GBNodeTransformer

sknnr.transformers.GBNodeTransformer ¶

GBNodeTransformer(loss_reg: Literal['squared_error', 'absolute_error', 'huber', 'quantile'] = 'squared_error', loss_clf: Literal['log_loss', 'exponential'] = 'log_loss', learning_rate: float = 0.1, n_estimators: int = 100, subsample: float = 1.0, criterion: Literal['friedman_mse', 'squared_error'] = 'friedman_mse', min_samples_split: int | float = 2, min_samples_leaf: int | float = 1, min_weight_fraction_leaf: float = 0.0, max_depth: int | None = 3, min_impurity_decrease: float = 0.0, init: BaseEstimator | Literal['zero'] | None = None, random_state: int | None = None, max_features: Literal['sqrt', 'log2'] | int | float | None = None, alpha_reg: float = 0.9, verbose: int = 0, max_leaf_nodes: int | None = None, warm_start: bool = False, validation_fraction: float = 0.1, n_iter_no_change: int | None = None, tol: float = 0.0001, ccp_alpha: float = 0.0, tree_weighting_method: Literal['train_improvement', 'uniform'] = 'train_improvement')

Bases: TreeNodeTransformer

Transformer to capture node indexes for samples across multiple gradient boosting estimators.

A gradient boosting estimator is fit to each y target in the training set using either scikit-learn's GradientBoostingRegressor or GradientBoostingClassifier. The transformation captures the node indexes for each tree in each estimator for each training or new sample.

The particular gradient boosting estimator type used for each target is determined by the data type of the target. If the target is numeric (e.g. int or float), a GradientBoostingRegressor is used. If the target is categorical (e.g. str or pd.Categorical), a GradientBoostingClassifier is used. Targets are automatically promoted to the minimum numpy dtype that safely represents all elements.

This transformer is intended to be used in conjunction with GBNNRegressor which captures similarity between node indexes of training and inference data and creates predictions using nearest neighbors.

See sklearn.ensemble.GradientBoostingRegressor and sklearn.ensemble.GradientBoostingClassifier for more detail on available parameters. All parameters are passed through to these respective gradient boosting estimators for each model being built. Note that some parameters (e.g. loss and alpha) are specified separately for regression and classification and have _reg and _clf suffixes.

Parameters:

Name	Type	Description	Default
`loss_reg`	`('squared_error', 'absolute_error', 'huber', 'quantile')`	default="squared_error" Loss function to be optimized for regression.	`"squared_error"`
`loss_clf`	`('log_loss', 'exponential')`	The loss function to be used for classification.	`"log_loss"`
`learning_rate`	`float`	Learning rate shrinks the contribution of each tree by `learning_rate`.	`0.1`
`n_estimators`	`int`	The number of boosting stages to perform.	`100`
`subsample`	`float`	The fraction of samples to be used for fitting the individual base learners.	`1.0`
`criterion`	`('friedman_mse', 'squared_error')`	The function to measure the quality of a split.	`"friedman_mse"`
`min_samples_split`	`int or float`	The minimum number of samples required to split an internal node.	`2`
`min_samples_leaf`	`int or float`	The minimum number of samples required to be at a leaf node.	`1`
`min_weight_fraction_leaf`	`float`	The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.	`0.0`
`max_depth`	`int or None`	Maximum depth of the individual regression estimators.	`3`
`min_impurity_decrease`	`float`	A node will be split if this split induces a decrease of the impurity greater than or equal to this value.	`0.0`
`init`	`(estimator, 'zero' or None)`	An estimator object that is used to compute the initial predictions.	`None`
`random_state`	`int, RandomState instance or None`	Controls the random seed given to each Tree estimator at each boosting iteration.	`None`
`max_features`	`('sqrt', 'log2')`	The number of features to consider when looking for the best split.	`"sqrt"`
`alpha_reg`	`float`	The alpha-quantile of the huber loss function and the quantile loss function.	`0.9`
`verbose`	`int`	Enable verbose output.	`0`
`max_leaf_nodes`	`int or None`	Grow trees with `max_leaf_nodes` in best-first fashion.	`None`
`warm_start`	`bool`	When set to `True`, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just erase the previous solution.	`False`
`validation_fraction`	`float`	The proportion of training data to set aside as validation set for early stopping.	`0.1`
`n_iter_no_change`	`int or None`	`n_iter_no_change` is used to decide if early stopping will be used to terminate training when validation score is not improving.	`None`
`tol`	`float`	Tolerance for the early stopping.	`1e-4`
`ccp_alpha`	`non-negative float`	Complexity parameter used for Minimal Cost-Complexity Pruning.	`0.0`
`tree_weighting_method`	`('train_improvement', 'uniform')`	default="train_improvement" The method used to weight the trees in each gradient boosting model.	`"train_improvement"`

Attributes:

Name	Type	Description
`n_features_in_`	`int`	Number of features seen during `fit`.
`feature_names_in_`	ndarray of shape (`n_features_in_`)	Names of features seen during fit. Defined only when `X` has feature names that are all strings.
`estimator_type_dict_`	`dict[str, str]`	Dictionary mapping target names to their gradient boosting type ("regression" or "classification").
`estimators_`	list [`GradientBoostingRegressor`\|`GradientBoostingClassifier`]	The gradient boosting models associated with each target in `y` during `fit`.
`n_forests_`	`int`	The number of forests (i.e. targets) in the ensemble. Equal to `len(self.estimators_)`.
`n_trees_per_iteration_`	`list[int]`	The number of trees per iteration for each forest. For regression and binary classification this is 1, but for multi-class classification, it is equal to the number of classes (other than two). Equal to `[est.n_trees_per_iteration_ for est in self.estimators_]`.
`tree_weights_`	list with length `self.n_forests_` of ndarrays of shape	(`self.n_estimators` * `self.estimators_[i].n_trees_per_iteration_`,) Weights assigned to each tree in each forest to be used when calculating distances between node indexes. In the case of multi-class classifiers, there are multiple trees per iteration, so the shape of each weight array is (`self.n_estimators` * `self.n_trees_per_iteration_[i]`,). All weights for a single iteration are sequentially repeated.

Notes

The tree_weighting_method parameter determines how the trees in each forest are weighted when calculating distances between node indexes. If tree_weighting_method is set to "train_improvement", tree weights are calculated as a function of the change in loss between successive trees in the gradient boosting estimator. As such, weights are directly proportional to the loss function specified and the user may want to choose the appropriate loss function (i.e. loss_reg or loss_clf) for their task.

If tree_weighting_method is set to "uniform", all trees are weighted equally.

Source code in src/sknnr/transformers/_gbnode_transformer.py

def __init__(
    self,
    loss_reg: Literal[
        "squared_error", "absolute_error", "huber", "quantile"
    ] = "squared_error",
    loss_clf: Literal["log_loss", "exponential"] = "log_loss",
    learning_rate: float = 0.1,
    n_estimators: int = 100,
    subsample: float = 1.0,
    criterion: Literal["friedman_mse", "squared_error"] = "friedman_mse",
    min_samples_split: int | float = 2,
    min_samples_leaf: int | float = 1,
    min_weight_fraction_leaf: float = 0.0,
    max_depth: int | None = 3,
    min_impurity_decrease: float = 0.0,
    init: BaseEstimator | Literal["zero"] | None = None,
    random_state: int | None = None,
    max_features: Literal["sqrt", "log2"] | int | float | None = None,
    alpha_reg: float = 0.9,
    verbose: int = 0,
    max_leaf_nodes: int | None = None,
    warm_start: bool = False,
    validation_fraction: float = 0.1,
    n_iter_no_change: int | None = None,
    tol: float = 0.0001,
    ccp_alpha: float = 0.0,
    tree_weighting_method: Literal[
        "train_improvement", "uniform"
    ] = "train_improvement",
):
    self.loss_reg = loss_reg
    self.loss_clf = loss_clf
    self.learning_rate = learning_rate
    self.n_estimators = n_estimators
    self.subsample = subsample
    self.criterion = criterion
    self.min_samples_split = min_samples_split
    self.min_samples_leaf = min_samples_leaf
    self.min_weight_fraction_leaf = min_weight_fraction_leaf
    self.max_depth = max_depth
    self.min_impurity_decrease = min_impurity_decrease
    self.alpha_reg = alpha_reg
    self.init = init
    self.random_state = random_state
    self.max_features = max_features
    self.verbose = verbose
    self.max_leaf_nodes = max_leaf_nodes
    self.warm_start = warm_start
    self.validation_fraction = validation_fraction
    self.n_iter_no_change = n_iter_no_change
    self.tol = tol
    self.ccp_alpha = ccp_alpha
    self.tree_weighting_method = tree_weighting_method

Attributes¶

alpha_reg `instance-attribute` ¶

alpha_reg = alpha_reg

ccp_alpha `instance-attribute` ¶

ccp_alpha = ccp_alpha

criterion `instance-attribute` ¶

criterion = criterion

init `instance-attribute` ¶

init = init

learning_rate `instance-attribute` ¶

learning_rate = learning_rate

loss_clf `instance-attribute` ¶

loss_clf = loss_clf

loss_reg `instance-attribute` ¶

loss_reg = loss_reg

max_depth `instance-attribute` ¶

max_depth = max_depth

max_features `instance-attribute` ¶

max_features = max_features

max_leaf_nodes `instance-attribute` ¶

max_leaf_nodes = max_leaf_nodes

min_impurity_decrease `instance-attribute` ¶

min_impurity_decrease = min_impurity_decrease

min_samples_leaf `instance-attribute` ¶

min_samples_leaf = min_samples_leaf

min_samples_split `instance-attribute` ¶

min_samples_split = min_samples_split

min_weight_fraction_leaf `instance-attribute` ¶

min_weight_fraction_leaf = min_weight_fraction_leaf

n_estimators `instance-attribute` ¶

n_estimators = n_estimators

n_iter_no_change `instance-attribute` ¶

n_iter_no_change = n_iter_no_change

random_state `instance-attribute` ¶

random_state = random_state

subsample `instance-attribute` ¶

subsample = subsample

tol `instance-attribute` ¶

tol = tol

tree_weighting_method `instance-attribute` ¶

tree_weighting_method = tree_weighting_method

validation_fraction `instance-attribute` ¶

validation_fraction = validation_fraction

verbose `instance-attribute` ¶

verbose = verbose

warm_start `instance-attribute` ¶

warm_start = warm_start

Functions¶

fit ¶

fit(X: DataLike, y: DataLike) -> Self

Source code in src/sknnr/transformers/_gbnode_transformer.py

def fit(self, X: DataLike, y: DataLike) -> Self:
    gb_common_kwargs = dict(
        learning_rate=self.learning_rate,
        n_estimators=self.n_estimators,
        subsample=self.subsample,
        criterion=self.criterion,
        min_samples_split=self.min_samples_split,
        min_samples_leaf=self.min_samples_leaf,
        min_weight_fraction_leaf=self.min_weight_fraction_leaf,
        max_depth=self.max_depth,
        min_impurity_decrease=self.min_impurity_decrease,
        init=self.init,
        random_state=self.random_state,
        max_features=self.max_features,
        verbose=self.verbose,
        max_leaf_nodes=self.max_leaf_nodes,
        warm_start=self.warm_start,
        validation_fraction=self.validation_fraction,
        n_iter_no_change=self.n_iter_no_change,
        tol=self.tol,
        ccp_alpha=self.ccp_alpha,
    )
    gb_reg_kwargs = {
        **gb_common_kwargs,
        "loss": self.loss_reg,
        "alpha": self.alpha_reg,
    }
    gb_clf_kwargs = {
        **gb_common_kwargs,
        "loss": self.loss_clf,
    }
    return self._fit(
        X,
        y,
        GradientBoostingRegressor,
        GradientBoostingClassifier,
        gb_reg_kwargs,
        gb_clf_kwargs,
    )

fit_transform ¶

fit_transform(X: DataLike, y: DataLike) -> NDArray[int64]

Source code in src/sknnr/transformers/_tree_node_transformer.py

def fit_transform(self, X: DataLike, y: DataLike) -> NDArray[np.int64]:
    return self.fit(X, y).transform(X)

get_feature_names_out ¶

get_feature_names_out() -> NDArray[object_]

Source code in src/sknnr/transformers/_gbnode_transformer.py

def get_feature_names_out(self) -> NDArray[np.object_]:
    check_is_fitted(self, "estimators_")
    feature_names: list[str] = []
    for i, est in enumerate(self.estimators_):
        # Regression and binary classification have 1 tree per iteration
        if est.n_trees_per_iteration_ == 1:
            feature_names.extend(
                [f"gb{i}_tree{k}" for k in range(est.n_estimators)]
            )
        # Multi-class classification has n_classes trees per iteration
        else:
            for j in range(est.n_trees_per_iteration_):
                feature_names.extend(
                    [f"gb{i}_cls{j}_tree{k}" for k in range(est.n_estimators)]
                )
    return np.asarray(feature_names, dtype=object)

transform ¶

transform(X: DataLike) -> NDArray[int64]

Source code in src/sknnr/transformers/_tree_node_transformer.py

def transform(self, X: DataLike) -> NDArray[np.int64]:
    check_is_fitted(self)
    X_arr = validate_data(
        self,
        X=X,
        reset=False,
        ensure_min_features=1,
        ensure_min_samples=1,
    )

    # Get the node IDs for each tree in each forest
    node_ids = []
    for est in self.estimators_:
        est_node_ids = est.apply(X_arr)
        # In the case of some multi-class estimators (e.g.
        # GradientBoostingClassifier), the output of `apply` is 3D (n_samples,
        # n_estimators, n_classes). First swap axes to get (n_samples,
        # n_classes, n_estimators), then flatten the last two dimensions
        # to ensure a 2D output.
        if est_node_ids.ndim == 3:
            est_node_ids = np.swapaxes(est_node_ids, 1, 2).reshape(
                est_node_ids.shape[0], -1
            )
        node_ids.append(est_node_ids)
    return np.hstack(node_ids).astype("int64")

GBNodeTransformer

sknnr.transformers.GBNodeTransformer ¶

Attributes¶

alpha_reg instance-attribute ¶

ccp_alpha instance-attribute ¶

criterion instance-attribute ¶

init instance-attribute ¶

learning_rate instance-attribute ¶

loss_clf instance-attribute ¶

loss_reg instance-attribute ¶

max_depth instance-attribute ¶

max_features instance-attribute ¶

max_leaf_nodes instance-attribute ¶

min_impurity_decrease instance-attribute ¶

min_samples_leaf instance-attribute ¶

min_samples_split instance-attribute ¶

min_weight_fraction_leaf instance-attribute ¶

n_estimators instance-attribute ¶

n_iter_no_change instance-attribute ¶

random_state instance-attribute ¶

subsample instance-attribute ¶

tol instance-attribute ¶

tree_weighting_method instance-attribute ¶

validation_fraction instance-attribute ¶

verbose instance-attribute ¶

warm_start instance-attribute ¶

Functions¶

fit ¶

fit_transform ¶

get_feature_names_out ¶

transform ¶

alpha_reg `instance-attribute` ¶

ccp_alpha `instance-attribute` ¶

criterion `instance-attribute` ¶

init `instance-attribute` ¶

learning_rate `instance-attribute` ¶

loss_clf `instance-attribute` ¶

loss_reg `instance-attribute` ¶

max_depth `instance-attribute` ¶

max_features `instance-attribute` ¶

max_leaf_nodes `instance-attribute` ¶

min_impurity_decrease `instance-attribute` ¶

min_samples_leaf `instance-attribute` ¶

min_samples_split `instance-attribute` ¶

min_weight_fraction_leaf `instance-attribute` ¶

n_estimators `instance-attribute` ¶

n_iter_no_change `instance-attribute` ¶

random_state `instance-attribute` ¶

subsample `instance-attribute` ¶

tol `instance-attribute` ¶

tree_weighting_method `instance-attribute` ¶

validation_fraction `instance-attribute` ¶

verbose `instance-attribute` ¶

warm_start `instance-attribute` ¶