Skip to content

GBNodeTransformer

sknnr.transformers.GBNodeTransformer

GBNodeTransformer(loss_reg: Literal['squared_error', 'absolute_error', 'huber', 'quantile'] = 'squared_error', loss_clf: Literal['log_loss', 'exponential'] = 'log_loss', learning_rate: float = 0.1, n_estimators: int = 100, subsample: float = 1.0, criterion: Literal['friedman_mse', 'squared_error'] = 'friedman_mse', min_samples_split: int | float = 2, min_samples_leaf: int | float = 1, min_weight_fraction_leaf: float = 0.0, max_depth: int | None = 3, min_impurity_decrease: float = 0.0, init: BaseEstimator | Literal['zero'] | None = None, random_state: int | None = None, max_features: Literal['sqrt', 'log2'] | int | float | None = None, alpha_reg: float = 0.9, verbose: int = 0, max_leaf_nodes: int | None = None, warm_start: bool = False, validation_fraction: float = 0.1, n_iter_no_change: int | None = None, tol: float = 0.0001, ccp_alpha: float = 0.0, tree_weighting_method: Literal['train_improvement', 'uniform'] = 'train_improvement')

Bases: TreeNodeTransformer

Transformer to capture node indexes for samples across multiple gradient boosting estimators.

A gradient boosting estimator is fit to each y target in the training set using either scikit-learn's GradientBoostingRegressor or GradientBoostingClassifier. The transformation captures the node indexes for each tree in each estimator for each training or new sample.

The particular gradient boosting estimator type used for each target is determined by the data type of the target. If the target is numeric (e.g. int or float), a GradientBoostingRegressor is used. If the target is categorical (e.g. str or pd.Categorical), a GradientBoostingClassifier is used. Targets are automatically promoted to the minimum numpy dtype that safely represents all elements.

This transformer is intended to be used in conjunction with GBNNRegressor which captures similarity between node indexes of training and inference data and creates predictions using nearest neighbors.

See sklearn.ensemble.GradientBoostingRegressor and sklearn.ensemble.GradientBoostingClassifier for more detail on available parameters. All parameters are passed through to these respective gradient boosting estimators for each model being built. Note that some parameters (e.g. loss and alpha) are specified separately for regression and classification and have _reg and _clf suffixes.

Parameters:

Name Type Description Default
loss_reg ('squared_error', 'absolute_error', 'huber', 'quantile')

default="squared_error" Loss function to be optimized for regression.

"squared_error"
loss_clf ('log_loss', 'exponential')

The loss function to be used for classification.

"log_loss"
learning_rate float

Learning rate shrinks the contribution of each tree by learning_rate.

0.1
n_estimators int

The number of boosting stages to perform.

100
subsample float

The fraction of samples to be used for fitting the individual base learners.

1.0
criterion ('friedman_mse', 'squared_error')

The function to measure the quality of a split.

"friedman_mse"
min_samples_split int or float

The minimum number of samples required to split an internal node.

2
min_samples_leaf int or float

The minimum number of samples required to be at a leaf node.

1
min_weight_fraction_leaf float

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.

0.0
max_depth int or None

Maximum depth of the individual regression estimators.

3
min_impurity_decrease float

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

0.0
init (estimator, 'zero' or None)

An estimator object that is used to compute the initial predictions.

None
random_state int, RandomState instance or None

Controls the random seed given to each Tree estimator at each boosting iteration.

None
max_features ('sqrt', 'log2')

The number of features to consider when looking for the best split.

"sqrt"
alpha_reg float

The alpha-quantile of the huber loss function and the quantile loss function.

0.9
verbose int

Enable verbose output.

0
max_leaf_nodes int or None

Grow trees with max_leaf_nodes in best-first fashion.

None
warm_start bool

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just erase the previous solution.

False
validation_fraction float

The proportion of training data to set aside as validation set for early stopping.

0.1
n_iter_no_change int or None

n_iter_no_change is used to decide if early stopping will be used to terminate training when validation score is not improving.

None
tol float

Tolerance for the early stopping.

1e-4
ccp_alpha non-negative float

Complexity parameter used for Minimal Cost-Complexity Pruning.

0.0
tree_weighting_method ('train_improvement', 'uniform')

default="train_improvement" The method used to weight the trees in each gradient boosting model.

"train_improvement"

Attributes:

Name Type Description
n_features_in_ int

Number of features seen during fit.

feature_names_in_ ndarray of shape (`n_features_in_`)

Names of features seen during fit. Defined only when X has feature names that are all strings.

estimator_type_dict_ dict[str, str]

Dictionary mapping target names to their gradient boosting type ("regression" or "classification").

estimators_ list [`GradientBoostingRegressor`|`GradientBoostingClassifier`]

The gradient boosting models associated with each target in y during fit.

n_forests_ int

The number of forests (i.e. targets) in the ensemble. Equal to len(self.estimators_).

n_trees_per_iteration_ list[int]

The number of trees per iteration for each forest. For regression and binary classification this is 1, but for multi-class classification, it is equal to the number of classes (other than two). Equal to [est.n_trees_per_iteration_ for est in self.estimators_].

tree_weights_ list with length `self.n_forests_` of ndarrays of shape

(self.n_estimators * self.estimators_[i].n_trees_per_iteration_,) Weights assigned to each tree in each forest to be used when calculating distances between node indexes. In the case of multi-class classifiers, there are multiple trees per iteration, so the shape of each weight array is (self.n_estimators * self.n_trees_per_iteration_[i],). All weights for a single iteration are sequentially repeated.

Notes

The tree_weighting_method parameter determines how the trees in each forest are weighted when calculating distances between node indexes. If tree_weighting_method is set to "train_improvement", tree weights are calculated as a function of the change in loss between successive trees in the gradient boosting estimator. As such, weights are directly proportional to the loss function specified and the user may want to choose the appropriate loss function (i.e. loss_reg or loss_clf) for their task.

If tree_weighting_method is set to "uniform", all trees are weighted equally.

Source code in src/sknnr/transformers/_gbnode_transformer.py
def __init__(
    self,
    loss_reg: Literal[
        "squared_error", "absolute_error", "huber", "quantile"
    ] = "squared_error",
    loss_clf: Literal["log_loss", "exponential"] = "log_loss",
    learning_rate: float = 0.1,
    n_estimators: int = 100,
    subsample: float = 1.0,
    criterion: Literal["friedman_mse", "squared_error"] = "friedman_mse",
    min_samples_split: int | float = 2,
    min_samples_leaf: int | float = 1,
    min_weight_fraction_leaf: float = 0.0,
    max_depth: int | None = 3,
    min_impurity_decrease: float = 0.0,
    init: BaseEstimator | Literal["zero"] | None = None,
    random_state: int | None = None,
    max_features: Literal["sqrt", "log2"] | int | float | None = None,
    alpha_reg: float = 0.9,
    verbose: int = 0,
    max_leaf_nodes: int | None = None,
    warm_start: bool = False,
    validation_fraction: float = 0.1,
    n_iter_no_change: int | None = None,
    tol: float = 0.0001,
    ccp_alpha: float = 0.0,
    tree_weighting_method: Literal[
        "train_improvement", "uniform"
    ] = "train_improvement",
):
    self.loss_reg = loss_reg
    self.loss_clf = loss_clf
    self.learning_rate = learning_rate
    self.n_estimators = n_estimators
    self.subsample = subsample
    self.criterion = criterion
    self.min_samples_split = min_samples_split
    self.min_samples_leaf = min_samples_leaf
    self.min_weight_fraction_leaf = min_weight_fraction_leaf
    self.max_depth = max_depth
    self.min_impurity_decrease = min_impurity_decrease
    self.alpha_reg = alpha_reg
    self.init = init
    self.random_state = random_state
    self.max_features = max_features
    self.verbose = verbose
    self.max_leaf_nodes = max_leaf_nodes
    self.warm_start = warm_start
    self.validation_fraction = validation_fraction
    self.n_iter_no_change = n_iter_no_change
    self.tol = tol
    self.ccp_alpha = ccp_alpha
    self.tree_weighting_method = tree_weighting_method

Attributes

alpha_reg instance-attribute

alpha_reg = alpha_reg

ccp_alpha instance-attribute

ccp_alpha = ccp_alpha

criterion instance-attribute

criterion = criterion

init instance-attribute

init = init

learning_rate instance-attribute

learning_rate = learning_rate

loss_clf instance-attribute

loss_clf = loss_clf

loss_reg instance-attribute

loss_reg = loss_reg

max_depth instance-attribute

max_depth = max_depth

max_features instance-attribute

max_features = max_features

max_leaf_nodes instance-attribute

max_leaf_nodes = max_leaf_nodes

min_impurity_decrease instance-attribute

min_impurity_decrease = min_impurity_decrease

min_samples_leaf instance-attribute

min_samples_leaf = min_samples_leaf

min_samples_split instance-attribute

min_samples_split = min_samples_split

min_weight_fraction_leaf instance-attribute

min_weight_fraction_leaf = min_weight_fraction_leaf

n_estimators instance-attribute

n_estimators = n_estimators

n_iter_no_change instance-attribute

n_iter_no_change = n_iter_no_change

random_state instance-attribute

random_state = random_state

subsample instance-attribute

subsample = subsample

tol instance-attribute

tol = tol

tree_weighting_method instance-attribute

tree_weighting_method = tree_weighting_method

validation_fraction instance-attribute

validation_fraction = validation_fraction

verbose instance-attribute

verbose = verbose

warm_start instance-attribute

warm_start = warm_start

Functions

fit

fit(X: DataLike, y: DataLike) -> Self
Source code in src/sknnr/transformers/_gbnode_transformer.py
def fit(self, X: DataLike, y: DataLike) -> Self:
    gb_common_kwargs = dict(
        learning_rate=self.learning_rate,
        n_estimators=self.n_estimators,
        subsample=self.subsample,
        criterion=self.criterion,
        min_samples_split=self.min_samples_split,
        min_samples_leaf=self.min_samples_leaf,
        min_weight_fraction_leaf=self.min_weight_fraction_leaf,
        max_depth=self.max_depth,
        min_impurity_decrease=self.min_impurity_decrease,
        init=self.init,
        random_state=self.random_state,
        max_features=self.max_features,
        verbose=self.verbose,
        max_leaf_nodes=self.max_leaf_nodes,
        warm_start=self.warm_start,
        validation_fraction=self.validation_fraction,
        n_iter_no_change=self.n_iter_no_change,
        tol=self.tol,
        ccp_alpha=self.ccp_alpha,
    )
    gb_reg_kwargs = {
        **gb_common_kwargs,
        "loss": self.loss_reg,
        "alpha": self.alpha_reg,
    }
    gb_clf_kwargs = {
        **gb_common_kwargs,
        "loss": self.loss_clf,
    }
    return self._fit(
        X,
        y,
        GradientBoostingRegressor,
        GradientBoostingClassifier,
        gb_reg_kwargs,
        gb_clf_kwargs,
    )

fit_transform

fit_transform(X: DataLike, y: DataLike) -> NDArray[int64]
Source code in src/sknnr/transformers/_tree_node_transformer.py
def fit_transform(self, X: DataLike, y: DataLike) -> NDArray[np.int64]:
    return self.fit(X, y).transform(X)

get_feature_names_out

get_feature_names_out() -> NDArray[object_]
Source code in src/sknnr/transformers/_gbnode_transformer.py
def get_feature_names_out(self) -> NDArray[np.object_]:
    check_is_fitted(self, "estimators_")
    feature_names: list[str] = []
    for i, est in enumerate(self.estimators_):
        # Regression and binary classification have 1 tree per iteration
        if est.n_trees_per_iteration_ == 1:
            feature_names.extend(
                [f"gb{i}_tree{k}" for k in range(est.n_estimators)]
            )
        # Multi-class classification has n_classes trees per iteration
        else:
            for j in range(est.n_trees_per_iteration_):
                feature_names.extend(
                    [f"gb{i}_cls{j}_tree{k}" for k in range(est.n_estimators)]
                )
    return np.asarray(feature_names, dtype=object)

transform

transform(X: DataLike) -> NDArray[int64]
Source code in src/sknnr/transformers/_tree_node_transformer.py
def transform(self, X: DataLike) -> NDArray[np.int64]:
    check_is_fitted(self)
    X_arr = validate_data(
        self,
        X=X,
        reset=False,
        ensure_min_features=1,
        ensure_min_samples=1,
    )

    # Get the node IDs for each tree in each forest
    node_ids = []
    for est in self.estimators_:
        est_node_ids = est.apply(X_arr)
        # In the case of some multi-class estimators (e.g.
        # GradientBoostingClassifier), the output of `apply` is 3D (n_samples,
        # n_estimators, n_classes). First swap axes to get (n_samples,
        # n_classes, n_estimators), then flatten the last two dimensions
        # to ensure a 2D output.
        if est_node_ids.ndim == 3:
            est_node_ids = np.swapaxes(est_node_ids, 1, 2).reshape(
                est_node_ids.shape[0], -1
            )
        node_ids.append(est_node_ids)
    return np.hstack(node_ids).astype("int64")