GBNodeTransformer
sknnr.transformers.GBNodeTransformer ¶
GBNodeTransformer(loss_reg: Literal['squared_error', 'absolute_error', 'huber', 'quantile'] = 'squared_error', loss_clf: Literal['log_loss', 'exponential'] = 'log_loss', learning_rate: float = 0.1, n_estimators: int = 100, subsample: float = 1.0, criterion: Literal['friedman_mse', 'squared_error'] = 'friedman_mse', min_samples_split: int | float = 2, min_samples_leaf: int | float = 1, min_weight_fraction_leaf: float = 0.0, max_depth: int | None = 3, min_impurity_decrease: float = 0.0, init: BaseEstimator | Literal['zero'] | None = None, random_state: int | None = None, max_features: Literal['sqrt', 'log2'] | int | float | None = None, alpha_reg: float = 0.9, verbose: int = 0, max_leaf_nodes: int | None = None, warm_start: bool = False, validation_fraction: float = 0.1, n_iter_no_change: int | None = None, tol: float = 0.0001, ccp_alpha: float = 0.0, tree_weighting_method: Literal['train_improvement', 'uniform'] = 'train_improvement')
Bases: TreeNodeTransformer
Transformer to capture node indexes for samples across multiple gradient boosting estimators.
A gradient boosting estimator is fit to each y target in the training set
using either scikit-learn's GradientBoostingRegressor or
GradientBoostingClassifier. The transformation captures the node indexes
for each tree in each estimator for each training or new sample.
The particular gradient boosting estimator type used for each target is
determined by the data type of the target. If the target is numeric (e.g.
int or float), a GradientBoostingRegressor is used. If the target is
categorical (e.g. str or pd.Categorical), a GradientBoostingClassifier
is used. Targets are automatically promoted to the minimum numpy dtype that
safely represents all elements.
This transformer is intended to be used in conjunction with GBNNRegressor
which captures similarity between node indexes of training and inference
data and creates predictions using nearest neighbors.
See sklearn.ensemble.GradientBoostingRegressor and
sklearn.ensemble.GradientBoostingClassifier for more detail on available
parameters. All parameters are passed through to these respective gradient
boosting estimators for each model being built. Note that some
parameters (e.g. loss and alpha) are specified separately
for regression and classification and have _reg and _clf suffixes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loss_reg
|
('squared_error', 'absolute_error', 'huber', 'quantile')
|
default="squared_error" Loss function to be optimized for regression. |
"squared_error"
|
loss_clf
|
('log_loss', 'exponential')
|
The loss function to be used for classification. |
"log_loss"
|
learning_rate
|
float
|
Learning rate shrinks the contribution of each tree by |
0.1
|
n_estimators
|
int
|
The number of boosting stages to perform. |
100
|
subsample
|
float
|
The fraction of samples to be used for fitting the individual base learners. |
1.0
|
criterion
|
('friedman_mse', 'squared_error')
|
The function to measure the quality of a split. |
"friedman_mse"
|
min_samples_split
|
int or float
|
The minimum number of samples required to split an internal node. |
2
|
min_samples_leaf
|
int or float
|
The minimum number of samples required to be at a leaf node. |
1
|
min_weight_fraction_leaf
|
float
|
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. |
0.0
|
max_depth
|
int or None
|
Maximum depth of the individual regression estimators. |
3
|
min_impurity_decrease
|
float
|
A node will be split if this split induces a decrease of the impurity greater than or equal to this value. |
0.0
|
init
|
(estimator, 'zero' or None)
|
An estimator object that is used to compute the initial predictions. |
None
|
random_state
|
int, RandomState instance or None
|
Controls the random seed given to each Tree estimator at each boosting iteration. |
None
|
max_features
|
('sqrt', 'log2')
|
The number of features to consider when looking for the best split. |
"sqrt"
|
alpha_reg
|
float
|
The alpha-quantile of the huber loss function and the quantile loss function. |
0.9
|
verbose
|
int
|
Enable verbose output. |
0
|
max_leaf_nodes
|
int or None
|
Grow trees with |
None
|
warm_start
|
bool
|
When set to |
False
|
validation_fraction
|
float
|
The proportion of training data to set aside as validation set for early stopping. |
0.1
|
n_iter_no_change
|
int or None
|
|
None
|
tol
|
float
|
Tolerance for the early stopping. |
1e-4
|
ccp_alpha
|
non-negative float
|
Complexity parameter used for Minimal Cost-Complexity Pruning. |
0.0
|
tree_weighting_method
|
('train_improvement', 'uniform')
|
default="train_improvement" The method used to weight the trees in each gradient boosting model. |
"train_improvement"
|
Attributes:
| Name | Type | Description |
|---|---|---|
n_features_in_ |
int
|
Number of features seen during |
feature_names_in_ |
ndarray of shape (`n_features_in_`)
|
Names of features seen during fit. Defined only when |
estimator_type_dict_ |
dict[str, str]
|
Dictionary mapping target names to their gradient boosting type ("regression" or "classification"). |
estimators_ |
list [`GradientBoostingRegressor`|`GradientBoostingClassifier`]
|
The gradient boosting models associated with each target in |
n_forests_ |
int
|
The number of forests (i.e. targets) in the ensemble. Equal to
|
n_trees_per_iteration_ |
list[int]
|
The number of trees per iteration for each forest. For regression
and binary classification this is 1, but for multi-class classification,
it is equal to the number of classes (other than two). Equal to
|
tree_weights_ |
list with length `self.n_forests_` of ndarrays of shape
|
( |
Notes
The tree_weighting_method parameter determines how the trees in each
forest are weighted when calculating distances between node indexes.
If tree_weighting_method is set to "train_improvement", tree weights are
calculated as a function of the change in loss between successive trees
in the gradient boosting estimator. As such, weights are directly
proportional to the loss function specified and the user may want to
choose the appropriate loss function (i.e. loss_reg or loss_clf)
for their task.
If tree_weighting_method is set to "uniform", all trees are weighted
equally.