GBNNRegressor
sknnr.GBNNRegressor ¶
GBNNRegressor(*, loss_reg: Literal['squared_error', 'absolute_error', 'huber', 'quantile'] = 'squared_error', loss_clf: Literal['log_loss', 'exponential'] = 'log_loss', learning_rate: float = 0.1, n_estimators: int = 100, subsample: float = 1.0, criterion: Literal['friedman_mse', 'squared_error'] = 'friedman_mse', min_samples_split: int | float = 2, min_samples_leaf: int | float = 1, min_weight_fraction_leaf: float = 0.0, max_depth: int | None = 3, min_impurity_decrease: float = 0.0, init: BaseEstimator | Literal['zero'] | None = None, random_state: int | None = None, max_features: Literal['sqrt', 'log2'] | int | float | None = None, alpha_reg: float = 0.9, verbose: int = 0, max_leaf_nodes: int | None = None, warm_start: bool = False, validation_fraction: float = 0.1, n_iter_no_change: int | None = None, tol: float = 0.0001, ccp_alpha: float = 0.0, forest_weights: Literal['uniform'] | ArrayLike = 'uniform', tree_weighting_method: Literal['train_improvement', 'uniform'] = 'train_improvement', n_neighbors: int = 5, weights: Literal['uniform', 'distance'] | Callable = 'uniform', n_jobs: int | None = None)
Bases: WeightedTreesNNRegressor
Regression using Gradient Boosting Nearest Neighbors (GBNN) imputation.
New data is predicted by similarity of its node indexes to training set node indexes when run through multiple univariate gradient boosting models. A gradient boosting model is fit to each target in the training set and node indexes are captured for each tree in each forest for each training sample. Node indexes are then captured for inference data and distance is calculated as the dissimilarity between node indexes.
Gradient boosting models are constructed using either scikit-learn's
GradientBoostingRegressor or GradientBoostingClassifier classes based on
the data type of each target (y or y_fit) in the training set. If the
target is numeric (e.g. int or float), a GradientBoostingRegressor is
used. If the target is categorical (e.g. str or pd.Categorical), a
GradientBoostingClassifier is used. The
sknnr.transformers.GBNodeTransformer class is responsible for constructing
the gradient boosting models and capturing the node indexes.
See sklearn.neighbors.KNeighborsRegressor for more detail on
parameters associated with nearest neighbors. See
sklearn.ensemble.GradientBoostingRegressor and
sklearn.ensemble.GradientBoostingClassifier for more detail on parameters
associated with gradient boosting. Note that some parameters (e.g.
loss and alpha) are specified separately for regression and
classification and have _reg and _clf suffixes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loss_reg
|
('squared_error', 'absolute_error', 'huber', 'quantile')
|
default="squared_error" Loss function to be optimized for regression. |
"squared_error"
|
loss_clf
|
('log_loss', 'exponential')
|
The loss function to be used for classification. |
"log_loss"
|
learning_rate
|
float
|
Learning rate shrinks the contribution of each tree by |
0.1
|
n_estimators
|
int
|
The number of boosting stages to perform. |
100
|
subsample
|
float
|
The fraction of samples to be used for fitting the individual base learners. |
1.0
|
criterion
|
('friedman_mse', 'squared_error')
|
The function to measure the quality of a split. |
"friedman_mse"
|
min_samples_split
|
int or float
|
The minimum number of samples required to split an internal node. |
2
|
min_samples_leaf
|
int or float
|
The minimum number of samples required to be at a leaf node. |
1
|
min_weight_fraction_leaf
|
float
|
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. |
0.0
|
max_depth
|
int or None
|
Maximum depth of the individual regression estimators. |
3
|
min_impurity_decrease
|
float
|
A node will be split if this split induces a decrease of the impurity greater than or equal to this value. |
0.0
|
init
|
(estimator, 'zero' or None)
|
An estimator object that is used to compute the initial predictions. |
None
|
random_state
|
int, RandomState instance or None
|
Controls the random seed given to each Tree estimator at each boosting iteration. |
None
|
max_features
|
('sqrt', 'log2')
|
The number of features to consider when looking for the best split. |
"sqrt"
|
alpha_reg
|
float
|
The alpha-quantile of the huber loss function and the quantile loss function. |
0.9
|
verbose
|
int
|
Enable verbose output. |
0
|
max_leaf_nodes
|
int or None
|
Grow trees with |
None
|
warm_start
|
bool
|
When set to |
False
|
validation_fraction
|
float
|
The proportion of training data to set aside as validation set for early stopping. |
0.1
|
n_iter_no_change
|
int or None
|
|
None
|
tol
|
float
|
Tolerance for the early stopping. |
1e-4
|
ccp_alpha
|
non-negative float
|
Complexity parameter used for Minimal Cost-Complexity Pruning. |
0.0
|
forest_weights
|
Literal['uniform'] | ArrayLike
|
Weights assigned to each target in the training set when calculating Hamming distance between node indexes. This allows for differential weighting of targets when calculating distances. Note that all trees associated with a target will receive the same weight. If "uniform", each tree is assigned equal weight. |
'uniform'
|
tree_weighting_method
|
Literal['train_improvement', 'uniform']
|
default="train_improvement" The method used to weight the trees in each gradient boosting model. |
'train_improvement'
|
n_neighbors
|
int
|
Number of neighbors to use by default for |
5
|
weights
|
('uniform', 'distance')
|
Weight function used in prediction. |
"uniform"
|
n_jobs
|
int or None
|
The number of jobs to run in parallel. |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
effective_metric_ |
str
|
Always set to 'hamming'. |
effective_metric_params_ |
dict
|
Always empty. |
hamming_weights_ |
array
|
When |
independent_prediction_ |
array
|
When |
independent_score_ |
double
|
When |
n_features_in_ |
int
|
Number of features that the transformer outputs. This is equal to the
number of features in |
n_samples_fit_ |
int
|
Number of samples in the fitted data. |
transformer_ |
GBNodeTransformer
|
The fitted transformer which holds the built gradient boosting models for each feature. |
y_fit_ |
array or DataFrame
|
When |
Notes
The tree_weighting_method parameter determines how the trees in each
forest are weighted when calculating distances between node indexes.
If tree_weighting_method is set to "train_improvement", tree weights are
calculated as a function of the change in loss between successive trees
in the gradient boosting estimator. As such, weights are directly
proportional to the loss function specified and the user may want to
choose the appropriate loss function (i.e. loss_reg or loss_clf)
for their task.
If tree_weighting_method is set to "uniform", all trees are weighted
equally.
Source code in src/sknnr/_gbnn.py
Attributes¶
Functions¶
fit ¶
Fit using transformed feature data. If y_fit is provided, it will be used to fit the transformer.
kneighbors ¶
kneighbors(X: DataLike | None = None, n_neighbors: int | None = None, return_distance: bool = True, return_dataframe_index: bool = False, use_deterministic_ordering: bool = True) -> NDArray[int64] | tuple[NDArray[float64], NDArray[int64]]
Find the K-neighbors of a point or points of transformed feature data and optionally return dataframe indexes rather than array indices when the model was fitted with a dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_queries, n_features)
|
The query point or points. Points are first transformed using the fitted transformer. If not provided, neighbors of each indexed point are returned. In this case, the query point is not considered its own neighbor. |
None
|
n_neighbors
|
int
|
Number of neighbors required for each sample. The default is the value passed to the constructor. |
None
|
return_distance
|
bool
|
Whether or not to return the distances. |
True
|
return_dataframe_index
|
bool
|
Whether or not to return dataframe indexes instead of array indices. Only applicable if the model was fitted with a dataframe. |
False
|
use_deterministic_ordering
|
bool
|
Whether to use deterministic ordering of neighbors when distances
are nearly identical. If True, neighbors with nearly identical
distances (up to DISTANCE_PRECISION_DECIMALS decimal places) are
ordered lexicographically by:
(1) their scaled and rounded distances,
(2) the absolute difference between a query point's row index
and the neighbor index (so that a sample, when present, is
returned before other equally distant samples), and
(3) the neighbor index iself.
If False, use the default ordering from
|
True
|
Returns:
| Name | Type | Description |
|---|---|---|
neigh_dist |
array-like of shape (n_queries, n_neighbors)
|
Array representing the lengths to points, only present if return_distance=True. |
neigh_ind |
array-like of shape (n_queries, n_neighbors)
|
Array indices or dataframe indexes of the nearest points in the population matrix. |