Skip to content

GBNNRegressor

sknnr.GBNNRegressor

GBNNRegressor(*, loss_reg: Literal['squared_error', 'absolute_error', 'huber', 'quantile'] = 'squared_error', loss_clf: Literal['log_loss', 'exponential'] = 'log_loss', learning_rate: float = 0.1, n_estimators: int = 100, subsample: float = 1.0, criterion: Literal['friedman_mse', 'squared_error'] = 'friedman_mse', min_samples_split: int | float = 2, min_samples_leaf: int | float = 1, min_weight_fraction_leaf: float = 0.0, max_depth: int | None = 3, min_impurity_decrease: float = 0.0, init: BaseEstimator | Literal['zero'] | None = None, random_state: int | None = None, max_features: Literal['sqrt', 'log2'] | int | float | None = None, alpha_reg: float = 0.9, verbose: int = 0, max_leaf_nodes: int | None = None, warm_start: bool = False, validation_fraction: float = 0.1, n_iter_no_change: int | None = None, tol: float = 0.0001, ccp_alpha: float = 0.0, forest_weights: Literal['uniform'] | ArrayLike = 'uniform', tree_weighting_method: Literal['train_improvement', 'uniform'] = 'train_improvement', n_neighbors: int = 5, weights: Literal['uniform', 'distance'] | Callable = 'uniform', n_jobs: int | None = None)

Bases: WeightedTreesNNRegressor

Regression using Gradient Boosting Nearest Neighbors (GBNN) imputation.

New data is predicted by similarity of its node indexes to training set node indexes when run through multiple univariate gradient boosting models. A gradient boosting model is fit to each target in the training set and node indexes are captured for each tree in each forest for each training sample. Node indexes are then captured for inference data and distance is calculated as the dissimilarity between node indexes.

Gradient boosting models are constructed using either scikit-learn's GradientBoostingRegressor or GradientBoostingClassifier classes based on the data type of each target (y or y_fit) in the training set. If the target is numeric (e.g. int or float), a GradientBoostingRegressor is used. If the target is categorical (e.g. str or pd.Categorical), a GradientBoostingClassifier is used. The sknnr.transformers.GBNodeTransformer class is responsible for constructing the gradient boosting models and capturing the node indexes.

See sklearn.neighbors.KNeighborsRegressor for more detail on parameters associated with nearest neighbors. See sklearn.ensemble.GradientBoostingRegressor and sklearn.ensemble.GradientBoostingClassifier for more detail on parameters associated with gradient boosting. Note that some parameters (e.g. loss and alpha) are specified separately for regression and classification and have _reg and _clf suffixes.

Parameters:

Name Type Description Default
loss_reg ('squared_error', 'absolute_error', 'huber', 'quantile')

default="squared_error" Loss function to be optimized for regression.

"squared_error"
loss_clf ('log_loss', 'exponential')

The loss function to be used for classification.

"log_loss"
learning_rate float

Learning rate shrinks the contribution of each tree by learning_rate.

0.1
n_estimators int

The number of boosting stages to perform.

100
subsample float

The fraction of samples to be used for fitting the individual base learners.

1.0
criterion ('friedman_mse', 'squared_error')

The function to measure the quality of a split.

"friedman_mse"
min_samples_split int or float

The minimum number of samples required to split an internal node.

2
min_samples_leaf int or float

The minimum number of samples required to be at a leaf node.

1
min_weight_fraction_leaf float

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.

0.0
max_depth int or None

Maximum depth of the individual regression estimators.

3
min_impurity_decrease float

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

0.0
init (estimator, 'zero' or None)

An estimator object that is used to compute the initial predictions.

None
random_state int, RandomState instance or None

Controls the random seed given to each Tree estimator at each boosting iteration.

None
max_features ('sqrt', 'log2')

The number of features to consider when looking for the best split.

"sqrt"
alpha_reg float

The alpha-quantile of the huber loss function and the quantile loss function.

0.9
verbose int

Enable verbose output.

0
max_leaf_nodes int or None

Grow trees with max_leaf_nodes in best-first fashion.

None
warm_start bool

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just erase the previous solution.

False
validation_fraction float

The proportion of training data to set aside as validation set for early stopping.

0.1
n_iter_no_change int or None

n_iter_no_change is used to decide if early stopping will be used to terminate training when validation score is not improving.

None
tol float

Tolerance for the early stopping.

1e-4
ccp_alpha non-negative float

Complexity parameter used for Minimal Cost-Complexity Pruning.

0.0
forest_weights Literal['uniform'] | ArrayLike

Weights assigned to each target in the training set when calculating Hamming distance between node indexes. This allows for differential weighting of targets when calculating distances. Note that all trees associated with a target will receive the same weight. If "uniform", each tree is assigned equal weight.

'uniform'
tree_weighting_method Literal['train_improvement', 'uniform']

default="train_improvement" The method used to weight the trees in each gradient boosting model.

'train_improvement'
n_neighbors int

Number of neighbors to use by default for kneighbors queries.

5
weights ('uniform', 'distance')

Weight function used in prediction.

"uniform"
n_jobs int or None

The number of jobs to run in parallel.

None

Attributes:

Name Type Description
effective_metric_ str

Always set to 'hamming'.

effective_metric_params_ dict

Always empty.

hamming_weights_ array

When fit, provides the weights on each tree in each forest when calculating the Hamming distance.

independent_prediction_ array

When fit, provides the prediction for training data not allowing self-assignment during neighbor search.

independent_score_ double

When fit, the mean coefficient of determination of the independent prediction across all features.

n_features_in_ int

Number of features that the transformer outputs. This is equal to the number of features in y (or y_fit) * n_estimators_per_forest.

n_samples_fit_ int

Number of samples in the fitted data.

transformer_ GBNodeTransformer

The fitted transformer which holds the built gradient boosting models for each feature.

y_fit_ array or DataFrame

When y_fit is passed to fit, the data used to construct the individual gradient boosting models. Note that all y data is used for prediction.

Notes

The tree_weighting_method parameter determines how the trees in each forest are weighted when calculating distances between node indexes. If tree_weighting_method is set to "train_improvement", tree weights are calculated as a function of the change in loss between successive trees in the gradient boosting estimator. As such, weights are directly proportional to the loss function specified and the user may want to choose the appropriate loss function (i.e. loss_reg or loss_clf) for their task.

If tree_weighting_method is set to "uniform", all trees are weighted equally.

Source code in src/sknnr/_gbnn.py
def __init__(
    self,
    *,
    loss_reg: Literal[
        "squared_error", "absolute_error", "huber", "quantile"
    ] = "squared_error",
    loss_clf: Literal["log_loss", "exponential"] = "log_loss",
    learning_rate: float = 0.1,
    n_estimators: int = 100,
    subsample: float = 1.0,
    criterion: Literal["friedman_mse", "squared_error"] = "friedman_mse",
    min_samples_split: int | float = 2,
    min_samples_leaf: int | float = 1,
    min_weight_fraction_leaf: float = 0.0,
    max_depth: int | None = 3,
    min_impurity_decrease: float = 0.0,
    init: BaseEstimator | Literal["zero"] | None = None,
    random_state: int | None = None,
    max_features: Literal["sqrt", "log2"] | int | float | None = None,
    alpha_reg: float = 0.9,
    verbose: int = 0,
    max_leaf_nodes: int | None = None,
    warm_start: bool = False,
    validation_fraction: float = 0.1,
    n_iter_no_change: int | None = None,
    tol: float = 0.0001,
    ccp_alpha: float = 0.0,
    forest_weights: Literal["uniform"] | ArrayLike = "uniform",
    tree_weighting_method: Literal[
        "train_improvement", "uniform"
    ] = "train_improvement",
    n_neighbors: int = 5,
    weights: Literal["uniform", "distance"] | Callable = "uniform",
    n_jobs: int | None = None,
):
    self.loss_reg = loss_reg
    self.loss_clf = loss_clf
    self.learning_rate = learning_rate
    self.n_estimators = n_estimators
    self.subsample = subsample
    self.criterion = criterion
    self.min_samples_split = min_samples_split
    self.min_samples_leaf = min_samples_leaf
    self.min_weight_fraction_leaf = min_weight_fraction_leaf
    self.max_depth = max_depth
    self.min_impurity_decrease = min_impurity_decrease
    self.init = init
    self.random_state = random_state
    self.max_features = max_features
    self.alpha_reg = alpha_reg
    self.verbose = verbose
    self.max_leaf_nodes = max_leaf_nodes
    self.warm_start = warm_start
    self.validation_fraction = validation_fraction
    self.n_iter_no_change = n_iter_no_change
    self.tol = tol
    self.ccp_alpha = ccp_alpha
    self.forest_weights = forest_weights
    self.tree_weighting_method = tree_weighting_method

    super().__init__(
        n_neighbors=n_neighbors,
        weights=weights,
        n_jobs=n_jobs,
    )

Attributes

algorithm instance-attribute

algorithm = algorithm

alpha_reg instance-attribute

alpha_reg = alpha_reg

ccp_alpha instance-attribute

ccp_alpha = ccp_alpha

criterion instance-attribute

criterion = criterion

forest_weights instance-attribute

forest_weights = forest_weights

init instance-attribute

init = init

leaf_size instance-attribute

leaf_size = leaf_size

learning_rate instance-attribute

learning_rate = learning_rate

loss_clf instance-attribute

loss_clf = loss_clf

loss_reg instance-attribute

loss_reg = loss_reg

max_depth instance-attribute

max_depth = max_depth

max_features instance-attribute

max_features = max_features

max_leaf_nodes instance-attribute

max_leaf_nodes = max_leaf_nodes

metric instance-attribute

metric = metric

metric_params instance-attribute

metric_params = metric_params

min_impurity_decrease instance-attribute

min_impurity_decrease = min_impurity_decrease

min_samples_leaf instance-attribute

min_samples_leaf = min_samples_leaf

min_samples_split instance-attribute

min_samples_split = min_samples_split

min_weight_fraction_leaf instance-attribute

min_weight_fraction_leaf = min_weight_fraction_leaf

n_estimators instance-attribute

n_estimators = n_estimators

n_iter_no_change instance-attribute

n_iter_no_change = n_iter_no_change

n_jobs instance-attribute

n_jobs = n_jobs

n_neighbors instance-attribute

n_neighbors = n_neighbors

p instance-attribute

p = p

random_state instance-attribute

random_state = random_state

regressor_ instance-attribute

regressor_: RawKNNRegressor

subsample instance-attribute

subsample = subsample

tol instance-attribute

tol = tol

transformer_ instance-attribute

transformer_: TreeNodeTransformer

tree_weighting_method instance-attribute

tree_weighting_method = tree_weighting_method

validation_fraction instance-attribute

validation_fraction = validation_fraction

verbose instance-attribute

verbose = verbose

warm_start instance-attribute

warm_start = warm_start

weights instance-attribute

weights = weights

Functions

fit

fit(X: DataLike, y: DataLike, y_fit: DataLike | None = None) -> Self

Fit using transformed feature data. If y_fit is provided, it will be used to fit the transformer.

Source code in src/sknnr/_base.py
def fit(self, X: DataLike, y: DataLike, y_fit: DataLike | None = None) -> Self:
    """Fit using transformed feature data. If y_fit is provided, it will be used
    to fit the transformer."""
    self.y_fit_ = y_fit
    return super().fit(X, y)

kneighbors

kneighbors(X: DataLike | None = None, n_neighbors: int | None = None, return_distance: bool = True, return_dataframe_index: bool = False, use_deterministic_ordering: bool = True) -> NDArray[int64] | tuple[NDArray[float64], NDArray[int64]]

Find the K-neighbors of a point or points of transformed feature data and optionally return dataframe indexes rather than array indices when the model was fitted with a dataframe.

Parameters:

Name Type Description Default
X array-like of shape (n_queries, n_features)

The query point or points. Points are first transformed using the fitted transformer. If not provided, neighbors of each indexed point are returned. In this case, the query point is not considered its own neighbor.

None
n_neighbors int

Number of neighbors required for each sample. The default is the value passed to the constructor.

None
return_distance bool

Whether or not to return the distances.

True
return_dataframe_index bool

Whether or not to return dataframe indexes instead of array indices. Only applicable if the model was fitted with a dataframe.

False
use_deterministic_ordering bool

Whether to use deterministic ordering of neighbors when distances are nearly identical. If True, neighbors with nearly identical distances (up to DISTANCE_PRECISION_DECIMALS decimal places) are ordered lexicographically by: (1) their scaled and rounded distances, (2) the absolute difference between a query point's row index and the neighbor index (so that a sample, when present, is returned before other equally distant samples), and (3) the neighbor index iself. If False, use the default ordering from KNeighborsRegressor.kneighbors. See the usage guide for more details.

True

Returns:

Name Type Description
neigh_dist array-like of shape (n_queries, n_neighbors)

Array representing the lengths to points, only present if return_distance=True.

neigh_ind array-like of shape (n_queries, n_neighbors)

Array indices or dataframe indexes of the nearest points in the population matrix.

Source code in src/sknnr/_base.py
def kneighbors(
    self,
    X: DataLike | None = None,
    n_neighbors: int | None = None,
    return_distance: bool = True,
    return_dataframe_index: bool = False,
    use_deterministic_ordering: bool = True,
) -> NDArray[np.int64] | tuple[NDArray[np.float64], NDArray[np.int64]]:
    """
    Find the K-neighbors of a point or points of transformed feature data
    and optionally return dataframe indexes rather than array indices when
    the model was fitted with a dataframe.

    Parameters
    ----------
    X : array-like of shape (n_queries, n_features), default=None
        The query point or points. Points are first transformed using the
        fitted transformer. If not provided, neighbors of each indexed
        point are returned. In this case, the query point is not
        considered its own neighbor.
    n_neighbors : int, default=None
        Number of neighbors required for each sample. The default is the
        value passed to the constructor.
    return_distance : bool, default=True
        Whether or not to return the distances.
    return_dataframe_index : bool, default=False
        Whether or not to return dataframe indexes instead of array indices.
        Only applicable if the model was fitted with a dataframe.
    use_deterministic_ordering : bool, default=True
        Whether to use deterministic ordering of neighbors when distances
        are nearly identical.  If True, neighbors with nearly identical
        distances (up to DISTANCE_PRECISION_DECIMALS decimal places) are
        ordered lexicographically by:
        (1) their scaled and rounded distances,
        (2) the absolute difference between a query point's row index
            and the neighbor index (so that a sample, when present, is
            returned before other equally distant samples), and
        (3) the neighbor index iself.
        If False, use the default ordering from
        `KNeighborsRegressor.kneighbors`. See the
        [usage guide](`../../../usage/#deterministic-neighbor-ordering`)
        for more details.

    Returns
    -------
    neigh_dist : array-like of shape (n_queries, n_neighbors)
        Array representing the lengths to points, only present if
        return_distance=True.
    neigh_ind : array-like of shape (n_queries, n_neighbors)
        Array indices or dataframe indexes of the nearest points in the
        population matrix.
    """
    X_transformed = self._transform_X(X)
    return self.regressor_.kneighbors(
        X=X_transformed,
        n_neighbors=n_neighbors,
        return_distance=return_distance,
        return_dataframe_index=return_dataframe_index,
        use_deterministic_ordering=use_deterministic_ordering,
    )

predict

predict(X: DataLike) -> NDArray[float64]
Source code in src/sknnr/_base.py
def predict(self, X: DataLike) -> NDArray[np.float64]:
    X_transformed = self._transform_X(X)
    return self.regressor_.predict(X_transformed)

score

score(X: DataLike, y: DataLike) -> float
Source code in src/sknnr/_base.py
def score(self, X: DataLike, y: DataLike) -> float:
    X_transformed = self._transform_X(X)
    return self.regressor_.score(X_transformed, y)