GBNNRegressor

sknnr.GBNNRegressor ¶

GBNNRegressor(*, loss_reg: Literal['squared_error', 'absolute_error', 'huber', 'quantile'] = 'squared_error', loss_clf: Literal['log_loss', 'exponential'] = 'log_loss', learning_rate: float = 0.1, n_estimators: int = 100, subsample: float = 1.0, criterion: Literal['friedman_mse', 'squared_error'] = 'friedman_mse', min_samples_split: int | float = 2, min_samples_leaf: int | float = 1, min_weight_fraction_leaf: float = 0.0, max_depth: int | None = 3, min_impurity_decrease: float = 0.0, init: BaseEstimator | Literal['zero'] | None = None, random_state: int | None = None, max_features: Literal['sqrt', 'log2'] | int | float | None = None, alpha_reg: float = 0.9, verbose: int = 0, max_leaf_nodes: int | None = None, warm_start: bool = False, validation_fraction: float = 0.1, n_iter_no_change: int | None = None, tol: float = 0.0001, ccp_alpha: float = 0.0, forest_weights: Literal['uniform'] | ArrayLike = 'uniform', tree_weighting_method: Literal['train_improvement', 'uniform'] = 'train_improvement', n_neighbors: int = 5, weights: Literal['uniform', 'distance'] | Callable = 'uniform', n_jobs: int | None = None)

Bases: WeightedTreesNNRegressor

Regression using Gradient Boosting Nearest Neighbors (GBNN) imputation.

New data is predicted by similarity of its node indexes to training set node indexes when run through multiple univariate gradient boosting models. A gradient boosting model is fit to each target in the training set and node indexes are captured for each tree in each forest for each training sample. Node indexes are then captured for inference data and distance is calculated as the dissimilarity between node indexes.

Gradient boosting models are constructed using either scikit-learn's GradientBoostingRegressor or GradientBoostingClassifier classes based on the data type of each target (y or y_fit) in the training set. If the target is numeric (e.g. int or float), a GradientBoostingRegressor is used. If the target is categorical (e.g. str or pd.Categorical), a GradientBoostingClassifier is used. The sknnr.transformers.GBNodeTransformer class is responsible for constructing the gradient boosting models and capturing the node indexes.

See sklearn.neighbors.KNeighborsRegressor for more detail on parameters associated with nearest neighbors. See sklearn.ensemble.GradientBoostingRegressor and sklearn.ensemble.GradientBoostingClassifier for more detail on parameters associated with gradient boosting. Note that some parameters (e.g. loss and alpha) are specified separately for regression and classification and have _reg and _clf suffixes.

Parameters:

Name	Type	Description	Default
`loss_reg`	`('squared_error', 'absolute_error', 'huber', 'quantile')`	default="squared_error" Loss function to be optimized for regression.	`"squared_error"`
`loss_clf`	`('log_loss', 'exponential')`	The loss function to be used for classification.	`"log_loss"`
`learning_rate`	`float`	Learning rate shrinks the contribution of each tree by `learning_rate`.	`0.1`
`n_estimators`	`int`	The number of boosting stages to perform.	`100`
`subsample`	`float`	The fraction of samples to be used for fitting the individual base learners.	`1.0`
`criterion`	`('friedman_mse', 'squared_error')`	The function to measure the quality of a split.	`"friedman_mse"`
`min_samples_split`	`int or float`	The minimum number of samples required to split an internal node.	`2`
`min_samples_leaf`	`int or float`	The minimum number of samples required to be at a leaf node.	`1`
`min_weight_fraction_leaf`	`float`	The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.	`0.0`
`max_depth`	`int or None`	Maximum depth of the individual regression estimators.	`3`
`min_impurity_decrease`	`float`	A node will be split if this split induces a decrease of the impurity greater than or equal to this value.	`0.0`
`init`	`(estimator, 'zero' or None)`	An estimator object that is used to compute the initial predictions.	`None`
`random_state`	`int, RandomState instance or None`	Controls the random seed given to each Tree estimator at each boosting iteration.	`None`
`max_features`	`('sqrt', 'log2')`	The number of features to consider when looking for the best split.	`"sqrt"`
`alpha_reg`	`float`	The alpha-quantile of the huber loss function and the quantile loss function.	`0.9`
`verbose`	`int`	Enable verbose output.	`0`
`max_leaf_nodes`	`int or None`	Grow trees with `max_leaf_nodes` in best-first fashion.	`None`
`warm_start`	`bool`	When set to `True`, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just erase the previous solution.	`False`
`validation_fraction`	`float`	The proportion of training data to set aside as validation set for early stopping.	`0.1`
`n_iter_no_change`	`int or None`	`n_iter_no_change` is used to decide if early stopping will be used to terminate training when validation score is not improving.	`None`
`tol`	`float`	Tolerance for the early stopping.	`1e-4`
`ccp_alpha`	`non-negative float`	Complexity parameter used for Minimal Cost-Complexity Pruning.	`0.0`
`forest_weights`	`Literal['uniform'] \| ArrayLike`	Weights assigned to each target in the training set when calculating Hamming distance between node indexes. This allows for differential weighting of targets when calculating distances. Note that all trees associated with a target will receive the same weight. If "uniform", each tree is assigned equal weight.	`'uniform'`
`tree_weighting_method`	`Literal['train_improvement', 'uniform']`	default="train_improvement" The method used to weight the trees in each gradient boosting model.	`'train_improvement'`
`n_neighbors`	`int`	Number of neighbors to use by default for `kneighbors` queries.	`5`
`weights`	`('uniform', 'distance')`	Weight function used in prediction.	`"uniform"`
`n_jobs`	`int or None`	The number of jobs to run in parallel.	`None`

Attributes:

Name	Type	Description
`effective_metric_`	`str`	Always set to 'hamming'.
`effective_metric_params_`	`dict`	Always empty.
`hamming_weights_`	`array`	When `fit`, provides the weights on each tree in each forest when calculating the Hamming distance.
`independent_prediction_`	`array`	When `fit`, provides the prediction for training data not allowing self-assignment during neighbor search.
`independent_score_`	`double`	When `fit`, the mean coefficient of determination of the independent prediction across all features.
`n_features_in_`	`int`	Number of features that the transformer outputs. This is equal to the number of features in `y` (or `y_fit`) * `n_estimators_per_forest`.
`n_samples_fit_`	`int`	Number of samples in the fitted data.
`transformer_`	`GBNodeTransformer`	The fitted transformer which holds the built gradient boosting models for each feature.
`y_fit_`	`array or DataFrame`	When `y_fit` is passed to `fit`, the data used to construct the individual gradient boosting models. Note that all `y` data is used for prediction.

Notes

The tree_weighting_method parameter determines how the trees in each forest are weighted when calculating distances between node indexes. If tree_weighting_method is set to "train_improvement", tree weights are calculated as a function of the change in loss between successive trees in the gradient boosting estimator. As such, weights are directly proportional to the loss function specified and the user may want to choose the appropriate loss function (i.e. loss_reg or loss_clf) for their task.

If tree_weighting_method is set to "uniform", all trees are weighted equally.

Source code in src/sknnr/_gbnn.py

def __init__(
    self,
    *,
    loss_reg: Literal[
        "squared_error", "absolute_error", "huber", "quantile"
    ] = "squared_error",
    loss_clf: Literal["log_loss", "exponential"] = "log_loss",
    learning_rate: float = 0.1,
    n_estimators: int = 100,
    subsample: float = 1.0,
    criterion: Literal["friedman_mse", "squared_error"] = "friedman_mse",
    min_samples_split: int | float = 2,
    min_samples_leaf: int | float = 1,
    min_weight_fraction_leaf: float = 0.0,
    max_depth: int | None = 3,
    min_impurity_decrease: float = 0.0,
    init: BaseEstimator | Literal["zero"] | None = None,
    random_state: int | None = None,
    max_features: Literal["sqrt", "log2"] | int | float | None = None,
    alpha_reg: float = 0.9,
    verbose: int = 0,
    max_leaf_nodes: int | None = None,
    warm_start: bool = False,
    validation_fraction: float = 0.1,
    n_iter_no_change: int | None = None,
    tol: float = 0.0001,
    ccp_alpha: float = 0.0,
    forest_weights: Literal["uniform"] | ArrayLike = "uniform",
    tree_weighting_method: Literal[
        "train_improvement", "uniform"
    ] = "train_improvement",
    n_neighbors: int = 5,
    weights: Literal["uniform", "distance"] | Callable = "uniform",
    n_jobs: int | None = None,
):
    self.loss_reg = loss_reg
    self.loss_clf = loss_clf
    self.learning_rate = learning_rate
    self.n_estimators = n_estimators
    self.subsample = subsample
    self.criterion = criterion
    self.min_samples_split = min_samples_split
    self.min_samples_leaf = min_samples_leaf
    self.min_weight_fraction_leaf = min_weight_fraction_leaf
    self.max_depth = max_depth
    self.min_impurity_decrease = min_impurity_decrease
    self.init = init
    self.random_state = random_state
    self.max_features = max_features
    self.alpha_reg = alpha_reg
    self.verbose = verbose
    self.max_leaf_nodes = max_leaf_nodes
    self.warm_start = warm_start
    self.validation_fraction = validation_fraction
    self.n_iter_no_change = n_iter_no_change
    self.tol = tol
    self.ccp_alpha = ccp_alpha
    self.forest_weights = forest_weights
    self.tree_weighting_method = tree_weighting_method

    super().__init__(
        n_neighbors=n_neighbors,
        weights=weights,
        n_jobs=n_jobs,
    )

Attributes¶

algorithm `instance-attribute` ¶

algorithm = algorithm

alpha_reg `instance-attribute` ¶

alpha_reg = alpha_reg

ccp_alpha `instance-attribute` ¶

ccp_alpha = ccp_alpha

criterion `instance-attribute` ¶

criterion = criterion

forest_weights `instance-attribute` ¶

forest_weights = forest_weights

init `instance-attribute` ¶

init = init

leaf_size `instance-attribute` ¶

leaf_size = leaf_size

learning_rate `instance-attribute` ¶

learning_rate = learning_rate

loss_clf `instance-attribute` ¶

loss_clf = loss_clf

loss_reg `instance-attribute` ¶

loss_reg = loss_reg

max_depth `instance-attribute` ¶

max_depth = max_depth

max_features `instance-attribute` ¶

max_features = max_features

max_leaf_nodes `instance-attribute` ¶

max_leaf_nodes = max_leaf_nodes

metric `instance-attribute` ¶

metric = metric

metric_params `instance-attribute` ¶

metric_params = metric_params

min_impurity_decrease `instance-attribute` ¶

min_impurity_decrease = min_impurity_decrease

min_samples_leaf `instance-attribute` ¶

min_samples_leaf = min_samples_leaf

min_samples_split `instance-attribute` ¶

min_samples_split = min_samples_split

min_weight_fraction_leaf `instance-attribute` ¶

min_weight_fraction_leaf = min_weight_fraction_leaf

n_estimators `instance-attribute` ¶

n_estimators = n_estimators

n_iter_no_change `instance-attribute` ¶

n_iter_no_change = n_iter_no_change

n_jobs `instance-attribute` ¶

n_jobs = n_jobs

n_neighbors `instance-attribute` ¶

n_neighbors = n_neighbors

p `instance-attribute` ¶

p = p

random_state `instance-attribute` ¶

random_state = random_state

regressor_ `instance-attribute` ¶

regressor_: RawKNNRegressor

subsample `instance-attribute` ¶

subsample = subsample

tol `instance-attribute` ¶

tol = tol

transformer_ `instance-attribute` ¶

transformer_: TreeNodeTransformer

tree_weighting_method `instance-attribute` ¶

tree_weighting_method = tree_weighting_method

validation_fraction `instance-attribute` ¶

validation_fraction = validation_fraction

verbose `instance-attribute` ¶

verbose = verbose

warm_start `instance-attribute` ¶

warm_start = warm_start

weights `instance-attribute` ¶

weights = weights

Functions¶

fit ¶

fit(X: DataLike, y: DataLike, y_fit: DataLike | None = None) -> Self

Fit using transformed feature data. If y_fit is provided, it will be used to fit the transformer.

Source code in src/sknnr/_base.py

def fit(self, X: DataLike, y: DataLike, y_fit: DataLike | None = None) -> Self:
    """Fit using transformed feature data. If y_fit is provided, it will be used
    to fit the transformer."""
    self.y_fit_ = y_fit
    return super().fit(X, y)

kneighbors ¶

kneighbors(X: DataLike | None = None, n_neighbors: int | None = None, return_distance: bool = True, return_dataframe_index: bool = False, use_deterministic_ordering: bool = True) -> NDArray[int64] | tuple[NDArray[float64], NDArray[int64]]

Find the K-neighbors of a point or points of transformed feature data and optionally return dataframe indexes rather than array indices when the model was fitted with a dataframe.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_queries, n_features)`	The query point or points. Points are first transformed using the fitted transformer. If not provided, neighbors of each indexed point are returned. In this case, the query point is not considered its own neighbor.	`None`
`n_neighbors`	`int`	Number of neighbors required for each sample. The default is the value passed to the constructor.	`None`
`return_distance`	`bool`	Whether or not to return the distances.	`True`
`return_dataframe_index`	`bool`	Whether or not to return dataframe indexes instead of array indices. Only applicable if the model was fitted with a dataframe.	`False`
`use_deterministic_ordering`	`bool`	Whether to use deterministic ordering of neighbors when distances are nearly identical. If True, neighbors with nearly identical distances (up to DISTANCE_PRECISION_DECIMALS decimal places) are ordered lexicographically by: (1) their scaled and rounded distances, (2) the absolute difference between a query point's row index and the neighbor index (so that a sample, when present, is returned before other equally distant samples), and (3) the neighbor index iself. If False, use the default ordering from `KNeighborsRegressor.kneighbors`. See the usage guide for more details.	`True`

Returns:

Name	Type	Description
`neigh_dist`	`array-like of shape (n_queries, n_neighbors)`	Array representing the lengths to points, only present if return_distance=True.
`neigh_ind`	`array-like of shape (n_queries, n_neighbors)`	Array indices or dataframe indexes of the nearest points in the population matrix.

Source code in src/sknnr/_base.py

def kneighbors(
    self,
    X: DataLike | None = None,
    n_neighbors: int | None = None,
    return_distance: bool = True,
    return_dataframe_index: bool = False,
    use_deterministic_ordering: bool = True,
) -> NDArray[np.int64] | tuple[NDArray[np.float64], NDArray[np.int64]]:
    """
    Find the K-neighbors of a point or points of transformed feature data
    and optionally return dataframe indexes rather than array indices when
    the model was fitted with a dataframe.

    Parameters
    ----------
    X : array-like of shape (n_queries, n_features), default=None
        The query point or points. Points are first transformed using the
        fitted transformer. If not provided, neighbors of each indexed
        point are returned. In this case, the query point is not
        considered its own neighbor.
    n_neighbors : int, default=None
        Number of neighbors required for each sample. The default is the
        value passed to the constructor.
    return_distance : bool, default=True
        Whether or not to return the distances.
    return_dataframe_index : bool, default=False
        Whether or not to return dataframe indexes instead of array indices.
        Only applicable if the model was fitted with a dataframe.
    use_deterministic_ordering : bool, default=True
        Whether to use deterministic ordering of neighbors when distances
        are nearly identical.  If True, neighbors with nearly identical
        distances (up to DISTANCE_PRECISION_DECIMALS decimal places) are
        ordered lexicographically by:
        (1) their scaled and rounded distances,
        (2) the absolute difference between a query point's row index
            and the neighbor index (so that a sample, when present, is
            returned before other equally distant samples), and
        (3) the neighbor index iself.
        If False, use the default ordering from
        `KNeighborsRegressor.kneighbors`. See the
        [usage guide](`../../../usage/#deterministic-neighbor-ordering`)
        for more details.

    Returns
    -------
    neigh_dist : array-like of shape (n_queries, n_neighbors)
        Array representing the lengths to points, only present if
        return_distance=True.
    neigh_ind : array-like of shape (n_queries, n_neighbors)
        Array indices or dataframe indexes of the nearest points in the
        population matrix.
    """
    X_transformed = self._transform_X(X)
    return self.regressor_.kneighbors(
        X=X_transformed,
        n_neighbors=n_neighbors,
        return_distance=return_distance,
        return_dataframe_index=return_dataframe_index,
        use_deterministic_ordering=use_deterministic_ordering,
    )

predict ¶

predict(X: DataLike) -> NDArray[float64]

Source code in src/sknnr/_base.py

def predict(self, X: DataLike) -> NDArray[np.float64]:
    X_transformed = self._transform_X(X)
    return self.regressor_.predict(X_transformed)

score ¶

score(X: DataLike, y: DataLike) -> float

Source code in src/sknnr/_base.py

def score(self, X: DataLike, y: DataLike) -> float:
    X_transformed = self._transform_X(X)
    return self.regressor_.score(X_transformed, y)

GBNNRegressor

sknnr.GBNNRegressor ¶

Attributes¶

algorithm instance-attribute ¶

alpha_reg instance-attribute ¶

ccp_alpha instance-attribute ¶

criterion instance-attribute ¶

forest_weights instance-attribute ¶

init instance-attribute ¶

leaf_size instance-attribute ¶

learning_rate instance-attribute ¶

loss_clf instance-attribute ¶

loss_reg instance-attribute ¶

max_depth instance-attribute ¶

max_features instance-attribute ¶

max_leaf_nodes instance-attribute ¶

metric instance-attribute ¶

metric_params instance-attribute ¶

min_impurity_decrease instance-attribute ¶

min_samples_leaf instance-attribute ¶

min_samples_split instance-attribute ¶

min_weight_fraction_leaf instance-attribute ¶

n_estimators instance-attribute ¶

n_iter_no_change instance-attribute ¶

n_jobs instance-attribute ¶

n_neighbors instance-attribute ¶

p instance-attribute ¶

random_state instance-attribute ¶

regressor_ instance-attribute ¶

subsample instance-attribute ¶

tol instance-attribute ¶

transformer_ instance-attribute ¶

tree_weighting_method instance-attribute ¶

validation_fraction instance-attribute ¶

verbose instance-attribute ¶

warm_start instance-attribute ¶

weights instance-attribute ¶

Functions¶

fit ¶

kneighbors ¶

predict ¶

score ¶

algorithm `instance-attribute` ¶

alpha_reg `instance-attribute` ¶

ccp_alpha `instance-attribute` ¶

criterion `instance-attribute` ¶

forest_weights `instance-attribute` ¶

init `instance-attribute` ¶

leaf_size `instance-attribute` ¶

learning_rate `instance-attribute` ¶

loss_clf `instance-attribute` ¶

loss_reg `instance-attribute` ¶

max_depth `instance-attribute` ¶

max_features `instance-attribute` ¶

max_leaf_nodes `instance-attribute` ¶

metric `instance-attribute` ¶

metric_params `instance-attribute` ¶

min_impurity_decrease `instance-attribute` ¶

min_samples_leaf `instance-attribute` ¶

min_samples_split `instance-attribute` ¶

min_weight_fraction_leaf `instance-attribute` ¶

n_estimators `instance-attribute` ¶

n_iter_no_change `instance-attribute` ¶

n_jobs `instance-attribute` ¶

n_neighbors `instance-attribute` ¶

p `instance-attribute` ¶

random_state `instance-attribute` ¶

regressor_ `instance-attribute` ¶

subsample `instance-attribute` ¶

tol `instance-attribute` ¶

transformer_ `instance-attribute` ¶

tree_weighting_method `instance-attribute` ¶

validation_fraction `instance-attribute` ¶

verbose `instance-attribute` ¶

warm_start `instance-attribute` ¶

weights `instance-attribute` ¶