RFNNRegressor

sknnr.RFNNRegressor ¶

RFNNRegressor(*, n_estimators: int = 50, criterion_reg: Literal['squared_error', 'absolute_error', 'friedman_mse', 'poisson'] = 'squared_error', criterion_clf: Literal['gini', 'entropy', 'log_loss'] = 'gini', max_depth: int | None = None, min_samples_split: int | float = 2, min_samples_leaf: int | float = 5, min_weight_fraction_leaf: float = 0.0, max_features_reg: Literal['sqrt', 'log2'] | int | float | None = 1.0, max_features_clf: Literal['sqrt', 'log2'] | int | float | None = 'sqrt', max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0.0, bootstrap: bool = True, oob_score: bool | Callable = False, n_jobs: int | None = None, random_state: int | RandomState | None = None, verbose: int = 0, warm_start: bool = False, class_weight_clf: Literal['balanced', 'balanced_subsample'] | dict[str, float] | list[dict[str, float]] | None = None, ccp_alpha: float = 0.0, max_samples: int | float | None = None, monotonic_cst: list[int] | None = None, forest_weights: Literal['uniform'] | ArrayLike = 'uniform', n_neighbors: int = 5, weights: Literal['uniform', 'distance'] | Callable = 'uniform')

Bases: WeightedTreesNNRegressor

Regression using Random Forest Nearest Neighbors (RFNN) imputation.

New data is predicted by similarity of its node indexes to training set node indexes when run through multiple univariate random forests. A random forest is fit to each target in the training set and node indexes are captured for each tree in each forest for each training sample. Node indexes are then captured for inference data and distance is calculated as the dissimilarity between node indexes.

Random forests are constructed using either scikit-learn's RandomForestRegressor or RandomForestClassifier classes based on the data type of each target (y or y_fit) in the training set. If the target is numeric (e.g. int or float), a RandomForestRegressor is used. If the target is categorical (e.g. str or pd.Categorical), a RandomForestClassifier is used. The sknnr.transformers.RFNodeTransformer class is responsible for constructing the random forests and capturing the node indexes.

See sklearn.neighbors.KNeighborsRegressor for more detail on parameters associated with nearest neighbors. See sklearn.ensemble.RandomForestRegressor and sklearn.ensemble.RandomForestClassifier for more detail on parameters associated with random forests. Note that some parameters (e.g. criterion and max_features) are specified separately for regression and classification and have _reg and _clf suffixes.

Parameters:

Name	Type	Description	Default
`n_estimators`	`int`	The number of trees in each random forest. Typically, this parameter is applied to a single random forest. However, in `RFNNRegressor`, this parameter is applied to each random forest for every feature in the training set.	`50`
`criterion_reg`	`('squared_error', 'absolute_error', 'friedman_mse', 'poisson')`	default="squared_error" The function to measure the quality of a split for RandomForestRegresor objects.	`"squared_error"`
`criterion_clf`	`('gini', 'entropy', 'log_loss')`	The function to measure the quality of a split for RandomForestClassifier objects.	`"gini"`
`max_depth`	`int`	The maximum depth of the tree.	`None`
`min_samples_split`	`int or float`	The minimum number of samples required to split an internal node.	`2`
`min_samples_leaf`	`int of float`	The minimum number of samples required to be at a leaf node.	`5`
`min_weight_fraction_leaf`	`float`	The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.	`0.0`
`max_features_reg`	`“sqrt”, “log2”, None`	The number of features to consider when looking for the best split for RandomForestRegressor objects.	`“sqrt”`
`max_features_clf`	`“sqrt”, “log2”, None`	The number of features to consider when looking for the best split for RandomForestClassifier objects.	`“sqrt”`
`max_leaf_nodes`	`int`	Grow trees with max_leaf_nodes in best-first fashion.	`None`
`min_impurity_decrease`	`float`	A node will be split if this split induces a decrease of the impurity greater than or equal to this value.	`0.0`
`bootstrap`	`bool`	Whether bootstrap samples are used when building trees.	`True`
`oob_score`	`bool or callable`	Whether to use out-of-bag samples to estimate the generalization score.	`False`
`n_jobs`	`int`	The number of jobs to run in parallel.	`None`
`random_state`	`int, RandomState instance or None`	Controls both the randomness of the bootstrapping of the samples used when building trees (if `bootstrap=True`) and the sampling of the features to consider when looking for the best split at each node (if `max_features < n_features`).	`None`
`verbose`	`int`	Controls the verbosity when fitting and predicting.	`0`
`warm_start`	`bool`	When set to `True`, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.	`False`
`class_weight_clf`	`“balanced”, “balanced_subsample”`	default=None Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.	`“balanced”`
`ccp_alpha`	`non-negative float`	Complexity parameter used for Minimal Cost-Complexity Pruning.	`0.0`
`max_samples`	`int or float`	If bootstrap is `True`, the number of samples to draw from X to train each base estimator.	`None`
`monotonic_cst`	`array-like of int of shape (n_features)`	Indicates the monotonicity constraint to enforce on each feature.	`None`
`forest_weights`	`'uniform'`	Weights assigned to each target in the training set when calculating Hamming distance between node indexes. This allows for differential weighting of targets when calculating distances. Note that all trees associated with a target will receive the same weight. If "uniform", each tree is assigned equal weight.	`"uniform"`
`n_neighbors`	`int`	Number of neighbors to use by default for `kneighbors` queries.	`5`
`weights`	`('uniform', 'distance')`	Weight function used in prediction.	`"uniform"`

Attributes:

Name	Type	Description
`hamming_weights_`	`array`	When `fit`, provides the weights on each tree in each forest when calculating the Hamming distance.
`independent_prediction_`	`array`	The independent predictions for each sample in the training set, obtained by calculating `kneighbors` on the training data itself and calculating predictions based on those neighbors.
`independent_score_`	`double`	The independent score (i.e. coefficient of determination or R²) for the model, obtained by calculating the average R² across all outputs.
`n_features_in_`	`int`	Number of features that the transformer outputs. This is equal to the number of features in `y` (or `y_fit`) * `n_estimators_per_forest`.
`regressor_`	`RawKNNRegressor`	The underlying RawKNNRegressor instance.
`transformer_`	`RFNodeTransformer`	The fitted transformer which holds the built random forests for each feature.
`y_fit_`	`array-like of shape (n_samples, n_targets)`	The target data seen during fit which is used to construct the individual random forests. Note that `y_fit_` is only used for fitting, whereas regression will be run on the `y` values passed to `fit`.

Notes

n_jobs is used as a parameter in both RFNodeTransformer and KNeighborsRegressor. The value specified for this parameter will be passed to both estimators.

References

Crookston, NL, Finley, AO. 2008. yaImpute: an R package for kNN imputation. Journal of Statistical Software, 23, pp.1-16.

Source code in src/sknnr/_rfnn.py

def __init__(
    self,
    *,
    n_estimators: int = 50,
    criterion_reg: Literal[
        "squared_error", "absolute_error", "friedman_mse", "poisson"
    ] = "squared_error",
    criterion_clf: Literal["gini", "entropy", "log_loss"] = "gini",
    max_depth: int | None = None,
    min_samples_split: int | float = 2,
    min_samples_leaf: int | float = 5,
    min_weight_fraction_leaf: float = 0.0,
    max_features_reg: Literal["sqrt", "log2"] | int | float | None = 1.0,
    max_features_clf: Literal["sqrt", "log2"] | int | float | None = "sqrt",
    max_leaf_nodes: int | None = None,
    min_impurity_decrease: float = 0.0,
    bootstrap: bool = True,
    oob_score: bool | Callable = False,
    n_jobs: int | None = None,
    random_state: int | RandomState | None = None,
    verbose: int = 0,
    warm_start: bool = False,
    class_weight_clf: Literal["balanced", "balanced_subsample"]
    | dict[str, float]
    | list[dict[str, float]]
    | None = None,
    ccp_alpha: float = 0.0,
    max_samples: int | float | None = None,
    monotonic_cst: list[int] | None = None,
    forest_weights: Literal["uniform"] | ArrayLike = "uniform",
    n_neighbors: int = 5,
    weights: Literal["uniform", "distance"] | Callable = "uniform",
):
    self.n_estimators = n_estimators
    self.criterion_reg = criterion_reg
    self.criterion_clf = criterion_clf
    self.max_depth = max_depth
    self.min_samples_split = min_samples_split
    self.min_samples_leaf = min_samples_leaf
    self.min_weight_fraction_leaf = min_weight_fraction_leaf
    self.max_features_reg = max_features_reg
    self.max_features_clf = max_features_clf
    self.max_leaf_nodes = max_leaf_nodes
    self.min_impurity_decrease = min_impurity_decrease
    self.bootstrap = bootstrap
    self.oob_score = oob_score
    self.n_jobs = n_jobs
    self.random_state = random_state
    self.verbose = verbose
    self.warm_start = warm_start
    self.class_weight_clf = class_weight_clf
    self.ccp_alpha = ccp_alpha
    self.max_samples = max_samples
    self.monotonic_cst = monotonic_cst
    self.forest_weights = forest_weights

    super().__init__(
        n_neighbors=n_neighbors,
        weights=weights,
        n_jobs=self.n_jobs,
    )

Attributes¶

algorithm `instance-attribute` ¶

algorithm = algorithm

bootstrap `instance-attribute` ¶

bootstrap = bootstrap

ccp_alpha `instance-attribute` ¶

ccp_alpha = ccp_alpha

class_weight_clf `instance-attribute` ¶

class_weight_clf = class_weight_clf

criterion_clf `instance-attribute` ¶

criterion_clf = criterion_clf

criterion_reg `instance-attribute` ¶

criterion_reg = criterion_reg

forest_weights `instance-attribute` ¶

forest_weights = forest_weights

leaf_size `instance-attribute` ¶

leaf_size = leaf_size

max_depth `instance-attribute` ¶

max_depth = max_depth

max_features_clf `instance-attribute` ¶

max_features_clf = max_features_clf

max_features_reg `instance-attribute` ¶

max_features_reg = max_features_reg

max_leaf_nodes `instance-attribute` ¶

max_leaf_nodes = max_leaf_nodes

max_samples `instance-attribute` ¶

max_samples = max_samples

metric `instance-attribute` ¶

metric = metric

metric_params `instance-attribute` ¶

metric_params = metric_params

min_impurity_decrease `instance-attribute` ¶

min_impurity_decrease = min_impurity_decrease

min_samples_leaf `instance-attribute` ¶

min_samples_leaf = min_samples_leaf

min_samples_split `instance-attribute` ¶

min_samples_split = min_samples_split

min_weight_fraction_leaf `instance-attribute` ¶

min_weight_fraction_leaf = min_weight_fraction_leaf

monotonic_cst `instance-attribute` ¶

monotonic_cst = monotonic_cst

n_estimators `instance-attribute` ¶

n_estimators = n_estimators

n_jobs `instance-attribute` ¶

n_jobs = n_jobs

n_neighbors `instance-attribute` ¶

n_neighbors = n_neighbors

oob_score `instance-attribute` ¶

oob_score = oob_score

p `instance-attribute` ¶

p = p

random_state `instance-attribute` ¶

random_state = random_state

regressor_ `instance-attribute` ¶

regressor_: RawKNNRegressor

transformer_ `instance-attribute` ¶

transformer_: TreeNodeTransformer

verbose `instance-attribute` ¶

verbose = verbose

warm_start `instance-attribute` ¶

warm_start = warm_start

weights `instance-attribute` ¶

weights = weights

Functions¶

fit ¶

fit(X: DataLike, y: DataLike, y_fit: DataLike | None = None) -> Self

Fit using transformed feature data. If y_fit is provided, it will be used to fit the transformer.

Source code in src/sknnr/_base.py

def fit(self, X: DataLike, y: DataLike, y_fit: DataLike | None = None) -> Self:
    """Fit using transformed feature data. If y_fit is provided, it will be used
    to fit the transformer."""
    self.y_fit_ = y_fit
    return super().fit(X, y)

kneighbors ¶

kneighbors(X: DataLike | None = None, n_neighbors: int | None = None, return_distance: bool = True, return_dataframe_index: bool = False, use_deterministic_ordering: bool = True) -> NDArray[int64] | tuple[NDArray[float64], NDArray[int64]]

Find the K-neighbors of a point or points of transformed feature data and optionally return dataframe indexes rather than array indices when the model was fitted with a dataframe.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_queries, n_features)`	The query point or points. Points are first transformed using the fitted transformer. If not provided, neighbors of each indexed point are returned. In this case, the query point is not considered its own neighbor.	`None`
`n_neighbors`	`int`	Number of neighbors required for each sample. The default is the value passed to the constructor.	`None`
`return_distance`	`bool`	Whether or not to return the distances.	`True`
`return_dataframe_index`	`bool`	Whether or not to return dataframe indexes instead of array indices. Only applicable if the model was fitted with a dataframe.	`False`
`use_deterministic_ordering`	`bool`	Whether to use deterministic ordering of neighbors when distances are nearly identical. If True, neighbors with nearly identical distances (up to DISTANCE_PRECISION_DECIMALS decimal places) are ordered lexicographically by: (1) their scaled and rounded distances, (2) the absolute difference between a query point's row index and the neighbor index (so that a sample, when present, is returned before other equally distant samples), and (3) the neighbor index iself. If False, use the default ordering from `KNeighborsRegressor.kneighbors`. See the usage guide for more details.	`True`

Returns:

Name	Type	Description
`neigh_dist`	`array-like of shape (n_queries, n_neighbors)`	Array representing the lengths to points, only present if return_distance=True.
`neigh_ind`	`array-like of shape (n_queries, n_neighbors)`	Array indices or dataframe indexes of the nearest points in the population matrix.

Source code in src/sknnr/_base.py

def kneighbors(
    self,
    X: DataLike | None = None,
    n_neighbors: int | None = None,
    return_distance: bool = True,
    return_dataframe_index: bool = False,
    use_deterministic_ordering: bool = True,
) -> NDArray[np.int64] | tuple[NDArray[np.float64], NDArray[np.int64]]:
    """
    Find the K-neighbors of a point or points of transformed feature data
    and optionally return dataframe indexes rather than array indices when
    the model was fitted with a dataframe.

    Parameters
    ----------
    X : array-like of shape (n_queries, n_features), default=None
        The query point or points. Points are first transformed using the
        fitted transformer. If not provided, neighbors of each indexed
        point are returned. In this case, the query point is not
        considered its own neighbor.
    n_neighbors : int, default=None
        Number of neighbors required for each sample. The default is the
        value passed to the constructor.
    return_distance : bool, default=True
        Whether or not to return the distances.
    return_dataframe_index : bool, default=False
        Whether or not to return dataframe indexes instead of array indices.
        Only applicable if the model was fitted with a dataframe.
    use_deterministic_ordering : bool, default=True
        Whether to use deterministic ordering of neighbors when distances
        are nearly identical.  If True, neighbors with nearly identical
        distances (up to DISTANCE_PRECISION_DECIMALS decimal places) are
        ordered lexicographically by:
        (1) their scaled and rounded distances,
        (2) the absolute difference between a query point's row index
            and the neighbor index (so that a sample, when present, is
            returned before other equally distant samples), and
        (3) the neighbor index iself.
        If False, use the default ordering from
        `KNeighborsRegressor.kneighbors`. See the
        [usage guide](`../../../usage/#deterministic-neighbor-ordering`)
        for more details.

    Returns
    -------
    neigh_dist : array-like of shape (n_queries, n_neighbors)
        Array representing the lengths to points, only present if
        return_distance=True.
    neigh_ind : array-like of shape (n_queries, n_neighbors)
        Array indices or dataframe indexes of the nearest points in the
        population matrix.
    """
    X_transformed = self._transform_X(X)
    return self.regressor_.kneighbors(
        X=X_transformed,
        n_neighbors=n_neighbors,
        return_distance=return_distance,
        return_dataframe_index=return_dataframe_index,
        use_deterministic_ordering=use_deterministic_ordering,
    )

predict ¶

predict(X: DataLike) -> NDArray[float64]

Source code in src/sknnr/_base.py

def predict(self, X: DataLike) -> NDArray[np.float64]:
    X_transformed = self._transform_X(X)
    return self.regressor_.predict(X_transformed)

score ¶

score(X: DataLike, y: DataLike) -> float

Source code in src/sknnr/_base.py

def score(self, X: DataLike, y: DataLike) -> float:
    X_transformed = self._transform_X(X)
    return self.regressor_.score(X_transformed, y)

RFNNRegressor

sknnr.RFNNRegressor ¶

Attributes¶

algorithm instance-attribute ¶

bootstrap instance-attribute ¶

ccp_alpha instance-attribute ¶

class_weight_clf instance-attribute ¶

criterion_clf instance-attribute ¶

criterion_reg instance-attribute ¶

forest_weights instance-attribute ¶

leaf_size instance-attribute ¶

max_depth instance-attribute ¶

max_features_clf instance-attribute ¶

max_features_reg instance-attribute ¶

max_leaf_nodes instance-attribute ¶

max_samples instance-attribute ¶

metric instance-attribute ¶

metric_params instance-attribute ¶

min_impurity_decrease instance-attribute ¶

min_samples_leaf instance-attribute ¶

min_samples_split instance-attribute ¶

min_weight_fraction_leaf instance-attribute ¶

monotonic_cst instance-attribute ¶

n_estimators instance-attribute ¶

n_jobs instance-attribute ¶

n_neighbors instance-attribute ¶

oob_score instance-attribute ¶

p instance-attribute ¶

random_state instance-attribute ¶

regressor_ instance-attribute ¶

transformer_ instance-attribute ¶

verbose instance-attribute ¶

warm_start instance-attribute ¶

weights instance-attribute ¶

Functions¶

fit ¶

kneighbors ¶

predict ¶

score ¶

algorithm `instance-attribute` ¶

bootstrap `instance-attribute` ¶

ccp_alpha `instance-attribute` ¶

class_weight_clf `instance-attribute` ¶

criterion_clf `instance-attribute` ¶

criterion_reg `instance-attribute` ¶

forest_weights `instance-attribute` ¶

leaf_size `instance-attribute` ¶

max_depth `instance-attribute` ¶

max_features_clf `instance-attribute` ¶

max_features_reg `instance-attribute` ¶

max_leaf_nodes `instance-attribute` ¶

max_samples `instance-attribute` ¶

metric `instance-attribute` ¶

metric_params `instance-attribute` ¶

min_impurity_decrease `instance-attribute` ¶

min_samples_leaf `instance-attribute` ¶

min_samples_split `instance-attribute` ¶

min_weight_fraction_leaf `instance-attribute` ¶

monotonic_cst `instance-attribute` ¶

n_estimators `instance-attribute` ¶

n_jobs `instance-attribute` ¶

n_neighbors `instance-attribute` ¶

oob_score `instance-attribute` ¶

p `instance-attribute` ¶

random_state `instance-attribute` ¶

regressor_ `instance-attribute` ¶

transformer_ `instance-attribute` ¶

verbose `instance-attribute` ¶

warm_start `instance-attribute` ¶

weights `instance-attribute` ¶