RFNNRegressor
sknnr.RFNNRegressor ¶
RFNNRegressor(*, n_estimators: int = 50, criterion_reg: Literal['squared_error', 'absolute_error', 'friedman_mse', 'poisson'] = 'squared_error', criterion_clf: Literal['gini', 'entropy', 'log_loss'] = 'gini', max_depth: int | None = None, min_samples_split: int | float = 2, min_samples_leaf: int | float = 5, min_weight_fraction_leaf: float = 0.0, max_features_reg: Literal['sqrt', 'log2'] | int | float | None = 1.0, max_features_clf: Literal['sqrt', 'log2'] | int | float | None = 'sqrt', max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0.0, bootstrap: bool = True, oob_score: bool | Callable = False, n_jobs: int | None = None, random_state: int | RandomState | None = None, verbose: int = 0, warm_start: bool = False, class_weight_clf: Literal['balanced', 'balanced_subsample'] | dict[str, float] | list[dict[str, float]] | None = None, ccp_alpha: float = 0.0, max_samples: int | float | None = None, monotonic_cst: list[int] | None = None, forest_weights: Literal['uniform'] | ArrayLike = 'uniform', n_neighbors: int = 5, weights: Literal['uniform', 'distance'] | Callable = 'uniform')
Bases: WeightedTreesNNRegressor
Regression using Random Forest Nearest Neighbors (RFNN) imputation.
New data is predicted by similarity of its node indexes to training set node indexes when run through multiple univariate random forests. A random forest is fit to each target in the training set and node indexes are captured for each tree in each forest for each training sample. Node indexes are then captured for inference data and distance is calculated as the dissimilarity between node indexes.
Random forests are constructed using either scikit-learn's RandomForestRegressor
or RandomForestClassifier classes based on the data type of each target
(y or y_fit) in the training set. If the target is numeric (e.g. int
or float), a RandomForestRegressor is used. If the target is
categorical (e.g. str or pd.Categorical), a RandomForestClassifier is
used. The sknnr.transformers.RFNodeTransformer class is responsible for
constructing the random forests and capturing the node indexes.
See sklearn.neighbors.KNeighborsRegressor for more detail on
parameters associated with nearest neighbors. See
sklearn.ensemble.RandomForestRegressor and
sklearn.ensemble.RandomForestClassifier for more detail on parameters
associated with random forests. Note that some parameters (e.g. criterion
and max_features) are specified separately for regression and classification
and have _reg and _clf suffixes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_estimators
|
int
|
The number of trees in each random forest. Typically, this parameter
is applied to a single random forest. However, in |
50
|
criterion_reg
|
('squared_error', 'absolute_error', 'friedman_mse', 'poisson')
|
default="squared_error" The function to measure the quality of a split for RandomForestRegresor objects. |
"squared_error"
|
criterion_clf
|
('gini', 'entropy', 'log_loss')
|
The function to measure the quality of a split for RandomForestClassifier objects. |
"gini"
|
max_depth
|
int
|
The maximum depth of the tree. |
None
|
min_samples_split
|
int or float
|
The minimum number of samples required to split an internal node. |
2
|
min_samples_leaf
|
int of float
|
The minimum number of samples required to be at a leaf node. |
5
|
min_weight_fraction_leaf
|
float
|
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. |
0.0
|
max_features_reg
|
“sqrt”, “log2”, None
|
The number of features to consider when looking for the best split for RandomForestRegressor objects. |
“sqrt”
|
max_features_clf
|
“sqrt”, “log2”, None
|
The number of features to consider when looking for the best split for RandomForestClassifier objects. |
“sqrt”
|
max_leaf_nodes
|
int
|
Grow trees with max_leaf_nodes in best-first fashion. |
None
|
min_impurity_decrease
|
float
|
A node will be split if this split induces a decrease of the impurity greater than or equal to this value. |
0.0
|
bootstrap
|
bool
|
Whether bootstrap samples are used when building trees. |
True
|
oob_score
|
bool or callable
|
Whether to use out-of-bag samples to estimate the generalization score. |
False
|
n_jobs
|
int
|
The number of jobs to run in parallel. |
None
|
random_state
|
int, RandomState instance or None
|
Controls both the randomness of the bootstrapping of the samples
used when building trees (if |
None
|
verbose
|
int
|
Controls the verbosity when fitting and predicting. |
0
|
warm_start
|
bool
|
When set to |
False
|
class_weight_clf
|
“balanced”, “balanced_subsample”
|
default=None Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. |
“balanced”
|
ccp_alpha
|
non-negative float
|
Complexity parameter used for Minimal Cost-Complexity Pruning. |
0.0
|
max_samples
|
int or float
|
If bootstrap is |
None
|
monotonic_cst
|
array-like of int of shape (n_features)
|
Indicates the monotonicity constraint to enforce on each feature. |
None
|
forest_weights
|
'uniform'
|
Weights assigned to each target in the training set when calculating Hamming distance between node indexes. This allows for differential weighting of targets when calculating distances. Note that all trees associated with a target will receive the same weight. If "uniform", each tree is assigned equal weight. |
"uniform"
|
n_neighbors
|
int
|
Number of neighbors to use by default for |
5
|
weights
|
('uniform', 'distance')
|
Weight function used in prediction. |
"uniform"
|
Attributes:
| Name | Type | Description |
|---|---|---|
hamming_weights_ |
array
|
When |
independent_prediction_ |
array
|
The independent predictions for each sample in the training set,
obtained by calculating |
independent_score_ |
double
|
The independent score (i.e. coefficient of determination or R²) for the model, obtained by calculating the average R² across all outputs. |
n_features_in_ |
int
|
Number of features that the transformer outputs. This is equal to the
number of features in |
regressor_ |
RawKNNRegressor
|
The underlying RawKNNRegressor instance. |
transformer_ |
RFNodeTransformer
|
The fitted transformer which holds the built random forests for each feature. |
y_fit_ |
array-like of shape (n_samples, n_targets)
|
The target data seen during fit which is used to construct the
individual random forests. Note that |
Notes
n_jobs is used as a parameter in both RFNodeTransformer and
KNeighborsRegressor. The value specified for this parameter will be
passed to both estimators.
References
Crookston, NL, Finley, AO. 2008. yaImpute: an R package for kNN imputation. Journal of Statistical Software, 23, pp.1-16.
Source code in src/sknnr/_rfnn.py
Attributes¶
Functions¶
fit ¶
Fit using transformed feature data. If y_fit is provided, it will be used to fit the transformer.
kneighbors ¶
kneighbors(X: DataLike | None = None, n_neighbors: int | None = None, return_distance: bool = True, return_dataframe_index: bool = False, use_deterministic_ordering: bool = True) -> NDArray[int64] | tuple[NDArray[float64], NDArray[int64]]
Find the K-neighbors of a point or points of transformed feature data and optionally return dataframe indexes rather than array indices when the model was fitted with a dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_queries, n_features)
|
The query point or points. Points are first transformed using the fitted transformer. If not provided, neighbors of each indexed point are returned. In this case, the query point is not considered its own neighbor. |
None
|
n_neighbors
|
int
|
Number of neighbors required for each sample. The default is the value passed to the constructor. |
None
|
return_distance
|
bool
|
Whether or not to return the distances. |
True
|
return_dataframe_index
|
bool
|
Whether or not to return dataframe indexes instead of array indices. Only applicable if the model was fitted with a dataframe. |
False
|
use_deterministic_ordering
|
bool
|
Whether to use deterministic ordering of neighbors when distances
are nearly identical. If True, neighbors with nearly identical
distances (up to DISTANCE_PRECISION_DECIMALS decimal places) are
ordered lexicographically by:
(1) their scaled and rounded distances,
(2) the absolute difference between a query point's row index
and the neighbor index (so that a sample, when present, is
returned before other equally distant samples), and
(3) the neighbor index iself.
If False, use the default ordering from
|
True
|
Returns:
| Name | Type | Description |
|---|---|---|
neigh_dist |
array-like of shape (n_queries, n_neighbors)
|
Array representing the lengths to points, only present if return_distance=True. |
neigh_ind |
array-like of shape (n_queries, n_neighbors)
|
Array indices or dataframe indexes of the nearest points in the population matrix. |