RFNodeTransformer

sknnr.transformers.RFNodeTransformer ¶

RFNodeTransformer(n_estimators: int = 50, criterion_reg: Literal['squared_error', 'absolute_error', 'friedman_mse', 'poisson'] = 'squared_error', criterion_clf: Literal['gini', 'entropy', 'log_loss'] = 'gini', max_depth: int | None = None, min_samples_split: int | float = 2, min_samples_leaf: int | float = 5, min_weight_fraction_leaf: float = 0.0, max_features_reg: Literal['sqrt', 'log2'] | int | float | None = 1.0, max_features_clf: Literal['sqrt', 'log2'] | int | float | None = 'sqrt', max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0.0, bootstrap: bool = True, oob_score: bool | Callable = False, n_jobs: int | None = None, random_state: int | RandomState | None = None, verbose: int = 0, warm_start: bool = False, class_weight_clf: Literal['balanced', 'balanced_subsample'] | dict[str, float] | list[dict[str, float]] | None = None, ccp_alpha: float = 0.0, max_samples: int | float | None = None, monotonic_cst: list[int] | None = None)

Bases: TreeNodeTransformer

Transformer to capture node indexes for samples across multiple random forests.

A random forest is fit to each y target in the training set using either scikit-learn's RandomForestRegressor or RandomForestClassifier. The transformation captures the node indexes for each tree in each forest for each training or new sample.

The particular random forest type used for each target is determined by the data type of the target. If the target is numeric (e.g. int or float), a RandomForestRegressor is used. If the target is categorical (e.g. str or pd.Categorical), a RandomForestClassifier is used. Targets are automatically promoted to the minimum numpy dtype that safely represents all elements.

This transformer is intended to be used in conjunction with RFNNRegressor which captures similarity between node indexes of training and inference data and creates predictions using nearest neighbors.

See sklearn.ensemble.RandomForestRegressor and sklearn.ensemble.RandomForestClassifier for more detail on available parameters. All parameters are passed through to these respective random forest estimators for each random forest being built. Note that some parameters (e.g. criterion and max_features) are specified separately for regression and classification and have _reg and _clf suffixes.

Parameters:

Name	Type	Description	Default
`n_estimators`	`int`	The number of trees in each random forest.	`50`
`criterion_reg`	`('squared_error', 'absolute_error', 'friedman_mse', 'poisson')`	default="squared_error" The function to measure the quality of a split for RandomForestRegresor objects.	`"squared_error"`
`criterion_clf`	`('gini', 'entropy', 'log_loss')`	The function to measure the quality of a split for RandomForestClassifier objects.	`"gini"`
`max_depth`	`int`	The maximum depth of the tree.	`None`
`min_samples_split`	`int or float`	The minimum number of samples required to split an internal node.	`2`
`min_samples_leaf`	`int of float`	The minimum number of samples required to be at a leaf node.	`5`
`min_weight_fraction_leaf`	`float`	The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.	`0.0`
`max_features_reg`	`“sqrt”, “log2”, None`	The number of features to consider when looking for the best split for RandomForestRegressor objects.	`“sqrt”`
`max_features_clf`	`“sqrt”, “log2”, None`	The number of features to consider when looking for the best split for RandomForestClassifier objects.	`“sqrt”`
`max_leaf_nodes`	`int`	Grow trees with max_leaf_nodes in best-first fashion.	`None`
`min_impurity_decrease`	`float`	A node will be split if this split induces a decrease of the impurity greater than or equal to this value.	`0.0`
`bootstrap`	`bool`	Whether bootstrap samples are used when building trees.	`True`
`oob_score`	`bool or callable`	Whether to use out-of-bag samples to estimate the generalization score.	`False`
`n_jobs`	`int`	The number of jobs to run in parallel.	`None`
`random_state`	`int, RandomState instance or None`	Controls both the randomness of the bootstrapping of the samples used when building trees (if `bootstrap=True`) and the sampling of the features to consider when looking for the best split at each node (if `max_features < n_features`).	`None`
`verbose`	`int`	Controls the verbosity when fitting and predicting.	`0`
`warm_start`	`bool`	When set to `True`, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.	`False`
`class_weight_clf`	`“balanced”, “balanced_subsample”`	default=None Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.	`“balanced”`
`ccp_alpha`	`non-negative float`	Complexity parameter used for Minimal Cost-Complexity Pruning.	`0.0`
`max_samples`	`int or float`	If bootstrap is `True`, the number of samples to draw from X to train each base estimator.	`None`
`monotonic_cst`	`array-like of int of shape (n_features)`	Indicates the monotonicity constraint to enforce on each feature.	`None`

Attributes:

Name	Type	Description
`n_features_in_`	`int`	Number of features seen during `fit`.
`feature_names_in_`	ndarray of shape (`n_features_in_`)	Names of features seen during fit. Defined only when `X` has feature names that are all strings.
`estimator_type_dict_`	`dict[str, str]`	Dictionary mapping target names to their random forest type ("regression" or "classification").
`estimators_`	list [`RandomForestRegressor`\|`RandomForestClassifier`]	The random forests associated with each target in `y` during `fit`.
`n_forests_`	`int`	The number of forests (i.e. targets) in the ensemble. Equal to `len(self.estimators_)`.
`n_trees_per_iteration_`	`list[int]`	The number of trees per iteration for each forest. Set to 1 for all random forest estimators.
`tree_weights_`	list with length `n_forests_` of ndarrays of shape	(`n_estimators`,). Weights assigned to each tree in each forest to be used when calculating distances between node indexes. Set to 1.0 for all trees.

Source code in src/sknnr/transformers/_rfnode_transformer.py

def __init__(
    self,
    n_estimators: int = 50,
    criterion_reg: Literal[
        "squared_error", "absolute_error", "friedman_mse", "poisson"
    ] = "squared_error",
    criterion_clf: Literal["gini", "entropy", "log_loss"] = "gini",
    max_depth: int | None = None,
    min_samples_split: int | float = 2,
    min_samples_leaf: int | float = 5,
    min_weight_fraction_leaf: float = 0.0,
    max_features_reg: Literal["sqrt", "log2"] | int | float | None = 1.0,
    max_features_clf: Literal["sqrt", "log2"] | int | float | None = "sqrt",
    max_leaf_nodes: int | None = None,
    min_impurity_decrease: float = 0.0,
    bootstrap: bool = True,
    oob_score: bool | Callable = False,
    n_jobs: int | None = None,
    random_state: int | RandomState | None = None,
    verbose: int = 0,
    warm_start: bool = False,
    class_weight_clf: Literal["balanced", "balanced_subsample"]
    | dict[str, float]
    | list[dict[str, float]]
    | None = None,
    ccp_alpha: float = 0.0,
    max_samples: int | float | None = None,
    monotonic_cst: list[int] | None = None,
):
    self.n_estimators = n_estimators
    self.criterion_reg = criterion_reg
    self.criterion_clf = criterion_clf
    self.max_depth = max_depth
    self.min_samples_split = min_samples_split
    self.min_samples_leaf = min_samples_leaf
    self.min_weight_fraction_leaf = min_weight_fraction_leaf
    self.max_features_reg = max_features_reg
    self.max_features_clf = max_features_clf
    self.max_leaf_nodes = max_leaf_nodes
    self.min_impurity_decrease = min_impurity_decrease
    self.bootstrap = bootstrap
    self.oob_score = oob_score
    self.n_jobs = n_jobs
    self.random_state = random_state
    self.verbose = verbose
    self.warm_start = warm_start
    self.class_weight_clf = class_weight_clf
    self.ccp_alpha = ccp_alpha
    self.max_samples = max_samples
    self.monotonic_cst = monotonic_cst

Attributes¶

bootstrap `instance-attribute` ¶

bootstrap = bootstrap

ccp_alpha `instance-attribute` ¶

ccp_alpha = ccp_alpha

class_weight_clf `instance-attribute` ¶

class_weight_clf = class_weight_clf

criterion_clf `instance-attribute` ¶

criterion_clf = criterion_clf

criterion_reg `instance-attribute` ¶

criterion_reg = criterion_reg

max_depth `instance-attribute` ¶

max_depth = max_depth

max_features_clf `instance-attribute` ¶

max_features_clf = max_features_clf

max_features_reg `instance-attribute` ¶

max_features_reg = max_features_reg

max_leaf_nodes `instance-attribute` ¶

max_leaf_nodes = max_leaf_nodes

max_samples `instance-attribute` ¶

max_samples = max_samples

min_impurity_decrease `instance-attribute` ¶

min_impurity_decrease = min_impurity_decrease

min_samples_leaf `instance-attribute` ¶

min_samples_leaf = min_samples_leaf

min_samples_split `instance-attribute` ¶

min_samples_split = min_samples_split

min_weight_fraction_leaf `instance-attribute` ¶

min_weight_fraction_leaf = min_weight_fraction_leaf

monotonic_cst `instance-attribute` ¶

monotonic_cst = monotonic_cst

n_estimators `instance-attribute` ¶

n_estimators = n_estimators

n_jobs `instance-attribute` ¶

n_jobs = n_jobs

oob_score `instance-attribute` ¶

oob_score = oob_score

random_state `instance-attribute` ¶

random_state = random_state

verbose `instance-attribute` ¶

verbose = verbose

warm_start `instance-attribute` ¶

warm_start = warm_start

Functions¶

fit ¶

fit(X: DataLike, y: DataLike) -> Self

Source code in src/sknnr/transformers/_rfnode_transformer.py

def fit(self, X: DataLike, y: DataLike) -> Self:
    # Specialize the kwargs sent to initialize the random forests
    rf_common_kwargs = dict(
        n_estimators=self.n_estimators,
        max_depth=self.max_depth,
        min_samples_split=self.min_samples_split,
        min_samples_leaf=self.min_samples_leaf,
        min_weight_fraction_leaf=self.min_weight_fraction_leaf,
        max_leaf_nodes=self.max_leaf_nodes,
        min_impurity_decrease=self.min_impurity_decrease,
        bootstrap=self.bootstrap,
        oob_score=self.oob_score,
        n_jobs=self.n_jobs,
        random_state=self.random_state,
        verbose=self.verbose,
        warm_start=self.warm_start,
        ccp_alpha=self.ccp_alpha,
        max_samples=self.max_samples,
        monotonic_cst=self.monotonic_cst,
    )
    rf_reg_kwargs = {
        **rf_common_kwargs,
        "criterion": self.criterion_reg,
        "max_features": self.max_features_reg,
    }
    rf_clf_kwargs = {
        **rf_common_kwargs,
        "criterion": self.criterion_clf,
        "max_features": self.max_features_clf,
        "class_weight": self.class_weight_clf,
    }
    return self._fit(
        X,
        y,
        RandomForestRegressor,
        RandomForestClassifier,
        rf_reg_kwargs,
        rf_clf_kwargs,
    )

fit_transform ¶

fit_transform(X: DataLike, y: DataLike) -> NDArray[int64]

Source code in src/sknnr/transformers/_tree_node_transformer.py

def fit_transform(self, X: DataLike, y: DataLike) -> NDArray[np.int64]:
    return self.fit(X, y).transform(X)

get_feature_names_out ¶

get_feature_names_out() -> NDArray[object_]

Source code in src/sknnr/transformers/_rfnode_transformer.py

def get_feature_names_out(self) -> NDArray[np.object_]:
    check_is_fitted(self, "estimators_")
    return np.asarray(
        [
            f"rf{i}_tree{j}"
            for i in range(len(self.estimators_))
            for j in range(self.estimators_[i].n_estimators)
        ],
        dtype=object,
    )

transform ¶

transform(X: DataLike) -> NDArray[int64]

Source code in src/sknnr/transformers/_tree_node_transformer.py

def transform(self, X: DataLike) -> NDArray[np.int64]:
    check_is_fitted(self)
    X_arr = validate_data(
        self,
        X=X,
        reset=False,
        ensure_min_features=1,
        ensure_min_samples=1,
    )

    # Get the node IDs for each tree in each forest
    node_ids = []
    for est in self.estimators_:
        est_node_ids = est.apply(X_arr)
        # In the case of some multi-class estimators (e.g.
        # GradientBoostingClassifier), the output of `apply` is 3D (n_samples,
        # n_estimators, n_classes). First swap axes to get (n_samples,
        # n_classes, n_estimators), then flatten the last two dimensions
        # to ensure a 2D output.
        if est_node_ids.ndim == 3:
            est_node_ids = np.swapaxes(est_node_ids, 1, 2).reshape(
                est_node_ids.shape[0], -1
            )
        node_ids.append(est_node_ids)
    return np.hstack(node_ids).astype("int64")

RFNodeTransformer

sknnr.transformers.RFNodeTransformer ¶

Attributes¶

bootstrap instance-attribute ¶

ccp_alpha instance-attribute ¶

class_weight_clf instance-attribute ¶

criterion_clf instance-attribute ¶

criterion_reg instance-attribute ¶

max_depth instance-attribute ¶

max_features_clf instance-attribute ¶

max_features_reg instance-attribute ¶

max_leaf_nodes instance-attribute ¶

max_samples instance-attribute ¶

min_impurity_decrease instance-attribute ¶

min_samples_leaf instance-attribute ¶

min_samples_split instance-attribute ¶

min_weight_fraction_leaf instance-attribute ¶

monotonic_cst instance-attribute ¶

n_estimators instance-attribute ¶

n_jobs instance-attribute ¶

oob_score instance-attribute ¶

random_state instance-attribute ¶

verbose instance-attribute ¶

warm_start instance-attribute ¶

Functions¶

fit ¶

fit_transform ¶

get_feature_names_out ¶

transform ¶

bootstrap `instance-attribute` ¶

ccp_alpha `instance-attribute` ¶

class_weight_clf `instance-attribute` ¶

criterion_clf `instance-attribute` ¶

criterion_reg `instance-attribute` ¶

max_depth `instance-attribute` ¶

max_features_clf `instance-attribute` ¶

max_features_reg `instance-attribute` ¶

max_leaf_nodes `instance-attribute` ¶

max_samples `instance-attribute` ¶

min_impurity_decrease `instance-attribute` ¶

min_samples_leaf `instance-attribute` ¶

min_samples_split `instance-attribute` ¶

min_weight_fraction_leaf `instance-attribute` ¶

monotonic_cst `instance-attribute` ¶

n_estimators `instance-attribute` ¶

n_jobs `instance-attribute` ¶

oob_score `instance-attribute` ¶

random_state `instance-attribute` ¶

verbose `instance-attribute` ¶

warm_start `instance-attribute` ¶