Skip to content

RFNodeTransformer

sknnr.transformers.RFNodeTransformer

RFNodeTransformer(n_estimators: int = 50, criterion_reg: Literal['squared_error', 'absolute_error', 'friedman_mse', 'poisson'] = 'squared_error', criterion_clf: Literal['gini', 'entropy', 'log_loss'] = 'gini', max_depth: int | None = None, min_samples_split: int | float = 2, min_samples_leaf: int | float = 5, min_weight_fraction_leaf: float = 0.0, max_features_reg: Literal['sqrt', 'log2'] | int | float | None = 1.0, max_features_clf: Literal['sqrt', 'log2'] | int | float | None = 'sqrt', max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0.0, bootstrap: bool = True, oob_score: bool | Callable = False, n_jobs: int | None = None, random_state: int | RandomState | None = None, verbose: int = 0, warm_start: bool = False, class_weight_clf: Literal['balanced', 'balanced_subsample'] | dict[str, float] | list[dict[str, float]] | None = None, ccp_alpha: float = 0.0, max_samples: int | float | None = None, monotonic_cst: list[int] | None = None)

Bases: TreeNodeTransformer

Transformer to capture node indexes for samples across multiple random forests.

A random forest is fit to each y target in the training set using either scikit-learn's RandomForestRegressor or RandomForestClassifier. The transformation captures the node indexes for each tree in each forest for each training or new sample.

The particular random forest type used for each target is determined by the data type of the target. If the target is numeric (e.g. int or float), a RandomForestRegressor is used. If the target is categorical (e.g. str or pd.Categorical), a RandomForestClassifier is used. Targets are automatically promoted to the minimum numpy dtype that safely represents all elements.

This transformer is intended to be used in conjunction with RFNNRegressor which captures similarity between node indexes of training and inference data and creates predictions using nearest neighbors.

See sklearn.ensemble.RandomForestRegressor and sklearn.ensemble.RandomForestClassifier for more detail on available parameters. All parameters are passed through to these respective random forest estimators for each random forest being built. Note that some parameters (e.g. criterion and max_features) are specified separately for regression and classification and have _reg and _clf suffixes.

Parameters:

Name Type Description Default
n_estimators int

The number of trees in each random forest.

50
criterion_reg ('squared_error', 'absolute_error', 'friedman_mse', 'poisson')

default="squared_error" The function to measure the quality of a split for RandomForestRegresor objects.

"squared_error"
criterion_clf ('gini', 'entropy', 'log_loss')

The function to measure the quality of a split for RandomForestClassifier objects.

"gini"
max_depth int

The maximum depth of the tree.

None
min_samples_split int or float

The minimum number of samples required to split an internal node.

2
min_samples_leaf int of float

The minimum number of samples required to be at a leaf node.

5
min_weight_fraction_leaf float

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.

0.0
max_features_reg “sqrt”, “log2”, None

The number of features to consider when looking for the best split for RandomForestRegressor objects.

“sqrt”
max_features_clf “sqrt”, “log2”, None

The number of features to consider when looking for the best split for RandomForestClassifier objects.

“sqrt”
max_leaf_nodes int

Grow trees with max_leaf_nodes in best-first fashion.

None
min_impurity_decrease float

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

0.0
bootstrap bool

Whether bootstrap samples are used when building trees.

True
oob_score bool or callable

Whether to use out-of-bag samples to estimate the generalization score.

False
n_jobs int

The number of jobs to run in parallel.

None
random_state int, RandomState instance or None

Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features).

None
verbose int

Controls the verbosity when fitting and predicting.

0
warm_start bool

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

False
class_weight_clf “balanced”, “balanced_subsample”

default=None Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.

“balanced”
ccp_alpha non-negative float

Complexity parameter used for Minimal Cost-Complexity Pruning.

0.0
max_samples int or float

If bootstrap is True, the number of samples to draw from X to train each base estimator.

None
monotonic_cst array-like of int of shape (n_features)

Indicates the monotonicity constraint to enforce on each feature.

None

Attributes:

Name Type Description
n_features_in_ int

Number of features seen during fit.

feature_names_in_ ndarray of shape (`n_features_in_`)

Names of features seen during fit. Defined only when X has feature names that are all strings.

estimator_type_dict_ dict[str, str]

Dictionary mapping target names to their random forest type ("regression" or "classification").

estimators_ list [`RandomForestRegressor`|`RandomForestClassifier`]

The random forests associated with each target in y during fit.

n_forests_ int

The number of forests (i.e. targets) in the ensemble. Equal to len(self.estimators_).

n_trees_per_iteration_ list[int]

The number of trees per iteration for each forest. Set to 1 for all random forest estimators.

tree_weights_ list with length `n_forests_` of ndarrays of shape

(n_estimators,). Weights assigned to each tree in each forest to be used when calculating distances between node indexes. Set to 1.0 for all trees.

Source code in src/sknnr/transformers/_rfnode_transformer.py
def __init__(
    self,
    n_estimators: int = 50,
    criterion_reg: Literal[
        "squared_error", "absolute_error", "friedman_mse", "poisson"
    ] = "squared_error",
    criterion_clf: Literal["gini", "entropy", "log_loss"] = "gini",
    max_depth: int | None = None,
    min_samples_split: int | float = 2,
    min_samples_leaf: int | float = 5,
    min_weight_fraction_leaf: float = 0.0,
    max_features_reg: Literal["sqrt", "log2"] | int | float | None = 1.0,
    max_features_clf: Literal["sqrt", "log2"] | int | float | None = "sqrt",
    max_leaf_nodes: int | None = None,
    min_impurity_decrease: float = 0.0,
    bootstrap: bool = True,
    oob_score: bool | Callable = False,
    n_jobs: int | None = None,
    random_state: int | RandomState | None = None,
    verbose: int = 0,
    warm_start: bool = False,
    class_weight_clf: Literal["balanced", "balanced_subsample"]
    | dict[str, float]
    | list[dict[str, float]]
    | None = None,
    ccp_alpha: float = 0.0,
    max_samples: int | float | None = None,
    monotonic_cst: list[int] | None = None,
):
    self.n_estimators = n_estimators
    self.criterion_reg = criterion_reg
    self.criterion_clf = criterion_clf
    self.max_depth = max_depth
    self.min_samples_split = min_samples_split
    self.min_samples_leaf = min_samples_leaf
    self.min_weight_fraction_leaf = min_weight_fraction_leaf
    self.max_features_reg = max_features_reg
    self.max_features_clf = max_features_clf
    self.max_leaf_nodes = max_leaf_nodes
    self.min_impurity_decrease = min_impurity_decrease
    self.bootstrap = bootstrap
    self.oob_score = oob_score
    self.n_jobs = n_jobs
    self.random_state = random_state
    self.verbose = verbose
    self.warm_start = warm_start
    self.class_weight_clf = class_weight_clf
    self.ccp_alpha = ccp_alpha
    self.max_samples = max_samples
    self.monotonic_cst = monotonic_cst

Attributes

bootstrap instance-attribute

bootstrap = bootstrap

ccp_alpha instance-attribute

ccp_alpha = ccp_alpha

class_weight_clf instance-attribute

class_weight_clf = class_weight_clf

criterion_clf instance-attribute

criterion_clf = criterion_clf

criterion_reg instance-attribute

criterion_reg = criterion_reg

max_depth instance-attribute

max_depth = max_depth

max_features_clf instance-attribute

max_features_clf = max_features_clf

max_features_reg instance-attribute

max_features_reg = max_features_reg

max_leaf_nodes instance-attribute

max_leaf_nodes = max_leaf_nodes

max_samples instance-attribute

max_samples = max_samples

min_impurity_decrease instance-attribute

min_impurity_decrease = min_impurity_decrease

min_samples_leaf instance-attribute

min_samples_leaf = min_samples_leaf

min_samples_split instance-attribute

min_samples_split = min_samples_split

min_weight_fraction_leaf instance-attribute

min_weight_fraction_leaf = min_weight_fraction_leaf

monotonic_cst instance-attribute

monotonic_cst = monotonic_cst

n_estimators instance-attribute

n_estimators = n_estimators

n_jobs instance-attribute

n_jobs = n_jobs

oob_score instance-attribute

oob_score = oob_score

random_state instance-attribute

random_state = random_state

verbose instance-attribute

verbose = verbose

warm_start instance-attribute

warm_start = warm_start

Functions

fit

fit(X: DataLike, y: DataLike) -> Self
Source code in src/sknnr/transformers/_rfnode_transformer.py
def fit(self, X: DataLike, y: DataLike) -> Self:
    # Specialize the kwargs sent to initialize the random forests
    rf_common_kwargs = dict(
        n_estimators=self.n_estimators,
        max_depth=self.max_depth,
        min_samples_split=self.min_samples_split,
        min_samples_leaf=self.min_samples_leaf,
        min_weight_fraction_leaf=self.min_weight_fraction_leaf,
        max_leaf_nodes=self.max_leaf_nodes,
        min_impurity_decrease=self.min_impurity_decrease,
        bootstrap=self.bootstrap,
        oob_score=self.oob_score,
        n_jobs=self.n_jobs,
        random_state=self.random_state,
        verbose=self.verbose,
        warm_start=self.warm_start,
        ccp_alpha=self.ccp_alpha,
        max_samples=self.max_samples,
        monotonic_cst=self.monotonic_cst,
    )
    rf_reg_kwargs = {
        **rf_common_kwargs,
        "criterion": self.criterion_reg,
        "max_features": self.max_features_reg,
    }
    rf_clf_kwargs = {
        **rf_common_kwargs,
        "criterion": self.criterion_clf,
        "max_features": self.max_features_clf,
        "class_weight": self.class_weight_clf,
    }
    return self._fit(
        X,
        y,
        RandomForestRegressor,
        RandomForestClassifier,
        rf_reg_kwargs,
        rf_clf_kwargs,
    )

fit_transform

fit_transform(X: DataLike, y: DataLike) -> NDArray[int64]
Source code in src/sknnr/transformers/_tree_node_transformer.py
def fit_transform(self, X: DataLike, y: DataLike) -> NDArray[np.int64]:
    return self.fit(X, y).transform(X)

get_feature_names_out

get_feature_names_out() -> NDArray[object_]
Source code in src/sknnr/transformers/_rfnode_transformer.py
def get_feature_names_out(self) -> NDArray[np.object_]:
    check_is_fitted(self, "estimators_")
    return np.asarray(
        [
            f"rf{i}_tree{j}"
            for i in range(len(self.estimators_))
            for j in range(self.estimators_[i].n_estimators)
        ],
        dtype=object,
    )

transform

transform(X: DataLike) -> NDArray[int64]
Source code in src/sknnr/transformers/_tree_node_transformer.py
def transform(self, X: DataLike) -> NDArray[np.int64]:
    check_is_fitted(self)
    X_arr = validate_data(
        self,
        X=X,
        reset=False,
        ensure_min_features=1,
        ensure_min_samples=1,
    )

    # Get the node IDs for each tree in each forest
    node_ids = []
    for est in self.estimators_:
        est_node_ids = est.apply(X_arr)
        # In the case of some multi-class estimators (e.g.
        # GradientBoostingClassifier), the output of `apply` is 3D (n_samples,
        # n_estimators, n_classes). First swap axes to get (n_samples,
        # n_classes, n_estimators), then flatten the last two dimensions
        # to ensure a 2D output.
        if est_node_ids.ndim == 3:
            est_node_ids = np.swapaxes(est_node_ids, 1, 2).reshape(
                est_node_ids.shape[0], -1
            )
        node_ids.append(est_node_ids)
    return np.hstack(node_ids).astype("int64")