Usage

Estimators¶

sknnr provides seven estimators that are fully compatible, drop-in replacements for scikit-learn estimators:

These estimators can be used like any other sklearn regressor (or classifier)¹.

from sknnr import EuclideanKNNRegressor
from sknnr.datasets import load_swo_ecoplot
from sklearn.model_selection import train_test_split

X, y = load_swo_ecoplot(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
est = EuclideanKNNRegressor(n_neighbors=3).fit(X_train, y_train)

print(est.score(X_test, y_test))
# 0.11496218649569434

In addition to their core functionality of fitting, predicting, and scoring, sknnr estimators offer a number of other features, detailed below.

Regression and Classification¶

The estimators in sknnr are all initialized with an optional parameter n_neighbors that determines how many plots a target plot's attributes will be predicted from. When n_neighbors > 1, a plot's attributes are calculated as optionally-weighted averages of each of its k nearest neighbors. Predicted values can fall anywhere between the observed plot values, making this "regression mode" suitable for continuous attributes (e.g. basal area). To maintain categorical attributes (e.g. dominant species type), the estimators can be run in "classification mode" with n_neighbors = 1, where each attribute is imputed directly from its nearest neighbor. To predict a combination of continuous and categorical attributes, it's possible to use two estimators and concatenate their predictions manually.

Independent Scores and Predictions¶

When an independent test set is not available, the accuracy of a kNN regressor can be estimated by comparing each sample in the training set to its second-nearest neighbor, i.e. the closest point excluding itself. All sknnr estimators set independent_prediction_ and independent_score_ attributes when they are fit, which store the predictions and scores of this independent evaluation.

print(est.independent_score_)
# 0.10243925752772305

Deterministic Neighbor Ordering¶

scikit-learn's KNeighborsRegressor warns that:

in case of multiple neighbors being at the same distance, the result will depend on the order of the samples in the training data.

In sknnr, we allow the user to enforce strict ordering of neighbors with deterministic tie-breaking when calling kneighbors by using the use_deterministic_ordering parameter. When this value is True, neighbors are sorted using the following logical order:

Scaled and rounded distances: Neighbors are first sorted by their scaled and rounded distances. Scaling is performed per query point such that each neighbor's distance is first normalized by dividing by the maximum distance for that query point (or 1.0 if the maximum distance is less than 1.0), then rounded to RawKNNRegressor.DISTANCE_PRECISION_DECIMALS decimal places (currently set to 10). Some floating point operations in distance determination (notably numpy.dot) can introduce very small numerical differences across platforms, which is effectively handled by this rounding.
Difference between query point row index and neighbors indexes: If two or more neighbors have identical rounded distances, they are further sorted by the absolute difference between their row index in the training data and the row index of the query point. This ensures that when a sample is its own nearest neighbor, it will always be selected first.
Neighbor index: If two or more neighbors are still tied based on the two above criteria, they are finally sorted by their row index in the training data.

As an example, consider the following training data with three samples:

import numpy as np
from sklearn.neighbors import KNeighborsRegressor

X = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [1, 2, 3]
])
y = np.array([10, 20, 30])
est = KNeighborsRegressor(n_neighbors=2).fit(X, y)

print(est.kneighbors(X, return_distance=False))
# [[0 2]
#  [1 0]
#  [0 2]] - Not returning itself as first neighbor

Using sknnr's RawKNNRegressor with deterministic ordering:

from sknnr import RawKNNRegressor
est = RawKNNRegressor(n_neighbors=2).fit(X, y)
print(est.kneighbors(X, return_distance=False, use_deterministic_ordering=True))
# [[0 2]
#  [1 0]
#  [2 0]] - Returning itself as first neighbor

The use_deterministic_ordering parameter defaults to True, but can revert to scikit-learn's default behavior when calling kneighbors:

distances, neighbors = est.kneighbors(
    X_test,
    use_deterministic_ordering=False
)

Warning

There may be potential to lose meaningful precision when rounding distances, especially with datasets that include samples with very large distances. In these situations, we suggest either increasing RawKNNRegressor.DISTANCE_PRECISION_DECIMALS or disabling use_deterministic_ordering at the expense of cross-platform reproducibility.

Retrieving Dataframe Indexes¶

In sklearn, the KNeighborsRegressor.kneighbors method can identify the array index of the nearest neighbor to a given sample. Estimators in sknnr offer an additional parameter return_dataframe_index that allows neighbor samples to be identified directly by their index.

X, y = load_swo_ecoplot(return_X_y=True, as_frame=True)
est = est.fit(X, y)

# Find the distance and dataframe index of the nearest neighbors to the first plot
distances, neighbor_ids = est.kneighbors(X.iloc[:1], return_dataframe_index=True)

# Preview the nearest neighbors by their dataframe index
print(y.loc[neighbor_ids[0]])

	ABAM_COV	ABGRC_COV	ABPRSH_COV	ACMA3_COV	ALRH2_COV
52481	0	0	39.3469	0	0
60089	0	0	22.1199	0	0
56253	0	0	22.8948	0	0

Warning

An estimator must be fit with a DataFrame in order to use return_dataframe_index=True.

Tip

In forestry applications, users typically store a unique inventory plot identification number as the index in the dataframe.

Y-Fit Data¶

The GNNRegressor, MSNRegressor, RFNNRegressor, and GBNNRegressor estimators can be fit with X and y data, but they also accept an optional y_fit parameter. If provided, y_fit is used to fit the transformer while y is used to fit the kNN regressor.

In forest attribute estimation, the underlying transformations for two of these estimators (CCA for GNN and CCorA for MSN) typically use a matrix of species abundances or presence/absence information to relate the species data to environmental covariates, but often the user wants predictions based not on these features, but rather attributes that describe forest structure (e.g. biomass) or composition (e.g. species richness). In this case, the species matrix would be specified as y_fit and the stand attributes would be specified as y.

For RFNN and GBNN, the y_fit parameter can be used to specify the attributes for which individual forests will be created (one forest per feature). As with GNN and MSN, the y parameter can then be used to specify the attributes that will be predicted by the nearest neighbors.

from sknnr import GNNRegressor

est = GNNRegressor().fit(X, y, y_fit=y_fit)

Dimensionality Reduction¶

The ordination transformers used by the GNNRegressor and MSNRegressor estimators apply dimensionality reduction by creating components that are linear combinations of the features in the X data. For both transformers, components that explain more variation present in the y (or y_fit) matrix are ordered first. Users can further reduce the number of components that are used to determine nearest neighbors by specifying n_components when instantiating the estimator.

est = GNNRegressor(n_components=3).fit(X, y)

Warning

The maximum number of components depends on the input data and the estimator. Specifying n_components greater than the maximum number of components will raise an error.

RFNN and GBNN Distance Metric¶

For all estimators other than RFNNRegressor and GBNNRegressor, the distance metric used to determine nearest neighbors is the Euclidean distance between samples in the transformed space. RFNN and GBNN, on the other hand, first build a forest for each feature in the y (or y_fit) matrix and then capture the node IDs (not values) for each sample on every forest and tree. The distance between samples is calculated using Hamming Distance, which captures the number of node IDs that are different between the target and reference samples and then divided by the total number of nodes. Therefore, a target and reference sample that share all node IDs would have a distance of 0, whereas a target and reference sample that share no node IDs would have a distance of 1.

Additionally, GBNN allows users to specify the tree_weighting_method parameter, which applies weights to the Hamming distance calculation based on the importance of the tree stage in training. When tree_weighting_method is set to "train_improvement", tree stages that contribute more to reducing training loss are weighted more heavily. When tree_weighting_method is set to "uniform", all trees are weighted equally.

Custom Transformers¶

Most estimators in sknnr work by applying specialized transformers like CCA and CCorA to the input data. These transformers can be used independently of the estimators, like any other sklearn transformer.

from sknnr.transformers import CCATransformer

cca = CCATransformer(n_components=3)
print(cca.fit_transform(X, y))

sknnr currently provides the following transformers:

Datasets¶

sknnr estimators can be used for any multi-output regression problem, but they excel at predicting forest attributes. The sknnr.datasets module contains a number of test datasets with plot-based forest measurements and environmental attributes.

from sknnr.datasets import load_swo_ecoplot, load_moscow_stjoes

Dataset Format¶

Like in sklearn, datasets in sknnr can be loaded in a variety of formats, including as a dict-like Dataset object:

dataset = load_swo_ecoplot()
print(dataset)
# Dataset(n=3005, features=18, targets=25)

...as an X, y tuple of Numpy arrays:

X, y = load_swo_ecoplot(return_X_y=True)
print(X.shape, y.shape)
# (3005, 18) (3005, 25)

...or as tuple of Pandas dataframes:

X_df, y_df = load_swo_ecoplot(return_X_y=True, as_frame=True)
print(X_df.head())

	ANNPRE	ANNTMP	AUGMAXT	CONTPRE	CVPRE	DECMINT	DIFTMP	SMRTMP	SMRTP	ASPTR	DEM	PRR	SLPPCT	TPI450	TC1	TC2	TC3	NBR
52481	740	514.667	2315	517.667	8971.67	-583.111	2899.11	1136.11	212.222	197.667	1870.11	13196.7	48.3333	33.7778	218.778	68.5556	-86.2222	343.556
52482	742	563.556	2354.33	502	9124.33	-543.556	2898.89	1179.44	221.111	190.222	1713.11	16355.8	5.4444	6.4444	210.222	60.3333	-96.6667	261.667
52484	738.556	639.111	2468.89	545.889	8897.22	-479.111	2949	1266.22	236	194.556	1612.11	15132.6	15.5556	-1.2222	157	110.222	-17.4444	721
52485	730.333	622.667	2405.33	555	8829.78	-481.222	2887.56	1244.22	234	196.444	1682.33	15146.7	19.8889	-16.8889	152.556	86.1111	-31.6667	597.111
52494	720	778.556	2678.11	658.556	8638	-386.667	3065.78	1396	262	191.778	1345.67	16672.1	2	0.4444	214.667	58.5556	-88.1111	294.222

Note

pandas must be installed to use as_frame=True.

Check out the sklearn docs for a refresher on estimator basics. ↩