Usage
Estimators¶
sknnr provides seven estimators that are fully compatible, drop-in replacements for scikit-learn estimators:
- RawKNNRegressor
- EuclideanKNNRegressor
- MahalanobisKNNRegressor
- GNNRegressor
- MSNRegressor
- RFNNRegressor
- GBNNRegressor
These estimators can be used like any other sklearn regressor (or classifier)1.
from sknnr import EuclideanKNNRegressor
from sknnr.datasets import load_swo_ecoplot
from sklearn.model_selection import train_test_split
X, y = load_swo_ecoplot(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
est = EuclideanKNNRegressor(n_neighbors=3).fit(X_train, y_train)
print(est.score(X_test, y_test))
# 0.11496218649569434
In addition to their core functionality of fitting, predicting, and scoring, sknnr estimators offer a number of other features, detailed below.
Regression and Classification¶
The estimators in sknnr are all initialized with an optional parameter n_neighbors that determines how many plots a target plot's attributes will be predicted from. When n_neighbors > 1, a plot's attributes are calculated as optionally-weighted averages of each of its k nearest neighbors. Predicted values can fall anywhere between the observed plot values, making this "regression mode" suitable for continuous attributes (e.g. basal area). To maintain categorical attributes (e.g. dominant species type), the estimators can be run in "classification mode" with n_neighbors = 1, where each attribute is imputed directly from its nearest neighbor. To predict a combination of continuous and categorical attributes, it's possible to use two estimators and concatenate their predictions manually.
Independent Scores and Predictions¶
When an independent test set is not available, the accuracy of a kNN regressor can be estimated by comparing each sample in the training set to its second-nearest neighbor, i.e. the closest point excluding itself. All sknnr estimators set independent_prediction_ and independent_score_ attributes when they are fit, which store the predictions and scores of this independent evaluation.
Deterministic Neighbor Ordering¶
scikit-learn's KNeighborsRegressor warns that:
in case of multiple neighbors being at the same distance, the result will depend on the order of the samples in the training data.
In sknnr, we allow the user to enforce strict ordering of neighbors with deterministic tie-breaking when calling kneighbors by using the use_deterministic_ordering parameter. When this value is True, neighbors are sorted using the following logical order:
- Scaled and rounded distances: Neighbors are first sorted by their scaled and rounded distances. Scaling is performed per query point such that each neighbor's distance is first normalized by dividing by the maximum distance for that query point (or 1.0 if the maximum distance is less than 1.0), then rounded to
RawKNNRegressor.DISTANCE_PRECISION_DECIMALSdecimal places (currently set to 10). Some floating point operations in distance determination (notablynumpy.dot) can introduce very small numerical differences across platforms, which is effectively handled by this rounding. - Difference between query point row index and neighbors indexes: If two or more neighbors have identical rounded distances, they are further sorted by the absolute difference between their row index in the training data and the row index of the query point. This ensures that when a sample is its own nearest neighbor, it will always be selected first.
- Neighbor index: If two or more neighbors are still tied based on the two above criteria, they are finally sorted by their row index in the training data.
As an example, consider the following training data with three samples:
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
X = np.array([
[1, 2, 3],
[4, 5, 6],
[1, 2, 3]
])
y = np.array([10, 20, 30])
est = KNeighborsRegressor(n_neighbors=2).fit(X, y)
print(est.kneighbors(X, return_distance=False))
# [[0 2]
# [1 0]
# [0 2]] - Not returning itself as first neighbor
Using sknnr's RawKNNRegressor with deterministic ordering:
from sknnr import RawKNNRegressor
est = RawKNNRegressor(n_neighbors=2).fit(X, y)
print(est.kneighbors(X, return_distance=False, use_deterministic_ordering=True))
# [[0 2]
# [1 0]
# [2 0]] - Returning itself as first neighbor
The use_deterministic_ordering parameter defaults to True, but can revert to scikit-learn's default behavior when calling kneighbors:
Warning
There may be potential to lose meaningful precision when rounding distances, especially with datasets that include samples with very large distances. In these situations, we suggest either increasing RawKNNRegressor.DISTANCE_PRECISION_DECIMALS or disabling use_deterministic_ordering at the expense of cross-platform reproducibility.
Retrieving Dataframe Indexes¶
In sklearn, the KNeighborsRegressor.kneighbors method can identify the array index of the nearest neighbor to a given sample. Estimators in sknnr offer an additional parameter return_dataframe_index that allows neighbor samples to be identified directly by their index.
X, y = load_swo_ecoplot(return_X_y=True, as_frame=True)
est = est.fit(X, y)
# Find the distance and dataframe index of the nearest neighbors to the first plot
distances, neighbor_ids = est.kneighbors(X.iloc[:1], return_dataframe_index=True)
# Preview the nearest neighbors by their dataframe index
print(y.loc[neighbor_ids[0]])
| ABAM_COV | ABGRC_COV | ABPRSH_COV | ACMA3_COV | ALRH2_COV | |
|---|---|---|---|---|---|
| 52481 | 0 | 0 | 39.3469 | 0 | 0 |
| 60089 | 0 | 0 | 22.1199 | 0 | 0 |
| 56253 | 0 | 0 | 22.8948 | 0 | 0 |
Warning
An estimator must be fit with a DataFrame in order to use return_dataframe_index=True.
Tip
In forestry applications, users typically store a unique inventory plot identification number as the index in the dataframe.
Y-Fit Data¶
The GNNRegressor, MSNRegressor, RFNNRegressor, and GBNNRegressor estimators can be fit with X and y data, but they also accept an optional y_fit parameter. If provided, y_fit is used to fit the transformer while y is used to fit the kNN regressor.
In forest attribute estimation, the underlying transformations for two of these estimators (CCA for GNN and CCorA for MSN) typically use a matrix of species abundances or presence/absence information to relate the species data to environmental covariates, but often the user wants predictions based not on these features, but rather attributes that describe forest structure (e.g. biomass) or composition (e.g. species richness). In this case, the species matrix would be specified as y_fit and the stand attributes would be specified as y.
For RFNN and GBNN, the y_fit parameter can be used to specify the attributes for which individual forests will be created (one forest per feature). As with GNN and MSN, the y parameter can then be used to specify the attributes that will be predicted by the nearest neighbors.
Dimensionality Reduction¶
The ordination transformers used by the GNNRegressor and MSNRegressor estimators apply dimensionality reduction by creating components that are linear combinations of the features in the X data. For both transformers, components that explain more variation present in the y (or y_fit) matrix are ordered first. Users can further reduce the number of components that are used to determine nearest neighbors by specifying n_components when instantiating the estimator.
Warning
The maximum number of components depends on the input data and the estimator. Specifying n_components greater than the maximum number of components will raise an error.
RFNN and GBNN Distance Metric¶
For all estimators other than RFNNRegressor and GBNNRegressor, the distance metric used to determine nearest neighbors is the Euclidean distance between samples in the transformed space. RFNN and GBNN, on the other hand, first build a forest for each feature in the y (or y_fit) matrix and then capture the node IDs (not values) for each sample on every forest and tree. The distance between samples is calculated using Hamming Distance, which captures the number of node IDs that are different between the target and reference samples and then divided by the total number of nodes. Therefore, a target and reference sample that share all node IDs would have a distance of 0, whereas a target and reference sample that share no node IDs would have a distance of 1.
Additionally, GBNN allows users to specify the tree_weighting_method parameter, which applies weights to the Hamming distance calculation based on the importance of the tree stage in training. When tree_weighting_method is set to "train_improvement", tree stages that contribute more to reducing training loss are weighted more heavily. When tree_weighting_method is set to "uniform", all trees are weighted equally.
Custom Transformers¶
Most estimators in sknnr work by applying specialized transformers like CCA and CCorA to the input data. These transformers can be used independently of the estimators, like any other sklearn transformer.
from sknnr.transformers import CCATransformer
cca = CCATransformer(n_components=3)
print(cca.fit_transform(X, y))
sknnr currently provides the following transformers:
- StandardScalerWithDOF
- MahalanobisTransformer
- CCATransformer
- CCorATransformer
- RFNodeTransformer
- GBNodeTransformer
Datasets¶
sknnr estimators can be used for any multi-output regression problem, but they excel at predicting forest attributes. The sknnr.datasets module contains a number of test datasets with plot-based forest measurements and environmental attributes.
Dataset Format¶
Like in sklearn, datasets in sknnr can be loaded in a variety of formats, including as a dict-like Dataset object:
...as an X, y tuple of Numpy arrays:
...or as tuple of Pandas dataframes:
| ANNPRE | ANNTMP | AUGMAXT | CONTPRE | CVPRE | DECMINT | DIFTMP | SMRTMP | SMRTP | ASPTR | DEM | PRR | SLPPCT | TPI450 | TC1 | TC2 | TC3 | NBR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 52481 | 740 | 514.667 | 2315 | 517.667 | 8971.67 | -583.111 | 2899.11 | 1136.11 | 212.222 | 197.667 | 1870.11 | 13196.7 | 48.3333 | 33.7778 | 218.778 | 68.5556 | -86.2222 | 343.556 |
| 52482 | 742 | 563.556 | 2354.33 | 502 | 9124.33 | -543.556 | 2898.89 | 1179.44 | 221.111 | 190.222 | 1713.11 | 16355.8 | 5.4444 | 6.4444 | 210.222 | 60.3333 | -96.6667 | 261.667 |
| 52484 | 738.556 | 639.111 | 2468.89 | 545.889 | 8897.22 | -479.111 | 2949 | 1266.22 | 236 | 194.556 | 1612.11 | 15132.6 | 15.5556 | -1.2222 | 157 | 110.222 | -17.4444 | 721 |
| 52485 | 730.333 | 622.667 | 2405.33 | 555 | 8829.78 | -481.222 | 2887.56 | 1244.22 | 234 | 196.444 | 1682.33 | 15146.7 | 19.8889 | -16.8889 | 152.556 | 86.1111 | -31.6667 | 597.111 |
| 52494 | 720 | 778.556 | 2678.11 | 658.556 | 8638 | -386.667 | 3065.78 | 1396 | 262 | 191.778 | 1345.67 | 16672.1 | 2 | 0.4444 | 214.667 | 58.5556 | -88.1111 | 294.222 |
Note
pandas must be installed to use as_frame=True.
-
Check out the sklearn docs for a refresher on estimator basics. ↩