Note
Click here to download the full example code
Not a Number LVQ (NaNLVQ)¶
NanLVQ [1] refers to a extension that can be implemented for various distance functions. It uses the partial distance strategy to ignore any NaN values in the data. Another interpretation would be that it imputes the missing values with those of the prototypes. Hence, the distance will be zero, which results in a zero update for the feature containing the NaN value.
import matplotlib
import numpy as np
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklvq import GMLVQ
matplotlib.rc("xtick", labelsize="small")
matplotlib.rc("ytick", labelsize="small")
iris = load_iris()
data = iris.data
labels = iris.target
# Insert some "random" missing values represented by np.nan
num_missing_values = 50
num_samples, num_dimensions = data.shape
i = np.random.choice(num_samples, num_missing_values, replace=False)
j = np.random.choice(num_dimensions, num_missing_values, replace=True)
data[i, j] = np.nan
Fitting the Model¶
Scale the data and create a GMLVQ object with, e.g., custom distance function, activation function and solver. See the API reference under documentation for defaults and other possible parameters.
# Object to perform z-transform
scaler = StandardScaler()
# Compute (fit) and apply (transform) z-transform
data = scaler.fit_transform(data)
# The creation of the model object used to fit the data to.
model = GMLVQ(
distance_type="adaptive-squared-euclidean",
activation_type="swish",
activation_params={"beta": 2},
solver_type="waypoint-gradient-descent",
solver_params={"max_runs": 10, "k": 3, "step_size": np.array([0.1, 0.05])},
random_state=1428,
force_all_finite="allow-nan", # This will make the data validation and distance function
# accept and deal with np.nan values.
)
The next step is to fit the GMLVQ object to the data and use the predict method to make the predictions. Note that this example only works on the training data and therefor does not say anything about the generalizability of the fitted model.
# Train the model using the data and labels
model.fit(data, labels)
# Predict the labels using the trained model
predicted_labels = model.predict(data)
# To get a sense of the training performance we could print the classification report.
print(classification_report(labels, predicted_labels))
Out:
precision recall f1-score support
0 1.00 1.00 1.00 50
1 0.98 0.96 0.97 50
2 0.96 0.98 0.97 50
accuracy 0.98 150
macro avg 0.98 0.98 0.98 150
weighted avg 0.98 0.98 0.98 150
The examples uses GMLVQ but all models and their compatible distance functions support the force_all_finite option.
References¶
[1] Rick van Veen (2016). Analysis of Missing Data Imputation Applied to Heart Failure Data ( Master’s Thesis, University of Groningen, Groningen, The Netherlands). Retrieved from http://fse.studenttheses.ub.rug.nl/id/eprint/14679
Total running time of the script: ( 0 minutes 0.062 seconds)