-
Notifications
You must be signed in to change notification settings - Fork 63
Output of Label Column when applying ONNX model is not as expected #460
Description
When creating ONNX models for classifiers, using NimbusML, and then applying them either with OnnxRunner (aka OnnxTransformer from ML.NET) or directly using Onnx runtime (aka ORT) python's API, then we get unexpected values in the Label column (i.e. the column that was used as Label for the classifier).
The behavior is somewhat different if the input DataFrame's Label column is category, object (string) or float (as I show in my repro below, but I guess similar problems arise for different types). There are two main issues:
Issue 1. When running with ORT, the output Label column from the ONNX model, is 'keys' and not 'values'... i.e. we get integers starting from 0, instead of whatever original values there where in Label. This happens regardless of the input Label column type.
Issue 2. When running with OnnxRunner, the Label column has weird values. If the input Label column was object (string), then, for all rows, the value in that column is "4294967295"... if the input was category or float, then the value is "0".
Repro
NOTE: the data_frame_tool module used is the one currently in the aml branch (link)
import os
import tempfile
from data_frame_tool import DataFrameTool as DFT
from nimbusml.datasets import get_dataset
from nimbusml.linear_model import FastLinearClassifier
from nimbusml.preprocessing import OnnxRunner
from nimbusml.preprocessing import FromKey, ToKey
from nimbusml import Pipeline
def get_tmp_file(suffix=None):
fd, file_name = tempfile.mkstemp(suffix=suffix)
fl = os.fdopen(fd, 'w')
fl.close()
return file_name
# Change the label column to see different behaviors:
LABEL_COLUMN_NAME = "Species" # Type: object (string)
#LABEL_COLUMN_NAME = "Setosa" # Type: float
#LABEL_COLUMN_NAME = "Label" # Type: category
iris_df = get_dataset("iris").as_df()
print("\n\nORIGINAL DATASET - using", LABEL_COLUMN_NAME, " as Label column")
print(iris_df)
print(iris_df.dtypes)
predictor = FastLinearClassifier(feature=["Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width"], label=LABEL_COLUMN_NAME)
predictor.fit(iris_df)
print("\n\nML.NET RESULT")
original_result = predictor.predict(iris_df) # Notice this outputs only "PredictedLabel" so the user can't get the Label column after applying the predictor. QUESTION: Is there a way for the user to get that column after the predictor?
print(predictor.model_)
print(original_result)
print(original_result.dtypes)
# onnxpath = get_tmp_file()
onnxpath = get_tmp_file()
print()
print("Onnx model path:", onnxpath)
predictor.export_to_onnx(onnxpath, 'com.microsoft.ml')
print("\n\nORT RESULT")
df_tool = DFT(onnxpath)
result_ort = df_tool.execute(iris_df, [])
print(result_ort)
print("\nColumn:", LABEL_COLUMN_NAME, " - ORT RESULT") # Issue 1: It prints the "keys", instead of values for the Label column
print(result_ort[LABEL_COLUMN_NAME + ".output"])
print("\n\nONNX RUNNER RESULT")
onnxrunner = OnnxRunner(model_file=onnxpath)
result_onnx = onnxrunner.fit_transform(iris_df)
print(result_onnx)
print(result_onnx.dtypes)
print("\nColumn:", LABEL_COLUMN_NAME, " - ONNX RUNNER RESULT") # Issue 2: It prints "4294967295" when label column is "Species" (string), "0" when label column is "Label" (category) and "Setosa" (float), for every row
print(result_onnx[LABEL_COLUMN_NAME])Output (for LABEL_COLUMN_NAME="Species")
ORIGINAL DATASET - using Species as Label column
Sepal_Length Sepal_Width Petal_Length Petal_Width Label Species Setosa
0 5.1 3.5 1.4 0.2 0 setosa 1.0
1 4.9 3.0 1.4 0.2 0 setosa 1.0
2 4.7 3.2 1.3 0.2 0 setosa 1.0
3 4.6 3.1 1.5 0.2 0 setosa 1.0
4 5.0 3.6 1.4 0.2 0 setosa 1.0
.. ... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2 virginica 0.0
146 6.3 2.5 5.0 1.9 2 virginica 0.0
147 6.5 3.0 5.2 2.0 2 virginica 0.0
148 6.2 3.4 5.4 2.3 2 virginica 0.0
149 5.9 3.0 5.1 1.8 2 virginica 0.0
[150 rows x 7 columns]
Sepal_Length float64
Sepal_Width float64
Petal_Length float64
Petal_Width float64
Label category
Species object
Setosa float64
dtype: object
Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Using 6 threads to train.
Automatically choosing a check frequency of 6.
Auto-tuning parameters: maxIterations = 9996.
Auto-tuning parameters: L2 = 2.667734E-05.
Auto-tuning parameters: L1Threshold (L1/L2) = 0.
Using best model from iteration 948.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:00.9079426
ML.NET RESULT
C:\Users\anvelazq\AppData\Local\Temp\tmp7b539j8w.model.bin
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: PredictedLabel, Length: 150, dtype: object
object
Onnx model path: C:\Users\anvelazq\Desktop\is23repros\model-labelissue.onnx
ORT RESULT
Sepal_Length.output Sepal_Width.output Petal_Length.output ... Score.output.0 Score.output.1 Score.output.2
0 5.1 3.5 1.4 ... 9.979612e-01 0.002039 7.896303e-15
1 4.9 3.0 1.4 ... 9.935742e-01 0.006426 1.243418e-13
2 4.7 3.2 1.3 ... 9.969639e-01 0.003036 2.946764e-14
3 4.6 3.1 1.5 ... 9.950643e-01 0.004936 1.473649e-13
4 5.0 3.6 1.4 ... 9.984953e-01 0.001505 4.957718e-15
.. ... ... ... ... ... ... ...
145 6.7 3.0 5.2 ... 6.576003e-09 0.002802 9.971976e-01
146 6.3 2.5 5.0 ... 3.143095e-07 0.031589 9.684103e-01
147 6.5 3.0 5.2 ... 4.240965e-07 0.031176 9.688237e-01
148 6.2 3.4 5.4 ... 1.435240e-08 0.002293 9.977069e-01
149 5.9 3.0 5.1 ... 7.885213e-06 0.121532 8.784599e-01
[150 rows x 19 columns]
Column: Species - ORT RESULT
0 1
1 1
2 1
3 1
4 1
..
145 3
146 3
147 3
148 3
149 3
Name: Species.output, Length: 150, dtype: uint32
ONNX RUNNER RESULT
Sepal_Length Sepal_Width Petal_Length ... Score.setosa Score.versicolor Score.virginica
0 5.1 3.5 1.4 ... 9.979612e-01 0.002039 7.896303e-15
1 4.9 3.0 1.4 ... 9.935742e-01 0.006426 1.243418e-13
2 4.7 3.2 1.3 ... 9.969639e-01 0.003036 2.946764e-14
3 4.6 3.1 1.5 ... 9.950643e-01 0.004936 1.473649e-13
4 5.0 3.6 1.4 ... 9.984953e-01 0.001505 4.957718e-15
.. ... ... ... ... ... ... ...
145 6.7 3.0 5.2 ... 6.576003e-09 0.002802 9.971976e-01
146 6.3 2.5 5.0 ... 3.143095e-07 0.031589 9.684103e-01
147 6.5 3.0 5.2 ... 4.240965e-07 0.031176 9.688237e-01
148 6.2 3.4 5.4 ... 1.435240e-08 0.002293 9.977069e-01
149 5.9 3.0 5.1 ... 7.885213e-06 0.121532 8.784599e-01
[150 rows x 19 columns]
Sepal_Length float64
Sepal_Width float64
Petal_Length float64
Petal_Width float64
Label object
Species uint32
Setosa float64
311418708f7545c0a2fd7f3db667a0cd float32
5ab7f7a1e38348f4b66ed5e3a9c2416e float32
776cb47f18c24a52a72e93f759808599 float32
17f5b772493b497fa3dfca2abffc6049 float32
Features.Sepal_Length float32
Features.Sepal_Width float32
Features.Petal_Length float32
Features.Petal_Width float32
PredictedLabel object
Score.setosa float32
Score.versicolor float32
Score.virginica float32
dtype: object
Column: Species - ONNX RUNNER RESULT
0 4294967295
1 4294967295
2 4294967295
3 4294967295
4 4294967295
...
145 4294967295
146 4294967295
147 4294967295
148 4294967295
149 4294967295
Name: Species, Length: 150, dtype: uint32