Topic:
Adjusted Target Encoding for Multi-Class Classifications
How category-encoders library gives incorrect results for multi-class categories.
http://contrib.scikit-learn.org/category_encoders/targetencoder.html
Backgrounds
About Target Encoder
1 | class category_encoders.target_encoder.TargetEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', min_samples_leaf=1, smoothing=1.0) |
Target encoding for categorical features.
Supported targets: binomial and continuous.For polynomial target support, see polynomial wrapper.
For the For the case of categorical target: features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.
For the case of categorical target: features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.
About Real-Life Dataset
To better explain the theories beneath those algorithms without loss of generality, a simplified, anonymized version of the dataset we're working with is applied:
Before Inappropriate Target Encoding
Number of Units(X) | Credit Rank(Y) |
---|---|
1 | 0 |
1 | 4 |
1 | 0 |
1 | 4 |
2 | 1 |
2 | 3 |
2 | 2 |
1 | from category_encoders import TargetEncoder |
After Inappropriate Target Encoding
Number of Units(X) | Credit Rank(Y) |
---|---|
2 | 0 |
2 | 4 |
2 | 0 |
2 | 4 |
2 | 1 |
2 | 3 |
2 | 2 |
All those different values will be replaced with 2, which causes confusion and loss of necessary information.
To aid this situation...
Apply Polynomial Wrappers(Experimental Feature provided by CE library, but changes nightly), or following the scheme provided in this paper A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.
- One-Hot Encode Target Label
, which gives binary columns, one corresponding to each class of the target; - Drop one arbitrary column to avoid linear dependencies;
- Use the usual target encoding algorithms for each categorical columns using each binary target one by one.
In the end, each column of categorical features is expended to
Code Implementations
1 | # Apply Mean Target Encoder |