FURP Driven Insights: `Target Encoding`
Ch'i YU Lv3

Topic: Adjusted Target Encoding for Multi-Class Classifications

How category-encoders library gives incorrect results for multi-class categories.

http://contrib.scikit-learn.org/category_encoders/targetencoder.html

Backgrounds

About Target Encoder

1
class category_encoders.target_encoder.TargetEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', min_samples_leaf=1, smoothing=1.0)

Target encoding for categorical features.

Supported targets: binomial and continuous.For polynomial target support, see polynomial wrapper.

For the For the case of categorical target: features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.

For the case of categorical target: features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.

About Real-Life Dataset

To better explain the theories beneath those algorithms without loss of generality, a simplified, anonymized version of the dataset we're working with is applied:

Before Inappropriate Target Encoding

Number of Units(X) Credit Rank(Y)
1 0
1 4
1 0
1 4
2 1
2 3
2 2
1
2
3
4
5
6
7
8
from category_encoders import TargetEncoder
import pandas

df = pd.read_csv("xxx.csv")
print(df)

te = TargetEncoder()
print(te.fit_transform(df["Number of Units(X)"], "Credit Rank(Y)"))

After Inappropriate Target Encoding

Number of Units(X) Credit Rank(Y)
2 0
2 4
2 0
2 4
2 1
2 3
2 2

All those different values will be replaced with 2, which causes confusion and loss of necessary information.

To aid this situation...

Apply Polynomial Wrappers(Experimental Feature provided by CE library, but changes nightly), or following the scheme provided in this paper A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.

  1. One-Hot Encode Target Label , which gives binary columns, one corresponding to each class of the target;
  2. Drop one arbitrary column to avoid linear dependencies;
  3. Use the usual target encoding algorithms for each categorical columns using each binary target one by one.

In the end, each column of categorical features is expended to columns, where represents the number of classes in the target.

Code Implementations

1
2
3
4
5
6
7
8
9
10
11
12
# Apply Mean Target Encoder
rank_onehot = pd.get_dummies(df["Credit Rank"], # the target `Y`
drop_first = True) # to avoid linear dependencies
for col in cat_cols:
for index in rank_onehot.columns:
te_encoder = ce.TargetEncoder(col)
df[col + "_" + str(index)] = te_encoder.fit_transform(df[col],
rank_onehot[index])
del df[col]

del rank_onehot
gc.collect()