I have a dataset with categorical variables, e.g.:
alias Explorer.DataFrame, as: DF
df = DF.new(%{
age: [1, 5, 3],
animal: ["dog", "cat", "dog"], # categorical
color: ["brown", "black", "brindle"] # categorical
})
I currently have the variables one-hot encoded:
df_one_hot = DF.new(%{
age: [1, 5, 3],
animal_cat_1_of_2: [1, 0, 1], # animal == "dog"
animal_cat_2_of_2: [0, 1, 0], # animal == "cat"
color_cat_1_of_3: [1, 0, 0], # color == "brown"
color_cat_2_of_3: [0, 1, 0], # color == "black"
color_cat_3_of_3: [0, 0, 1], # color == "brindle"
})
Neural nets generally do well with that encoding. Tree-based models, however, may benefit from the original encoding or ordinal encoding. So while experimenting with different models, I found myself needing to undo the one-hot encoding.
I came up with a solution which I’ll post below. But I wanted to see how others would approach the problem. This is a somewhat computationally-intensive operation, and I worry that I’m not taking full advantage of Explorer. In particular, this looked like it may be a job for across
, but I couldn’t make it work.