billylanchantin
How should I undo a one-hot encoded variable?
I have a dataset with categorical variables, e.g.:
alias Explorer.DataFrame, as: DF
df = DF.new(%{
age: [1, 5, 3],
animal: ["dog", "cat", "dog"], # categorical
color: ["brown", "black", "brindle"] # categorical
})
I currently have the variables one-hot encoded:
df_one_hot = DF.new(%{
age: [1, 5, 3],
animal_cat_1_of_2: [1, 0, 1], # animal == "dog"
animal_cat_2_of_2: [0, 1, 0], # animal == "cat"
color_cat_1_of_3: [1, 0, 0], # color == "brown"
color_cat_2_of_3: [0, 1, 0], # color == "black"
color_cat_3_of_3: [0, 0, 1], # color == "brindle"
})
Neural nets generally do well with that encoding. Tree-based models, however, may benefit from the original encoding or ordinal encoding. So while experimenting with different models, I found myself needing to undo the one-hot encoding.
I came up with a solution which I’ll post below. But I wanted to see how others would approach the problem. This is a somewhat computationally-intensive operation, and I worry that I’m not taking full advantage of Explorer. In particular, this looked like it may be a job for across, but I couldn’t make it work.
Marked As Solved
josevalim
If you can hardcode the fields, then you can do:
require Explorer.DataFrame, as: DF
DF.mutate(df_one_hot,
animal: cond do
animal_cat_1_of_2 == 1 -> "dog"
animal_cat_2_of_2 == 1 -> "cat"
end,
color: cond do
color_cat_1_of_3 == 1 -> "brown"
color_cat_2_of_3 == 1 -> "black"
color_cat_3_of_3 == 1 -> "brindle"
end
)
If you cannot, then you can port your approach to mutate_with. mutate_with gives you access to the columns and allow you to dynamically build a query based on the field. Then use Series.select to build the cond. Scroll down to find the answer (I added some padding in case you want to try it out by yourself before seeing the solution):
categorical_cols =
df_one_hot.names
|> Enum.map(&Regex.run(~r/(.+)_cat_(\d+)_of_\d+/, &1))
|> Enum.reject(&is_nil/1)
|> Enum.group_by(
fn [_, group, _] -> group end,
fn [col, _, num] -> {col, String.to_integer(num)} end
)
|> Map.new(fn {group, col_num_pairs} -> {group, col_num_pairs} end)
DF.mutate_with(df_one_hot, fn df ->
Enum.map(categorical_cols, fn {group, col_to_num} ->
expr = Enum.reduce(col_to_num, -1, fn {col_name, num}, acc ->
equal = Explorer.Series.equal(df[col_name], 1)
Explorer.Series.select(equal, num, acc)
end)
{group, expr}
end)
end)







