How should I undo a one-hot encoded variable?

billylanchantin · October 24, 2023, 9:05pm

I have a dataset with categorical variables, e.g.:

alias Explorer.DataFrame, as: DF

df = DF.new(%{
  age: [1, 5, 3],
  animal: ["dog", "cat", "dog"],       # categorical
  color: ["brown", "black", "brindle"] # categorical
})

I currently have the variables one-hot encoded:

df_one_hot = DF.new(%{
  age: [1, 5, 3],
  animal_cat_1_of_2: [1, 0, 1], # animal == "dog"
  animal_cat_2_of_2: [0, 1, 0], # animal == "cat"
  color_cat_1_of_3: [1, 0, 0], # color == "brown"
  color_cat_2_of_3: [0, 1, 0], # color == "black"
  color_cat_3_of_3: [0, 0, 1], # color == "brindle"
})

Neural nets generally do well with that encoding. Tree-based models, however, may benefit from the original encoding or ordinal encoding. So while experimenting with different models, I found myself needing to undo the one-hot encoding.

I came up with a solution which I’ll post below. But I wanted to see how others would approach the problem. This is a somewhat computationally-intensive operation, and I worry that I’m not taking full advantage of Explorer. In particular, this looked like it may be a job for across, but I couldn’t make it work.

billylanchantin · October 24, 2023, 9:08pm

My approach for converting to an ordinal encoding:

require DF

# Build up a map of %{original_col => %{category_col => category_num}}
categorical_cols =
  df_one_hot.names
  |> Enum.map(&Regex.run(~r/(.+)_cat_(\d+)_of_\d+/, &1))
  |> Enum.reject(&is_nil/1)
  |> Enum.group_by(
    fn [_, group, _] ->
      group
    end,
    fn [col, _, num_string] ->
      {num, _} = Integer.parse(num_string)
      {col, num}
    end
  )
  |> Map.new(fn {group, col_num_pairs} -> {group, Map.new(col_num_pairs)} end)

# Ordinal encode the data by repeatedly adding `num * col` to a column of all 0s.
df_ordinal =
  Enum.reduce(categorical_cols, df_one_hot, fn {group, col_to_num}, outer ->
    col_to_num
    |> Enum.reduce(DF.put(outer, group, [0]), fn {col_name, num}, inner ->
      DF.mutate(inner, [{^group, col(^group) + col(^col_name) * ^num}])
    end)
    |> DF.discard(Map.keys(col_to_num))
  end)

josevalim · October 25, 2023, 7:14am

If you can hardcode the fields, then you can do:

require Explorer.DataFrame, as: DF

DF.mutate(df_one_hot,
  animal: cond do
    animal_cat_1_of_2 == 1 -> "dog"
    animal_cat_2_of_2 == 1 -> "cat"
  end,
  color: cond do
    color_cat_1_of_3 == 1 -> "brown"
    color_cat_2_of_3 == 1 -> "black"
    color_cat_3_of_3 == 1 -> "brindle"
  end
)

If you cannot, then you can port your approach to mutate_with. mutate_with gives you access to the columns and allow you to dynamically build a query based on the field. Then use Series.select to build the cond. Scroll down to find the answer (I added some padding in case you want to try it out by yourself before seeing the solution):

categorical_cols =
  df_one_hot.names
  |> Enum.map(&Regex.run(~r/(.+)_cat_(\d+)_of_\d+/, &1))
  |> Enum.reject(&is_nil/1)
  |> Enum.group_by(
    fn [_, group, _] -> group end,
    fn [col, _, num] -> {col, String.to_integer(num)} end
  )
  |> Map.new(fn {group, col_num_pairs} -> {group, col_num_pairs} end)

DF.mutate_with(df_one_hot, fn df ->
  Enum.map(categorical_cols, fn {group, col_to_num} ->
    expr = Enum.reduce(col_to_num, -1, fn {col_name, num}, acc ->
      equal = Explorer.Series.equal(df[col_name], 1)
      Explorer.Series.select(equal, num, acc)
    end)

    {group, expr}
  end)
end)

billylanchantin · October 25, 2023, 3:25pm

Thanks! That made a huge improvement.

Quick benchmark on a subset of the data (cols: 203, rows: 65,917):

{billy_us, billy_df} = :timer.tc(fn ->
  # ...
end)

{jose_us, jose_df} = :timer.tc(fn ->
  # ...
  |> DF.discard(#... to make the comparison fair
end)

billy_df.names == jose_df.names  #=> true
[billy: billy_us, jose: jose_us] #=> [billy: 26438, jose: 4680]

What’s more, your solution works on lazy frames. I needed to call DF.collect my dataframe first.

This particular dataset had ~80 columns one-hot columns. Hardcoding is do-able, but I’d prefer to avoid it if possible.