Help to simplify/optimize a function

rmoretto · May 3, 2021, 11:30pm

Hello!

So I currently have the following problem: I want to take a map and “sanitize” all the integers fields. Basically verifying if that field value is really an int and replace the parsed value in the map, furthermore, if the verified value is not an valid int it should save nil to the field, for example:

Input Map

%{
    int_a: "1",
    int_b: "not_a_int!"
%}

Output Map:

%{
    int_a: 1,
    int_b: nil
%}

On the system that I am currently working I know that all the integers fields in the map will start with the int_ prefix (important to say that I don’t have full access to the struct of the map, so I can’t write a function that uses a specific field) so I wrote the following code:

def parse_map_int_fields(map) do
    ints_map =
      map 
      # Filter all the Integers field
      |> Enum.filter(fn {k, _v} -> Atom.to_string(k) |> String.starts_with?("int_") end)
      # Remove all fields that are nil or ints
      |> Enum.filter(fn {_k, v} -> !is_integer(v) and !is_nil(v) end)
      # Parse the values to int, returning nil if invalid
      |> Enum.map(fn {key, val} ->
        case Integer.parse(val) do
          :error ->
            {key, nil}

          {parsed, _} ->
            {key, parsed}
        end
      end)
      |> Map.new()
    
    # Merge the original map with the "sanitized" map
    Map.merge(map, ints_map)
  end

Another point is that this code will be ran in a lot of maps (5 to 10 millions maps) with on average 15 fields per map, so I have two questions:

Most important, can this code be simplified? From my perspective it look a little bit convoluted, I would appreciate any tips!
How one would optimize this function? I have wrote a simple benchmark using Benchee that gave me the following results:

Operating System: Linux
CPU Information: AMD Ryzen 7 2700X Eight-Core Processor
Number of Available Cores: 16
Available memory: 15.63 GB
Elixir 1.11.2
Erlang 23.2.3

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 7 s

Benchmarking parse_map_int_fields...

Name                           ips        average  deviation         median         99th %
parse_map_int_fields      857.59 K        1.17 μs  ±2149.16%        1.01 μs        2.12 μs

Code used for the benchmark:

defmodule Sanitizer do
  def parse_map_int_fields(map) do
    ints_map =
      map
      |> Enum.filter(fn {k, _v} -> Atom.to_string(k) |> String.starts_with?("int_") end)
      |> Enum.filter(fn {_k, v} -> !is_integer(v) and !is_nil(v) end)
      |> Enum.map(fn {key, val} ->
        case Integer.parse(val) do
          :error ->
            {key, nil}

          {parsed, _} ->
            {key, parsed}
        end
      end)
      |> Map.new()

    Map.merge(map, ints_map)
  end
end

test_map = %{
  int_a: "1",
  int_b: "not_a_int!",
  str_a: "This is a string field",
  str_b: "Another string field",
  fl_a: 0.0
}

Benchee.run(%{
  "parse_map_int_fields" => fn -> Sanitizer.parse_map_int_fields(test_map) end
})

Which isn’t all that bad for my use case, but in the interest of learning I would like to know if something could be done different.

Thanks to all!

kokolegorille · May 4, 2021, 12:19am

rmoretto:

      |> Enum.filter(fn {k, _v} -> Atom.to_string(k) |> String.starts_with?("int_") end)
      # Remove all fields that are nil or ints
      |> Enum.filter(fn {_k, v} -> !is_integer(v) and !is_nil(v) end)

You could simplify in one pass…

|> Enum.filter(fn {k, v} -> cond1(k) and cond2(v) end)

In fact, You could probably do all in one pass. And You should not filter, if later You will merge the filtered {k, v}

Maybe something like this, one iteration only…

Map.new(map, fn {_k, _v}=el ->
  if has_to_be_sanitized?(el), do: sanitize(el), else: el
end)
#
defp has_to_be_sanitized?({k, v}) do
  # check as in your filters...
end

defp sanitize({k, v}) do
  # ...
end

You can even replace with a private function this part.

defp maybe_reshape(el) do
  if has_to_be_sanitized?(el), do: sanitize(el), else: el
end

# This allows You to reduce to
Map.new(map, &maybe_reshape(&1))

You should not hesitate to split your pipeline into small, composable functions.

chungwong · May 4, 2021, 12:41am

You can give this a try, it is using tail recursion but not fully tested

  def parse(m) when is_map(m), do: parse(Map.to_list(m), [])
  def parse([], acc), do: Enum.into(acc, %{})
  def parse([{k, v} | t], acc), do: parse(t, prepend({to_string(k), v}, acc))

  def prepend({"int_" <> _ = k, v} ,acc) when is_binary(v), do: [{String.to_atom(k), to_integer(Integer.parse(v))} | acc]
  def prepend({k, v}, acc), do: [{String.to_atom(k), v} | acc]              

  def to_integer(:error), do: nil 
  def to_integer({parsed, _}), do: parsed                       

  Benchee.run(%{
    "parse" => fn -> Sanitizer.parse(test_map) end
  })

ericgray · May 4, 2021, 3:48am

Another option is to use reduce. You can map the key/value pairs into a new map to get the desired result. You can also break your code out into smaller functions that do one thing to make your code more readable. Many different ways to do it. Just a matter of taste.

  def parse_map_int_fields(map) do
    Enum.reduce(map, %{}, &do_parse_map_int_fields/2)
  end

  # Reconstruct or accumulate key value pairs into a new map
  def do_parse_map_int_fields({key, value}, map) do
    case starts_with_int?(key) do
      true ->
        Map.put( map, key, parse_value(value) )

      false ->
        Map.put( map, key, value )
    end
  end

  # Boolean function returns true if field is int
  def starts_with_int?(key) do
    key
    |> Atom.to_string()
    |> String.starts_with?("int_")
  end

  # Parses the value of int field, if not an integer return nil
  def parse_value(value) do
    case Integer.parse(value) do
      {value, _} -> value
      :error -> nil
    end
  end

amnu3387 · May 4, 2021, 5:34am

Reduce is fine and a single iteration:

@spec reshape_map_int_fields(map) :: map
def reshape_map_int_fields(map) when is_map(map),
  do: Enum.reduce(map, %{}, fn {k, v}, acc ->
        Map.put(acc, k, maybe_reshape(k, v))
      end)

@spec maybe_reshape(atom, term) :: integer | term
def maybe_reshape(k, v) when is_atom(k) and not is_integer(v) and not is_nil(v) do
  case Atom.to_string(k) do
    <<"int_", _>> ->
      case Integer.parse(v) do
        {new_val, _} -> new_val
        _ -> nil
      end
    _ ->
      v
  end
end

def maybe_reshape(_, v), do: v

gregvaughn · May 4, 2021, 4:02pm

I haven’t benchmarked, but you could do something with a for comprehension

iex(103)> map = %{
...(103)>     int_a: "1",
...(103)>     int_b: "not_a_int!",
...(103)>     other: "leave me alone"
...(103)> }
%{int_a: "1", int_b: "not_a_int!", other: "leave me alone"}
iex(104)> for {k, v} <- map, match?("int" <> _, Atom.to_string(k)), into: map do
...(104)>   case Integer.parse(v) do
...(104)>     {int, _} -> {k, int}
...(104)>     _other -> {k, nil}
...(104)>   end
...(104)> end
%{int_a: 1, int_b: nil, other: "leave me alone"}

rmoretto · May 4, 2021, 5:02pm

Thanks people for all the responses!

Indeed a lot of way to approach the problem, I will probably use the @amnu3387 response just because is a little bit faster, but as Eric said, is just a matter of taste, and all other options would be completely fine.

Thanks again for the help!

The benchmark results were:

Name                             ips        average  deviation         median         99th %
mnussbaumer_response        927.74 K        1.08 μs  ±2239.93%        0.96 μs        1.98 μs
kokolegorille_response      867.51 K        1.15 μs  ±2327.39%        1.00 μs        2.14 μs
gregvaughn_response         766.18 K        1.31 μs  ±2168.73%        1.15 μs        2.31 μs
ericgray_response           730.43 K        1.37 μs  ±1314.86%        1.26 μs        2.48 μs
chungwong_response          655.58 K        1.53 μs  ±1555.15%        1.36 μs        2.78 μs
parse_map_int_fields        567.61 K        1.76 μs  ±1491.38%        1.57 μs        3.18 μs

Comparison:
mnussbaumer_response        927.74 K
kokolegorille_response      867.51 K - 1.07x slower +0.0748 μs
gregvaughn_response         766.18 K - 1.21x slower +0.23 μs
ericgray_response           730.43 K - 1.27x slower +0.29 μs
chungwong_response          655.58 K - 1.42x slower +0.45 μs
parse_map_int_fields        567.61 K - 1.63x slower +0.68 μs

And the code used for the benchmark:

defmodule Sanitizer do
  def parse_map_int_fields(map) do
    ints_map =
      map
      |> Enum.filter(fn {k, _v} -> Atom.to_string(k) |> String.starts_with?("int_") end)
      |> Enum.filter(fn {_k, v} -> !is_integer(v) and !is_nil(v) end)
      |> Enum.map(fn {key, val} ->
        case Integer.parse(val) do
          :error ->
            {key, nil}

          {parsed, _} ->
            {key, parsed}
        end
      end)
      |> Map.new()

    Map.merge(map, ints_map)
  end

  def kokolegorille_response(map) do
    Map.new(map, fn {_k, _v} = el ->
      if has_to_be_sanitized?(el), do: sanitize(el), else: el
    end)
  end

  defp has_to_be_sanitized?({k, v}) do
    Atom.to_string(k) |> String.starts_with?("int_") and !is_integer(v) and !is_nil(v)
  end

  defp sanitize({k, v}) do
    case Integer.parse(v) do
      :error ->
        {k, nil}

      {parsed, _} ->
        {k, parsed}
    end
  end

  def chungwong_response(map) do
    parse(map)
  end

  def parse(m) when is_map(m), do: parse(Map.to_list(m), [])
  def parse([], acc), do: Enum.into(acc, %{})
  def parse([{k, v} | t], acc), do: parse(t, prepend({to_string(k), v}, acc))

  def prepend({"int_" <> _ = k, v}, acc) when is_binary(v),
    do: [{String.to_atom(k), to_integer(Integer.parse(v))} | acc]

  def prepend({k, v}, acc), do: [{String.to_atom(k), v} | acc]

  def to_integer(:error), do: nil
  def to_integer({parsed, _}), do: parsed

  def ericgray_response(map) do
    Enum.reduce(map, %{}, &do_parse_map_int_fields/2)
  end

  # Reconstruct or accumulate key value pairs into a new map
  def do_parse_map_int_fields({key, value}, map) do
    case starts_with_int?(key) do
      true ->
        Map.put(map, key, parse_value(value))

      false ->
        Map.put(map, key, value)
    end
  end

  # Boolean function returns true if field is int
  def starts_with_int?(key) do
    key
    |> Atom.to_string()
    |> String.starts_with?("int_")
  end

  # Parses the value of int field, if not an integer return nil
  def parse_value(value) do
    case Integer.parse(value) do
      {value, _} -> value
      :error -> nil
    end
  end

  def mnussbaumer_response(map) do
    reshape_map_int_fields(map)
  end

  @spec reshape_map_int_fields(map) :: map
  def reshape_map_int_fields(map) when is_map(map),
    do:
      Enum.reduce(map, %{}, fn {k, v}, acc ->
        Map.put(acc, k, maybe_reshape(k, v))
      end)

  @spec maybe_reshape(atom, term) :: integer | term
  def maybe_reshape(k, v) when is_atom(k) and not is_integer(v) and not is_nil(v) do
    case Atom.to_string(k) do
      <<"int_", _>> ->
        case Integer.parse(v) do
          {new_val, _} -> new_val
          _ -> nil
        end

      _ ->
        v
    end
  end

  def maybe_reshape(_, v), do: v

  def gregvaughn_response(map) do
    for {k, v} <- map, match?("int" <> _, Atom.to_string(k)), into: map do
      case Integer.parse(v) do
        {int, _} -> {k, int}
        _other -> {k, nil}
      end
    end
  end
end

test_map = %{
  int_a: "1",
  int_b: "not_a_int!",
  str_a: "This is a string field",
  str_b: "Another string field",
  fl_a: 0.0
}

Benchee.run(%{
  "parse_map_int_fields" => fn -> Sanitizer.parse_map_int_fields(test_map) end,
  "kokolegorille_response" => fn -> Sanitizer.kokolegorille_response(test_map) end,
  "chungwong_response" => fn -> Sanitizer.chungwong_response(test_map) end,
  "ericgray_response" => fn -> Sanitizer.ericgray_response(test_map) end,
  "mnussbaumer_response" => fn -> Sanitizer.mnussbaumer_response(test_map) end,
  "gregvaughn_response" => fn -> Sanitizer.gregvaughn_response(test_map) end
})

ericgray · May 4, 2021, 10:28pm

I like this approach. Simple elegant solution. Never thought about match?

eksperimental · May 5, 2021, 3:42am

you can optimize it even more:
def maybe_reshape(k, v) when is_atom(k) and is_binary(v) do

eksperimental · May 5, 2021, 3:59am

Additionally, IIRC correctly it could be faster to reduce into a keyword list and convert that to a map with :maps.from_list. You iterate twice the list though. I would like to see the benchmarks.

amnu3387 · May 5, 2021, 4:03am

Yeap, you beat me to it. I think that if I was writing this I would probably do:

@spec reshape_map_int_fields(map) :: map
def reshape_map_int_fields(map) when is_map(map),
  do: Enum.reduce(map, %{}, fn {k, v}, acc ->
        Map.put(acc, k, maybe_reshape(k, v))
      end)

@spec maybe_reshape(atom | String.t, term) :: integer | nil | term
def maybe_reshape(k, v) when is_atom(k) and is_binary(v),
  do: maybe_reshape(Atom.to_string(k), v)

def maybe_reshape(<<"int_", _>>, v) when is_binary(v) do
  case Integer.parse(v) do
     {new_val, _} -> new_val
      _ -> nil
  end
end

# perhaps add a special clause for when it's still `"int_"` but not binary, not nil
# and not integer to error out, or nillify but that would depend

def maybe_reshape(_, v), do: v

This would have the benefit of also converting binary keys besides atom ones.

eksperimental · May 5, 2021, 4:10am

you can make it into a defp and name the function differently and you will not need to run the is_binary(v) check since it was checked in the previous call.

amnu3387 · May 5, 2021, 4:15am

Yeap, but then only works for atom keyed maps - that version was to allow binary keys too - but good pointers

eksperimental · May 5, 2021, 5:33am

Here’s my improved version built on top of @amnu3387 's version.


defmodule Sanitizer do
  @spec parse_map_int_fields(map) :: map
  def parse_map_int_fields(map) when is_map(map) do
    Enum.reduce(map, [], fn
      {k, v}, acc when is_atom(k) and is_binary(v) ->
        [{k, transform_v(Atom.to_string(k), v)} | acc]

      k_v, acc ->
        [k_v | acc]
    end)
    |> :maps.from_list()
  end

  defp transform_v(<<"int_", _>>, v) do
    case Integer.parse(v) do
      {v_int, ""} -> v_int
      _ -> nil
    end
  end

  defp transform_v(_k_string, v) do
    v
  end
end

I use a list to build the new map, and then convert it into a map. Could be faster or not depending on how many items get updated from the original map.

Check out this new test_map. If you really want to check on a full representation of an integer, you need to pattern match on the remainder, to be ""

test_map = %{
  int_a: "1",
  int_b: "not_a_int!",
  int_c: "1.2",
  int_d: "1x",
  int_x: 42,
  str_a: "This is a string field",
  str_b: "Another string field",
  fl_a: 0.0,
  
}

Sanitizer.parse_map_int_fields(test_map)

iex(5)> Sanitizer.parse_map_int_fields(test_map)
%{
  fl_a: 0.0,
  int_a: 1,
  int_b: nil,
  int_c: nil,
  int_d: nil,
  int_x: 42,
  str_a: "This is a string field",
  str_b: "Another string field"
}

rmoretto · May 5, 2021, 12:10pm

I agree with you that to be more precise with the parsing the "" remainder should be match, but in my use case is better that any integer is saved than none.

I have run the benchmark with the code that you shared, only changing the line {v_int, ""} to {v_int, _} to match the other functions, here are the results:

Name                             ips        average  deviation         median         99th %
mnussbaumer_response        960.24 K        1.04 μs  ±2517.65%        0.94 μs        1.80 μs
gregvaughn_response         868.02 K        1.15 μs  ±1919.57%        1.03 μs        2.04 μs
kokolegorille_response      827.66 K        1.21 μs  ±2517.58%        1.09 μs        1.65 μs
ericgray_response           724.15 K        1.38 μs  ±1604.94%        1.26 μs        2.50 μs
chungwong_response          656.80 K        1.52 μs  ±1600.17%        1.38 μs        2.71 μs
parse_map_int_fields        623.51 K        1.60 μs  ±1424.02%        1.45 μs        2.52 μs
eksperimental_response      612.94 K        1.63 μs  ±1509.09%        1.47 μs        2.92 μs

Comparison:
mnussbaumer_response        960.24 K
gregvaughn_response         868.02 K - 1.11x slower +0.111 μs
kokolegorille_response      827.66 K - 1.16x slower +0.167 μs
ericgray_response           724.15 K - 1.33x slower +0.34 μs
chungwong_response          656.80 K - 1.46x slower +0.48 μs
parse_map_int_fields        623.51 K - 1.54x slower +0.56 μs
eksperimental_response      612.94 K - 1.57x slower +0.59 μs

I guess that the :maps.from_list() is too expensive to be worth in this specific case?

dimitarvp · May 5, 2021, 6:42pm

Naive solution on a first try:

defmodule IntMapSanitizer do
  @test_map_1 %{
    greet: "Hirrow",
    int_a: "17",
    int_b: "gibberish",
    int_c: "428",
    farewell: "Aufwiedersehen"
  }

  defp is_int_name?(k) when is_atom(k) do
    k
    |> Atom.to_string()
    |> String.starts_with?("int_")
  end

  defp should_keep?(v) when is_nil(v) or is_integer(v), do: false
  defp should_keep?(_), do: true

  defp parse_to_int_or_nil(v) do
    case Integer.parse(v) do
      :error ->
        nil

      {parsed, _} ->
        parsed
    end
  end

  def sanitize(%{} = map) do
    for {k, v} <- map, is_int_name?(k), should_keep?(v), reduce: map do
      acc -> Map.update(acc, k, nil, &parse_to_int_or_nil(&1))
    end
  end

  def test1() do
    sanitize(@test_map_1)
  end
end

Try in iex:

iex(1)> IntMapSanitizer.test1()
%{
  farewell: "Aufwiedersehen",
  greet: "Hirrow",
  int_a: 17,
  int_b: nil,
  int_c: 428
}

Haven’t measured but for comprehensions with filters and the reduce option have always been fairly fast in my experience.

Maybe if you make a public GitHub repo then we can make a benchmarking arena.