String.split(string, " ", trim: true) performance boost using filter

While looking for performance bottlenecks i stumbled into this which is quite surprising.

  @some_string "1632767814.9532351 00 0  0 1  0 0 0 0 0 0 0 0 0 0  00  00 0 00 00 00 00 00 0e 01 14 3c 3c 96 20 99 ba 1e 10 80 93 a7 1d 00 00 01 00 00 00 00 00 00 00 00 00 00 00 0  0  0"
  def bench_string_split_trim(n) do

    fun1 = fn ->
      Enum.map(1..n, fn(_x) ->
      String.split(@some_string, " ", trim: true) end)
    end

    fun2 = fn ->
      Enum.map(1..n, fn(_x) ->
      String.split(@some_string, " ") |> Enum.filter(fn(x) -> x != "" end) end)
    end

    Benchee.run(
      %{
        "vanilla" => fun1,
        "filter" => fun2,
      },
      formatters: [
        {Benchee.Formatters.HTML, file: "samples_output/my.html"},
        Benchee.Formatters.Console
      ]
    )
  end

Replacing trim: true with filter more than doubles the performance; just my 2 cents, might be handy for someone out there.

cheers!

3 Likes

Nice find! I think it would make sense to open an issue on the Elixir’s issue tracker. Seems like there is a room for performance improvement in standard lib.

2 Likes

I think that’s related to the fact that String.trim does not simply check for empty strings after removing spaces; it probably deals with Unicode spaces as well and this is where the performance difference is likely coming from.

Good suggestion: It’s guess it’s worth mentioning: Potential performance improvement, filter instead of trim: true · Issue #11398 · elixir-lang/elixir · GitHub

I think this option has nothing to do with whitespaces… It is a bit confusing, I know. From the docs.

* `:trim` (boolean) - if `true`, empty strings are removed from
  the resulting list.
1 Like

Yep, I got tripped on the naming then, sorry.

1 Like

Potential performance improvement, filter instead of trim: true · Issue #11398 · elixir-lang/elixir · GitHub closed and already fixed in later version. Express delivery by @josevalim :sunny:

5 Likes

Blockquote

Don’t blame yourself for that! Look at the options for :binary.split/2 Erlang -- binary

trim

Removes trailing empty parts of the result (as does trim in re:split/3.

trim_all

Removes all empty parts of the result.

1 Like

If any, it is something that should be fixed upstream in OTP.