Cldr.Number.Parser.parse quite slow?

Phillipp · September 25, 2020, 1:31pm

Hey,

I have an application that loads data from an XML API, parses that response using XmlToMap and then iterates over the parsed data to transform it into maps and generally a better structure for future processing.

Part of that transformation is to parse the values I get. They can be booleans, text, numbers and timestamps (also just numbers tho).

Here is what I did:

  def parse_value("true"), do: true
  def parse_value("True"), do: true
  def parse_value("TRUE"), do: true
  def parse_value("false"), do: false
  def parse_value("False"), do: false
  def parse_value("FALSE"), do: false
  def parse_value(value) do
    case Cldr.Number.Parser.parse(value) do
      {:ok, number} -> number
      _ -> value
    end
  end

Yes, not the prettiest but it did the job. Unfortunately, the Cldr.Number.Parser.parse slows everything down which leads to an execution time of the entire transformation of 5 seconds (my dev system) and 16 seconds (Raspberry Pi). This is pretty bad, because my plan was to fetch the data every 5 seconds and store it in an influxDB database.

Is there a way to improve the performance of Cldr.Number.Parser.parse or can I use something else? I chose Cldr.Number.Parser.parse because I can just throw stuff at it and it returns it, no matter if its an integer, or float, or whatever.

Comparison:

iex(23)> :timer.tc(MyApp, :parse_value, ["true"])
{3, true}

iex(21)> :timer.tc(MyApp, :parse_value, ["1"])
{8520, 1}

This function goes over my data map and returns a huge list of tuples.

iex(20)> :timer.tc(MyApp, :generate_influx_tuples, [data])
{584, [...]}

kip · September 25, 2020, 6:24pm

@Phillip, I’ll take a look (I’m the author) and see if there is anything I can do to improve the performance. Feel free to add an issue so I can track it properly as well.

Note however that Cldr.Number.Parser.parse/2 is probably not the best tool for this job. If you know you are only receiving data that has no localisations in it (no separators, no localised decimal digits, …) then using the standard library tools would be better. For example the following code parsers both integers and floats that have no formatting in them (no localisations) in about 1.46 μs versus 1.01 μs for Float.parse/1:

def parse_number(x) do
  case Integer.parse(x) do
    {integer, ""} -> integer
    other -> case Float.parse(x) do
      {float, ""} -> float
      _other -> x
    end
  end
end

``Cldr.Number.Parser.parse/2` is designed to be quite resilient in the face of localised and formatted numbers and that means there is definitely more work going on. Its typical use is to enhance user experience when parsing user-provided text input.

kip · September 25, 2020, 7:08pm

I did some more digging here and the issue isn’t primarily number parsing. Its related to repeatedly parsing the default locale which happens here if one is not supplied and if one wasn’t set with Cldr.put_locale/1.

You can probably get a speed up of 40x with the following addition to your original code:

  # Assumes you are only using one locale
  @locale Cldr.Locale.new!("en", MyApp.Cldr)

  def parse_value(value) do
    case Cldr.Number.Parser.parse(value, locale: @locale) do
      {:ok, number} -> number
      _ -> value
    end
  end

I will fix the underlying issue here and publish a new version of ex_cldr.

Nevertheless, my suggestions in the previous message still hold.

kip · September 25, 2020, 8:38pm

I have published ex_cldr version 2.17.1 which improves your example code by about 40x. The changelog entry is:

Bug Fixes

Significantly improve the performance of Cldr.default_locale/0. In previously releases, the default locale was being parsed on each access. In this release it is parsed once and cached in the application environment. This improves performance by about 40x. Thanks to @Phillipp who brought this to attention in Elixir Forum

LostKobrakai · September 26, 2020, 8:33am

Wouldn‘t the process dict fit better here? It‘s not external to the computation and should therefore not experience race conditions.

kip · September 26, 2020, 8:53am

In this specific case it’s the system wide default so the process dictionary isn’t really suitable. The system wide default has always been in the app environment, but as a binary and therefore parsed on each access. The only change is to now also store the parsed version as well, which does save about 1ms per access which a really big win.

Phillipp · September 26, 2020, 10:24am

Hey @kip,

thanks for your investigation and recommendations. You are right that it is probably better to use the standard lib for this job. I’ve tested the new version of the package and my code execution went from around 15 seconds to around 1,1 seconds (including 2 http requests over local network + xml parsing).

Glad that this led to a performance improvement in your (awesome) library.

Phillipp