How to speed up this custom file parser code?

hristonev · January 18, 2018, 12:36pm

Hi, guys
In other forum we test different languages. My project is Elixir based.
Task: Custom file parser. Similar to Markdown.

line starts with # header (1…6)
empty line block (paragraph)
In the begining of the output file ToC linked to headers
Links line start with [txt] url
Links in text [txt] replaced by link generated above
/text become bold/italic

My code exec time is ~450ms (i7,ssd):
Are you see some slow blocks/elements?
Avg time for reading file 12MB is 90/100ms

defmodule Markdown do
  def parse(source_file, target_file) do
    source = File.read!(source_file)
      |> String.splitter("\n")
      |> Enum.to_list

    links_worker = Task.async(fn ->
      Enum.reduce(source, %{}, fn(i, acc) ->
        ltr = String.slice(i, 0, 1)
        if ltr == "[", do: Map.merge(acc, form_link(i)), else: acc
      end)
    end)

    html = source
      |> Stream.transform(
        fn() -> %{ tag: "", html: "", header: [], line: 0 } end,
        fn(i, acc) ->
          ltr = String.slice(i, 0, 1)
          if(ltr == "[") do
            {[nil], acc}
          else
            data = parse_line(ltr, i, acc)
            {[data.html], data}
          end
        end,
        fn(acc) ->
          toc = acc.header
            |> Enum.reduce("", fn(x, acc) -> acc <> "<li>#{x}</li>" end)
          html = "<!DOCTYPE html><html lang=\"zxx\"><head><title>Parser output</title><meta http-equiv=\"content-type\" content=\"text/html; charset=utf-8\"/></head><body><ul>#{toc}</ul>"

          File.write(target_file, html, [:delayed_write])
          acc
        end)
      |> Enum.reduce("", fn(i, acc) -> if !is_nil(i), do: acc <> i, else: acc end)
      |> String.split(~r/\*/)
      |> Stream.transform(1, fn i, acc ->
          if rem(acc, 2) == 0, do: {["<b>#{i}</b>"], acc + 1}, else: {[i], acc + 1}
        end)
      |> Enum.join
      |> String.splitter(["[", "]"])
      |> Enum.to_list
      |> (fn(html, worker) ->
          links = Task.await(worker)
          place_links(html, links)
        end).(links_worker)
      |> Enum.join

    File.write(target_file, html, [:append, :delayed_write])
  end

  def place_links([], _) do
    ["</body></html>"]
  end

  def place_links([head | tail], links) do
    html = head
      |> (fn x, links -> if Map.has_key?(links, head), do: Map.get(links, head), else: x end).(links)

    [html] ++ place_links(tail, links)
  end

  defp form_link(src) do
    [_, txt, url] = String.splitter(src, ["[", "]"]) |> Enum.to_list

    Map.put(%{}, txt, "<a href=\"#{url}\">#{txt}</a>")
  end

  # Block start/end
  defp parse_line(start, _, acc) when start == "" do
    html = ""
      |> (fn(_,acc) -> if String.length(acc.tag) > 0, do: "</#{acc.tag}>" end).(acc)
    %{ tag: "", html: html, header: acc.header, line: acc.line + 1 }
  end

  # Header
  defp parse_line(start, line, acc) when start == "#" do
    {tag_num, content} = head_num(line)

    link = "<a href=\"#_#{acc.line}\" id=\"#{acc.line}\">#{content}</a>"
    html = "<h#{tag_num}>#{content} <a href=\"##{acc.line}\" id=\"_#{acc.line}\">&#8683</a></h#{tag_num}>"

    %{ tag: "h#{tag_num}", html: html, header: acc.header ++ [link], line: acc.line + 1 }
  end

  # Paragraph
  defp parse_line(_, line, acc) do
    html = line
      |> (fn(line, acc) -> if acc.tag != "p", do: "<p>" <> line, else: line end).(acc)
    %{ tag: "p", html: html, header: acc.header, line: acc.line + 1 }
  end

  # Match header tag
  defp head_num("######" <> rest), do: {6, String.trim(rest)}
  defp head_num("#####" <> rest), do: {5, String.trim(rest)}
  defp head_num("####" <> rest), do: {4, String.trim(rest)}
  defp head_num("###" <> rest), do: {3, String.trim(rest)}
  defp head_num("##" <> rest), do: {2, String.trim(rest)}
  defp head_num("#" <> rest), do: {1, String.trim(rest)}
end

[source, target] = System.argv
if File.exists?(source) do
  :timer.tc(Markdown, :parse, [source, target])
  |> IO.inspect
else
  IO.puts "Source file does not exist"
end

Test source file

NobbZ · January 18, 2018, 1:04pm

Your indentation is off, which makes it a bit harder to follow the code.

But what I do see immediately, is that you eager-load the file into a string and then split it into lines and then convert it into a list. You should use File.stream!/3 to open the file and then process this stream.

Also I do not understand what you are using a Task for.

In general your code is looking unnecesarily complex, uses a lot of anonymous functions and doesn’t seem to have much structure.

You should try to implement a proper tokenizer first. While collecting the tokens, it should also easy doable to write collected headlines and URLs into an ETS.

After you have your tokens, you can transcribe them in a nearly 1:1 fashion into HTML, creating the TOC from ETS and also the link targets.

I’ll try to take a closer look this evening.

hristonev · January 18, 2018, 1:14pm

Thanks for your review. My first version was File.stream!/3. Unfortunately this is slower approach. Task is for async build of links. Link replacement is in the end of the parse, and building can be async… I’m new to Elixir and still learning from imperative programming. I’m not sure what you mean with “proper tokenizer first”

NobbZ · January 18, 2018, 1:26pm

Perhaps related to stuff going on deeper in your code. But without seeing that version I’m not sure…

Well, a tokenizer (here) would take a String and convert it into a list or stream of tokens.

eg:

tokens = tokenize("*foo*")
#=> [:bold_start, {:word, "foo"}, :bold_end]
html = to_html(tokens)
#=> "<b>foo</b>"

Writing such a tokenizer in a proper way is not an easy task and requires some knowledge about finite state machines. Its easier using tools like leex, but I do fear, that leex is incapable of tokenizing this format, as a single asterisk or slash can have different meanings depending on the context.

If though it is guaranteed that there is never slashes and stars interleaved or combined and also do not span multiple lines, leex could do the job.

Also I do not think that collecting link targets via a Task gains you anything. The task has to skip all the other stuff, while your regular “parser” has to skip all the link target definitions. Then at the end they have to wait for each other.

Its better to have a single process walking and tokenizing the input, while keeping track of defined links in an ETS or if you prefer a pure style a map passed around through the invocations of the tokenizer (which will again make using leex impossible).

And you said, you are doing this in another forum, may I ask which one? I’d like to take a look at other implementations as well.

hristonev · January 18, 2018, 1:33pm

Sure here is theme. Do not know google translate can help you from Bulgarian. There are implementation in Assembly, C, C++, Java, Lua, Perl 6. My tag is same as here .

gon782 · January 18, 2018, 1:40pm

http://forums.bgdev.org/index.php?act=Attach&type=post&id=295293

This script generates the markdown that’s supposed to be parsed, correct?

Edit:

As per this post http://forums.bgdev.org/index.php?act=Attach&type=post&id=295293:

Току що получих от Sun последната версия на генератора и го пускам като прикачен файл.
Лично аз нямам забележки към генерираният код. Ако съм пропуснал нещо, казвайте.

Работи така:

generator4.pl 10000 1500 > test.md

Първият аргумент е броя на параграфите които ще се генерират.
Вторият е броя на връзките.
Извежда на стандартният изход, така че ако ни трябва файл, трябва да го пренасочим.

В прикаченият файл съм сложил и примерен файл, генериран при параметри 15 5

Roughly:

"As soon as I received the last version of the generator from Sun [forum user] I also(?) added it as an attached file. I don’t personally have any notes on the generated code. If I’ve missed something, tell me.

It works like this:

 generator4.pl 10000 1500 > test.md

The first argument is the number of paragraphs it should generate. Second is the amount of links. It outputs to standard out, so if we need it in a file we have to redirect it.

In the attached file is an example file generated with parameters 15 5."

Edit: Translated by free hand, there may be issues.

hristonev · January 18, 2018, 1:44pm

Yep. I work with one large and one small file, never used generator:slight_smile:

hristonev · January 19, 2018, 6:41am

Here is what I understand . Bottleneck is file IO it is very slow on Elixir. 100ms reading file in List. Next is link parsing there are 2 parts[quote=“hristonev, post:1, topic:11710”]
links_worker = Task.async(fn ->
[/quote]

Collect links from source in a map. Map is for faster access and search. Second part start at[quote=“hristonev, post:1, topic:11710”]
|> String.splitter(["[", “]”])
[/quote]

For example I have implemented this in PHP. Overall time is 4x+ faster ~100ms. Reading file into memory as Array took 13ms.