CharlesO

CharlesO

Parse / Pattern match large binary data

I have read through this thread: binary-pattern-matching-when-input-is-a-stream on a related subject, but i’m not able to make any headway.

I get this error trying to load and parse a large binary directly:

iex> JQL.save "--1566364097905", "STMT.ENTRY"
eheap_alloc: Cannot allocate 18446744071692551144 bytes of memory (of type "heap").

Crash dump is being written to: erl_crash.dump...

I’m doing an ETL (extract - transform - load) for a reporting project, moving data from a proprietary platform into SQL Server for easier reporting.

The parser has worked fine until I hit this size issue.

Please how / can we use streams to handle these kind of situations?
(particularly when initial or existing parser had not been built with streaming in mind)

JQL Parser: https://gist.github.com/CharlesOkwuagwu/4c6c89d96db7876bc0d27fecd518340e (updated)

Sample large file (~850mb unzipped): https://paperlesssolutionsltd.com.ng/java/--1566364097905.7z

Thanks.

Marked As Solved

CharlesO

CharlesO

Found a really simple way to solve this. Just read the data-header, then parse row by row:

  def read(src, cnt \\ nil) do
    {:ok, fid} = :file.open("_cache/#{src}", [:raw, :read, :binary])
    {:h, start, n} = _p(fid, :h)

    cnt = cnt || n

    Logger.debug("Reading: #{cnt} of #{n} rows")

    Process.put(:start_pos, start)

    for i <- 1..cnt do
      {:r, start, row} = _p(fid, Process.get(:start_pos), :r)
      Process.put(:start_pos, start)
      Logger.debug(inspect({i, row}, @format))
    end

    :file.close(fid)
  end

  # READER
  defp _p(fid, :h) do
    # 000007775B4C4900000136490000000049FFFFFFFF4C490000006749000000004900000005
    {:ok, <<n::32>>} = :file.pread(fid, 33, 4)
    {:h, 37, n}
  end

  defp _p(fid, start, :r) do
    # 490000000049000000006A62 00000165 3136333737303....
    {:ok, <<l::32>>} = :file.pread(fid, start + 12, 4)
    {:ok, b} = :file.pread(fid, start + 16, l)

    {:r, start + l + 18, _split(b)}
  end

  defp _split(v) do
    b = for(<<c <- v>>, c in 32..126 || c in [252, 253, 254], into: "", do: <<c>>)
    b = :binary.replace(b, <<253>>, ";", [:global])
    b = :binary.replace(b, <<252>>, "^", [:global])
    :binary.split(b, <<254>>, [:global])
  end

Also Liked

NobbZ

NobbZ

But the error message you posted, it claims that it wants to allocate another 18446744071692551144 byte, which is 18014398507512256 kiB, 17592186042492 MiB, 17179869182 GiB, 16777215 TiB, 16383 PiB, so ~16 EiB…

As you can see @hauleth even missed by a factor of ~100…

There might be plenty of reasons why such a huge amount of memory might be allocated…

hauleth

hauleth

No surprise there as I assume that you do not have 163 PB of RAM available for your machine. It seems like you want to load a lot of data at once. Maybe try to split it into reasonable chunks?

NobbZ

NobbZ

Does the problem persist if you remove this line?

https://gist.github.com/CharlesOkwuagwu/4c6c89d96db7876bc0d27fecd518340e#file-jql-ex-L102

You are not even using the result of term_to_binary… And unless compressed, it will always require more memory than the input.

Where Next?

Popular in Questions Top

Tee
can someone please explain to me how Enum.reduce works with maps
New
alice
Hey, Just curious what are the main benefits of Elixir compared to Clojure? When is Elixir more useful than Clojure and vice versa? Th...
New
belgoros
I’m not a pro in using Regex and can’t figure out why the following behaviour happens, especially if we take into account the difference ...
New
itssasanka
Hi all, Trying to get some more clarity over utc_datetime and naive_datetime for Ecto: The documentation above suggests that while ...
New
Qqwy
Original source of discussion: This topic on the Pragmatic Programmers’ Functional Web Development with Elixir, OTP, and Phoenix forum. ...
New
ashish173
I am using Ecto timestamps with postgres, I can see the timestamps() use the :naive_dateime but for my use case I wanted to store the ti...
New
jason.o
In the code below, if the create action is not set to accept “extra_key” as an input, it errors out with a message shown above. Is there ...
New
dblack
I’ve got an issue with an app and I’ve no idea of how to troubleshoot it. I’m hoping someone here might have seen something similar. I p...
New
openscript
Hello! Sorry for this astonishing simple question, but I’m really stuck. I try to set up the intellij-elixir plugin, but I don’t know ho...
New
PeterCarter
There are pre-rolled solutions for other frameworks that do work. However, Phoenix does not seem to have these. Have people had good expe...
New

Other popular topics Top

danschultzer
None of the current solutions worked well for me, so I went ahead and built a user management system from scratch. This project took far...
548 29377 241
New
jononomo
I am trying to figure out how Mix knows whether the environment is test, dev, or prod – where is this set? Thanks.
New
dokuzbir
I want to highlight html closing tags when i click a html tag. That works in .html files but doesnt work for html.eex templates. How can...
New
fayddelight
I tried installing elixir 1.11.2 erlang 23.3.4 via asdf in my zsh shell. Enabled the versions locally and globally. When I list them ...
New
Qqwy
Original source of discussion: This topic on the Pragmatic Programmers’ Functional Web Development with Elixir, OTP, and Phoenix forum. ...
New
nobody
Hi! In PHP: $_SERVER[‘SERVER_ADDR’] - in Elixir? Searched the docs for ip address and the web, no good results. Thanks!
New
Brian
What is the proper way to load a module from a file in to IEX? In the python world, doing something like this pretty standard: from ....
New
hariharasudhan94
I would like to know what is the best IDE for elixir development?
New
dogweather
I wrote this comment on r/haskell, and it’s not popular there. :wink: But I think I’m on to something… Haskell reminds me of Java, and e...
New
lanycrost
Hi everyone! I need implement if…else if…else condition from my elixir code, and anymore of this control flow structures not work proper...
New

We're in Beta

About us Mission Statement