Parse / Pattern match large binary data

I have read through this thread: binary-pattern-matching-when-input-is-a-stream on a related subject, but i’m not able to make any headway.

I get this error trying to load and parse a large binary directly:

iex> JQL.save "--1566364097905", "STMT.ENTRY"
eheap_alloc: Cannot allocate 18446744071692551144 bytes of memory (of type "heap").

Crash dump is being written to: erl_crash.dump...

I’m doing an ETL (extract - transform - load) for a reporting project, moving data from a proprietary platform into SQL Server for easier reporting.

The parser has worked fine until I hit this size issue.

Please how / can we use streams to handle these kind of situations?
(particularly when initial or existing parser had not been built with streaming in mind)

JQL Parser: https://gist.github.com/CharlesOkwuagwu/4c6c89d96db7876bc0d27fecd518340e (updated)

Sample large file (~850mb unzipped): https://paperlesssolutionsltd.com.ng/java/--1566364097905.7z

Thanks.

No surprise there as I assume that you do not have 163 PB of RAM available for your machine. It seems like you want to load a lot of data at once. Maybe try to split it into reasonable chunks?

3 Likes

Im just loading 853mb …

I have 16 GB ram

But the error message you posted, it claims that it wants to allocate another 18446744071692551144 byte, which is 18014398507512256 kiB, 17592186042492 MiB, 17179869182 GiB, 16777215 TiB, 16383 PiB, so ~16 EiB…

As you can see @hauleth even missed by a factor of ~100…

There might be plenty of reasons why such a huge amount of memory might be allocated…

4 Likes

Very strange.

Im at a loss.

I’ve uploaded the code and exact file I’m trying to process

Does the problem persist if you remove this line?

You are not even using the result of term_to_binary… And unless compressed, it will always require more memory than the input.

3 Likes

That line was just for testing something else. removing it still brings allocation errors, but smaller:

iex> JQL.save "--1566364097905", "STMT.ENTRY"
eheap_alloc: Cannot allocate 1898305688 bytes of memory (of type "heap").

Crash dump is being written to: erl_crash.dump...
 def save(src, name) do
    {:ok, <<len::32, "[", _::bits>> = bin} = :file.read_file("_cache/#{src}")
    {:ok, f} = :file.open("_cache/#{name}.bin", [:raw, :binary, :write, :delayed_write])
    m = _r(binary_part(bin, 5, len - 2), [])
    # b = :erlang.term_to_binary(m)
    :done
  end

i’m able to read the file, but not parse it.

This is a smaller sample of using the parser to handle just a few kilobytes:

def decode(hex), do: _r(Base.decode16!(hex))

iex> JQL.decode "000007775B4C4900000136490000000049FFFFFFFF4C490000006749000000004900000005490000000049000000006A62000001653136333737303030313335323832312E303130303031FE3136333737303030313335323832312E303130303031FE3030303030303030303139373339FE4E4730303130303031FE333439312E3030FE323133FE414C4C4F434154494F4EFE414C4C4F434154494F4EFEFE31FE31FE35303032FE3230313231313031FEFE465431323330363030303636FEFE3230313231313031FE31FE33FE465431323330363030303636FE4654FE3230313231313031FE3136333737303030313335323832312E3031FD312D32FEFEFE31FE31323131303131343430FE4E474EFE333439312E3030FEFE4445424954FEFE3230313231313031FE3230313231313031FEFEFEFEFEFEFEFEFEFEFEFEFEFE41432E312E54522E4E474E2E353030322E312E353032302E343930302E4E472E4E472E313030302D312E312E2E2E2E2E4E4730303130303031FEFE44454641554C54FE4E4730303130303031FEFEFE3230313231313031FE31FEFE3B3B490000000049000000006A620000014D3136393233303030303936333935342E303130303032FE3136393233303030303936333935342E303130303032FE4E474E31313236303030303130303031FE4E4730303130303031FE2D3439343534392E3031FE323133FEFEFEFEFE31FE3131323630FE3230313430343239FEFE465431343131393030303036FEFE3230313430343239FE31FE37FE465431343131393030303036FE4654FE3230313430343239FE3136393233303030303936333935342E3031FD312D32FEFEFE31FE31343035303131373435FE4E474EFE2D3439343534392E3031FEFE4445424954FEFE3230313430343239FE3230313430343239FEFEFEFEFEFEFEFEFEFEFEFEFEFE41432E312E54522E4E474E2E31313236302E372E2E2E2E2E313030302D312E2E2E2E2E2E4E4730303130303031FEFE44454641554C54FE4E4730303130303031FEFEFE3230313430343239FE31FEFE3B3B490000000049000000006A62000001783136393834303030303534313938352E303230303032FE3136393834303030303534313938352E303230303032FE3030303030303030303231303234FE4E4730303130303031FE2D36393032322E3636FE323133FE434F4C4C454354494F4EFE434F4C4C454354494F4EFEFE31303030FE31FE35303033FE3230313430363330FEFE465431343138323030313432FEFE3230313430373031FE31FE33FE465431343138323030313432FE4654FE3230313430373031FE3136393834303030303534313938352E3032FD312D32FEFEFE31FE31343037303131313339FE4E474EFE2D36393032322E3636FEFE435245444954FEFE3230313430363330FE3230313430363330FEFEFEFEFEFEFEFEFE3230313430373031FEFEFEFEFE41432E312E54522E4E474E2E353030332E372E353032302E343930302E4E472E4E472E313030302D312E313030302E2E2E2E2E4E4730303130303031FEFE44454641554C54FE4E4730303130303031FEFEFE3230313430373031FE31FEFE3B3B490000000049000000006A620000015C3136393834303030303634363137372E303030303031FE3136393834303030303634363137372E303030303031FE3030303030303030303139383434FE4E4730303130303032FE3230353038302E3838FE323133FEFEFEFE32303030FE31FE31323036FE3230313430373031FEFE465431343138323030333438FEFE3230313430373031FE31FE37FE465431343138323030333438FE4654FE3230313430373031FE3136393834303030303634363137372E3030FD31FD31FEFEFE31FE31343037303131323439FE4E474EFE3230353038302E3838FEFE435245444954FEFE3230313430373031FE3230313430373031FEFEFEFEFEFEFEFEFEFEFEFEFEFE41432E312E54522E4E474E2E313230362E372E353032302E343930302E4E472E4E472E323030302D312E323030302E2E2E2E2E4E4730303130303032FEFE44454641554C54FE4E4730303130303032FEFEFE3230313430373031FE31FEFE3B3B490000000049000000006A62000001733136393834303030323435323030362E303230303031FE3136393834303030323435323030362E303230303031FE3030303030303030303139393431FE4E4730303130303031FE3130383336382E3030FE323133FE434C45415245442044455441494C53FE434C45415245442044455441494C53FEFE31303030FE31FE31303130FE3230313430373031FEFE465431343138323030343633FEFE3230313430373031FE31FE33FE465431343138323030343633FE4654FE3230313430373031FE3136393834303030323435323030362E3032FD312D32FEFEFE31FE31343037303131343236FE4E474EFE3130383336382E3030FEFE435245444954FEFE3230313430373031FE3230313430373031FEFEFEFEFEFEFEFEFEFEFEFEFEFE41432E312E54522E4E474E2E313031302E372E353032302E343930302E4E472E4E472E313030302D312E313030302E2E2E2E2E4E4730303130303031FEFEFE4E4730303130303031FEFEFE3230313430373031FE31FEFE3B3B3B3B5D"

iex> {:data, 5,
 [
   ["163770001352821.010001", "163770001352821.010001", "00000000019739", "NG0010001", "3491.00", "213", "ALLOCATION", "ALLOCATION", "", "1", "1", "5002", "20121101", "", "FT1230600066", "", "20121101", "1",
    "3", "FT1230600066", "FT", "20121101", "163770001352821.01;1-2", "", "", "1", "1211011440", "NGN", "3491.00", "", "DEBIT", "", "20121101", "20121101", "", "", "", "", "", "", "", "", "", "", "", "", "",
    "AC.1.TR.NGN.5002.1.5020.4900.NG.NG.1000-1.1.....NG0010001", "", "DEFAULT", "NG0010001", "", "", "20121101", "1", "", ""],
   ["169230000963954.010002", "169230000963954.010002", "NGN1126000010001", "NG0010001", "-494549.01", "213", "", "", "", "", "1", "11260", "20140429", "", "FT1411900006", "", "20140429", "1", "7",
    "FT1411900006", "FT", "20140429", "169230000963954.01;1-2", "", "", "1", "1405011745", "NGN", "-494549.01", "", "DEBIT", "", "20140429", "20140429", "", "", "", "", "", "", "", "", "", "", "", "", "",
    "AC.1.TR.NGN.11260.7.....1000-1......NG0010001", "", "DEFAULT", "NG0010001", "", "", "20140429", "1", "", ""],
   ["169840000541985.020002", "169840000541985.020002", "00000000021024", "NG0010001", "-69022.66", "213", "COLLECTION", "COLLECTION", "", "1000", "1", "5003", "20140630", "", "FT1418200142", "", "20140701",
    "1", "3", "FT1418200142", "FT", "20140701", "169840000541985.02;1-2", "", "", "1", "1407011139", "NGN", "-69022.66", "", "CREDIT", "", "20140630", "20140630", "", "", "", "", "", "", "", "", "20140701", "",
    "", "", "", "AC.1.TR.NGN.5003.7.5020.4900.NG.NG.1000-1.1000.....NG0010001", "", "DEFAULT", "NG0010001", "", "", "20140701", "1", "", ""],
   ["169840000646177.000001", "169840000646177.000001", "00000000019844", "NG0010002", "205080.88", "213", "", "", "", "2000", "1", "1206", "20140701", "", "FT1418200348", "", "20140701", "1", "7",
    "FT1418200348", "FT", "20140701", "169840000646177.00;1;1", "", "", "1", "1407011249", "NGN", "205080.88", "", "CREDIT", "", "20140701", "20140701", "", "", "", "", "", "", "", "", "", "", "", "", "",
    "AC.1.TR.NGN.1206.7.5020.4900.NG.NG.2000-1.2000.....NG0010002", "", "DEFAULT", "NG0010002", "", "", "20140701", "1", "", ""],
   ["169840002452006.020001", "169840002452006.020001", "00000000019941", "NG0010001", "108368.00", "213", "CLEARED DETAILS", "CLEARED DETAILS", "", "1000", "1", "1010", "20140701", "", "FT1418200463", "",
    "20140701", "1", "3", "FT1418200463", "FT", "20140701", "169840002452006.02;1-2", "", "", "1", "1407011426", "NGN", "108368.00", "", "CREDIT", "", "20140701", "20140701", "", "", "", "", "", "", "", "", "",
    "", "", "", "", "AC.1.TR.NGN.1010.7.5020.4900.NG.NG.1000-1.1000.....NG0010001", "", "", "NG0010001", "", "", "20140701", "1", "", ""]
 ]}
iex>
1 Like

Another problem might be, that you are appending to your accumulator all the time… This requires rebuilding the list all the time, eg:

defp _r(<<"LI", 1::32, ";", b::bits>>, val), do: _r(b, val ++ [nil])

I’m not sure how to refactor though. As I currently will not try to understand the full parser, as I do not know JQL anyway…

1 Like

@NobbZ i just uploaded a 3k sample that contains 5 rows of data.

IF i can optimize the way i handle this, possibly the much larger file may equally be processed without memory issues

I see your point.

The mistake might be in trying to build out / parse the entire file into memory in one go.

What you pointed out should not be an issue, say if i was processing each row of data and dumping row by row in the database, instead of parsing all rows first., before dumping to the DB, like i do now.

@CharlesO At a minimum if you are going to parse it all in memory, you need to never append to a list in a loop. This is an O(n^2) operation. You should prepend to the list, and then reverse at the end. Or, use a body recursive function that builds the list front to back via recursion.

3 Likes
000007775B4C4900000136490000000049FFFFFFFF4C49000000674900000000 4900000005 # row count
	490000000049000000006A62 00000165 3136333737303030313335323832312E303130303031FE3136333737303030313335323832312E303130303031FE3030303030303030303139373339FE4E4730303130303031FE333439312E3030FE323133FE414C4C4F434154494F4EFE414C4C4F434154494F4EFEFE31FE31FE35303032FE3230313231313031FEFE465431323330363030303636FEFE3230313231313031FE31FE33FE465431323330363030303636FE4654FE3230313231313031FE3136333737303030313335323832312E3031FD312D32FEFEFE31FE31323131303131343430FE4E474EFE333439312E3030FEFE4445424954FEFE3230313231313031FE3230313231313031FEFEFEFEFEFEFEFEFEFEFEFEFEFE41432E312E54522E4E474E2E353030322E312E353032302E343930302E4E472E4E472E313030302D312E312E2E2E2E2E4E4730303130303031FEFE44454641554C54FE4E4730303130303031FEFEFE3230313231313031FE31FEFE-3B3B
	490000000049000000006A62 0000014D 3136393233303030303936333935342E303130303032FE3136393233303030303936333935342E303130303032FE4E474E31313236303030303130303031FE4E4730303130303031FE2D3439343534392E3031FE323133FEFEFEFEFE31FE3131323630FE3230313430343239FEFE465431343131393030303036FEFE3230313430343239FE31FE37FE465431343131393030303036FE4654FE3230313430343239FE3136393233303030303936333935342E3031FD312D32FEFEFE31FE31343035303131373435FE4E474EFE2D3439343534392E3031FEFE4445424954FEFE3230313430343239FE3230313430343239FEFEFEFEFEFEFEFEFEFEFEFEFEFE41432E312E54522E4E474E2E31313236302E372E2E2E2E2E313030302D312E2E2E2E2E2E4E4730303130303031FEFE44454641554C54FE4E4730303130303031FEFEFE3230313430343239FE31FEFE-3B3B
	490000000049000000006A62 00000178 3136393834303030303534313938352E303230303032FE3136393834303030303534313938352E303230303032FE3030303030303030303231303234FE4E4730303130303031FE2D36393032322E3636FE323133FE434F4C4C454354494F4EFE434F4C4C454354494F4EFEFE31303030FE31FE35303033FE3230313430363330FEFE465431343138323030313432FEFE3230313430373031FE31FE33FE465431343138323030313432FE4654FE3230313430373031FE3136393834303030303534313938352E3032FD312D32FEFEFE31FE31343037303131313339FE4E474EFE2D36393032322E3636FEFE435245444954FEFE3230313430363330FE3230313430363330FEFEFEFEFEFEFEFEFE3230313430373031FEFEFEFEFE41432E312E54522E4E474E2E353030332E372E353032302E343930302E4E472E4E472E313030302D312E313030302E2E2E2E2E4E4730303130303031FEFE44454641554C54FE4E4730303130303031FEFEFE3230313430373031FE31FEFE-3B3B
	490000000049000000006A62 0000015C 3136393834303030303634363137372E303030303031FE3136393834303030303634363137372E303030303031FE3030303030303030303139383434FE4E4730303130303032FE3230353038302E3838FE323133FEFEFEFE32303030FE31FE31323036FE3230313430373031FEFE465431343138323030333438FEFE3230313430373031FE31FE37FE465431343138323030333438FE4654FE3230313430373031FE3136393834303030303634363137372E3030FD31FD31FEFEFE31FE31343037303131323439FE4E474EFE3230353038302E3838FEFE435245444954FEFE3230313430373031FE3230313430373031FEFEFEFEFEFEFEFEFEFEFEFEFEFE41432E312E54522E4E474E2E313230362E372E353032302E343930302E4E472E4E472E323030302D312E323030302E2E2E2E2E4E4730303130303032FEFE44454641554C54FE4E4730303130303032FEFEFE3230313430373031FE31FEFE-3B3B
	490000000049000000006A62 00000173 3136393834303030323435323030362E303230303031FE3136393834303030323435323030362E303230303031FE3030303030303030303139393431FE4E4730303130303031FE3130383336382E3030FE323133FE434C45415245442044455441494C53FE434C45415245442044455441494C53FEFE31303030FE31FE31303130FE3230313430373031FEFE465431343138323030343633FEFE3230313430373031FE31FE33FE465431343138323030343633FE4654FE3230313430373031FE3136393834303030323435323030362E3032FD312D32FEFEFE31FE31343037303131343236FE4E474EFE3130383336382E3030FEFE435245444954FEFE3230313430373031FE3230313430373031FEFEFEFEFEFEFEFEFEFEFEFEFEFE41432E312E54522E4E474E2E313031302E372E353032302E343930302E4E472E4E472E313030302D312E313030302E2E2E2E2E4E4730303130303031FEFEFE4E4730303130303031FEFEFE3230313430373031FE31FEFE-3B3B
3B3B5D

JQL is not so tricky. You can see the core repeating pattern above:

data-body header row-count
row-header - row-length - row data - row terminator
...
...
data-body terminator

Thanks for this. This is recursion 101, i know but i felt the size of each row was small enough to accumulate the parsed items left - right, as they occur in the data packet. The place i’ve really goofed is here:

 defp _r(
         <<"LI", 0x136::32, "I", 0::32, "I", _::32, "LI", 0x67::32, "I", 0::32, "I", n::32, b::bits>>,
         []
       ) do
    v =
      for <<l::32, s::bytes-size(l), _::bits>> <-
            :binary.split(b, @_00jb, [:global]),
          do: _split(s)

    {:data, n, v}
  end

on this line:

 v =
      for <<l::32, s::bytes-size(l), _::bits>> <-
            :binary.split(b, @_00jb, [:global]),
          do: _split(s)

i’m practically saying split the 850mb blob of data (2.5m+ rows), and build a list out of it … all at once in memory

When i run same routine for 5, 100 or even 100,000 rows it complete without issues.

I need a way to process and discard each row and not accumulate to lists like i’m doing, like you guys have rightly pointed out, more so NOT using ++ to accumulate

You can do this, you just have to write it body recursive instead of tail recursive.

1 Like

Found a really simple way to solve this. Just read the data-header, then parse row by row:

  def read(src, cnt \\ nil) do
    {:ok, fid} = :file.open("_cache/#{src}", [:raw, :read, :binary])
    {:h, start, n} = _p(fid, :h)

    cnt = cnt || n

    Logger.debug("Reading: #{cnt} of #{n} rows")

    Process.put(:start_pos, start)

    for i <- 1..cnt do
      {:r, start, row} = _p(fid, Process.get(:start_pos), :r)
      Process.put(:start_pos, start)
      Logger.debug(inspect({i, row}, @format))
    end

    :file.close(fid)
  end

  # READER
  defp _p(fid, :h) do
    # 000007775B4C4900000136490000000049FFFFFFFF4C490000006749000000004900000005
    {:ok, <<n::32>>} = :file.pread(fid, 33, 4)
    {:h, 37, n}
  end

  defp _p(fid, start, :r) do
    # 490000000049000000006A62 00000165 3136333737303....
    {:ok, <<l::32>>} = :file.pread(fid, start + 12, 4)
    {:ok, b} = :file.pread(fid, start + 16, l)

    {:r, start + l + 18, _split(b)}
  end

  defp _split(v) do
    b = for(<<c <- v>>, c in 32..126 || c in [252, 253, 254], into: "", do: <<c>>)
    b = :binary.replace(b, <<253>>, ";", [:global])
    b = :binary.replace(b, <<252>>, "^", [:global])
    :binary.split(b, <<254>>, [:global])
  end
3 Likes