CharlesO
Parse / Pattern match large binary data
I have read through this thread: binary-pattern-matching-when-input-is-a-stream on a related subject, but i’m not able to make any headway.
I get this error trying to load and parse a large binary directly:
iex> JQL.save "--1566364097905", "STMT.ENTRY"
eheap_alloc: Cannot allocate 18446744071692551144 bytes of memory (of type "heap").
Crash dump is being written to: erl_crash.dump...
I’m doing an ETL (extract - transform - load) for a reporting project, moving data from a proprietary platform into SQL Server for easier reporting.
The parser has worked fine until I hit this size issue.
Please how / can we use streams to handle these kind of situations?
(particularly when initial or existing parser had not been built with streaming in mind)
JQL Parser: https://gist.github.com/CharlesOkwuagwu/4c6c89d96db7876bc0d27fecd518340e (updated)
Sample large file (~850mb unzipped): https://paperlesssolutionsltd.com.ng/java/--1566364097905.7z
Thanks.
Marked As Solved
CharlesO
Found a really simple way to solve this. Just read the data-header, then parse row by row:
def read(src, cnt \\ nil) do
{:ok, fid} = :file.open("_cache/#{src}", [:raw, :read, :binary])
{:h, start, n} = _p(fid, :h)
cnt = cnt || n
Logger.debug("Reading: #{cnt} of #{n} rows")
Process.put(:start_pos, start)
for i <- 1..cnt do
{:r, start, row} = _p(fid, Process.get(:start_pos), :r)
Process.put(:start_pos, start)
Logger.debug(inspect({i, row}, @format))
end
:file.close(fid)
end
# READER
defp _p(fid, :h) do
# 000007775B4C4900000136490000000049FFFFFFFF4C490000006749000000004900000005
{:ok, <<n::32>>} = :file.pread(fid, 33, 4)
{:h, 37, n}
end
defp _p(fid, start, :r) do
# 490000000049000000006A62 00000165 3136333737303....
{:ok, <<l::32>>} = :file.pread(fid, start + 12, 4)
{:ok, b} = :file.pread(fid, start + 16, l)
{:r, start + l + 18, _split(b)}
end
defp _split(v) do
b = for(<<c <- v>>, c in 32..126 || c in [252, 253, 254], into: "", do: <<c>>)
b = :binary.replace(b, <<253>>, ";", [:global])
b = :binary.replace(b, <<252>>, "^", [:global])
:binary.split(b, <<254>>, [:global])
end
Also Liked
NobbZ
But the error message you posted, it claims that it wants to allocate another 18446744071692551144 byte, which is 18014398507512256 kiB, 17592186042492 MiB, 17179869182 GiB, 16777215 TiB, 16383 PiB, so ~16 EiB…
As you can see @hauleth even missed by a factor of ~100…
There might be plenty of reasons why such a huge amount of memory might be allocated…
hauleth
No surprise there as I assume that you do not have 163 PB of RAM available for your machine. It seems like you want to load a lot of data at once. Maybe try to split it into reasonable chunks?
NobbZ
Does the problem persist if you remove this line?
https://gist.github.com/CharlesOkwuagwu/4c6c89d96db7876bc0d27fecd518340e#file-jql-ex-L102
You are not even using the result of term_to_binary… And unless compressed, it will always require more memory than the input.








