How to seek a huge text file at the speed of light (well elixir :)

AleksandarFilipov · November 10, 2022, 9:28pm

Hi forum, it’s been a while,

I would like some input from you brainics on the following;
I have huge textfile (GB) where each line starts timestamp.

timestamp0 data
timestamp1 data
...

So given a timestamp as input I need to locate the closest timestamp in the file and then parse line by line. Naturally this seek operation should be quick!

Well we know that this has been done a billion times before, however it hasn’t been done by me

I should be mentioned that the huge textfile, is produced during a recording session which I control - opening for a possibility to produce a secondary meta data file during the recording. (well to be fair there is actually already a meta data file containing some json (however no indexing)).

Surely there must exist a “by the book” solution here.

Thanks!

hst337 · November 10, 2022, 9:30pm

Is the file ordered by timestamp? If it is, you can perform binary search using :file.pread

AleksandarFilipov · November 10, 2022, 9:46pm

Yeps! Ordered by timestamp. I’ll definetly have a deeper look.

fmn · November 10, 2022, 9:59pm

How about divide and conquer variant? Something like:

Based on file size / blocks count, decide to how many segments you want to split the file.
Start n separate processes, assign segment to every process, let them all open the file in RO mode, at the beginning of their corresponding segments (lseek/fseek).
Scan for timestamps.
If timestamp you are looking for is lower than start pointer position, drop search and let process exit.
Note file/stream pointer position when right timestamp will be found.

If file is really big, and was written to a single, physical storage device, this approach can saturate I/O. To prevent that, you can process file in batches.

You should probably put some emphasis on search for a new line indicator implementation. I feel like main factor here is maximum length of a line. If you have a good guess or hard limit, that can help.

hst337 · November 10, 2022, 11:14pm

This algorithm has a problem. As soon as you split the file into N parts, N - 1 processes will exit at the first sight, because they’ll find out that the target timestamp is not in their range (or segment or block or batch or whatever you call it)

hst337 · November 10, 2022, 11:25pm

If we’re taking this problem to the real-world level, than you’re algorithm must depend on file encoding, disk type you’re using and page size.

So, your algorithm must consist in these steps

Take left boundary as the lowest timestamp and right boundary is the highest timestamp.
Split the boundaries in 2 equal parts (if you’re using SSD which works fine with non-sequential reads) or in N equal parts (if you’re using HDD which works fine only with sequential reads).
Read one page from every boundary of each part (except leftest and rightest) (using :file.read in case of SSD or :file.pread in case of HDD)
Pad page in case you have file encoding like utf8. If your encoding is latin1, you’re fine
Find highest or lowest timestamp of every page.
Check if target timestamp is in one of the parts
Set left and right boundary to the boundaries of the part. Goto 2

Plus, I’d definitely add some heuristics to switch to the linear scanning if the part between left and right boundaries is too small to use the algorithm described above

fmn · November 10, 2022, 11:26pm

Well, perhaps I should put more emphasis Something in “Something like:”
I am sure if you will take a closer look, you will find more issues - it’s just an idea.
That being said, I don’t think what you wrote is correct. If timestamp is at the end of file, none of the processes will exit because of the condition you described.

EDIT: To clarify - in what I described, search is going forward. If timestamp you are looking for is lower than first timestamp you encounter during the search, it means this particular process can’t find it. But other can.

hst337 · November 10, 2022, 11:32pm

If timestamp you are looking for is lower than first timestamp you encounter during the search, it means this particular process can’t find it

And if the timestamp you’re looking is higher than the first timestamp of the next segment, current process can exit straight away.

So, N-1 processes will exit straight away in the algorithm you’re describing.

I think what you actually want to implement is some kind of binary (or N-ary) search. The idea is close to divide and conquer, but they’re not the same

fmn · November 10, 2022, 11:36pm

Ahh, that’s what you mean. Cool idea actually, what we are missing here, is processes having common state or talking to each another.

Yes, in that sense it could be very interesting optimization, I agree.

dimitarvp · November 10, 2022, 11:54pm

…or you can import it into SQLite and query that.

AleksandarFilipov · November 11, 2022, 6:13am

A lot of good thinking here. I was thinking that producing metadata (in another file) with timestamp => line_nbr (or maybe at byte offset if i can keep track of that for a text file) could also be of help, however just seeking n lines in to a file of this size tajes quite some time (if my memory serves me right).
Splitting into smaller files is quite tempting but might add unnecessay complexity.

Sebb · November 11, 2022, 1:08pm

write a file each minute/hour/…

AleksandarFilipov · November 15, 2022, 11:32pm

I’ve ha a chance to play with this and i’m sharing some findings. @hst337 suggested :file.pread which is lightning fast which I can use to find the byte offset. It also changes the file cursor but… erlang docs states:

“The current position of the file after the operation is undefined for raw mode and unchanged for ram mode.”

So now I’m stuck in non :raw mode which I need for performance reasons. Notably Enum.each(IO.stream(file_with_cursor_offset, :line), fn(line) -> will not work.

So my attention moved to File.stream

So once I’ve located the position (possibly with :file.pread, without :raw mode) I would like to use File.stream! (or simiar) and then consume the file line by line. However I don’t see how to start by an offset. The closest thing i’ve found so far is Stream.drop/2 - however given that the stream is consumed line by line it will drop n lines. Which is fine, but I don’t see this being as efficient as just skipping m bytes ahead.

Hopefully i’m missing something here.

TDLR: how do I start my Stream with a byte offet? Assume I would like to skip/drop the first 100 MB of the file.

Hmm, maybe I’m onto something here…

  def file_stream_with_offset(name, position) do
    Stream.resource(
      fn ->
        {:ok, file} = :file.open(String.to_charlist(name), [:read, :raw, :read_ahead, :binary])
        :file.position(file, position)
        file
      end,
      fn file ->
        case :file.read_line(file) do
          :eof -> {:halt, file}
          {:ok, line} -> {[line], file}
        end
      end,
      fn file -> :file.close(file) end
    )
  end