Newbie needs help parsing a file

Woo hoo
Need to parse a file that is an inventory file
File that has the following structure:

rec1
rec2
~~BOM
PVDocMgmt1.dll -- DateTime = 20160906082015;md5=MD5=2e59ba41a50a5ff2ab530519281bc572
PVDocMgmt2.dll -- DateTime = 20160906082015;md5=MD5=2e59ba41a50a5ff2ab530519281bc572
~~

I need to pull the records between the ~~BOM and the next pair of ~~

I can read the file using:

 def load_letters do
  File.stream!("c:/bwdata/92913.txt")
    |> Enum.each(&IO.write/1)
end

or

def yyyy do
  stream = File.stream!("c:/bwdata/92913.txt") |> Stream.map(&String.rstrip/1)
  stream |> Enum.each(&IO.inspect/1)
end

I need to now parse the file and eventually write out a file of the selected lines info

Need help parsing the selected set of records

Thanks
Bryan

So it looks like you are looking for a library to parse strings into a data structure, thus I would recommend: https://hex.pm/packages/combine :slight_smile:

1 Like

@bww00 If you are starting, I think it’s better for you to learn how to work with lists.
You read that file into a list, and then write a filter function with pattern matching.

here’s an example I have just made for you.

it’s up to you in the first 2 clauses of filter/4 to determine what you want to do with malformed files for example when a ~~BOM is not closed and you find another one open. weather you want to discard it or joint to the following ~~BOM data.

Happy coding

Thanks for the help

regards

One trick that can be useful with this kind of file processing is using Stream.transform to chunk the multiple line sections into lists of maps or whatever. Not sure that’s exactly what is required in this case. Are the many ~~BOM ~~ sections or just one?

Should be just 1 ~~BOM per file

based on @bbense advice, and provided you will have ready only the first ~~BOM section you encounter, you could do something like

defmodule RecordFileStream2 do
  def read(file) do
    File.stream!(file)
    |> Stream.map( &String.trim_trailing(&1, "\n") )
    |> Stream.transform(false, fn
      "~~BOM", _bom? ->
        {[], true}
      "~~", _bom? ->
        {:halt, nil}
      e, true ->
        {[e], true}
      _e, false ->
        {[], false}
    end)
    |> Enum.to_list
  end
end

IO.inspect RecordFileStream2.read("records.txt")

the output will be

$ elixir record_file_stream2.exs 
["PVDocMgmt1.dll -- DateTime = 20160906082015;md5=MD5=2e59ba41a50a5ff2ab530519281bc572",
 "PVDocMgmt2.dll -- DateTime = 20160906082015;md5=MD5=2e59ba41a50a5ff2ab530519281bc572"]

the help is much appreciated

1 Like

Just some quick notes re file parsing:

If you are interest in speed, you can trade simplicity of implementation for complexity. If the file will fit in memory, slurping the entire file and then parsing the resulting binary is almost always faster.

You can often get orders of magnitude improvement in file parsing speed by using all the tricks available. Elixir often looks really slow compared to languages like Ruby or Python based on straightforward use of File.stream! for parsing.

If you absolutely need parsing speed, one really good trick is to use the Erlang leex library. It allows a limited set of regexp in the parser definitions, so you get the speed of a “real parser” with the flexiblity of using regexp to define the parser.

1 Like