Newbie needs help parsing a file

bww00 · September 23, 2016, 2:07pm

Woo hoo
Need to parse a file that is an inventory file
File that has the following structure:

rec1
rec2
~~BOM
PVDocMgmt1.dll -- DateTime = 20160906082015;md5=MD5=2e59ba41a50a5ff2ab530519281bc572
PVDocMgmt2.dll -- DateTime = 20160906082015;md5=MD5=2e59ba41a50a5ff2ab530519281bc572
~~

I need to pull the records between the ~~BOM and the next pair of ~~

I can read the file using:

 def load_letters do
  File.stream!("c:/bwdata/92913.txt")
    |> Enum.each(&IO.write/1)
end

or

def yyyy do
  stream = File.stream!("c:/bwdata/92913.txt") |> Stream.map(&String.rstrip/1)
  stream |> Enum.each(&IO.inspect/1)
end

I need to now parse the file and eventually write out a file of the selected lines info

Need help parsing the selected set of records

Thanks
Bryan

OvermindDL1 · September 23, 2016, 2:19pm

So it looks like you are looking for a library to parse strings into a data structure, thus I would recommend: https://hex.pm/packages/combine

eksperimental · September 23, 2016, 2:43pm

@bww00 If you are starting, I think it’s better for you to learn how to work with lists.
You read that file into a list, and then write a filter function with pattern matching.

here’s an example I have just made for you.

it’s up to you in the first 2 clauses of filter/4 to determine what you want to do with malformed files for example when a ~~BOM is not closed and you find another one open. weather you want to discard it or joint to the following ~~BOM data.

Happy coding

bww00 · September 23, 2016, 7:18pm

Thanks for the help

regards

bbense · September 24, 2016, 1:09am

One trick that can be useful with this kind of file processing is using Stream.transform to chunk the multiple line sections into lists of maps or whatever. Not sure that’s exactly what is required in this case. Are the many ~~BOM ~~ sections or just one?

bww00 · September 24, 2016, 1:44am

Should be just 1 ~~BOM per file

eksperimental · September 24, 2016, 2:41am

based on @bbense advice, and provided you will have ready only the first ~~BOM section you encounter, you could do something like

defmodule RecordFileStream2 do
  def read(file) do
    File.stream!(file)
    |> Stream.map( &String.trim_trailing(&1, "\n") )
    |> Stream.transform(false, fn
      "~~BOM", _bom? ->
        {[], true}
      "~~", _bom? ->
        {:halt, nil}
      e, true ->
        {[e], true}
      _e, false ->
        {[], false}
    end)
    |> Enum.to_list
  end
end

IO.inspect RecordFileStream2.read("records.txt")

the output will be

$ elixir record_file_stream2.exs 
["PVDocMgmt1.dll -- DateTime = 20160906082015;md5=MD5=2e59ba41a50a5ff2ab530519281bc572",
 "PVDocMgmt2.dll -- DateTime = 20160906082015;md5=MD5=2e59ba41a50a5ff2ab530519281bc572"]

bww00 · September 24, 2016, 3:13pm

the help is much appreciated

bbense · September 24, 2016, 3:35pm

Just some quick notes re file parsing:

If you are interest in speed, you can trade simplicity of implementation for complexity. If the file will fit in memory, slurping the entire file and then parsing the resulting binary is almost always faster.

You can often get orders of magnitude improvement in file parsing speed by using all the tricks available. Elixir often looks really slow compared to languages like Ruby or Python based on straightforward use of File.stream! for parsing.

If you absolutely need parsing speed, one really good trick is to use the Erlang leex library. It allows a limited set of regexp in the parser definitions, so you get the speed of a “real parser” with the flexiblity of using regexp to define the parser.