How are you dealing with CSV files that uses CR line breaks?

I believe there are people here who are dealing with CSV files import on the daily basis, and since Excel is a really popular tool there is a standing problem of reading them since File.* doesn’t support \r as a new line anyway.
Details:
https://github.com/elixir-lang/elixir/issues/6169
https://github.com/beatrichartz/csv/issues/55

Right now I see two possible solutions:

  • read N chars, detect if \r doesn’t have \n after it and consider that file uses CR as a line breaks, and then:
  1. Create and use NIF (still have to find a good C/C++ library)
  2. Replace \r with \r\n

Is there any better way? What are you folks doing in this situation?

1 Like

I do a regex on them to clean them up.

  def read!(ifile) do
    contents = ifile
    |> local_file
    |> File.read!
    |> filter(~r/[^\x00-\x7F]/, " ")
    |> filter(~r/\r[^\n]/, "\r\n")

    %{ifile | contents: contents}
  end

There’s probably a better way of going about it but this one works pretty well for me.

Have you ever encountered filed that are using LFCR as a newline sequence?

Why not use a proper CSV library?

1 Like

We do, but all libraries are dealing with the File.read/File.stream! output which doesn’t have an option to specify a line-separation sequence, as in Python, for example Glossary — Python 2.7.18 documentation

This is described in the second link of original message:

File.Stream uses IO.each__stream which uses :io.get_line. LF and CRLF are handled, but not CR, which is used in some Mac OS and other legacy formats.

nimble_csv has :newlines / :line_separator options to handle this.

3 Likes

Just what I was about to say. I use nimble_csv myself and it has that ability, plus it was made by our own Jose Valim . :slight_smile:

1 Like

I use nimble_csv myself, and the thing it (and just about every other CSV library i’ve ever used, in any language) breaks down on is weird encodings you often get from Excel users. i’ve basically given up on parsing these documents, which I’m pretty sure are broken but also very common, so I go with the regex replacement above.

The problem with setting the line endings in nimble is that you have to know ahead of time what’s in the file, and if you have a wide variety of these files coming in, you can’t just set it once and forget it.

Teeeeechnically following the spec the line endings should (but not have to) be CRLF pairs (optional on the last entry).

But sadly no real official spec.

Perhaps use a file format that does actually have a defined spec? :slight_smile:

I ran into this problem and I cleaned the file up on the command line using this Unix command.

tr -d ‘\r’ <file1.csv > file2.csv

It removes all of the carriage returns. The file should be left with \n. I found this to be
the easiest solution for me to work with. I often just use simple Unix commands to clean files up which makes them much easier to work with in Elixir.

I have the same problem, but I written thin csv to JSON converter, it’s not generic though. I simply scan documents from beginning and check what line ending I encounter first and what delimiter there was, comma or semicolon. Then pass it to proper csv reading library with proper line ending and delimiter.

Not the fastest approach, far from perfect, but so far in production had only one not parsed/erroring file and it was really XLS with csv file extension.

PS. That rust program is run through cli piping stdout back to an application.

1 Like

I envy you if you work in a sector where you get to dictate the formats your customers bring you to work with :slight_smile:

Actually I make converters that they have to use. ^.^

Like the banking reports are… random. I have SO many fixups to get it into a defined format.

Like their’s is something kind of like csv, except it doesn’t escape, so, say, comma’s in text (which happens often) are just ‘there’, unescaped and unquoted… It is horrifying. But my cleanup scripts fix them up. ^.^

Dang, I really don’t understand how you dealt with unescaped commas in a Comma Separated Value (CSV) file. How do you even know where the commas are supposed to go? I guess if you have a description field next to a field with known values you can say that any extra commas on the row go to the description field, but that sounds quite error-prone.

Sometimes I wish there was a file format similar to CSV but complicated enough that everyone would realize that they need an actual parser/outputer. CSV isn’t as simple as sticking commas between fields. (JSON isn’t a good alternative due to all of the duplication if you have lots of rows).

we went the same way

The commas were in account name, so I had a whitelist of account names to scan specifically for and wrap. ^.^;

S-Expressions!!! ^.^

Everything that is able to distinguish between at least 5 basic types: Integers, Floats, Strings, Composites and an ordered kollection of items (List/Array).

Also it needs to support a Schema.

Therefore, whenever a client yells at me that he can provide a JSON or CSV dump of his database, then I ask for an SQL export instead.

If he wants to provide a CSV export of some OOo, LO or Excel spreadsheet, I ask for a copy of the spreadsheet.

If he insists on giving only CSV or JSON my boss doubles the price per hour…

We’ve been through many hoops that could have been avoided if we had access to raw data right away… So we came up with the solution mentioned above, after adjusting the payrate, most clients are able to provide raw data all of a sudden and they do not care for all the corporation policies they mentioned in the discussion earlier…

3 Likes

I’ve gotta remember that. ^.^