Converting variable number of CSV headers into map

onkara · January 14, 2021, 6:53pm

This post builds on the requirements mentioned in a similar post.

I am using NimbleCSV and I am a Elixir n00b

My Use case: Customer uploads CSV with 2 set of headers

fixed and known headers, this one is easy as shown here
additional variable number of headers

I can’t figure how using NimbleCSV can I capture additional variable number of headers and its corresponding columns in either the same map or a different map.

The parsed output I would like to achieve is

[
 %{
  fixed_h1: "val1",
  fixed_h2: "val2",
  fixed_h3: "val3",
  variable_h1: "val4",
  variable_h2: "val5", .....
 }
]

benwilson512 · January 14, 2021, 7:42pm

Hi @onkara, what have you tried so far?

onkara · January 16, 2021, 3:22pm

I think I managed to solve it, but I think there is a more elegant solution out there

defmodule Scratch do
  NimbleCSV.define(MyCSVParser, [])

  # core_cols = %{
  #   date_of_birth: 'DOB',
  #   external_id: 'Plan_Member_ID',
  #   first_name: 'First_Name',
  #   last_name: 'Last_Name',
  #   phone: 'Phone'
  # }
  #
  # custom_col_pos = %{
  #   0 => 'Plan_Member_ID',
  #   1 => 'First_Name',
  #   2 => 'Phone',
  #   3 => 'Last_Name',
  #   4 => 'DOB',
  #   5 => 'HbA1c',
  #   6 => 'Hypertension',
  #   7 => 'Children',
  #   8 => 'Gender', 
  #   9 => 'Pain'
  # }
  def process_csv(file_path, %{} = custom_col_pos, %{} = core_cols) do
    file_path
    |> File.stream!(read_ahead: 1000)
    |> MyCSVParser.parse_stream([{:skip_headers, true}])
    |> Stream.map(fn line ->
      mapped_row =
        line
        |> Enum.with_index()
        |> Enum.reduce(
          %{},
          fn {cell_data, index}, acc ->
            header = Map.fetch!(custom_col_pos, index)
            column_name = if core_cols[header], do: core_cols[header], else: header

            Map.put(acc, column_name, format_col(cell_data))
          end
        )

      IO.inspect(mapped_row, label: "MappedRow")
      # {:ok, date_of_birth} = DateTimeParser.parse_date(dob)
    end)
    |> Stream.run()
  end

  defp format_col(str) do
    str |> :binary.copy() |> Macro.underscore() |> String.downcase()
  end
end

the output is (trimmed)

MappedRow: %{
  :date_of_birth => "1/1/1974",
  :external_id => "120511",
  :first_name => "jane",
  :last_name => "doe",
  :phone => "1112223333",
  'Children' => "n",
  'Gender' => "f",
  'HbA1c' => "6/3",
  'Hypertension' => "y",
  'Pain' => "y"
}

As you can see in the output I need to do some data type specific transformation i.e. HbA1c should be 6.3 not 6/3. So I guess I can use pattern matching on format_col function variants using regular expressions, but not sure if this is the best approach?

fchabouis · February 10, 2022, 11:27am

I used the following code to output a map, whose keys are the column names and values are the row values.

      "a,b,c\n1,2,3\n4,5,6"
      |> NimbleCSV.RFC4180.parse_string(skip_headers: false)
      |> Stream.transform([], fn r, acc ->
        if acc == [] do
          {%{}, r |> Enum.map(fn h -> h |> String.replace_prefix("\uFEFF", "") end)}
        else
          {[Enum.zip(acc, r) |> Enum.into(%{})], acc}
        end
      end)
     |> Enum.to_list()

[%{"a" => "1", "b" => "2", "c" => "3"}, %{"a" => "4", "b" => "5", "c" => "6"}]

The idea is to keep the headers row, and use a Stream.transform. The accumulator contains the headers, and you zip the headers with each row. You don’t need to know which headers will be there.

LostKobrakai · February 10, 2022, 1:16pm

nimble_csv has options to drop byte order marks (the \uFEFF).

fchabouis · February 10, 2022, 1:23pm

In my real code, I’m using NimbleCSV.RFC4180.parse_stream, and I understand from the documentation I cannot trim_bom from a stream.

:trim_bom - automatically trims BOM (byte-order marker) when parsing string. Note the bom is not trimmed for enumerables or streams. In such cases, the BOM must be trimmed directly in the stream, such as File.stream!(path, [:trim_bom])

Here that part is probably not necessary.
Thanks for your comment!

dimitarvp · February 11, 2022, 8:07pm

That sucks but it depends how big CSV payloads you are expecting. If they cap at 4-5MB it’s likely a non-issue just to download them fully and then make use of the :trim_bom setting.

dimitarvp · February 11, 2022, 9:04pm

But at the same time, here’s a super lame implementation I came up with in 5 minutes:

  def maybe_trim_bom(stream) do
    stream
    |> Stream.with_index()
    |> Stream.drop_while(fn
      {0xFE, 0} -> true
      {0xFF, 1} -> true
      _ -> false
    end)
    |> Stream.map(fn {item, _index} -> item end)
  end

And you can put that function at the start of your pipe. But I fear that it’s super inefficient.