Hi,
my csv file that i want to read is from DE-region (Germany). There they use comma as decimal separator for floating point numbers. eg 4,58
Unfortunately I cannot identitfy a parameter in the documentation to specify another decimal separator for these functions:
Because of the decimal separator (in my case “,” instead of “.”) it will throw an error if defining the datatype of the columns in the csv file on reading with CSV.decode (…) or Dataframe.load_csv(…)
Example:
df = DF.load_csv!(content, dtypes: [{“myFloatCol”,:f64}])
→ throws the expected error:
RuntimeError{message: "Polars Error: could not parse \"4,58\"
as dtype f64
at column ‘myFloatCol’ (column number 4)
What would be a good practice?"
Thanks for your advice
Gordian
You could use field_transform
to accomplish most of this, by passing String.replace
:
String.replace("-123,45", ~r/^(-?)(\d+),(\d+)$/, "\\1\\2.\\3")
(adjust further if you’re expecting scientific notation too)
2 Likes
Can you give an example of a few CSV rows? F.ex. do they look like this?
ABC,123,4,58,DEF
And you expect ["ABC", "123", "4.58", "DEF"]
but get ["ABC", "123", "4", "58", "DEF"]
instead? Is that the problem? Sounds like you basically have an invalid CSV when we get down to it. Don’t think that even field_transform
can help you in this case.
You can use the xsv
tool to extract the “defective” columns and reformat the numbers to be dot-separated and then you can re-merge (or replace) the data back – I’ve done something very similar in the past, successfully. From then on you can just use any normal CSV parser.
1 Like
Thank you for the notice about xsv
tool.
It looks like, it can only be used on CLI. ( is it true?! )
But i am searching a solution, which can be integrated in an elixir-livebook-app.
(useCASE: a user should upload an csv-file and then data analyse is starting. But the input file has a comma as decimal separator. therefore i am looking for a possibility to replace the comma with an period.)
I found NimbleCSV — NimbleCSV v1.2.0
maybe it could help.
For clarification my question an example.
start.csv
col1,col2
"0,0","0,00"
"28,6","4,43"
Notice:
- the decimal separator in the example above is an comma (eg.: “0,0”)
- but i need as decimal separator a period (eg.: “0.0”)
goal.csv
col1,col2
"0.0","0.00"
"28.6","4.43"
Reading the csv file:
require Explorer.DataFrame, as: DF
df = DF.load_csv!(csv, dtypes: [{"col1",:f64}])
Problem to solve:
When i use the option “dtypes” to convert the “0,0” from start.csv i get an error because of the decimal separator:
"Polars Error: could not parse `\"0,00\"` as dtype `f64`
Does DataFrame.load:csv/2 has the ability to replace the decimal separator on the fly?!
If not what kind of “preprocessing” would you recommend?
THX
Oh, but your CSV is actually valid I see. Your problem is with Explorer.DataFrame
. I have not worked with that. Maybe the mutate/1
function can help you transform the values before they are being parsed? No idea though.
In this case – because your CSV is actually valid and not malformed like I assumed – you can just follow @al2o3cr’s advice to parse and then transform the values, and then you can feed the resulting CSV to Explorer.DataFrame
. Something like this should work:
content = File.read!(your_path_to_csv) # or a `Plug.Upload` stream?
content
|> CSV.decode!(field_transform: fn x -> String.replace(x, ~r/^(-?)(\d+),(\d+)$/, "\\1\\2.\\3") end)
|> DataFrame.load_something!(dtypes: [{“myFloatCol”,:f64}])
I don’t know if you can load parsed CSV into Explorer.DataFrame
, you likely could. But if you really can’t then you can just re-encode the data and feed them to load_csv!
.
Thank you both for your advice and your time!
I will try…
Try it and let us know how it goes. It’s not a difficult problem, you should drop the insecurity. You can do it! data:image/s3,"s3://crabby-images/0ff05/0ff0536a1ec31772718912ebfbbfc906fc3a8012" alt=":smiley: :smiley:"
I did some digging. Polars actually has a relevant option to polars.read_csv()
called decimal_comma
that Explorer does not expose:
decimal_comma
Parse floats using a comma as the decimal separator instead of a period.
We could certainly expose it. However even if it were exposed, you can’t use it for your example because your CSV also has a ,
as the separator:
polars.exceptions.InvalidOperationError: 'decimal_comma' argument cannot be combined with ',' separator
So it seems Polars, and therefore Explorer, requires a certain subset of CSV to parse directly into a float like you’re hoping to do.
However @dimitarvp is right to call out mutate
. General rule with Explorer: it’s usually fastest to get Rust to do the work if you can. So I suggest the following:
require Explorer.DataFrame, as: DataFrame
csv_path
|> DataFrame.from_csv!()
|> DataFrame.mutate(for col <- across() do
{col.name, col |> replace(",", ".") |> cast(:f64)}
end)
With this approach, you load what happens to be in the CSV into Rust, then let Rust do the string manipulation.
1 Like