How to read a file of numeric data as a list of integers?

bwanab · January 25, 2021, 10:01pm

When you read a file with File.read, you get a large binary. I’m reading a file of numeric data that I would like to have in a list of integers, but I haven’t found any straightforward way to do it. I suspect I’m missing something pretty basic.

I.e. instead of
<<77, 84, 104, 100, 0, 0, 0, 6, 0, 1, 0, 10, 0, 192, 77, 84>>
I want:
[77, 84, 104, 100, 0, 0, 0, 6, 0, 1, 0, 10, 0, 192, 77, 84]

seanmor5 · January 25, 2021, 10:06pm

You can just use :binary.bin_to_list/1

Eiji · January 25, 2021, 10:13pm

You can also use for with all of it’s features:

iex> for <<x <- <<77, 84, 104, 100, 0, 0, 0, 6, 0, 1, 0, 10, 0, 192, 77, 84>> >>, do: x
[77, 84, 104, 100, 0, 0, 0, 6, 0, 1, 0, 10, 0, 192, 77, 84]

NobbZ · January 25, 2021, 10:15pm

How does the file look like? Is it some ASCII text that you can read as numbers in your editor, speared by spaces, commas or any other arbitrary separators?

Or is your file indeed consisting from those bytes 77, 87, etc?

mpope · January 25, 2021, 10:24pm

Why not just keep it as a binary if the binary is large, it is allocated to a shared heap and can possibly be more memory efficient. https://medium.com/@mentels/a-short-guide-to-refc-binaries-f13f9029f6e2

bwanab · January 25, 2021, 10:29pm

It is mixed numeric data. There are strings within it, but also int8, int16, int32, floats. I’ve got code to read it, but I fell afoul of a strange phenomenon. Take the binary d = <<130, 13, 10, 60>> for example:
String.length(d)
3
But, take the binary d = <<130, 120, 201, 17>>
String.length(d)
4
It appears that String munges the 13 and 10 in the first example to a single character ‘\r\n’.

String.slice(<<130, 13, 10, 60>>, 1…1)
“\r\n”

This is happening to me on one of the files I’m reading and in the context the 13 and the 10 are part of different messages, but since String does that I can’t parse it correctly.

bwanab · January 25, 2021, 10:30pm

:binary.bin_to_list/1 is just what I was looking for. Thanks!

Eiji · January 25, 2021, 10:42pm

To understand the behaviour you need to check a String documentation:

Code points and grapheme cluster

The functions in this module act according to the Unicode Standard, version 12.1.0.

As per the standard, a code point is a single Unicode Character, which may be represented by one or more bytes.

For example, although the code point “é” is a single character, its underlying representation uses two bytes:
String.length("é")
1
byte_size("é")
2
Furthermore, this module also presents the concept of grapheme cluster (from now on referenced as graphemes). Graphemes can consist of multiple code points that may be perceived as a single character by readers. For example, “é” can be represented either as a single “e with acute” code point or as the letter “e” followed by a “combining acute accent” (two code points):
string = "\u0065\u0301"
byte_size(string)
3
String.length(string)
1
String.codepoints(string)
["e", "́"]
String.graphemes(string)
["é"]
Source: String — Elixir v1.16.0

wojtekmach · January 25, 2021, 10:46pm

I was curious how to read file contents as charlist and I found:

{:ok, pid} = File.open("mix.exs", [:charlist])
IO.inspect IO.read(pid, :all)
# Outputs: {:ok, 'defmodule ...'}
File.close(pid)

kip · January 25, 2021, 10:49pm

String.length/1 counts the number of graphemes, not code points. So definitely not the right tool for the job you are tackling.

I was surprised to see that \r\n is considered a grapheme cluster but it is according to the Unicode standard.

derek-zhou · January 26, 2021, 12:25am

If you just want size in bytes, there is byte_size/1. Also much faster than String.length/1.

bwanab · January 26, 2021, 12:35am

Blockquote
I was curious how to read file contents as charlist and I found:

{:ok, pid} = File.open("mix.exs", [:charlist])
IO.inspect IO.read(pid, :all)
# Outputs: {:ok, 'defmodule ...'}
File.close(pid)

Yes, that works, too. Thanks for the pointer.

That helps for me to understand the behavior, but I don’t think it helps with my underlying issue which is that I need to not consider the consecutive bytes of <<13, 10>> to be a single character. That’s why it seems that I need to have the data represented as a list of integers instead of as a binary string.

Eiji · January 26, 2021, 12:47am

I wrote that only to explain why this does not works for you. If you want us to give you a solution how to do something with binary then you need to be precise what you want to do with your data. A general tip would be linking to documentation of Kernel.SpecialForms.<<>>/1. This is really useful in pattern matching. If you have any questions regarding binary and any other data type feel free to ask.

kip · January 26, 2021, 12:49am

This is only treated as a grapheme cluster by the String module (since strings are unicode strings by definition). Even within the string module, String.codepoints(<<13, 10>>) returns what you are after.

As @eji pointed out, there is some very powerful binary pattern matches that are possible so if you’re able to share the shape of the data you are receiving and what you are trying to do with it I know forum members would be happy to help.

bwanab · January 26, 2021, 2:22am

Don’t think that I don’t appreciate your explanations - I do very much indeed. Let me experiment a bit with what you’ve given me thus far and I’ll ask more specific questions when I formulate them.

Aetherus · January 26, 2021, 2:27am

Can you give us an example of the input binary and its expected output?

dimitarvp · January 26, 2021, 11:13am

But what are you trying to do exactly? So far this discussion has been a demonstration of the XY problem – you have a problem X and fixate on solution Y without telling us what X is.

I am joining @Aetherus in asking for an input and expected output.