How to read a file of numeric data as a list of integers?

When you read a file with File.read, you get a large binary. I’m reading a file of numeric data that I would like to have in a list of integers, but I haven’t found any straightforward way to do it. I suspect I’m missing something pretty basic.

I.e. instead of
<<77, 84, 104, 100, 0, 0, 0, 6, 0, 1, 0, 10, 0, 192, 77, 84>>
I want:
[77, 84, 104, 100, 0, 0, 0, 6, 0, 1, 0, 10, 0, 192, 77, 84]

You can just use :binary.bin_to_list/1

1 Like

You can also use for with all of it’s features:

iex> for <<x <- <<77, 84, 104, 100, 0, 0, 0, 6, 0, 1, 0, 10, 0, 192, 77, 84>> >>, do: x
[77, 84, 104, 100, 0, 0, 0, 6, 0, 1, 0, 10, 0, 192, 77, 84]

How does the file look like? Is it some ASCII text that you can read as numbers in your editor, speared by spaces, commas or any other arbitrary separators?

Or is your file indeed consisting from those bytes 77, 87, etc?

Why not just keep it as a binary :slight_smile: if the binary is large, it is allocated to a shared heap and can possibly be more memory efficient. https://medium.com/@mentels/a-short-guide-to-refc-binaries-f13f9029f6e2

2 Likes

It is mixed numeric data. There are strings within it, but also int8, int16, int32, floats. I’ve got code to read it, but I fell afoul of a strange phenomenon. Take the binary d = <<130, 13, 10, 60>> for example:
String.length(d)
3
But, take the binary d = <<130, 120, 201, 17>>
String.length(d)
4
It appears that String munges the 13 and 10 in the first example to a single character ‘\r\n’.

String.slice(<<130, 13, 10, 60>>, 1…1)
“\r\n”

This is happening to me on one of the files I’m reading and in the context the 13 and the 10 are part of different messages, but since String does that I can’t parse it correctly.

:binary.bin_to_list/1 is just what I was looking for. Thanks!

To understand the behaviour you need to check a String documentation:

Code points and grapheme cluster

The functions in this module act according to the Unicode Standard, version 12.1.0.

As per the standard, a code point is a single Unicode Character, which may be represented by one or more bytes.

For example, although the code point “é” is a single character, its underlying representation uses two bytes:

String.length("é")
1
byte_size("é")
2

Furthermore, this module also presents the concept of grapheme cluster (from now on referenced as graphemes). Graphemes can consist of multiple code points that may be perceived as a single character by readers. For example, “é” can be represented either as a single “e with acute” code point or as the letter “e” followed by a “combining acute accent” (two code points):

string = "\u0065\u0301"
byte_size(string)
3
String.length(string)
1
String.codepoints(string)
["e", "́"]
String.graphemes(string)
["é"]

Source: String — Elixir v1.16.0

1 Like

I was curious how to read file contents as charlist and I found:

{:ok, pid} = File.open("mix.exs", [:charlist])
IO.inspect IO.read(pid, :all)
# Outputs: {:ok, 'defmodule ...'}
File.close(pid)
4 Likes

String.length/1 counts the number of graphemes, not code points. So definitely not the right tool for the job you are tackling.

I was surprised to see that \r\n is considered a grapheme cluster but it is according to the Unicode standard.

4 Likes

If you just want size in bytes, there is byte_size/1. Also much faster than String.length/1.

Blockquote
I was curious how to read file contents as charlist and I found:

{:ok, pid} = File.open("mix.exs", [:charlist])
IO.inspect IO.read(pid, :all)
# Outputs: {:ok, 'defmodule ...'}
File.close(pid)

Yes, that works, too. Thanks for the pointer.

That helps for me to understand the behavior, but I don’t think it helps with my underlying issue which is that I need to not consider the consecutive bytes of <<13, 10>> to be a single character. That’s why it seems that I need to have the data represented as a list of integers instead of as a binary string.

I wrote that only to explain why this does not works for you. If you want us to give you a solution how to do something with binary then you need to be precise what you want to do with your data. A general tip would be linking to documentation of Kernel.SpecialForms.<<>>/1. This is really useful in pattern matching. If you have any questions regarding binary and any other data type feel free to ask.

This is only treated as a grapheme cluster by the String module (since strings are unicode strings by definition). Even within the string module, String.codepoints(<<13, 10>>) returns what you are after.

As @eji pointed out, there is some very powerful binary pattern matches that are possible so if you’re able to share the shape of the data you are receiving and what you are trying to do with it I know forum members would be happy to help.

1 Like

Don’t think that I don’t appreciate your explanations - I do very much indeed. Let me experiment a bit with what you’ve given me thus far and I’ll ask more specific questions when I formulate them.

1 Like

Can you give us an example of the input binary and its expected output?

2 Likes

But what are you trying to do exactly? So far this discussion has been a demonstration of the XY problem – you have a problem X and fixate on solution Y without telling us what X is.

I am joining @Aetherus in asking for an input and expected output.

1 Like