When you read a file with File.read, you get a large binary. I’m reading a file of numeric data that I would like to have in a list of integers, but I haven’t found any straightforward way to do it. I suspect I’m missing something pretty basic.
How does the file look like? Is it some ASCII text that you can read as numbers in your editor, speared by spaces, commas or any other arbitrary separators?
Or is your file indeed consisting from those bytes 77, 87, etc?
It is mixed numeric data. There are strings within it, but also int8, int16, int32, floats. I’ve got code to read it, but I fell afoul of a strange phenomenon. Take the binary d = <<130, 13, 10, 60>> for example:
String.length(d)
3
But, take the binary d = <<130, 120, 201, 17>>
String.length(d)
4
It appears that String munges the 13 and 10 in the first example to a single character ‘\r\n’.
String.slice(<<130, 13, 10, 60>>, 1…1)
“\r\n”
This is happening to me on one of the files I’m reading and in the context the 13 and the 10 are part of different messages, but since String does that I can’t parse it correctly.
To understand the behaviour you need to check a String documentation:
Code points and grapheme cluster
The functions in this module act according to the Unicode Standard, version 12.1.0.
As per the standard, a code point is a single Unicode Character, which may be represented by one or more bytes.
For example, although the code point “é” is a single character, its underlying representation uses two bytes:
String.length("é")
1
byte_size("é")
2
Furthermore, this module also presents the concept of grapheme cluster (from now on referenced as graphemes). Graphemes can consist of multiple code points that may be perceived as a single character by readers. For example, “é” can be represented either as a single “e with acute” code point or as the letter “e” followed by a “combining acute accent” (two code points):
That helps for me to understand the behavior, but I don’t think it helps with my underlying issue which is that I need to not consider the consecutive bytes of <<13, 10>> to be a single character. That’s why it seems that I need to have the data represented as a list of integers instead of as a binary string.
I wrote that only to explain why this does not works for you. If you want us to give you a solution how to do something with binary then you need to be precise what you want to do with your data. A general tip would be linking to documentation of Kernel.SpecialForms.<<>>/1. This is really useful in pattern matching. If you have any questions regarding binary and any other data type feel free to ask.
This is only treated as a grapheme cluster by the String module (since strings are unicode strings by definition). Even within the string module, String.codepoints(<<13, 10>>) returns what you are after.
As @eji pointed out, there is some very powerful binary pattern matches that are possible so if you’re able to share the shape of the data you are receiving and what you are trying to do with it I know forum members would be happy to help.
Don’t think that I don’t appreciate your explanations - I do very much indeed. Let me experiment a bit with what you’ve given me thus far and I’ll ask more specific questions when I formulate them.
But what are you trying to do exactly? So far this discussion has been a demonstration of the XY problem – you have a problem X and fixate on solution Y without telling us what X is.
I am joining @Aetherus in asking for an input and expected output.