How to determine if a string matches some character encoding in Elixir

sandergroen · December 5, 2018, 8:09am

Hi everyone,

I would like to know how I can determine if a string matches some character encoding in Elixir. So for instance if the string matches 8 bit ASCII.

Thanks

NobbZ · December 5, 2018, 8:58am

Sadly you can’t, as any valid 8 Bit ASCII encoded string is valid ISO-whatever. Any UTF-8 encoded string is valid 8 Bit ASCII as well, it might just look different from what you would expect.

So unless you have some reference symbols at specific positions of the input, you can’t be sure.

dimitarvp · December 7, 2018, 7:54pm

Your best bet would probably be to use one of the text transcoding libraries. Try to convert from your expected codepage to utf8 and if that fails, fail the call or try the next one in priority order.

What are you trying to do, by the way?

sandergroen · December 13, 2018, 7:41pm

Thanks NobbZ and dimitarvp,

I need this to validate a function output in a test. I said that I wanted to match a string but in fact I need to match a binary but in Elixir a string is a binary.

In Ruby there is an encoding method that can be matched to Encoding::ASCII_8BIT. For example:
"\xE0\xC9\x9F\x7F\x15\x8B\xAA\xAA\xF8t\xA7\x03\x8D7\x95\x90".encoding == Encoding::ASCII_8BIT
returns true.

"\xE0\xC9\x9F\x7F\x15\x8B\xAA\xAA\xF8t\xA7\x03\x8D7\x95\x90"

Is the following binary in Elixir:

<<224, 201, 159, 127, 21, 139, 170, 170, 248, 116, 167, 3, 141, 55, 149, 144>>

But according to your replies I assume there is no such thing in Elixir.

NobbZ · December 13, 2018, 9:06pm

As I said already, any sequence of bytes that have values between 0 and 255 are valid 8bit ASCII.

But from your ruby snippet it seems as if you do not want to validate but rather read a property of an encoded string.

There might be libraries available, but those won’t be plain strings anymore but structs.

OvermindDL1 · December 13, 2018, 11:23pm

Actually 0-127 is valid 7-bit ascii, there is no 8-bit ascii, there are higher ‘encodings’ of ascii like EASCII (great for drawing boxes!), Latin-1, etc… etc… etc… that are 8-bit though.

NobbZ · December 14, 2018, 6:18am

Yeah, those are refered to commonly as 8 BIT ASCII But none of those encodings has anything in the upper area what would make them “invalid”.

NobbZ · December 14, 2018, 6:28am

I’ve taken a look in the ruby documentation:

Encoding::ASCII_8BIT is a special encoding that is usually used for a byte string, not a character string. But as the name insists, its characters in the range of ASCII are considered as ASCII characters. This is useful when you use ASCII-8BIT characters with other ASCII compatible characters.

From this I read that such a string is tried to be printed as plain ASCII but in general is considered binary data.

Also ruby does not check for validity of an encoding when you read that property, it just returns whatever the property was set to before.

Unless you really have to deal with input that is not UTF8, there is usually no issue. And if you had the proper way is to convert it at the system boundary or treat it opaque binary data.

OvermindDL1 · December 14, 2018, 3:51pm

And they are all incompatible, so lumping them just under an 8-bit ASCII term doesn’t say which is being used. ^.^;

Indeed! The whole old ‘standard’ was efficient but not scaleable! :-/