Is there a way to escape escape characters i.e. turn `"\u2003"` => `"\\u2003"`?

Hey, does anybody know if it is possible to escape escape characters like \r or \u2003? Given a string containing such characters, I’d like to turn them into an escaped form as follows:

iex> Magic.escape("\r\u2003")
"\\r\\u2003"

The only thing that I am aware of after 5 minutes of searching is Regex.escape/1.

And it only covers part of these: Escape characters.

Do you mean at compile-time?

You cannot do it at runtime, because once you have a string, you cannot identify how that string was produced. For example:

iex(1)> " "
" "
iex(2)> "\s"
" "
iex(3)> <<32>>
" "

Given the string " ", there’s no way to tell if the source code entered the value directly, used the escape character, or or constructed the string some other way.

Hi @adamu
I don’t mean to restore it the way it was created. I just need to restore it to one of the escape char variants.

For example I would want this:

iex> Magic.escape("\\0")
"\\0"

iex> Magic.escape(<<0>>)
"\\0"

@dimitarvp It is not exactly what I am looking for as it does a bit more.

Including for all spaces, newlines, unicode characters, and random bytes?

Excluding newlines, tabs and spaces. But others like \r, \0, \uNNNN… :wink:

The more I think about it, the more I get the feeling that it is not possible…

\r is carriage return, you probably want to skip that too. \uNNNN is all Unicode characters, so literally everything that’s valid Unicode would be converted to that.

I think the problem is that what you are asking for doesn’t make sense. :slight_smile: What are you trying to do?

All you need to do is to define a list of characters you want to escape and the rest is a simple reduce:

defmodule Example do
  @characters_to_escape [?b]

  def sample(string) when is_binary(string) do
    for <<char::utf8 <- string>>, reduce: "" do
      acc -> acc <> maybe_escape(char)
    end
  end

  defp maybe_escape(char) when char in @characters_to_escape do
    char
    |> Integer.to_string(16)
    |> String.pad_leading(4, "0")
    |> then(&"\\u" <> &1)
  end

  defp maybe_escape(char), do: <<char::utf8>>
end

iex> Example.sample("a b c")
"a \\u0062 c"
2 Likes

Thank you all for the input and the sparring. @Eiji I guess I was hoping for some well-defined distinction / function to tell “printable” from other unicode chars. But I learned that what needs to be escaped depends very much on the use case. Looking at Jason, there is the :escape option which can be one of 4 possible values, each one escaping more or fewer chars.

I think the problem is that what you are asking for doesn’t make sense.

@adamu Yes, that too, probably! :smiley:

Inspired by Jason I ended up with something like the following which works for my case for now (probably needs refinement):

defmodule NotSoMuchMagic do
  @escape_chars '\b\f\r\v\"\\'
  @escape_char_maping Enum.zip('\b\f\r\v\"\\', 'bfrv"\\')
  @unicode_char_mapping 0x00..0x1F |> Enum.reject(&Kernel.in(&1, '\n\t'))

  for {src, dst} <- @escape_char_maping do
    defp encode_char(unquote(src)), do: [?\\, unquote(dst)]
  end

  for uchar <- @unicode_char_mapping do
    unicode_sequence = List.to_string(:io_lib.format("\\u~4.16.0B", [uchar]))
    defp encode_char(unquote(uchar)), do: unquote(unicode_sequence)
  end

  defp encode_char(char), do: char

  def encode(binary) do
    iodata = for (<< char::utf8 <- binary >>), do: encode_char(char)
    IO.iodata_to_binary(iodata)
  end
end
iex> NotSoMuchMagic.encode("\r \t \v \0 \u0028 \u0012")
"\\r \t \\v \\u0000 ( \\u0012"
1 Like

Enum.reject(0x00..0x1F, & &1 in '\n\t')

2 Likes

What are you trying to do with that information?

Decide how to encode them. :wink:

1 Like

Aren’t all Unicode “characters” printable by definition? What other encoding are you trying to convert them to?

Maybe I’m using wrong terms. But a yaml encoder needs to encode different chars differently. For example: normal letters, digits or \n or \t can be left as they are. The bell character (\u0007) can be encoded as \u0007 or \a. Others need to be encoded in the form \u0033 or \x33.

YAML - Wikipedia.

If you want to produce “ascii only” yaml and escape everything else that’s fine, but it seems like yaml technically supports the full UTF8 spectrum. My overall point is that I would pick a well known subset of UTF8 like ASCII and then build your filter on top of that. All ASCII chars go in as is, non ascii are use the escaped utf8 form.

1 Like

From looking at the YAML spec, it looks like:

Escape anything that has a C escape sequences to the form listed in the spec.

Then escape anything that’s explicitly not in the ranges the spec defines as printable with \u.

For each codepoint:

  1. In printable range? Leave it alone.
  2. Has c escape? Convert to that.
  3. Escape with \u

I think this is YAML-specific behaviour, so Elixir doesn’t have a default that does it, unless it’s provided by a YAML library, but it looks like that’s what you’re making :slight_smile:

https://yaml.org/spec/1.2.2/#chapter-5-character-productions

[1] c-printable ::=
                         # 8 bit
    x09                  # Tab (\t)
  | x0A                  # Line feed (LF \n)
  | x0D                  # Carriage Return (CR \r)
  | [x20-x7E]            # Printable ASCII
                         # 16 bit
  | x85                  # Next Line (NEL)
  | [xA0-xD7FF]          # Basic Multilingual Plane (BMP)
  | [xE000-xFFFD]        # Additional Unicode Areas
  | [x010000-x10FFFF]    # 32 bit

Yes, that’s almost exactly what I’m doing in the PR linked above. However, there are some things to consider. Let’s take char \u0085. It has an escape character in YAML (\N). But it is also in the c-printable range. So while theoretically it can be left alone, I preferred to encode it as \N because I think it is editor/reader friendlier. Would you agree? So my order is…

  1. Is \n or \t or? Leave it alone
  2. Has c escape? Convert to that.
  3. In printable range? Leave it alone.
  4. Escape with \u

Printing to console:

iex> IO.puts("a\u0085b")
a
b
:ok

Decode YAML (escape char inside double quotes) and printing to console:

iex> ~s("a\\Nb") |> YamlElixir.read_from_string!() |> IO.puts
a
b
:ok

Decode YAML (actual character) and printing to console:

iex> "a\u0085b" |> YamlElixir.read_from_string!() |> IO.puts
a
b
:ok