How to detect if any Chinese character in Elixir string?

Ajaxdone · November 3, 2016, 4:29pm

Hello Elixir,

Say have a sentence or article , How to detect that are Chinese characters in it? If can, how to extract them.

Pretty new to Elixir, may improve the question later on.

Thanks,
Ajax

OvermindDL1 · November 3, 2016, 4:45pm

The easiest way nowadays is usually, regardless of language, a Regex matcher on a character class, something like: {Han}

However, erlangs regex engine, from my quick test just now, does not support character classes (boo), so the manual way: [\x{4e00}-\x{9fa5}]

Make sure to enable the unicode setting in your regex matcher too!

So something like this:

iex> matcher = ~r/[\x{4e00}-\x{9fa5}]/u
iex> test_string = "This has a chinese characters here: 你好"
iex> results = Regex.scan(matcher, test_string)

Where results would be an array of arrays of matches. Sadly I am on windows here right now and my terminal does not have utf-8 capabilities so I cannot test that my code is precisely correct. Hopefully someone else can correct it if not.

taiansu · November 3, 2016, 5:16pm

It’s correct and the result is

iex(3)> results = Regex.scan(matcher, test_string)
[["你"], ["好"]]

OvermindDL1 · November 3, 2016, 5:31pm

Awesome.

And you can change ~r/[\x{4e00}-\x{9fa5}]/u into ~r/[\x{4e00}-\x{9fa5}]+/u if you want to capture ranges of characters instead of just individual characters too.

Ajaxdone · November 3, 2016, 7:32pm

Thank you very much OvermindDL1, and thank you taiansu for testing as well.