How to detect if any Chinese character in Elixir string?

Hello Elixir,

Say have a sentence or article , How to detect that are Chinese characters in it? If can, how to extract them.

Pretty new to Elixir, may improve the question later on.

Thanks,
Ajax

The easiest way nowadays is usually, regardless of language, a Regex matcher on a character class, something like: {Han}

However, erlangs regex engine, from my quick test just now, does not support character classes (boo), so the manual way: [\x{4e00}-\x{9fa5}]

Make sure to enable the unicode setting in your regex matcher too!

So something like this:

iex> matcher = ~r/[\x{4e00}-\x{9fa5}]/u
iex> test_string = "This has a chinese characters here: 你好"
iex> results = Regex.scan(matcher, test_string)

Where results would be an array of arrays of matches. Sadly I am on windows here right now and my terminal does not have utf-8 capabilities so I cannot test that my code is precisely correct. Hopefully someone else can correct it if not. :slight_smile:

5 Likes

It’s correct and the result is

iex(3)> results = Regex.scan(matcher, test_string)
[["你"], ["好"]]
3 Likes

Awesome.

And you can change ~r/[\x{4e00}-\x{9fa5}]/u into ~r/[\x{4e00}-\x{9fa5}]+/u if you want to capture ranges of characters instead of just individual characters too. :slight_smile:

1 Like

Thank you very much OvermindDL1, and thank you taiansu for testing as well.