Hello Elixir,
Say have a sentence or article , How to detect that are Chinese characters in it? If can, how to extract them.
Pretty new to Elixir, may improve the question later on.
Thanks,
Ajax
Hello Elixir,
Say have a sentence or article , How to detect that are Chinese characters in it? If can, how to extract them.
Pretty new to Elixir, may improve the question later on.
Thanks,
Ajax
The easiest way nowadays is usually, regardless of language, a Regex matcher on a character class, something like: {Han}
However, erlangs regex engine, from my quick test just now, does not support character classes (boo), so the manual way: [\x{4e00}-\x{9fa5}]
Make sure to enable the unicode setting in your regex matcher too!
So something like this:
iex> matcher = ~r/[\x{4e00}-\x{9fa5}]/u
iex> test_string = "This has a chinese characters here: 你好"
iex> results = Regex.scan(matcher, test_string)
Where results would be an array of arrays of matches. Sadly I am on windows here right now and my terminal does not have utf-8 capabilities so I cannot test that my code is precisely correct. Hopefully someone else can correct it if not.
It’s correct and the result is
iex(3)> results = Regex.scan(matcher, test_string)
[["你"], ["好"]]
Awesome.
And you can change ~r/[\x{4e00}-\x{9fa5}]/u
into ~r/[\x{4e00}-\x{9fa5}]+/u
if you want to capture ranges of characters instead of just individual characters too.
Thank you very much OvermindDL1, and thank you taiansu for testing as well.