How can I split a string only at spaces with unicode characters in it?

I am a newbie in Elixir and am trying to split String only at spaces the string contains unicode charcters too

String.split("Freude schöner Götterfunken", ~r/[^[:alnum:]-]/u, trim: true)

like above , this gives me a UnicodeConversionError , is there a way to do that ?

the output I want is

["Freude","schöner","Götterfunken"]

It would be great if someone could suggest something.

This works for me:

iex(1)> String.split("Freude schöner Götterfunken", ~r/[^[:alnum:]-]/u, trim: true)
["Freude", "schöner", "Götterfunken"]

Is your shell not in UTF8 perhaps?

2 Likes

If you want to split the string only at spaces, I would use this:

String.split("Freude schöner Götterfunken", " ", trim: true)

It’s working for me. Am I missing something?

String.split("Freude schöner Götterfunken") would be preferred since it will use the Unicode word break algorithm which is much more flexible than breaking on an ASCII space character.

iex> String.split("Freude schöner Götterfunken") 
["Freude", "schöner", "Götterfunken"]

From the docs:

Divides a string into substrings at each Unicode whitespace occurrence with leading and trailing whitespace ignored. Groups of whitespace are treated as a single occurrence. Divisions do not occur on non-breaking whitespace.

4 Likes

When i try it on shell it gives me the error and when i call it via another method it breaks the sentence on space as well as the unicode characters .

I am using windows 10 , how do I make sure that the shell supports UTF-8 if not how do i make it UTF-8 compatible ?

Any useful link to go thru ?

The requirement is I dont only have to split on spaces but also have to ignore symbols except hyphen while splitting.

The solution is elegant but I also have to maintain that I ignore special characters as in symbols while splitting hence the need for a regex.

Thanks a lot for all your help ,

My requirement was I have to split a string on spaces discarding all the symbols except a few special characters like a ‘-’ , after lot of trial and errors the below regex solves it ,

String.split(sentence, ~r/[^[:alnum:]-]/ui, trim: true)

This is a great forum and hope it will help someone in need :slight_smile:

Thanks again for the help , I come from Java and have started on elixir , you can imagine the pain its a whole new world for me , will post questions if any .

4 Likes