Usage of String.replace?

owaisqayum · July 25, 2020, 12:25pm

I have a sample string

sentence = "Hello, world ... 123 *** ^%&*())^% %%:>"

From this string, I want to only keep the integers, characters, and spaces. For that, I have used String. replace like this

a = String.replace(sentence, ~r"[:_!@#$%^&*:|,./]", " ")

Here it’s replacing the second argument with the third one that is an empty space.

Is there any way that we can use String.replace and focus on what to keep rather than what to throw away.

Thanks

al2o3cr · July 25, 2020, 12:30pm

In this case, the negation option for a character class will do what you want - ~r([^A-Z]) will match anything that isn’t A-Z (indicated by the leading ^.

Eiji · July 25, 2020, 12:37pm

@owaisqayum: You can use ^ (negation character) like:

sentence = "Hello, world ... 123 *** ^%&*())^% %%:>"
String.replace(sentence, ~r/[^0-9a-zA-Z ]/, " ")

Here is my another post with many helpful examples:

On pages for testing regular expressions you may find some references. For example here:

at the bottom of page you can search by not and you would have few interesting suggestions.

A CHARACTER NOT IN THE RANGE A-Z
[^a-z] Matches any characters except those in the range a-z.

/[^a-z]+/g
Anything but a-z.

owaisqayum · July 25, 2020, 1:06pm

Thank you for your valuable responses and I found out that just using the pin operator can reverse the whole scenario. For example, if we have a test case

Assertion with == failed
     code:  assert WordCount.count("co-operative") == expected
     left:  %{"co" => 1, "operative" => 1}
     right: %{"co-operative" => 1}
     stacktrace:
       test/word_count_test.exs:35: (test)```

here the word is co-operative and when I use the pin operator, it opts the "-" between the co-operative and which should be read as a single word. What is the better approach in this case?

Now the regex expression looks like

String.replace(b, ~r([^a-z 0-9]), " ")

Thanks

kip · July 25, 2020, 1:55pm

The regexs here aren’t going to work well with Unicode input, nor match digits that aren’t the indo-arabic set of 0..9. And likely have issues with signs and exponents. None of these may matter in your use case of course. But if they do, then my ex_cldr_numbers library can help. Use Cldr.Numbers.Parser.scan/2 and then filter as you wish.

Examples

# Scan a string in a locale-sensitive fashion and extract numbers
iex> Cldr.Number.Parser.scan "Hello, world ... 123 *** ^%&*()-72.5)^% %%: 123.00>"
["Hello, world ... ", 123, " *** ^%&*()", -72.5, ")^% %%: ", 123.0, ">"]

# Scan a string in a locale-sensitive fashion and extract numbers - in the "de" locale
# Note the use of the "," as the decimal separator
iex> Cldr.Number.Parser.scan "Hello, world ... 123 *** ^%&*()-72.5)^% %%: 123,00>", locale: "de"
["Hello, world ... ", 123, " *** ^%&*()", -72.5, ")^% %%: ", 123.0, ">"]

Eiji · July 25, 2020, 4:08pm

By default I’m going with really simple examples to not confuse people too much by providing more complex code samples.

If you are interested check what I have created here: regex101: build, test, and debug regex

I don’t know all edge-cases, but this one should work for you. If you have more requirements or have other questions about regular expressions please create a separate topic and feel free to ping me.

wolf4earth · July 25, 2020, 4:58pm

I know that test.

Let me guess, you’re doing the Word Count exercise on exercism, right? I’m a mentor there, so I recognized this immediately.

Well, then let me enlighten you, what you’re looking for - especially for the latter special case tests - are Unicode Categories. I could of course give you the working regex but then why do the exercise at all?

axelson · July 25, 2020, 6:20pm

Not super important but for clarity I’d like to point out that the caret character ^ is not the pin operator in this case, it is acting as a regex negation operator.

owaisqayum · July 25, 2020, 8:34pm

You are absolutely right … and i think its the best way to learn. Thanks for the enlightenment, really appreciated.

owaisqayum · July 26, 2020, 6:53pm

Thanks for the clarification, really appreciated