How to find a substring position in a string ? something like indexOf in js

I have read the docs of String module, it seems that there’s nothing like IndexOf in js which
return index of a substring in string.

You can use String.split and then Enum.find_index.

String.split("abc", "", trim: true)
|> Enum.find_index(& &1 == "b")

thanks. I will try it.

You can also use regular expression

~r/#{Regex.escape(substring)}/
|> Regex.scan(string, return: :index)

This returns the indices and lengths of all matching parts as [[{index1, length1}], [{index2, length2}], ...]

2 Likes

why there isn’t a straight forward function ?

I am afraid that the syntax you want is not available.

So you could make a function like this


def index_of(string, needle) do
  String.split(string, "", trim: true)
  |> Enum.find_index(& &1 == needle)
end

Because strings are tricky. In Elixir, there’s no difference between strings and raw binary data. Whether you want to count by bytes or characters or graphemes is up to you, so you need to be extra explicit.

1 Like

To further expand on this, you could also use String.codepoints to split the string.

2 Likes

And String.graphemes/1

2 Likes

So there’s a few layers here, which make the String module not have such a function:

Elixirs String module is about utf8 encoded binaries. So it doesn’t provide an API for general binaries no matter the encoding of the bytes in it.

For utf8 encoded binaries people have already expressed that there’s multiple ways to “count” within utf8. The smallest unit in utf8 is a codepoint. Codepoints can be combined to form graphemes. Unrelated to a specific usecase you want to default to working with graphemes. I’d suggest this excelent blog post on the matter: The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!) @ tonsky.me.

Additionally while elixir could compute you a codepoint index or a grapheme index that wouldn’t be of much use to begin with. Given a binary and a grapheme/codepoint index there’s no performant way to find where that index is in the binary. Due to the nature that codepoints and graphemes have no fixed size there’s no way to seek to such indexes. One would need to walk the binary from the beginning parsing it for utf8 to figure out where the index is pointing to – and do it again if that had been done to calculate the index in the first place. Hence String.codepoints and String.graphemes, where you get a list. Having a list makes such indexes useable.

You can also count in bytes, but if you do that with utf8 you can easily separate bytes belonging to a single grapheme or codepoint leaving you with garbage.

But what if you explicitly want to deal with bytes or don’t even have utf8 in the first place. In this case you’d want to work with OTPs :binary module, which holds apis for arbitrary binaries. And unsurprisingly it has :binary.match/2 (binary — stdlib v6.2), which allows you to search a binary for occurances of a pattern and give you back results in the form of {byte_index, byte_length}. Bytes can be very easily be seeked to in binaries, so it’s fine to use that here.

9 Likes

Besides @LostKobrakai excellent’s response, can you please explain us what you are trying to do? We typically get the indexes to do something… and perhaps there is an operation for what you are already trying to do. In any case, :binary.matches/2 is likely what you want, but bear in mind it returns byte positions, not characters.

5 Likes

I just want to extract some text from a file. in js,python, cpp, there you can find index of a start tag, and find index of an end tag, and use substring with the start,end index to extract the text.

It’s not answering your initial question, but it sounds like Regular Expressions may accomplish your end goal. For example, the following code returns the text between a start tag of <em> and end tag of </em>. The Regex.run function returns the first match, and the Regex.scan function returns a list of all of the matches.

iex(1)> re = ~r"<em>(.*)</em>"U
iex(2)> input = "one <em>two</em> three <em>four</em> five"
iex(3)> Regex.run(re, input, capture: :all_but_first)
["two"]
iex(4)> Regex.scan(re, input, capture: :all_but_first)
[["two"], ["four"]]

Normally a / character is used for the RegEx quoting character, but I used " instead so I didn’t have to escape the slash in the end tag. As Aetherus pointed out, there is also an option for Regular Expressions to return index numbers as well.

1 Like