Extracting numbers from a string

Hey all, just started picking up Elixir last week and am writing a scraper as a learning project.

Baby step #1 is extracting the number from a URL on the target web page… here’s what I’ve written:

# url is in "https://xxyyzz.com/xxyyzz.383254/" format... goal is to extract 383254
  def get_id_from_url(url), do: Regex.run(~r"\d+\/", url) |> Enum.at(0) |> Integer.parse |> elem(0)

This seems a bit clunky of a function to me for a simple integer extraction… is there a better way of going about this?

2 Likes

How about

iex(10)> "https://xxyyzz.com/xxyyzz.383254/" |> String.replace(~r/[^\d]/, "")
"383254"
iex(11)>
9 Likes
defmodule Test do
  def match_string("https://xxyyzz.com/xxyyzz." <> suffix) do
    case Integer.parse(suffix) do
      {number, "/"} when is_integer(number) ->
        IO.puts "suffix is #{number}"

      _ ->
        IO.puts "cannot parse suffix: #{suffix}"
    end
  end
end

Test it in iex:

iex> Test.match_string "https://xxyyzz.com/xxyyzz.383254/"
suffix is 383254
:ok
iex> Test.match_string "https://xxyyzz.com/xxyyzz.383254/!"
cannot parse suffix: 383254/!
:ok

You can abuse Elixir’s allowed syntax of pattern matching on a string suffix (you cannot pattern-match strings in the middle of the bigger string though, have that in mind). Not sure if I am not taking your example too literally but if I understood you correctly, that’s how I would approach the problem.

3 Likes

You can use hd instead of Enum.at. that saves you some characters.

Also you can use String.to_integer instead of piping through Integer.parse and elem.

But that’s the way to go.

2 Likes

Many thx for suggestions! Realized my URL example is a bit lacking… the xxyyzz is representative of any numbers and letters of unknown length, so just stripping non-digits won’t work, nor will the pattern matching (I think, awol from computer).

I guess the challenge is to find a series of numbers at the end of a string that starts with a period and ends with a slash… String.to_integer looks like it might be what I need, good enough for government work.

Nope! We have enough borked gov’t systems. Let’s do better.

If you give us a few examples and/or explain the whole URL schemata then I can help you better.

2 Likes

Hah, fair game ^^

https://foster.com/death-pancake.1468/ === 1468

https://hkd33.net/mr-rogers101.690153/ === 690153

https://space-force911.gov/sauce-master.13257777/=== 13257777

Here’s a selection of random URLs… all begin with https, have a base domain, followed by the username of the person who submitted the domain followed by a period followed by the id of the post followed by a trailing slash…

Only the id of the post is relevant so we can ignore the base domain, username, period, and ending slash…

So all URLs always end in a number plus a forward slash? No exceptions?

1 Like

Correctamundo

(EDIT 1: Account for invalid values.)
(EDIT 2: Trim empty strings when splitting.)
(EDIT 3: Included explanations.)

defmodule Test do
  def extract_id(url) when is_binary(url) do
    url
    |> String.split(~w(. /), parts: 1000, trim: true)
    |> List.last
    |> parse_id
    |> fetch_id
  end

  defp parse_id(nil), do: :error
  defp parse_id(x) when is_binary(x), do: Integer.parse(x)

  defp fetch_id({number, ""}) when is_integer(number), do: number
  defp fetch_id(:error), do: :error
end

Test it:

iex> urls = ["https://foster.com/death-pancake.1468/", "https://hkd33.net/mr-rogers101.690153/", "whatever_dude", "https://space-force911.gov/sauce-master.13257777/"]

iex> urls |> Enum.map(&Test.extract_id/1)
[1468, 690153, :error, 13257777]

Breaking it down:

  • ~w(. /) equals [".", "/"] (so String.split is called with multiple separators).

  • parts: 1000 is used to prevent denial-of-service attacks, in case somebody manages to smuggle huge strings to your code. trim: true removes empty strings from the result. Check String.split docs.

  • "https://foster.com/death-pancake.1468/" |> String.split(~w(. /), parts: 1000, trim: true) yields this:

["https:", "foster", "com", "death-pancake", "1468"]

…so we are calling List.last on it to give us the desirable piece of data.

  • Our internal function parse_id has to also handle invalid data:
    • If String.split returns [], List.last would return nil.
    • If String.split returns ["single_invalid_url"], List.last would return "single_invalid_url".

Both cases would make our internal function parse_id to return :error. (Integer.parse will return :error if you supply it a string that does NOT start with an integer.)

  • The fetch_id internal function uses function heads instead of if or case to extract successful integer parsing and return it, or react to an :error return value and just pass it down the line to your consumer code.

  • One caveat: notice that fetch_id matches on {number, ""} when is_integer(number) which means the function will be called only if a full integer string is passed, namely “123” or “456” will succeed but “123xyz” will not. If you expect URLs like “https://whatever.man/1234abcd”, this code won’t work.

6 Likes

Dang, that’s some good looking code! Copy and pasting in 3, 2, 1…

Sorry for a ton of edits. Made a quick and dirty version and then figured I will write it as if I am paid for it.

1 Like

Better come visit the forum and copy-paste from here and not from the email because I made a lot of edits.

1 Like

Now as we have more information, I have an alternative version which I prefer over @dimitarvp, because it is much more explicit about what we want.

  • It says that we want an URL and verifies we get one (by parsing it) and that we are only interested in the path,
  • it says that we are searching for dot, followed by at least one digit and ending with a slash as the last character of the path, but we are only interested in the actual digits (the call to Regex.named_captures/3),
  • we want those digits to cleanly parse into a number.

If all succeed, we return an :ok-tuple, and simply :error otherwise.

But which version to choose is probably a matter of taste, I have not benchmarked them.

defmodule M do
  def extract(url) do
    with %URI{path: path} when is_binary(path) <- URI.parse(url),
         %{"num" => num_str} <- Regex.named_captures(~r[\.(?<num>\d+)/$], path),
         {num, ""} <- Integer.parse(num_str) do
      {:ok, num}
    else
      _ -> :error
    end
  end
end

IO.inspect M.extract("https://foster.com/death-pancake.1468/")
IO.inspect M.extract("https://hkd33.net/mr-rogers101.690153/")
IO.inspect M.extract("https://space-force911.gov/sauce-master.13257777/")
3 Likes

The reasons I did not do it like you:

  1. I worked with governmental datasets before. Some have very blatant errors or peculiarities in them, like spaces in the domain part of the URLs, three forward slashes at the end, two dots instead of one etc… Some of them really don’t care. As an extreme example ~17 years ago, a guy reported how he had to make his own mini-parser for malformed RSS from several agencies (that “RSS” was not even a valid XML).
  2. Regexes can become a scaling problem if you have to process a lot of data. But I will admit that I am on the side of some preliminary optimization here which is a 50/50 decision and depends on a lot of factors.

That being said, I like your code. :+1:

2 Likes

@makeitrein this is not StackOverflow and we are not collecting reputation points but – whichever solution you like, please mark it up as the answer. It increases the visibility of such topics on the forum and people in the future can make use of your question and our answers.

1 Like

Legacy… No need to justify whether it’s governmental or not, legacy is a pain on its own so often, but usually I just assume everything is conforming the corresponding specifications until I get proven wrong.

2 Likes

Yep. I guess I am older and grumpier and just flat out assume people don’t know what they are doing. :expressionless:

But you are correct, we should be enforcing standards until we really have no other choice. Agreed.

2 Likes

@NobbZ and @dimitarvp - just woke up to the pleasant surprise of two discrete solutions that, in the words of the late Steve Jobs, would be insanely great! Always a fan of getting input & insight from the more seasoned programmers on some of my junior code, gets me exposed to a few new patterns and methods along the way.

@dimitarvp takes the award for the problem, but just by a bit. I find his code slightly more grokable, especially if I were to revisit it a few months down the line. It doesn’t seem that I can mark responses as answers… here is what I see on my end…

I moved it to #questions-help, also I put codereview tag onto it and regular expression. OP should be able to accept an answer now.