makeitrein
Extracting numbers from a string
Hey all, just started picking up Elixir last week and am writing a scraper as a learning project.
Baby step #1 is extracting the number from a URL on the target web page… here’s what I’ve written:
# url is in "https://xxyyzz.com/xxyyzz.383254/" format... goal is to extract 383254
def get_id_from_url(url), do: Regex.run(~r"\d+\/", url) |> Enum.at(0) |> Integer.parse |> elem(0)
This seems a bit clunky of a function to me for a simple integer extraction… is there a better way of going about this?
Marked As Solved
dimitarvp
(EDIT 1: Account for invalid values.)
(EDIT 2: Trim empty strings when splitting.)
(EDIT 3: Included explanations.)
defmodule Test do
def extract_id(url) when is_binary(url) do
url
|> String.split(~w(. /), parts: 1000, trim: true)
|> List.last
|> parse_id
|> fetch_id
end
defp parse_id(nil), do: :error
defp parse_id(x) when is_binary(x), do: Integer.parse(x)
defp fetch_id({number, ""}) when is_integer(number), do: number
defp fetch_id(:error), do: :error
end
Test it:
iex> urls = ["https://foster.com/death-pancake.1468/", "https://hkd33.net/mr-rogers101.690153/", "whatever_dude", "https://space-force911.gov/sauce-master.13257777/"]
iex> urls |> Enum.map(&Test.extract_id/1)
[1468, 690153, :error, 13257777]
Breaking it down:
-
~w(. /)equals[".", "/"](soString.splitis called with multiple separators). -
parts: 1000is used to prevent denial-of-service attacks, in case somebody manages to smuggle huge strings to your code.trim: trueremoves empty strings from the result. CheckString.splitdocs. -
"https://foster.com/death-pancake.1468/" |> String.split(~w(. /), parts: 1000, trim: true)yields this:
["https:", "foster", "com", "death-pancake", "1468"]
…so we are calling List.last on it to give us the desirable piece of data.
- Our internal function
parse_idhas to also handle invalid data:- If
String.splitreturns[],List.lastwould returnnil. - If
String.splitreturns["single_invalid_url"],List.lastwould return"single_invalid_url".
- If
Both cases would make our internal function parse_id to return :error. (Integer.parse will return :error if you supply it a string that does NOT start with an integer.)
-
The
fetch_idinternal function uses function heads instead ofiforcaseto extract successful integer parsing and return it, or react to an:errorreturn value and just pass it down the line to your consumer code. -
One caveat: notice that
fetch_idmatches on{number, ""} when is_integer(number)which means the function will be called only if a full integer string is passed, namely “123” or “456” will succeed but “123xyz” will not. If you expect URLs like “https://whatever.man/1234abcd”, this code won’t work.
Also Liked
hassan
How about
iex(10)> "https://xxyyzz.com/xxyyzz.383254/" |> String.replace(~r/[^\d]/, "")
"383254"
iex(11)>
dimitarvp
defmodule Test do
def match_string("https://xxyyzz.com/xxyyzz." <> suffix) do
case Integer.parse(suffix) do
{number, "/"} when is_integer(number) ->
IO.puts "suffix is #{number}"
_ ->
IO.puts "cannot parse suffix: #{suffix}"
end
end
end
Test it in iex:
iex> Test.match_string "https://xxyyzz.com/xxyyzz.383254/"
suffix is 383254
:ok
iex> Test.match_string "https://xxyyzz.com/xxyyzz.383254/!"
cannot parse suffix: 383254/!
:ok
You can abuse Elixir’s allowed syntax of pattern matching on a string suffix (you cannot pattern-match strings in the middle of the bigger string though, have that in mind). Not sure if I am not taking your example too literally but if I understood you correctly, that’s how I would approach the problem.
NobbZ
Now as we have more information, I have an alternative version which I prefer over @dimitarvp, because it is much more explicit about what we want.
- It says that we want an URL and verifies we get one (by parsing it) and that we are only interested in the
path, - it says that we are searching for dot, followed by at least one digit and ending with a slash as the last character of the path, but we are only interested in the actual digits (the call to
Regex.named_captures/3), - we want those digits to cleanly parse into a number.
If all succeed, we return an :ok-tuple, and simply :error otherwise.
But which version to choose is probably a matter of taste, I have not benchmarked them.
defmodule M do
def extract(url) do
with %URI{path: path} when is_binary(path) <- URI.parse(url),
%{"num" => num_str} <- Regex.named_captures(~r[\.(?<num>\d+)/$], path),
{num, ""} <- Integer.parse(num_str) do
{:ok, num}
else
_ -> :error
end
end
end
IO.inspect M.extract("https://foster.com/death-pancake.1468/")
IO.inspect M.extract("https://hkd33.net/mr-rogers101.690153/")
IO.inspect M.extract("https://space-force911.gov/sauce-master.13257777/")







