defmodule Test do
def match_string("https://xxyyzz.com/xxyyzz." <> suffix) do
case Integer.parse(suffix) do
{number, "/"} when is_integer(number) ->
IO.puts "suffix is #{number}"
_ ->
IO.puts "cannot parse suffix: #{suffix}"
end
end
end
You can abuse Elixir’s allowed syntax of pattern matching on a string suffix (you cannot pattern-match strings in the middle of the bigger string though, have that in mind). Not sure if I am not taking your example too literally but if I understood you correctly, that’s how I would approach the problem.
Many thx for suggestions! Realized my URL example is a bit lacking… the xxyyzz is representative of any numbers and letters of unknown length, so just stripping non-digits won’t work, nor will the pattern matching (I think, awol from computer).
I guess the challenge is to find a series of numbers at the end of a string that starts with a period and ends with a slash… String.to_integer looks like it might be what I need, good enough for government work.
Here’s a selection of random URLs… all begin with https, have a base domain, followed by the username of the person who submitted the domain followed by a period followed by the id of the post followed by a trailing slash…
Only the id of the post is relevant so we can ignore the base domain, username, period, and ending slash…
(EDIT 1: Account for invalid values.) (EDIT 2: Trim empty strings when splitting.) (EDIT 3: Included explanations.)
defmodule Test do
def extract_id(url) when is_binary(url) do
url
|> String.split(~w(. /), parts: 1000, trim: true)
|> List.last
|> parse_id
|> fetch_id
end
defp parse_id(nil), do: :error
defp parse_id(x) when is_binary(x), do: Integer.parse(x)
defp fetch_id({number, ""}) when is_integer(number), do: number
defp fetch_id(:error), do: :error
end
~w(. /) equals [".", "/"] (so String.split is called with multiple separators).
parts: 1000 is used to prevent denial-of-service attacks, in case somebody manages to smuggle huge strings to your code. trim: true removes empty strings from the result. Check String.split docs.
…so we are calling List.last on it to give us the desirable piece of data.
Our internal function parse_id has to also handle invalid data:
If String.split returns [], List.last would return nil.
If String.split returns ["single_invalid_url"], List.last would return "single_invalid_url".
Both cases would make our internal function parse_id to return :error. (Integer.parse will return :error if you supply it a string that does NOT start with an integer.)
The fetch_id internal function uses function heads instead of if or case to extract successful integer parsing and return it, or react to an :error return value and just pass it down the line to your consumer code.
One caveat: notice that fetch_id matches on {number, ""} when is_integer(number) which means the function will be called only if a full integer string is passed, namely “123” or “456” will succeed but “123xyz” will not. If you expect URLs like “https://whatever.man/1234abcd”, this code won’t work.
Now as we have more information, I have an alternative version which I prefer over @dimitarvp, because it is much more explicit about what we want.
It says that we want an URL and verifies we get one (by parsing it) and that we are only interested in the path,
it says that we are searching for dot, followed by at least one digit and ending with a slash as the last character of the path, but we are only interested in the actual digits (the call to Regex.named_captures/3),
we want those digits to cleanly parse into a number.
If all succeed, we return an :ok-tuple, and simply :error otherwise.
But which version to choose is probably a matter of taste, I have not benchmarked them.
defmodule M do
def extract(url) do
with %URI{path: path} when is_binary(path) <- URI.parse(url),
%{"num" => num_str} <- Regex.named_captures(~r[\.(?<num>\d+)/$], path),
{num, ""} <- Integer.parse(num_str) do
{:ok, num}
else
_ -> :error
end
end
end
IO.inspect M.extract("https://foster.com/death-pancake.1468/")
IO.inspect M.extract("https://hkd33.net/mr-rogers101.690153/")
IO.inspect M.extract("https://space-force911.gov/sauce-master.13257777/")
I worked with governmental datasets before. Some have very blatant errors or peculiarities in them, like spaces in the domain part of the URLs, three forward slashes at the end, two dots instead of one etc… Some of them really don’t care. As an extreme example ~17 years ago, a guy reported how he had to make his own mini-parser for malformed RSS from several agencies (that “RSS” was not even a valid XML).
Regexes can become a scaling problem if you have to process a lot of data. But I will admit that I am on the side of some preliminary optimization here which is a 50/50 decision and depends on a lot of factors.
@makeitrein this is not StackOverflow and we are not collecting reputation points but – whichever solution you like, please mark it up as the answer. It increases the visibility of such topics on the forum and people in the future can make use of your question and our answers.
Legacy… No need to justify whether it’s governmental or not, legacy is a pain on its own so often, but usually I just assume everything is conforming the corresponding specifications until I get proven wrong.
@NobbZ and @dimitarvp - just woke up to the pleasant surprise of two discrete solutions that, in the words of the late Steve Jobs, would be insanely great! Always a fan of getting input & insight from the more seasoned programmers on some of my junior code, gets me exposed to a few new patterns and methods along the way.
@dimitarvp takes the award for the problem, but just by a bit. I find his code slightly more grokable, especially if I were to revisit it a few months down the line. It doesn’t seem that I can mark responses as answers… here is what I see on my end…