makeitrein

makeitrein

Extracting numbers from a string

Hey all, just started picking up Elixir last week and am writing a scraper as a learning project.

Baby step #1 is extracting the number from a URL on the target web page… here’s what I’ve written:

# url is in "https://xxyyzz.com/xxyyzz.383254/" format... goal is to extract 383254
  def get_id_from_url(url), do: Regex.run(~r"\d+\/", url) |> Enum.at(0) |> Integer.parse |> elem(0)

This seems a bit clunky of a function to me for a simple integer extraction… is there a better way of going about this?

Marked As Solved

dimitarvp

dimitarvp

(EDIT 1: Account for invalid values.)
(EDIT 2: Trim empty strings when splitting.)
(EDIT 3: Included explanations.)

defmodule Test do
  def extract_id(url) when is_binary(url) do
    url
    |> String.split(~w(. /), parts: 1000, trim: true)
    |> List.last
    |> parse_id
    |> fetch_id
  end

  defp parse_id(nil), do: :error
  defp parse_id(x) when is_binary(x), do: Integer.parse(x)

  defp fetch_id({number, ""}) when is_integer(number), do: number
  defp fetch_id(:error), do: :error
end

Test it:

iex> urls = ["https://foster.com/death-pancake.1468/", "https://hkd33.net/mr-rogers101.690153/", "whatever_dude", "https://space-force911.gov/sauce-master.13257777/"]

iex> urls |> Enum.map(&Test.extract_id/1)
[1468, 690153, :error, 13257777]

Breaking it down:

  • ~w(. /) equals [".", "/"] (so String.split is called with multiple separators).

  • parts: 1000 is used to prevent denial-of-service attacks, in case somebody manages to smuggle huge strings to your code. trim: true removes empty strings from the result. Check String.split docs.

  • "https://foster.com/death-pancake.1468/" |> String.split(~w(. /), parts: 1000, trim: true) yields this:

["https:", "foster", "com", "death-pancake", "1468"]

…so we are calling List.last on it to give us the desirable piece of data.

  • Our internal function parse_id has to also handle invalid data:
    • If String.split returns [], List.last would return nil.
    • If String.split returns ["single_invalid_url"], List.last would return "single_invalid_url".

Both cases would make our internal function parse_id to return :error. (Integer.parse will return :error if you supply it a string that does NOT start with an integer.)

  • The fetch_id internal function uses function heads instead of if or case to extract successful integer parsing and return it, or react to an :error return value and just pass it down the line to your consumer code.

  • One caveat: notice that fetch_id matches on {number, ""} when is_integer(number) which means the function will be called only if a full integer string is passed, namely “123” or “456” will succeed but “123xyz” will not. If you expect URLs like “https://whatever.man/1234abcd”, this code won’t work.

Also Liked

hassan

hassan

How about

iex(10)> "https://xxyyzz.com/xxyyzz.383254/" |> String.replace(~r/[^\d]/, "")
"383254"
iex(11)>
dimitarvp

dimitarvp

defmodule Test do
  def match_string("https://xxyyzz.com/xxyyzz." <> suffix) do
    case Integer.parse(suffix) do
      {number, "/"} when is_integer(number) ->
        IO.puts "suffix is #{number}"

      _ ->
        IO.puts "cannot parse suffix: #{suffix}"
    end
  end
end

Test it in iex:

iex> Test.match_string "https://xxyyzz.com/xxyyzz.383254/"
suffix is 383254
:ok
iex> Test.match_string "https://xxyyzz.com/xxyyzz.383254/!"
cannot parse suffix: 383254/!
:ok

You can abuse Elixir’s allowed syntax of pattern matching on a string suffix (you cannot pattern-match strings in the middle of the bigger string though, have that in mind). Not sure if I am not taking your example too literally but if I understood you correctly, that’s how I would approach the problem.

NobbZ

NobbZ

Now as we have more information, I have an alternative version which I prefer over @dimitarvp, because it is much more explicit about what we want.

  • It says that we want an URL and verifies we get one (by parsing it) and that we are only interested in the path,
  • it says that we are searching for dot, followed by at least one digit and ending with a slash as the last character of the path, but we are only interested in the actual digits (the call to Regex.named_captures/3),
  • we want those digits to cleanly parse into a number.

If all succeed, we return an :ok-tuple, and simply :error otherwise.

But which version to choose is probably a matter of taste, I have not benchmarked them.

defmodule M do
  def extract(url) do
    with %URI{path: path} when is_binary(path) <- URI.parse(url),
         %{"num" => num_str} <- Regex.named_captures(~r[\.(?<num>\d+)/$], path),
         {num, ""} <- Integer.parse(num_str) do
      {:ok, num}
    else
      _ -> :error
    end
  end
end

IO.inspect M.extract("https://foster.com/death-pancake.1468/")
IO.inspect M.extract("https://hkd33.net/mr-rogers101.690153/")
IO.inspect M.extract("https://space-force911.gov/sauce-master.13257777/")

Where Next?

Popular in Questions Top

Tee
can someone please explain to me how Enum.reduce works with maps
New
aadeshere1
I have a another noob question about loop. Since elixir is immutable, while loop is not directly possible. total = 10 while total != 0 ...
New
earth10
Hi, I’m just starting to build a side-project with Elixir and Phoenix and doing some basic test with Elixir alone. What strikes me is th...
New
chrisalley
ExUnit now has describe blocks which is a welcome addition coming from RSpec. In the docs, it states that nested hierarchies of describe ...
New
pmjoe
I have a relationship of love and hate with Elixir. Lots of things are just absolutely right, but there are some things that are kind of ...
New
LegitStack
I’m trying to make a websocket server in Phoenix or raw Elixir. I heard about gun, I think I could use cowboy, but since I’m not that sma...
New
ashish173
I am using Ecto timestamps with postgres, I can see the timestamps() use the :naive_dateime but for my use case I wanted to store the ti...
New
chensan
I have a User schema with a :from_id field set to type :string: defmodule TweetBot.Repo.Migrations.CreateUsers do use Ecto.Migration ...
New
Brian
What is the proper way to load a module from a file in to IEX? In the python world, doing something like this pretty standard: from ....
New
senggen
Erlang/OTP 25 [erts-13.2.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] 15:22:35.803 [error] gen_event {lager_file_backend...
New

Other popular topics Top

sorentwo
Hello! tl;dr Announcing Oban, an Ecto based job processing library with a focus on reliability and historical observability. After spen...
985 42842 311
New
aesmail
Hello guys, I have finally made it. I created an admin interface for a framework. It’s been on my todo list for years and with the curre...
New
belgoros
I’m not a pro in using Regex and can’t figure out why the following behaviour happens, especially if we take into account the difference ...
New
chrismccord
This release brings a number of exciting features, including integration with the new Phoenix LiveDashboard and Phoenix LiveView. There h...
New
ashish173
I am using Ecto timestamps with postgres, I can see the timestamps() use the :naive_dateime but for my use case I wanted to store the ti...
New
jason.o
In the code below, if the create action is not set to accept “extra_key” as an input, it errors out with a message shown above. Is there ...
New
KronicDeth
Elixir plugin for JetBrain’s IntelliJ Platform (including Rubymine) This is a plugin that adds support for Elixir to JetBrains IntelliJ...
289 35953 110
New
dblack
I’ve got an issue with an app and I’ve no idea of how to troubleshoot it. I’m hoping someone here might have seen something similar. I p...
New
romenigld
I am trying to run a deploy with docker and I successfully runned with this command: docker build -t romenigld/blog-prod . but when I t...
New
sergio
Kind of like when jquery came out, it was super necessary. Existing drag and drop libraries have a bunch of baggage to support old browse...
New

We're in Beta

About us Mission Statement