Getting the year out of a list of strings

I’m currently working on an application that require me to get the year out of a label.
They appear in different ways. It’s never consistent:

  examples = [
    "February 2015 Part 1",
    "2014 Part 2 February",
    "February 2015",
    "2015 Feb"
  ]

I need to separate them out so the year and label is like this

  expected_result = [
    %{"label" => "February Part 1", "year" => "2015"},
    %{"label" => "Part 2 February", "year" => "2014"},
    %{"label" => "February", "year" => "2015"},
    %{"label" => "Feb", "year" => "2015"}
  ]

Here is my implementation

defmodule ViewHelper do

  def get_label_with_year(sentence) do
    list_of_words = String.split(sentence, " ")

    year  = get_year(list_of_words)

    label =
      list_of_words
      |> Enum.reject(fn(word) -> word == year end)
      |> Enum.join(" ")

    %{}
    |> Map.put_new("label", label)
    |> Map.put_new("year", year)
  end


  def get_year(list_of_words) do
    [year] = list_of_words
              |> Enum.map(fn(x) -> Integer.parse(x) end)
              |> Enum.filter(fn(x) -> is_tuple(x) end)
              |> Enum.filter(fn({num, _}) -> String.length(to_string(num)) == 4 end)
              |> Enum.map(fn({num, _}) -> to_string(num) end)
     year
  end


end

and my test file

defmodule ViewHelperTest do
  use ExUnit.Case
  doctest ViewHelper

  test "get_label_with_year" do

  examples = [
    "February 2015 Part 1",
    "2014 Part 2 February",
    "February 2015",
    "2015 Feb"
  ]

  expected_result = [
    %{"label" => "February Part 1", "year" => "2015"},
    %{"label" => "Part 2 February", "year" => "2014"},
    %{"label" => "February", "year" => "2015"},
    %{"label" => "Feb", "year" => "2015"}
  ]

  my_test = Enum.map(examples, fn(x) -> ViewHelper.get_label_with_year(x) end)

  assert  my_test == expected_result

  end


end

Everything works. However…
I feel like this implementation is not the best way of writing this.
Could this be written differently so its easier to read? Am I overthinking this?

My intuition tells me this could be written differently so its much easier to understand in the future OR for someone else who will eventually look at this code. Looking for feedback and/or concepts on how to approach this differently (i.e. pattern matching, reducer)

Maybe simply Regex?

iex> capture_year = fn label -> Regex.named_captures ~r/(?<year>\d{4})/, label end 
iex> examples |> Enum.map(&capture_year.(&1))
[
  %{"year" => "2015"}, 
  %{"year" => "2014"},
  %{"year" => "2015"},
  %{"year" => "2015"}
]

Just adapt the output.

5 Likes

The most efficient solution would be built around binary pattern matching and creating as little extra garbage as possible.

In you solution, the call to String.split creates a bunch of new strings and each Enum.* call creates a new list.

Here’s my attempt at solving this problem:

defmodule FindYear do
  def extract_year(string), do: find_and_extract_year(string, 0, string)

  defp find_and_extract_year("", _position, whole_string),
    do: %{
      "year" => :not_found,
      "label" => whole_string
    }

  defp find_and_extract_year(<<d1, d2, d3, d4>> <> suffix, position, whole_string)
       when d1 in ?1..?9 and d2 in ?0..?9 and d3 in ?0..?9 and d4 in ?0..?9 do
    prefix = :binary.part(whole_string, {0, position})

    %{
      "year" => <<d1, d2, d3, d4>>,
      "label" => merge_and_trim(prefix, suffix)
    }
  end

  defp find_and_extract_year(<<_>> <> rest, position, whole_string),
    do: find_and_extract_year(rest, position + 1, whole_string)

  defp merge_and_trim("", rest), do: String.trim_leading(rest)
  defp merge_and_trim(prefix, ""), do: String.trim_trailing(prefix)
  defp merge_and_trim(prefix, " " <> suffix), do: prefix <> suffix
end

The basic idea is to walk the input string 1 byte at a time looking for a sequence of 4 consecutive digits. As soon as one is found, we extract the prefix that precedes the year and concatenate it with the remainder of the string.

Note that this code has a number of assumptions: it expects ASCII-only text, the year is assumed to be any 4-digit sequence that starts with a digit from 1 to 9, it expects only one year in the string, and all words are assumed to be split using a single space.

On the one hand, this solution is quite specific to the provided list of examples but, on the other hand, all the assumptions are pretty obvious from the code itself, without any documentation around it.

2 Likes

Yes Regex makes more sense and this call name_captures is a really good call.

Here is my updated example.
I took inspiration from your suggestion.

defmodule ViewHelper do

  def get_label_with_year(sentence) do
    %{}
    |> Map.put_new("label", capture_label(sentence))
    |> Map.merge(capture_year(sentence))
  end

  def capture_label(sentence) do
      Regex.split(~r/(?<year>\d{4})/, sentence, trim: true)
      |> Enum.map_join(" ", &String.trim/1)
      |> String.trim_leading
  end

  def capture_year(year) do
      Regex.named_captures ~r/(?<year>\d{4})/, year
  end

end

It definitely reads easier then before and my test is still passing

 sentence = "February 2015 Part 1"

 ViewHelper.get_label_with_year(sentence)

%{"label" => "February Part 1", "year" => "2015"}

I really like the method names and how your using pattern matching. The pattern matching on byte code is a little bit over my head, but I’m going to spend some time in further understanding it. Thanks for the feedback.

Good answer.

Btw, we also need to consider years with 1, 2 and 3 digits.

And just in case @neuone’s application/product/project will last for millenniums, we need to consider 5 digits year as well.

1 Like

One other way that only requires you to run the regex once per string is:

[pre, year, post] = Regex.run(~r/(.*)(\d{4})(.*)/, "February 2015 Part 1", capture: :all_but_first)
%{
   "label" => String.trim(String.trim(pre) <> " " <> String.trim(post)),
   "year" => year
}
3 Likes

Almost :slight_smile:

That was my thought, but there is a problem because pre or post can be nil… thus breaking the label :slight_smile:

PS I should read code with more accuracy… You trim all, and now I look stupid :slight_smile:

I only realised it would still capture as an empty string when playing with it - my intuition was that it wouldn’t too

1 Like

@amnu3387 :+1: That solution simplifies it down even further.

I’m also just posting from the docs what Regex states about capture :all_but_first for anyone in the future who reads this.

Captures
Many functions in this module handle what to capture in a regex match via the :capture option. The supported values are:

:all_but_first - all but the first matching subpattern, i.e. all explicitly captured subpatterns, but not the complete matching part of the string

Here are some other :capture options for future reference (source)

  • :all - all captured subpatterns including the complete matching string (this is the default)
  • :first - only the first captured subpattern, which is always the complete matching part of the string; all explicitly captured subpatterns are discarded
  • :all_but_first - all but the first matching subpattern, i.e. all explicitly captured subpatterns, but not the complete matching part of the string
  • :none - does not return matching subpatterns at all
  • :all_names - captures all names in the Regex
  • list(binary) - a list of named captures to capture

Its fun to see where it started and the variations being proposed. So many ways to approach it.

2 Likes