Any tips to refactor this open-source module in a more functional programming/Elixir way?

hectorperez · October 5, 2022, 2:49pm

Hi!

I’m building an open-source site to discover Livebook notebooks.

How would you improve the following code? Any tips would be appreciated, thanks!

I have this module — full code here:

defmodule Notesclub.Workers.UrlContentSyncWorker do
  def perform(%Oban.Job{args: %{"notebook_id" => notebook_id}}) do
    notebook_id
    |> get_notebook()
    |> get_urls()
    |> get_content()
    |> update_content_and_maybe_url()
  end

  
  # ...

  defp get_urls(%Sync{} = data) do
    case Urls.get_urls(data.notebook) do
      {:ok, %Urls{} = urls} ->
        Map.put(data, :urls, urls)

      {:error, error} ->
        Logger.error("get_urls/1 returned {:error, #{error}}; notebook.id=#{data.notebook.id}")
        Map.put(data, :cancel, error)
    end
  end
  
  # ...
end

I could also change the content of the Urls module— full code here — if that helps:

defmodule Notesclub.Notebooks.Urls do
	# ...
	def get_urls(%Notebook{} = notebook) do
	  notebook
	  |> get_commit_url()
	  |> get_raw_commit_url()
	  |> get_default_branch_url()
	  |> get_raw_default_branch_url()
	end
	# ...
end

Any idea about how to refactor this?
Thank you!

fuelen · October 5, 2022, 5:07pm

My main recommendation is to use as simple data structures as possible. You’ve defined a struct and use it only internally. Its state is spread across multiple private functions, which makes every single function more complex than it’s needed. Absence of notebook is handled in 3 functions, however you don’t even have to run the functions if notebook is absent. I think if a notebook is absent, then this is an exception. The code is run by Oban. The end user won’t see errors you’re trying to represent as {:error, message}. Just throw exceptions if input is really not expected and shouldn’t be transformed to :cancel response for Oban.
I recommend to setup generic telemetry handler for Oban and log the result of invocation there, not in worker implementation.
Try to rewrite code with pipes inside perform function using with macro.

Another notes:

get_notebook function has invalid specs. Specs say it returns Notebook, the module, instead of the struct.
Invalid spec definition here:
```
 @spec get_urls(%Notebook{}) :: {:ok, %Urls{}} | {:error, binary()}
  def get_urls(nil), do: {:error, "notebook can't be nil"}
```
Implementation handles nil, which is not mentioned in @spec. The code inside the worker already handles nil, and if you rewrite fetching notebook using exception, then you won’t need this handling at all.
I don’t think these developer errors are useful:
```
  def get_urls(%Notebook{user: nil}), do: {:error, "user can't be nil. It needs to be preloaded."}
  def get_urls(%Notebook{repo: nil}), do: {:error, "repo can't be nil. It needs to be preloaded."}
```
Just throw an exception if the error is for developer. Also, if the association is not preloaded, then Ecto will put Ecto.Association.NotLoaded struct, not nil.
It is not necessary to use / for regex sigil. ~r|^https://github.com/#{user.username}/#{repo.name}/blob| would also work and it makes the code less noisy. user.username and repo.name should be escaped using Regex.escape function, as values inside may contain special characters.

hectorperez · October 6, 2022, 6:49pm

Thank you for your response @fuelen!

I started a PR refactoring the code based on your recommendations.

I think if a notebook is absent, then this is an exception.

If a notebook doesn’t exist, I believe it’s better to return :cancel so the job is not retried. If it doesn’t exist now, it won’t exist later.

I don’t think these developer errors are useful:
def get_urls(%Notebook{user: nil}), do: {:error, "user can't be nil. It needs to be preloaded."}
Just throw an exception if the error is for developer.

Similarly, if I raise an exception, UrlContentSyncWorker would need to catch it as the job should be cancelled — not retried.

My main recommendation is to use as simple data structures as possible. You’ve defined a struct and use it only internally.

So should I avoid defining structs when they are only used internally in a module?

Its state is spread across multiple private functions, which makes every single function more complex than it’s needed.

What do you mean? Not sure I understand what you mean by spreading the state.

Thank you!

fuelen · October 7, 2022, 7:41am

Consider not even scheduling a job where notebook doesn’t exist or user is nil.

well, it might be okay to use structs for internal things, but I don’t see advantages for them in your code. It might be ok to represent the state of a GenServer using structs.
Also, you don’t get all the benefits of structs if you use Map.put for updating data. %{struct | field: data} ← this approach can show warnings when field doesn’t exist in the struct

I just mean, that simple functions receive way more data than needed.

hectorperez · October 7, 2022, 2:38pm

Thank you again! I got rid of the struct.

al2o3cr · October 9, 2022, 2:28am

My advice: stay away from “railway-oriented programming”. It can be very expressive in the right context, but mostly just leads to twisty code that’s hard to read.

The other downside to approaches like the one in UrlContentSyncWorker is that the path taken in subsequent steps is highly dependent on what’s happened to data before; for instance, the way that maybe_make_commit_request’s code path depends on what happened in make_default_branch_request because data will have different keys.

Here’s my refactoring of the file (as of cancel_if_missing_data/1 · notesclub/notesclub@f714e8d · GitHub):

# snip
  def perform(%Oban.Job{args: %{"notebook_id" => notebook_id}}) do
    with {:ok, notebook} <- load_notebook(notebook_id),
         {:ok, urls} <- get_urls(notebook),
         {:ok, attrs} <- attrs_for_update(notebook, urls)
    do
      update_notebook(notebook, attrs)
    end
  end

  defp load_notebook(notebook_id) do
    case Notebooks.get_notebook(notebook_id, preload: [:user, :repo]) do
      nil ->
        {:cancel, "notebook does NOT exist"}

      %Notebook{user: nil} ->
        {:cancel, "user is nil"}

      %Notebook{repo: %Repo{default_branch: nil} = repo} ->
        {:ok, _job} =
          %{repo_id: repo.id}
          |> RepoSyncWorker.new()
          |> Oban.insert()

        {:cancel, "No default branch. Enqueueing RepoSyncWorker."}

      notebook ->
        {:ok, notebook}
    end
  end

  defp get_urls(%Notebook{} = notebook) do
    case Urls.get_urls(notebook) do
      {:ok, urls} ->
        {:ok, urls}

      {:error, error} ->
        {:cancel, "get_urls/1 returned '#{error}'; notebook id: #{notebook.id}"}
    end
  end

  defp attrs_for_update(notebook, urls) do
    case ReqTools.make_request(urls.raw_default_branch_url) do
      {:ok, %Req.Response{status: 200, body: body}} ->
        {:ok, %{content: body, url: urls.default_branch_url}}

      {:ok, %Req.Response{status: 404}} ->
        attrs_from_commit(notebook, urls)

      _ ->
        # Retry several times
        raise "request to notebook default branch url failed"
    end
  end

  defp attrs_from_commit(notebook, urls) do
    case ReqTools.make_request(urls.raw_commit_url) do
      {:ok, %Req.Response{status: 200, body: body}} ->
        {:ok, %{content: body, url: nil}}

      {:ok, %Req.Response{status: 404}} ->
        {:cancel,
         "Neither notebook default branch url or commit url exists. The notebook id: #{notebook.id} was deleted or moved on Github"}

      _ ->
        # Retry job several times
        raise "request to notebook commit url failed"
    end
  end

  defp update_notebook(notebook, attrs) do
    case Notebooks.update_notebook(notebook, attrs) do
      {:ok, _notebook} ->
        {:ok, :synced}

      {:error, _} ->
        #  Retry job several times
        {:error, "Error saving the notebook id #{notebook.id}, attrs: #{inspect(attrs)}"}
    end
  end
end

IMO perform’s control flow is clearer in this shape:

all the responsibility for forming error-shapes is still left to the helper functions
the dependencies of each helper function (load_notebook/get_urls/attr_for_update) are clearly labeled

hectorperez · October 9, 2022, 9:03am

Thank you @al2o3cr, that’s brilliant!

That use of the with macro makes it clearer to read.