Why does Oban.Notifier serialize to JSON instead of term_to_binary?

For a number of reasons, I want to use Oban.Notifier over Phoenix PubSub (I need to pipe the events through Postgres instead of the Erlang cluster).

At the first glance, the semantics of Notifier looks similar to PubSub apart from different naming and slightly different API design choices.

However, one significant difference is that Notifier would always convert the payload to JSON (Jason calls hardcoded in implementation) which ruins the developer experience on the receiving side of the event. For example, pattern matching on structs and tuples in “receive do” or GenServer handlers.

I kind of understand why that would be handy when in case of Oban.Job’s args or anything else that’s persistent, but is there ANY benefit to using JSON instead of term_to_binary/binary_to_term (maybe even with [:safe] baked in)?

One reason I’m asking this is because I feel like making a Github issue about it and also overriding it in a private fork (already verified that it works), but I cannot help but feel that I’m missing something.

1 Like

The original notifier was only for Postgres. It was a transport layer through the application, but it also received notifications directly from the database via an insert trigger. Postgres can construct JSON natively, but not Erlang external term format. Beyond Postgres, keeping the message format agnostic simplifies interop with other languages (not as important at the moment, but always a consideration).

A UTF-8 string is valid JSON. You can use term_to_binary and base64 encode a string to pass directly. It’s technically JSON, but not structured or subject to casting.

5 Likes

Thanks a lot for the explanation. I was missing the fact that Notifier is used for comms with Postgres along with the user-land usage.

One more question if you don’t mind :slight_smile:

Do I understand correctly that, if I want to “teach” the notifier to use term_to_binary, my options are:

  1. Forking and patching
  2. Soft-forking just the notifier.ex and notifiers/postgres.ex, patching them, and starting them explicitly on the supervision tree myself, parallel to the rest of Oban (I actually did it as a POC and it seems to be working fine, but I might be unaware of second-order effects, e.g. extra DB connection strain)
  3. Wrapping public Oban.Notifier interface into my own “facade” without touching the internals. But this way, I’ll have to proxy all incoming messages through an extra process (e.g. a GenServer) to call binary_to_term, right?
  4. ???

At this point, should I just write a Postgres adapter to Phoenix.PubSub based on the source code of notifiers/postgres.ex? :sweat_smile:

Or am I missing a trivial solution to this?

I strongly recommend you go the opposite direction and use a different notifier, if possible. There are several drawbacks to Postgres notifications:

  1. It requires one additional, dedicated connection to the database per node.
  2. Payload size has a hard limit of 8kb, rather small and easy to bump up against.
  3. Notifier queries use the ecto pool, which adds contention to the pool and load to the db.

However, if you’re set on using Postgres, you can make a lightweight version of option 3 work and avoid the burden of forking the notifier. It requires you to transparently encode data and wrap it in a map:

defmodule MyApp.Notifier do
  # Use this instead of `Oban.Notifier.notify`
  def notify(channel, data) do
    encoded =
      data
      |> :erlang.term_to_binary()
      |> Base.encode64(padding: false)

    # Wrap the data string in a map
    Oban.Notifier.notify(channel, %{data: encoded})
  end

  # Use this to decode `{:notification, channel, payload}` 
  def decode(%{"data" => encoded}) when is_binary(encoded) do
    encoded
    |> Base.decode64!(padding: false)
    |> :erlang.binary_to_term()
  end
end

One already exists, but it has very few downloads.

Thanks for your answers @sorentwo

I’ll see if we can reconsider reaching for Postgres for PubSub.

The crux of the issue is that we have a solid zero-downtime deployment process but it forces that there’s an small-ish period of time when both the old and the new nodes are capable of processing jobs.

Sometimes, we broadcast events from jobs and those events are awaited in controllers.

What happens is that the controller from the old node waits for the event, but the job is picked up by the new node and the event only happens on the new node.

I guess the solution is to connect the old and the new nodes into the same Erlang cluster but something in me is resisting that haha

Replacing PubSub with Oban.Notifier in the most critical place immediately solved the problem there since both nodes got the event.

There’s benefit in connecting the old and new instances because it allows for an immediate switch in leadership. It’s also beneficial for metrics in Oban Web because old nodes can hand off historic data to the new nodes, and older nodes that are still processing can share those events with the connected nodes.

If you still don’t want to connect them, you can isolate each release using a namespace: Oban.Notifiers.PG — Oban v2.18.1

1 Like

Thanks for recommendation. The main reason for “why not” has actually been that I need to figure out proper discovery, expose epmd (or epmdless endpoint), etc.

I’m deploying the services using GitHub - basecamp/kamal: Deploy web apps anywhere. and I’ll have to figure out how to make sure that the new node discovers the old one and connects to it either manually with some explicit Node.connect in the code or via some of those fancy tools like libcluster with automatic discovery.

Relying on Postgres as the “mutually discovered bus” was so much simpler because the nodes don’t have to know about each other in any way, nor did they have to have direct connectivity.

(Marking your original response as the solution)

1 Like

FWIW we use Redis pub/sub as our Phoenix PubSub mechanism and it works great. I know that adds complexity, but in our case, it was worth it

Cool, thanks for chiming in!

Is there anything you can share about that decision especially vs using Distributed Erlang for PubSub?

To me, it seems like adding a Redis server would be equally or more complicated then e.g. adding one extra static Elixir node that the actual app nodes would connect to when starting and thus forming a cluster (to avoid the need for the app nodes to discover each other somehow).

Sure! We operate in AWS EC2 and networking in the cloud is notoriously finicky. Distributed Erlang has trouble when the network has issues. Distributed Erlang also doesn’t play super well with autoscaling.

We found using Redis to be significantly simpler operationally and performant/cost effective enough for our needs, while not requiring the overhead of a fully connected mesh. Plus there was already a Redis Pub/Sub lib available, so it was a pretty easy choice for us.

If I were running a personal project somewhere like fly.io, I’d probably try meshing first, depending on my use-case.

2 Likes

From my experience using Phoenix PubSub in a mission critical part of our application, using something like Redis is actually way less complex than distributed Erlang. I don’t know your use case exactly but if it’s important that events are not lost then using something like Redis will make your life less stressful.

There’s a lot of nuance when using distributed Erlang that people don’t really talk about. IMO reaching for distributed Erlang should be the exception and not the rule.

3 Likes

Also, one issue with using :term_to_binary with structs is you could run into issues if the struct changes in a way that causes it to be incompatible with already dispatched events. This is a bigger problem if your events are persistent but can also be a problem during rollover when some nodes are emitting events using an older version of a struct and other nodes are consuming them with the newer version. Even when keeping this in mind this issue can pop up in sometimes unintuitive ways. It’s also kind of a headache to debug.

Using something like JSON as the event message can help with the above issue as well as provide additional benefits similar to what others have said. You could also try to get around this issue by converting a struct to a map before using :term_to_binary but from a DX perspective I’m not sure what the difference would be.

2 Likes

Yes, agreed on all points. I’ve had a few projects where we ended up with 3 different pattern-matching function heads of a structure’s parse function. It was extremely useful and prevented errors (confirmed by Prometheus metrics counting how much each function is called).

1 Like

Are any posts or docs about the hardships of running distributed Erlang in production?

I’m toying with libcluster_postgres which just uses LISTEN/NOTIFY to broadcast the node names to all other nodes upon connection and then every 5 seconds.

So, assuming the nodes in a such cluster do have direct network connectivity, I wonder what kind of a catch I should expect. Especially in the context of not losing events.

Can you expand on that? I don’t think anyone would suggest erlang distribution as a tool to “not loose events”. At least in my book pubsub doesn’t include any delivery guarantees and if you want to not loose the work you need a persisted event queue somewhere. That’s the same if you’d use phoenix pubsub over erlang distribution or if you’d use redis to provide the pubsub mechanism.

Sorry, I’m not sure how much I can speak to this. I don’t feel like I have a good grasp of your use case so I’m not sure what using libcluster_postgres would buy you.

My question was rather this: why would PubSub on top of distributed Erlang be considered fragile/unreliable compared to Redis?

I’ve always assumed that with PubSub (fan-out) systems, there’s no guarantee of reception by design. E.g. if the listener wasn’t listening at the time of the broadcast (for whatever reason), it’s lost forever. (Again, that is by default, without introducing any fancy replay mechanisms).

Every time I need to ensure some event has a guaranteed consequence, I typically reach for Oban or generally any persistent job processing.

If you compare two architectures for PubSub:

  1. Star-like Erlang cluster where each nodes is aggressively connecting to one central node so that all nodes are in the same cluster. Let’s assume we also handle reconnects by spamming Node.connect on interval or something.
  2. PubSub through a central Redis node.

What would Redis (option 2) give you that option 1 won’t? Including in terms of reliability.