Oban how to properly assert on enqueued jobs?

andreyuhai · June 14, 2022, 2:48pm

I am trying to insert another Oban job after one finishes by listening to [:oban, :job, :stop].
I’d like to assert that the next job was queued after one succeeded, so in my test I do something like this

test "schedules job after another" do
      # ...
      {:ok, result} =
        perform_job(Workers.MyWorker, %{
          record_id: some_record_id
        })

      assert_enqueued(
        [worker: Workers.SecondWorker, args: %{result_id: result.id}],
        1000
      ) 
end

And my telemetry event handler looks like this

  def handle_event(
        [:oban, :job, :stop],
        _measure,
        meta = %{worker: worker, result: {:ok, result}},
        _
      ) do
      Workers.SecondWorker.new(%{result_id: result.id}) |> Oban.insert()
  end

However even though I can see the job inserted after inspecting the result of Oban.insert (though it doesn’t have an ID) my test still fails and I couldn’t really figure out why.

andreyuhai · June 14, 2022, 2:59pm

If I change the Oban config to be

config :lead_engine, Oban, testing: :manual

then it works but I don’t know whether there’s any other way of making this work because now some other tests fail.

If I wrap only my test case with with_testing_mode(:manual... then I get

     ** (ArgumentError) nil given for `id`. comparison with nil is forbidden as it is unsafe. Instead write a query with is_nil/1, for example: is_nil(s.id)

al2o3cr · June 14, 2022, 4:05pm

~~IIRC :telemetry handlers run in a separate process, you may be encountering DB sandbox issues.~~ I did NOT recall correctly, see @LostKobrakai’s note below.

Also beware: telemetry doesn’t provide any delivery guarantees and aggressively (silently + permanently for the VM’s lifetime) removes listeners that crash even once. If SecondWorker is doing anything important, you may want to consider an alternative approach like Oban Pro’s workflows.

LostKobrakai · June 14, 2022, 4:06pm

They do not, as they don’t want to introduce message passing.

andreyuhai · June 14, 2022, 4:42pm

Also beware: telemetry doesn’t provide any delivery guarantees and aggressively (silently + permanently for the VM’s lifetime) removes listeners that crash even once. If SecondWorker is doing anything important, you may want to consider an alternative approach like Oban Pro’s workflows.

I was guessing that those are GenServers and somehow they come back up, but yes, I’ve already encountered that during my tests, so we’ll probably have to switch to Pro to be able to use Workflows. Thank you!

I wonder how other people use it though, I mean if there’s no guarantee and the listeners might get removed. Is it because it doesn’t hurt much (well maybe it does, I have no idea ) losing some metrics?

al2o3cr · June 14, 2022, 5:11pm

Since a crashing handler crashes the calling process, I assume the design principle is “prefer uptime over metrics” which isn’t unreasonable.

It can be kinda painful if you mess up, for instance, a listener that tracks Oban.Job crashes and sends them to Sentry so that you don’t get any reports from production after the very first one. Ask me how I learned THAT one

LostKobrakai · June 14, 2022, 5:16pm

The point here is much more about this:

The code emitting telemetry events can’t handle problems of listeners – those can be libraries, which have no relationship to your project or metric setup. So it’s the job of listeners to have their shit together and deal with any issues they encounter.

trisolaran · June 14, 2022, 5:24pm

Correct, because in :inline mode, which is what I assume your original config was using, no jobs are ever enqueued in the DB

andreyuhai · June 14, 2022, 5:27pm

But then as I mentioned, even if I wrap it with with_testing_mode(:manual... I get that argument error I shared above.

trisolaran · June 14, 2022, 6:53pm

Most likely because, as the error message suggests, perform_job is returning a result struct with id = nil.

Take a look at what happens in your worker or share the code so someone can help you.

andreyuhai · June 14, 2022, 8:27pm

It wasn’t actually about the result perform_job returns but that happens when Oban tries to complete the job, at least that’s what I understood from stacktrace.

  1) test handle_event/4 schedules the next job in the pipeline after one finishes (MyApp.ObanTelemetryHandlerTest)
     apps/my_app/test/my_app/oban_telemetry_handler_test.exs:16
     ** (ArgumentError) nil given for `id`. comparison with nil is forbidden as it is unsafe. Instead write a query with is_nil/1, for example: is_nil(s.id)
     code: Oban.Testing.with_testing_mode(:manual, fn ->
     stacktrace:
       (ecto 3.8.3) lib/ecto/query/builder/filter.ex:191: Ecto.Query.Builder.Filter.not_nil!/2
       (oban 2.12.1) lib/oban/queue/basic_engine.ex:137: Oban.Queue.BasicEngine.complete_job/2
       (oban 2.12.1) lib/oban/queue/engine.ex:234: anonymous fn/3 in Oban.Queue.Engine.with_span/4
       (telemetry 0.4.3) /Users/system/apps/my-app/deps/telemetry/src/telemetry.erl:272: :telemetry.span/3
       (oban 2.12.1) lib/oban/queue/executor.ex:203: Oban.Queue.Executor.ack_event/1
       (oban 2.12.1) lib/oban/queue/executor.ex:197: Oban.Queue.Executor.report_finished/1
       (oban 2.12.1) lib/oban/queue/executor.ex:80: anonymous fn/1 in Oban.Queue.Executor.call/1
       (oban 2.12.1) lib/oban/testing.ex:236: Oban.Testing.perform_job/3
       test/my_app/oban_telemetry_handler_test.exs:26: anonymous fn/1 in MyApp.ObanTelemetryHandlerTest."test handle_event/4 schedules the next job in the pipeline after one finishes"/1
       (oban 2.12.1) lib/oban/testing.ex:398: Oban.Testing.with_testing_mode/2
       test/my_app/oban_telemetry_handler_test.exs:24: (test)

Also when I’ve inspected the actual Oban.Job that was passed to perform/1 from perform_job, I see that the id is nil.

Not sure that’s because of perform_job but I do not think so (or maybe ). I mean perform_job doesn’t actually insert anything as far as I understood by looking at the module.

I’m also using perform_job in other tests but I’ve never encountered the same error.

Edit

Somehow using perform_job within with_testing_mode(:manual... causes some problems

      # ...
     with_testing_mode(:manual, fn ->
      {:ok, result} =
        perform_job(Workers.MyWorker, %{
          record_id: some_record_id
        })

      assert_enqueued(
        [worker: Workers.SecondWorker, args: %{result_id: result.id}],
        1000
      ) 
     end)

The above snippet somehow throws the ArgumentError I shared above.

I’ve also tried to use Oban.insert/1 instead of using perform_job/1 hoping that it would insert the job and once my job is handled it would enqueue another one

      # ...
     with_testing_mode(:manual, fn ->
      %{record_id: some_record_id}
      |> Workers.MyWorker.new()
      |> Oban.insert()

      assert_enqueued(
        [worker: Workers.SecondWorker, args: %{result_id: result.id}],
        1000
      ) 
     end)

but somehow this way my first job didn’t even get processed. At least I didn’t see the IO.inspects that I did to ensure it was working the way I assumed it would.

how I could finally make it work was by directly calling perform/1 on my worker

      # ...
     with_testing_mode(:manual, fn ->
      Workers.MyWorker.perform(%Oban.Job{args: %{record_id: some_record_id}})

      assert_enqueued(
        [worker: Workers.SecondWorker, args: %{result_id: result.id}],
        1000
      ) 
     end)

But at that point, is this test useful or necessary? Because I am calling perform/1 directly and I already know that if the job is succesfull the next thing it will do is to schedule another one. Maybe that’s the topic of another question about testing in general.

trisolaran · June 15, 2022, 8:09am

That’s expected, perform_job/3 doesn’t insert the job in the DB (it’s used to unit test a worker) so the id of the job will be nil

Yeah I verified this error with my own code and I was able to reproduce it.

Here’s what I strongly believe it’s going on:

:inline testing mode uses the Oban.Queue.InlineEngine, which doesn’t touch the DB, while :manual testing mode uses the Oban.Queue.BasicEngine, which inserts jobs in the DB
perform_job/3 is always supposed to run in :inline mode using the InlineEngine, if it accidentally uses the BasicEngine, the error you encountered will be triggered, namely it will try to update the state of a job that was never inserted in the DB
Therefore, when you call perform_job/3, it sets testing: :inline in the config here: oban/lib/oban/testing.ex at fdeb0001bbacb60c4686b86ca0610c7f7508d357 · sorentwo/oban · GitHub which is then used here: oban/lib/oban/config.ex at fdeb0001bbacb60c4686b86ca0610c7f7508d357 · sorentwo/oban · GitHub to enforce the usage of the InlineEngine (the correct one), regardless of the testing mode set in your main config. So you can have testing: :manual in your config and still have perform_job/3 run correctly using the InlineEngine.
This works well as long as you don’t use with_testing_mode(:manual, ...). Reason is that with_testing_mode/2 sets the engine to use in the process dictionary here: oban/lib/oban/testing.ex at 9b4861354f0189d548f4d5cd89273bc98f8eaede · sorentwo/oban · GitHub and this takes precedence over the engine set by perform_job/3 . The overwrite happens here: oban/lib/oban/queue/engine.ex at 1e0f61a913a2ba52c985675383e2f7a180119ac5 · sorentwo/oban · GitHub

Long story short: perform_job/3 always tries to use the InlineEngine, cause it fails otherwise, but with_testing_mode/2 gets in the way end enforces the usage of the BasicEngine, which is the wrong one and causes the issue.

This seems to be a bug, perform_job/3 should ignore the engine set by with_testing_mode/2. Although, to be honest, I see little use for explicitly setting the test mode to :manual and then calling perform_job/3. If you’re using manual mode you’re expected to insert jobs in the DB and execute them with drain_queue/2.

andreyuhai · June 15, 2022, 8:13am

Thank you for taking the time to try it!
I think you’re right, as @shanesveller has mentioned in Slack channel that he’s got a PR for that. I’ll try from that branch and see if it solves my problem.

trisolaran · June 15, 2022, 8:18am

wasted time apparently, since the bug is already known and there’s a PR for it… Well, it happens.

andreyuhai · June 15, 2022, 8:35am

Sorry, I didn’t know about it to let you know beforehand. I just saw the message this morning.

trisolaran · June 15, 2022, 8:37am

Hey no worries! Glad you found the solution to your problem. Also, it was an instructive deep dive into oban’s testing code, so not a complete waste of time

hubertlepicki · June 15, 2022, 9:54am

This is precisely the reason why I submitted this PR a while ago:

This hasn’t been merged yet, but it’s the use case I had in mind and also experienced, that a lot of libraries do not provide any additional instrumentation in addition to telemetry events. It’s not great fit, since Telemetry was never meant to do instrumentation, hence the callbacks detaching mechanism. But life is life, and libraries do use it this way.

So my idea is to add option to specify durable handlers. If you have better idea I’d be very much interested to hear.

al2o3cr · June 15, 2022, 6:06pm

I think you called out exactly what I’m thinking in a previous comment on the issue linked to that PR:

github.com/beam-telemetry/telemetry

Making execution order of handlers guaranteed

opened 10:27AM - 31 Mar 22 UTC

hubertlepicki

I have a use case, where I want to include `span_id` and `trace_id`, which come …from a "distributed tracing" library of choice into logs. If it's OpenTelemetry or Spandex doesn't really matter, in both cases, I am facing the same possible issue. I want to have `span_id` and `trace_id` attached to the log lines generated when web request starts already, but in order to achieve it, the handlers which start the OpenTelemetry/Spandex span/trace *need* to be executed *before* the handlers attached by `Phoenix.Logger` to the same event. This seems to be work, as I can disable logger initialization on `:phoenix` application start, and then attach it manually *after* I attached OpenTelemetry/Spandex events, for example: ``` # config/config.exs config :phoenix, :logger, false ``` ``` # lib/application.ex SpandexPhoenix.Telemetry.install() Phoenix.Logger.install() # or OpenTelemetry.register_application_tracer(:my_app) OpentelemetryPhoenix.setup() Phoenix.Logger.install() ``` This *works* as I'd expect it to work but, the documentation of this project clearly says: > Note that you should not rely on the order in which handlers are invoked. https://github.com/beam-telemetry/telemetry/blob/d3ba8b24ac7b46ce03b644f96f143e8abed71d62/src/telemetry.erl#L72 And this is precisely what I am relying on (and others too, without a doubt). It seems that the handlers *are* executed in the order they were inserted to underlying ETS table, as they are fetched with `:ets.lookup`, which does return them in order of insertion. https://github.com/beam-telemetry/telemetry/blob/main/src/telemetry_handler_table.erl#L48 Should we make this behavior official? I think we don't have to change anything in the project code itself, maybe we should just add a couple of tests to make sure the order is preserved, and remove the warnings from the documentation that the order is not guaranteed?

It’s almost like we need a second event dispatch mechanism, along Telemetry, where events are durable (i.e. do not detach), execute in predictable and defined way, and failures in handlers result in overall request failure as well.

IMO that second mechanism is not :telemetry, it’s a different thing entirely with different requirements and tradeoffs.

The big difference is what happens when things go wrong: :telemetry takes the position that “uptime with missing logs” is better than “downtime with 100% complete logs” under heavy load or crashing handler functions. A hypothetical “callback registry” or “event bus” would likely make the opposite choice.