Oban Job does not retry, returns: {:error, %{}

houayang · August 2, 2024, 5:38pm

I am trying to test oban job retries locally, my perform job returns {:error, %{}, and I see the job is in a retryable state, but it never runs and attempts is still 1.

oban: “~> 2.15.0”
oban_pro: “~> 0.14”
oban_web: “~> 2.6”

sorentwo · August 2, 2024, 7:30pm

That’s not much info to help you debug. Is this in the development environment? Can you verify that the queue is running with Oban.check_queue(queue: "name_of_queue")?

houayang · August 2, 2024, 8:04pm

Yes this is my local test environment. The queue is running, tries to run the job once and when it errors, just stays in a retryable state.

Oban.check_queue(Manifold.Oban, queue: :delivery_destination_queue) %{ name: "Manifold.Oban", node: "M1-Max", running: [], queue: "delivery_destination_queue", started_at: ~U[2024-08-02 19:54:57.208154Z], updated_at: ~U[2024-08-02 19:55:20.222256Z], global_limit: nil, local_limit: 1, paused: false, rate_limit: nil, retry_attempts: 5, retry_backoff: 1000 }

sorentwo · August 3, 2024, 3:37pm

Is there a custom backoff on the job? When is the scheduled_at timestamp for it to retry?

houayang · August 5, 2024, 3:43pm

no custom backoff

inserted_at: 2024-08-05 15:39:11.287173
scheduled_at: 2024-08-05 15:39:28.327155
attempted_at: 2024-08-05 15:39:11.309171
completed_at: NULL
attempted_by: "{M1-Max,0726da66-d47b-47fd-bc5e-1db1ec3a7e85}"
discarded_at: NULL
priority: 3
tags: {}
meta: "{""span_context"": null}"
cancelled_at: NULL
state: retryable

sorentwo · August 5, 2024, 4:41pm

And this is still happening, even after a reset? Please share the output of Oban.config(), feel free to omit queue names or cron workers.

Side note—there are many reliability fixes and diagnostic improvements in more recent versions of Oban (not to mention Pro and Web). If possible, I highly recommend upgrading.

houayang · August 5, 2024, 6:13pm

iex(1)> Oban.config(Manifold.Oban)
%Oban.Config{
  dispatch_cooldown: 5,
  engine: Oban.Pro.Queue.SmartEngine,
  get_dynamic_repo: nil,
  log: false,
  name: Manifold.Oban,
  node: "M1-Max",
  notifier: Oban.Notifiers.Postgres,
  peer: Oban.Peers.Disabled,
  plugins: [
    {Oban.Web.Plugins.Stats, []},
    {Oban.Pro.Plugins.DynamicLifeline, [rescue_interval: 300000]},
    {Oban.Plugins.Gossip, [interval: 5000]},
    {Oban.Pro.Plugins.DynamicPruner,
     [
       limit: 100000,
       mode: {:max_age, {2, :days}},
       queue_overrides: [
         delete_expired: {:max_age, {1, :day}},
         upload_worker: {:max_age, {1, :day}},
         file_queue: {:max_age, {1, :day}},
         delivery_destination: {:max_age, {3, :day}},
       ]
     ]}
  ],
  prefix: "public",
  queues: [
    delivery_destination: [limit: 1],
    periodic_queue: [limit: 1],
    process: [limit: 1]
  ],
  repo: Engine.Repo,
  shutdown_grace_period: 15000,
  stage_interval: 1000,
  testing: :disabled
}

houayang · August 5, 2024, 7:22pm

Also as an FYI, if I add schedule_in, to the job option, that job also is never attempted.

sorentwo · August 5, 2024, 7:38pm

See if your stager is alive with Oban.Registry.whereis(Oban, Oban.Stager). If it is alive, add this telemetry to see if it reports any errors or other information.

:telemetry.attach_many(:debug, [[:oban, :plugin, :exception], [:oban, :plugin, :stop]], fn _, time, meta, _ ->
  if meta.plugin == Oban.Stager, do: IO.inspect({time, Map.delete(meta, :conf)})
end, nil)

houayang · August 5, 2024, 8:10pm

iex(2)> Oban.Registry.whereis(Manifold.Oban, Oban.Stager)
#PID<0.3316.0>
iex(3)> :telemetry.attach_many(:debug, [[:oban, :plugin, :exception], [:oban, :plugin, :stop]], fn _, time, meta, _ ->
...(3)>   if meta.plugin == Oban.Stager, do: IO.inspect({time, Map.delete(meta, :conf)})
...(3)> end, nil)

20:08:17.173 level=[info] The function passed as a handler with ID :debug is a local function.
This means that it is either an anonymous function or a capture of a function without a module specified. That may cause a performance penalty when calling that handler. For more details see the note in `telemetry:attach/4` documentation.

https://hexdocs.pm/telemetry/telemetry.html#attach/4 line=112 pid=<0.3846.0> file=/Users/hyang/cars/cars_platform/deps/telemetry/src/telemetry.erl mfa=:telemetry.attach_many/4 
:ok
{%{monotonic_time: -576460289938607883, duration: 2329583},
 %{
   telemetry_span_context: #Reference<0.3291262840.1439170561.97292>,
   plugin: Oban.Stager,
   staged_count: 0,
   staged_jobs: []
 }}
{%{monotonic_time: -576460289931805058, duration: 1137269},
 %{
   telemetry_span_context: #Reference<0.3291262840.1439170561.97305>,
   plugin: Oban.Stager,
   staged_count: 0,
   staged_jobs: []
 }}
{%{monotonic_time: -576460289904803716, duration: 2169496},
 %{
   telemetry_span_context: #Reference<0.3291262840.1439170561.97316>,
   plugin: Oban.Stager,
   staged_count: 0,
   staged_jobs: []
 }}
{%{monotonic_time: -576460288931119893, duration: 1765531},
 %{
   telemetry_span_context: #Reference<0.3291262840.1439170561.97352>,
   plugin: Oban.Stager,
   staged_count: 0,
   staged_jobs: []
 }}
{%{monotonic_time: -576460288929637285, duration: 1357398},
 %{
   telemetry_span_context: #Reference<0.3291262840.1439170562.51763>,
   plugin: Oban.Stager,
   staged_count: 0,
   staged_jobs: []
 }}
{%{monotonic_time: -576460288902098684, duration: 1741655},
 %{
   telemetry_span_context: #Reference<0.3291262840.1439170562.51768>,
   plugin: Oban.Stager,
   staged_count: 0,
   staged_jobs: []
 }}
{%{monotonic_time: -576460287925900192, duration: 3010885},
 %{
   telemetry_span_context: #Reference<0.3291262840.1439170561.97430>,
   plugin: Oban.Stager,
   staged_count: 0,
   staged_jobs: []
 }}

sorentwo · August 5, 2024, 8:32pm

It’s not staging anything, which means it can’t see any jobs within the time window. Will you output the console time (DateTime.utc_now()) and the scheduled_at timestamp from a scheduled job in the database?

You can also try running this:

Oban.Engine.stage_jobs(Oban.config(), Oban.Job, limit: 5000)

And see if it returns a count.

houayang · August 6, 2024, 3:43pm

iex(15)> DateTime.utc_now()
~U[2024-08-06 15:38:51.337192Z]
iex(16)> Oban.Engine.stage_jobs(Oban.config(Manifold.Oban), Oban.Job, limit: 5000)
{:ok, [%{id: 350, state: "scheduled", queue: "delivery_destination"}]}
iex(17)>

job is scheduled for: 2024-08-06 15:38:48.665877

The 1 stage_job we are seeing here is a different job. We send 2 emails and each one goes to a different location. 1 of those email destinations, I am using schedule_in or failing on purpose, and that is the one that never runs or retries.

After the first email job is completed, the stage_job command returns an empty list:

iex(15)> DateTime.utc_now()
~U[2024-08-06 15:38:51.337192Z]
iex(16)> Oban.Engine.stage_jobs(Oban.config(Manifold.Oban), Oban.Job, limit: 5000)
{:ok, [%{id: 350, state: "scheduled", queue: "delivery_destination"}]}
iex(17)> Oban.Engine.stage_jobs(Oban.config(Manifold.Oban), Oban.Job, limit: 5000)
{:ok, []}
iex(18)>

sorentwo · August 6, 2024, 8:48pm

What’s the attempt and max_attempts for that job?

houayang · August 7, 2024, 3:46pm

attempt is 0 and max is 3

sorentwo · August 7, 2024, 5:02pm

Nothing appears to be wrong and I don’t see any reason why a scheduled job wouldn’t be ran. Staging is clearly working, as demonstrated by the stage call.

Will you run through this set of queries and see if it matches what you expect?

import Ecto.Query

# Check how many are scheduled or retryable
Oban.Job |> where([j], j.state in ~w(scheduled retryable)) |> select(count()) |> Repo.one()

# Add the queue check
Oban.Job
|> where([j], j.state in ~w(scheduled retryable) and not is_nil(j.queue))
|> select(count())
|> Repo.one()

# Find the max time
Oban.Job
|> where([j], j.state in ~w(scheduled retryable) and not is_nil(j.queue))
|> select([j], max(j.scheduled_at)))
|> Repo.one()

# Add the time check
Oban.Job
|> where([j], j.state in ~w(scheduled retryable) and not is_nil(j.queue))
|> where([j] j.scheduled_at <= ^DateTime.utc_now())
|> select(count())
|> Repo.one()

houayang · August 7, 2024, 5:15pm

iex(35)> import Ecto.Query
Ecto.Query
iex(36)> alias Engine.Repo
Engine.Repo
iex(37)> Oban.Job |> where([j], j.state in ~w(scheduled retryable)) |> select(count()) |> Repo.one()
34
iex(38)> Oban.Job |> where([j], j.state in ~w(scheduled retryable) and not is_nil(j.queue)) |> select(count()) |> Repo.one()
34
iex(39)> Oban.Job |> where([j], j.state in ~w(scheduled retryable) and not is_nil(j.queue)) |> select([j], max(j.scheduled_at)) |> Repo.one()
~U[2024-08-07 17:04:37.664335Z]
iex(40)> Oban.Job |> where([j], j.state in ~w(scheduled retryable) and not is_nil(j.queue)) |> where([j], j.scheduled_at <= ^DateTime.utc_now()) |> select(count()) |> Repo.one()
34
iex(41)>

sorentwo · August 7, 2024, 6:09pm

So there are 34 scheduled jobs that are ready to go? If you run stage_jobs as you did earlier, does it stage any of them?

That’s approximately the same query that stage_jobs uses, so I wouldn’t expect any difference.

houayang · August 8, 2024, 4:27pm

looks like it

iex(62)> Oban.Engine.stage_jobs(Oban.config(Manifold.Oban), Oban.Job, limit: 5000)
{:ok,
 [
   %{id: 1, state: "scheduled", queue: "recalc"},
   %{id: 2, state: "scheduled", queue: "recalc"},
   %{id: 3, state: "scheduled", queue: "recalc"},
   %{id: 4, state: "scheduled", queue: "recalc"},
   %{id: 5, state: "scheduled", queue: "recalc"},
   %{id: 6, state: "scheduled", queue: "recalc"},
   %{id: 7, state: "scheduled", queue: "recalc"},
   %{id: 13, state: "scheduled", queue: "recalc"},
   %{id: 14, state: "scheduled", queue: "recalc"},
   %{id: 16, state: "scheduled", queue: "recalc"},
   %{id: 17, state: "scheduled", queue: "recalc"},
   %{id: 18, state: "scheduled", queue: "recalc"},
   %{id: 19, state: "scheduled", queue: "recalc"},
   %{id: 20, state: "scheduled", queue: "recalc"},
   %{id: 25, state: "scheduled", queue: "recalc"},
   %{id: 26, state: "scheduled", queue: "recalc"},
   %{id: 27, state: "scheduled", queue: "recalc"},
   %{id: 28, state: "scheduled", queue: "recalc"},
   %{id: 29, state: "scheduled", queue: "recalc"},
   %{id: 30, state: "scheduled", queue: "recalc"},
   %{id: 31, state: "scheduled", queue: "recalc"},
   %{id: 32, state: "scheduled", queue: "recalc"},
   %{id: 33, state: "scheduled", queue: "recalc"},
   %{id: 34, state: "scheduled", queue: "recalc"},
   %{id: 35, state: "scheduled", queue: "recalc"},
   %{id: 36, state: "scheduled", queue: "recalc"},
   %{id: 37, state: "scheduled", queue: "recalc"},
   %{id: 38, state: "scheduled", queue: "recalc"},
   %{id: 39, state: "scheduled", queue: "recalc"},
   %{id: 40, state: "scheduled", queue: "recalc"},
   %{id: 42, state: "scheduled", queue: "process_status"},
   %{id: 43, state: "retryable", queue: "delivery_destination"},
   %{id: 47, state: "scheduled", queue: "process_status"},
   %{id: 52, state: "scheduled", queue: "process_status"}
 ]}
iex(63)>

doomspork · November 18, 2024, 9:30pm

@houayang did you find a resolution to this issue? I’ve found myself dealing with a similar issue. Following this thread and trying things as prescribed gets me similar results.

If I use Oban.Engine.stage_jobs(Oban.config(), Oban.Job, limit: 5000) directly the jobs are staged but otherwise they hang out.

sorentwo · November 18, 2024, 9:43pm

Which version of Oban and what notifier are you using? Newer versions will emit a warning in the logger when nodes aren’t connected, and should fall back to a local polling mechanism.