Oban jobs not getting executed after upgrading elixir/erlang and oban packages

madclaws · January 9, 2025, 6:56am

I have recently upgraded elixir from 1.15.2-otp-26 to 1.17.3-otp-27. Also upgraded Oban package from 2.17.1 → 2.18.3, oban pro from 1.2.2 → 1.4.14, among other routine package updates. But now my oban jobs are either in scheduled or available state, its not executing. And this not reproducible on remote environments(staging) or in other machines.

I have reverted the elixir version and its working. ( although it crashes after sometime since i didnt revert the libs back, but if i revert the libraries too then everything works fine).

I have tried removing _build folder, doing a fresh repo install but nothing works.
postgres version → 15.8, macos 14.5

Any help on what i a missing, let me know if you need more details?, cc: @sorentwo @sorenone

sorentwo · January 9, 2025, 5:28pm

We’ll need some more details, please. Which version of Postgres? What does your configuration look like? Do you have any warnings or errors logged?

madclaws · January 9, 2025, 5:34pm

postgres version is 15.8
The repo is open-source.
No warning or errors were logged.

Thanks for checking this, will get back with more details on the local job states, rn i am afk.

config can be found here.

The worker

The job is in scheduled state, with attempts 0.
@sorentwo

sorentwo · January 11, 2025, 2:18pm

The Postgres version is fine, and the config looks pretty good. You should remove Oban.Plugins.Gossip though, as it’s a no-op in recent versions and not needed for Oban Web. It should log a warning on init, actually.

Jobs stuck available would indicate that there was a problem with the queues or fetching jobs. In this case, as the jobs are scheduled, it indicates the job stager isn’t running or is failing somehow. Here are some things to check:

See if there is an issue with leadership. You can use Oban.Peer.get_leader() to see which node is the leader. That has to be an actively running node, otherwise no plugins (including the stager) will run. This will require deploying the new version.

Make sure there aren’t any staging timeouts or exceptions. Attach a telemetry handler for to track plugin exceptions like this:

handler = fn _event, _measure, meta, _conf -> IO.inspect(meta) end
:telemetry.attach("oban-plugin", [:oban, :plugin, :exception], handler, [])

Attach the default logger (if you haven’t already) and see if there are any warnings about degraded connectivity.

madclaws · January 13, 2025, 3:16pm

@sorentwo
I have tried all 3 mentioned points but all were positive for me. However i have some new observations.

If the job is not created with scheduled_at, then it will be in available state in DB. Now if i run Oban.drain_queue(queue: "queue_name") in iex then those get executed and moved to completed.
Also interestingly when i check the Stager process state it doesn’t have any plugins or queues in it’s state

%Oban.Stager{
  conf: %Oban.Config{
    dispatch_cooldown: 5,
    engine: Oban.Pro.Engines.Smart,
    get_dynamic_repo: nil,
    insert_trigger: true,
    log: false,
    name: Oban,
    node: "T4Ds-MacBook-Air",
    notifier: {Oban.Notifiers.Postgres, []},
    peer: {Oban.Peers.Isolated, [leader?: false]},
    plugins: [],
    prefix: "global",
    queues: [],
    repo: Glific.Repo,
    shutdown_grace_period: 15000,
    stage_interval: 1000,
    testing: :disabled
  },
  timer: #Reference<0.418926374.546832389.242625>,
  interval: 1000,
  limit: 5000,
  mode: :local
}

But when i downgrade to older version of project (old elixir/erlang + old package versions as mentioned in the post)
Then the Stager process does have queues and plugins loaded in its state.

%Oban.Stager.State{
  conf: %Oban.Config{
    dispatch_cooldown: 5,
    engine: Oban.Pro.Engines.Smart,
    get_dynamic_repo: nil,
    insert_trigger: true,
    log: false,
    name: Oban,
    node: "T4Ds-MacBook-Air",
    notifier: {Oban.Notifiers.Postgres, []},
    peer: {Oban.Peers.Postgres, []},
    plugins: [
      {Oban.Plugins.Gossip, []},
      {Oban.Pro.Plugins.DynamicLifeline, []},
      {Oban.Plugins.Cron,
       [
         crontab: [
            [args: %{job: :weekly_tasks}]},
           {"0 0 * * MON", Glific.Jobs.MinuteWorker,
            [args: %{job: :weekly_report}]},
           {"* 20-23 * * *", Glific.Jobs.MinuteWorker,
            [args: %{job: :daily_low_traffic_tasks}]}
         ]
       ]},
      {Oban.Pro.Plugins.DynamicPruner, [mode: {:max_age, 300}, limit: 25000]}
    ],
    prefix: "global",
    queues: [
      default: [limit: 10],
      dialogflow: [limit: 5],
      gcs: [limit: 10],
      gupshup: [limit: 10]
    ],
    repo: Glific.Repo,
    shutdown_grace_period: 15000,
    stage_interval: 1000,
    testing: :disabled
  },
  name: {:via, Registry, {Oban.Registry, {Oban, Oban.Stager}}},
  timer: #Reference<0.3080533931.501219330.202487>,
  interval: 1000,
  limit: 5000,
  mode: :global,
  ping_at_tick: 60,
  swap_at_tick: 65,
  tick: 31
}

Any idea on this?

sorentwo · January 13, 2025, 3:48pm

Inserting without a scheduled_at timestamp will mark the job as available, that’s normal. You can drain the queue with scheduled jobs by passing the with_scheduled flag, but that function is only meant for testing. The fact that you then have to drain it at all means the queues aren’t running.

That’s definitely the problem. It looks like there’s a console check in application.ex. That’s what’s doing it.

Either remove the console check or start with mix phx.server rather than iex -S mix phx.server.

madclaws · January 13, 2025, 3:56pm

Damn that console check was the problem. Its weird that it used to work till last elixir version(1.15-otp-26). Thank you very much. Will check why it was working before. Thanks!.