Appreciate advice about scheduling jobs every n minutes.. batching or continuous?

BitGonzo · February 27, 2018, 10:39am

Let’s say I currently have an application (Phoenix), a Postgres instance, a Redis instance, and a worker. The flow generally looks like this:

Phoenix (action) → create row in db | create message in redis with row id ← worker listens to redis and pulls down row, runs job → updates db with results.

That first job is scheduled by a user action. However, future jobs based on that user action will be scheduled every n minutes.

The obvious initial solution would be to query the db every n minutes and schedule all jobs to run. But I was wondering if a better solution might be to have some kind of continuous scheduling which spreads out the job scheduling and db load. In essence, scheduling 10,000 jobs in 1 second at 1 minute intervals, or scheduling ~150 jobs every second. The biggest problem with the latter is it involves more complexity and would be making db requests every second, instead of every n minutes. Having said that, the former would result in an influx of 10,000 reads every minute, closely followed by 10,000 writes.

Have taken to StackExchange instead! Elixir is not the right forum haha

OvermindDL1 · February 27, 2018, 5:11pm

I’d probably just use Quantum and turn on/off the job as necessary (if even that)? But I try to be lazy when possible. ^.^;

BitGonzo · February 27, 2018, 8:47pm

Cheers. I did notice you mention that in a related thread from a while back. I was planning to just setup a separate scheduler app and use a simple interval timer - quantum seems like a perfect fit.

BitGonzo · February 27, 2018, 9:28pm

I do have a bonus question actually…

In my SE question it was wrongly assumed I was attempting to perform 1000 queries where one query would suffice, and was understandably pounced on in typical SE fashion.

But the answer did bring about another concern that I’m sure you’ve dealt with before - db locking.

Reading aside, as I can read/write from postgres simultaneously without locking (though there are caveats), would you consider it worthwhile to engineer my app in such a way that I can build up a bulk result that is then fed to my worker jobs?

For example, let’s say I have 1000 jobs… all 1000 jobs need to fetch a particular row in my db table. The job only receives the id. I’m not that concerned about reads, but I am concerned about the 1000 writes that would happen upon completion of those jobs.

Would you consider it worthwhile to insert these jobs into an intermediary store that then builds up a bulk insert query instead of potentially performing 1000 writes? It seems like a better solution, but then also stinks of premature optimisation.

OvermindDL1 · February 27, 2018, 9:42pm

Always yes. Bulk calls are 99% of the time more efficient in DB’s then lots of smaller calls.

Personally this is one thing I ‘prematurely’ optimize, but I also design in such a way to follow this pattern anyway.

BitGonzo · February 27, 2018, 9:46pm

I mustn’t be phrasing the question properly because everybody I ask assumes I’m going to perform multiple queries where I can simply perform one

For example, how would you reduce 1000 write queries spread across a 1000 different jobs into a single insert query – these jobs are being performed on different machines. Surely you would have to use an intermediary store… which would have the same write locking implications? Would you push that data into redis, and then build up a query from this ‘primed’ data? Or, is there something I can use in Elixir/Erlang/OTP to pass/stream this data to a central place that will routinely flush it into a bulk query? – which would present all kinds of durability issues and potential data loss.

Not seeking exact instructions, just a general idea of how you’d approach.

OvermindDL1 · February 27, 2018, 9:51pm

Ah yep, you’d need an intermediary store, so that would of course be better.

For note, PostgreSQL is an immutable DB store internally, it’s very fast at inserting records that don’t have a lot of indexes. Usually just writing a record as it’s available is fine, even 1000 is not that much spread around at all.

I’d not do this, you want to get it to the store as fast as possible, if there is a power issue or the server goes down or the hardware catches fire, you want it stored, not sitting in a ‘ready to flush’ cache or so.

BitGonzo · February 27, 2018, 9:56pm

Thanks!

The table I’m inserting into only has 1 index, and I’ve purposely not normalised the data for performance reasons. However, it will be building up around 72,000 rows/day, so I’m going to have fun archiving/sharding/pruning etc,

OvermindDL1 · February 27, 2018, 10:06pm

Oh only 72k rows a day? Yeah that’s nothing, don’t worry about it. I was having to push in a few 100k an hour a bit back and I was doing them one at a time (as they came in)… ^.^;

BitGonzo · February 27, 2018, 10:08pm

An hour! haha I know Postgres can handle 100 million+ in a table without breaking a sweat, it’s just a lot of data buildup that I’m not used to managing myself Good to know it’s less of a problem than I assumed. Still, I’m sure I’ll run into some interesting problems with my resource limitations.