OTP design for job scheduler/runner/monitor

jlevy · March 9, 2017, 11:09pm

Hi All,

Since learning Elixir, I’ve found it to be a great tool for writing integrations between applications. Lots of times these end up being a scheduled job that runs hourly or nightly, reading from some database, dumping data into some REST API, etc. After building a few of these, I thought it would be really nice to have a framework with a little web front end to monitor the execution of all of these otherwise unrelated jobs. Through this interface they could be scheduled, ran manually, logs and errors could be reviewed, etc.

After thinking through how I would build such a thing, I hit the limits of my OTP knowledge and experience. I came up with a couple of possibilities for the structure:

Create a release for the scheduler itself and then a separate release for each individual job. This doesn’t seem ideal because each job would run in its own node, which seems like a lot of memory overhead (right?). Communication about the job schedule and errors/results would probably happen through a DB or the file system. The advantage is that there is nice isolation between the different jobs. One can be updated without any risk to other scheduled or running jobs.
Create a single release containing the scheduler and the jobs, with each job as its own OTP Application. Running the scheduler and jobs in the same node brings a bunch of advantages when it comes to communication between the two. There could be a task Behavior with a standard interface, and executing a task could be as simple as a function call in a new process. Hot code replacement should make it pretty safe to update one job without affecting other running jobs. The big issue I see here is dependency management. I don’t think it will be possible to have two different jobs that need two different versions of a dependency. For example, a job interfacing with Postgres leverages Ecto 2 while a job talking to MS SQL needs Ecto 1.
Some other possibility I’m not thinking of?

If you were going to build something like this, how would you do it?

Thanks!
Jeff

minhajuddin · March 11, 2017, 9:18pm

If all I needed was scheduling a job to run at a specific time. I would just use crontab. However, if we want to add interactivity with ability to run jobs through a web interface. I would just make them into different apps and spin them up as nodes. So, you would have one job which acts as a controller/conductor and it triggers work in your worker/specialised nodes. This gives you the most amount of flexibility. I don’t think memory should be an issue.

jlevy · March 13, 2017, 10:59pm

Thanks for the response. Not only do I want to be able to run jobs through a web interface, I want the jobs to report status and log data back to the central controller.

So if each job runs in its own node, I assume you are suggesting using clustering to link the nodes together? I thought there was a limit of 50 or so nodes that could be connected this way due to the full mesh topology. I guess an alternative would be to keep the nodes isolated and use some other message passing mechanism (zeroMQ, RabbitMQ, HTTP polling). That seems like giving up a lot of what makes OTP so awesome though.

StefanHoutzager · March 14, 2017, 4:35am

For the cron jobs you could use https://github.com/c-rack/quantum-elixir. It could trigger a broadcast over a phoenix channel also, say every minute, to send info to clients. But you can broadcast triggered by a statuschange also. Maybe you can use agents to hold the state of your jobs, and a registry to at least manage your agents. This helped me a lot: http://elixir-lang.org/getting-started/mix-otp/supervisor-and-application.html