Planning a project

I’m working on an application and I need some help to make it right.

My project is going to be a set of news reading, news analyzing tools.

So far I have two application:
the “scraper” app which contains several modules for scraping (this is a stateless app);
the “db” module which takes care of everything database related

I’m going to need several functions to get the latest news, these would need to run regularly (cronjobs?) and some would only need to run once or just a few times, like getting old articles from an archive.

The problem is: where should I put these?
For example, the scraper app contains a module to scrape a news archive.
I need to get a “site” from my “db” app, call my ArchiveScraper from the “scraper” app to get the news from that site, then save the articles in the “db” app.

I’m sure just overthinking it. It would be great if someone could give me some advice on this or recommend me some books which could help me.

Sounds like you can add a Scheduler app, perhaps using quantum to call into the DB and Scraper app

4 Likes

Sounds good. Thanks!

Quantum looks like all you need but for discussions sake, I have a similar-ish system requirement.

I have a “job runner” GenServer that checks a “jobs” table every n-seconds, pulls jobs that are :pending and are past their run_at time.

User actions generate jobs that go into the table for me, as they would for you but you might also have a self perpetuating job (“fetch new articles from site X”) that would run then on completion, create a new identical job to run in an hours time or whatever. You could also have some separate table/file you check to see when you last did an action but I think it’s cleaner to have everything as jobs, even though it means you have to do that initial kickstart. (E: you could also have the runner check for an existence of a job type and create it if it doesnt exist, probably smartest incase someone goes down while the job is running and it some how gets lost and not recreated.)

Most of my jobs hit an external API so I’ve limited each one to one request, they do work and return {:ok, _}, or {:error, :type} states and the job runner is the thing with the brains to know if it should reschedule, fail, complete or create additional sequential jobs. This makes each job smaller and inherently easier-ly tested.

Jobs are a %Job{} struct/record that has a name and arguments. I use the name to lookup the correct module (i.e Job.Types.ArticleFetcher) that contains the actual jobs run function, additional information (local/processing or remote/api job?) and argument validators.

No doubt what I’ve made is less tested and has overlooked things covered by major libraries so it depends on how mission critical your project is. It was a fun problem to (continue to) solve though.

4 Likes

Thanks for the detailed answer! Helps a lot!

@sasajuric’s library Parent has a Peroidic functionality as well for repeated things as well. I still need to look in to it to compare to Quantum but supposedly it’s better in some ways?

Periodic looks great for actions performed ‘every x seconds’, vs Quantum which uses a cron style date/time pattern.

I quite like the combination of a Scheduler + Job Queue for queuing up a bunch of work on a schedule and let stateless workers complete the jobs.

I wouldn’t call it better, but rather simpler. It’s nowhere near the amount of features provided by quantum, but I believe it’s a better fit for simple periodic execution where you need to run something every x, and don’t care about a possible drift which might occur if you restart BEAM or a parent supervisor.

The benefits of periodic are:

  • Simple setup: just inject {Periodic, run: mfa_or_zero_arity_lambda, every: interval} childspec under a desired supervisor, and you’re good to go.
  • Fine-grained control of where each periodic job sits in the supervision tree. Therefore, it’s trivial to stop each periodic job independently from others.
  • No cryptic cron-like syntax or complex structs.
  • Doesn’t require any config.exs/app env settings.
  • No singleton (registered) processes.

Yeah, periodic is mostly appropriate when the interval is small-ish. It doesn’t support other kinds of patterns (e.g. execute once every Sunday 4AM-5AM), although I feel that this can still be implemented on top of periodic, but it would require extra work (like keeping track about when has the last job been executed).

Other rich features, such as persistence, distributed support, queuing, are not supported, nor planned ATM, but I’m open for discussion :slight_smile: If you have some ideas, feel free to open up an issue on the repo.

2 Likes