Stuck on design decisions

I am motivated to start a new project I am working on using Elixir. This will be a project that will use Phoenix as the web front end. I am trying to make some decisions on application architecture.

Application Composite

The application allows end-users to login and upload videos, documents and photographs to their accounts. When a media item (video, photo, or document) is uploaded the item will be evaluated by a process to pull metadata, determine size, create previews, etc.

Design Considerations

Each account belongs to an organization. When the application is complete and deployed there will probably only be about a hundred or so organizations and about a thousand or so users. I don’t foresee a web front-end cluster being too big of an issue handling this many users. Uploads will be direct-to the cloud provider S3 or Google.

Where I am having some trouble is trying to decide how to deal with background jobs. I think I am still trying to think in the terms of Ruby. I first designed a push-to-queue system with workers. Then, scraped that. Then thought of just having a node sitting on a good sized VM waiting to spawn processes to do jobs I need. But, then if 300 people upload a video at about the same time then processing that many videos at once will cause an issue. So, I am back to thinking about making a queueing system.

Anyway, I am not looking for THE ANSWER, just some guidance if anyone has ran into the same problem. I’m open to any idea. I want to take this opportunity to further my experience with Elixir as well.

2 Likes

If you need durable retries, then a job queue backed by redis or a DB is the way to go.

In the same way that a Phoenix umbrella separates the web interface from the business logic, I would did the same for Job interface, and any Cron schedules.

You can then create separate distillery releases for web, job and clock, including only the apps and dependencies required for each.

That was my original direction, I don’t know why I abandoned that :angry: I think I felt I was making the issue too complex. My design was to have the jobs application start and pull a list of pending jobs by determining which media items have not been processed. With that list, distribute the jobs to the list of nodes configured to receive the jobs.

Since all the media files are on S3 already you could also do the heavy lifting in a lambda function that would then notify your web application about the video size and properties. You can even resize the video using ffmpeg or similar. Although you’d have to work with a non-Elixir language for the lambda function.

Although doing a bunch of work in worker processes is totally valid as well. Just throwing another option on the table.

@mbuhot, @axelson thank you for your feedback. I think I’m going to go all-out and just make my own queuing system again to deal with background processing. I will just run all nodes on a big box and distribute nodes out to their own VMs where appropriate.

1 Like

I can vouch for this solution. It is really helps to set guarantees about the system when the heavy work is done somewhere else.

For redis-backed queue Exq might be a solid option.

1 Like

I think I found it. Honeydew. A distributed job queue and worker pool @koudelka created. No need for redis, as it uses mnesia. No need to leave the elixir/erlang ecosystem – so far. Thanks for all the input!

2 Likes

When I read your question, this is what went through my head:

First I thought about just spawning a background task as a genserver that does stuff for you.
Then I realized that that solution would not be sufficient, because you had to retry things if they failed.
Then I thought about ways to persistently storing the list of retries. For this again, you can decide to only store them in a ‘task queue’ GenServer. This would mean that the queue would still be gone whenever the application would close. You could, however, write the queue to a file every time a task is added or has finished executing.

The file is only loaded during the startup of that GenServer, and writing to it can happen in parallel to actually enqueuing a given task, keeping things fast.

Until you get an enormous amounts of tasks, this solution would be fast enough, and very simple to maintain. Added plus is that a function call can be stored persistently using erlang term to binary directly, so to store the background tasks is extremely simple.

Now, if you find out that it is too slow in practice, you can make the solution more efficient (with the tradeoff of added complexity) by creating a DETS table. DETS will serve you well unless you create a queue of background tasks that is larger than 2 GigaByte (which is extremely unlikely, because it would mean that your tasks probably run too slow, and that you have hundreds of thousands of enqueued tasks). If that would ever become too slow, then (and only then) would I move to the added complexity of Mnesia.

In this scenario, I see no reason to include an external store (be it redis, a database or a message queue like RabbitMQ) because we are only interested in executing Elixir code in the background of an Elixir application. Not adding more moving parts to your server deployment will really help you to keep your system maintainable.

2 Likes

Depending on the hosting environment, you may want multiple worker nodes for availability, eg running in separate AWS availability zones.

Is there a way to keep the simplicity of the file/dets backed queue setup with multiple consumers?

Yes, there is, as long as your multiple consumers are running in a connected BEAM environment. As soon as this is not possible anymore, you’ll need some other way to communicate between them and I guess that that is the time where you’d look at any of the outside-of-the-BEAM solutions mentioned earlier in the thread, but even in that case, you could have the main queue be in your main app, and the workers connect to this app through a bidirectional socket.

There is no conceptual difference between Redis and a BEAM node running (D)ETS: Both are in-memory datastores that can be (and are) used as databases, caching layers or message queues. Both optionally do persistence to the disk, distribution/replication across multiple systems, etc. Redis is built in ANSI C, which might be somewhat faster, if not for the fact that data needs to be serialized and deserialized every time it moves into/from your application(s) to the datastore, which is not the case when you use an in-memory datastore that is native to your language environment of choice.

(And to be clear: All of this is only required if you have a very high amount of users across the globe. Using simpler techniques that keep your whole system part of the same BEAM environment is a lot easier to maintain.)

It may also depend on your availability requirements. To avoid a single point of failure the queue will need to be replicated, at which point it may be simpler to just use a message broker or hosted service.

Put them in any pull based MQ, use bare metal worker if you need speed, put the workers in fixed size nodes and set max processes or use load average to limit concurrency. To scale you just need add more nodes.

I like the idea of auto-scaling workers with load, I may have to steal it for Honeydew. :slight_smile:

1 Like