Processing Queues

tovarchristian21 · February 13, 2020, 7:33pm

Hello guys, I am working on a reports generation application on Elixir for a huge database for generating reports with around 300,000 rows per report. I have a GraphQL API, that receives some parameters and generates the reports based on those parameters. The issue I have is that this API can get easily queried simultaneously by lots of users, thus eventually running out of memory. Do you guys know a way to make a queue, that allows to generate reports one after another (I don’t care if the user has to wait a long time as long as the app won’t run out of memory). Thanks in advance.

NobbZ · February 13, 2020, 7:50pm

Oban and GenStage and Broadway might help for parallel/queued processing of jobs.

Though if you don’t have enough memory to serve under load, it’s probably time to scale either number of hosts or amount of memory up, maybe dynamically?

wanton7 · February 14, 2020, 9:46am

Could you just stream response instead of holding those 300000 rows in memory? You could instead of returning document in GraphQL response just return unique download url that would be alive for some minutes or up to when someone initiates that download

There is blog post here how to do streaming and chunking here ttps://medium.com/@feymartynov/streaming-csv-report-in-phoenix-4503b065bf4a

I assume you are using SQL and Phoenix. If you can split your SQL query into chunks without streaming from database and just send chunks from Phoenix that should work as well.

wanton7 · February 14, 2020, 11:13am

One other option if it is option to page your database request to use GraphQL paging. Example Github API has mandatory first parameter number between 1-100. Then optional after parameter that could be an id to continue after. This way you could page results and show partial results for user while they are coming in.

chasers · February 14, 2020, 12:12pm

Yeah this … but how are you going to deliver those 300k rows after the fact?

If you’re not paging through results, definitely make them page through results.

tovarchristian21 · February 14, 2020, 3:40pm

Sorry guys I’ve omitted some details that might give a different impression of the situation. After querying the database, the data is used for generating xlsx files that are directly uploaded to S3. When the data is uploaded, the memory is released and there is no sort of holding or maintaining the data mucho longer, due to the fact that is now stored in the bucket. The issue relies on the fact, that many users for some reason, might request huge reports alll at once, the system will eventually fail due to the amount of memory that will be exceeded. I don’t care if users have to wait 5 min or 2 hours due to a queue, my goal is to guarantee that every report that is requested, will be generated and uploaded correctly, despite the amount of other reports that are waiting to be generated.

wanton7 · February 14, 2020, 3:47pm

Are you using ExAws? It’s API seem to have chunking support. But maybe job queue would be best in your case.

tovarchristian21 · February 14, 2020, 3:49pm

Yup, ExAws is the one I use. Do you have some recommended dependency for job queuing?

chasers · February 14, 2020, 3:51pm

It sounds like you don’t need raw throughout in terms of job count. I’d look at Oban first since the queue backed by Postgres.

tovarchristian21 · February 14, 2020, 7:56pm

Oban seems like a very decent option, I’ll check it out!