I’m working on an elixir application that processes json files(contain transactions) from an upstream server. The average file size is 10mb. The upstream server will generate the file(s) and push it to my server. I need to parse the json file and generate alerts where applicable based on threshold configuration. A lot of files will be generated by the upstream server and I’m expecting to process about 10GB in total everyday. I don’t have control over the rate at which the files are generated and pushed by the upstream server.
What will be the most efficient solution for the above. (I’ve heard of FLOW/BROADWAY/GENSTAGE - which one will be ideal)
Is there any filesystem monitoring api that can monitor the filesystem and notify my application once a new file is pushed by the upstream server.
What are some of the gotchas I should be aware of, especially if I want to persist the file content to a database(Postgresql)
@kodepett I have working in a bit similar case. I was doing a migration from old JSON BigData (thousands of small files) to PostgreSQL database. Therefore I did not need to make it fastest ever, but of course it should not run for hours. For that case simply Flow was enough.
From what you said you should have about 1000 files per day which is not that big number. If you will make parsing each file enough fast then number of files should not be a problem for you.
I’m sure about only one thing: there is nothing ideal . It may depend on your logic. Look that simply putting JSON to jsonb column without even parsing JSON to Elixir structs is way different comparing to a big process of parsing and processing data. If working with single file will be really short then maybe you do not even need to think about putting an extra dependency you don’t know about just for this case. However typically the whole process is longer than just reading file and it may be worth to think about splitting job into few stages. For this case I would recommend Broadway.
You may also be interested in Flow by Plataformatec. Both Broadway and Flow are built on top of GenStage. Flow is a more general abstraction than Broadway that focuses on data as a whole, providing features like aggregation, joins, windows, etc. Broadway focuses on events and on operational features, such as metrics, automatic acknowledgements, failure handling, and so on.
Yes, it is. Even phoenix_live_reload is using it, so it may be worth to check it’s source. The library is called file_system. For this please make sure that your backend is prepared as in some cases it’s a must have.
There are lots of gotchas. From basic Elixir up to specific to your use case.
First of all for a well known gotchas in Elixir you may read this forum topic:
When working in file you are doing few operations (especially with a big number of files) then it’s worth to use different File.open or sometimes also File.stream. Therefore in some cases Jaxon may be interesting for you:
Finally you should be aware of typical overusing some features of Elixir like GenServer as there are not good for every use case:
@Eiji, thanks for a comprehensive response. I’ve gone through the gotchas, I did encountered most of them when I started the elixir journey. I will take a look at Jaxon due to it support for streaming and partial parsing. I’m not sure on how to use File.stream when parsing a json file since I need the entire content to generate a json object, I was thinking of using File.read as this will read the entire file into memory - any suggestion on that is most welcomed. I will spend time to look at genstage, flow and broadway to know which will be ideal(there is nothing ideal ). Thanks.
Jaxon has support for file stream. See a readme file:
I’m not sure if loading whole file into memory is good idea. You need to ask yourself what would happen if you would get all files at once and load all of them into memory. Remember that data processing requires some memory too. Of course if there is no processing and you want just parse, cast and save data to database then there is not much to do, but if you would save it into multiple database records then consider streaming option.
I agree with you, loading the entire file into memory is not a good idea especially since I don’t control the rate at which the upstream server generate and send the files. I will peruse the Jaxon docs. I will let you know how the implementation goes and revert with any question. Thanks for you guidance.
If the average file size is 10mb I think you’d probably be okay reading the whole file into memory and parsing it all at once. But you should definitely do some benchmarking on your own to see what works well for your use-case.