Processing json files

Eiji · October 20, 2019, 9:44am

@kodepett I have working in a bit similar case. I was doing a migration from old JSON BigData (thousands of small files) to PostgreSQL database. Therefore I did not need to make it fastest ever, but of course it should not run for hours. For that case simply Flow was enough.

From what you said you should have about 1000 files per day which is not that big number. If you will make parsing each file enough fast then number of files should not be a problem for you.

I’m sure about only one thing: there is nothing ideal . It may depend on your logic. Look that simply putting JSON to jsonb column without even parsing JSON to Elixir structs is way different comparing to a big process of parsing and processing data. If working with single file will be really short then maybe you do not even need to think about putting an extra dependency you don’t know about just for this case. However typically the whole process is longer than just reading file and it may be worth to think about splitting job into few stages. For this case I would recommend Broadway.

You may also be interested in Flow by Plataformatec. Both Broadway and Flow are built on top of GenStage. Flow is a more general abstraction than Broadway that focuses on data as a whole, providing features like aggregation, joins, windows, etc. Broadway focuses on events and on operational features, such as metrics, automatic acknowledgements, failure handling, and so on.
Source: GitHub - dashbitco/broadway: Concurrent and multi-stage data ingestion and data processing with Elixir

Yes, it is. Even phoenix_live_reload is using it, so it may be worth to check it’s source. The library is called file_system. For this please make sure that your backend is prepared as in some cases it’s a must have.

There are lots of gotchas. From basic Elixir up to specific to your use case.

First of all for a well known gotchas in Elixir you may read this forum topic:

When working in file you are doing few operations (especially with a big number of files) then it’s worth to use different File.open or sometimes also File.stream. Therefore in some cases Jaxon may be interesting for you:

https://moboudra.com/intro-to-jaxon-json-parser-for-elixir/

Finally you should be aware of typical overusing some features of Elixir like GenServer as there are not good for every use case:

https://learn-elixir.dev/dangers-of-genservers