Xml Data Pipeline - suggestions needed

Greetings,
Trying to get a toy project done in both go and elixir for fun. Would love to hear suggestions , libs to use and strategies to make it done quicker and faster.

  • ingest about 1000 xml files containing Articles (about 2 million of them ) with slight schema differences every day,

    • size varies from 20 mb to 8 GB
    • schema varies slightly / different node / element /attributes names

    Using Saxy.Handler and file.stream! above tasks have been completed.

  • for every article in xml file.

    • do processing
    • check if it exists in database, insert / update articles in database
    • create an xml file for changed / new articles each day for backup

    should this be a job queue ? GenStage, Flow ?

  • Requirements

    • telemetry

      • ( new / updated articles every day)
      • time taken
      • any error notification
      • dashboard
    • back pressure

      • should not slow down / overwhelm databases
    • optimizations.

      • take articles in a chunk of 500, do a single DB query to check if they exists in database.
      • batch updates/ inserts etc
2 Likes

For the backpressure and chunking I use Genstages. Works like a charm :slight_smile: Have no computer nearby, so can’t provide an example. You probably can find one searching for ‘genstage stream chunk’

2 Likes