Is GenStage the right tool for the job? Importing data from multiple sources

jdj_dk · June 6, 2018, 1:03pm

I’ve been working on Bitboard for quite some time. It’s a great side project, and I’ve learned quite a bit doing it.
Bitboard lets you manage Bitbucket issues using boards and cards. One of the first things I developed was the importer which now needs to be cleaned up. What it does is simple:

Fetch all Bitbucket issues from a repository
Fetch comments for each issue
Fetch information about the members of the repository (so it is possible to assign and mention other members)

What I would like to add now is importing all pull requests, so these are linked to the relevant issues. I could shoehorn this feature into the existing codebase, but I would much rather figure out how to write this properly. Right now it’s a separate Context with sub context modules responsible for each step. But if one comment fails to import the entire import fails. And I don’t think this is desirable behavior. I would like it to be more fault tolerant and more loosely coupled, so it’s easy to add new steps into the importer and replace existing ones.

So long story short - is this a proper use case for something like GenStage?

I was thinking that each step should be its own stage (Fetch issues, Fetch comments etc.). And after everything is fetched, I would have additional stages for importing this into the database (just to separate those concerns). I would properly kick off a process for each comment being downloaded (but it should be rate limited somehow).

So is GenStage the right tool for the job? How would you structure this? I don’t expect a complete answer from anyone. I don’t mind reading and learning on my own. But it would be great if someone could let me know if I am on the right track or not. Is there a better approach?

josevalim · June 6, 2018, 1:47pm

The GenStage docs go over this, so I will be brief, but you don’t use stages to separate organizational/behavioural concerns. If you do this, then you end-up with a very slow pipeline because you are copying the data across multiple processes. As with GenServer, Tasks, etc, stages should be used to model runtime properties.

I don’t think GenStage is necessary in your case. It seems Task.async_stream is enough to leverage concurrency.

jdj_dk · June 6, 2018, 1:58pm

Thanks for the response. I read the announcement on elixir website, and that is how I came to the conclusion that this might be a good fit. So I’m glad I did ask about this before digging myself into a hole

I’ve been using Task a lot already so I’ll look into the async_steam to solve this problem.