I’m writing a GenStage producer which is designed to scrape a bunch of RSS feed urls, and return a stream of all Articles. The set of urls it might scrape is fairly large, so it is given a UrlRepository module that it calls pop_url on to get the next url to parse. The UrlRepository (not coded yet) will presumably give it least recently parsed url.
My current implementation of handle_demand is that I keep popping URLs and parsing them until I get enough articles to meet demand, and then I return them. If I got more articles than the demand, I pop them in a buffer in the state, and return the demanded amount. That is working very nicely.
However, what if I’ve parsed all the urls recently, and UrlRepository decides it doesn’t want to give me urls again (perhaps there’s a minimum scrape interval for each url of 1 hour or something), so eventually pop_url returns nil, and handle_demand is left without enough articles to fill demand.
The documentation says that I need to take care of buffering the demand myself. So, I create a demand integer in my state, and if handle_demand has to return fewer articles than were demanded, I append the difference to my demand buffer. Next time handle demand is called, I can potentially return more and fillfil that extra demand.
However, of course handle demand will stop getting called if I stop fullfilling it. It’s up to me to announce when that demand is now able to be fullfilled. I see that in libraries such as Snowy and Twittex, where an external service such as a TCP library finally provides some data, the GenStage producer recieves a handle_info call, and because there is a positive number in the demand buffer, it immediately emits some events.
However, because I stopped getting urls, I stopped trying to parse. I need to know to start again in order to start fullfilling that buffered demand. My pop_url function returned nil, so I stopped, but it will presumably have urls later if I check again.
Ideas I have for solutions.
- When my pop_url function returns nil, set a callback timer to check again in 20 minutes. Respond to that with handle_info to get things moving again.
- Turn my url provider into a GenStage producer, so I an indicate that I’m demanding urls, and just wait until there are some.