Fetch data from multiple pages

cipher · July 30, 2021, 7:25pm

how do I make my code get the number of pages available in advance, and then start several tasks to get several pages concurrently?

Example: example.com/api?id=1; example.com/api?id=2; …
and get the response from the server, until the page returns an empty array: “[]”

benwilson512 · July 30, 2021, 8:38pm

Hey cipher welcome! Any good programming task starts by breaking down a problem into simpler parts. The first part is: How to fetch a single page? What have you tried so far?

cipher · July 30, 2021, 9:13pm

I’ve already requested a single page, organized and separated all the information I want to handle, I only need now to request several pages

kokolegorille · July 30, 2021, 9:34pm

The request was to look at the code You have written…

If You know how to fetch a page, You should be able to fetch given a page param. This should give You an offset and a limit.

And count will give You the total page number.

I am also doubtful about the use of Tasks to preload data when a cache could speed up db access.

Fetching data is already concurrent because You have a pool of workers in charge of Repo access.

cipher · July 30, 2021, 10:00pm

Here is the code.

I’ll try to solve it by passing a parameter, but from what I’ve seen, it has about 10 thousand pages, would that be the best way?

kokolegorille · July 30, 2021, 10:11pm

It was not clear data comes from external API…

I have implemented GenStage pipeline with HTTPoison to get this concurrency.

But my latest web scrap was with Finch to download paintings/metadata concurently from WikiMedia.

benwilson512 · July 30, 2021, 10:20pm

Ok, what have you tried so far?

kokolegorille · July 30, 2021, 10:29pm

There is this old post, I remember I liked it, but maybe now it’s a bit old…

Maybe You can get some inspiration. It uses poolboy and HTTPoison.

Because the problem is not really to be concurrent, but to manage this concurrency.

You don’t want to flood the host with 10’000 concurrent requests.