Hello Elixir community! I’m building a workflow built with some GenServers and I would like to ask a GitHub PR review!
Here’s a brief description of my project:
The PESCARTE Project has as its main goal the creation of a regional social network integrated by artisanal fishermen and their families, seeking, through educational processes, to promote, strengthen and improve their community organization and professional qualification, as well as their involvement in the participative construction and implementation of work and income generation projects.
Through the PESCARTE Project, the fishing communities that live in the municipalities of Arraial do Cabo, Cabo Frio, Macaé, Quissamã, Campos dos Goytacazes, São João da Barra, and São Francisco de Itabapoana are mobilized, encouraged, and oriented to participate in different actions and/or activities of an educational nature. These actions and/or activities have the following objectives: to improve the professional performance of these communities, either by increasing their productivity or by being able to better organize themselves and carry out solidary economic activities.
The intention is to reinforce the productive identities of these fishing communities, in order to favor the mitigation of the negative impacts that affect them and that result from the activities carried out, in that region, by the oil and natural gas exploration and production industry.
More context about this feature I’m building:
We need to build a price quote API for fish prices variations, that are updated daily in the Pesagro site (https://www.pesagro.rj.gov.br/). They publish reports that are files in PDF with price quotes for agricultural items, fish included.
So the first step wast to “scrape” all these reports, that was made in this PR: Cria app CotacoesETL e buscador de novas cotações na Pesagro by zoedsoupe · Pull Request #113 · peapescarte/pescarte-plataforma · GitHub. The second step is to convert all these PDFs into TXT to easy parsing of information and the last step is the information ingestion.
Also, I would like to ask some advices on how I could test the flow that this PR implements!
This is the PR description, translated:
Description
This PR implements the second part of the script for importing fish quotes from the Pesagro website. In this part of the flow, a new worker has been implemented, to search the quote table for quotes (links) that have not yet been downloaded.
After downloading each file, a check must be made, because some Pesagro links are a set of PDFs in a zip
file, which must be extracted.
With all the PDFs extracted, we must then upload each of the extracted PDFs into the Zamzar API, converting them to TXT. We cannot exceed the rate limit of their API (5 requests per second).
Once we have the converted file, we need to download it so that the last worker can be started for parsing the data from each fish.
The original script can be found in the cotacoes-api
repository: https://github.com/peapescarte/cotacao-api/blob/feat-etl-module/etl/crawler.py
Points for Attention
- The conversion worker should follow the following flow:
- Fetch quotes from the database that have not yet been downloaded.
- Download each quotation, at the Pesagro site
- If a quotation is a zip file, extract all the PDFs contained in the file
- Upload each PDF to the Zamzar API, for conversion into TXT of each one, respecting their rate limit (maximum 5 requests per second)
- Download the converted file from Zamzar, if it is already ready, or schedule a new query in their API
Do you have new settings?
- Internal settings for the correct use of the lib
mox
. - Environment variable
FETCH_PESAGRO_COTACOTES
, a boolean to control if workers should be started with the application or not
Do you have migrations?
N/A
This is the PR link: Cria worker para converter arquivos PDF para TXT by zoedsoupe · Pull Request #119 · peapescarte/pescarte-plataforma · GitHub