New to elixir, having trouble with interfacing with a shaky external API

I’m trying to build a website that imports data from an external API that is unbearably slow as well as being quite unreliable. Right now while piecing it together I have everything (somewhat) working with a manual invocation of the functions in IEx and then fixing the inevitable errors that happen.

The API that I’m querying has a hard limit on the amount of resources you can get per request so I’ve solved that by using Enum.map and Task.async + Task.await to get the responses and then am using an ecto multi to handle the bulk insertion of the entities. But since the entities are interlinked from the API as well as in my database schema. I need a consistent way to not fail or automatically fix the errors of when a linked resource doesn’t exist.

I’m also going to try and have this on some sort of cron job or repeating background job to go through and refresh the data from the external API.

Is there some sort of way to implement an automatic retry for failed requests and some sort of thing that reacts to missing data and requests it from the API?

you can do retries with tesla

https://hexdocs.pm/tesla/Tesla.Middleware.Retry.html

the missing data part, sounds domain specific so I think you would have to roll your own, maybe against a changeset, and if it’s invalid you load in the missing data from the api…

for the cron part you can use quantum GitHub - c-rack/quantum-elixir: Cron-like job scheduler for Elixir

2 Likes

I would consider making a mirror of the external system, i.e. dump the data (JSON, XML, whatnot) into your database as is; put that into a worker and handle external API failures there… Then add a separate worker that would transform and upload the data to the destination. This might make for a cleaner design and better performance.

2 Likes

The thing is that this data is constantly changing and needs to be updated on a consistent basis. I have implemented an exponential retry system which handles the failed requests well-ish. I still don’t know how to make it work consistently though.

The doesn’t seem to be an issue since the process is going to give you eventual consistency. The exported data is always stale (unless you can update the data in both systems as part of a single transaction or you can block updates in the old system), so it’s best to embrace this fact. You could make the staleness window smaller if you’d have notification from the old API + fast response times + no API quota, but that’s not the case.

2 Likes

I feel like I’m not really understanding what you are saying. So I should just get the data from the API, dump that into a database, and then spawn a whole separate worker to transform it and put into the database? What do you mean by “put that into a worker and handle external API failures there”? I already have an implementation working that sort of does that.

Because the API is limited to 100 results at a time I do some parallel requests and then flatten that response into a list of maps that I then run through an api_to_changeset function that takes the raw api result and then transforms it into a usable changeset that I put into an Ecto.Multi to batch insert into the database.

But currently because of the flakiness of the API I get :connect_timeout's from hackney which the exponential retry system attempts to combat.

link to implementation

If I got that correctly, your app uses an external API to fetch data, transforms that into your own format, saves that in DB and then exposes that through your own API. The problem is that the external API is really unreliable. As a consequence, you don’t really have an option to implement it via hot calls.

So what I’m suggesting is that you “clone” the external database. You have options here:

  1. dump the responses from the API straight into the DB: the advantage here is that you can easily change your own JSON schema (add more attributes, etc.), since the external data sits in your DB
  2. store the transformed entities in the DB: the advantage here is speed (negligible)
  3. do both

Seems like the problem that you need to solve here is to decide what data and when to pull it from the external API.

As to what:

  • can you store everything locally? what’s the estimated data set? can you afford this?
  • does your application need all the data?

As to when:

  • does the external API has web hooks or any other notification mechanism?
  • does the external API has an incremental endpoint (i.e. give me what changed since timestamp)?
  • if you need to poll for data, what’s the acceptable timespan? how is this affected by quota?

Depending on the answers, the whole thing might event turn infeasible (I hope not!).

I’d implement this export as a worker that runs continuously in the background under it’s own supervision tree. As to how organize this - it really depends on the structure of the data and what the external API gives you. But my first approach would be a process that has a queue of messages, where each message represents a resource to be fetched. For each message, a task is spawned to fetch the data. If it fails, the message is resent to try again. Some additional process would also periodically request fetching of all resources.

4 Likes

Yes, I am attempting to clone the database because after a proof of concept I’ve realized that hot calls were completely out of the window because of the unreliability as well as the extreme speed, while designing the proof of concept I implemented a caching middleware for the http client that I am using which temporarily solves the issue with speed and could be considered dumping the responses to a database. I’m not sure if storing the items in the database or storing them using ets tables would be better, but just going off of gut intuition I’m leaning towards storing them in the database so they can be persisted across restarts.

For What:

  1. Yes, I can store everything locally using either postgres or ets tables. estimated dataset is around 65k items in total IIRC. if you mean afford by if I can pay for the database resources to store this, yes I can, simple DO droplet is my deploy target which can fit that amount of data easily.
  2. I’m not entirely sure that my app does need all that data but it would be nice and once I figure out how to consistently persist a single entity, since they are all quite similar I can extrapolate the logic. Also, having all the data would open me up to be able to do statistical queries based on the dataset which is always interesting, especially with comics.

For When:

  1. Sadly the marvel API is pretty outdated and doesn’t have any webhooks or any other notifications system.
  2. I can query the data and sort by the modified parameter that they have to get a summary of new data
  3. The API guidelines say that the data changes quite frequently and that they would like no third party apps to have stale data so I was thinking that I either refresh every 6 hours and search for new modified data and then every 48 hours do some sort of purge/flush on high traffic resources. The quota for the API is 3000 calls/day without a rate limit so as long as I’m not requesting the specific information for each entity I’m probably fine.

One other thing that is causing a lot of anguish for me is the consistency of the data supplied from the API, as per their schema, nothing is a required field (that is including the id for the resource! thankfully they all do have id’s) and handling the inconsistencies is a bit of an issue but I think I’ve solved it and just need to work on the implementation a bit more.

Also, thank you so so so much for your help! It has been really really nice to get detailed feedback on my problems :slight_smile:

1 Like

I’m happy to be of any help. Also I’m into MCU lately, so I’ll keep an eye on your project :slight_smile:

2 Likes

The MCU is great! I loved infinity war so much and I absolutely cannot wait for the new Ant Man and The Wasp movie.

If you’d like a tour of my code please feel free to pm me.

As for the worker process, would a genserver work in this case, or would something else be optimal? Do you also have any suggestions on how to normalize/validate the input data, because right now I have it working but it feels a lot less than elegant and I feel that there is some method that I could be using that would make it a lot better (here’s an example of how I’m handling it currently).

1 Like

The problem you’re solving sounds similar to writing a web crawler. I’d start with a GenServer that manages a set of Tasks.

So you’re mapping one schema to another. Seems like there are two options: write it by hand or roll-out/use an existing mapping lib. That would DRY the code up by making it more more declarative, i.e. for each key of each schema entity define what to do with it (sketch):

def comic_transform() do
  %{
    isbn: &integer_or_string/1,
    issueNumber: &underscore_key/1
    # ...
  }
end

def underscore_key({key, val}), do: {Macro.underscore(key), val}

def map(schema, transform_definition) do
  schema
  |> Enum.map(fn {key, val} -> transform_definition[key]({key, val}) end)
  |> Map.new()
end

It might be the case that’s it’s not worth the hassle, though.

1 Like

Yeah, I originally wrote this about a year ago in python using a web scraper with mixed results, it worked but was expensive to keep running plus it required quite a bit of manual intervention. This was something that I thought elixir/phoenix could solve really well so it became my pet project for learning. So you are suggesting that I have a GenServer that starts and stops tasks based on calls coming from either the ui or some sort of scheduler? Would you suggest that the Tasks be under a supervision tree?

Thank you so much! This is exactly what I was looking for, the way that I was doing felt very limited and required a lot of repeated code. Time to migrate all of the other schema parsing stuff to this new format that is so much cleaner and nicer. Thank you so much :slight_smile:!

That was my initial thinking. I don’t have enough experience, but I think that GenStage might be a better fit here:
http://tomasztomczyk.com/2017/01/17/genstage-for-processing-jobs.html
https://hexdocs.pm/gen_stage/GenStage.html