Maintaining object URLs with unreliable data?

jswny · October 18, 2018, 9:03pm

My Phoenix application has to work with a swathe of unreliable data, delivered in the form of JSON files. As of right now, I seed the database with these data files when the app is deployed. So, when the app is running the objects from these data files have URLs, which are currently built with the IDs that get generated when they are imported. So, it would be something like site.com/object/1, .../2, etc. Everything is deployed inside Docker including the database so at the moment the entire site is brought down and updated, and then everything is started fresh. This is fine since all the data is static in those files anyway so nothing is lost.

My problem is that as I develop the application, I’m adding more and more ways to cleanse the data. Therefore, lots of times the number of objects changes, or the auto-generated ID’s corresponding to those objects changes.

For example, if I have three objects in my database, but I change the code so that the second object (which is actually a typo of the first), is added to the first object, then I’m left with only two objects. Therefore, next time I deploy, the second object’s ID has changed since it’s now under the first’s ID, and the third object now has ID 2, shifted up.

I am wondering if there is a good way to handle giving these objects more permanent URLs so that they don’t get messed up every time I re-deploy. I was thinking about slug URLs but that would mean that if the name of an object changes after being cleaned or something like that the URL will change too.

OvermindDL1 · October 18, 2018, 9:09pm

You will need some unique identifier. Auto-incrementing keys should never ever be assigned to nor should their value be assumed, so the fact they are used is definitely where it is not working. Just need some way to distinguish them, which will be based on whatever their data is (which was not posted as of this time).

sribe · October 18, 2018, 11:14pm

It seems like you do not have a 1-1 correspondence between URL & actual objects. You might need a generated synthetic key (sequence or UUID) for each actual object, with a mapping of URL -> ID???

Phillipp · October 19, 2018, 8:06am

Do those objects change their content? If not, you could hash them and use that as identifier. But that will also break as soon as the object changes. Or you just hash specific keys from the objects.

jswny · October 19, 2018, 1:37pm

I think that’s what I’ll end up doing. I can hash the names and then use those in the URLs. This will end up with dead links when I merge two objects together but thats fine, it’s better than every link from the last deployment being invalidated. I need something that isn’t sequential so hashed something from the objects sounds pretty good! Thanks

That could work but the problem is the entire database gets reset and re-seeded from the data files each deployment, so I’d have to save the UUID’s somewhere and use those on next import, which is actually a decent option!

Hah yes I agree! I need to make sure whatever I use is unique and definitely not sequential.

sneako · October 19, 2018, 1:47pm

Why does the database have to be destroyed on each deployment? If it is within your power at all to change this, then that is the route I would suggest. You should be able to update the image your elixir app is running in, and just leave your database container running.

jswny · October 20, 2018, 8:29pm

I can definitely do that. The problem is that all the data cleaning, normalization, etc. goes in the seeding process. Because of that, if we change the way we handle some field or for instance add some mechanism for merging two similarly named objects together, we would need to re-write that logic in a separate place and run it on the existing database, instead of just having that logic be where the rest of the logic is in the seeding process.