Image comparison with existing list of images

tmariaz · June 1, 2023, 7:09am

I am uploading an image to the server. Before it gets uploaded I wanted to check whether I have already uploaded the same image before.

All the uploaded images can be accessed via the url https://myserver/image/{image_id}. As a current user I can get the list of images from the repo.

Now I can select the new image and it stores temporarily in the server. I wanted to compare new_image_url with the list of images stored in the server.

def compare_image(temp_img_url)
   # code to get the list of images from repo
   Enum.each(images, fn m ->
     user_image = “https://myserver/image/#{m.id}”
     {:ok, duplicate_image} = compare_images_from_url(temp_img_url, user_image)
   end)
end

def compare_images_from_url(tmp, current) do
    response = HTTPoison.get!(image_url, [], hackney: [recv_timeout: 15_000, timeout: 150_000])

    case response.status_code do      
         # logics to compare the images 
         # .....
         # return true if there is a duplicate
    end
end

There is a possibility that a user can have 1000s of images. But it is not ideal to compare every single image to find that. Which sounds costly.

What is the best way to do this?

Eiji · June 1, 2023, 7:47am

First of all why do you make call from your server to itself? Can’t you just operate on file instead?

I would personally go with generating a SHA hash and save it into database. If possible client can even generate such hash before uploading file to server. This standalone would save lots of bandwidth. Later you can generate said hashes to verify file integrity and secure attempts of uploading duplicates with changed hash by client

Of course there is lots of edge cases here:

There is a limited number of values hash could represent and therefore two files may be seen as same even if that’s not a case. If you worry about it you can store extra information like file name, size etc., but nothing would be as much precise as comparing file by file which is slow.
Keep in mind that changing one pixel in paint gives you a new image. Most probably it would have completely different hash even if the file size is the same. Here even comparing file by file would not help you and you would need to use some image processing tools to get a percentage of similarities, but even if you do so what would happen if the images are similar, but are different? Just think that a meme image is exactly the same except some text in it.
Think if and how you would like to secure your app to not accept same image, but converted to different format like jpeg → png. Once again you would need to use image processing tools and once again you would have some questions that can be answered only by youy.

Depending on your use case you need to choose a way how you would like to compare files. As I;m not expert of images and their formats I most probably did not cover even most cases.

kip · June 1, 2023, 7:49am

The general strategy for “image is basically the same” is to use a perceptual hash. Image.dshash/1 will generate such a hash which can stored in the database if you need, for each image. Or just use the hash to compare with the hash of other images.

A perceptual hash overcomes the issues of same image but different resolution, or different image format, or different image colorspace. But still the “same” image.

Something like:

def kinda_the_same?(image_1, image_2) do
  Image.dhash(image_1) == Image.dhash(image_2)
end

Eiji · June 1, 2023, 8:05am

Interesting, … is it faster or slower than :crypto.hash/3 algorithms?

Is there a browser implementation to make a pre-verify before upload?

tmariaz · June 1, 2023, 8:05am

Matter of fact, the files are stored in AWS and can be accessed via cloudfront. When I query the list of images from the database, I get the urls with multiple resolutions for each ID.

%Images{
  {
    id: 1,
    url: {
       “400x”: url,
       “1200x”: url,
       # few more res
    }
  }
}

When I upload I wanted to do the comparison actually that’s why calling every time. I know it sounds bad.

@kip can the Perceptual Hash is helpful even with different res?

D4no0 · June 1, 2023, 8:23am

I think different image size can be handled easily by any fingerprinting algorithms, it gets tricky when the image is partially manipulated.

kip · June 1, 2023, 8:55am

Yes, you are right. Basically the image is resized, converted to BW, convolved to sharpen the edges and that’s basically the “image hash”. You can see the code here.

~~Im not 100% happy with the implementation but as best I can test it works as expected (please open issues if you find otherwise).~~ I’ve fixed the implementation to return the expected 64-bit hash (not the previous 512-bit hash which was wasting space).

kip · June 1, 2023, 9:00am

Probably slower because it involves image resizing, edge detection, contrast enhancement. But its not testing for identical. Identical is not very meaningful for image comparison since different compression algorithms and settings mean the image doesn’t round trip after decoding.

You can use mean square error as a way to establish “similarity” between two images. I use this in the test suite to overcome some the challenges - there can be different results across different library builds, system architectures and so on.

Basically the code for image similarity is:

    similarity =
      calculated_image
      |> Math.subtract!(validate_image)
      |> Math.pow!(2)
      |> Vix.Vips.Operation.avg!()

Note the operations are matrix operations since an image is basically just a matrix.

kip · June 1, 2023, 9:06am

Very likely a javascript implementation (almost certain).

Yes, deliberately intended to be resolution independent. And format independent. And colourspace independent.

D4no0 · June 1, 2023, 9:10am

I wonder if it possible to fingerprint the image with something like Fast Fourier Transform, like you can do with sound, this would certainly make the sampling algorithm complexity more linear.

kip · June 1, 2023, 9:50am

The image decoding process typically dominates processing time so I’m not sure any gains would be material.

Lucassifoni · June 1, 2023, 3:19pm

I had been quite happy adding multiple heuristics to a client app working mainly on images a few years ago.
I stored :

SHA hash
File name
Perceptual hash (I’d have to find the library)
Luminance fingerprint (basically a 8x8 or 16x16 grayscale version of the image)
Mime type
File size
Image width
Image height

That allowed me to have quite effective deductions on those questions :

Are two images the same ?
Are two images minor derivations of the same original (a bit resized, a bit cropped, a bit of this and that and renamed or converted…)

For confusing answers to those questions I had a “potential duplicates” list that someone could check. Maintaining and cleaning that list was useful for the business purposes though, not just to limit storage use.

The answer to “are those two images the same” mainly depend on your specific use case. Maybe identical files are a good enough answer, maybe you’d prefer something more like “do the two image files contain the same picture as seen by an user ?”.

Good luck ! This is a fun topic