Calculating similarity of photos

Silencing progress: the old noisy Maxtor is out, WD Caviar Green is in, hung in 5.25″ bay with elastic straps.

Going to try Shotwell instead of F-Spot for photo management for a while. With over 60K photos to manage F-Spot was getting increasingly sluggish, so why not give Shotwell a shot.

Now fiddling with Python script that digs through directories of photos and tries to clean up some of the mess with duplicate copies, downsized versions, diverging backup copies on CDs and DVDs etc.

One interesting bit the script does, is recognizing that picture A is a downsampled version of picture B. My first attempt for comparing the photos was to calculate luminosity histograms and to compare these. Second attempt that gave much more convincing results, was to resize to 10×10 and compare pixel values. With PIL:

def get_fingerprint(path):
    im =
    im = im.resize((10, 10), Image.ANTIALIAS)
    im = ImageOps.grayscale(im)
    return im.getdata()

The script would calculate fingerprints of the two photos in question, and then compare them byte-by-byte, with some low threshold.

def fingerprints_match(fingerprint1, fingerprint2):
    for v1, v2 in zip(fingerprint1, fingerprint2):
        if abs(v1 - v2) > 5:
            return False

    return True

Then there was a problem that some photos had rotation stored in EXIF data, while others didn’t. PIL doesn’t know much about EXIF but good ol’ ImageMagick does, so fingerprinting routine gets rewritten:

def get_fingerprint(path):
    command = ["convert", path, "-auto-orient", "-type", "Grayscale",  "-filter", "Cubic", "-resize", "10x10!", "/tmp/merge-fingerprint.gray"]
    retcode =    
    result = []
    for byte in open("/tmp/merge-fingerprint.gray").read():

    return result

Some remarks about ImageMagick flags:

  • -auto-orient tells ImageMagick to rotate picture according to EXIF.
  • -filter Cubic is good enough and a bit faster than the default Lanczos.
  • .gray format is a raw byte format, by saving 10×10 picture you get 100 bytes with grayscale values in file–perfect!