PHASH
Calculating Perceptual hashes (PHASH) and comparing image similarity
Here is a program to calculate the perceptual hash for images different than a CRC or md5 hash a perceptual hash will be similar for similar images. A slight change in the image (sharpen, blur, resize) will give perceptual hash that is similar where a CRC or MD5 hash will be completely different. I looked for a PHASH program for pascal and could not find any. I did find three methods for getting a perceptual hash. The first was using a DCT transform but the code was written in C and was not easy to link the program in. The second way is to resize it and make a hash based on whether each pixel was more or less then its next neighbor. The third way was to use image moments. ImageMagick(IM) uses this method but I found that after a certain threshold the images where radically different. Also calculating the differences required a lot of computing power as you had to square the difference of 42 numbers for each image comparison. I tried all three methods, only the first had good promise. The last one was not even part of the IM Lazarus interface.
This program uses ImageMagick and I want to express my appreciation for the work of the translators of this tool to work with Lazarus. However I had a lot of problems getting it to work properly. First because it has not kept up with the code there were some routines that did not work because they have been deprecated (and no longer available) in ImageMagick. Second because some of the newer routines were missing entirely. And third because I was using the Q8 version of ImageMagick and it appeared that the translation was for the Q16 version. PPixelPacket for instance assumes the colors are 16 bit and not 8 bit so I had to change how I handled the structure). I use 8 bit ImageMagick because 99.99999% of the pictures I wanted to process are 8 bit or less and the Q8 program is much faster then the Q16 version.
To get this to work under windows 7 you need to compile under 32bit mode (and call 32 bit ImageMagick) as the calling conventions for 64 bit are quite different and Lazarus does not support the 64 bit calling conventions. (this may change but keep it in mind).
This program takes one main argument - the directory to parse for images (currently only JPG and GIF but you can add more). The program maintains a file 'yalls' that contains information for each image it finds in that directory - deleting files that no longer exists, adding new files and if the md5 does not match recalculating data for that file. (for speed you can have it skip the md5 check and just rewrite existing lines for files by placing SKIP before the directory argument. Since I had a specific set of directories to parse if you do not use any argument it will parse those directories. The output file will consist of one line for each image it finds consisting of fields separated by blanks. The program expects the filenames are without spaces and would need modification otherwise. There was a memory leak with the program and after about 200,000 images it crashed. This may be gone but if the program dies just move the yall file to yalls and restart with JPEGINFO SKIP to restart from where it ended.
The output file contains:
MD5 - 32 hex bytes of the MD5 sum of the entire file. Signature - 64 hex bytes that comes from IM identify that is I think the md5 of the image only PHASH - 16 hex bytes of the DCT hash of the image Date of the file in YYYYMMDDHHMMSS format (from ImageMagick identify) Size in bytes of the file Bitdepth of the image Width of the image Height of the image Filename of the image Complete fileid of the image Example: md5 fb54e12c44d2b14d4d3266251109799e signature 2318170444932902d0541895584df4c2f8f38ad8b1d4dad899870760817422f2 phash CCFB4EAEF3F30C0D date 20151001010847 size 901 bitdepth 8 width 32 width 32 filename 01.gif fid F:\CCC\01.gif
I will add another program later if there is interest that calculates phash similarity. This program was written on a windows7 platform.
Basically to get phash similarity you count the number of bits that are different. If that number is 0 the pictures are likely to be the same,
if it is over 10 the pictures are likely to be different, otherwise the pictures may be similar to some degree. Note it takes about 24 hours to process 700000 images with this program on my computer. I estimate it may take another 24 hours to compare these hashes up to 4 bit differences. At least twice that for 8 bit differences.
program JPEGInfo;
{$APPTYPE CONSOLE} {$R-}
uses strutils,sysutils,classes, md5,magick_wand, ImageMagick,DateUtils;
{ Bytes = array[0..0] of Byte;}
{ LongWords = array[0..255] of Byte;}
type
clr = record
red:byte;
green:byte;
blue:byte;
opacity:byte;
end;
var
allfile:text;
infile:text;
donedir,donefname, inrec:string;
skipped,deleted,added,modified,count:longword;
wand: PMagickWand;
updatefileexists, skipmd5check,notstarted,readnextrec :boolean;
dctmatrix : array[0..7,0..31] of real;
status: MagickBooleanType;
starttime:tdatetime;
filename:pchar;
udir,ufname:string;
Size : Integer;
timed:longint;
DirName: string;
procedure ThrowWandException(wand: PMagickWand);
var
description: PChar;
severity: ExceptionType;
begin
description := MagickGetException(wand, @severity);
WriteLn(Format('An error occurred. Description: %s', [description]));
description := MagickRelinquishMemory(description);
Abort;
end;
function pad(msg:string;len:byte):string;
var par : byte;
begin
result := msg;
if length(msg) > len then result := copy(msg,1,len);
while length(result) < len do result := result + ' ';
for par := 1 to len do if ord(result[par]) < 32 then result[par] := ' ';
end;
{read the next old file record into inrec and fileid into donefname,donedir}
procedure getnextok;
var i: integer;
procedure skipitem;
begin
while (i < length(inrec)) and (inrec[i] <> ' ') do inc(i);
end;
procedure skipspaces;
begin
while inrec[i] = ' ' do inc(i)
end;
begin
if eof(infile) then donedir := chr(255)
else begin
readln(infile,inrec);
i := 1;
skipitem; {md5} skipspaces; {fb54e12c44d2b14d4d3266251109799e}
skipitem; {md6} skipspaces; {2318170444932902d0541895584df4c2f8f38ad8b1d4dad899870760817422f2}
skipitem; {phash} skipspaces; {CCFB4EAEF3F30C0D}
skipitem; {date} skipspaces; {20151001010847}
skipitem; {size} skipspaces; { 901}
skipitem; {depth} skipspaces; { 8}
skipitem; {width} skipspaces; { 32}
skipitem; {height} skipspaces; { 32}
skipitem; {name} skipspaces; {01.gif }
donefname:= copy(inrec,i,999); {F:\CCC\01.gif}
i := rpos('\',donefname);
if i = 0 then writeln('error reading line from yalls:',inrec);
donedir := StringReplace(uppercase(copy(donefname,1,i)),'\',' ',[rfReplaceAll]);;
donefname := uppercase(copy(donefname,i+1,9999));
end;
readnextrec := false;
end;
function GetDif(time:longint):string;
var part: longint; t:longint;
begin
result := '';
t := time;
if time > 86400 then begin
part := t div 86400;
t := t mod 86400;
result := result + format('%.2d',[part])+' '; end;
if time > 3600 then begin
part := t div 3600;
t := t mod 3600;
result := result + format('%.2d',[part])+':'; end;
if time > 60 then begin
part := t div 60;
t := t mod 60;
result := result + format('%.2d',[part])+':'; end;
if time > 0 then
result := result + format('%.2d',[t]);
end;
function gethash:string;
var
intermediate : array[0..7,0..31] of real;
i,j,r,c,k:byte;
img:array[0..31,0..31] of byte;
dctresult:array[0..7,0..7] of real;
pimg: Pimage;
pack:record case integer of
1:( pkt:PPixelPacket);
2:( co:^clr);
end;
hash,one:qword; mean:DOUBLE;
begin
result := 'xxxxxxxxxxxxxxxx'; { in case an error occurs}
try
{Resize the image to 32x32. If it is my test case file
'f:\ccc\01.gif' then it is already at the correct size and I
found that if imagemagic resizes it the file gets changed ---
I did not want that, it might be faster to grayscale the image
first then resize but not if you use MagickQuantizeImage too slow}
if uppercase(filename) <> 'F:\CCC\01.GIF' then
MagickResizeImage(wand, 32, 32, LanczosFilter, 1.0);
{now we convert the image to grayscale - this procedure is similar to
MagickQuantizeImage(wand, 256, GRAYColorspace, iszero, iszero, iszero);
but is a LOT faster. BTW I don't know why I had to use variables
for this function. The conversion numbers were selected to be
close to what MagickQuantizeImage used. Note I had an issue
here. Under Q8 IM GetAuthenticPixels it returns an array of
byte values for colors but the lazarus interface mapped it to word sizes.
You could get an entire array of values with getauthenticpixels but
the speed won't be much different so I didn't bother.
We dump the results into the img array}
pimg := GetImageFromMagickWand(wand);
for j := 0 to 31 do begin
for i := 0 to 31 do begin
pack.pkt := GetAuthenticPixels(pimg, i, j, 1, 1, nil);
img[i,j] :=round(pack.co^.red*0.072+
pack.co^.green*0.715+
pack.co^.blue*0.213);
end;
end;
{results are in the img array. Now do matrix multiplcation
DCTMATRIX * img * transposed(DCTMATRIX)
since we only need the top 8x8 array we only do the
calculations for that part}
{intermediate = DCTMATRIX*IMG}
fillchar(intermediate,sizeof(intermediate),0);
for r := 0 to 7 do
for c := 0 to 31 do
for k := 0 to 31 do
intermediate[r,c] := intermediate[r,c] +
dctmatrix[r,k]*img[k,c];
{dctresult = intermediate*transposed(DCTMATRIX)}
fillchar(dctresult,sizeof(dctresult),0);
for r := 0 to 7 do
for c := 0 to 7 do
for k := 0 to 31 do
dctresult[r,c] := dctresult[r,c] +
intermediate[r,k]*dctmatrix[c,k]{inverted dctmatrix};
{now we need to calculate the arithmetic mean. Note the first value
(0,0) is much too large and affects the mean value too much.
To fix it is set to 0 but perhaps some other solution can be done
such as do a log on the value}
mean := 0;
dctresult[0,0] := 0;
for r := 0 to 7 do
for c := 0 to 7 do
mean := mean + dctresult[r,c];
mean := mean / 64;
{the hash is calculated by setting the bit of
each value greater than the mean}
one := 1;
hash := 0;
for r := 0 to 7 do
for c := 0 to 7 do begin
if dctresult[r,c] > mean then inc(hash,one);
one := one shl 1;
end;
one := 1;
{return the phash}
result := inttohex(hash,16);
finally
end;
end;
procedure processfile(dir:string;fname:string);
var
F: file;
idents:pchar;
i:integer;
md5,md6, date, fsize,depth,imageheight,imagewidth,fno:string;
P: Pointer;
begin
inc(count);
if (count mod 1000) = 1 then begin
timed := SecondsBetween(Now,starttime);
writeln('at count ',count-1,' E:'+ GetDif(timed)+' D:',deleted,' A:',added,' M:',modified,' S:',skipped,' ',dir,fname);
end;
filename := pchar(dir+fname);
udir := StringReplace(uppercase(dir),'\',' ',[rfReplaceAll]);;
ufname := uppercase(fname);
if updatefileexists then begin
if readnextrec then getnextok;
while (udir > donedir) or ((udir = donedir) and (ufname > donefname)) do begin getnextok; inc(deleted); end;
if skipmd5check then
if (udir+ufname) = (donedir+donefname) then begin
inc(skipped);
writeln(allfile,inrec);
readnextrec := true;
exit;
end;
end;
TRY
Assign(F,FileName);
FileMode:=fmOpenRead + fmShareDenyNone;
Reset(F, 1);
Size := FileSize(F);
GetMem(P, Size);
BlockRead(F, P^, Size);
md5 := md5print(md5buffer(p^,Size));
finally
FreeMem(P);
CloseFile(F);
end;
if updatefileexists then begin
if (udir+ufname) = (donedir+donefname) then begin
if copy(inrec,1,32) = md5 then begin
inc(skipped);
writeln(allfile,inrec);
readnextrec := true;
exit;
end;
inc(modified);
end
else inc(added);
end;
if notstarted then begin
notstarted := false;
writeln('Started at line:',count,' with file:',filename);
end;
try
wand := NewMagickWand;
status := MagickReadImage(wand, filename);
if (status = MagickFalse) then ThrowWandException(wand);
idents := magickidentifyimage(wand);
md6 := copy(idents,pos('signature:',idents)+11,64);
i := pos('date:modify:',idents)+13;
if i = 0 then i := pos('date:create:',idents)+13;
date := copy(idents,i ,4)+copy(idents,i+ 5,2)+copy(idents,i+ 8,2)+
copy(idents,i+11,2)+copy(idents,i+14,2)+copy(idents,i+17,2);
fsize := copy(idents,pos('Filesize:',idents)+10,64);
i := POS(Chr(10),FSIZE);
if i > 0 then fsize := copy(fsize,1,i-1);
i := pos('Depth:', idents)+7;
depth := copy(idents, i,1);
i := pos('Geometry:',idents)+10;
imageheight := copy(idents, i,99);
i := pos('x',imageheight);
imagewidth := copy(imageheight,i+1,99);
imageheight := copy(imageheight,1,i-1);
i := pos('+',imagewidth);
imagewidth := copy(imagewidth,1,i-1);
fno := fname;
if length(fno) < 22 then fno := pad(fno,25);
writeln(allfile,md5,' ',md6,' ',gethash,' ',date:14,' ',
size:12, depth:2,imageheight:7,imagewidth:7,' ',
fno,' ',filename);
finally
wand := DestroyMagickWand(wand);
end;
end;
// Recursive procedure to build a list of files
procedure FindFiles(StartDir: string);
var
SR: TSearchRec;
IsFound: Boolean;
begin
if StartDir[length(StartDir)] <> '\' then StartDir := StartDir + '\';
{ Build a list of the files in directory StartDir
(but not the directories!) }
IsFound := FindFirst(StartDir+'*.*', faAnyFile-faDirectory, SR) = 0;
while IsFound do begin
if (uppercase(extractFileExt(sr.name)) = '.JPG')
or (uppercase(extractFileExt(sr.name)) = '.GIF')
then processfile(StartDir,SR.Name);
IsFound := FindNext(SR) = 0;
end;
FindClose(SR);
// Build a list of subdirectories
IsFound := FindFirst(StartDir+'*.*', faDirectory, SR) = 0;
while IsFound do begin
if ((SR.Attr and faDirectory) <> 0) and (SR.Name[1] <> '.') then
FindFiles(StartDir + SR.Name);
IsFound := FindNext(SR) = 0;
end;
FindClose(SR);
end;
{$R *.res}
procedure makedct;
var c1:real; ii,p,q: byte;
begin
c1 := 1/sqrt(32);
for ii := 0 to 31 do dctmatrix[0,ii] := c1;
c1 := c1*2;
for p := 1 to 7 do
for q := 0 to 31 do
dctmatrix[p,q] := c1 * cos(pi/64*p*(2*q+1));
end;
begin
starttime := now; {keep track of time}
{have we started processing files as opposed to copy old recs}
notstarted := true;
skipmd5check := false; {on copy skip md5 check}
DirName := ParamStr(1); {did he pass a filename}
if length(dirname) > 3 then
if copy(upcase(dirname),1,4) = 'SKIP' then begin
dirname := paramstr(2);
skipmd5check := true;
end;
count := 0;
deleted := 0; modified := 0; added := 0; skipped := 0;
{allow access to file while writing it}
FileMode:=fmOpenwrite + fmShareDenyNone;
assign(allfile,'yall');
rewrite(allfile);
readnextrec := true;
updatefileexists := fileexists('yalls');
if updatefileexists then begin
{allow access to file while reading it}
FileMode:=fmOpenwrite + fmShareDenyNone;
assign(infile,'yalls');
reset(infile);
getnextok;
end;
MagickWandGenesis;
makedct; {make DCT matrix (or at least what is needed)}
{This is a set of directories that I desired to maintain}
if (dirname = 'F:\') or (dirname = 'F:\') or
(length(dirname) < 3) then begin
FindFiles('F:\CCC\');
FindFiles('F:\CD\');
FindFiles('F:\CHECK\');
FindFiles('F:\QDISK\');
end
else FindFiles(DirName); {or use a user supplied directory name}
writeln('found:',count,' files ',' Deleted:',deleted,' Added:',added,' Changed:',modified,' SKIPPED:',skipped);
close(allfile);
if updatefileexists then close(infile);
MagickWandTerminus;
{ I get paranoid and like to keep a few versions of the files around}
if fileexists('yalls.005') then deletefile('yalls.005');
if fileexists('yalls.004') then renamefile('yalls.004','yalls.005');
if fileexists('yalls.003') then renamefile('yalls.003','yalls.004');
if fileexists('yalls.002') then renamefile('yalls.002','yalls.003');
if fileexists('yalls.001') then renamefile('yalls.001','yalls.002');
if fileexists('yalls') then renamefile('yalls','yalls.001');
renamefile('yall','yalls');
end.