基于文件哈希(md5,sha1等)的精确副本 用于缩放图像的感知哈希(phash) 用于修改图像的基于特征的(SIFT)
例如,如果您有100万张图像,则需要100万64位哈希值(8 MB)的数组。在某些cpu上,这适用于L2/L3缓存!在实际使用中,我看到corei7的速度超过1千兆哈姆/秒,这只是CPU内存带宽的问题。一个10亿张图片的数据库在64位CPU(需要8GB内存)上是可行的,搜索不会超过1秒!
For modified/cropped images it would seem a transform-invariant feature/keypoint detector like SIFT is the way to go. SIFT will produce good keypoints that will detect crop/rotate/mirror etc. However the descriptor compare is very slow compared to hamming distance used by phash. This is a major limitation. There are a lot of compares to do, since there are maximum IxJxK descriptor compares to lookup one image (I=num haystack images, J=target keypoints per haystack image, K=target keypoints per needle image).
The first is a standard approach in computer vision, keypoint matching. This may require some background knowledge to implement, and can be slow. The second method uses only elementary image processing, and is potentially faster than the first approach, and is straightforward to implement. However, what it gains in understandability, it lacks in robustness -- matching fails on scaled, rotated, or discolored images. The third method is both fast and robust, but is potentially the hardest to implement.
Better than picking 100 random points is picking 100 important points. Certain parts of an image have more information than others (particularly at edges and corners), and these are the ones you'll want to use for smart image matching. Google "keypoint extraction" and "keypoint matching" and you'll find quite a few academic papers on the subject. These days, SIFT keypoints are arguably the most popular, since they can match images under different scales, rotations, and lighting. Some SIFT implementations can be found here.
Another less robust but potentially faster solution is to build feature histograms for each image, and choose the image with the histogram closest to the input image's histogram. I implemented this as an undergrad, and we used 3 color histograms (red, green, and blue), and two texture histograms, direction and scale. I'll give the details below, but I should note that this only worked well for matching images VERY similar to the database images. Re-scaled, rotated, or discolored images can fail with this method, but small changes like cropping won't break the algorithm
Computing the color histograms is straightforward -- just pick the range for your histogram buckets, and for each range, tally the number of pixels with a color in that range. For example, consider the "green" histogram, and suppose we choose 4 buckets for our histogram: 0-63, 64-127, 128-191, and 192-255. Then for each pixel, we look at the green value, and add a tally to the appropriate bucket. When we're done tallying, we divide each bucket total by the number of pixels in the entire image to get a normalized histogram for the green channel.
For the texture direction histogram, we started by performing edge detection on the image. Each edge point has a normal vector pointing in the direction perpendicular to the edge. We quantized the normal vector's angle into one of 6 buckets between 0 and PI (since edges have 180-degree symmetry, we converted angles between -PI and 0 to be between 0 and PI). After tallying up the number of edge points in each direction, we have an un-normalized histogram representing texture direction, which we normalized by dividing each bucket by the total number of edge points in the image.
|A.green_histogram.bucket_1 - B.green_histogram.bucket_1|
A third approach that is probably much faster than the other two is using semantic texton forests (PDF). This involves extracting simple keypoints and using a collection decision trees to classify the image. This is faster than simple SIFT keypoint matching, because it avoids the costly matching process, and keypoints are much simpler than SIFT, so keypoint extraction is much faster. However, it preserves the SIFT method's invariance to rotation, scale, and lighting, an important feature that the histogram method lacked.
我的错误——语义德克顿森林论文并不是专门关于图像匹配的,而是关于区域标记的。关于匹配的原始论文是:使用随机树的关键点识别。此外,下面的论文继续发展的想法,并代表了艺术的状态(c. 2010):
快速关键点识别使用随机蕨类-更快,更可扩展比Lepetit 06 概要:二进制健壮的独立基本特征-不太健壮但非常快-我认为这里的目标是在智能手机和其他手持设备上进行实时匹配
我们笼统地称之为副本的东西,算法很难识别。 你的副本可以是:
确切的副本 接近精确重复。(图像的轻微编辑等) 重复(相同的内容,但不同的视角,相机等)
第一个和第二个更容易解决。3号。是非常主观的,仍然是一个研究课题。 我可以提供1号和2号的解决方案。 这两个解决方案都使用了优秀的图像哈希-哈希库:https://github.com/JohannesBuchner/imagehash
确切的副本 使用感知哈希度量可以找到精确的重复项。 phash库在这方面做得很好。我经常用它来清洁 训练数据。 用法(来自github网站)简单如:
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
for img_fn in sorted(image_fns):
hash = imagehash.average_hash(Image.open(image_fn))
if hash in img_hashes:
print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) )
img_hashes[hash] = image_fn
接近精确复制 在这种情况下,您必须设置一个阈值,并比较它们之间距离的哈希值 其他。这必须通过对图像内容的反复试验来完成。
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
epsilon = 50
for img_fn1, img_fn2 in zip(image_fns, image_fns[::-1]):
if image_fn1 == image_fn2:
hash1 = imagehash.average_hash(Image.open(image_fn1))
hash2 = imagehash.average_hash(Image.open(image_fn2))
if hash1 - hash2 < epsilon:
print( '{} is near duplicate of {}'.format(image_fn1, image_fn2) )
I believe that dropping the size of the image down to an almost icon size, say 48x48, then converting to greyscale, then taking the difference between pixels, or Delta, should work well. Because we're comparing the change in pixel color, rather than the actual pixel color, it won't matter if the image is slightly lighter or darker. Large changes will matter since pixels getting too light/dark will be lost. You can apply this across one row, or as many as you like to increase the accuracy. At most you'd have 47x47=2,209 subtractions to make in order to form a comparable Key.