我希望创建一个图像基表,然后将任何新图像与之进行比较,以确定新图像是否与基表完全相同(或接近)。
例如:如果你想减少100次相同图像的存储,你可以存储它的一个副本,并提供它的参考链接。当输入一个新图像时,你想要与现有的图像进行比较,以确保它不是重复的…想法吗?
我的一个想法是缩小到一个小缩略图,然后随机选择100个像素位置并进行比较。
我希望创建一个图像基表,然后将任何新图像与之进行比较,以确定新图像是否与基表完全相同(或接近)。
例如:如果你想减少100次相同图像的存储,你可以存储它的一个副本,并提供它的参考链接。当输入一个新图像时,你想要与现有的图像进行比较,以确保它不是重复的…想法吗?
我的一个想法是缩小到一个小缩略图,然后随机选择100个像素位置并进行比较。
当前回答
I have an idea, which can work and it most likely to be very fast. You can sub-sample an image to say 80x60 resolution or comparable, and convert it to grey scale (after subsampling it will be faster). Process both images you want to compare. Then run normalised sum of squared differences between two images (the query image and each from the db), or even better Normalised Cross Correlation, which gives response closer to 1, if both images are similar. Then if images are similar you can proceed to more sophisticated techniques to verify that it is the same images. Obviously this algorithm is linear in terms of number of images in your database so even though it is going to be very fast up to 10000 images per second on the modern hardware. If you need invariance to rotation, then a dominant gradient can be computed for this small image, and then the whole coordinate system can be rotated to canonical orientation, this though, will be slower. And no, there is no invariance to scale here.
如果你想要更一般的东西或使用大数据库(百万张图片),那么 你需要研究图像检索理论(在过去5年里出现了大量的论文)。 在其他答案中有一些提示。但这可能有点过头了,建议直方图方法就可以了。尽管我认为是多种不同的组合 快速的方法会更好。
其他回答
我所知道的最好的方法是使用感知哈希。似乎有一个很好的开源实现这样的散列可用:
http://phash.org/
其主要思想是,通过识别原始图像文件中的显著特征,并对这些特征进行哈希(而不是直接对图像数据进行哈希),将每张图像简化为一个小的哈希代码或“指纹”。这意味着,相比简单的方法,如将图像缩小到一个小的拇指指纹大小的图像,并比较拇指指纹,假阳性率大大降低。
Phash提供了几种类型的哈希,可用于图像、音频或视频。
I have an idea, which can work and it most likely to be very fast. You can sub-sample an image to say 80x60 resolution or comparable, and convert it to grey scale (after subsampling it will be faster). Process both images you want to compare. Then run normalised sum of squared differences between two images (the query image and each from the db), or even better Normalised Cross Correlation, which gives response closer to 1, if both images are similar. Then if images are similar you can proceed to more sophisticated techniques to verify that it is the same images. Obviously this algorithm is linear in terms of number of images in your database so even though it is going to be very fast up to 10000 images per second on the modern hardware. If you need invariance to rotation, then a dominant gradient can be computed for this small image, and then the whole coordinate system can be rotated to canonical orientation, this though, will be slower. And no, there is no invariance to scale here.
如果你想要更一般的东西或使用大数据库(百万张图片),那么 你需要研究图像检索理论(在过去5年里出现了大量的论文)。 在其他答案中有一些提示。但这可能有点过头了,建议直方图方法就可以了。尽管我认为是多种不同的组合 快速的方法会更好。
I believe that dropping the size of the image down to an almost icon size, say 48x48, then converting to greyscale, then taking the difference between pixels, or Delta, should work well. Because we're comparing the change in pixel color, rather than the actual pixel color, it won't matter if the image is slightly lighter or darker. Large changes will matter since pixels getting too light/dark will be lost. You can apply this across one row, or as many as you like to increase the accuracy. At most you'd have 47x47=2,209 subtractions to make in order to form a comparable Key.
如果您有大量的图像,请查看Bloom过滤器,它使用多个散列来获得概率高但效率高的结果。如果图像的数量不是很大,那么像md5这样的加密散列应该足够了。
我们笼统地称之为副本的东西,算法很难识别。 你的副本可以是:
确切的副本 接近精确重复。(图像的轻微编辑等) 重复(相同的内容,但不同的视角,相机等)
第一个和第二个更容易解决。3号。是非常主观的,仍然是一个研究课题。 我可以提供1号和2号的解决方案。 这两个解决方案都使用了优秀的图像哈希-哈希库:https://github.com/JohannesBuchner/imagehash
确切的副本 使用感知哈希度量可以找到精确的重复项。 phash库在这方面做得很好。我经常用它来清洁 训练数据。 用法(来自github网站)简单如:
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
for img_fn in sorted(image_fns):
hash = imagehash.average_hash(Image.open(image_fn))
if hash in img_hashes:
print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) )
else:
img_hashes[hash] = image_fn
接近精确复制 在这种情况下,您必须设置一个阈值,并比较它们之间距离的哈希值 其他。这必须通过对图像内容的反复试验来完成。
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
epsilon = 50
for img_fn1, img_fn2 in zip(image_fns, image_fns[::-1]):
if image_fn1 == image_fn2:
continue
hash1 = imagehash.average_hash(Image.open(image_fn1))
hash2 = imagehash.average_hash(Image.open(image_fn2))
if hash1 - hash2 < epsilon:
print( '{} is near duplicate of {}'.format(image_fn1, image_fn2) )