哈希字典?

为了缓存目的，我需要从字典中存在的GET参数生成一个缓存键。

目前，我正在使用sha1(repr(sorted(my_dict.items()))) (sha1()是一个内部使用hashlib的方便方法)，但我很好奇是否有更好的方法。

当前回答

下面的代码避免使用Python hash()函数，因为它不会在重新启动Python时提供一致的散列(参见Python 3.3中的散列函数在会话之间返回不同的结果)。make_hashable()将对象转换为嵌套的元组，make_hash_sha256()也将repr()转换为base64编码的SHA256散列。

import hashlib
import base64

def make_hash_sha256(o):
    hasher = hashlib.sha256()
    hasher.update(repr(make_hashable(o)).encode())
    return base64.b64encode(hasher.digest()).decode()

def make_hashable(o):
    if isinstance(o, (tuple, list)):
        return tuple((make_hashable(e) for e in o))

    if isinstance(o, dict):
        return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))

    if isinstance(o, (set, frozenset)):
        return tuple(sorted(make_hashable(e) for e in o))

    return o

o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))

print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=

2017-02-10 05:09:30

其他回答

编辑:如果你所有的键都是字符串，那么在继续阅读这个答案之前，请参阅Jack O'Connor的更简单(更快)的解决方案(它也适用于嵌套字典)。

虽然答案已经被接受，但问题的标题是“哈希一个python字典”，关于这个标题的答案是不完整的。(关于问题的主体，答案是完整的。)

嵌套的字典

如果一个人在Stack Overflow上搜索如何散列字典，他可能会遇到这个恰当的标题问题，如果他试图散列多重嵌套字典，他可能会感到不满意。上面的答案在这种情况下不起作用，您必须实现某种递归机制来检索散列。

下面是一个这样的机制:

import copy

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that contains
  only other hashable types (including any lists, tuples, sets, and
  dictionaries).
  """

  if isinstance(o, (set, tuple, list)):

    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

奖励:哈希对象和类

hash()函数在散列类或实例时工作得很好。然而，关于对象，我发现了一个关于哈希的问题:

class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789

哈希值是一样的，即使我改变了foo。这是因为foo的单位没有改变，所以哈希值是一样的。如果你想让foo根据它的当前定义进行不同的哈希，解决方案是哈希掉任何实际发生变化的东西。在本例中，__dict__属性:

class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785

唉，当你试图对类本身做同样的事情时:

print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'

类__dict__属性不是一个普通的字典:

print (type(Foo.__dict__)) # type <'dict_proxy'>

这是一个类似于前面的机制，将适当地处理类:

import copy

DictProxyType = type(object.__dict__)

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that 
  contains only other hashable types (including any lists, tuples, sets, and
  dictionaries). In the case where other kinds of objects (like classes) need 
  to be hashed, pass in a collection of object attributes that are pertinent. 
  For example, a class can be hashed in this fashion:

    make_hash([cls.__dict__, cls.__name__])

  A function can be hashed like so:

    make_hash([fn.__dict__, fn.__code__])
  """

  if type(o) == DictProxyType:
    o2 = {}
    for k, v in o.items():
      if not k.startswith("__"):
        o2[k] = v
    o = o2  

  if isinstance(o, (set, tuple, list)):

    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

你可以使用this返回一个包含任意数量元素的哈希元组:

# -7666086133114527897
print (make_hash(func.__code__))

# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))

# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))

注意:以上所有代码都假设Python 3.x。没有在早期版本中测试，尽管我假设make_hash()将在2.7.2中工作。至于让例子起作用，我确实知道

func.__code__

应该用

func.func_code

2012-01-03 15:05:37

虽然hash(frozenset(x.items())和hash(tuple(sorted(x.items()))可以工作，但分配和复制所有键-值对需要做很多工作。哈希函数应该避免大量的内存分配。

一点数学知识能帮上忙。大多数哈希函数的问题是他们认为顺序很重要。要对无序结构进行哈希，需要一个交换操作。乘法运算不能很好地工作，因为任何元素哈希到0都意味着整个乘积为0。位&和|倾向于所有的0或1。有两个很好的候选:加法和异或。

from functools import reduce
from operator import xor

class hashable(dict):
    def __hash__(self):
        return reduce(xor, map(hash, self.items()), 0)

    # Alternative
    def __hash__(self):
        return sum(map(hash, self.items()))

一点:xor可以工作，部分原因是dict保证键是唯一的。sum可以工作，因为Python会按位截断结果。

如果你想散列一个多集，sum是更可取的。对于xor， {a}将哈希到与{a, a, a}相同的值，因为x ^ x ^ x = x。

如果您确实需要SHA提供的保证，那么这并不适合您。但是在集合中使用字典，这将很好;Python容器对某些冲突具有弹性，底层哈希函数非常好。

2021-01-08 08:13:03

如果你的字典不是嵌套的，你可以用字典的项创建一个frozenset，并使用hash():

hash(frozenset(my_dict.items()))

与生成JSON字符串或字典表示相比，这需要的计算量要小得多。

更新:请参阅下面的评论，为什么这种方法可能不会产生稳定的结果。

2011-05-04 13:24:33

更新自2013年回复…

以上答案在我看来都不可靠。原因是使用了items()。据我所知，这是一个依赖于机器的顺序。

这个怎么样?

import hashlib

def dict_hash(the_dict, *ignore):
    if ignore:  # Sometimes you don't care about some items
        interesting = the_dict.copy()
        for item in ignore:
            if item in interesting:
                interesting.pop(item)
        the_dict = interesting
    result = hashlib.sha1(
        '%s' % sorted(the_dict.items())
    ).hexdigest()
    return result

2013-03-04 18:10:36

为了保持键顺序，而不是哈希(str(字典))或哈希(json.dumps(字典))，我更喜欢快速和肮脏的解决方案:

from pprint import pformat
h = hash(pformat(dictionary))

它甚至可以用于DateTime等不能序列化的JSON类型。

2015-01-30 00:45:17

推荐文章

最新文章

标签