检查列表中是否存在值的最快方法

检查一个值是否存在于一个非常大的列表的最快方法是什么?

当前回答

听起来您的应用程序可能会从使用Bloom Filter数据结构中获得优势。

简而言之，bloom过滤器查找可以非常快速地告诉你一个值是否绝对不存在于一个集合中。否则，您可以执行较慢的查找，以获得可能在列表中的值的索引。因此，如果您的应用程序倾向于获得“未找到”结果，而不是“找到”结果，您可能会通过添加Bloom Filter看到速度的提高。

关于细节，维基百科提供了Bloom过滤器如何工作的很好的概述，在网上搜索“python Bloom过滤器库”将提供至少两个有用的实现。

2016-01-27 22:46:39

其他回答

正如其他人所说，对于大型列表，in可能非常慢。这里比较了in, set和bisect的性能。注意时间(秒)是对数尺度。

测试代码:

import random
import bisect
import matplotlib.pyplot as plt
import math
import time


def method_in(a, b, c):
    start_time = time.time()
    for i, x in enumerate(a):
        if x in b:
            c[i] = 1
    return time.time() - start_time


def method_set_in(a, b, c):
    start_time = time.time()
    s = set(b)
    for i, x in enumerate(a):
        if x in s:
            c[i] = 1
    return time.time() - start_time


def method_bisect(a, b, c):
    start_time = time.time()
    b.sort()
    for i, x in enumerate(a):
        index = bisect.bisect_left(b, x)
        if index < len(a):
            if x == b[index]:
                c[i] = 1
    return time.time() - start_time


def profile():
    time_method_in = []
    time_method_set_in = []
    time_method_bisect = []

    # adjust range down if runtime is too long or up if there are too many zero entries in any of the time_method lists
    Nls = [x for x in range(10000, 30000, 1000)]
    for N in Nls:
        a = [x for x in range(0, N)]
        random.shuffle(a)
        b = [x for x in range(0, N)]
        random.shuffle(b)
        c = [0 for x in range(0, N)]

        time_method_in.append(method_in(a, b, c))
        time_method_set_in.append(method_set_in(a, b, c))
        time_method_bisect.append(method_bisect(a, b, c))

    plt.plot(Nls, time_method_in, marker='o', color='r', linestyle='-', label='in')
    plt.plot(Nls, time_method_set_in, marker='o', color='b', linestyle='-', label='set')
    plt.plot(Nls, time_method_bisect, marker='o', color='g', linestyle='-', label='bisect')
    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc='upper left')
    plt.yscale('log')
    plt.show()


profile()

2016-12-04 20:44:39

最初的问题是:

知道一个值是否存在于一个列表(一个列表它有数百万个值)，它的索引是什么?

因此，有两件事需要发现:

是列表中的一项，和索引是什么(如果在列表中)。

为此，我修改了@xslittlegrass代码来计算所有情况下的索引，并添加了一个额外的方法。

结果

方法是:

基本上，if x In b: return b。index(x) 在b.index(x)上尝试/捕获(跳过必须检查x是否在b中) Set——基本上如果x在Set (b):返回b.index(x) 平分——对b和它的下标排序，对x在排序(b)中进行二分搜索。注意来自@ xsllittlegrass的mod，它返回排序后的b的下标，而不是原来的b) 反向——为b形成一个反向查找字典d;然后 D [x]提供了x的索引。

结果表明，方法5速度最快。

有趣的是，try方法和set方法在时间上是等价的。

测试代码

import random
import bisect
import matplotlib.pyplot as plt
import math
import timeit
import itertools

def wrapper(func, *args, **kwargs):
    " Use to produced 0 argument function for call it"
    # Reference https://www.pythoncentral.io/time-a-python-function/
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

def method_in(a,b,c):
    for i,x in enumerate(a):
        if x in b:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_try(a,b,c):
    for i, x in enumerate(a):
        try:
            c[i] = b.index(x)
        except ValueError:
            c[i] = -1

def method_set_in(a,b,c):
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_bisect(a,b,c):
    " Finds indexes using bisection "

    # Create a sorted b with its index
    bsorted = sorted([(x, i) for i, x in enumerate(b)], key = lambda t: t[0])

    for i,x in enumerate(a):
        index = bisect.bisect_left(bsorted,(x, ))
        c[i] = -1
        if index < len(a):
            if x == bsorted[index][0]:
                c[i] = bsorted[index][1]  # index in the b array

    return c

def method_reverse_lookup(a, b, c):
    reverse_lookup = {x:i for i, x in enumerate(b)}
    for i, x in enumerate(a):
        c[i] = reverse_lookup.get(x, -1)
    return c

def profile():
    Nls = [x for x in range(1000,20000,1000)]
    number_iterations = 10
    methods = [method_in, method_try, method_set_in, method_bisect, method_reverse_lookup]
    time_methods = [[] for _ in range(len(methods))]

    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        for i, func in enumerate(methods):
            wrapped = wrapper(func, a, b, c)
            time_methods[i].append(math.log(timeit.timeit(wrapped, number=number_iterations)))

    markers = itertools.cycle(('o', '+', '.', '>', '2'))
    colors = itertools.cycle(('r', 'b', 'g', 'y', 'c'))
    labels = itertools.cycle(('in', 'try', 'set', 'bisect', 'reverse'))

    for i in range(len(time_methods)):
        plt.plot(Nls,time_methods[i],marker = next(markers),color=next(colors),linestyle='-',label=next(labels))

    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

profile()

2019-11-13 19:58:19

7 in a

最清晰最快的方法。

您也可以考虑使用一个集合，但是从列表中构造该集合所花费的时间可能比快速成员测试所节省的时间要多。唯一确定的方法就是做好基准测试。(这也取决于你需要什么操作)

2011-09-27 15:25:11

你可以把你的物品放在一个集合里。设置查找非常有效。

Try:

s = set(a)
if 7 in s:
  # do stuff

在注释中，你说你想获取元素的索引。不幸的是，集合没有元素位置的概念。另一种方法是对列表进行预先排序，然后在每次需要查找元素时使用二分搜索。

2011-09-27 15:25:12

def check_availability(element, collection: iter):
    return element in collection

使用

check_availability('a', [1,2,3,4,'a','b','c'])

我相信这是知道所选值是否在数组中的最快方法。

2011-09-27 15:33:49

检查列表中是否存在值的最快方法

推荐文章

最新文章

标签