检查列表中是否存在值的最快方法

检查一个值是否存在于一个非常大的列表的最快方法是什么?

当前回答

def check_availability(element, collection: iter):
    return element in collection

使用

check_availability('a', [1,2,3,4,'a','b','c'])

我相信这是知道所选值是否在数组中的最快方法。

2011-09-27 15:33:49

其他回答

a = [4,2,3,1,5,6]

index = dict((y,x) for x,y in enumerate(a))
try:
   a_index = index[7]
except KeyError:
   print "Not found"
else:
   print "found"

只有在a没有改变的情况下，这才会是一个好主意，因此我们可以只执行一次dict()部分，然后重复使用它。如果a确实发生了变化，请提供更多关于您正在做什么的细节。

2011-09-27 16:26:35

你可以把你的物品放在一个集合里。设置查找非常有效。

Try:

s = set(a)
if 7 in s:
  # do stuff

在注释中，你说你想获取元素的索引。不幸的是，集合没有元素位置的概念。另一种方法是对列表进行预先排序，然后在每次需要查找元素时使用二分搜索。

2011-09-27 15:25:12

最初的问题是:

知道一个值是否存在于一个列表(一个列表它有数百万个值)，它的索引是什么?

因此，有两件事需要发现:

是列表中的一项，和索引是什么(如果在列表中)。

为此，我修改了@xslittlegrass代码来计算所有情况下的索引，并添加了一个额外的方法。

结果

方法是:

基本上，if x In b: return b。index(x) 在b.index(x)上尝试/捕获(跳过必须检查x是否在b中) Set——基本上如果x在Set (b):返回b.index(x) 平分——对b和它的下标排序，对x在排序(b)中进行二分搜索。注意来自@ xsllittlegrass的mod，它返回排序后的b的下标，而不是原来的b) 反向——为b形成一个反向查找字典d;然后 D [x]提供了x的索引。

结果表明，方法5速度最快。

有趣的是，try方法和set方法在时间上是等价的。

测试代码

import random
import bisect
import matplotlib.pyplot as plt
import math
import timeit
import itertools

def wrapper(func, *args, **kwargs):
    " Use to produced 0 argument function for call it"
    # Reference https://www.pythoncentral.io/time-a-python-function/
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

def method_in(a,b,c):
    for i,x in enumerate(a):
        if x in b:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_try(a,b,c):
    for i, x in enumerate(a):
        try:
            c[i] = b.index(x)
        except ValueError:
            c[i] = -1

def method_set_in(a,b,c):
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_bisect(a,b,c):
    " Finds indexes using bisection "

    # Create a sorted b with its index
    bsorted = sorted([(x, i) for i, x in enumerate(b)], key = lambda t: t[0])

    for i,x in enumerate(a):
        index = bisect.bisect_left(bsorted,(x, ))
        c[i] = -1
        if index < len(a):
            if x == bsorted[index][0]:
                c[i] = bsorted[index][1]  # index in the b array

    return c

def method_reverse_lookup(a, b, c):
    reverse_lookup = {x:i for i, x in enumerate(b)}
    for i, x in enumerate(a):
        c[i] = reverse_lookup.get(x, -1)
    return c

def profile():
    Nls = [x for x in range(1000,20000,1000)]
    number_iterations = 10
    methods = [method_in, method_try, method_set_in, method_bisect, method_reverse_lookup]
    time_methods = [[] for _ in range(len(methods))]

    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        for i, func in enumerate(methods):
            wrapped = wrapper(func, a, b, c)
            time_methods[i].append(math.log(timeit.timeit(wrapped, number=number_iterations)))

    markers = itertools.cycle(('o', '+', '.', '>', '2'))
    colors = itertools.cycle(('r', 'b', 'g', 'y', 'c'))
    labels = itertools.cycle(('in', 'try', 'set', 'bisect', 'reverse'))

    for i in range(len(time_methods)):
        plt.plot(Nls,time_methods[i],marker = next(markers),color=next(colors),linestyle='-',label=next(labels))

    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

profile()

2019-11-13 19:58:19

这不是代码，而是快速搜索的算法。

如果您的列表和您正在寻找的值都是数字，这是相当简单的。如果字符串:看底部:

-Let "n" be the length of your list -Optional step: if you need the index of the element: add a second column to the list with current index of elements (0 to n-1) - see later Order your list or a copy of it (.sort()) Loop through: Compare your number to the n/2th element of the list If larger, loop again between indexes n/2-n If smaller, loop again between indexes 0-n/2 If the same: you found it Keep narrowing the list until you have found it or only have 2 numbers (below and above the one you are looking for) This will find any element in at most 19 steps for a list of 1.000.000 (log(2)n to be precise)

如果您还需要您的数字的原始位置，请在第二索引列中查找。

如果您的列表不是由数字组成的，该方法仍然有效，并且将是最快的，但您可能需要定义一个可以比较/排序字符串的函数。

当然，这需要使用sorted()方法，但如果您不断重用相同的列表进行检查，那么这样做可能是值得的。

2015-09-24 13:35:02

空间数据的边缘情况

可能有更快的算法来处理空间数据(例如重构以使用k-d树)，但检查向量是否在数组中的特殊情况是有用的:

如果你有空间数据(即笛卡尔坐标) 如果你有整数掩码(即数组过滤)

在这种情况下，我想知道由两点定义的(无向)边是否在(无向)边的集合中，这样

(pair in unique_pairs) | (pair[::-1] in unique_pairs) for pair in pairs

其中pair构成两个任意长度的向量(即形状(2,N))。

如果这些向量之间的距离是有意义的，那么检验可以用一个浮点不等式来表示

test_result = Norm(v1 - v2) < Tol

和“值存在于列表”是简单的任何(test_result)。

下面是整数对和R3向量对的示例代码和虚拟测试集生成器。

# 3rd party
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt

# optional
try:
    from tqdm import tqdm
except ModuleNotFoundError:
    def tqdm(X, *args, **kwargs):
        return X
    print('tqdm not found. tqdm is a handy progress bar module.')
    

def get_float_r3_pairs(size):
    """ generate dummy vector pairs in R3  (i.e. case of spatial data) """
    coordinates = np.random.random(size=(size, 3))
    pairs = []
    for b in coordinates:
        for a in coordinates:
            pairs.append((a,b))
    pairs = np.asarray(pairs)
    return pairs
    
        
def get_int_pairs(size):
    """ generate dummy integer pairs (i.e. case of array masking) """
    coordinates = np.random.randint(0, size, size)
    pairs = []
    for b in coordinates:
        for a in coordinates:
            pairs.append((a,b))
    pairs = np.asarray(pairs)
    return pairs


def float_tol_pair_in_pairs(pair:np.ndarray, pairs:np.ndarray) -> np.ndarray:
    """
    True if abs(a0 - b0) <= tol & abs(a1 - b1) <= tol for (ai1, aj2), (bi1, bj2)
    in [(a01, a02), ... (aik, ajl)]
    
    NB this is expected to be called in iteration so no sanitization is performed.

    Parameters
    ----------
    pair : np.ndarray
        pair of vectors with shape (2, M)
    pairs : np.ndarray
        collection of vector pairs with shape (N, 2, M)

    Returns
    -------
    np.ndarray
        (pair in pairs) | (pair[::-1] in pairs).
    """
    m1 = np.sum( abs(LA.norm(pairs - pair, axis=2)) <= (1e-03, 1e-03), axis=1 ) == 2
    m2 = np.sum( abs(LA.norm(pairs - pair[::-1], axis=2)) <= (1e-03, 1e-03), axis=1 ) == 2
    return m1 | m2


def get_unique_pairs(pairs:np.ndarray) -> np.ndarray:
    """
    apply float_tol_pair_in_pairs for pair in pairs
    
    Parameters
    ----------
    pairs : np.ndarray
        collection of vector pairs with shape (N, 2, M)

    Returns
    -------
    np.ndarray
        pair if not ((pair in rv) | (pair[::-1] in rv)) for pair in pairs

    """
    pairs = np.asarray(pairs).reshape((len(pairs), 2, -1))
    rv = [pairs[0]]
    for pair in tqdm(pairs[1:], desc='finding unique pairs...'):
        if not any(float_tol_pair_in_pairs(pair, rv)):
            rv.append(pair)
    return np.array(rv)

2023-01-11 19:02:06

检查列表中是否存在值的最快方法

推荐文章

最新文章

标签