比较两个庞大(>50.000项)的最快(和最少资源密集型)的方法是什么,从而得到如下所示的两个列表:

在第一个列表中出现但在第二个列表中没有出现的项目 出现在第二个列表中但不在第一个列表中的项目

目前,我正在使用列表或IReadOnlyCollection,并在linq查询中解决这个问题:

var list1 = list.Where(i => !list2.Contains(i)).ToList();
var list2 = list2.Where(i => !list.Contains(i)).ToList();

但这并不像我想的那样好。 有什么想法使这更快和更少的资源密集,因为我需要处理很多列表?


当前回答

While Jon Skeet's answer is an excellent advice for everyday's practice with small to moderate number of elements (up to a few millions) it is nevertheless not the fastest approach and not very resource efficient. An obvious drawback is the fact that getting the full difference requires two passes over the data (even three if the elements that are equal are of interest as well). Clearly, this can be avoided by a customized reimplementation of the Except method, but it remains that the creation of a hash set requires a lot of memory and the computation of hashes requires time.

对于非常大的数据集(数十亿个元素),考虑特定的情况通常是有好处的。这里有一些想法可能会给你一些启发: 如果元素可以比较(在实践中几乎总是这样),那么对列表进行排序并应用以下zip方法是值得考虑的:

/// <returns>The elements of the specified (ascendingly) sorted enumerations that are
/// contained only in one of them, together with an indicator,
/// whether the element is contained in the reference enumeration (-1)
/// or in the difference enumeration (+1).</returns>
public static IEnumerable<Tuple<T, int>> FindDifferences<T>(IEnumerable<T> sortedReferenceObjects,
    IEnumerable<T> sortedDifferenceObjects, IComparer<T> comparer)
{
    var refs  = sortedReferenceObjects.GetEnumerator();
    var diffs = sortedDifferenceObjects.GetEnumerator();
    bool hasNext = refs.MoveNext() && diffs.MoveNext();
    while (hasNext)
    {
        int comparison = comparer.Compare(refs.Current, diffs.Current);
        if (comparison == 0)
        {
            // insert code that emits the current element if equal elements should be kept
            hasNext = refs.MoveNext() && diffs.MoveNext();

        }
        else if (comparison < 0)
        {
            yield return Tuple.Create(refs.Current, -1);
            hasNext = refs.MoveNext();
        }
        else
        {
            yield return Tuple.Create(diffs.Current, 1);
            hasNext = diffs.MoveNext();
        }
    }
}

例如,它可以以以下方式使用:

const int N = <Large number>;
const int omit1 = 231567;
const int omit2 = 589932;
IEnumerable<int> numberSequence1 = Enumerable.Range(0, N).Select(i => i < omit1 ? i : i + 1);
IEnumerable<int> numberSequence2 = Enumerable.Range(0, N).Select(i => i < omit2 ? i : i + 1);
var numberDiffs = FindDifferences(numberSequence1, numberSequence2, Comparer<int>.Default);

在我的计算机上对N = 1M进行基准测试,得到以下结果:

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
DiffLinq 115.19 ms 0.656 ms 0.582 ms 1.00 2800.0000 2800.0000 2800.0000 67110744 B
DiffZip 23.48 ms 0.018 ms 0.015 ms 0.20 - - - 720 B

对于N = 100M:

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
DiffLinq 12.146 s 0.0427 s 0.0379 s 1.00 13000.0000 13000.0000 13000.0000 8589937032 B
DiffZip 2.324 s 0.0019 s 0.0018 s 0.19 - - - 720 B

请注意,这个示例当然得益于列表已经排序并且可以非常有效地比较整数的事实。但这正是关键所在:如果你确实有有利的环境,一定要好好利用它们。

A few further comments: The speed of the comparison function is clearly relevant for the overall performance, so it may be beneficial to optimize it. The flexibility to do so is a benefit of the zipping approach. Furthermore, parallelization seems more feasible to me, although by no means easy and maybe not worth the effort and the overhead. Nevertheless, a simple way to speed up the process by roughly a factor of 2, is to split the lists respectively in two halfs (if it can be efficiently done) and compare the parts in parallel, one processing from front to back and the other in reverse order.

其他回答

不是针对这个问题,但是这里有一些代码来比较相等和不相等的列表!相同的对象:

public class EquatableList<T> : List<T>, IEquatable<EquatableList<T>> where    T : IEquatable<T>

/// <summary>
/// True, if this contains element with equal property-values
/// </summary>
/// <param name="element">element of Type T</param>
/// <returns>True, if this contains element</returns>
public new Boolean Contains(T element)
{
    return this.Any(t => t.Equals(element));
}

/// <summary>
/// True, if list is equal to this
/// </summary>
/// <param name="list">list</param>
/// <returns>True, if instance equals list</returns>
public Boolean Equals(EquatableList<T> list)
{
    if (list == null) return false;
    return this.All(list.Contains) && list.All(this.Contains);
}

试试这个方法:

var difList = list1.Where(a => !list2.Any(a1 => a1.id == a.id))
            .Union(list2.Where(a => !list1.Any(a1 => a1.id == a.id)));

更有效的方法是使用Enumerable。除了:

var inListButNotInList2 = list.Except(list2);
var inList2ButNotInList = list2.Except(list);

该方法是通过使用延迟执行实现的。这意味着你可以这样写:

var first10 = inListButNotInList2.Take(10);

它也很有效,因为它在内部使用Set<T>来比较对象。它的工作原理是首先从第二个序列中收集所有不同的值,然后将第一个序列的结果流式传输,检查它们是否之前没有出现过。

我已经使用这段代码来比较两个有百万记录的列表。

这种方法不会花很多时间

    //Method to compare two list of string
    private List<string> Contains(List<string> list1, List<string> list2)
    {
        List<string> result = new List<string>();

        result.AddRange(list1.Except(list2, StringComparer.OrdinalIgnoreCase));
        result.AddRange(list2.Except(list1, StringComparer.OrdinalIgnoreCase));

        return result;
    }
using System.Collections.Generic;
using System.Linq;

namespace YourProject.Extensions
{
    public static class ListExtensions
    {
        public static bool SetwiseEquivalentTo<T>(this List<T> list, List<T> other)
            where T: IEquatable<T>
        {
            if (list.Except(other).Any())
                return false;
            if (other.Except(list).Any())
                return false;
            return true;
        }
    }
}

有时,您只需要知道两个列表是否不同,而不需要知道这些差异是什么。在这种情况下,考虑将此扩展方法添加到项目中。注意,你列出的对象应该实现IEquatable!

用法:

public sealed class Car : IEquatable<Car>
{
    public Price Price { get; }
    public List<Component> Components { get; }

    ...
    public override bool Equals(object obj)
        => obj is Car other && Equals(other);

    public bool Equals(Car other)
        => Price == other.Price
            && Components.SetwiseEquivalentTo(other.Components);

    public override int GetHashCode()
        => Components.Aggregate(
            Price.GetHashCode(),
            (code, next) => code ^ next.GetHashCode()); // Bitwise XOR
}

无论Component类是什么,这里为Car显示的方法应该几乎相同地实现。

注意我们如何编写GetHashCode是非常重要的。为了正确地实现IEquatable, Equals和GetHashCode必须以逻辑兼容的方式操作实例的属性。

Two lists with the same contents are still different objects, and will produce different hash codes. Since we want these two lists to be treated as equal, we must let GetHashCode produce the same value for each of them. We can accomplish this by delegating the hashcode to every element in the list, and using the standard bitwise XOR to combine them all. XOR is order-agnostic, so it doesn't matter if the lists are sorted differently. It only matters that they contain nothing but equivalent members.

注意:这个奇怪的名字是为了暗示这个方法不考虑列表中元素的顺序。如果您确实关心列表中元素的顺序,则此方法不适合您!