比较两个庞大(>50.000项)的最快(和最少资源密集型)的方法是什么,从而得到如下所示的两个列表:

在第一个列表中出现但在第二个列表中没有出现的项目 出现在第二个列表中但不在第一个列表中的项目

目前,我正在使用列表或IReadOnlyCollection,并在linq查询中解决这个问题:

var list1 = list.Where(i => !list2.Contains(i)).ToList();
var list2 = list2.Where(i => !list.Contains(i)).ToList();

但这并不像我想的那样好。 有什么想法使这更快和更少的资源密集,因为我需要处理很多列表?


当前回答

我认为这是一个简单易行的方法来逐个元素比较两个列表

x=[1,2,3,5,4,8,7,11,12,45,96,25]
y=[2,4,5,6,8,7,88,9,6,55,44,23]

tmp = []


for i in range(len(x)) and range(len(y)):
    if x[i]>y[i]:
        tmp.append(1)
    else:
        tmp.append(0)
print(tmp)

其他回答

这是你能找到的最好的解决办法

var list3 = list1.Where(l => list2.ToList().Contains(l));

我比较了3种不同的方法来比较不同的数据集。下面的测试创建了一个包含从0到length - 1的所有数字的字符串集合,然后是另一个具有相同范围但包含偶数的集合。然后我从第一个集合中挑出奇数。

使用Linq除外

public void TestExcept()
{
    WriteLine($"Except {DateTime.Now}");
    int length = 20000000;
    var dateTime = DateTime.Now;
    var array = new string[length];
    for (int i = 0; i < length; i++)
    {
        array[i] = i.ToString();
    }
    Write("Populate set processing time: ");
    WriteLine(DateTime.Now - dateTime);
    var newArray = new string[length/2];
    int j = 0;
    for (int i = 0; i < length; i+=2)
    {
        newArray[j++] = i.ToString();
    }
    dateTime = DateTime.Now;
    Write("Count of items: ");
    WriteLine(array.Except(newArray).Count());
    Write("Count processing time: ");
    WriteLine(DateTime.Now - dateTime);
}

输出

Except 2021-08-14 11:43:03 AM
Populate set processing time: 00:00:03.7230479
2021-08-14 11:43:09 AM
Count of items: 10000000
Count processing time: 00:00:02.9720879

使用HashSet。添加

public void TestHashSet()
{
    WriteLine($"HashSet {DateTime.Now}");
    int length = 20000000;
    var dateTime = DateTime.Now;
    var hashSet = new HashSet<string>();
    for (int i = 0; i < length; i++)
    {
        hashSet.Add(i.ToString());
    }
    Write("Populate set processing time: ");
    WriteLine(DateTime.Now - dateTime);
    var newHashSet = new HashSet<string>();
    for (int i = 0; i < length; i+=2)
    {
        newHashSet.Add(i.ToString());
    }
    dateTime = DateTime.Now;
    Write("Count of items: ");
    // HashSet Add returns true if item is added successfully (not previously existing)
    WriteLine(hashSet.Where(s => newHashSet.Add(s)).Count());
    Write("Count processing time: ");
    WriteLine(DateTime.Now - dateTime);
}

输出

HashSet 2021-08-14 11:42:43 AM
Populate set processing time: 00:00:05.6000625
Count of items: 10000000
Count processing time: 00:00:01.7703057

特殊HashSet测试:

public void TestLoadingHashSet()
{
    int length = 20000000;
    var array = new string[length];
    for (int i = 0; i < length; i++)
    {
       array[i] = i.ToString();
    }
    var dateTime = DateTime.Now;
    var hashSet = new HashSet<string>(array);
    Write("Time to load hashset: ");
    WriteLine(DateTime.Now - dateTime);
}
> TestLoadingHashSet()
Time to load hashset: 00:00:01.1918160

使用.Contains

public void TestContains()
{
    WriteLine($"Contains {DateTime.Now}");
    int length = 20000000;
    var dateTime = DateTime.Now;
    var array = new string[length];
    for (int i = 0; i < length; i++)
    {
        array[i] = i.ToString();
    }
    Write("Populate set processing time: ");
    WriteLine(DateTime.Now - dateTime);
    var newArray = new string[length/2];
    int j = 0;
    for (int i = 0; i < length; i+=2)
    {
        newArray[j++] = i.ToString();
    }
    dateTime = DateTime.Now;
    WriteLine(dateTime);
    Write("Count of items: ");
    WriteLine(array.Where(a => !newArray.Contains(a)).Count());
    Write("Count processing time: ");
    WriteLine(DateTime.Now - dateTime);
}

输出

Contains 2021-08-14 11:19:44 AM
Populate set processing time: 00:00:03.1046998
2021-08-14 11:19:49 AM
Count of items: Hosting process exited with exit code 1.
(Didnt complete. Killed it after 14 minutes)

结论:

Linq Except在我的设备上运行大约比使用HashSets慢1秒(n=20,000,000)。 使用Where和Contains运行了很长时间

哈希集的总结:

独特的数据 确保为类类型重写GetHashCode(正确地) 如果您复制数据集,可能需要高达2倍的内存,这取决于实现 HashSet是为使用IEnumerable构造函数克隆其他HashSet而优化的,但是将其他集合转换为HashSet比较慢(参见上面的特殊测试)

试试这个方法:

var difList = list1.Where(a => !list2.Any(a1 => a1.id == a.id))
            .Union(list2.Where(a => !list1.Any(a1 => a1.id == a.id)));

一行:

var list1 = new List<int> { 1, 2, 3 };
var list2 = new List<int> { 1, 2, 3, 4 };
if (list1.Except(list2).Count() + list2.Except(list1).Count() == 0)
    Console.WriteLine("same sets");

也许这很有趣,但这对我来说很管用:

string.Join("",List1) != string.Join("", List2)