我需要采取一个c++向量与潜在的很多元素,擦除重复,并排序。

我目前有下面的代码,但它不起作用。

vec.erase(
      std::unique(vec.begin(), vec.end()),
      vec.end());
std::sort(vec.begin(), vec.end());

我怎样才能正确地做到这一点呢?

此外,是先删除副本(类似于上面的编码)还是先执行排序更快?如果我先执行排序,它是否保证在std::unique执行后保持排序?

或者是否有另一种(也许更有效的)方法来完成这一切?


当前回答

如果您正在寻找性能并使用std::vector,我推荐使用本文档链接提供的方法。

std::vector<int> myvector{10,20,20,20,30,30,20,20,10};             // 10 20 20 20 30 30 20 20 10
std::sort(myvector.begin(), myvector.end() );
const auto& it = std::unique (myvector.begin(), myvector.end());   // 10 20 30 ?  ?  ?  ?  ?  ?
                                                                   //          ^
myvector.resize( std::distance(myvector.begin(),it) ); // 10 20 30

其他回答

如果你不想改变元素的顺序,那么你可以尝试这个解决方案:

template <class T>
void RemoveDuplicatesInVector(std::vector<T> & vec)
{
    set<T> values;
    vec.erase(std::remove_if(vec.begin(), vec.end(), [&](const T & value) { return !values.insert(value).second; }), vec.end());
}

我同意R. Pate和Todd Gardner的观点;std::set在这里可能是个好主意。即使你在使用向量时遇到了困难,如果你有足够多的副本,你最好创建一个集合来做这些肮脏的工作。

让我们来比较三种方法:

用向量,sort + unique

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

转换为set(手动)

set<int> s;
unsigned size = vec.size();
for( unsigned i = 0; i < size; ++i ) s.insert( vec[i] );
vec.assign( s.begin(), s.end() );

转换为set(使用构造函数)

set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );

下面是它们在重复数量变化时的表现:

总结:当副本的数量足够大时,实际上更快地将数据转换为一个集合,然后将数据转储回一个向量。

出于某种原因,手动进行set转换似乎比使用set构造函数更快——至少在我使用的随机数据上是这样。

取决于用例。如果你期望有少于100个正整数的唯一值,并且你有一个cpu能够处理avx512f指令集,那么你可以以每个元素15个时钟周期或每秒3 -5亿个插入的速度插入元素,通过与一个小型查找表进行简单的比较。

接下来的实现使用CPU寄存器对~50个惟一值进行值查找,并对~1000个惟一值进行L1缓存。对于L1缓存版本,每次插入大约需要160个时钟周期,这相当于大约每秒25M个插入,并且比使用std::set慢。对于只有4个唯一值,它以每个元素5.8个周期的速率插入,高于500M/s。

//g++  7.4.0
// time measurement taken from another answer
// valid C99 and C++

#include <stdint.h>  // <cstdint> is preferred in C++, but stdint.h works.

#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
     _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    uint64_t tsc = __rdtsc();
     _mm_lfence();  // optionally block later instructions until rdtsc retires
    return tsc;
}

// requires a Nehalem or newer CPU.  Not Core2 or earlier.  IDK when AMD added it.
inline
uint64_t readTSCp() {
    unsigned dummy;
    return __rdtscp(&dummy);  // waits for earlier insns to retire, but allows later to start
}



#include <iostream>

template<int n>
struct FastUnique
{
    public:
    FastUnique()
    {
         it=0;
         for(int i=0;i<n;i++)
             dict[i]=-1;
    }

    void insert(const int val)
    {
        if(!test(dict,val))
            dict[it++]=val;
    }

    const int get(const int index)
    {
        return dict[index];
    }

    const int size()
    {
        return it;
    }

    private:
    int dict[n];
    int it;
    bool test(const int * dict, const int val)
    {
        int c=0;
        for(int i=0;i<n;i++)
            c+=(dict[i]==val);
        return c>0;
    }
};

int main()
{
    std::cout << "Hello, world!\n";
    const int n=500000000;

    FastUnique<64> fastSet;

    auto t= readTSC();

    for(int i=0;i<n;i++)
        fastSet.insert(i&63);

    auto t2=readTSC();

    std::cout<<(t2-t)/(double)n<<"cycles per iteration"<<std::endl;
   
    for(int i=0;i<fastSet.size();i++)
        std::cout<<fastSet.get(i)<<std::endl;
    
    return 0;
}

下面是使用std::unique()出现重复删除问题的示例。在LINUX机器上,程序崩溃。详情请阅读评论。

// Main10.cpp
//
// Illustration of duplicate delete and memory leak in a vector<int*> after calling std::unique.
// On a LINUX machine, it crashes the progam because of the duplicate delete.
//
// INPUT : {1, 2, 2, 3}
// OUTPUT: {1, 2, 3, 3}
//
// The two 3's are actually pointers to the same 3 integer in the HEAP, which is BAD
// because if you delete both int* pointers, you are deleting the same memory
// location twice.
//
//
// Never mind the fact that we ignore the "dupPosition" returned by std::unique(),
// but in any sensible program that "cleans up after istelf" you want to call deletex
// on all int* poitners to avoid memory leaks.
//
//
// NOW IF you replace std::unique() with ptgi::unique(), all of the the problems disappear.
// Why? Because ptgi:unique merely reshuffles the data:
// OUTPUT: {1, 2, 3, 2}
// The ptgi:unique has swapped the last two elements, so all of the original elements in
// the INPUT are STILL in the OUTPUT.
//
// 130215   dbednar@ptgi.com
//============================================================================

#include <iostream>
#include <vector>
#include <algorithm>
#include <functional>

#include "ptgi_unique.hpp"

// functor used by std::unique to remove adjacent elts from vector<int*>
struct EqualToVectorOfIntegerStar: public std::equal_to<int *>
{
    bool operator() (const int* arg1, const int* arg2) const
    {
        return (*arg1 == *arg2);
    }
};

void printVector( const std::string& msg, const std::vector<int*>& vnums);

int main()
{
    int inums [] = { 1, 2, 2, 3 };
    std::vector<int*> vnums;

    // convert C array into vector of pointers to integers
    for (size_t inx = 0; inx < 4; ++ inx)
        vnums.push_back( new int(inums[inx]) );

    printVector("BEFORE UNIQ", vnums);

    // INPUT : 1, 2A, 2B, 3
    std::unique( vnums.begin(), vnums.end(), EqualToVectorOfIntegerStar() );
    // OUTPUT: 1, 2A, 3, 3 }
    printVector("AFTER  UNIQ", vnums);

    // now we delete 3 twice, and we have a memory leak because 2B is not deleted.
    for (size_t inx = 0; inx < vnums.size(); ++inx)
    {
        delete(vnums[inx]);
    }
}

// print a line of the form "msg: 1,2,3,..,5,6,7\n", where 1..7 are the numbers in vnums vector
// PS: you may pass "hello world" (const char *) because of implicit (automatic) conversion
// from "const char *" to std::string conversion.

void printVector( const std::string& msg, const std::vector<int*>& vnums)
{
    std::cout << msg << ": ";

    for (size_t inx = 0; inx < vnums.size(); ++inx)
    {
        // insert comma separator before current elt, but ONLY after first elt
        if (inx > 0)
            std::cout << ",";
        std::cout << *vnums[inx];

    }
    std::cout << "\n";
}
std::set<int> s;
std::for_each(v.cbegin(), v.cend(), [&s](int val){s.insert(val);});
v.clear();
std::copy(s.cbegin(), s.cend(), v.cbegin());