如何迭代由空格分隔的单词组成的字符串中的单词?

注意,我对C字符串函数或那种字符操作/访问不感兴趣。比起效率,我更喜欢优雅。我当前的解决方案:

#include <iostream>
#include <sstream>
#include <string>

using namespace std;

int main() {
    string s = "Somewhere down the road";
    istringstream iss(s);

    do {
        string subs;
        iss >> subs;
        cout << "Substring: " << subs << endl;
    } while (iss);
}

当前回答

这里有一个只使用标准正则表达式库的简单解决方案

#include <regex>
#include <string>
#include <vector>

std::vector<string> Tokenize( const string str, const std::regex regex )
{
    using namespace std;

    std::vector<string> result;

    sregex_token_iterator it( str.begin(), str.end(), regex, -1 );
    sregex_token_iterator reg_end;

    for ( ; it != reg_end; ++it ) {
        if ( !it->str().empty() ) //token could be empty:check
            result.emplace_back( it->str() );
    }

    return result;
}

正则表达式参数允许检查多个参数(空格、逗号等)

我通常只选中空格和逗号分隔,所以我也有这个默认函数:

std::vector<string> TokenizeDefault( const string str )
{
    using namespace std;

    regex re( "[\\s,]+" );

    return Tokenize( str, re );
}

“[\\s,]+”检查空格(\\s)和逗号(,)。

注意,如果要拆分wstring而不是string,

将所有std::regex更改为std::wregex将所有sregex_token_iterator更改为wsregex_token_idterator

注意,根据编译器的不同,您可能还希望引用字符串参数。

其他回答

这里有一个仅使用标准正则表达式库的正则表达式解决方案。(我有点生疏,所以可能会有一些语法错误,但这至少是一般的想法)

#include <regex.h>
#include <string.h>
#include <vector.h>

using namespace std;

vector<string> split(string s){
    regex r ("\\w+"); //regex matches whole words, (greedy, so no fragment words)
    regex_iterator<string::iterator> rit ( s.begin(), s.end(), r );
    regex_iterator<string::iterator> rend; //iterators to iterate thru words
    vector<string> result<regex_iterator>(rit, rend);
    return result;  //iterates through the matches to fill the vector
}

这是我的条目:

template <typename Container, typename InputIter, typename ForwardIter>
Container
split(InputIter first, InputIter last,
      ForwardIter s_first, ForwardIter s_last)
{
    Container output;

    while (true) {
        auto pos = std::find_first_of(first, last, s_first, s_last);
        output.emplace_back(first, pos);
        if (pos == last) {
            break;
        }

        first = ++pos;
    }

    return output;
}

template <typename Output = std::vector<std::string>,
          typename Input = std::string,
          typename Delims = std::string>
Output
split(const Input& input, const Delims& delims = " ")
{
    using std::cbegin;
    using std::cend;
    return split<Output>(cbegin(input), cend(input),
                         cbegin(delims), cend(delims));
}

auto vec = split("Mary had a little lamb");

第一个定义是采用两对迭代器的STL样式泛型函数。第二个是一个方便的函数,可以让你不用自己做所有的开始和结束。例如,如果要使用列表,还可以将输出容器类型指定为模板参数。

它之所以优雅(IMO),是因为与其他大多数答案不同,它不限于字符串,而是可以与任何STL兼容的容器一起使用。在不更改上述代码的情况下,您可以说:

using vec_of_vecs_t = std::vector<std::vector<int>>;

std::vector<int> v{1, 2, 0, 3, 4, 5, 0, 7, 8, 0, 9};
auto r = split<vec_of_vecs_t>(v, std::initializer_list<int>{0, 2});

这将在每次遇到0或2时将向量v分割成单独的向量。

(还有一个额外的好处,即使用字符串,这个实现比基于strtok()和getline()的版本更快,至少在我的系统上是这样。)

我的实施可以是另一种解决方案:

std::vector<std::wstring> SplitString(const std::wstring & String, const std::wstring & Seperator)
{
    std::vector<std::wstring> Lines;
    size_t stSearchPos = 0;
    size_t stFoundPos;
    while (stSearchPos < String.size() - 1)
    {
        stFoundPos = String.find(Seperator, stSearchPos);
        stFoundPos = (stFoundPos == std::string::npos) ? String.size() : stFoundPos;
        Lines.push_back(String.substr(stSearchPos, stFoundPos - stSearchPos));
        stSearchPos = stFoundPos + Seperator.size();
    }
    return Lines;
}

测试代码:

std::wstring MyString(L"Part 1SEPsecond partSEPlast partSEPend");
std::vector<std::wstring> Parts = IniFile::SplitString(MyString, L"SEP");
std::wcout << L"The string: " << MyString << std::endl;
for (std::vector<std::wstring>::const_iterator it=Parts.begin(); it<Parts.end(); ++it)
{
    std::wcout << *it << L"<---" << std::endl;
}
std::wcout << std::endl;
MyString = L"this,time,a,comma separated,string";
std::wcout << L"The string: " << MyString << std::endl;
Parts = IniFile::SplitString(MyString, L",");
for (std::vector<std::wstring>::const_iterator it=Parts.begin(); it<Parts.end(); ++it)
{
    std::wcout << *it << L"<---" << std::endl;
}

测试代码的输出:

The string: Part 1SEPsecond partSEPlast partSEPend
Part 1<---
second part<---
last part<---
end<---

The string: this,time,a,comma separated,string
this<---
time<---
a<---
comma separated<---
string<---

这是我的版本

#include <vector>

inline std::vector<std::string> Split(const std::string &str, const std::string &delim = " ")
{
    std::vector<std::string> tokens;
    if (str.size() > 0)
    {
        if (delim.size() > 0)
        {
            std::string::size_type currPos = 0, prevPos = 0;
            while ((currPos = str.find(delim, prevPos)) != std::string::npos)
            {
                std::string item = str.substr(prevPos, currPos - prevPos);
                if (item.size() > 0)
                {
                    tokens.push_back(item);
                }
                prevPos = currPos + 1;
            }
            tokens.push_back(str.substr(prevPos));
        }
        else
        {
            tokens.push_back(str);
        }
    }
    return tokens;
}

它适用于多字符分隔符。它防止空令牌进入结果。它使用单个标头。当您不提供分隔符时,它将字符串作为一个标记返回。如果字符串为空,它还会返回一个空结果。不幸的是,它的效率很低,因为存在巨大的std::vector副本,除非您使用C++11进行编译,否则应该使用移动示意图。在C++11中,这段代码应该很快。

对于那个些需要使用字符串分隔符拆分字符串的人,也许可以尝试我的以下解决方案。

std::vector<size_t> str_pos(const std::string &search, const std::string &target)
{
    std::vector<size_t> founds;

    if(!search.empty())
    {
        size_t start_pos = 0;

        while (true)
        {
            size_t found_pos = target.find(search, start_pos);

            if(found_pos != std::string::npos)
            {
                size_t found = found_pos;

                founds.push_back(found);

                start_pos = (found_pos + 1);
            }
            else
            {
                break;
            }
        }
    }

    return founds;
}

std::string str_sub_index(size_t begin_index, size_t end_index, const std::string &target)
{
    std::string sub;

    size_t size = target.length();

    const char* copy = target.c_str();

    for(size_t i = begin_index; i <= end_index; i++)
    {
        if(i >= size)
        {
            break;
        }
        else
        {
            char c = copy[i];

            sub += c;
        }
    }

    return sub;
}

std::vector<std::string> str_split(const std::string &delimiter, const std::string &target)
{
    std::vector<std::string> splits;

    if(!delimiter.empty())
    {
        std::vector<size_t> founds = str_pos(delimiter, target);

        size_t founds_size = founds.size();

        if(founds_size > 0)
        {
            size_t search_len = delimiter.length();

            size_t begin_index = 0;

            for(int i = 0; i <= founds_size; i++)
            {
                std::string sub;

                if(i != founds_size)
                {
                    size_t pos  = founds.at(i);

                    sub = str_sub_index(begin_index, pos - 1, target);

                    begin_index = (pos + search_len);
                }
                else
                {
                    sub = str_sub_index(begin_index, (target.length() - 1), target);
                }

                splits.push_back(sub);
            }
        }
    }

    return splits;
}

这些片段由3个函数组成。坏消息是使用str_split函数,您将需要另外两个函数。是的,这是一大块代码。但好消息是,这两个附加功能可以独立工作,有时也很有用

测试main()块中的函数如下:

int main()
{
    std::string s = "Hello, world! We need to make the world a better place. Because your world is also my world, and our children's world.";

    std::vector<std::string> split = str_split("world", s);

    for(int i = 0; i < split.size(); i++)
    {
        std::cout << split[i] << std::endl;
    }
}

它将产生:

Hello, 
! We need to make the 
 a better place. Because your 
 is also my 
, and our children's 
.

我认为这不是最有效的代码,但至少它可以工作。希望有帮助。