Java有一个方便的分割方法:

String str = "The quick brown fox";
String[] results = str.split(" ");

在c++中有简单的方法来做到这一点吗?


当前回答

这是一个非常简单的问题:

#include <vector>
#include <string>
using namespace std;

vector<string> split(const char *str, char c = ' ')
{
    vector<string> result;

    do
    {
        const char *begin = str;

        while(*str != c && *str)
            str++;

        result.push_back(string(begin, str));
    } while (0 != *str++);

    return result;
}

其他回答

你可以利用boost::make_find_iterator。类似于这个:

template<typename CH>
inline vector< basic_string<CH> > tokenize(
    const basic_string<CH> &Input,
    const basic_string<CH> &Delimiter,
    bool remove_empty_token
    ) {

    typedef typename basic_string<CH>::const_iterator string_iterator_t;
    typedef boost::find_iterator< string_iterator_t > string_find_iterator_t;

    vector< basic_string<CH> > Result;
    string_iterator_t it = Input.begin();
    string_iterator_t it_end = Input.end();
    for(string_find_iterator_t i = boost::make_find_iterator(Input, boost::first_finder(Delimiter, boost::is_equal()));
        i != string_find_iterator_t();
        ++i) {
        if(remove_empty_token){
            if(it != i->begin())
                Result.push_back(basic_string<CH>(it,i->begin()));
        }
        else
            Result.push_back(basic_string<CH>(it,i->begin()));
        it = i->end();
    }
    if(it != it_end)
        Result.push_back(basic_string<CH>(it,it_end));

    return Result;
}

使用regex_token_iterators的解决方案:

#include <iostream>
#include <regex>
#include <string>

using namespace std;

int main()
{
    string str("The quick brown fox");

    regex reg("\\s+");

    sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
    sregex_token_iterator end;

    vector<string> vec(iter, end);

    for (auto a : vec)
    {
        cout << a << endl;
    }
}

这里有许多过于复杂的建议。试试这个简单的std::string解决方案:

using namespace std;

string someText = ...

string::size_type tokenOff = 0, sepOff = tokenOff;
while (sepOff != string::npos)
{
    sepOff = someText.find(' ', sepOff);
    string::size_type tokenLen = (sepOff == string::npos) ? sepOff : sepOff++ - tokenOff;
    string token = someText.substr(tokenOff, tokenLen);
    if (!token.empty())
        /* do something with token */;
    tokenOff = sepOff;
}

Adam Pierce的回答提供了一个采用const char*的手工标记器。使用迭代器会有一些问题,因为对字符串的结束迭代器进行递增是未定义的。也就是说,给定字符串str{"The quick brown fox"},我们当然可以做到:

auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };

while (start != cend(str)) {
    const auto finish = find(++start, cend(str), ' ');

    tokens.push_back(string(start, finish));
    start = finish;
}

生活的例子


如果你想通过使用标准功能来抽象复杂性,On Freund建议strtok是一个简单的选择:

vector<string> tokens;

for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);

如果你不能访问c++ 17,你需要像这个例子一样替换data(str): http://ideone.com/8kAGoa

虽然在示例中没有演示,但strtok不需要为每个标记使用相同的分隔符。除了这个优势,还有几个缺点:

strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s) For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe) Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on


c++20为我们提供了split_view来以非破坏性的方式标记字符串:https://topanswers.xyz/cplusplus?q=749#a874


前面的方法不能就地生成标记化的向量,这意味着如果不将它们抽象为辅助函数,它们就不能初始化const vector<string>令牌。该功能和接受任何空白分隔符的能力可以使用istream_iterator来利用。例如,给定const string str{"The quick \tbrown \nfox"},我们可以这样做:

istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };

生活的例子

对于这个选项,需要构造一个istringstream的代价比前面两个选项要大得多,但是这个代价通常隐藏在字符串分配的代价中。


如果上面的选项都不够灵活,不能满足您的标记化需求,那么最灵活的选项是使用regex_token_iterator,当然这种灵活性会带来更大的开销,但同样,这可能隐藏在字符串分配成本中。例如,我们想要基于非转义的逗号进行标记化,也吃空白,给定以下输入:const string str{" the,qu\\,ick,\tbrown, fox"}我们可以这样做:

const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };

生活的例子

/// split a string into multiple sub strings, based on a separator string
/// for example, if separator="::",
///
/// s = "abc" -> "abc"
///
/// s = "abc::def xy::st:" -> "abc", "def xy" and "st:",
///
/// s = "::abc::" -> "abc"
///
/// s = "::" -> NO sub strings found
///
/// s = "" -> NO sub strings found
///
/// then append the sub-strings to the end of the vector v.
/// 
/// the idea comes from the findUrls() function of "Accelerated C++", chapt7,
/// findurls.cpp
///
void split(const string& s, const string& sep, vector<string>& v)
{
    typedef string::const_iterator iter;
    iter b = s.begin(), e = s.end(), i;
    iter sep_b = sep.begin(), sep_e = sep.end();

    // search through s
    while (b != e){
        i = search(b, e, sep_b, sep_e);

        // no more separator found
        if (i == e){
            // it's not an empty string
            if (b != e)
                v.push_back(string(b, e));
            break;
        }
        else if (i == b){
            // the separator is found and right at the beginning
            // in this case, we need to move on and search for the
            // next separator
            b = i + sep.length();
        }
        else{
            // found the separator
            v.push_back(string(b, i));
            b = i;
        }
    }
}

boost库很好,但并不总是可用的。手工做这些事情也是很好的脑力锻炼。这里我们只使用STL中的std::search()算法,参见上面的代码。