Java有一个方便的分割方法:
String str = "The quick brown fox";
String[] results = str.split(" ");
在c++中有简单的方法来做到这一点吗?
Java有一个方便的分割方法:
String str = "The quick brown fox";
String[] results = str.split(" ");
在c++中有简单的方法来做到这一点吗?
当前回答
If the maximum length of the input string to be tokenized is known, one can exploit this and implement a very fast version. I am sketching the basic idea below, which was inspired by both strtok() and the "suffix array"-data structure described Jon Bentley's "Programming Perls" 2nd edition, chapter 15. The C++ class in this case only gives some organization and convenience of use. The implementation shown can be easily extended for removing leading and trailing whitespace characters in the tokens.
基本上,可以将分隔符替换为以字符串结束的'\0'字符,并设置指向修改后字符串中的标记的指针。在极端情况下,当字符串仅由分隔符组成时,将得到字符串长度加1个空标记。复制要修改的字符串是可行的。
头文件:
class TextLineSplitter
{
public:
TextLineSplitter( const size_t max_line_len );
~TextLineSplitter();
void SplitLine( const char *line,
const char sep_char = ',',
);
inline size_t NumTokens( void ) const
{
return mNumTokens;
}
const char * GetToken( const size_t token_idx ) const
{
assert( token_idx < mNumTokens );
return mTokens[ token_idx ];
}
private:
const size_t mStorageSize;
char *mBuff;
char **mTokens;
size_t mNumTokens;
inline void ResetContent( void )
{
memset( mBuff, 0, mStorageSize );
// mark all items as empty:
memset( mTokens, 0, mStorageSize * sizeof( char* ) );
// reset counter for found items:
mNumTokens = 0L;
}
};
Implementattion文件:
TextLineSplitter::TextLineSplitter( const size_t max_line_len ):
mStorageSize ( max_line_len + 1L )
{
// allocate memory
mBuff = new char [ mStorageSize ];
mTokens = new char* [ mStorageSize ];
ResetContent();
}
TextLineSplitter::~TextLineSplitter()
{
delete [] mBuff;
delete [] mTokens;
}
void TextLineSplitter::SplitLine( const char *line,
const char sep_char /* = ',' */,
)
{
assert( sep_char != '\0' );
ResetContent();
strncpy( mBuff, line, mMaxLineLen );
size_t idx = 0L; // running index for characters
do
{
assert( idx < mStorageSize );
const char chr = line[ idx ]; // retrieve current character
if( mTokens[ mNumTokens ] == NULL )
{
mTokens[ mNumTokens ] = &mBuff[ idx ];
} // if
if( chr == sep_char || chr == '\0' )
{ // item or line finished
// overwrite separator with a 0-terminating character:
mBuff[ idx ] = '\0';
// count-up items:
mNumTokens ++;
} // if
} while( line[ idx++ ] );
}
使用的场景是:
// create an instance capable of splitting strings up to 1000 chars long:
TextLineSplitter spl( 1000 );
spl.SplitLine( "Item1,,Item2,Item3" );
for( size_t i = 0; i < spl.NumTokens(); i++ )
{
printf( "%s\n", spl.GetToken( i ) );
}
输出:
Item1
Item2
Item3
其他回答
使用strtok。在我看来,没有必要围绕标记化构建类,除非strtok不能提供您所需要的东西。可能不会,但在用C和c++编写各种解析代码的15年多时间里,我一直在使用strtok。这里有一个例子
char myString[] = "The quick brown fox";
char *p = strtok(myString, " ");
while (p) {
printf ("Token: %s\n", p);
p = strtok(NULL, " ");
}
一些注意事项(可能不适合您的需要)。该字符串在该过程中被“销毁”,这意味着EOS字符内联放置在分隔符点中。正确的用法可能需要创建字符串的非const版本。还可以在解析过程中更改分隔符列表。
在我看来,上面的代码比为它单独编写一个类要简单得多,也更容易使用。对我来说,这是语言提供的功能之一,而且它做得很好,很干净。这只是一个“基于C”的解决方案。它很合适,很简单,而且你不需要写很多额外的代码:-)
这是一个简单的stl解决方案(~5行!)使用std::find和std::find_first_not_of来处理重复的分隔符(例如空格或句号),以及开头和结尾的分隔符:
#include <string>
#include <vector>
void tokenize(std::string str, std::vector<string> &token_v){
size_t start = str.find_first_not_of(DELIMITER), end=start;
while (start != std::string::npos){
// Find next occurence of delimiter
end = str.find(DELIMITER, start);
// Push back the token found into vector
token_v.push_back(str.substr(start, end-start));
// Skip all occurences of the delimiter to find new start
start = str.find_first_not_of(DELIMITER, end);
}
}
现场试试吧!
这是一个非常简单的问题:
#include <vector>
#include <string>
using namespace std;
vector<string> split(const char *str, char c = ' ')
{
vector<string> result;
do
{
const char *begin = str;
while(*str != c && *str)
str++;
result.push_back(string(begin, str));
} while (0 != *str++);
return result;
}
Adam Pierce的回答提供了一个采用const char*的手工标记器。使用迭代器会有一些问题,因为对字符串的结束迭代器进行递增是未定义的。也就是说,给定字符串str{"The quick brown fox"},我们当然可以做到:
auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };
while (start != cend(str)) {
const auto finish = find(++start, cend(str), ' ');
tokens.push_back(string(start, finish));
start = finish;
}
生活的例子
如果你想通过使用标准功能来抽象复杂性,On Freund建议strtok是一个简单的选择:
vector<string> tokens;
for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);
如果你不能访问c++ 17,你需要像这个例子一样替换data(str): http://ideone.com/8kAGoa
虽然在示例中没有演示,但strtok不需要为每个标记使用相同的分隔符。除了这个优势,还有几个缺点:
strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s) For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe) Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on
c++20为我们提供了split_view来以非破坏性的方式标记字符串:https://topanswers.xyz/cplusplus?q=749#a874
前面的方法不能就地生成标记化的向量,这意味着如果不将它们抽象为辅助函数,它们就不能初始化const vector<string>令牌。该功能和接受任何空白分隔符的能力可以使用istream_iterator来利用。例如,给定const string str{"The quick \tbrown \nfox"},我们可以这样做:
istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };
生活的例子
对于这个选项,需要构造一个istringstream的代价比前面两个选项要大得多,但是这个代价通常隐藏在字符串分配的代价中。
如果上面的选项都不够灵活,不能满足您的标记化需求,那么最灵活的选项是使用regex_token_iterator,当然这种灵活性会带来更大的开销,但同样,这可能隐藏在字符串分配成本中。例如,我们想要基于非转义的逗号进行标记化,也吃空白,给定以下输入:const string str{" the,qu\\,ick,\tbrown, fox"}我们可以这样做:
const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };
生活的例子
我贴出了类似问题的答案。 不要白费力气。我使用过许多库,我遇到过的最快、最灵活的库是:c++ String Toolkit Library。
这里有一个如何使用它的例子,我已经张贴在stackoverflow的其他地方。
#include <iostream>
#include <vector>
#include <string>
#include <strtk.hpp>
const char *whitespace = " \t\r\n\f";
const char *whitespace_and_punctuation = " \t\r\n\f;,=";
int main()
{
{ // normal parsing of a string into a vector of strings
std::string s("Somewhere down the road");
std::vector<std::string> result;
if( strtk::parse( s, whitespace, result ) )
{
for(size_t i = 0; i < result.size(); ++i )
std::cout << result[i] << std::endl;
}
}
{ // parsing a string into a vector of floats with other separators
// besides spaces
std::string s("3.0, 3.14; 4.0");
std::vector<float> values;
if( strtk::parse( s, whitespace_and_punctuation, values ) )
{
for(size_t i = 0; i < values.size(); ++i )
std::cout << values[i] << std::endl;
}
}
{ // parsing a string into specific variables
std::string s("angle = 45; radius = 9.9");
std::string w1, w2;
float v1, v2;
if( strtk::parse( s, whitespace_and_punctuation, w1, v1, w2, v2) )
{
std::cout << "word " << w1 << ", value " << v1 << std::endl;
std::cout << "word " << w2 << ", value " << v2 << std::endl;
}
}
return 0;
}