我需要在c++中加载和使用CSV文件数据。在这一点上,它实际上只是一个以逗号分隔的解析器(即不用担心转义新行和逗号)。主要需要的是逐行解析器,它将在每次调用方法时为下一行返回一个向量。
我发现这篇文章看起来很有前途: http://www.boost.org/doc/libs/1_35_0/libs/spirit/example/fundamental/list_parser.cpp
我从未使用过Boost's Spirit,但我愿意尝试一下。但前提是我忽略了一个更直接的解决方案。
我需要在c++中加载和使用CSV文件数据。在这一点上,它实际上只是一个以逗号分隔的解析器(即不用担心转义新行和逗号)。主要需要的是逐行解析器,它将在每次调用方法时为下一行返回一个向量。
我发现这篇文章看起来很有前途: http://www.boost.org/doc/libs/1_35_0/libs/spirit/example/fundamental/list_parser.cpp
我从未使用过Boost's Spirit,但我愿意尝试一下。但前提是我忽略了一个更直接的解决方案。
当前回答
CSV文件是由行组成的文本文件,每一行都由逗号分隔的令牌组成。虽然在解析时你应该知道一些事情:
(0)文件用“CP_ACP”编码页编码。您应该使用相同的编码页来解码文件内容。
(1) CSV丢失了“复合单元格”信息(比如rowspan > 1),所以当它被读回excel时,复合单元格信息丢失。
(2)单元格文本可以在头部和尾部用""" "进行引用,文字引用char将变成双引号。因此,结束匹配的引号字符必须是一个引号字符,而不是后面跟着另一个引号字符。例如,如果一个单元格有逗号,它必须在csv中被引用,因为逗号在csv中有意义。
(3)当单元格内容有多行时,它将在CSV中被引用,在这种情况下,解析器必须继续读取CSV文件中的下几行,直到获得与第一个引用字符匹配的结束引号字符,确保当前逻辑行读取完成后再解析该行的令牌。
例如:在CSV文件中,以下3个物理行是由3个令牌组成的逻辑行:
--+----------
1 |a,"b-first part
2 |b-second part
3 |b-third part",c
--+----------
其他回答
你可能想看看我的自由/开源软件项目CSVfix(更新链接),这是一个用c++编写的CSV流编辑器。CSV解析器不是什么好东西,但它完成了工作,整个包可以在不编写任何代码的情况下满足您的需要。
CSV解析器请参见alib/src/a_csv.cpp,使用示例请参见csvlib/src/csved_ioman.cpp (IOManager::ReadCSV)。
如果您所需要的只是加载一个双精度数据文件(没有整数,没有文本),那么这里有一个随时可用的函数。
#include <sstream>
#include <fstream>
#include <iterator>
#include <string>
#include <vector>
#include <algorithm>
using namespace std;
/**
* Parse a CSV data file and fill the 2d STL vector "data".
* Limits: only "pure datas" of doubles, not encapsulated by " and without \n inside.
* Further no formatting in the data (e.g. scientific notation)
* It however handles both dots and commas as decimal separators and removes thousand separator.
*
* returnCodes[0]: file access 0-> ok 1-> not able to read; 2-> decimal separator equal to comma separator
* returnCodes[1]: number of records
* returnCodes[2]: number of fields. -1 If rows have different field size
*
*/
vector<int>
readCsvData (vector <vector <double>>& data, const string& filename, const string& delimiter, const string& decseparator){
int vv[3] = { 0,0,0 };
vector<int> returnCodes(&vv[0], &vv[0]+3);
string rowstring, stringtoken;
double doubletoken;
int rowcount=0;
int fieldcount=0;
data.clear();
ifstream iFile(filename, ios_base::in);
if (!iFile.is_open()){
returnCodes[0] = 1;
return returnCodes;
}
while (getline(iFile, rowstring)) {
if (rowstring=="") continue; // empty line
rowcount ++; //let's start with 1
if(delimiter == decseparator){
returnCodes[0] = 2;
return returnCodes;
}
if(decseparator != "."){
// remove dots (used as thousand separators)
string::iterator end_pos = remove(rowstring.begin(), rowstring.end(), '.');
rowstring.erase(end_pos, rowstring.end());
// replace decimal separator with dots.
replace(rowstring.begin(), rowstring.end(),decseparator.c_str()[0], '.');
} else {
// remove commas (used as thousand separators)
string::iterator end_pos = remove(rowstring.begin(), rowstring.end(), ',');
rowstring.erase(end_pos, rowstring.end());
}
// tokenize..
vector<double> tokens;
// Skip delimiters at beginning.
string::size_type lastPos = rowstring.find_first_not_of(delimiter, 0);
// Find first "non-delimiter".
string::size_type pos = rowstring.find_first_of(delimiter, lastPos);
while (string::npos != pos || string::npos != lastPos){
// Found a token, convert it to double add it to the vector.
stringtoken = rowstring.substr(lastPos, pos - lastPos);
if (stringtoken == "") {
tokens.push_back(0.0);
} else {
istringstream totalSString(stringtoken);
totalSString >> doubletoken;
tokens.push_back(doubletoken);
}
// Skip delimiters. Note the "not_of"
lastPos = rowstring.find_first_not_of(delimiter, pos);
// Find next "non-delimiter"
pos = rowstring.find_first_of(delimiter, lastPos);
}
if(rowcount == 1){
fieldcount = tokens.size();
returnCodes[2] = tokens.size();
} else {
if ( tokens.size() != fieldcount){
returnCodes[2] = -1;
}
}
data.push_back(tokens);
}
iFile.close();
returnCodes[1] = rowcount;
return returnCodes;
}
我写了一个只有头文件的c++ 11 CSV解析器。它经过了良好的测试,快速,支持整个CSV规范(带引号的字段,引号中的分隔符/结束符,引号转义等),并且可以配置为不符合规范的CSV。
配置是通过一个流畅的接口完成的:
// constructor accepts any input stream
CsvParser parser = CsvParser(std::cin)
.delimiter(';') // delimited by ; instead of ,
.quote('\'') // quoted fields use ' instead of "
.terminator('\0'); // terminated by \0 instead of by \r\n, \n, or \r
解析只是一个基于范围的for循环:
#include <iostream>
#include "../parser.hpp"
using namespace aria::csv;
int main() {
std::ifstream f("some_file.csv");
CsvParser parser(f);
for (auto& row : parser) {
for (auto& field : row) {
std::cout << field << " | ";
}
std::cout << std::endl;
}
}
我写了一个很好的解析CSV文件的方法,我认为我应该把它作为一个答案:
#include <algorithm>
#include <fstream>
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
struct CSVDict
{
std::vector< std::string > inputImages;
std::vector< double > inputLabels;
};
/**
\brief Splits the string
\param str String to split
\param delim Delimiter on the basis of which splitting is to be done
\return results Output in the form of vector of strings
*/
std::vector<std::string> stringSplit( const std::string &str, const std::string &delim )
{
std::vector<std::string> results;
for (size_t i = 0; i < str.length(); i++)
{
std::string tempString = "";
while ((str[i] != *delim.c_str()) && (i < str.length()))
{
tempString += str[i];
i++;
}
results.push_back(tempString);
}
return results;
}
/**
\brief Parse the supplied CSV File and obtain Row and Column information.
Assumptions:
1. Header information is in first row
2. Delimiters are only used to differentiate cell members
\param csvFileName The full path of the file to parse
\param inputColumns The string of input columns which contain the data to be used for further processing
\param inputLabels The string of input labels based on which further processing is to be done
\param delim The delimiters used in inputColumns and inputLabels
\return Vector of Vector of strings: Collection of rows and columns
*/
std::vector< CSVDict > parseCSVFile( const std::string &csvFileName, const std::string &inputColumns, const std::string &inputLabels, const std::string &delim )
{
std::vector< CSVDict > return_CSVDict;
std::vector< std::string > inputColumnsVec = stringSplit(inputColumns, delim), inputLabelsVec = stringSplit(inputLabels, delim);
std::vector< std::vector< std::string > > returnVector;
std::ifstream inFile(csvFileName.c_str());
int row = 0;
std::vector< size_t > inputColumnIndeces, inputLabelIndeces;
for (std::string line; std::getline(inFile, line, '\n');)
{
CSVDict tempDict;
std::vector< std::string > rowVec;
line.erase(std::remove(line.begin(), line.end(), '"'), line.end());
rowVec = stringSplit(line, delim);
// for the first row, record the indeces of the inputColumns and inputLabels
if (row == 0)
{
for (size_t i = 0; i < rowVec.size(); i++)
{
for (size_t j = 0; j < inputColumnsVec.size(); j++)
{
if (rowVec[i] == inputColumnsVec[j])
{
inputColumnIndeces.push_back(i);
}
}
for (size_t j = 0; j < inputLabelsVec.size(); j++)
{
if (rowVec[i] == inputLabelsVec[j])
{
inputLabelIndeces.push_back(i);
}
}
}
}
else
{
for (size_t i = 0; i < inputColumnIndeces.size(); i++)
{
tempDict.inputImages.push_back(rowVec[inputColumnIndeces[i]]);
}
for (size_t i = 0; i < inputLabelIndeces.size(); i++)
{
double test = std::atof(rowVec[inputLabelIndeces[i]].c_str());
tempDict.inputLabels.push_back(std::atof(rowVec[inputLabelIndeces[i]].c_str()));
}
return_CSVDict.push_back(tempDict);
}
row++;
}
return return_CSVDict;
}
如果你不关心转义逗号和换行符, 并且你不能在引号中嵌入逗号和换行符(如果你不能转义那么…) 那么它只有大约三行代码(好的14 ->,但它只有15读取整个文件)。
std::vector<std::string> getNextLineAndSplitIntoTokens(std::istream& str)
{
std::vector<std::string> result;
std::string line;
std::getline(str,line);
std::stringstream lineStream(line);
std::string cell;
while(std::getline(lineStream,cell, ','))
{
result.push_back(cell);
}
// This checks for a trailing comma with no data after it.
if (!lineStream && cell.empty())
{
// If there was a trailing comma then add an empty element.
result.push_back("");
}
return result;
}
我只需要创建一个表示一行的类。 然后流到该对象:
#include <iterator>
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <string>
class CSVRow
{
public:
std::string_view operator[](std::size_t index) const
{
return std::string_view(&m_line[m_data[index] + 1], m_data[index + 1] - (m_data[index] + 1));
}
std::size_t size() const
{
return m_data.size() - 1;
}
void readNextRow(std::istream& str)
{
std::getline(str, m_line);
m_data.clear();
m_data.emplace_back(-1);
std::string::size_type pos = 0;
while((pos = m_line.find(',', pos)) != std::string::npos)
{
m_data.emplace_back(pos);
++pos;
}
// This checks for a trailing comma with no data after it.
pos = m_line.size();
m_data.emplace_back(pos);
}
private:
std::string m_line;
std::vector<int> m_data;
};
std::istream& operator>>(std::istream& str, CSVRow& data)
{
data.readNextRow(str);
return str;
}
int main()
{
std::ifstream file("plop.csv");
CSVRow row;
while(file >> row)
{
std::cout << "4th Element(" << row[3] << ")\n";
}
}
但只要做一点工作,我们就可以在技术上创建一个迭代器:
class CSVIterator
{
public:
typedef std::input_iterator_tag iterator_category;
typedef CSVRow value_type;
typedef std::size_t difference_type;
typedef CSVRow* pointer;
typedef CSVRow& reference;
CSVIterator(std::istream& str) :m_str(str.good()?&str:nullptr) { ++(*this); }
CSVIterator() :m_str(nullptr) {}
// Pre Increment
CSVIterator& operator++() {if (m_str) { if (!((*m_str) >> m_row)){m_str = nullptr;}}return *this;}
// Post increment
CSVIterator operator++(int) {CSVIterator tmp(*this);++(*this);return tmp;}
CSVRow const& operator*() const {return m_row;}
CSVRow const* operator->() const {return &m_row;}
bool operator==(CSVIterator const& rhs) {return ((this == &rhs) || ((this->m_str == nullptr) && (rhs.m_str == nullptr)));}
bool operator!=(CSVIterator const& rhs) {return !((*this) == rhs);}
private:
std::istream* m_str;
CSVRow m_row;
};
int main()
{
std::ifstream file("plop.csv");
for(CSVIterator loop(file); loop != CSVIterator(); ++loop)
{
std::cout << "4th Element(" << (*loop)[3] << ")\n";
}
}
现在我们已经到了2020年,让我们添加一个CSVRange对象:
class CSVRange
{
std::istream& stream;
public:
CSVRange(std::istream& str)
: stream(str)
{}
CSVIterator begin() const {return CSVIterator{stream};}
CSVIterator end() const {return CSVIterator{};}
};
int main()
{
std::ifstream file("plop.csv");
for(auto& row: CSVRange(file))
{
std::cout << "4th Element(" << row[3] << ")\n";
}
}