我使用巨大的数据文件,有时我只需要知道这些文件中的行数,通常我打开它们,一行一行地读取它们,直到我到达文件的末尾
我在想有没有更聪明的办法
我使用巨大的数据文件,有时我只需要知道这些文件中的行数,通常我打开它们,一行一行地读取它们,直到我到达文件的末尾
我在想有没有更聪明的办法
当前回答
接受的答案有一个错误关闭多行文件,不以换行符结束。一个没有换行符的单行文件将返回1,但是一个没有换行符的两行文件也将返回1。下面是解决这个问题的公认解决方案的实现。endsWithoutNewLine检查对于除最终读取外的所有内容都是浪费的,但与整个函数相比,应该是微不足道的时间。
public int count(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean endsWithoutNewLine = false;
while ((readChars = is.read(c)) != -1) {
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n')
++count;
}
endsWithoutNewLine = (c[readChars - 1] != '\n');
}
if(endsWithoutNewLine) {
++count;
}
return count;
} finally {
is.close();
}
}
其他回答
EOF处没有换行符('\n')的多行文件的最佳优化代码。
/**
*
* @param filename
* @return
* @throws IOException
*/
public static int countLines(String filename) throws IOException {
int count = 0;
boolean empty = true;
FileInputStream fis = null;
InputStream is = null;
try {
fis = new FileInputStream(filename);
is = new BufferedInputStream(fis);
byte[] c = new byte[1024];
int readChars = 0;
boolean isLine = false;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if ( c[i] == '\n' ) {
isLine = false;
++count;
}else if(!isLine && c[i] != '\n' && c[i] != '\r'){ //Case to handle line count where no New Line character present at EOF
isLine = true;
}
}
}
if(isLine){
++count;
}
}catch(IOException e){
e.printStackTrace();
}finally {
if(is != null){
is.close();
}
if(fis != null){
fis.close();
}
}
LOG.info("count: "+count);
return (count == 0 && !empty) ? 1 : count;
}
上面的count()方法给出的答案是,如果文件末尾没有换行符,就会出现行数错误——它无法计算文件中的最后一行。
这个方法更适合我:
public int countLines(String filename) throws IOException {
LineNumberReader reader = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}
cnt = reader.getLineNumber();
reader.close();
return cnt;
}
似乎有几种不同的方法可以使用LineNumberReader。
我是这样做的:
int lines = 0;
FileReader input = new FileReader(fileLocation);
LineNumberReader count = new LineNumberReader(input);
String line = count.readLine();
if(count.ready())
{
while(line != null) {
lines = count.getLineNumber();
line = count.readLine();
}
lines+=1;
}
count.close();
System.out.println(lines);
更简单的是,可以使用Java BufferedReader lines()方法返回元素流,然后使用stream count()方法对所有元素进行计数。然后只需在输出中添加1,就可以得到文本文件中的行数。
为例:
FileReader input = new FileReader(fileLocation);
LineNumberReader count = new LineNumberReader(input);
int lines = (int)count.lines().count() + 1;
count.close();
System.out.println(lines);
我知道这是一个老问题,但公认的解决方案并不完全符合我所需要的。因此,我将其改进为接受各种行结束符(而不仅仅是换行)并使用指定的字符编码(而不是ISO-8859-n)。所有在一个方法(适当重构):
public static long getLinesCount(String fileName, String encodingName) throws IOException {
long linesCount = 0;
File file = new File(fileName);
FileInputStream fileIn = new FileInputStream(file);
try {
Charset encoding = Charset.forName(encodingName);
Reader fileReader = new InputStreamReader(fileIn, encoding);
int bufferSize = 4096;
Reader reader = new BufferedReader(fileReader, bufferSize);
char[] buffer = new char[bufferSize];
int prevChar = -1;
int readCount = reader.read(buffer);
while (readCount != -1) {
for (int i = 0; i < readCount; i++) {
int nextChar = buffer[i];
switch (nextChar) {
case '\r': {
// The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed.
linesCount++;
break;
}
case '\n': {
if (prevChar == '\r') {
// The current line is terminated by a carriage return immediately followed by a line feed.
// The line has already been counted.
} else {
// The current line is terminated by a line feed.
linesCount++;
}
break;
}
}
prevChar = nextChar;
}
readCount = reader.read(buffer);
}
if (prevCh != -1) {
switch (prevCh) {
case '\r':
case '\n': {
// The last line is terminated by a line terminator.
// The last line has already been counted.
break;
}
default: {
// The last line is terminated by end-of-file.
linesCount++;
}
}
}
} finally {
fileIn.close();
}
return linesCount;
}
这个解决方案在速度上与公认的解决方案相当,在我的测试中大约慢了4%(尽管Java中的计时测试是出了名的不可靠)。
这是我迄今为止发现的最快的版本,大约比readLines快6倍。对于150MB的日志文件,这需要0.35秒,而在使用readLines()时需要2.40秒。只是为了好玩,linux的wc -l命令需要0.15秒。
public static int countLinesOld(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
编辑,9年半后:我几乎没有java经验,但无论如何,我试图将这段代码与下面的LineNumberReader解决方案进行基准测试,因为没有人这样做让我感到困扰。似乎对于大文件,我的解决方案更快。虽然它似乎需要几次运行,直到优化器做一个像样的工作。我已经玩了一些代码,并产生了一个新版本,始终是最快的:
public static int countLinesNew(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int readChars = is.read(c);
if (readChars == -1) {
// bail out if nothing to read
return 0;
}
// make it easy for the optimizer to tune this loop
int count = 0;
while (readChars == 1024) {
for (int i=0; i<1024;) {
if (c[i++] == '\n') {
++count;
}
}
readChars = is.read(c);
}
// count remaining characters
while (readChars != -1) {
for (int i=0; i<readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
readChars = is.read(c);
}
return count == 0 ? 1 : count;
} finally {
is.close();
}
}
1.3GB文本文件的基准测试结果,y轴以秒为单位。我已经对同一个文件执行了100次运行,并使用System.nanoTime()对每次运行进行了测量。您可以看到countLinesOld有一些异常值,而countLinesNew没有异常值,虽然它只是稍微快一点,但差异在统计上是显著的。LineNumberReader显然更慢。