我使用巨大的数据文件,有时我只需要知道这些文件中的行数,通常我打开它们,一行一行地读取它们,直到我到达文件的末尾

我在想有没有更聪明的办法


当前回答

一个直接的方式使用扫描器

static void lineCounter (String path) throws IOException {

        int lineCount = 0, commentsCount = 0;

        Scanner input = new Scanner(new File(path));
        while (input.hasNextLine()) {
            String data = input.nextLine();

            if (data.startsWith("//")) commentsCount++;

            lineCount++;
        }

        System.out.println("Line Count: " + lineCount + "\t Comments Count: " + commentsCount);
    }

其他回答

我的结论是wc -l:s计算换行的方法是好的,但是在最后一行不以换行符结束的文件上返回非直观的结果。

和@。基于LineNumberReader的vikas解决方案,但在行数中添加一个,在最后一行以换行符结束的文件上返回非直观的结果。

因此我做了一个算法,处理如下:

@Test
public void empty() throws IOException {
    assertEquals(0, count(""));
}

@Test
public void singleNewline() throws IOException {
    assertEquals(1, count("\n"));
}

@Test
public void dataWithoutNewline() throws IOException {
    assertEquals(1, count("one"));
}

@Test
public void oneCompleteLine() throws IOException {
    assertEquals(1, count("one\n"));
}

@Test
public void twoCompleteLines() throws IOException {
    assertEquals(2, count("one\ntwo\n"));
}

@Test
public void twoLinesWithoutNewlineAtEnd() throws IOException {
    assertEquals(2, count("one\ntwo"));
}

@Test
public void aFewLines() throws IOException {
    assertEquals(5, count("one\ntwo\nthree\nfour\nfive\n"));
}

它是这样的:

static long countLines(InputStream is) throws IOException {
    try(LineNumberReader lnr = new LineNumberReader(new InputStreamReader(is))) {
        char[] buf = new char[8192];
        int n, previousN = -1;
        //Read will return at least one byte, no need to buffer more
        while((n = lnr.read(buf)) != -1) {
            previousN = n;
        }
        int ln = lnr.getLineNumber();
        if (previousN == -1) {
            //No data read at all, i.e file was empty
            return 0;
        } else {
            char lastChar = buf[previousN - 1];
            if (lastChar == '\n' || lastChar == '\r') {
                //Ending with newline, deduct one
                return ln;
            }
        }
        //normal case, return line number + 1
        return ln + 1;
    }
}

如果你想要直观的结果,你可以用这个。如果您只想要wc -l兼容性,只需使用@er即可。Vikas解决方案,但不添加一个到结果,并重试跳过:

try(LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")))) {
    while(lnr.skip(Long.MAX_VALUE) > 0){};
    return lnr.getLineNumber();
}

接受的答案有一个错误关闭多行文件,不以换行符结束。一个没有换行符的单行文件将返回1,但是一个没有换行符的两行文件也将返回1。下面是解决这个问题的公认解决方案的实现。endsWithoutNewLine检查对于除最终读取外的所有内容都是浪费的,但与整个函数相比,应该是微不足道的时间。

public int count(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean endsWithoutNewLine = false;
        while ((readChars = is.read(c)) != -1) {
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n')
                    ++count;
            }
            endsWithoutNewLine = (c[readChars - 1] != '\n');
        }
        if(endsWithoutNewLine) {
            ++count;
        } 
        return count;
    } finally {
        is.close();
    }
}

我测试了上面的方法来计数行,这里是我对不同方法的观察,在我的系统上进行了测试

文件大小:1.6 Gb 方法:

使用扫描仪:大约35秒 使用BufferedReader:大约5s 使用Java 8: 5s左右 使用LineNumberReader:大约5s

此外,Java8方法似乎非常方便:

Files.lines(Paths.get(filePath), Charset.defaultCharset()).count()
[Return type : long]

这是我迄今为止发现的最快的版本,大约比readLines快6倍。对于150MB的日志文件,这需要0.35秒,而在使用readLines()时需要2.40秒。只是为了好玩,linux的wc -l命令需要0.15秒。

public static int countLinesOld(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean empty = true;
        while ((readChars = is.read(c)) != -1) {
            empty = false;
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
        }
        return (count == 0 && !empty) ? 1 : count;
    } finally {
        is.close();
    }
}

编辑,9年半后:我几乎没有java经验,但无论如何,我试图将这段代码与下面的LineNumberReader解决方案进行基准测试,因为没有人这样做让我感到困扰。似乎对于大文件,我的解决方案更快。虽然它似乎需要几次运行,直到优化器做一个像样的工作。我已经玩了一些代码,并产生了一个新版本,始终是最快的:

public static int countLinesNew(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        
        int readChars = is.read(c);
        if (readChars == -1) {
            // bail out if nothing to read
            return 0;
        }
        
        // make it easy for the optimizer to tune this loop
        int count = 0;
        while (readChars == 1024) {
            for (int i=0; i<1024;) {
                if (c[i++] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }
        
        // count remaining characters
        while (readChars != -1) {
            for (int i=0; i<readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }
        
        return count == 0 ? 1 : count;
    } finally {
        is.close();
    }
}

1.3GB文本文件的基准测试结果,y轴以秒为单位。我已经对同一个文件执行了100次运行,并使用System.nanoTime()对每次运行进行了测量。您可以看到countLinesOld有一些异常值,而countLinesNew没有异常值,虽然它只是稍微快一点,但差异在统计上是显著的。LineNumberReader显然更慢。

EOF处没有换行符('\n')的多行文件的最佳优化代码。

/**
 * 
 * @param filename
 * @return
 * @throws IOException
 */
public static int countLines(String filename) throws IOException {
    int count = 0;
    boolean empty = true;
    FileInputStream fis = null;
    InputStream is = null;
    try {
        fis = new FileInputStream(filename);
        is = new BufferedInputStream(fis);
        byte[] c = new byte[1024];
        int readChars = 0;
        boolean isLine = false;
        while ((readChars = is.read(c)) != -1) {
            empty = false;
            for (int i = 0; i < readChars; ++i) {
                if ( c[i] == '\n' ) {
                    isLine = false;
                    ++count;
                }else if(!isLine && c[i] != '\n' && c[i] != '\r'){   //Case to handle line count where no New Line character present at EOF
                    isLine = true;
                }
            }
        }
        if(isLine){
            ++count;
        }
    }catch(IOException e){
        e.printStackTrace();
    }finally {
        if(is != null){
            is.close();    
        }
        if(fis != null){
            fis.close();    
        }
    }
    LOG.info("count: "+count);
    return (count == 0 && !empty) ? 1 : count;
}