如果您有java.io.InputStream对象,应该如何处理该对象并生成String?
假设我有一个包含文本数据的InputStream,我想将其转换为String,例如,我可以将其写入日志文件。
获取InputStream并将其转换为字符串的最简单方法是什么?
public String convertStreamToString(InputStream is) {
// ???
}
如果您有java.io.InputStream对象,应该如何处理该对象并生成String?
假设我有一个包含文本数据的InputStream,我想将其转换为String,例如,我可以将其写入日志文件。
获取InputStream并将其转换为字符串的最简单方法是什么?
public String convertStreamToString(InputStream is) {
// ???
}
当前回答
当输入流来自类路径资源(这似乎是一项流行任务)时,Guava提供了更短的高效自动关闭解决方案:
byte[] bytes = Resources.toByteArray(classLoader.getResource(path));
or
String text = Resources.toString(classLoader.getResource(path), StandardCharsets.UTF_8);
还有ByteSource和CharSource的一般概念,它们温和地负责打开和关闭流。
因此,例如,不要显式打开一个小文件来读取其内容:
String content = Files.asCharSource(new File("robots.txt"), StandardCharsets.UTF_8).read();
byte[] data = Files.asByteSource(new File("favicon.ico")).read();
或者只是
String content = Files.toString(new File("robots.txt"), StandardCharsets.UTF_8);
byte[] data = Files.toByteArray(new File("favicon.ico"));
其他回答
下面是如何仅使用JDK使用字节数组缓冲区来实现这一点。这实际上就是commons-ioIOUItils.copy()方法的工作原理。如果要从Reader而不是InputStream复制,可以用char[]替换byte[]。
import java.io.ByteArrayOutputStream;
import java.io.InputStream;
...
InputStream is = ....
ByteArrayOutputStream baos = new ByteArrayOutputStream(8192);
byte[] buffer = new byte[8192];
int count = 0;
try {
while ((count = is.read(buffer)) != -1) {
baos.write(buffer, 0, count);
}
}
finally {
try {
is.close();
}
catch (Exception ignore) {
}
}
String charset = "UTF-8";
String inputStreamAsString = baos.toString(charset);
与Okio一起:
String result = Okio.buffer(Okio.source(inputStream)).readUtf8();
我在这里对14个不同的答案做了一个基准测试(很抱歉没有提供学分,但有太多重复)。
结果非常令人惊讶。事实证明,Apache IOUtils是最慢的解决方案,ByteArrayOutputStream是最快的解决方案:
因此,首先是最好的方法:
public String inputStreamToString(InputStream inputStream) throws IOException {
try(ByteArrayOutputStream result = new ByteArrayOutputStream()) {
byte[] buffer = new byte[1024];
int length;
while ((length = inputStream.read(buffer)) != -1) {
result.write(buffer, 0, length);
}
return result.toString(UTF_8);
}
}
20个周期内20 MB随机字节的基准结果
时间(毫秒)
字节数组输出流测试:194NioStream:198Java9ISTransferTo:201Java9ISReadAllBytes:205缓冲输入流VsByteArray输出流:314ApacheStringWriter2:574GuavaCharStreams:589扫描仪读取器无下一测试:614扫描仪读数:633ApacheStringWriter:1544StreamApi:错误ParallelStreamApi:错误BufferReaderTest:错误InputStreamAndStringBuilder:错误
基准源代码
import com.google.common.io.CharStreams;
import org.apache.commons.io.IOUtils;
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
import java.util.stream.Collectors;
/**
* Created by Ilya Gazman on 2/13/18.
*/
public class InputStreamToString {
private static final String UTF_8 = "UTF-8";
public static void main(String... args) {
log("App started");
byte[] bytes = new byte[1024 * 1024];
new Random().nextBytes(bytes);
log("Stream is ready\n");
try {
test(bytes);
} catch (IOException e) {
e.printStackTrace();
}
}
private static void test(byte[] bytes) throws IOException {
List<Stringify> tests = Arrays.asList(
new ApacheStringWriter(),
new ApacheStringWriter2(),
new NioStream(),
new ScannerReader(),
new ScannerReaderNoNextTest(),
new GuavaCharStreams(),
new StreamApi(),
new ParallelStreamApi(),
new ByteArrayOutputStreamTest(),
new BufferReaderTest(),
new BufferedInputStreamVsByteArrayOutputStream(),
new InputStreamAndStringBuilder(),
new Java9ISTransferTo(),
new Java9ISReadAllBytes()
);
String solution = new String(bytes, "UTF-8");
for (Stringify test : tests) {
try (ByteArrayInputStream inputStream = new ByteArrayInputStream(bytes)) {
String s = test.inputStreamToString(inputStream);
if (!s.equals(solution)) {
log(test.name() + ": Error");
continue;
}
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 20; i++) {
try (ByteArrayInputStream inputStream = new ByteArrayInputStream(bytes)) {
test.inputStreamToString(inputStream);
}
}
log(test.name() + ": " + (System.currentTimeMillis() - startTime));
}
}
private static void log(String message) {
System.out.println(message);
}
interface Stringify {
String inputStreamToString(InputStream inputStream) throws IOException;
default String name() {
return this.getClass().getSimpleName();
}
}
static class ApacheStringWriter implements Stringify {
@Override
public String inputStreamToString(InputStream inputStream) throws IOException {
StringWriter writer = new StringWriter();
IOUtils.copy(inputStream, writer, UTF_8);
return writer.toString();
}
}
static class ApacheStringWriter2 implements Stringify {
@Override
public String inputStreamToString(InputStream inputStream) throws IOException {
return IOUtils.toString(inputStream, UTF_8);
}
}
static class NioStream implements Stringify {
@Override
public String inputStreamToString(InputStream in) throws IOException {
ReadableByteChannel channel = Channels.newChannel(in);
ByteBuffer byteBuffer = ByteBuffer.allocate(1024 * 16);
ByteArrayOutputStream bout = new ByteArrayOutputStream();
WritableByteChannel outChannel = Channels.newChannel(bout);
while (channel.read(byteBuffer) > 0 || byteBuffer.position() > 0) {
byteBuffer.flip(); //make buffer ready for write
outChannel.write(byteBuffer);
byteBuffer.compact(); //make buffer ready for reading
}
channel.close();
outChannel.close();
return bout.toString(UTF_8);
}
}
static class ScannerReader implements Stringify {
@Override
public String inputStreamToString(InputStream is) throws IOException {
java.util.Scanner s = new java.util.Scanner(is).useDelimiter("\\A");
return s.hasNext() ? s.next() : "";
}
}
static class ScannerReaderNoNextTest implements Stringify {
@Override
public String inputStreamToString(InputStream is) throws IOException {
java.util.Scanner s = new java.util.Scanner(is).useDelimiter("\\A");
return s.next();
}
}
static class GuavaCharStreams implements Stringify {
@Override
public String inputStreamToString(InputStream is) throws IOException {
return CharStreams.toString(new InputStreamReader(
is, UTF_8));
}
}
static class StreamApi implements Stringify {
@Override
public String inputStreamToString(InputStream inputStream) throws IOException {
return new BufferedReader(new InputStreamReader(inputStream))
.lines().collect(Collectors.joining("\n"));
}
}
static class ParallelStreamApi implements Stringify {
@Override
public String inputStreamToString(InputStream inputStream) throws IOException {
return new BufferedReader(new InputStreamReader(inputStream)).lines()
.parallel().collect(Collectors.joining("\n"));
}
}
static class ByteArrayOutputStreamTest implements Stringify {
@Override
public String inputStreamToString(InputStream inputStream) throws IOException {
try(ByteArrayOutputStream result = new ByteArrayOutputStream()) {
byte[] buffer = new byte[1024];
int length;
while ((length = inputStream.read(buffer)) != -1) {
result.write(buffer, 0, length);
}
return result.toString(UTF_8);
}
}
}
static class BufferReaderTest implements Stringify {
@Override
public String inputStreamToString(InputStream inputStream) throws IOException {
String newLine = System.getProperty("line.separator");
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
StringBuilder result = new StringBuilder(UTF_8);
String line;
boolean flag = false;
while ((line = reader.readLine()) != null) {
result.append(flag ? newLine : "").append(line);
flag = true;
}
return result.toString();
}
}
static class BufferedInputStreamVsByteArrayOutputStream implements Stringify {
@Override
public String inputStreamToString(InputStream inputStream) throws IOException {
BufferedInputStream bis = new BufferedInputStream(inputStream);
ByteArrayOutputStream buf = new ByteArrayOutputStream();
int result = bis.read();
while (result != -1) {
buf.write((byte) result);
result = bis.read();
}
return buf.toString(UTF_8);
}
}
static class InputStreamAndStringBuilder implements Stringify {
@Override
public String inputStreamToString(InputStream inputStream) throws IOException {
int ch;
StringBuilder sb = new StringBuilder(UTF_8);
while ((ch = inputStream.read()) != -1)
sb.append((char) ch);
return sb.toString();
}
}
static class Java9ISTransferTo implements Stringify {
@Override
public String inputStreamToString(InputStream inputStream) throws IOException {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
inputStream.transferTo(bos);
return bos.toString(UTF_8);
}
}
static class Java9ISReadAllBytes implements Stringify {
@Override
public String inputStreamToString(InputStream inputStream) throws IOException {
return new String(inputStream.readAllBytes(), UTF_8);
}
}
}
总结我找到的11种主要方法(见下文)。我写了一些性能测试(见下面的结果):
将InputStream转换为字符串的方法:
使用IOUtils.toString(Apache Utils)字符串结果=IOTils.toString(inputStream,StandardCharsets.UTF_8);使用CharStreams(Guava)字符串结果=CharStreams.toString(新InputStreamReader(inputStream、字符集UTF_8));使用扫描仪(JDK)扫描仪s=新扫描仪(inputStream)。使用分隔符(“\\A”);字符串结果=s.hasNext()?s.next():“”;使用流API(Java 8)。警告:此解决方案将不同的换行符(如\r\n)转换为\n。字符串结果=新BufferedReader(新InputStreamReader(inputStream)).line().collector(Collectors.joining(“\n”));使用并行流API(Java8)。警告:此解决方案将不同的换行符(如\r\n)转换为\n。字符串结果=新BufferedReader(新InputStreamReader(inputStream)).line().allel().collector(Collectors.joining(“\n”));使用InputStreamReader和StringBuilder(JDK)int bufferSize=1024;char[]缓冲区=新字符[bufferSize];StringBuilder out=新StringBuilder();Reader in=新InputStreamReader(流,StandardCharsets.UTF_8);for(int numRead;(numRead=in.read(buffer,0,buffer.length))>0;){out.append(缓冲区,0,numRead);}return out.toString();使用StringWriter和IOUtils.copy(Apache Commons)StringWriter writer=新StringWriter();IOUtils.copy(inputStream,writer,“UTF-8”);return writer.toString();使用ByteArrayOutputStream和inputStream.read(JDK)ByteArrayOutputStream result=new ByteArrayOutputStream();byte[]缓冲区=新字节[1024];for(int length;(length=inputStream.read(缓冲区))!=-1; ) {result.write(缓冲区,0,长度);}//StandardCharsets.UTF_8.name()>JDK 7return result.toString(“UTF-8”);使用BufferedReader(JDK)。警告:此解决方案将不同的换行符(如\n\r)转换为line.separator系统属性(例如,在Windows中转换为“\r\n”)。String newLine=System.getProperty(“line.sepaper”);BufferedReader读取器=新的BufferedReader(新的InputStreamReader(inputStream));StringBuilder result=new StringBuilder();for(字符串行;(行=reader.readLine())!=null;){如果(result.length()>0){result.append(换行符);}result.append(行);}return result.toString();使用BufferedInputStream和ByteArrayOutputStream(JDK)BufferedInputStream bis=新缓冲输入流(inputStream);ByteArrayOutputStream buf=新ByteArrayOutputStream();for(int result=bis.read();结果!=-1.result=bis.read()){buf.write((字节)结果);}//StandardCharsets.UTF_8.name()>JDK 7return buf.toString(“UTF-8”);使用inputStream.read()和StringBuilder(JDK)。警告:此解决方案存在Unicode问题,例如俄语文本(仅适用于非Unicode文本)StringBuilder sb=新StringBuilder();for(int ch;(ch=inputStream.read())!=-1; ) {sb.append((char)ch);}return sb.toString();
警告:
解决方案4、5和9将不同的换行符转换为一个。解决方案11无法正确使用Unicode文本
性能测试
小字符串(长度=175)的性能测试,github中的url(模式=平均时间,系统=Linux,分数1343最好):
Benchmark Mode Cnt Score Error Units
8. ByteArrayOutputStream and read (JDK) avgt 10 1,343 ± 0,028 us/op
6. InputStreamReader and StringBuilder (JDK) avgt 10 6,980 ± 0,404 us/op
10. BufferedInputStream, ByteArrayOutputStream avgt 10 7,437 ± 0,735 us/op
11. InputStream.read() and StringBuilder (JDK) avgt 10 8,977 ± 0,328 us/op
7. StringWriter and IOUtils.copy (Apache) avgt 10 10,613 ± 0,599 us/op
1. IOUtils.toString (Apache Utils) avgt 10 10,605 ± 0,527 us/op
3. Scanner (JDK) avgt 10 12,083 ± 0,293 us/op
2. CharStreams (guava) avgt 10 12,999 ± 0,514 us/op
4. Stream Api (Java 8) avgt 10 15,811 ± 0,605 us/op
9. BufferedReader (JDK) avgt 10 16,038 ± 0,711 us/op
5. parallel Stream Api (Java 8) avgt 10 21,544 ± 0,583 us/op
对大字符串(长度=50100)、github中的url进行性能测试(模式=平均时间,系统=Linux,得分200715最好):
Benchmark Mode Cnt Score Error Units
8. ByteArrayOutputStream and read (JDK) avgt 10 200,715 ± 18,103 us/op
1. IOUtils.toString (Apache Utils) avgt 10 300,019 ± 8,751 us/op
6. InputStreamReader and StringBuilder (JDK) avgt 10 347,616 ± 130,348 us/op
7. StringWriter and IOUtils.copy (Apache) avgt 10 352,791 ± 105,337 us/op
2. CharStreams (guava) avgt 10 420,137 ± 59,877 us/op
9. BufferedReader (JDK) avgt 10 632,028 ± 17,002 us/op
5. parallel Stream Api (Java 8) avgt 10 662,999 ± 46,199 us/op
4. Stream Api (Java 8) avgt 10 701,269 ± 82,296 us/op
10. BufferedInputStream, ByteArrayOutputStream avgt 10 740,837 ± 5,613 us/op
3. Scanner (JDK) avgt 10 751,417 ± 62,026 us/op
11. InputStream.read() and StringBuilder (JDK) avgt 10 2919,350 ± 1101,942 us/op
图表(性能测试取决于Windows 7系统中的输入流长度)
性能测试(平均时间)取决于Windows 7系统中的输入流长度:
length 182 546 1092 3276 9828 29484 58968
test8 0.38 0.938 1.868 4.448 13.412 36.459 72.708
test4 2.362 3.609 5.573 12.769 40.74 81.415 159.864
test5 3.881 5.075 6.904 14.123 50.258 129.937 166.162
test9 2.237 3.493 5.422 11.977 45.98 89.336 177.39
test6 1.261 2.12 4.38 10.698 31.821 86.106 186.636
test7 1.601 2.391 3.646 8.367 38.196 110.221 211.016
test1 1.529 2.381 3.527 8.411 40.551 105.16 212.573
test3 3.035 3.934 8.606 20.858 61.571 118.744 235.428
test2 3.136 6.238 10.508 33.48 43.532 118.044 239.481
test10 1.593 4.736 7.527 20.557 59.856 162.907 323.147
test11 3.913 11.506 23.26 68.644 207.591 600.444 1211.545
异-8859-1
如果您知道输入流的编码是ISO-8859-1或ASCII,这里有一种非常高效的方法来实现这一点。它(1)避免了StringWriter的内部StringBuffer中存在的不必要的同步,(2)避免了InputStreamReader的开销,(3)最小化了必须复制StringBuilder的内部字符数组的次数。
public static String iso_8859_1(InputStream is) throws IOException {
StringBuilder chars = new StringBuilder(Math.max(is.available(), 4096));
byte[] buffer = new byte[4096];
int n;
while ((n = is.read(buffer)) != -1) {
for (int i = 0; i < n; i++) {
chars.append((char)(buffer[i] & 0xFF));
}
}
return chars.toString();
}
UTF-8型
对于使用UTF-8编码的流,可以使用相同的通用策略:
public static String utf8(InputStream is) throws IOException {
StringBuilder chars = new StringBuilder(Math.max(is.available(), 4096));
byte[] buffer = new byte[4096];
int n;
int state = 0;
while ((n = is.read(buffer)) != -1) {
for (int i = 0; i < n; i++) {
if ((state = nextStateUtf8(state, buffer[i])) >= 0) {
chars.appendCodePoint(state);
} else if (state == -1) { //error
state = 0;
chars.append('\uFFFD'); //replacement char
}
}
}
return chars.toString();
}
其中nextStateUtf8()函数定义如下:
/**
* Returns the next UTF-8 state given the next byte of input and the current state.
* If the input byte is the last byte in a valid UTF-8 byte sequence,
* the returned state will be the corresponding unicode character (in the range of 0 through 0x10FFFF).
* Otherwise, a negative integer is returned. A state of -1 is returned whenever an
* invalid UTF-8 byte sequence is detected.
*/
static int nextStateUtf8(int currentState, byte nextByte) {
switch (currentState & 0xF0000000) {
case 0:
if ((nextByte & 0x80) == 0) { //0 trailing bytes (ASCII)
return nextByte;
} else if ((nextByte & 0xE0) == 0xC0) { //1 trailing byte
if (nextByte == (byte) 0xC0 || nextByte == (byte) 0xC1) { //0xCO & 0xC1 are overlong
return -1;
} else {
return nextByte & 0xC000001F;
}
} else if ((nextByte & 0xF0) == 0xE0) { //2 trailing bytes
if (nextByte == (byte) 0xE0) { //possibly overlong
return nextByte & 0xA000000F;
} else if (nextByte == (byte) 0xED) { //possibly surrogate
return nextByte & 0xB000000F;
} else {
return nextByte & 0x9000000F;
}
} else if ((nextByte & 0xFC) == 0xF0) { //3 trailing bytes
if (nextByte == (byte) 0xF0) { //possibly overlong
return nextByte & 0x80000007;
} else {
return nextByte & 0xE0000007;
}
} else if (nextByte == (byte) 0xF4) { //3 trailing bytes, possibly undefined
return nextByte & 0xD0000007;
} else {
return -1;
}
case 0xE0000000: //3rd-to-last continuation byte
return (nextByte & 0xC0) == 0x80 ? currentState << 6 | nextByte & 0x9000003F : -1;
case 0x80000000: //3rd-to-last continuation byte, check overlong
return (nextByte & 0xE0) == 0xA0 || (nextByte & 0xF0) == 0x90 ? currentState << 6 | nextByte & 0x9000003F : -1;
case 0xD0000000: //3rd-to-last continuation byte, check undefined
return (nextByte & 0xF0) == 0x80 ? currentState << 6 | nextByte & 0x9000003F : -1;
case 0x90000000: //2nd-to-last continuation byte
return (nextByte & 0xC0) == 0x80 ? currentState << 6 | nextByte & 0xC000003F : -1;
case 0xA0000000: //2nd-to-last continuation byte, check overlong
return (nextByte & 0xE0) == 0xA0 ? currentState << 6 | nextByte & 0xC000003F : -1;
case 0xB0000000: //2nd-to-last continuation byte, check surrogate
return (nextByte & 0xE0) == 0x80 ? currentState << 6 | nextByte & 0xC000003F : -1;
case 0xC0000000: //last continuation byte
return (nextByte & 0xC0) == 0x80 ? currentState << 6 | nextByte & 0x3F : -1;
default:
return -1;
}
}
自动检测编码
如果您的输入流是使用ASCII、ISO-8859-1或UTF-8编码的,但您不确定是哪一种,我们可以使用与上一种方法类似的方法,但使用额外的编码检测组件在返回字符串之前自动检测编码。
public static String autoDetect(InputStream is) throws IOException {
StringBuilder chars = new StringBuilder(Math.max(is.available(), 4096));
byte[] buffer = new byte[4096];
int n;
int state = 0;
boolean ascii = true;
while ((n = is.read(buffer)) != -1) {
for (int i = 0; i < n; i++) {
if ((state = nextStateUtf8(state, buffer[i])) > 0x7F)
ascii = false;
chars.append((char)(buffer[i] & 0xFF));
}
}
if (ascii || state < 0) { //probably not UTF-8
return chars.toString();
}
//probably UTF-8
int pos = 0;
char[] charBuf = new char[2];
for (int i = 0, len = chars.length(); i < len; i++) {
if ((state = nextStateUtf8(state, (byte)chars.charAt(i))) >= 0) {
boolean hi = Character.toChars(state, charBuf, 0) == 2;
chars.setCharAt(pos++, charBuf[0]);
if (hi) {
chars.setCharAt(pos++, charBuf[1]);
}
}
}
return chars.substring(0, pos);
}
如果您的输入流的编码既不是ISO-8859-1,也不是ASCII,也不是UTF-8,那么我就遵从已经存在的其他答案。