如果您有java.io.InputStream对象,应该如何处理该对象并生成String?


假设我有一个包含文本数据的InputStream,我想将其转换为String,例如,我可以将其写入日志文件。

获取InputStream并将其转换为字符串的最简单方法是什么?

public String convertStreamToString(InputStream is) {
// ???
}

当前回答

异-8859-1

如果您知道输入流的编码是ISO-8859-1或ASCII,这里有一种非常高效的方法来实现这一点。它(1)避免了StringWriter的内部StringBuffer中存在的不必要的同步,(2)避免了InputStreamReader的开销,(3)最小化了必须复制StringBuilder的内部字符数组的次数。

public static String iso_8859_1(InputStream is) throws IOException {
    StringBuilder chars = new StringBuilder(Math.max(is.available(), 4096));
    byte[] buffer = new byte[4096];
    int n;
    while ((n = is.read(buffer)) != -1) {
        for (int i = 0; i < n; i++) {
            chars.append((char)(buffer[i] & 0xFF));
        }
    }
    return chars.toString();
}

UTF-8型

对于使用UTF-8编码的流,可以使用相同的通用策略:

public static String utf8(InputStream is) throws IOException {
    StringBuilder chars = new StringBuilder(Math.max(is.available(), 4096));
    byte[] buffer = new byte[4096];
    int n;
    int state = 0;
    while ((n = is.read(buffer)) != -1) {
        for (int i = 0; i < n; i++) {
            if ((state = nextStateUtf8(state, buffer[i])) >= 0) {
                chars.appendCodePoint(state);
            } else if (state == -1) { //error
                state = 0;
                chars.append('\uFFFD'); //replacement char
            }
        }
    }
    return chars.toString();
}

其中nextStateUtf8()函数定义如下:

/**
 * Returns the next UTF-8 state given the next byte of input and the current state.
 * If the input byte is the last byte in a valid UTF-8 byte sequence,
 * the returned state will be the corresponding unicode character (in the range of 0 through 0x10FFFF).
 * Otherwise, a negative integer is returned. A state of -1 is returned whenever an
 * invalid UTF-8 byte sequence is detected.
 */
static int nextStateUtf8(int currentState, byte nextByte) {
    switch (currentState & 0xF0000000) {
        case 0:
            if ((nextByte & 0x80) == 0) { //0 trailing bytes (ASCII)
                return nextByte;
            } else if ((nextByte & 0xE0) == 0xC0) { //1 trailing byte
                if (nextByte == (byte) 0xC0 || nextByte == (byte) 0xC1) { //0xCO & 0xC1 are overlong
                    return -1;
                } else {
                    return nextByte & 0xC000001F;
                }
            } else if ((nextByte & 0xF0) == 0xE0) { //2 trailing bytes
                if (nextByte == (byte) 0xE0) { //possibly overlong
                    return nextByte & 0xA000000F;
                } else if (nextByte == (byte) 0xED) { //possibly surrogate
                    return nextByte & 0xB000000F;
                } else {
                    return nextByte & 0x9000000F;
                }
            } else if ((nextByte & 0xFC) == 0xF0) { //3 trailing bytes
                if (nextByte == (byte) 0xF0) { //possibly overlong
                    return nextByte & 0x80000007;
                } else {
                    return nextByte & 0xE0000007;
                }
            } else if (nextByte == (byte) 0xF4) { //3 trailing bytes, possibly undefined
                return nextByte & 0xD0000007;
            } else {
                return -1;
            }
        case 0xE0000000: //3rd-to-last continuation byte
            return (nextByte & 0xC0) == 0x80 ? currentState << 6 | nextByte & 0x9000003F : -1;
        case 0x80000000: //3rd-to-last continuation byte, check overlong
            return (nextByte & 0xE0) == 0xA0 || (nextByte & 0xF0) == 0x90 ? currentState << 6 | nextByte & 0x9000003F : -1;
        case 0xD0000000: //3rd-to-last continuation byte, check undefined
            return (nextByte & 0xF0) == 0x80 ? currentState << 6 | nextByte & 0x9000003F : -1;
        case 0x90000000: //2nd-to-last continuation byte
            return (nextByte & 0xC0) == 0x80 ? currentState << 6 | nextByte & 0xC000003F : -1;
        case 0xA0000000: //2nd-to-last continuation byte, check overlong
            return (nextByte & 0xE0) == 0xA0 ? currentState << 6 | nextByte & 0xC000003F : -1;
        case 0xB0000000: //2nd-to-last continuation byte, check surrogate
            return (nextByte & 0xE0) == 0x80 ? currentState << 6 | nextByte & 0xC000003F : -1;
        case 0xC0000000: //last continuation byte
            return (nextByte & 0xC0) == 0x80 ? currentState << 6 | nextByte & 0x3F : -1;
        default:
            return -1;
    }
}

自动检测编码

如果您的输入流是使用ASCII、ISO-8859-1或UTF-8编码的,但您不确定是哪一种,我们可以使用与上一种方法类似的方法,但使用额外的编码检测组件在返回字符串之前自动检测编码。

public static String autoDetect(InputStream is) throws IOException {
    StringBuilder chars = new StringBuilder(Math.max(is.available(), 4096));
    byte[] buffer = new byte[4096];
    int n;
    int state = 0;
    boolean ascii = true;
    while ((n = is.read(buffer)) != -1) {
        for (int i = 0; i < n; i++) {
            if ((state = nextStateUtf8(state, buffer[i])) > 0x7F)
                ascii = false;
            chars.append((char)(buffer[i] & 0xFF));
        }
    }

    if (ascii || state < 0) { //probably not UTF-8
        return chars.toString();
    }
    //probably UTF-8
    int pos = 0;
    char[] charBuf = new char[2];
    for (int i = 0, len = chars.length(); i < len; i++) {
        if ((state = nextStateUtf8(state, (byte)chars.charAt(i))) >= 0) {
            boolean hi = Character.toChars(state, charBuf, 0) == 2;
            chars.setCharAt(pos++, charBuf[0]);
            if (hi) {
                chars.setCharAt(pos++, charBuf[1]);
            }
        }
    }
    return chars.substring(0, pos);
}

如果您的输入流的编码既不是ISO-8859-1,也不是ASCII,也不是UTF-8,那么我就遵从已经存在的其他答案。

其他回答

一个很好的方法是使用Apache Commons IOUItils将InputStream复制到StringWriter中。。。类似于

StringWriter writer = new StringWriter();
IOUtils.copy(inputStream, writer, encoding);
String theString = writer.toString();

甚至

// NB: does not close inputStream, you'll have to use try-with-resources for that
String theString = IOUtils.toString(inputStream, encoding);

或者,如果不想混合流和写入器,可以使用ByteArrayOutputStream。

注意:这可能不是个好主意。此方法使用递归,因此将非常快地命中StackOverflowError:

public String read (InputStream is) {
    byte next = is.read();
    return next == -1 ? "" : next + read(is); // Recursive part: reads next byte recursively
}

Apache Commons允许:

String myString = IOUtils.toString(myInputStream, "UTF-8");

当然,您可以选择UTF-8以外的其他字符编码。

另请参阅:(文档)

我做了一些计时测试,因为时间总是很重要的。

我试图以3种不同的方式将响应转换为字符串。(如下所示)为了可读性,我省略了try/catch块。

为了给出上下文,这是所有3种方法的前面代码:

   String response;
   String url = "www.blah.com/path?key=value";
   GetMethod method = new GetMethod(url);
   int status = client.executeMethod(method);

1)

 response = method.getResponseBodyAsString();

2)

InputStream resp = method.getResponseBodyAsStream();
InputStreamReader is=new InputStreamReader(resp);
BufferedReader br=new BufferedReader(is);
String read = null;
StringBuffer sb = new StringBuffer();
while((read = br.readLine()) != null) {
    sb.append(read);
}
response = sb.toString();

3)

InputStream iStream  = method.getResponseBodyAsStream();
StringWriter writer = new StringWriter();
IOUtils.copy(iStream, writer, "UTF-8");
response = writer.toString();

因此,在使用相同的请求/响应数据对每种方法运行了500次测试之后,以下是数字。再次,这些是我的发现,你的发现可能不完全相同,但我写这篇文章是为了向其他人说明这些方法的效率差异。

排名:方法#1进近#3-比#1慢2.6%2号进近——比1号进近慢4.3%

任何这些方法都是获取响应并从中创建字符串的适当解决方案。

  InputStream IS=new URL("http://www.petrol.si/api/gas_prices.json").openStream();   

  ByteArrayOutputStream BAOS=new ByteArrayOutputStream();
  IOUtils.copy(IS, BAOS);
  String d= new String(BAOS.toByteArray(),"UTF-8");           

System.out.println(d);