有一个在线文件(如http://www.example.com/information.asp),我需要抓取并保存到一个目录。我知道有几种逐行抓取和读取在线文件(url)的方法,但是否有一种方法可以使用Java下载并保存文件?
当前回答
这里有许多优雅而有效的答案。但是简洁会让我们失去一些有用的信息。特别是,人们通常不希望将连接错误视为异常,并且可能希望以不同的方式处理某些与网络相关的错误—例如,决定是否应该重试下载。
下面是一个方法,它不会为网络错误抛出异常(仅用于真正异常的问题,如url格式错误或写入文件的问题)
/**
* Downloads from a (http/https) URL and saves to a file.
* Does not consider a connection error an Exception. Instead it returns:
*
* 0=ok
* 1=connection interrupted, timeout (but something was read)
* 2=not found (FileNotFoundException) (404)
* 3=server error (500...)
* 4=could not connect: connection timeout (no internet?) java.net.SocketTimeoutException
* 5=could not connect: (server down?) java.net.ConnectException
* 6=could not resolve host (bad host, or no internet - no dns)
*
* @param file File to write. Parent directory will be created if necessary
* @param url http/https url to connect
* @param secsConnectTimeout Seconds to wait for connection establishment
* @param secsReadTimeout Read timeout in seconds - trasmission will abort if it freezes more than this
* @return See above
* @throws IOException Only if URL is malformed or if could not create the file
*/
public static int saveUrl(final Path file, final URL url,
int secsConnectTimeout, int secsReadTimeout) throws IOException {
Files.createDirectories(file.getParent()); // make sure parent dir exists , this can throw exception
URLConnection conn = url.openConnection(); // can throw exception if bad url
if( secsConnectTimeout > 0 ) conn.setConnectTimeout(secsConnectTimeout * 1000);
if( secsReadTimeout > 0 ) conn.setReadTimeout(secsReadTimeout * 1000);
int ret = 0;
boolean somethingRead = false;
try (InputStream is = conn.getInputStream()) {
try (BufferedInputStream in = new BufferedInputStream(is); OutputStream fout = Files
.newOutputStream(file)) {
final byte data[] = new byte[8192];
int count;
while((count = in.read(data)) > 0) {
somethingRead = true;
fout.write(data, 0, count);
}
}
} catch(java.io.IOException e) {
int httpcode = 999;
try {
httpcode = ((HttpURLConnection) conn).getResponseCode();
} catch(Exception ee) {}
if( somethingRead && e instanceof java.net.SocketTimeoutException ) ret = 1;
else if( e instanceof FileNotFoundException && httpcode >= 400 && httpcode < 500 ) ret = 2;
else if( httpcode >= 400 && httpcode < 600 ) ret = 3;
else if( e instanceof java.net.SocketTimeoutException ) ret = 4;
else if( e instanceof java.net.ConnectException ) ret = 5;
else if( e instanceof java.net.UnknownHostException ) ret = 6;
else throw e;
}
return ret;
}
其他回答
你可以在一行中使用netloader for Java:
new NetFile(new File("my/zips/1.zip"), "https://example.com/example.zip", -1).load(); // Returns true if succeed, otherwise false.
可以使用Apache的HttpComponents而不是Commons IO来下载文件。这段代码允许您根据URL在Java中下载文件,并将其保存到特定的目的地。
public static boolean saveFile(URL fileURL, String fileSavePath) {
boolean isSucceed = true;
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet(fileURL.toString());
httpGet.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0");
httpGet.addHeader("Referer", "https://www.google.com");
try {
CloseableHttpResponse httpResponse = httpClient.execute(httpGet);
HttpEntity fileEntity = httpResponse.getEntity();
if (fileEntity != null) {
FileUtils.copyInputStreamToFile(fileEntity.getContent(), new File(fileSavePath));
}
} catch (IOException e) {
isSucceed = false;
}
httpGet.releaseConnection();
return isSucceed;
}
与单行代码相比:
FileUtils.copyURLToFile(fileURL, new File(fileSavePath),
URLS_FETCH_TIMEOUT, URLS_FETCH_TIMEOUT);
这段代码将使您对进程有更多的控制,不仅可以指定超时,还可以指定User-Agent和Referer值,这对许多网站来说都是至关重要的。
更简单的非阻塞I/O用法:
URL website = new URL("http://www.website.com/information.asp");
try (InputStream in = website.openStream()) {
Files.copy(in, target, StandardCopyOption.REPLACE_EXISTING);
}
下载一个文件需要你阅读它。无论哪种方式,您都必须以某种方式查看该文件。而不是逐行,你可以从流中逐字节读取:
BufferedInputStream in = new BufferedInputStream(new URL("http://www.website.com/information.asp").openStream())
byte data[] = new byte[1024];
int count;
while((count = in.read(data, 0, 1024)) != -1)
{
out.write(data, 0, count);
}
public void saveUrl(final String filename, final String urlString)
throws MalformedURLException, IOException {
BufferedInputStream in = null;
FileOutputStream fout = null;
try {
in = new BufferedInputStream(new URL(urlString).openStream());
fout = new FileOutputStream(filename);
final byte data[] = new byte[1024];
int count;
while ((count = in.read(data, 0, 1024)) != -1) {
fout.write(data, 0, count);
}
} finally {
if (in != null) {
in.close();
}
if (fout != null) {
fout.close();
}
}
}
您将需要处理异常,可能是该方法的外部异常。
推荐文章
- codestyle;把javadoc放在注释之前还是之后?
- 如何在Spring中定义List bean ?
- 将Set<T>转换为List<T>的最简洁的方法
- 在JavaScript中,什么相当于Java的Thread.sleep() ?
- 使用Java重命名文件
- URL从Java中的类路径加载资源
- .toArray(new MyClass[0]) or .toArray(new MyClass[myList.size()])?
- Hibernate中不同的保存方法之间有什么区别?
- Java 8流和数组操作
- Java Regex捕获组
- Openssl不被视为内部或外部命令
- 如何添加自定义方法到Spring Data JPA
- 如何在Ubuntu中设置Java环境路径
- 无法执行dex:在Eclipse中超过GC开销限制
- 有人能解释一下JPA和Hibernate中的mappedBy吗?