使用wget递归地获取包含任意文件的目录

我有一个web目录，我存储一些配置文件。我想使用wget将这些文件拉下来并保持它们当前的结构。例如，远程目录看起来像:

http://mysite.com/configs/.vim/

.vim包含多个文件和目录。我想用wget在客户端复制它。似乎无法找到正确的wget标志组合来完成这项工作。什么好主意吗?

当前回答

首先，感谢所有发帖的人。这是我递归下载一个网站的“终极”wget脚本:

wget --recursive ${comment# self-explanatory} \
  --no-parent ${comment# will not crawl links in folders above the base of the URL} \
  --convert-links ${comment# convert links with the domain name to relative and uncrawled to absolute} \
  --random-wait --wait 3 --no-http-keep-alive ${comment# do not get banned} \
  --no-host-directories ${comment# do not create folders with the domain name} \
  --execute robots=off --user-agent=Mozilla/5.0 ${comment# I AM A HUMAN!!!} \
  --level=inf  --accept '*' ${comment# do not limit to 5 levels or common file formats} \
  --reject="index.html*" ${comment# use this option if you need an exact mirror} \
  --cut-dirs=0 ${comment# replace 0 with the number of folders in the path, 0 for the whole domain} \
$URL

之后，剥离查询参数从url像main.css?Crc =12324567并且运行一个本地服务器(例如通过python3 -m http。在你刚刚获得的目录下的server)来运行JS可能是必要的。请注意，——convert-links选项仅在完成完整爬行之后才生效。

此外，如果你正在尝试wget一个网站，可能很快就会宕机，你应该与ArchiveTeam联系，让他们把你的网站添加到他们的ArchiveBot队列中。

2020-12-24 19:56:34

其他回答

听起来你是想要镜像你的文件。虽然wget有一些有趣的FTP和SFTP用途，但一个简单的镜像应该可以工作。只是一些注意事项，以确保您能够正确下载文件。

尊重robots . txt

如果您的public_html、www或configs目录中有一个/robots.txt文件，请确保它不会阻止爬行。如果是这样，你需要在你的wget命令中使用以下选项来指示wget忽略它:

wget -e robots=off 'http://your-site.com/configs/.vim/'

将远程链接转换为本地文件。

此外，必须指示wget将链接转换为下载的文件。如果你正确地做了上面的所有事情，你在这里应该没问题。我发现的获取所有文件的最简单方法是使用mirror命令，前提是在非公共目录后面没有隐藏任何东西。

试试这个:

wget -mpEk 'http://your-site.com/configs/.vim/'

# If robots.txt is present:

wget -mpEk robots=off 'http://your-site.com/configs/.vim/'

# Good practice to only deal with the highest level directory you specify (instead of downloading all of `mysite.com` you're just mirroring from `.vim`

wget -mpEk robots=off --no-parent 'http://your-site.com/configs/.vim/'

Using -m instead of -r is preferred as it doesn't have a maximum recursion depth and it downloads all assets. Mirror is pretty good at determining the full depth of a site, however if you have many external links you could end up downloading more than just your site, which is why we use -p -E -k. All pre-requisite files to make the page, and a preserved directory structure should be the output. -k converts links to local files. Since you should have a link set up, you should get your config folder with a file /.vim.

镜像模式也适用于设置为ftp://的目录结构。

一般经验法则:

根据您要镜像的站点的哪一侧，您将向服务器发送许多调用。为了防止你被列入黑名单或被切断，使用等待选项来限制你的下载。

wget -mpEk --no-parent robots=off --random-wait 'http://your-site.com/configs/.vim/'

但是如果你只是下载../config/。Vim /文件，你不应该担心它，因为你忽略了父目录和下载单个文件。

2021-09-02 05:20:20

这个版本递归下载，不创建父目录。

wgetod() {
    NSLASH="$(echo "$1" | perl -pe 's|.*://[^/]+(.*?)/?$|\1|' | grep -o / | wc -l)"
    NCUT=$((NSLASH > 0 ? NSLASH-1 : 0))
    wget -r -nH --user-agent=Mozilla/5.0 --cut-dirs=$NCUT --no-parent --reject="index.html*" "$1"
}

用法:

添加到~/。Bashrc或粘贴到终端 wgetod“http://example.com/x/”

2017-10-18 23:31:27

下面的选项似乎是处理递归下载时的完美组合:

wget -nd -n -P -P /dest/dir -充值http://url/dir1 dir2

为方便起见，手册页中的相关片段:

   -nd
   --no-directories
       Do not create a hierarchy of directories when retrieving recursively.  With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
       filenames will get extensions .n).


   -np
   --no-parent
       Do not ever ascend to the parent directory when retrieving recursively.  This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.

2019-09-07 15:07:53

递归下载一个目录，该目录拒绝index.html*文件，下载时不包含主机名、父目录和整个目录结构:

wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" http://mysite.com/dir1/dir2/data

2011-03-17 06:17:28

对于其他有类似问题的人。Wget遵循robots.txt，这可能不允许您抓取站点。不用担心，你可以把它关掉:

wget -e robots=off http://www.example.com/

http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html

2012-11-22 20:36:10

使用wget递归地获取包含任意文件的目录

推荐文章

最新文章

标签