如何防止网站刮取?

我有一个相当大的音乐网站，有一个很大的艺术家数据库。我一直注意到其他音乐网站在窃取我们网站的数据(我在这里和那里输入假艺人的名字，然后进行谷歌搜索)。

如何防止屏幕刮擦?这可能吗?

当前回答

不幸的是，您最好的选择是手动的:寻找您认为指示抓取和禁止其IP地址的流量模式。

既然你谈论的是一个公共网站，那么让网站对搜索引擎友好也会让网站对抓取友好。如果一个搜索引擎可以抓取和刮除你的网站，那么恶意刮除者也可以。这是一条微妙的界线。

2010-07-01 20:51:04

其他回答

当然，这是可能的。为了100%的成功，让你的网站离线。

在现实中，你可以做一些事情，让抓取变得更加困难。谷歌进行浏览器检查，以确保您不是一个抓取搜索结果的机器人(尽管这和大多数其他事情一样，可以被欺骗)。

你可以做一些事情，比如在第一次连接到你的网站和随后的点击之间需要几秒钟。我不确定理想的时间是什么，也不知道具体怎么做，但这是另一个想法。

我相信还有其他一些人有更多的经验，但我希望这些想法至少有一定的帮助。

2010-07-01 20:53:27

你真的没有什么办法可以完全防止这种情况。刮刮者可以伪造他们的用户代理，使用多个IP地址等，并以正常用户的身份出现。你唯一能做的就是让文本在页面加载时不可用——用图像、flash或JavaScript加载。然而，前两个都是糟糕的想法，如果JavaScript没有为一些常规用户启用，最后一个将是可访问性问题。

如果他们对你的网站进行了猛烈的攻击，并且浏览了你所有的页面，你可以做一些速率限制。

不过还是有希望的。刮刮器依赖于您的网站的数据是一致的格式。如果你能随机分配，它可能会打碎他们的刮刀。比如在每次加载时更改页面元素的ID或类名等。但这需要做很多工作，我不确定是否值得。即便如此，只要有足够的投入，他们可能就能解决这个问题。

2010-07-01 20:51:53

你不能停止正常的屏幕抓取。不管是好是坏，这就是网络的本质。

你可以让任何人都不能访问某些东西(包括音乐文件)，除非他们以注册用户的身份登录。在Apache中做到这一点并不难。我想在IIS中也不会太难。

2010-07-02 00:43:09

注意:由于这个答案的完整版本超过了Stack Overflow的长度限制，您需要前往GitHub阅读扩展版本，其中有更多提示和详细信息。

为了阻止抓取(也称为Web抓取、屏幕抓取、Web数据挖掘、Web收集或Web数据提取)，它有助于了解这些抓取器是如何工作的，推而广之，是什么阻碍了它们正常工作。

刮板有很多种类型，每一种的工作方式都不一样:

Spiders, such as Google's bot or website copiers like HTtrack, which recursively follow links to other pages in order to get data. These are sometimes used for targeted scraping to get specific data, often in combination with a HTML parser to extract the desired data from each page. Shell scripts: Sometimes, common Unix tools are used for scraping: Wget or Curl to download pages, and Grep (Regex) to extract the data. HTML parsers, such as ones based on Jsoup, Scrapy, and others. Similar to shell-script regex based ones, these work by extracting data from pages based on patterns in HTML, usually ignoring everything else. For example: If your website has a search feature, such a scraper might submit a request for a search, and then get all the result links and their titles from the results page HTML, in order to specifically get only search result links and their titles. These are the most common. Screenscrapers, based on eg. Selenium or PhantomJS, which open your website in a real browser, run JavaScript, AJAX, and so on, and then get the desired text from the webpage, usually by: Getting the HTML from the browser after your page has been loaded and JavaScript has run, and then using a HTML parser to extract the desired data. These are the most common, and so many of the methods for breaking HTML parsers / scrapers also work here. Taking a screenshot of the rendered pages, and then using OCR to extract the desired text from the screenshot. These are rare, and only dedicated scrapers who really want your data will set this up. Webscraping services such as ScrapingHub or Kimono. In fact, there's people whose job is to figure out how to scrape your site and pull out the content for others to use. Unsurprisingly, professional scraping services are the hardest to deter, but if you make it hard and time-consuming to figure out how to scrape your site, these (and people who pay them to do so) may not be bothered to scrape your website. Embedding your website in other site's pages with frames, and embedding your site in mobile apps. While not technically scraping, mobile apps (Android and iOS) can embed websites, and inject custom CSS and JavaScript, thus completely changing the appearance of your pages. Human copy - paste: People will copy and paste your content in order to use it elsewhere.

这些不同类型的刮板之间有很多重叠，即使它们使用不同的技术和方法，许多刮板也会表现相似。

这些建议主要是我自己的想法，我在编写scraper时遇到的各种困难，以及来自互联网的一些信息和想法。

如何停止刮痧

你不可能完全阻止它，因为无论你做什么，坚定的刮刀者仍然可以找到如何刮。然而，你可以通过做一些事情来停止大量的抓取:

监控你的日志和流量模式;如果看到异常活动，请限制访问:

定期检查您的日志，如果有不寻常的活动表明自动访问(scraper)，例如来自同一IP地址的许多类似操作，您可以阻止或限制访问。

具体来说，有以下几点:

Rate limiting: Only allow users (and scrapers) to perform a limited number of actions in a certain time - for example, only allow a few searches per second from any specific IP address or user. This will slow down scrapers, and make them ineffective. You could also show a captcha if actions are completed too fast or faster than a real user would. Detect unusual activity: If you see unusual activity, such as many similar requests from a specific IP address, someone looking at an excessive number of pages or performing an unusual number of searches, you can prevent access, or show a captcha for subsequent requests. Don't just monitor & rate limit by IP address - use other indicators too: If you do block or rate limit, don't just do it on a per-IP address basis; you can use other indicators and methods to identify specific users or scrapers. Some indicators which can help you identify specific users / scrapers include: How fast users fill out forms, and where on a button they click; You can gather a lot of information with JavaScript, such as screen size / resolution, timezone, installed fonts, etc; you can use this to identify users. HTTP headers and their order, especially User-Agent. As an example, if you get many request from a single IP address, all using the same User Agent, screen size (determined with JavaScript), and the user (scraper in this case) always clicks on the button in the same way and at regular intervals, it's probably a screen scraper; and you can temporarily block similar requests (eg. block all requests with that user agent and screen size coming from that particular IP address), and this way you won't inconvenience real users on that IP address, eg. in case of a shared internet connection. You can also take this further, as you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block. Again, be aware of not inadvertently blocking real users. This can be effective against screenscrapers which run JavaScript, as you can get a lot of information from them. Related questions on Security Stack Exchange: How to uniquely identify users with the same external IP address? for more details, and Why do people use IP address bans when IP addresses often change? for info on the limits of these methods. Instead of temporarily blocking access, use a Captcha: The simple way to implement rate-limiting would be to temporarily block access for a certain amount of time, however using a Captcha may be better, see the section on Captchas further down.

需要注册和登录

需要帐户创建，以查看您的内容，如果这对您的网站是可行的。这对刮刀者来说是一个很好的威慑，但对真正的用户来说也是一个很好的威慑。

如果您需要帐户创建和登录，您可以准确地跟踪用户和刮刀动作。通过这种方式，您可以轻松地检测到特定的帐户正在被用于抓取，并禁止它。像速率限制或检测滥用(例如在短时间内进行大量搜索)这样的事情变得更容易，因为您可以识别特定的刮码器，而不仅仅是IP地址。

为了避免脚本创建多个帐户，您应该:

需要一个电子邮件地址进行注册，并通过发送一个必须打开的链接来验证该电子邮件地址，以便激活帐户。每个电子邮件地址只允许一个帐户。在注册/帐户创建过程中需要验证码解决。

要求创建账户来查看内容将会赶走用户和搜索引擎;如果你需要创建帐户才能查看文章，用户就会去其他地方。

阻止来自云托管和抓取服务IP地址的访问

有时，刮刮器将从web托管服务(如Amazon web services或GAE或vps)运行。对于来自云托管服务使用的IP地址的请求，限制访问您的网站(或显示验证码)。

同样，您也可以限制来自代理或VPN提供商使用的IP地址的访问，因为scraper可能会使用这样的代理服务器来避免许多请求被检测到。

请注意，通过阻止代理服务器和vpn的访问，您将对真实用户产生负面影响。

如果阻塞，则使错误消息不可描述

如果你阻止/限制进入，你应该确保你没有告诉刮板是什么原因导致了堵塞，从而给他们如何修理刮板的线索。所以一个坏主意是显示错误页面的文本如下:

您的IP地址请求太多，请稍候再试。错误，用户代理头不存在!

相反，显示一个友好的错误消息，不告诉刮刀是什么原因造成的。像这样的东西要好得多:

对不起，出了点问题。如果问题仍然存在，您可以通过helpdesk@example.com联系技术支持。

这对于真正的用户来说也更加友好，如果他们看到这样的错误页面的话。您还应该考虑为后续请求显示验证码，而不是硬阻止，以防真实用户看到错误消息，这样您就不会阻止，从而导致合法用户与您联系。

使用验证码，如果你怀疑你的网站正在被一个刮板访问。

验证码(“完全自动化的测试，以区分计算机和人类”)是非常有效的阻止刮刀。不幸的是，它们也非常容易激怒用户。

因此，当你怀疑可能是刮板，并想要停止刮板时，它们很有用，而不会阻止访问，以防它不是刮板而是真正的用户。如果你怀疑是刮板，你可能要考虑在允许访问内容之前显示验证码。

使用验证码时需要注意的事情:

Don't roll your own, use something like Google's reCaptcha : It's a lot easier than implementing a captcha yourself, it's more user-friendly than some blurry and warped text solution you might come up with yourself (users often only need to tick a box), and it's also a lot harder for a scripter to solve than a simple image served from your site Don't include the solution to the captcha in the HTML markup: I've actually seen one website which had the solution for the captcha in the page itself, (although quite well hidden) thus making it pretty useless. Don't do something like this. Again, use a service like reCaptcha, and you won't have this kind of problem (if you use it properly). Captchas can be solved in bulk: There are captcha-solving services where actual, low-paid, humans solve captchas in bulk. Again, using reCaptcha is a good idea here, as they have protections (such as the relatively short time the user has in order to solve the captcha). This kind of service is unlikely to be used unless your data is really valuable.

将文本内容作为图像提供

您可以将文本呈现到图像服务器端，并将其显示出来，这将阻碍简单的抓取程序提取文本。

然而，这对屏幕阅读器、搜索引擎、性能和几乎所有其他方面都不利。在一些地方，这也是非法的(由于交通不便，例如。美国残疾人法案)，它也很容易通过一些OCR来规避，所以不要这样做。

你可以用CSS精灵做类似的事情，但也会遇到同样的问题。

不要暴露你的完整数据集:

If feasible, don't provide a way for a script / bot to get all of your dataset. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on site search, and, if you don't have a list of all the articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.

以下情况将无效:

The bot / script does not want / need the full dataset anyway. Your articles are served from a URL which looks something like example.com/article.php?articleId=12345. This (and similar things) which will allow scrapers to simply iterate over all the articleIds and request all the articles that way. There are other ways to eventually find all the articles, such as by writing a script to follow links within articles which lead to other articles. Searching for something like "and" or "the" can reveal almost everything, so that is something to be aware of. (You can avoid this by only returning the top 10 or 20 results). You need search engines to find your content.

不要公开你的api、端点和类似的东西:

确保不公开任何api，即使是无意的。例如，如果您正在使用AJAX或来自Adobe Flash或Java applet(上帝禁止!)的网络请求来加载数据，那么从页面查看网络请求并找出这些请求的去向，然后逆向工程并在scraper程序中使用这些端点是很简单的。确保您混淆了端点，并使它们难以被其他人使用，如前所述。

为了阻止HTML解析器和抓取器:

由于HTML解析器的工作原理是基于HTML中的可识别模式从页面中提取内容，因此我们可以有意地改变这些模式，以破坏这些刮削器，甚至破坏它们。这些技巧大部分也适用于其他抓取工具，如蜘蛛和屏幕抓取工具。

频繁更改HTML

Scrapers which process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a div with an id of article-content, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site, and extract the content text of the article-content div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.

如果您频繁更改HTML和页面的结构，这样的刮刀将不再工作。

You can frequently change the id's and classes of elements in your HTML, perhaps even automatically. So, if your div.article-content becomes something like div.a4c36dda13eaf0, and changes every week, the scraper will work fine initially, but will break after a week. Make sure to change the length of your ids / classes too, otherwise the scraper will use div.[any-14-characters] to find the desired div instead. Beware of other similar holes too.. If there is no way to find the desired content from the markup, the scraper will do so from the way the HTML is structured. So, if all your article pages are similar in that every div inside a div which comes after a h1 is the article content, scrapers will get the article content based on that. Again, to break this, you can add / remove extra markup to your HTML, periodically and randomly, eg. adding extra divs or spans. With modern server side HTML processing, this should not be too hard.

需要注意的事项:

It will be tedious and difficult to implement, maintain, and debug. You will hinder caching. Especially if you change ids or classes of your HTML elements, this will require corresponding changes in your CSS and JavaScript files, which means that every time you change them, they will have to be re-downloaded by the browser. This will result in longer page load times for repeat visitors, and increased server load. If you only change it once a week, it will not be a big problem. Clever scrapers will still be able to get your content by inferring where the actual content is, eg. by knowing that a large single block of text on the page is likely to be the actual article. This makes it possible to still find & extract the desired data from the page. Boilerpipe does exactly this.

从本质上讲，要确保脚本不容易为每个类似的页面找到实际所需的内容。

有关如何在PHP中实现这一点的详细信息，请参见如何防止依赖于XPath的爬虫程序获取页面内容。

根据用户的位置更改HTML

这有点类似于前面的技巧。如果您根据用户的位置/国家(由IP地址决定)提供不同的HTML，这可能会破坏传递给用户的刮码器。例如，如果有人正在编写一个移动应用程序，从你的网站抓取数据，它最初会工作得很好，但当它实际分发给用户时就会中断，因为这些用户可能在不同的国家，因此获得不同的HTML，嵌入式scraper并不是为消费而设计的。

经常改变你的HTML，积极螺丝刮削这样做!

举个例子:你在你的网站上有一个搜索功能，位于example.com/search?query=somesearchquery，它返回以下HTML:

<div class="search-result">
  <h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
  <p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
  <a class"search-result-link" href="/stories/story-link">Read more</a>
</div>
(And so on, lots more identically structured divs with search results)

正如您可能已经猜到的那样，这很容易刮擦:刮擦器所需要做的只是用查询命中搜索URL，并从返回的HTML中提取所需的数据。除了如上所述定期更改HTML之外，您还可以保留带有旧id和类的旧标记，用CSS隐藏它，并用假数据填充它，从而毒害scraper。下面是修改搜索结果页面的方法:

<div class="the-real-search-result">
  <h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
  <p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
  <a class"the-real-search-result-link" href="/stories/story-link">Read more</a>
</div>

<div class="search-result" style="display:none">
  <h3 class="search-result-title">Visit Example.com now, for all the latest Stack Overflow related news !</h3>
  <p class="search-result-excerpt">Example.com is so awesome, visit now !</p>
  <a class"search-result-link" href="http://example.com/">Visit Now !</a>
</div>
(More real search results follow)

这意味着基于类或id从HTML中提取数据的抓取器将继续工作，但他们将获得虚假数据甚至广告，这些数据是真正的用户永远不会看到的，因为它们隐藏在CSS中。

拧刮板:在页面中插入虚假的、看不见的蜜罐数据

再加上前面的例子，您可以在HTML中添加不可见的蜜罐项来捕获抓取。可以添加到前面描述的搜索结果页面的示例:

<div class="search-result" style="display:none">
  <h3 class="search-result-title">This search result is here to prevent scraping</h3>
  <p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-)
  Note that clicking the link below will block access to this site for 24 hours.</p>
  <a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a>
</div>
(The actual, real, search results follow.)

一个为了获得所有搜索结果而编写的scraper会像任何其他页面上的真实搜索结果一样，选择这个链接，寻找所需的内容。一个真正的人根本不会看到它(由于它被CSS隐藏)，也不会访问这个链接。像谷歌这样的真正的合适的爬行器也不会访问该链接，因为您在robots.txt中禁用了/scrapertrap/。

可以让scrapertrap.php对访问它的IP地址进行阻止访问，或者对来自该IP的所有后续请求强制验证码。

Don't forget to disallow your honeypot (/scrapertrap/) in your robots.txt file so that search engine bots don't fall into it. You can / should combine this with the previous tip of changing your HTML frequently. Change this frequently too, as scrapers will eventually learn to avoid it. Change the honeypot URL and text. Also want to consider changing the inline CSS used for hiding, and use an ID attribute and external CSS instead, as scrapers will learn to avoid anything which has a style attribute with CSS used to hide the content. Also try only enabling it sometimes, so the scraper works initially, but breaks after a while. This also applies to the previous tip. Malicious people can prevent access for real users by sharing a link to your honeypot, or even embedding that link somewhere as an image (eg. on a forum). Change the URL frequently, and make any ban times relatively short.

如果发现刮板，提供虚假和无用的数据

如果你发现明显是刮板，你可以提供虚假和无用的数据;这将破坏刮刀从你的网站得到的数据。你还应该让人们无法区分这些虚假数据和真实数据，这样信息搜集者就不知道他们被骗了。

举个例子:你有一个新闻网站;如果你检测到一个刮板，而不是阻止访问，提供虚假的，随机生成的文章，这将污染刮板得到的数据。如果你让虚假数据与真实数据难以区分，就会让信息搜集者很难得到他们想要的东西，也就是真实的数据。

如果用户代理为空或缺失，则不接受请求

通常，惰性编写的scraper不会在请求时发送User Agent报头，而所有浏览器和搜索引擎蜘蛛都会。

如果收到的请求中没有User Agent报头，则可以显示验证码，或者简单地阻止或限制访问。(或者提供如上所述的虚假数据，或者其他东西..)

欺骗它是微不足道的，但作为一种针对编写糟糕的scraper的措施，它是值得实现的。

如果用户代理是普通的刮板代理，则不接受请求;黑名单由刮刀器使用

在某些情况下，scraper将使用没有真正的浏览器或搜索引擎爬行器使用的用户代理，例如:

“Mozilla”(仅此而已。我看到了一些关于刮痧的问题。一个真正的浏览器永远不会只使用它) “Java 1.7.43_u43”(默认情况下，Java的HttpUrlConnection使用类似的东西。) “BIZCO EasyScraping Studio 2.0” “wget”，“curl”，“libcurl”，…(Wget和cURL有时用于基本的抓取)

如果您发现某个特定的User Agent字符串被站点上的scraper使用，而不是真正的浏览器或合法的爬行器使用，您也可以将其添加到黑名单中。

如果它不请求资源(CSS，图像)，它就不是一个真正的浏览器。

真正的浏览器会(几乎总是)请求和下载图像和CSS等资源。HTML解析器和抓取器不会，因为它们只对实际的页面及其内容感兴趣。

您可以将请求记录到您的资产中，如果您看到许多仅针对HTML的请求，那么它可能是一个scraper。

请注意，搜索引擎机器人、老式移动设备、屏幕阅读器和配置错误的设备也可能不会请求资产。

使用并要求cookie;使用它们来跟踪用户和刮刀动作。

您可以要求启用cookies以查看您的网站。这将阻止经验不足和新手刮刀作家，但很容易为刮刀发送饼干。如果你确实使用并需要它们，你可以用它们来跟踪用户和刮刀操作，从而实现限速、阻止或显示每个用户而不是每个ip的验证码。

例如:当用户进行搜索时，设置唯一的标识cookie。在查看结果页面时，验证该cookie。如果用户打开了所有的搜索结果(从cookie可以看出)，那么它可能是一个刮板。

使用cookie可能是无效的，因为抓取程序也可以将cookie与请求一起发送，并在需要时丢弃它们。如果您的站点只使用cookie，您还将阻止禁用cookie的真实用户访问。

请注意，如果您使用JavaScript设置和检索cookie，您将阻止不运行JavaScript的抓取程序，因为它们不能检索和发送cookie请求。

使用JavaScript + Ajax来加载内容

您可以在页面本身加载后使用JavaScript + AJAX加载内容。这将使不运行JavaScript的HTML解析器无法访问内容。对于新手和没有经验的程序员来说，这通常是一种有效的威慑。

注意:

使用JavaScript加载实际内容会降低用户体验和性能搜索引擎可能也不运行JavaScript，从而阻止他们索引你的内容。对于搜索结果页面，这可能不是问题，但对于其他内容，比如文章页面，这可能是问题。

混淆您的标记、来自脚本的网络请求和其他一切。

If you use Ajax and JavaScript to load your data, obfuscate the data which is transferred. As an example, you could encode your data on the server (with something as simple as base64 or more complex), and then decode and display it on the client, after fetching via Ajax. This will mean that someone inspecting network traffic will not immediately see how your page works and loads data, and it will be tougher for someone to directly request request data from your endpoints, as they will have to reverse-engineer your descrambling algorithm.

If you do use Ajax for loading the data, you should make it hard to use the endpoints without loading the page first, eg by requiring some session key as a parameter, which you can embed in your JavaScript or your HTML. You can also embed your obfuscated data directly in the initial HTML page and use JavaScript to deobfuscate and display it, which would avoid the extra network requests. Doing this will make it significantly harder to extract the data using a HTML-only parser which does not run JavaScript, as the one writing the scraper will have to reverse engineer your JavaScript (which you should obfuscate too). You might want to change your obfuscation methods regularly, to break scrapers who have figured it out.

不过，这样做也有一些缺点:

实现、维护和调试将非常繁琐和困难。它将无效的scraper和屏幕抓取器，实际上运行JavaScript，然后提取数据。(大多数简单的HTML解析器不运行JavaScript) 如果实际用户禁用了JavaScript，它将使您的站点无法使用。性能和页面加载时间将受到影响。

非技术:

告诉人们不要刮，一些人会尊重它找律师让你的数据可用，提供一个API: 您可以使您的数据易于获取，并要求归因和链接回到您的网站。也许要收$$$。

杂项:

There are also commercial scraping protection services, such as the anti-scraping by Cloudflare or Distill Networks (Details on how it works here), which do these things, and more for you. Find a balance between usability for real users and scraper-proofness: Everything you do will impact user experience negatively in one way or another, find compromises. Don't forget your mobile site and apps. If you have a mobile app, that can be screenscraped too, and network traffic can be inspected to determine the REST endpoints it uses. Scrapers can scrape other scrapers: If there's one website which has content scraped from yours, other scrapers can scrape from that scraper's website.

进一步阅读:

维基百科关于网络抓取的文章。涉及的技术和不同类型的网页刮板的许多细节。阻止脚本制作者在一秒钟内上百次地攻击你的网站。关于一个非常类似的问题的问答-机器人检查网站，并在他们开始销售的时候购买东西。大量相关信息，特别是验证码和速率限制。

2016-01-16 15:06:06

与其将机器人列入黑名单，不如将它们列入白名单。如果你不想让你的搜索结果失去前几个引擎，你可以把他们的用户代理字符串列入白名单，这通常是广为人知的。不那么道德的机器人倾向于伪造流行网络浏览器的用户代理字符串。排名靠前的几个搜索引擎应该会为你带来95%以上的流量。

识别机器人本身应该是相当简单的，使用其他帖子建议的技术。

2010-07-02 01:22:01

如何防止网站刮取?

推荐文章

最新文章

标签