如何防止网站刮取?

我有一个相当大的音乐网站，有一个很大的艺术家数据库。我一直注意到其他音乐网站在窃取我们网站的数据(我在这里和那里输入假艺人的名字，然后进行谷歌搜索)。

如何防止屏幕刮擦?这可能吗?

当前回答

我假定您已经设置了robots.txt。

正如其他人所提到的，刮刮器可以伪造其活动的几乎每个方面，并且可能很难识别来自坏人的请求。

我会考虑:

建立一个页面/jail.html。禁止访问robots.txt中的页面(因此尊敬的蜘蛛永远不会访问)。在你的一个页面上放置一个链接，用CSS隐藏它(display: none)。记录访问者的IP地址到/jail.html。

这可以帮助您快速识别来自scraper的请求，这些请求公然无视您的robots.txt。

你可能还想让你的/jail.html成为一个完整的网站，它拥有与正常页面相同的标记，但是使用假数据(/jail/album/63ajdka， /jail/track/3aads8等)。这样，在你有机会完全阻止它们之前，糟糕的抓取程序不会被提醒“异常输入”。

2010-07-01 21:09:07

其他回答

不，不可能停止(以任何方式) 拥抱它。为什么不发布为RDFa，成为超级搜索引擎友好，并鼓励重复使用数据?人们会感谢你，并在适当的时候提供信用(以musicbrainz为例)。

这可能不是你想要的答案，但为什么要隐藏你想要公开的东西呢?

2010-07-02 00:32:01

你不能停止正常的屏幕抓取。不管是好是坏，这就是网络的本质。

你可以让任何人都不能访问某些东西(包括音乐文件)，除非他们以注册用户的身份登录。在Apache中做到这一点并不难。我想在IIS中也不会太难。

2010-07-02 00:43:09

注意:由于这个答案的完整版本超过了Stack Overflow的长度限制，您需要前往GitHub阅读扩展版本，其中有更多提示和详细信息。

为了阻止抓取(也称为Web抓取、屏幕抓取、Web数据挖掘、Web收集或Web数据提取)，它有助于了解这些抓取器是如何工作的，推而广之，是什么阻碍了它们正常工作。

刮板有很多种类型，每一种的工作方式都不一样:

Spiders, such as Google's bot or website copiers like HTtrack, which recursively follow links to other pages in order to get data. These are sometimes used for targeted scraping to get specific data, often in combination with a HTML parser to extract the desired data from each page. Shell scripts: Sometimes, common Unix tools are used for scraping: Wget or Curl to download pages, and Grep (Regex) to extract the data. HTML parsers, such as ones based on Jsoup, Scrapy, and others. Similar to shell-script regex based ones, these work by extracting data from pages based on patterns in HTML, usually ignoring everything else. For example: If your website has a search feature, such a scraper might submit a request for a search, and then get all the result links and their titles from the results page HTML, in order to specifically get only search result links and their titles. These are the most common. Screenscrapers, based on eg. Selenium or PhantomJS, which open your website in a real browser, run JavaScript, AJAX, and so on, and then get the desired text from the webpage, usually by: Getting the HTML from the browser after your page has been loaded and JavaScript has run, and then using a HTML parser to extract the desired data. These are the most common, and so many of the methods for breaking HTML parsers / scrapers also work here. Taking a screenshot of the rendered pages, and then using OCR to extract the desired text from the screenshot. These are rare, and only dedicated scrapers who really want your data will set this up. Webscraping services such as ScrapingHub or Kimono. In fact, there's people whose job is to figure out how to scrape your site and pull out the content for others to use. Unsurprisingly, professional scraping services are the hardest to deter, but if you make it hard and time-consuming to figure out how to scrape your site, these (and people who pay them to do so) may not be bothered to scrape your website. Embedding your website in other site's pages with frames, and embedding your site in mobile apps. While not technically scraping, mobile apps (Android and iOS) can embed websites, and inject custom CSS and JavaScript, thus completely changing the appearance of your pages. Human copy - paste: People will copy and paste your content in order to use it elsewhere.

这些不同类型的刮板之间有很多重叠，即使它们使用不同的技术和方法，许多刮板也会表现相似。

这些建议主要是我自己的想法，我在编写scraper时遇到的各种困难，以及来自互联网的一些信息和想法。

如何停止刮痧

你不可能完全阻止它，因为无论你做什么，坚定的刮刀者仍然可以找到如何刮。然而，你可以通过做一些事情来停止大量的抓取:

监控你的日志和流量模式;如果看到异常活动，请限制访问:

定期检查您的日志，如果有不寻常的活动表明自动访问(scraper)，例如来自同一IP地址的许多类似操作，您可以阻止或限制访问。

具体来说，有以下几点:

Rate limiting: Only allow users (and scrapers) to perform a limited number of actions in a certain time - for example, only allow a few searches per second from any specific IP address or user. This will slow down scrapers, and make them ineffective. You could also show a captcha if actions are completed too fast or faster than a real user would. Detect unusual activity: If you see unusual activity, such as many similar requests from a specific IP address, someone looking at an excessive number of pages or performing an unusual number of searches, you can prevent access, or show a captcha for subsequent requests. Don't just monitor & rate limit by IP address - use other indicators too: If you do block or rate limit, don't just do it on a per-IP address basis; you can use other indicators and methods to identify specific users or scrapers. Some indicators which can help you identify specific users / scrapers include: How fast users fill out forms, and where on a button they click; You can gather a lot of information with JavaScript, such as screen size / resolution, timezone, installed fonts, etc; you can use this to identify users. HTTP headers and their order, especially User-Agent. As an example, if you get many request from a single IP address, all using the same User Agent, screen size (determined with JavaScript), and the user (scraper in this case) always clicks on the button in the same way and at regular intervals, it's probably a screen scraper; and you can temporarily block similar requests (eg. block all requests with that user agent and screen size coming from that particular IP address), and this way you won't inconvenience real users on that IP address, eg. in case of a shared internet connection. You can also take this further, as you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block. Again, be aware of not inadvertently blocking real users. This can be effective against screenscrapers which run JavaScript, as you can get a lot of information from them. Related questions on Security Stack Exchange: How to uniquely identify users with the same external IP address? for more details, and Why do people use IP address bans when IP addresses often change? for info on the limits of these methods. Instead of temporarily blocking access, use a Captcha: The simple way to implement rate-limiting would be to temporarily block access for a certain amount of time, however using a Captcha may be better, see the section on Captchas further down.

需要注册和登录

需要帐户创建，以查看您的内容，如果这对您的网站是可行的。这对刮刀者来说是一个很好的威慑，但对真正的用户来说也是一个很好的威慑。

如果您需要帐户创建和登录，您可以准确地跟踪用户和刮刀动作。通过这种方式，您可以轻松地检测到特定的帐户正在被用于抓取，并禁止它。像速率限制或检测滥用(例如在短时间内进行大量搜索)这样的事情变得更容易，因为您可以识别特定的刮码器，而不仅仅是IP地址。

为了避免脚本创建多个帐户，您应该:

需要一个电子邮件地址进行注册，并通过发送一个必须打开的链接来验证该电子邮件地址，以便激活帐户。每个电子邮件地址只允许一个帐户。在注册/帐户创建过程中需要验证码解决。

要求创建账户来查看内容将会赶走用户和搜索引擎;如果你需要创建帐户才能查看文章，用户就会去其他地方。

阻止来自云托管和抓取服务IP地址的访问

有时，刮刮器将从web托管服务(如Amazon web services或GAE或vps)运行。对于来自云托管服务使用的IP地址的请求，限制访问您的网站(或显示验证码)。

同样，您也可以限制来自代理或VPN提供商使用的IP地址的访问，因为scraper可能会使用这样的代理服务器来避免许多请求被检测到。

请注意，通过阻止代理服务器和vpn的访问，您将对真实用户产生负面影响。

如果阻塞，则使错误消息不可描述

如果你阻止/限制进入，你应该确保你没有告诉刮板是什么原因导致了堵塞，从而给他们如何修理刮板的线索。所以一个坏主意是显示错误页面的文本如下:

您的IP地址请求太多，请稍候再试。错误，用户代理头不存在!

相反，显示一个友好的错误消息，不告诉刮刀是什么原因造成的。像这样的东西要好得多:

对不起，出了点问题。如果问题仍然存在，您可以通过helpdesk@example.com联系技术支持。

这对于真正的用户来说也更加友好，如果他们看到这样的错误页面的话。您还应该考虑为后续请求显示验证码，而不是硬阻止，以防真实用户看到错误消息，这样您就不会阻止，从而导致合法用户与您联系。

使用验证码，如果你怀疑你的网站正在被一个刮板访问。

验证码(“完全自动化的测试，以区分计算机和人类”)是非常有效的阻止刮刀。不幸的是，它们也非常容易激怒用户。

因此，当你怀疑可能是刮板，并想要停止刮板时，它们很有用，而不会阻止访问，以防它不是刮板而是真正的用户。如果你怀疑是刮板，你可能要考虑在允许访问内容之前显示验证码。

使用验证码时需要注意的事情:

Don't roll your own, use something like Google's reCaptcha : It's a lot easier than implementing a captcha yourself, it's more user-friendly than some blurry and warped text solution you might come up with yourself (users often only need to tick a box), and it's also a lot harder for a scripter to solve than a simple image served from your site Don't include the solution to the captcha in the HTML markup: I've actually seen one website which had the solution for the captcha in the page itself, (although quite well hidden) thus making it pretty useless. Don't do something like this. Again, use a service like reCaptcha, and you won't have this kind of problem (if you use it properly). Captchas can be solved in bulk: There are captcha-solving services where actual, low-paid, humans solve captchas in bulk. Again, using reCaptcha is a good idea here, as they have protections (such as the relatively short time the user has in order to solve the captcha). This kind of service is unlikely to be used unless your data is really valuable.

将文本内容作为图像提供

您可以将文本呈现到图像服务器端，并将其显示出来，这将阻碍简单的抓取程序提取文本。

然而，这对屏幕阅读器、搜索引擎、性能和几乎所有其他方面都不利。在一些地方，这也是非法的(由于交通不便，例如。美国残疾人法案)，它也很容易通过一些OCR来规避，所以不要这样做。

你可以用CSS精灵做类似的事情，但也会遇到同样的问题。

不要暴露你的完整数据集:

If feasible, don't provide a way for a script / bot to get all of your dataset. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on site search, and, if you don't have a list of all the articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.

以下情况将无效:

The bot / script does not want / need the full dataset anyway. Your articles are served from a URL which looks something like example.com/article.php?articleId=12345. This (and similar things) which will allow scrapers to simply iterate over all the articleIds and request all the articles that way. There are other ways to eventually find all the articles, such as by writing a script to follow links within articles which lead to other articles. Searching for something like "and" or "the" can reveal almost everything, so that is something to be aware of. (You can avoid this by only returning the top 10 or 20 results). You need search engines to find your content.

不要公开你的api、端点和类似的东西:

确保不公开任何api，即使是无意的。例如，如果您正在使用AJAX或来自Adobe Flash或Java applet(上帝禁止!)的网络请求来加载数据，那么从页面查看网络请求并找出这些请求的去向，然后逆向工程并在scraper程序中使用这些端点是很简单的。确保您混淆了端点，并使它们难以被其他人使用，如前所述。

为了阻止HTML解析器和抓取器:

由于HTML解析器的工作原理是基于HTML中的可识别模式从页面中提取内容，因此我们可以有意地改变这些模式，以破坏这些刮削器，甚至破坏它们。这些技巧大部分也适用于其他抓取工具，如蜘蛛和屏幕抓取工具。

频繁更改HTML

Scrapers which process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a div with an id of article-content, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site, and extract the content text of the article-content div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.

如果您频繁更改HTML和页面的结构，这样的刮刀将不再工作。

You can frequently change the id's and classes of elements in your HTML, perhaps even automatically. So, if your div.article-content becomes something like div.a4c36dda13eaf0, and changes every week, the scraper will work fine initially, but will break after a week. Make sure to change the length of your ids / classes too, otherwise the scraper will use div.[any-14-characters] to find the desired div instead. Beware of other similar holes too.. If there is no way to find the desired content from the markup, the scraper will do so from the way the HTML is structured. So, if all your article pages are similar in that every div inside a div which comes after a h1 is the article content, scrapers will get the article content based on that. Again, to break this, you can add / remove extra markup to your HTML, periodically and randomly, eg. adding extra divs or spans. With modern server side HTML processing, this should not be too hard.

需要注意的事项:

It will be tedious and difficult to implement, maintain, and debug. You will hinder caching. Especially if you change ids or classes of your HTML elements, this will require corresponding changes in your CSS and JavaScript files, which means that every time you change them, they will have to be re-downloaded by the browser. This will result in longer page load times for repeat visitors, and increased server load. If you only change it once a week, it will not be a big problem. Clever scrapers will still be able to get your content by inferring where the actual content is, eg. by knowing that a large single block of text on the page is likely to be the actual article. This makes it possible to still find & extract the desired data from the page. Boilerpipe does exactly this.

从本质上讲，要确保脚本不容易为每个类似的页面找到实际所需的内容。

有关如何在PHP中实现这一点的详细信息，请参见如何防止依赖于XPath的爬虫程序获取页面内容。

根据用户的位置更改HTML

这有点类似于前面的技巧。如果您根据用户的位置/国家(由IP地址决定)提供不同的HTML，这可能会破坏传递给用户的刮码器。例如，如果有人正在编写一个移动应用程序，从你的网站抓取数据，它最初会工作得很好，但当它实际分发给用户时就会中断，因为这些用户可能在不同的国家，因此获得不同的HTML，嵌入式scraper并不是为消费而设计的。

经常改变你的HTML，积极螺丝刮削这样做!

举个例子:你在你的网站上有一个搜索功能，位于example.com/search?query=somesearchquery，它返回以下HTML:

<div class="search-result">
  <h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
  <p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
  <a class"search-result-link" href="/stories/story-link">Read more</a>
</div>
(And so on, lots more identically structured divs with search results)

正如您可能已经猜到的那样，这很容易刮擦:刮擦器所需要做的只是用查询命中搜索URL，并从返回的HTML中提取所需的数据。除了如上所述定期更改HTML之外，您还可以保留带有旧id和类的旧标记，用CSS隐藏它，并用假数据填充它，从而毒害scraper。下面是修改搜索结果页面的方法:

<div class="the-real-search-result">
  <h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
  <p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
  <a class"the-real-search-result-link" href="/stories/story-link">Read more</a>
</div>

<div class="search-result" style="display:none">
  <h3 class="search-result-title">Visit Example.com now, for all the latest Stack Overflow related news !</h3>
  <p class="search-result-excerpt">Example.com is so awesome, visit now !</p>
  <a class"search-result-link" href="http://example.com/">Visit Now !</a>
</div>
(More real search results follow)

这意味着基于类或id从HTML中提取数据的抓取器将继续工作，但他们将获得虚假数据甚至广告，这些数据是真正的用户永远不会看到的，因为它们隐藏在CSS中。

拧刮板:在页面中插入虚假的、看不见的蜜罐数据

再加上前面的例子，您可以在HTML中添加不可见的蜜罐项来捕获抓取。可以添加到前面描述的搜索结果页面的示例:

<div class="search-result" style="display:none">
  <h3 class="search-result-title">This search result is here to prevent scraping</h3>
  <p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-)
  Note that clicking the link below will block access to this site for 24 hours.</p>
  <a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a>
</div>
(The actual, real, search results follow.)

一个为了获得所有搜索结果而编写的scraper会像任何其他页面上的真实搜索结果一样，选择这个链接，寻找所需的内容。一个真正的人根本不会看到它(由于它被CSS隐藏)，也不会访问这个链接。像谷歌这样的真正的合适的爬行器也不会访问该链接，因为您在robots.txt中禁用了/scrapertrap/。

可以让scrapertrap.php对访问它的IP地址进行阻止访问，或者对来自该IP的所有后续请求强制验证码。

Don't forget to disallow your honeypot (/scrapertrap/) in your robots.txt file so that search engine bots don't fall into it. You can / should combine this with the previous tip of changing your HTML frequently. Change this frequently too, as scrapers will eventually learn to avoid it. Change the honeypot URL and text. Also want to consider changing the inline CSS used for hiding, and use an ID attribute and external CSS instead, as scrapers will learn to avoid anything which has a style attribute with CSS used to hide the content. Also try only enabling it sometimes, so the scraper works initially, but breaks after a while. This also applies to the previous tip. Malicious people can prevent access for real users by sharing a link to your honeypot, or even embedding that link somewhere as an image (eg. on a forum). Change the URL frequently, and make any ban times relatively short.

如果发现刮板，提供虚假和无用的数据

如果你发现明显是刮板，你可以提供虚假和无用的数据;这将破坏刮刀从你的网站得到的数据。你还应该让人们无法区分这些虚假数据和真实数据，这样信息搜集者就不知道他们被骗了。

举个例子:你有一个新闻网站;如果你检测到一个刮板，而不是阻止访问，提供虚假的，随机生成的文章，这将污染刮板得到的数据。如果你让虚假数据与真实数据难以区分，就会让信息搜集者很难得到他们想要的东西，也就是真实的数据。

如果用户代理为空或缺失，则不接受请求

通常，惰性编写的scraper不会在请求时发送User Agent报头，而所有浏览器和搜索引擎蜘蛛都会。

如果收到的请求中没有User Agent报头，则可以显示验证码，或者简单地阻止或限制访问。(或者提供如上所述的虚假数据，或者其他东西..)

欺骗它是微不足道的，但作为一种针对编写糟糕的scraper的措施，它是值得实现的。

如果用户代理是普通的刮板代理，则不接受请求;黑名单由刮刀器使用

在某些情况下，scraper将使用没有真正的浏览器或搜索引擎爬行器使用的用户代理，例如:

“Mozilla”(仅此而已。我看到了一些关于刮痧的问题。一个真正的浏览器永远不会只使用它) “Java 1.7.43_u43”(默认情况下，Java的HttpUrlConnection使用类似的东西。) “BIZCO EasyScraping Studio 2.0” “wget”，“curl”，“libcurl”，…(Wget和cURL有时用于基本的抓取)

如果您发现某个特定的User Agent字符串被站点上的scraper使用，而不是真正的浏览器或合法的爬行器使用，您也可以将其添加到黑名单中。

如果它不请求资源(CSS，图像)，它就不是一个真正的浏览器。

真正的浏览器会(几乎总是)请求和下载图像和CSS等资源。HTML解析器和抓取器不会，因为它们只对实际的页面及其内容感兴趣。

您可以将请求记录到您的资产中，如果您看到许多仅针对HTML的请求，那么它可能是一个scraper。

请注意，搜索引擎机器人、老式移动设备、屏幕阅读器和配置错误的设备也可能不会请求资产。

使用并要求cookie;使用它们来跟踪用户和刮刀动作。

您可以要求启用cookies以查看您的网站。这将阻止经验不足和新手刮刀作家，但很容易为刮刀发送饼干。如果你确实使用并需要它们，你可以用它们来跟踪用户和刮刀操作，从而实现限速、阻止或显示每个用户而不是每个ip的验证码。

例如:当用户进行搜索时，设置唯一的标识cookie。在查看结果页面时，验证该cookie。如果用户打开了所有的搜索结果(从cookie可以看出)，那么它可能是一个刮板。

使用cookie可能是无效的，因为抓取程序也可以将cookie与请求一起发送，并在需要时丢弃它们。如果您的站点只使用cookie，您还将阻止禁用cookie的真实用户访问。

请注意，如果您使用JavaScript设置和检索cookie，您将阻止不运行JavaScript的抓取程序，因为它们不能检索和发送cookie请求。

使用JavaScript + Ajax来加载内容

您可以在页面本身加载后使用JavaScript + AJAX加载内容。这将使不运行JavaScript的HTML解析器无法访问内容。对于新手和没有经验的程序员来说，这通常是一种有效的威慑。

注意:

使用JavaScript加载实际内容会降低用户体验和性能搜索引擎可能也不运行JavaScript，从而阻止他们索引你的内容。对于搜索结果页面，这可能不是问题，但对于其他内容，比如文章页面，这可能是问题。

混淆您的标记、来自脚本的网络请求和其他一切。

If you use Ajax and JavaScript to load your data, obfuscate the data which is transferred. As an example, you could encode your data on the server (with something as simple as base64 or more complex), and then decode and display it on the client, after fetching via Ajax. This will mean that someone inspecting network traffic will not immediately see how your page works and loads data, and it will be tougher for someone to directly request request data from your endpoints, as they will have to reverse-engineer your descrambling algorithm.

If you do use Ajax for loading the data, you should make it hard to use the endpoints without loading the page first, eg by requiring some session key as a parameter, which you can embed in your JavaScript or your HTML. You can also embed your obfuscated data directly in the initial HTML page and use JavaScript to deobfuscate and display it, which would avoid the extra network requests. Doing this will make it significantly harder to extract the data using a HTML-only parser which does not run JavaScript, as the one writing the scraper will have to reverse engineer your JavaScript (which you should obfuscate too). You might want to change your obfuscation methods regularly, to break scrapers who have figured it out.

不过，这样做也有一些缺点:

实现、维护和调试将非常繁琐和困难。它将无效的scraper和屏幕抓取器，实际上运行JavaScript，然后提取数据。(大多数简单的HTML解析器不运行JavaScript) 如果实际用户禁用了JavaScript，它将使您的站点无法使用。性能和页面加载时间将受到影响。

非技术:

告诉人们不要刮，一些人会尊重它找律师让你的数据可用，提供一个API: 您可以使您的数据易于获取，并要求归因和链接回到您的网站。也许要收$$$。

杂项:

There are also commercial scraping protection services, such as the anti-scraping by Cloudflare or Distill Networks (Details on how it works here), which do these things, and more for you. Find a balance between usability for real users and scraper-proofness: Everything you do will impact user experience negatively in one way or another, find compromises. Don't forget your mobile site and apps. If you have a mobile app, that can be screenscraped too, and network traffic can be inspected to determine the REST endpoints it uses. Scrapers can scrape other scrapers: If there's one website which has content scraped from yours, other scrapers can scrape from that scraper's website.

进一步阅读:

维基百科关于网络抓取的文章。涉及的技术和不同类型的网页刮板的许多细节。阻止脚本制作者在一秒钟内上百次地攻击你的网站。关于一个非常类似的问题的问答-机器人检查网站，并在他们开始销售的时候购买东西。大量相关信息，特别是验证码和速率限制。

2016-01-16 15:06:06

迟到的回答——而且这个答案可能不是你想听到的……

我自己已经编写了许多(几十个)不同的专门的数据挖掘抓取程序。(只是因为我喜欢“开放数据”哲学)。

在其他答案中已经有很多建议了-现在我将扮演魔鬼倡导者的角色，并将扩展和/或纠正它们的有效性。

第一:

如果有人真的想要你的数据你不能有效地(技术上)隐藏你的数据数据是否应该对“常规用户”公开访问

试图使用一些技术障碍是不值得的麻烦，导致:

对你的老用户来说，就是让他们的用户体验变差到常规的和受欢迎的机器人(搜索引擎) 等等……

纯HMTL——最简单的方法是解析纯HTML页面，使用定义良好的结构和css类。例如，这是足够的检查元素与Firebug，并使用正确的xpath，和/或CSS路径在我的刮刀。

你可以动态地生成HTML结构，也可以动态地生成CSS类名(和CSS本身)(例如，通过使用一些随机的类名)-但是

您希望以一致的方式将信息呈现给常规用户例如，再一次分析页面结构来设置刮刀就足够了。它可以通过分析一些“已知的内容”自动完成某人已经知道(通过早先的搜集)，例如: 关于"菲尔·柯林斯"的信息足够显示“phil collins”页面，并(自动)分析页面的结构“today”:)

你不能为每一个回应都改变结构，因为你的老用户会讨厌你。而且，这样会给您(保养)带来更多的麻烦，而不是给刮板带来麻烦。XPath或CSS路径由抓取脚本从已知内容自动确定。

Ajax -一开始有点难，但很多时候加速了抓取过程:)-为什么?

当分析请求和响应时，我只是设置了自己的代理服务器(用perl写的)，我的firefox正在使用它。当然，因为它是我自己的代理——它是完全隐藏的——目标服务器将其视为普通浏览器。(因此，没有X-Forwarded-for等报头)。基于代理日志，大多数情况下可以确定ajax请求的“逻辑”，例如，我可以跳过大部分html抓取，只使用结构良好的ajax响应(主要是JSON格式)。

所以，ajax帮助不大…

一些更复杂的页面使用了大量的javascript函数。

这里可以使用两种基本方法:

解包和理解JS，并创建一个遵循Javascript逻辑的scraper(艰难的方式) 或者(最好是自己使用)-只是使用Mozilla和Mozrepl进行刮擦。例如，真正的抓取是在全功能javascript支持的浏览器中完成的，它被编程为点击正确的元素，并直接从浏览器窗口抓取“解码”的响应。

这样的刮取速度很慢(刮取是在普通浏览器中完成的)，但它确实很慢

非常容易设置和使用而且几乎不可能对抗它:) 无论如何，都需要“慢”来对抗“阻塞快速的相同IP请求”。

基于用户代理的过滤根本没有帮助。任何认真的数据挖掘者都会在他的scraper中将其设置为正确的值。

要求登录-没有帮助。最简单的方法打败它(没有任何分析和/或脚本登录协议)只是登录到网站作为普通用户，使用Mozilla，然后运行基于Mozrepl的刮刀…

请记住，要求登录有助于匿名机器人，但无助于那些想要窃取您的数据的人。他只是把自己注册为普通用户。

使用框架也不是很有效。这是许多现场电影服务使用，它不是很难被击败。帧只是另一个需要分析的HTML/Javascript页面…如果数据值得麻烦，数据挖掘器将进行所需的分析。

基于ip的限制根本无效——这里有太多的公共代理服务器，还有TOR…:)它不会减慢抓取(对于那些真正想要你的数据的人)。

非常困难的是抓取隐藏在图像中的数据。(例如，简单地将数据转换为图像服务器端)。使用“tesseract”(OCR)可以帮助很多次，但老实说，数据必须值得刮刀的麻烦。(很多时候这并不值得)。

另一方面，你的用户也会因此讨厌你。我自己，(即使在不抓取的时候)讨厌那些不允许将页面内容复制到剪贴板的网站(因为信息在图像中，或者(愚蠢的那些)试图绑定到右键单击一些自定义Javascript事件。：）

最难的是使用java applet或flash的站点，applet本身在内部使用安全https请求。但是仔细想想——你的iPhone用户会有多开心……,)。因此，目前很少有网站使用它们。我自己，在我的浏览器中屏蔽所有flash内容(在常规浏览会话中)-并且从不使用依赖flash的网站。

你的里程碑可以是……，所以您可以尝试这种方法-只是记住-您可能会失去一些用户。还要记住，一些SWF文件是可反编译的。,)

验证码(好的验证码，比如reCaptcha)帮助很大，但是你的用户会讨厌你…-想象一下，当你的用户需要在所有显示音乐艺术家信息的页面上解决一些验证码时，他们会多么喜欢你。

可能不需要继续了——你已经了解了。

现在你应该做的是:

记住:几乎不可能隐藏你的数据，如果你在另一边想要发布它们(以友好的方式)给你的普通用户。

So,

make your data easily accessible - by some API this allows the easy data access e.g. offload your server from scraping - good for you setup the right usage rights (e.g. for example must cite the source) remember, many data isn't copyright-able - and hard to protect them add some fake data (as you already done) and use legal tools as others already said, send an "cease and desist letter" other legal actions (sue and like) probably is too costly and hard to win (especially against non US sites)

在尝试使用一些技术障碍之前要三思。

与其试图阻止数据挖掘者，不如在你的网站可用性上投入更多精力。你的用户会喜欢你的。投入在技术障碍上的时间(和精力)通常是不值得的——花时间做一个更好的网站更好。

此外，数据窃贼与普通小偷不同。

如果你买了一个便宜的家庭警报器，并加上一个警告“这所房子与警方有联系”——许多小偷甚至不会试图闯入。因为他只要走错一步，就会进监狱。

所以，你投资的钱不多，但小偷投资的风险很大。

但数据窃贼没有这样的风险。恰恰相反——如果你走错了一步(例如，如果你因为技术障碍引入了一些BUG)，你就会失去你的用户。如果抓取机器人第一次不工作，什么也不会发生——数据挖掘者只是尝试另一种方法和/或调试脚本。

在这种情况下，你需要更多的投资，而刮板投资得更少。

想想你想把时间和精力投入到什么地方……

Ps:英语不是我的母语——所以请原谅我的蹩脚英语…

2016-02-03 00:26:09

提供一个XML API来访问您的数据;以一种易于使用的方式。如果人们想要你的数据，他们就会得到，你不妨全力以赴。

通过这种方式，您可以以有效的方式提供功能子集，至少确保刮刮器不会消耗HTTP请求和大量带宽。

然后，您所要做的就是说服想要您的数据的人使用API。；）

2010-07-01 21:01:50

如何防止网站刮取?

推荐文章

最新文章

标签