I've accepted an answer, but sadly, I believe we're stuck with our original worst case scenario: CAPTCHA everyone on purchase attempts of the crap. Short explanation: caching / web farms make it impossible to track hits, and any workaround (sending a non-cached web-beacon, writing to a unified table, etc.) slows the site down worse than the bots would. There is likely some pricey hardware from Cisco or the like that can help at a high level, but it's hard to justify the cost if CAPTCHA-ing everyone is an alternative. I'll attempt a more full explanation later, as well as cleaning this up for future searchers (though others are welcome to try, as it's community wiki).
这是关于woot.com上的垃圾销售。我是Woot Workshop的总统,Woot Workshop是Woot的子公司,负责设计,撰写产品描述,播客,博客文章,并主持论坛。我使用CSS/HTML,对其他技术几乎不熟悉。我与开发人员密切合作,在这里讨论了所有的答案(以及我们的许多其他想法)。
所以我们又回到了扫描IP, a)在这个云网络和垃圾邮件僵尸的时代是相当无用的,b)考虑到来自一个IP地址的业务数量,捕获了太多无辜的人(更不用说非静态IP isp的问题和试图跟踪它的潜在性能影响)。
The user experience sucks for humans, as they have to decipher CAPTCHA, pick out the cat, or solve a math problem.
If the perceived benefit is high enough, and the crowd large enough, some group will find their way around any tweak, leading to an arms race. (This is especially true the simpler the tweak is; hidden 'comments' form, re-arranging the form elements, mis-labeling them, hidden 'gotcha' text all will work once and then need to be changed to fight targeting this specific form.)
Even if the scripters can't 'solve' your tweak it doesn't prevent them from slamming your front page, and then sounding an alarm for the scripter to fill out the order, manually. Given they get the advantage from solving [a], they will likely still win [b] since they'll be the first humans reaching the order page. Additionally, 1. still happens, causing server errors and a decreased performance for everyone.
What if Woot were to intentionally decouple the queuing process after the first screen, and feed every session from that point into a sequence of fixed-minimum-time steps? The second screen wouldn't even be presented until 30 seconds had passed; after it was submitted, same for the following screens. I bet wooters would have no problem if they were told that, after the first screen, they would wait in a queue (which is already true) that would spread the load over time in a way that should take no longer than before, be more robust, and help weed out the bots. At this point you can throw in some of the bot speedbumps listed above (subtle variations in DOM objects, etc.) Just the benefit from the perception that Woot is a little more in control of things would help.
If a much higher proportion of the BOC initial hits could segue into a bot-unfriendlier non-time-critical process on their first hit (or close to it), rather than retrying, then real people who get past that point would have more confidence. For sure it would be less hostile than the current situation. It might cut down on the background-noise-ambient-bot-rate that's going on all the time even under normal Woot-Off circumstances. And the bots would lay off the main page and sit in the queue with each other (and everyone else) where they have no advantage.
Hmmm... The concept "apartment-threaded" comes to mind. I wonder if the pattern is approximately useful?
A useful core concept here is being able, after the first screen, to track accumulated total time in queue and be able to adjust to standard. As a bot-mitigation strategy, you would have a little bit of flexibility to maybe fudge the very earliest sessions by maybe 5-10 seconds; doing so would probably be undetectable, but would result in a richer non-bot purchase mix. I'm sure you have statistics to help evaluate stuff like this after the fact.
Just for fun, you could (at least for one wootoff) put together your own bot that combines the best features you've seen, and then hand it out to everyone the day before. Then at least everyone would be equally armed. (Then duck ... incoming ...)
No matter what, you will have to do some IP based throttling to thwart the 'bot slamming'. Since it seems important to you to allow unauthenticated (non-logged-in) visitors to get the special offers, you only have IPs to go by initially, and although they're not perfect, they do work against single-IP bots. Botnets are a different beast, but I'll come back to those. For now, we will do some simple throttling to beat rapid-fire single-IP bots.
The performance hit is negligable if you run the IP check before all other processing, use a proxy server for the throttling logic, and store the IPs in a memcached lookup-optimized tree structure.
With rapid-fire single-IP bots throttled, we still have to address slow single-IP bots, ie. bots that are specifically tweaked to 'fly under the radar' by spacing requests slightly further apart than the throttling prevents.
To instantly render slow single-IP bots useless, simply use the strategy suggested by abelenky: serve 10-minute-old cached pages to all IPs that have been spotted in the last 24 hours (or so). That way, every IP gets one 'chance' per day/hour/week (depending on the period you choose), and there will be no visible annoyance to real users who are just hitting 'reload', except that they don't win the offer.
The beauty of this measure is that is also thwarts 'alarm bots', as long as they don't originate from a botnet.
(I know you would probably prefer it if real users were allowed to refresh over and over, but there is no way to tell a refresh-spamming human from a request-spamming bot apart without a CAPTCHA or similar)
You are right that CAPTCHAs hurt the user experience and should be avoided. However, in _one_ situation they can be your best friend: If you've designed a very restrictive system to thwart bots, that - because of its restrictiveness - also catches a number of false positives; then a CAPTCHA served as a last resort will allow those real users who get caught to slip by your throttling (thus avoiding annoying DoS situations).
The sweet spot, of course, is when ALL the bots get caught in your net, while extremely few real users get bothered by the CAPTCHA.
If you, when serving up the 10-minute-old cached pages, also offer an alternative, optional, CAPTCHA-verified 'front page refresher', then humans who really want to keep refreshing, can still do so without getting the old cached page, but at the cost of having to solve a CAPTCHA for each refresh. That is an annoyance, but an optional one just for the die-hard users, who tend to be more forgiving because they know they're gaming the system to improve their chances, and that improved chances don't come free.
Christopher Mahan had an idea that I rather liked, but I would put a different spin on it. Every time you are preparing a new offer, prepare two other 'offers' as well, that no human would pick, like a 12mm wingnut for $20. When the offer appears on the front page, put all three 'offers' in the same picture, with numbers corresponding to each offer. When the user/bot actually goes on to order the item, they will have to pick (a radio button) which offer they want, and since most bots would merely be guessing, in two out of three cases, the bots would be buying worthless junk.
Naturally, this doesn't address 'alarm bots', and there is a (slim) chance that someone could build a bot that was able to pick the correct item. However, the risk of accidentally buying junk should make scripters turn entirely from the fully automated bots.
Okay............ I've now spent most of my evening thinking about this, trying different approaches.... global delays.... cookie-based tokens.. queued serving... 'stranger throttling'.... And it just doesn't work. It doesn't. I realized the main reason why you hadn't accepted any answer yet was that noone had proposed a way to thwart a distributed/zombie net/botnet attack.... so I really wanted to crack it. I believe I cracked the botnet problem for authentication in a different thread, so I had high hopes for your problem as well. But my approach doesn't translate to this. You only have IPs to go by, and a large enough botnet doesn't reveal itself in any analysis based on IP addresses.