reCAPTCHA, the system I use to keep spam out of the comments, is probably one of the most popular CAPTCHAs (Completely Automated Method[s] to Tell Computers and Humans Apart) out there. And for a very good reason: it draws its source words only from texts that current optical character recognition (OCR) technology is unable to read; therefore, no spam bot should be able to read them, especially after reCAPTCHA applies some extra distortion to render it absolutely non-machine-readable. But what do we do when the state of OCR technology advances to the point where they get as good as humans at reading text? As technology for reading words improves, it seems likely that within the next decade or two, the level of distortion necessary to render it unreadable by machines will also make it illegible to humans. So what next?
One class is image-recognition CAPTCHA: you present the user with ten distorted images (to prevent random guessing by bots) and ask them which ones contain a cat, or which ones have been rotated upside-down, or which ones are people. This is essentially a generalization of text-based CAPTCHAS, but it has several problems. First and foremost, you need a large source of images to show the user. This is one of the huge advantages of text-based CAPTCHAs: they can be procedurally generated. If the image database for a CAPTCHA service is small, then it’ll be passed around by spam bots; since recognizing whether two images are the same is a fairly solved problem, all they have to do is answer your question for each of the images once. (The distortion’s purpose is to make image comparison harder in case spammers do get a hold of your database, not to make it impossible). One method would be to browse Flickr for photos tagged with an object and assume that each such photo contains an object, but you’re running into copyright issues as well as essentially relying on the fact that someone won’t tag a photo ‘cat’ just because it has a kitten in the distant background.
One other idea that I’ve seen a couple sites use is knowledge-based, relying on the fact that machines can’t yet parse natural language. So it asks a question like “what is 2 plus 2?”. The fundamental problem I see with this is that, again, you’re going to have a very small repertoire of questions; a CAPTCHA has to be able to be generated by a computer. Not to mention the fact that whatever question-generating algorithm you use could just be reverse-engineered to extract content, then passed to Google or Wolfram Alpha to get the answer. Unlike images, there’s no way to ‘distort’ a question.
A third possibility, orthogonal to trying to tell real people from computers, is to look at the content of the message, rather than require the message sender to pass some arbitrary test. This is the approach Akismet (which comes by default on WordPress) uses, and is similar to the way e-mail clients detect spam. This has the downside of having a higher false positive rate than CAPTCHA-based methods. A short comment saying ‘Hey, I read your article and liked it; check out this link’ can either be legitimate or spam, and determining which one it is would require knowing the contents of the link. So your CAPTCHA system would have to visit links posted by users, which is obviously highly undesirable behavior. However, it does have the advantage of not relying on some problem being ‘hard’ to solve, and it also removes the (admittedly small) barrier to commenting that CAPTCHAs produce.
For now, reCAPTCHA will remain good enough; it’s easy to solve, and the word combinations that I can’t easily read can be dismissed with a click of the refresh button. And since I have very low traffic, I can afford to have an e-mail sent to me for every comment I get here; if it does wind up being spam (apparently, either reCAPTCHA isn’t completely impervious to computer solving or there’s some sweatshop worker whose job is to spam sites with cheap Viagra ads) I can just delete it.
Karl
/ March 16, 2010Take a look at:
http://nedbatchelder.com/text/stopbots.html
Ken
/ March 21, 2010A relatively simple extension to reCAPTCHA would be to present the failed OCR text in context of the surrounding words. Then, the human can read, say, the entire sentence or paragraph to figure out what the word is in context, using our brain’s capability higher level semantic reasoning. With a sentence or paragraph of context, a human will be able to figure out extremely illegible words that even an sophisticated OCR program will likely be unable to solve.
Now, let’s see if I can solve this CAPTCHA to allow me to post a comment.
Geoffrey
/ April 19, 2010One of the tricks about reCAPTCHA is that it can’t actually figure out what the right answer should be. Therefore, it gives you two words, one of which it knows the OCR for and one it doesn’t. You’re assumed human if you get the known word right; if two respondents give the same reading for the unknown word, then it’s sent back to the book-digitizing people. This does mean there’s a little bit of an attack in that if you’re good at OCR you can get the known word and just guess randomly on the unknown one. Depending on how often words are reused, you might also be able to play games with running essentially a reCAPTCHA proxy.
But it is interesting for the general problem of captcha design that reCAPTCHA doesn’t need to know any solutions for its captchas past bootstrapping. The other one, of course, is to present a problem that’s easier to verify than to solve, in the spirit of NP.