reCAPTCHA, the system I use to keep spam out of the comments, is probably one of the most popular CAPTCHAs (Completely Automated Method[s] to Tell Computers and Humans Apart) out there. And for a very good reason: it draws its source words only from texts that current optical character recognition (OCR) technology is unable to read; therefore, no spam bot should be able to read them, especially after reCAPTCHA applies some extra distortion to render it absolutely non-machine-readable. But what do we do when the state of OCR technology advances to the point where they get as good as humans at reading text? As technology for reading words improves, it seems likely that within the next decade or two, the level of distortion necessary to render it unreadable by machines will also make it illegible to humans. So what next?
One class is image-recognition CAPTCHA: you present the user with ten distorted images (to prevent random guessing by bots) and ask them which ones contain a cat, or which ones have been rotated upside-down, or which ones are people. This is essentially a generalization of text-based CAPTCHAS, but it has several problems. First and foremost, you need a large source of images to show the user. This is one of the huge advantages of text-based CAPTCHAs: they can be procedurally generated. If the image database for a CAPTCHA service is small, then it’ll be passed around by spam bots; since recognizing whether two images are the same is a fairly solved problem, all they have to do is answer your question for each of the images once. (The distortion’s purpose is to make image comparison harder in case spammers do get a hold of your database, not to make it impossible). One method would be to browse Flickr for photos tagged with an object and assume that each such photo contains an object, but you’re running into copyright issues as well as essentially relying on the fact that someone won’t tag a photo ‘cat’ just because it has a kitten in the distant background.
One other idea that I’ve seen a couple sites use is knowledge-based, relying on the fact that machines can’t yet parse natural language. So it asks a question like “what is 2 plus 2?”. The fundamental problem I see with this is that, again, you’re going to have a very small repertoire of questions; a CAPTCHA has to be able to be generated by a computer. Not to mention the fact that whatever question-generating algorithm you use could just be reverse-engineered to extract content, then passed to Google or Wolfram Alpha to get the answer. Unlike images, there’s no way to ‘distort’ a question.
A third possibility, orthogonal to trying to tell real people from computers, is to look at the content of the message, rather than require the message sender to pass some arbitrary test. This is the approach Akismet (which comes by default on WordPress) uses, and is similar to the way e-mail clients detect spam. This has the downside of having a higher false positive rate than CAPTCHA-based methods. A short comment saying ‘Hey, I read your article and liked it; check out this link’ can either be legitimate or spam, and determining which one it is would require knowing the contents of the link. So your CAPTCHA system would have to visit links posted by users, which is obviously highly undesirable behavior. However, it does have the advantage of not relying on some problem being ‘hard’ to solve, and it also removes the (admittedly small) barrier to commenting that CAPTCHAs produce.
For now, reCAPTCHA will remain good enough; it’s easy to solve, and the word combinations that I can’t easily read can be dismissed with a click of the refresh button. And since I have very low traffic, I can afford to have an e-mail sent to me for every comment I get here; if it does wind up being spam (apparently, either reCAPTCHA isn’t completely impervious to computer solving or there’s some sweatshop worker whose job is to spam sites with cheap Viagra ads) I can just delete it.