Digitizing Books with the Help of Millions of People around the World

I came across this video podcast episode of PBS’s Wired Science with Luis von Ahn, the guy who came up with “Captcha“, those fuzzy looking words that you have to enter on websites sometimes as proof that you are human.

“Captcha” was developed to prevent automation (usually via scripting) of a process, such as the creation of a user account. “Captcha” images are not readable by computers. Also the deployment of OCR (optical character recognition) technologies to identify letters within images does not work. To prevent OCR technology to be effective is actually the reason why the images always look so funny and distorted.

Luis von Ahn said that it takes in average 10 seconds for a human to solve a captcha and that humans solve about 200 Million captcha puzzles every day. That is a lot of time that is being wasted, because a captcha does not serve a purpose, except from keeping cheaters and spammers out.

So after being responsible for making mankind wasting thousands of hours every day to make up for it by developing something that will take the time spent on captcha and put them to good use.

You probably heard about the various book digitizing projects that are done around the world to convert old books in print format into digital format and make their content accessible online to users around the world, for example the Google Books Library Project. The project even got the attention of the New York Times who reported in great detail about the efforts. Another big project is the Universal Digital Library Initiative, which is supported by Microsoft among other major players in the industry.

The problem these and similar projects are facing are words that are not very clear (especially older books have this problem, where time took its toll on the paper and ink). Today’s OCR technology is unable to determine clearly what some of the words are. Where computers fail, humans are able to solve the problem. Well, the problem that the digital library projects have is solved every day about 200 million times by people around the world by solving “captcha” puzzles.

Now people help with the digital conversion of books by identifying words within scanned books that the computer was not able to identify. To prevent incorrectly solved captacha puzzles to falsify the results, for example caused by scripted attempts by cheaters and spammers to get around the captcha check, they show two words to the user, one where they know what it means and the other from a book where they don’t know the meaning. If the user solves the one for the image where they know what it means correctly, then they know that a human was solving the captcha and not a computer.

It was missed to mention how you can become part of this initiative, but it seems that Luis is working for Google on their Library project and another one that I will talk about in a second. I suggest contacting Google and ask them how you, if you are the developer of a captcha solution, can become part of this initiative and help with the digitization of books.

Another problem computers have is they are doing a terrible job at identifying objects and subjects in photographs. The technology improved a lot over the recent years, but they are still far away from having the computer understand and recognize the content of images as humans do, which goes way beyond simplistic properties like shapes and colors. The computer might be able to tell you that there is a human face in the image (add the parameter “&imgtype=face” to a query at Google Image Search to return only images with human faces in it, for example). It may be capable of telling you if it is an adult or child, or male versus female, but it is hard to impossible to determine a person’s mood expressed or the name and origin of a person.

The image recognition technology advanced a lot, but is rarely developed far enough beyond an experimental stage, like the human faces filter by Google or the object recognition of the visual search engine Like.com by Riya.

Luis introduced a game called “The ESP Game” where humans describe images that they get shown by the game, using tags. To turn simple tagging into a game and humans play it without being paid money for doing it, they added a component into it that not only created the reason to play, but also solved the problem of getting obscure of fake tags attached to an image by cheaters and pranksters.

They show the same image to two people who have to describe the one image via tags at the same time. Whenever both people use the same word or phrase to describe the image, they get points and increase their ranking. Words and phrases that do not match are not counted. If two different people who do not know each other and cannot communicate with each other use the same word to describe what they see, this word is much more likely to be accurate and common. It is also hard to skew the results because of the mentioned reasons.

He mentioned also some other interesting figures that I did not include in my post. Check out the video recording for yourself. It is only 7 minutes in length. I am sure you will enjoy it as much as I did.

With all the talks and discussions about human powered search engine projects like Mahalo and Wikia Search, people sometimes forget that you need to make things searchable and findable first, before you can set out and create a service that performs searches across this content.

Cheers!

Carsten Cumbrowski
Carsten is an internet marketing strategy consultant, entrepreneur, blogger and performance marketer (aka affiliate) since early 2001. Beside from making is living as an affiliate who sells other people’s stuff for a commission, Carsten does some consulting to help small and big companies with their internet marketing strategy and goals. Because he likes to teach (and talk ), he did create a free resources website at Cumbrowski.com for fellow internet marketers like him and other marketing professionals and amateurs.