[D66] Literature is not Data: Against Digital Humanities

Sun Oct 28 18:41:14 CET 2012

http://lareviewofbooks.org/article.php?type&id=1040&fulltext=1&media

Literature is not Data: Against Digital Humanities by Stephen Marche
October 28th, 2012 RESET - +

BIG DATA IS COMING for your books. It’s already come for everything 
else. All human endeavor has by now generated its own monadic mass of 
data, and through these vast accumulations of ciphers the robots now 
endlessly scour for significance much the way cockroaches scour for 
nutrition in the enormous bat dung piles hiding in Bornean caves. The 
recent Automate This, a smart book with a stupid title, offers a 
fascinatingly general look at the new algorithmic culture: 60 percent of 
trades on the stock market today take place with virtually no human 
oversight. Artificial intelligence has already changed health care and 
pop music, baseball, electoral politics and several aspects of the law. 
And now, as an afterthought to an afterthought, the algorithms have 
arrived at literature, like an army which, having conquered Italy, turns 
its attention to San Marino.

The story of how literature became data in the first place is a story of 
several, related intellectual failures.

In 2002, on a Friday, Larry Page began to end the book as we know it. 
Using the 20 percent of his time that Google then allotted to its 
engineers for personal projects, Page and Vice-President Marissa Mayer 
developed a machine for turning books into data. The original was a 
crude plywood affair with simple clamps, a metronome, a scanner, and a 
blade for cutting the books into sheets. The process took 40 minutes. 
The first refinement Page developed was a means of digitizing books 
without cutting off their spines — a gesture of tender-hearted 
sentimentality towards print. The great disbinding was to be 
metaphorical rather than literal. A team of Page-supervised engineers 
developed an infrared camera that took into account the curvature of 
pages around the spine. They resurrected a long dormant piece of Optical 
Character Recognition software from Hewlett-Packard and released it to 
the open-source community for improvements. They then crowd-sourced 
textual correction at a minimal cost through a brilliant program called 
reCAPTCHA, which employs an anti-bot service to get users to read and 
type in words the Optical Character Recognition software can’t 
recognize. (A miracle of cleverness: everyone who has entered a security 
identification has also, without knowing it, aided the perfection of the 
world’s texts.) Soon after, the world’s five largest libraries signed on 
as partners. And, more or less just like that, literature became data.

..continued..