×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Google Pushes Open Source OCR

Zonk posted about 7 years ago | from the google-has-taken-all-knowledge-to-be-its-provice dept.

Google 212

SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

212 comments

Sign of times to come? (3, Interesting)

Anonymous Coward | about 7 years ago | (#18679067)

Now that they will be able to recognize and tag our images, I wonder if Picassa will finally get increased storage. Google will be able to deliver targeted ads based on our pictures.

Already done - 1GB and counting! (2, Informative)

Anonymous Coward | about 7 years ago | (#18680571)

Where have you been lately? Picasaweb.google.com has already increased from a mere 250MB to 1GB+ and counting!

Build instructions are outdated (2, Informative)

What the Frag (951841) | about 7 years ago | (#18679073)

Use this line to checkout ocropus:

svn co http://ocropus.googlecode.com/svn/trunk/ ocropus

More build info; Ubuntu Feisty (4, Informative)

drinkypoo (153816) | about 7 years ago | (#18681295)

Here's some other information you might need/want to build this software; note that I am on Ubuntu Feisty.

To build tesseract-ocr you must install autoconf.

If you are smart you will figure out a way to omit ocropus/data-test-pages from your checkout. Do you really want to use their pages to test? No! You want to use your own data.

I built with gcc 4.1.2, YMMV. Some people have reported errors trying to just compile tesseract.

to build ocropus I ended up installing jam, libaspell-dev and libtiff-dev.

Thank Goog (-1, Redundant)

Anonymous Coward | about 7 years ago | (#18679087)

Thank Goog very much.

The goal of the project (4, Insightful)

user24 (854467) | about 7 years ago | (#18679089)

The goal of the project is to stop the damn email image spammers.

among other things, sure, but it's got to be a high priority for google.

Re:The goal of the project (3, Insightful)

sammy baby (14909) | about 7 years ago | (#18679253)

And of course, as a side effect they'll probably wind up with a lovely distributed system for solving captcha. ;)

Small price if it helps email spam. (4, Insightful)

Kadin2048 (468275) | about 7 years ago | (#18679373)

And of course, as a side effect they'll probably wind up with a lovely distributed system for solving captcha. ;)

True, but CAPTCHAs always seemed like a bit of an inelegant hack anyway. First, they're horrible from a disabled-access standpoint, and second they're really not all that effective against a concerted enemy when there's a lot of money on the line. Spammers can just pay a few kids in some Third World country to sit there all day and solve CAPTCHAs if they want to.

Since message boards, which are the major users of CAPTCHAs, are practically by design little fiefdoms, I don't think they're nearly as hard to patrol as a common-carrier network like email. The solution to message-board spam is to either institute a moderator-delay (for small blogs and boards), or simply make enough admins with IP-ban powers so that the second someone starts spamming, they get banned and the spam gets deleted. Lameness filters working on the same principles as email spam-filters are probably helpful, too.

captcha's (2, Insightful)

mithras invictus (1084169) | about 7 years ago | (#18680355)

captcha's are not restricted to images of letters. For example: you could ask people to solve a regular text question (this would also fix accessibility issues)

Re:Small price if it helps email spam. (4, Interesting)

Pxtl (151020) | about 7 years ago | (#18680535)

You've obviously never fought off a bb spammer. They don't use one or two accounts to spam one or two messages - they inundate the board from a long list of IPs. Even without spamming messages, they create hordes of accounts just for the pagerank provided by the links within their personal account pages. Plus, admin-approval-delays degrade quality for the user. It creates a huge headache all around to handle maintaining banlists and cleaning out garbage.

Captchas are by far the better solution.

The problem is that, long term, they will eventually be cracked. I'm imagining the ultimate solution will only be to allow users with email addresses from "approved" major ISPs.

Well... (4, Informative)

Shawn is an Asshole (845769) | about 7 years ago | (#18680543)

If you're sick of image spam, you can do what I did. Add the OpenProtect [openprotect.com] channel to SpamAssassin and then add these line to your SpamAssassin config:

required_hits 5
score SARE_GIF_ATTACH 5


I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day.

Re:Well... (2, Interesting)

drinkypoo (153816) | about 7 years ago | (#18680949)

All I want is a plugin for thunderbird that will detect when a message is written in another language other than English and mark it spam if it is. No one ever sends me an email in anything other than English except for spam. I have no fucking idea why this has not yet been implemented. I get absolute shitloads of russian spam.

Re:Well... (1)

ConceptJunkie (24823) | about 7 years ago | (#18681161)

That and the ability to import and export from your message store. I love Thunderbird and have been using it exclusively since about version 0.4, but simply cannot believe some of the functionality it lacks.

Re:Well... (2, Informative)

Auntie Virus (772950) | about 7 years ago | (#18681011)

"required_hits 5 score SARE_GIF_ATTACH 5 I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day."

Brilliant. You just automatically blocked messages from companies whose PHBs insist on attaching a .gif of the company logo. SARE_GIF_ATTACH is ok with a lower score, adding to other scoring parameters. What you REALLY want for image spam is the FuzzyOCR plugin.

Re:Small price if it helps email spam. (1)

mypalmike (454265) | about 7 years ago | (#18680837)

they're really not all that effective against a concerted enemy when there's a lot of money on the line... Since message boards, which are the major users of CAPTCHAs, are practically by design little fiefdoms, I don't think they're nearly as hard to patrol as a common-carrier network like email.

Little feifdom bulletin boards don't generally have a lot of money on the line, which is why captcha works so well. The cost of paying human captcha solvers is high enough that it's fairly rare to see spam on a captcha-protected site. The effect of captchas on my own tiny personal feifdom brought spam down from a significant daily annoyance to zero. I simply don't get spam on my site anymore.

Re:The goal of the project (2, Informative)

UbuntuDupe (970646) | about 7 years ago | (#18679851)

Isn't that the same principle behind PGP? Correct me if I'm wrong (and I freely admit encryption is not my area of expertise), but to crack (in reasonable time) PGP-encrypted data, you have to solve a problem no one in the world has been able to solve yet (quick solution for a certain class of problems). Similarly, if captchas get to the point where you need a major theoretical advance to beat them, thanks to wide use of OCR-type programs, that would either foil all spammers, or cause them to solvea mathematically/AI significant problem.

I'm wrong, eh?

Re:The goal of the project (3, Interesting)

ajs (35943) | about 7 years ago | (#18680307)

The goal of the project is to stop the damn email image spammers.

among other things, sure, but it's got to be a high priority for google.
I don't buy either one. I think the goal of the project is to get sued.

Google knows darned well that there are tons of patents around OCR, so they're not going to roll their own internally. Instead, they'll open source the project and make as much noise about enhancing the state of the art through collaboration as possible. Then, when they get sued (and they will), they can bring this case front-and-center in the debate surrounding patent reform, citing this as the textbook example of how the promotion of the sciences and useful arts (as specified by the Constitution) is hobbled by current patent law surrounding software.

I could be wrong, but they'd be stupid to think that high-profile, open source OCR software won't be challenged by those who hold the patents....

Re:The goal of the project (4, Insightful)

slashbob22 (918040) | about 7 years ago | (#18680745)

Ok, I'll bite and play DA for a bit.

Why Google wouldn't want this:
1) Google's own patents on search techniques, distributing advertisement, etc. Yes, I understand that you are talking about the need for certain limits to patent law - but this could as easily hurt Google as help them.
2) They are already challenging the copyright laws with GooTube, I don't see any sense in tackling both at the same time.

IANIGHQ (In Google's HQ) but I don't see the value of getting sued at this point in time. Besides, if Google is doing this under appropriate conditions there shouldn't be concern of suits - but I suppose their Chinese plagiarism case doesn't support this point.

// End DA

Re:The goal of the project (1)

w_mute (40724) | about 7 years ago | (#18680317)

>The goal of the project is to stop the damn email image spammers.
>
> among other things, sure, but it's got to be a high priority for google.

OCRs application to image spam is useful but limited without lots of tweaking. OCR is geared toward dealing with readable text. Image spammers are already doing font swapping, kerning tweaks, applying image rotation to subsections, random backgrounds, etc. Warping text similar to CAPCHAs isn't that much further along.

Also, OCR is much more computationally expensive than other text/image recognition methods. Anti-CAPCHA algorithms can be used to segment and recognize warped text, but its much more problematic (and expensive) than plain OCR. OCR may be an OK last resort, but there are other less finicky, faster methods that work on most image spam.

-Greg

So much for captcha (1, Redundant)

Red Flayer (890720) | about 7 years ago | (#18679111)

Oh great. I, for one, do not welcome the increase in message board spamming.

Re:So much for captcha (3, Informative)

cyphercell (843398) | about 7 years ago | (#18679673)

Captcha (warped text) will probably remain for a long time. This OCR has more practical uses when applied to text that is meant to be legible.

Re:So much for captcha (1)

Gregory Cox (997625) | about 7 years ago | (#18679681)

Captchas are a good thing, but taking a long-term view, isn't it a better thing that technology is progressing? I'm sure the positive uses of OCR outweigh the problem of spamming, and it'd be a shame if no-one wanted to work on OCR just because of captchas.

The beginning of the end? (3, Insightful)

Iphtashu Fitz (263795) | about 7 years ago | (#18679113)

... for Captchas [wikipedia.org] ? If Google is pushing OCR I could see it eventually becoming good enough to parse at least some types of captchas.

Re:The beginning of the end? (1)

X0563511 (793323) | about 7 years ago | (#18679541)

When the computer can parsed a Captcha better than a human can... it means that we need to move on to something else. What that else is (do NOT mention kitten-captcha) I don't know.

What's wrong with kitten captcha? (2, Insightful)

brunes69 (86786) | about 7 years ago | (#18679753)

When we can make a computer that can tell the difference between a kitten and an adult cat (or hell even another furred mamal) with any kind of accuracy, I think the LEAST of your problems at that point is coming up with captchas. You should be more worried about how you're going to escape from Skynet.

Re:The beginning of the end? (1)

walt-sjc (145127) | about 7 years ago | (#18679769)

Then we need to move to simple logic questions such as "what is the sum of 5 and 4?" or "how many inches in a foot", etc.

Re:The beginning of the end? (4, Insightful)

user24 (854467) | about 7 years ago | (#18679955)

Please, please, please, everybody, stop claiming that "what is 2+2?" is a hard AI question. I could code something in a hour to defeat most of this sort of question, and give me a week and a budget and I'll write something to get past 95% of these type of questions.

If the text is parsable, it takes nothing to google it.
I mean, those two examples you give; just slap it into google and screenscrape it. So you're going to need harder questions than that.

So the next generation of crapchas will ask "what color is the sky".
Go and take a glance at ultraHal or another relatively advance NLP AI; a large knowledgebase is not hard to construct. When it doesn't know, it guesses. If it gets it right, then the knowledgebase increases by one fact.

So then, what, you have to ask "Given that all bleeps and blue, and blank is a bleep, is blank blue?"
Not only is that also easily computationally solved, but also a lot of people aren't going to be able to answer (smartass questions about stopping spam and idiots aside)

So *then* I suppose you have to ask "In the first mathematical antimony, does Kant conclusively prove both that there can have been no beginning to time and that there must have been a beginning to time?"
and give the user a 255 character textarea to put their answer in.

So... please, text question based captchas are DOOMED TO FAIL. stop thinking that they could work. They can't.

Re:The beginning of the end? (1)

asninn (1071320) | about 7 years ago | (#18680665)

So... please, text question based captchas are DOOMED TO FAIL. stop thinking that they could work. They can't.

While I agree with you in principle, I think youre definition of "work" with regard to captchas is flawed. Captchas don't need to be 100% undefeatable; they just need to work well enough so that the time/energy/computing power/manpower/money needed to solve them en gros makes sure doing so isn't worth it to the spammer.

Your claim that they're useless because they don't work perfectly makes as much sense as saying that postage paid on snail mail letters doesn't make sense since it's possible for postal spammers to just shell out the amount necessary to send a letter, anyway (especially given that they'll receive bulk discounts). Still, in reality, I hardly ever get postal spam; the rate is probably less than 1 unsolicited letter per month, while my email spam, on the other hand, is measured in thousands of mails per day.

I'd argue that the fact that bulletin boards, blogs etc. are generally pretty spam-free proves that captchas ARE working - not perfectly, but well enough.

Re:The beginning of the end? (0)

el_gordo101 (643167) | about 7 years ago | (#18680841)

So *then* I suppose you have to ask "In the first mathematical antimony, does Kant conclusively prove both that there can have been no beginning to time and that there must have been a beginning to time?" and give the user a 255 character textarea to put their answer in.
Why 255 characters? Wouldn't a couple of radio buttons suffice?

OYes | O No

Now the bots will get in 50% of the time, even if they are only taking a guess. I think a captcha would work better.

Re:The beginning of the end? (5, Informative)

lawpoop (604919) | about 7 years ago | (#18679619)

I doubt it.

Part of what makes OCR work is that it assumes that the text was written to communicate meaning. It has regular characters in alignment, forming common words and abbreviations in more or less grammatical sentences with close-to-proper punctuation.

A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses.

Re:The beginning of the end? (2, Funny)

thePowerOfGrayskull (905905) | about 7 years ago | (#18680587)

A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses.
Hell, if we obscure it enough it can be practically buried under geometric noise; and once we do that, we've solved the AC problem on /.!

Re:The beginning of the end? (2, Informative)

Iphtashu Fitz (263795) | about 7 years ago | (#18681191)

Part of what makes OCR work is that it assumes that the text was written to communicate meaning.

As computing power continues to grow that kind of assumption is less and less important. Ten years ago I worked for a speech recognition company that developed tools similar to what Google is now using for their 800-GOOG-411 search line. Back then the state of the art was to carefully guide what a caller was likely to say, and to rely on massive dictionaries to help with the recognition. Now, 10 years later, with more research and more powerful computers, it's much easier to develop more free-formed speech recognition systems that can accurately recognize arbitrary strings of numbers/letters. (account numbers, phone numbers, etc) Given that the capabilities of speech regonition systems have grown so much I'd be willing to bet that OCR capabilities have grown in similar ways.

Re:The beginning of the end? (1)

dotoole (881696) | about 7 years ago | (#18679689)

OCR is already at the stage where simply distorting letters isn't sufficient anymore. The real trick now is to generate the the letters and background clutter in such a way that the software cannot segment the image into seperate characters.

Re:The beginning of the end? (1)

dimeglio (456244) | about 7 years ago | (#18680153)

I must be a computer/cyborg. I have trouble reading 50% of captchas (on first try). Can't wait to get this enhancement.

Re:The beginning of the end? (1)

Matt Perry (793115) | about 7 years ago | (#18680179)

If Google is pushing OCR I could see it eventually becoming good enough to parse at least some types of captchas.
I hope so. I'm looking forward to a Firefox extension that'll let me decode a captcha so I don't have to figure it out. Some of the captchas I've seen lately are so confusing, with warped text, noise, and fonts that make zero and oh look identical, that I have to go through two or three of them before I can get an entry correct.

the presidential papers (4, Funny)

User 956 (568564) | about 7 years ago | (#18679157)

The goal of the project is to ... deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis

So, will it work on documents written in crayon? It would be a tragic loss for Dubya's presidential documents to get lost in the sands of time. On the scale of the library of Alexandria. No, seriously.

Re:the presidential papers (2, Funny)

adickerson0 (884626) | about 7 years ago | (#18679931)

No need, Dubya turns his work into the Secretary of Education so she can put a gold star on each page. While this may seem like a childish system it is really the only sort of over site he would agree to. The original plan was to scan everything and place an RFID Gold Star on each page for tracking, that way the Executive Branches work could be preserved, however this led to a few problems. Apparently the Sec of Ed got to busy and turned the work over to an intern. The intern decided to not only put a Gold Star on each page but actually started grading the papers. This lead to the "Inbasion of Iwack Plans" scandal. Dubya's plan, which included a drawing of himself in a jet holding an American flag, was given a "A+ Good Work" stamp. This of course was given back to the Presidnet who decided that if it was a "A+" then there is no way his plan would fail.

Finally... (3, Interesting)

Searinox (833879) | about 7 years ago | (#18679159)

An OCR system that runs on Linux. I've been waiting for quite some time for something like this.

Re:Finally... (1)

Cocoronixx (551128) | about 7 years ago | (#18679517)

GOCR http://jocr.sourceforge.net/ [sourceforge.net]
Tesseract-OCR http://sourceforge.net/projects/tesseract-ocr [sourceforge.net]

Re:Finally... (0)

Anonymous Coward | about 7 years ago | (#18679725)

I run ABBYY FineReader 5.0 on my slackware laptop using WINE [winehq.com] ...

Re:Finally... (4, Insightful)

Feyr (449684) | about 7 years ago | (#18679979)

have you tried gocr? it's nice as a random number generator, but beside that... it's pretty much garbage

Re:Finally... (1, Offtopic)

smchris (464899) | about 7 years ago | (#18680875)

Good one. Yeah, GOCR is crap.

As someone who was consistently getting high 90s% recognition on OmniPage with preservation of basic layout and images for work in 1996, linux is a non-starter and pathetically WAY, WAY behind in this area. It isn't even a GIMP vs. Photoshop ("Yeah, well GIMP is just different and 'special'!") argument. I'll look at a couple of the other suggestions here but I had basically just given up and said this is a linux blind spot.

So if Google _also_ wants to use it to torture kittens, or whatever, I"d have to say, "Well, let's weigh the pros and cons before we make a hasty judgement."

Re:Finally... (0)

Anonymous Coward | about 7 years ago | (#18681355)

When you say "linux" here, you actually mean "open source" or "free software". Most high quality commercial OCR packages run on Linux.

Re:Finally... (2, Informative)

stilbon (69689) | about 7 years ago | (#18680029)

Vividata OCR Shop XTR

http://www.vividata.com/index.html [vividata.com]

It's not free software, but it works extremely well.

Vividata is just OK (1)

hirschma (187820) | about 7 years ago | (#18681227)

it actually has many issues, and it is lagging behind the Windows version that Nuance produces. My company owns several licenses.

it is, however, the best OCR on Linux right now. I'm looking forward to having an alternative.

Re:Finally... (0)

Anonymous Coward | about 7 years ago | (#18680097)

Google probably wants it to help with their aid to putting the Library of Congress online. [slashdot.org] I was suprised not to see this linked as a recent related article.

Captchas (1)

Radon360 (951529) | about 7 years ago | (#18679189)

So will something like this eventually render captchas used as a security/anti-spam measure obsolete?

Not like something wasn't bound to eventually come out to counter that idea, anyway.

Very cool. (5, Insightful)

Kadin2048 (468275) | about 7 years ago | (#18679305)

I've been hoping that someone with deep pockets (Google, IBM, Sun) would take on this area for a while.

There is a major need for an OSS OCR package, and right now the field is pretty bare. There's GOCR [sourceforge.net] , and a commercial offering called OCRShop, and at least that I've run across, that's about it. Nothing really on par with Omnipage, or other commercial packages for other platforms.

I think there are some really neat applications for OCR that have never really been investigated, because it's so expensive to build that capability into other products. A free OCR engine that really worked could lead to some very neat book-scanning applications, just for starters. I don't think that there's really any integrated packages around for helping people scan books and manuscripts. (Right now you have to photograph the pages, keep them organized, then OCR them and proofread the text against the images. Bit of a nightmare.) I'd love to see a free application for libraries that let a user batch scan (via a digital camera -- let's not get into what I think of SANE and scanners generally) a book, and then provided a nice interface for proofreading the OCRed text against the original image.

Something like that could have a huge social impact. There are a lot of libraries where I'm sure they'd love to scan some of their out-of-copyright assets and provide them to patrons in a digital form, but it's just too technically complicated. An easy-to-use program that let the proofreading be done by nontechnical users (maybe remotely, as long as we're dreaming) could vastly increase the volume of digital materials available.

Re:Very cool. (1)

Gregory Cox (997625) | about 7 years ago | (#18679827)

Yes, this is good... but I'd be even happier if there were more definite plans about support for other scripts (like Japanese?) But that's probably a lot of work for something that's not a top priority. Maybe in a few years...

Re:Very cool. (1)

CodeShark (17400) | about 7 years ago | (#18679957)

Actually for Japanese, etc. this would still be a godsend, because most OCR work comes from typed sources, and the "typed" Chinese characters (also used in Japan, etc.) there are only a limited number of fonts in use. Which presumably would lead to a great amount of work on identifying those font libraries and characters that cause problems in the OCR and the gradual elimination of those problems by inclusion in the recognition files.


Essentially, what this would open up would be a process of converting the vast library of pre-'Net hand-typed texts to scanning via OCR, and being open sourced-- it doesn't necessarily have to run on Google's machines.

Re:Very cool. (1)

CastrTroy (595695) | about 7 years ago | (#18680141)

You aren't going to get a good shot of the document with a digital camera for a lot of reasons. First of all, the lighting is uneven. Then there's the problem with the lens distorting things. Then there's problems with getting it to focus properly. I'm sure lots of people would love to point out other problems with using a digital camera to capture documents. It may work fine for a human looking at the picture, but it's going to make the job of the OCR program a lot harder. Even things like dust can throw off "good" OCR programs.

Actually it's done all the time. (1)

Kadin2048 (468275) | about 7 years ago | (#18680903)

Actually lots of people do book "scanning" with digital cameras. In fact, you can sometimes get much better results off of a book using a digital camera than you can by pressing it down against the bed of a flatbed scanner (because if the page wasn't typeset with a wide gutter, you'll start to distort some of the letters as you get close to the binding). Plus, it's a lot easier on the books, which is important when you're talking about books that are all going to be 75 years old and some much, much older.

The best way to use a flatbed scanner to scan books is actually to run them through a guillotine first, chop off the binding, and then scan the loose pages; this produces good results but it's not something most libraries are going to be willing to do.

Here's a commercial non-destructive book scanner [kirtas-tech.com] which uses cameras. Basically, what you do, is you have two cameras, each pointing at one side of the book. You use lights held at an angle to the paper with reflectors and diffusers so that it's evenly lit, and then you just flip the pages and fire the cameras once per page turn. You can build a setup to do this (with manual page turning) for a few hundred bucks plus the cost of the cameras. The auto page-turning is really what drives up the cost.

People were photographing text using cameras for a lot longer than photocopiers have been around. The standard way of reproducing photographs was by using a copy stand [marietta.edu] and a fixed camera in order to make an internegative, and prior to the introduction of all-digital typesetting, almost all offset printing was done by photographing a paste-up of the final product with a special camera, which produced the plate used in the press.

So in short, although you're correct that just holding a digital camera over a book and clicking the shutter wouldn't give great results, the issues surrounding lighting, lens distortion, and focus are all solved problems. (And if you really wanted to be slick about things like barrel distortion and dust, you could start each run by photographing a standard grey field and a checkerboard, and use that to remove dust and correct for distortion digitally, rather than mechanically/optically.)

Leave the conversion to those skilled at it (1)

kence (24217) | about 7 years ago | (#18680183)

I think the potential of new Google-backed OCR software is pretty high but I'm not certain that your average library would have the manpower and technical know-how to manage a book-to-ebook conversion, Google OCR software or not.

If libraries are interested in getting their out-of-copyright assets into digital form, they really only need contact someone with Digital Proofreaders [pgdp.net] to get the ball rolling. DPers would take care of the scanning, proofing, formatting, and post-processing of the book on behalf of the library requiring nothing but a temporary loan of the book or manuscript (something the libraries already excel at :)

Re:Very cool. (1)

ZERO1ZERO (948669) | about 7 years ago | (#18680299)

I get the feeling that is basically what this is: http://www.iiri.com/i2s/copibook.htm [iiri.com] it looks a bit home grown, but this is about £25,000 I believe. If you had kind of framework, a lens with large DOF, 6-10MP sensor could do this on the cheap I reckon. Oh and some skills in processing the resulting image.

Re:Very cool. (1)

ZERO1ZERO (948669) | about 7 years ago | (#18680343)

Doh! I forgot to mention this thing runs Linux, which is why I was reminded of it in the first place. It's basically a PC hooked up inside the scanner. Slow Down Cowboy! Slashdot requires you to wait between each successful posting of a comment to allow everyone a fair chance at posting a comment. It's been 1 minute since you last successfully posted a comment Chances are, you're behind a firewall or proxy, or clicked the Back button to accidentally reuse a form. Please try again. If the problem persists, and all other options have been tried, contact the site administrator.

Never saw that one specifically (1)

Kadin2048 (468275) | about 7 years ago | (#18681073)

Yeah that's similar to what I was thinking about. Actually, what I was recalling was this thing [kirtas-tech.com] , which seems to pretty clearly use off-the-shelf DSLR cameras (not sure on the lenses though, they're not visible). It probably costs a fortune because of the robotics and vacuum system necessary for the automatic page turning, but I think you could DIY something similar out of two copy stands for a lot less if you were okay with flipping pages.

The one you linked to seems like it would have more distortion of the pages because the cameras aren't being held constantly perpendicular to the page, but maybe it just corrects for that in software afterwards. (It wouldn't be hard, in fact I think all the code you'd need to do it is part of the Panorama Tools / Hugin package.)

What I think is a bigger problem for most libraries isn't the scanning per se, because that at least is a problem that most non-technical people can understand, but it's the storage and document-management that's the issue. Once you have the book scanned, you have a giant pile of JPEGs or TIFF files...unless you're careful about organization, it could become a real mess in a hurry.

So where I think the missing piece is, has to do with getting from raw images to an actual ebook. The hardest problem seems to be in the proofreading step; if you run each image through an OCR program, and then you want to proofread it, you need some way of distributing pages out to proofreaders, and letting each of them have a page of text and the image from that page, side by side. And then managing their edits and checking changes back in, etc. It's nothing really novel -- they're all solved problems in other areas (documents management, change management, remote access, web services) -- but I've never seen them combined.

If you had a software package that handled all the document management and proofreading (preferably something that your proofreaders could log into remotely and work, while storing everything centrally), then the hardware required is mostly off-the-shelf. It goes from being a $25,000 grant proposal, to some undergrad's thesis/semester project.

Wonderful! (4, Insightful)

jshriverWVU (810740) | about 7 years ago | (#18679461)

This'll be a much needed boost for us Linux users who want to help out Project Gutenburg.

Re:Wonderful! (1, Informative)

Anonymous Coward | about 7 years ago | (#18679971)

This'll be a much needed boost for us Linux users who want to help out Project Gutenburg.

Join the Distributed proofreaders [pgdp.net]
and do any or all of:
1) do some proofreading or formatting of a PG text
or
2) Smooth read a near-finished text looking for overlooked oddities
or
3) help improve DP's processing software. Lots of extra features wanted...
or
4) Get copyright clearance, scan a book and upload to DP's OCR pool
or
4) Run your Windows OCR under WINE like I do...

More details here [pgdp.net]

One thing leads to another... (2, Insightful)

jojoba_oil (1071932) | about 7 years ago | (#18679493)

Okay, so one thing will lead to another and soon Google will be creating technology to recognize non-symbol shapes... How long before I can login to my G-Accounts by smiling at my computer?

Captcha killer? (-1, Redundant)

140Mandak262Jamuna (970587) | about 7 years ago | (#18679533)

What is google trying to do? Develop tools to help automate captcha circumventing bots?

Re:Captcha killer? (0)

Anonymous Coward | about 7 years ago | (#18679825)

Could they be prosecuted under the DMCA for this?

Re:Captcha killer? (0)

Anonymous Coward | about 7 years ago | (#18679993)

yeah that's what they're trying to do you fucking idiot

no rly gize, i no how 2 spel! (0)

Anonymous Coward | about 7 years ago | (#18679639)

And Zonk has taken all editing to be his...provice.

captchas (4, Insightful)

gEvil (beta) (945888) | about 7 years ago | (#18679651)

All you people who are worried about this breaking captchas seem to be missing something--there have been a number of fairly decent OCR packages out there for a long time. The goal of this Google project is to create an open-sourced one that does a good job deciphering HUMAN-READABLE TEXT. Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.

Re:captchas (3, Informative)

arrrrg (902404) | about 7 years ago | (#18679977)

Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.

Um, the whole point of CAPTCHAS is that they are human-readable (albeit often not easily so) but not machine-readable.

Re:captchas (1)

MoriaOrc (822758) | about 7 years ago | (#18680255)

Captchas are far from human-readable (the good ones at least)
While I've run into not a few captchas that are not human-readable, I would argue that they are not, in fact, the good ones. Good Captchas are human readable, but extremely difficult to solve using automation (this, other OCR software, what have you).

Re:captchas (1)

AeroIllini (726211) | about 7 years ago | (#18680775)

Captchas are far from human-readable (the good ones at least)...
Yeah, that's why they suck.

Some forums, I have to try *four* times to get past the captcha, just to post a message about how libsomething won't compile.

If they really wanted good captchas, they need to start using problems that are very easy for humans to solve, but very hard for computers to solve. For example, picking the one photo of a puppy out of a matrix of photos of full-grown dogs.

Computers are currently really bad at recognizing images in photos, but they do a decent job of recognizing text with commercial OCR programs (that ability will only increase when there are some hardcore OSS versions available, such as Google's project). So why are we spending our time mangling the text so that neither computers nor humans can read it, and not focusing on something computers actually are bad at, like recognizing a puppy?

searchable pdfs (4, Interesting)

radarsat1 (786772) | about 7 years ago | (#18679705)

Anyone know of an open source utility that can convert scanned image-based PDF files into searchable PDFs ?
(Extra points if it somehow re-generates the actual file so it looks nice instead of pixelated.)

Perhaps this library could be used to build such an application if none exists...

Re:searchable pdfs (1)

Nasarius (593729) | about 7 years ago | (#18680115)

Not even the best commercial software (ABBYY, OmniPage) can do more than a half-assed job of that. If you want accurate, well-formatted results, expect to do a lot of manual work.

Re:searchable pdfs (1)

gEvil (beta) (945888) | about 7 years ago | (#18680289)

Although Acrobat's OCR engine leaves a bit to be desired, the approach there works pretty well. You can have it create the OCR'd text page layout that uses the original image as an overlay. So, in essence you get a page that looks like the original scanned image, but that lets you highlight/select the text from the background text layer. I'm sure other programs out there can do this, too. None that are OS (to my knowledge), as per the GP's requirements, though.

Re:searchable pdfs (0)

Anonymous Coward | about 7 years ago | (#18680617)

1. run tesseract / gocr / other OCR package to generate the text

2. use a2ps utility to convert to postscript

3. use ghostscript ps2pdf to generate your pdfs

Language? (4, Interesting)

ceeam (39911) | about 7 years ago | (#18679731)

English only I suppose?

Re:Language? (5, Funny)

fireboy1919 (257783) | about 7 years ago | (#18680083)

Since the official language of the Googleplex is Googlese, and the original project was developed by the US Census bureau - notorious for their use of no languages except Esperanto, it goes without saying (though I'm saying it anyway), that it will read only Klingon.

Remember kids, there are no stupid questions.
Only people who don't RTFA who ask questions.

Re:Language? (1)

xlv (125699) | about 7 years ago | (#18680537)

It looks like the curernt OCR engine they use, Tesseract OCR, only supports English as its roadmap includes "support for languages other than English" but from a quick look at the various links, they are developing other engines as well.

Besides, the research group being based in Germany, you'd assume that German and latin based languages will be supported pretty soon...

We are already winners (-1, Troll)

Anonymous Coward | about 7 years ago | (#18679865)

It was a tender, stormy night as Gragurikov shambled down the hill. "Olga!" he shouted, his clear barytone voice seemingly made Sparrows in the UK can reach an approximate size of 1.3 meters in main body diameter for the males, although the majority of this is

BUY C14L15! 4 4LL J00R N33D5 TH47 W3 KN0W Y0U H4V3 L0T5 0F

1-800-LOTS-OF-PILLS

This message brought to you by the Friends of Google

Mathematics? (0)

ObsessiveMathsFreak (773371) | about 7 years ago | (#18679899)

And will it be able to recognised and latexify handwriten mathematics. The world and it's mother can do OCR, but I've yet to an honest attempt at making writing mathematics papers easier.

Re:Mathematics? (1)

frogstar_robot (926792) | about 7 years ago | (#18681017)

I suppose this could be used to build such a beast once it's a bit more fully baked. A good general purpose FOSS OCR is necessary for what you want even if it isn't entirely sufficient.

Re:Mathematics? (1)

nireus (988551) | about 7 years ago | (#18681405)

I wouldn't say so, check http://www.inftyproject.org/ [inftyproject.org] Their OCR claims 99% success in printed documents (i've tried it is true). And wait a few years,there are some really promising papers out there, i bet you'll be amazed on the number of people working on this problem since the late 90's. 3 years from now i am almost sure you'll be able to enter any kind of math expression by hand using a digitizer (don't ask handwritten offline OCR just yet though :( )
check out this guy as well http://www.cs.berkeley.edu/~fateman/ [berkeley.edu] his work is groundbraking, i hope they will have a solid opensource system in a couple of years

Open Source Ballot Scanning! (1)

Soong (7225) | about 7 years ago | (#18680049)

Ok, I got excited too early. Actually, ballot scanning is a specialized task and general purpose OCR probably doesn't play much of a part in that, but if any part of it does apply, then this is still awesome.

Adoption? (1)

kurbchekt (890891) | about 7 years ago | (#18680147)

Hopefully, the Linux community will adopt some of this, as some of it can be utilized for accessibility. After perusing some patents from the 1800's, it's clear that Google has made some headway in this department. There were errors in translation (namely K's and R's/P's and B's), but for several documents, things come across as intended.

Re:Adoption? (0)

Anonymous Coward | about 7 years ago | (#18680675)

As usual, the main reason OCR has not taken off in the open source world is patents. Americans have patents essentially on the concept of doing OCR. Now the hard bit of OCR,like most software, is implementing it, not having the idea "wouldn't it be nice if computers could, like, read text?". But in short, if you do all the hard work of implementing OCR, some patent troll will swoop in and claim it - unless, perhaps, there's a giant lump like Google backing you up (but that might _encourage_ the trolls...). Maybe in ten years time... But don't forget, Americans have quietly started pushing on the international for patent terms renewable beyond the traditional 20-year mark, so they may never expire if they get their way...

The world leader in closed-source OCR is based in and operates out of russia, partly because they're russian, but mostly for this reason.

Comics (2, Interesting)

rbanffy (584143) | about 7 years ago | (#18680339)

Will I be able to search my comics strips (downloaded since ever) by keyword?

I would love that!

Sheesh.... (0)

Rick Richardson (87058) | about 7 years ago | (#18680799)

make[3]: Entering directory `/home/rick/tesseract-ocr/wordrec'
if g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../ccstruct -I../ccutil -I../cutil -I../classify -I../image -I../dict -I../viewer -g -O2 -MT tface.o -MD -MP -MF ".deps/tface.Tpo" -c -o tface.o tface.cpp; \
then mv -f ".deps/tface.Tpo" ".deps/tface.Po"; else rm -f ".deps/tface.Tpo"; exit 1; fi ../cutil/globals.h:46: error: previous declaration of 'int optind' with 'C++' linkage ../ccutil/getopt.h:23: error: conflicts with new declaration with 'C' linkage ../cutil/globals.h:47: error: previous declaration of 'char* optarg' with 'C++' linkage ../ccutil/getopt.h:24: error: conflicts with new declaration with 'C' linkage
make[3]: *** [tface.o] Error 1
make[3]: Leaving directory `/home/rick/tesseract-ocr/wordrec'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/rick/tesseract-ocr/wordrec'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/rick/tesseract-ocr'
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...