jpg to text / Image to Text
Converting images to text is also known as OCR (Optical Character Recognition), and is very useful when scanning paper documents into your computer and then converting them to, say, MS Word format. OCR is considered a 'solved problem' these days as the OCR engines are now quite good and support a plethora of languages, alphabets and characters.
There are some freeware and open source applications which do a very good job of this. Sometimes, you may have only an image or screenshot of a text document and you need to manipulate the text. In such cases, an OCR application is indispensible.
The best Open Source (Free) software for Optical Character Recognition to date is Tesseract.
- The Tesseract Project Page at Google
- Download Tesseract
- Tesseract page at SourceForge (Migrating to Google)
About Tesseract
Tesseract is a free optical character recognition engine. It was originally developed as proprietary software at Hewlett-Packard between 1985 until 1995. For ten years it languished without any development, then Hewlett Packard and UNLV released it as open source in 2005. Tesseract is now being actively developed by Google and released under the Apache License, Version 2.0.
Listing of other OCR applications (Sortable table):
| Name | License | Operating systems | Notes |
|---|---|---|---|
| ExperVision TypeReader & OpenRTK | Commercial | Windows,Mac OS X,Unix,Linux,OS/2 | |
| ABBYY FineReader OCR | Commercial | Windows | For working with localized interfaces, corresponding language support is required. |
| OmniPage | Commercial (Nuance EULA) | Windows, Mac OS | Product of Nuance Communications |
| Readiris | Commercial | Windows, Mac OS | I.R.I.S. Group of Belgium. Asian and Middle Eastern editions. |
| SmartZone (formerly known as Zonal OCR) | Commercial | Windows | SmartZone is the process by which Optical Character Recognition (OCR) applications "read" specifically zoned text from a scanned image. |
| Computhink's ViewWise & AnyDoc | Commercial | Windows | Document Management system |
| CuneiForm | BSD variant | Windows, Linux, BSD, MacOSX. | Enterprise-class system, multi language, can save text formatting and recognizes complicated tables of any structure |
| CVISION Technologies, Inc. PdfCompressor and Maestro Recognition Server | Commercial | Windows | Fast, accurate, high volume OCR |
| GOCR | GPL | Many (open source) | Early development |
| Microsoft Office Document Imaging | Commercial | Windows, Mac OS X | Microsoft Office has some OCR capabilities built-in. |
| Microsoft Office OneNote 2007 | Commercial | Windows | $99.00 from Microsoft. |
| NovoDynamics VERUS | Commercial? | ? | Specializes in languages of the Middle East |
| Ocrad | GPL | Unix-like, OS/2 | Open Source |
| Brainware | Commercial | Windows | Data extraction and processing of data from documents into any backend system; sample document types include invoices, remittance statements, bills of lading and POs |
| HOCR | GPL | Linux | Hebrew OCR |
| OCRopus | Apache | Linux | Pluggable framework which can use Tesseract. State of the Art OCR for Linux. |
| OOCR | Open Source (GPL) | Windows | Open OCR |
| ReadSoft | Commercial | Windows | Scan, capture and classify business documents such forms, invoices and POs. |
| Alt-N Technologies' RelayFax Network Fax Manager |
Commercial | Windows | Multi-language OCR Plug-in is used to convert faxed pages into editable document formats (doc, pdf, etc...) in many different languages. |
| Scantron Cognition | Commercial | Windows | For working with localized interfaces, corresponding language support is required. |
| SimpleOCR | Freeware and commercial versions | Windows | Free! |
| SmartScore | Commercial | Windows, Mac OS | For musical scores |
| Tesseract | Apache | Windows, Mac OS X, Linux, OS/2 | HP initiative; now under development by Google |