2016年3月1日 星期二

tesseract-ocr info


rails gem

https://github.com/meh/ruby-tesseract-ocr

How to install Tesseract OCR on a Amazon EC2 (Free Tier) Linux Machine for Optical Character Recognition





Free Online OCR Tool



Using Non-Ruby Programs With Ruby





Tesseracting + ImageMagcking

To review, you'll have to install the following software in order to create a Ruby script that optimizes scanned images and uses Tesseract to extract their textual content:
  1. The ImageMagick, a library used for command-line image manipulation
  2. The RMagick gem, which provides a Ruby interface for ImageMagick
  3. Tesseract, an open source OCR program that runs from the command-line
Here's example code. Understanding it requires being familiar with RMagick, but there's not much more to it than that. Here is what the code does:
  1. Opens an image file
  2. Optimizes the image for OCR by straightening it and creating a grayscale version that is saved to disk as a TIFF
  3. Executes tesseract from the command line
  4. Reads the contents of the text file created by tesseract
require 'rubygems'
require 'rmagick'

def tessrack(oimg_name, do_gray=true, keep_temp=true)
   fname = oimg_name.chomp(File.extname(oimg_name))
   
   # create a non-crooked version of the image
   tiff = Magick::Image::read(oimg_name).first.deskew 

   # convert to grayscale if do_gray==true
   tiff = tiff.quantize(256, Magick::GRAYColorspace) if do_gray == true
   
   # create a TIFF version of the file, as Tesseract only accepts TIFFs
   tname = "#{fname}--tesseracted.tif"
   tiff.write(tname){|t| t.depth = 8}
   puts "TR:\t#{tname} created"
    
   # Run tesseract
   tc = "tesseract #{tname} #{fname}"
   puts "TR:\t#{tc}"
   `#{tc}`
   
   File.delete(tname) unless keep_temp==true
   File.open("#{fname}.txt"){|txt| txt.read}
end

txt = tessrack("data-hold/tufte-intro.tif")
puts txt
FAQ

https://github.com/tesseract-ocr/tesseract/wiki/FAQ

沒有留言:

張貼留言