rails gem
https://github.com/meh/ruby-tesseract-ocr
How to install Tesseract OCR on a Amazon EC2 (Free Tier) Linux Machine for Optical Character Recognition
Free Online OCR Tool
Using Non-Ruby Programs With Ruby
Tesseracting + ImageMagcking
To review, you'll have to install the following software in order to create a Ruby script that optimizes scanned images and uses Tesseract to extract their textual content:
- The ImageMagick, a library used for command-line image manipulation
- The RMagick gem, which provides a Ruby interface for ImageMagick
- Tesseract, an open source OCR program that runs from the command-line
Here's example code. Understanding it requires being familiar with RMagick, but there's not much more to it than that. Here is what the code does:
- Opens an image file
- Optimizes the image for OCR by straightening it and creating a grayscale version that is saved to disk as a TIFF
- Executes tesseract from the command line
- Reads the contents of the text file created by tesseract
require 'rubygems'
require 'rmagick'
def tessrack(oimg_name, do_gray=true, keep_temp=true)
fname = oimg_name.chomp(File.extname(oimg_name))
# create a non-crooked version of the image
tiff = Magick::Image::read(oimg_name).first.deskew
# convert to grayscale if do_gray==true
tiff = tiff.quantize(256, Magick::GRAYColorspace) if do_gray == true
# create a TIFF version of the file, as Tesseract only accepts TIFFs
tname = "#{fname}--tesseracted.tif"
tiff.write(tname){|t| t.depth = 8}
puts "TR:\t#{tname} created"
# Run tesseract
tc = "tesseract #{tname} #{fname}"
puts "TR:\t#{tc}"
`#{tc}`
File.delete(tname) unless keep_temp==true
File.open("#{fname}.txt"){|txt| txt.read}
end
txt = tessrack("data-hold/tufte-intro.tif")
puts txt
https://github.com/tesseract-ocr/tesseract/wiki/FAQ
沒有留言:
張貼留言