This project provides some ideas how to work with Tesseract OCR 4+ and MICR fonts. Actually it's not about Python implementation. I developed client specific Tesseract OCR projects on Java, Node.js before, but basic things are language neutral and can be achieved even with shell scripts. Python now is on top in programming language ranking, so it will be useful for more auditory.
In general process can be described as: 1) Prepare input data: - convert it to format acceptable by Tesseract(e.g. convert PDF to image, use lossless image formats if possible etc). - make additional image transformations for the best Tesseract OCR processing like listed below: - adjust (upscale/downscale) image, reference point is DPI, optimal is 300dpi; and font size - 12pt is optimal. - crop region to be recognized to point tesseract to necessary OCR area. - sharpen image. - invert image if necessary. - rotate/deskew image. - remove horizontal and vertical lines if any. - threshold image. - fix illumination of image. - add/fix borders around image. - apply other filters on demand like noise removal, dilation and erosion etc. - convert to monochrome black and white mode (binarisation). 2) Execute Tesseract. 3) Post-process results. In real project probably it will end up with HOCR or TSV outputs. These formats contains text geometry and confidence value and can be used for regression based on recognized text nature. For these reasons it's recommended to use Tesseract version 4.1+. Note: 1->2->3 workflow processes can be executed in parallel. It's hard to predict which kind of preparation for image is the best one in step 1, so few different optimal approaches can be selected based on data and using testing results. In this project to demonstrate step 1) used libvips image processing library, it doesn't cover all 100% of functionality, but fast, cross platform, language neutral and effective. for step 2) used Tesseract 4.1+.
Source Code https://github.com/vmdocua/tesseract_micr on GitHub.