Pytesseract OCR with Single Digit Recognition and Number-Only Constraints
In the context of Pytesseract, configuring Tesseract to recognize single digits and restrict output to numbers can be challenging. To address this issue, we delve into the specifics of Tesseract's configuration options.
Tesseract Page Segmentation Modes
Tesseract offers various page segmentation modes (psm) to handle diverse text layouts. For single character recognition, the appropriate psm is 10. This mode treats the image as a single character.
Character Whitelist
To limit the recognized characters to numbers, we can leverage the tessedit_char_whitelist configuration parameter. By specifying 0123456789 as the whitelist, Tesseract will accept only these characters.
Sample Usage
Here's an example usage of image_to_string with multiple configuration options:
target = pytesseract.image_to_string(image, lang='eng', boxes=False, config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
By setting psm to 10 and using the character whitelist, this configuration ensures that Tesseract will recognize single digits while limiting the output to numerical values. Additionally, lang specifies the language, boxes disables text box boundaries, and oem selects the OCR engine.
Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.
Copyright© 2022 湘ICP备2022001581号-3