PyTesseract - Restricting OCR to a set of characters - python

I'm having trouble with pytesseract. I know that you can restrict tesseract to a specific set of characters using command line arguments :
tesseract input.tif output nobatch digits
I found some ppl saying they can restrict tesseract with the following lines in python :
import tesseract
ocr = tesseract.TessBaseAPI();
ocr.Init(".","eng",tesseract.OEM_TESSERACT_ONLY)
ocr.SetVariable("tessedit_char_whitelist", "0123456789")
But this is for using the tesseract API, and I'm using pytesseract.... Finally I also tried :
print(image_to_string(someimage, config='outputbase digits'))
But this doesn't work as I still get letters in my output. This is weird because I am using the below code and it is working :
print(image_to_string(screen, config='-psm 10'))
PSM stands for PageSegmentationMode and it allows me to parse my imagefile as a single character. I don't understand why this works and the snippet before doesn't when they are both commandline arguments to tesseract...
Can anyone help ? I want to use both options with a custom wordlist (that i created in the config folder of tesseract).

Finally found the solution, if it can ever help anyone... This is from the tesseract help page :
Simplest invocation of tesseract :
tesseract imagename outputbase
I could deduce the proper syntax from that (in fact, everything I found on stack overflow pretty much pointed me in the wrong direction, maybe because of different versions of tesseract). Keep in mind I'm using tesseract 3.05 (win installer available on GitHub) and pytesseract (installed from pip).
image_to_string(someimage, config='digits -psm 7')
As we've seen on the help page, the outputbase argument comes first after the filename and before the other options, this allows the use of both PSM & restricted charset.
All the command line args from tesseract help page can be used this way, in the config variable !!

Related

How can I identify numbers from an image?

I'm writing a script that takes an image and crops the image down to only include the number I want it to recognize. I have that part working fine. The numbers will be either single or double digit.
I've tried using Googles Vision API, which works fine and gives the correct result, but I would rather do it locally to avoid the fees associated with using that service. I'm currently working on using Tesseract OCR https://github.com/tesseract-ocr/tesseract
Example of an image I want it to recognize:
Tesseract is a command line program but I am calling it in a python file that also handles the other parts of my script. I'm not sure if Tesseract is what I want or if there is a better solution to my problem.
sudo tesseract imgName outputFile
The only results I get no matter what image I put through it returns 0 and also shows "Empty page!!"
EDIT:
I am now using pytesseract and I am trying with this code:
print(pytesseract.image_to_string(img))
Nothing is outputted from that so I tried
print(pytesseract.image_to_string(img,config ='--psm 6'))
which outputs random letters it's guessing. Is there a way with tesseract to only look for numbers so my results are narrowed down?

Extremely new user to Python. "No module named request" error while trying code to detect image subdomains in a website to extract them to a folder

I may sound rather uninformed writing this, and unfortunately, my current issue may require a very articulate answer to fix. Therefore, I will try to be specific as possible as to ensure that my problem can be concisely understood.
My apologizes for that- as this Python code was merely obtained from a friend of mine who wrote it for me in order to complete a certain task. I myself had had extremely minimal programming knowledge.
Essentially, I am running Python 3.6 on a Mac. I am trying to work out a code that allows Python to scan through a bulk of a particular website's potentially existent subdomains in order to find possibly-existent JPG images files contained within said subdomains, and download any and all of the resulting found files to a distinct folder on my Desktop.
The Setup-
The code itself, named "download.py" on my computer, is written as follows:
import urllib.request
start = int(input("Start range:100000"))
stop = int(input("End range:199999"))
for i in range(start, stop + 1):
filename = str(i).rjust(6, '0') + ".jpg"
url = "http://website.com/Image_" + filename
urllib.request.urlretrieve(url, filename)
print(url)
(Note that the words "website" and "Image" have been substituted for the actual text included in my code).
Before I proceed, perhaps some explanation would be necessary.
Basically, the website in question contains several subdomains that include .JPG images, however, the majority of the exact URLs that allow the user to access these sub-domains are unknown and are a hidden component of the internal website itself. The format is "website.com/Image_xxxxxx.jpg", wherein x indicates a particular digit, and there are 6 total numerical digits by which only when combined to make a valid code pertain to each of the existent images on the site.
So as you can see, I have calibrated the code so that Python will initially search through number values in the aforementioned URL format from 100000 to 199999, and upon discovering any .JPG images attributed to any of the thousands of link combinations, will directly download all existent uncovered images to a specific folder that resides within my Desktop. The aim would be to start from that specific portion of number values, and upon running the code and fetching any images (or not), continually renumbering the code to work my way through all of the possible 6-digit combos until the operation is ultimately a success.
(Possible Side-Issue- Although I am fairly confident that my friend's code is written in a manner so that Python will only download .JPG files to my computer from images that actually do exist on that particular URL, rather than swarming my folder with blank/bare files from every single one of URL attempts regardless of whether that URL happens to be successful or not, I am admittedly not completely certain. If the latter is the case, informing me of a more suitable edit to my code would be tremendously appreciated.)
The Execution-
Right off the bat, the code experienced a large error. I'll list through the series of steps that led to the creation of said error.
#1- Of course, I first copy-pasted the code into a text document, and saved it as "download.py". I saved it inside of a folder named "Images" where I sought the images to be directly downloaded to. I used BBEdit.
#2- I proceeded, in Terminal, to input the commands "cd Desktop/Images" (to account for the file being held within the "Images" folder on my Desktop), followed by the command "Python download.py" (to actually run the code).
As you can see, the error which I obtained following my attempt to run the code was the ImportError: No module named request. Despite me guessing that the answer to solving this is simple, I can legitimately say I have got such minimal knowledge regarding Python that I've absolutely no idea how to solve this.
Hint: Prior to making the download.py file, the folder, and typing the Terminal code the only interactions I made with Python were downloading the program (3.6) and placing it in my toolbar. I'm not even quite sure if I am required to create any additional scripts/text files, or make any additional downloads before a script like this would work and successfully download the resulting images into my "Images" folder as is my desired goal. If I sincerely missed something integral at any point during this long read, hopefully, someone in here can provide a thoroughly detailed explanation as to how to solve my issue.
Finishing statements for those who've managed to stick along this far:
Thank you. I know this is one hell of a read, and I'm getting more tired as I go along. What I hope to get out of this question is
1.) Obviously, what would constitute a direct solution to the "No module named request" Input Error in Terminal. In other words, what I did wrong there or am missing.
2.) Any other helpful information that you know would assist this code, for example, if there is any integral step or condition I've missed or failed to meet that would ultimately cause the entirety of my code to cease to work. If you do see a fault in this, I only ask of you to be specific, as I've not got much experience in the programming world. After all, I know there is a lot of developers out here that are far more informed and experienced than am I. Thanks.
urllib.request is in Python 3 only. When running 'python' on a Mac, you're running Python 2 by default. Try running executing with python3.
python --version
might need to
brew install python3
urllib.request is a Python 3 construct. Most systems run Python 2 as default and this is what you get when you run simply python.
To install Python 3, go to https://brew.sh/ and follow the instructions to install the Hombrew package manager. Then run
brew install python3
python3 download.py

Training Tesseract OCR for ambiguities

I am pretty new to data scraping and I am facing a minor issue.
I am trying to extract text from a Hindi pdf using textract and Tesseract OCR.
Following is the code in Python:
import textract
text = textract.parsers.process("test.pdf", encoding='utf_8', method='tesseract', language = 'hin')
Now, many of the words from the PDF are correctly extracted. However, there are some things that are messed up. I read the documentation and about how ambiguities can be overridden by using a file lang.unicharambigs. However, I need to run combine_tessdata in order to actually bring it into effect and override certain trained data.
However, when I try to run the command I get the following:
-bash: combine_tessdata: command not found
I have installed tesseract from the source and I can't seem to understand why this is happening. Any ideas on how to troubleshoot this?
Thanks in advance!
Tesseract training executables are built separately.
https://github.com/tesseract-ocr/tesseract/wiki/Compiling

Generating an image-report with Python

Hy,
I'm working on a project, where I have to generate a image (e.g. .png, .bmp etc) with a python script.
The Image must have:
Small boxes (8x8px) in 3 different colours
Horizontal(normal) text in 2 different sizes
and 3) vertikal text (rotate normal text) (like this: http://devcity.net/Data/ArticleImages/Dual_Labels.jpg)
So not very complex things.
I spent the last days with PiL (Python Image Library). For the small boxes, it works fine and easy. But to generate a text in the image, it doesn't work fine.
What also works is to write a normal text, with the standard font (pilfont-type).
But I can't set the px-size of this text. When using truetypes, the following error comes:
"The _imagingft C module is not installed"
I allready "googled" this and this seems to be a popular problem. My Problem is, that the script also has to run on other python systems. What I can accept is, that I have to install Pil on each system/computer, but I can't fix the problem with the truetypes each time!
I'm using Python 2.7 with pil 1.1.7.
So to my question:
For the named "forms" my script has to generate, what library (or other ways to generate an image with a script) would you recomment to me?
Would it be possible to create, e.g writing a bitmap-file with text and pixels with colour, with my script in "Pure-Python", so without any extension?(Would be the optimal solution for me)
Have you thought about using PyCairo instead? See this link for an example: https://stackoverflow.com/a/6506825/514031
This is not quite what matplotlib was designed for, but is definitely capable of producing what you're after. Have a look at the gallery, it has usage examples for almost everything you mentioned.

printing to windows printer with python or shell command

I'm tryin to script an annoying task that involves fetching, handling and printing loads of scanned docs - jpeg or pdf. I don't succeed in accessing the printer from python or from windows shell (which I could script with python subproccess module). I succeeded in printing a text file from the command line with lpr command, but not jpg or pdf.
be glad for any clues about that, including a more extensive win shell reference for printing to printer, a suitable python library I missed in my google search stackoverflow search etc (just one unanswered question)
Well, after a little research I found some links that might help you:
1) To print images using Python Shell, this link below has some code using PIL that will, hopefully, do what you want:
http://timgolden.me.uk/python/win32_how_do_i/print.html
2) To print PDF files, this link may have what you need:
http://www.darkcoding.net/software/printing-word-and-pdf-files-from-python/
I never did any of those things, but with a quick look, I could find this links and they seem to make very much sense. Hope it helps :)
I used this for a rtf (just an idea) :
subprocess.call(['loffice', '-pt', 'LaserJet', file])
I am using LibreOffice. it can print in a batch mode.
with a default pdf viewer assigned to the system you can do
import win32api
fname="C:\\somePDF.pdf"
win32api.ShellExecute(0, "print", fname, None, ".", 0)
note that this will only work on windows and will not work with all pdf viewers but it should be good with acrobat and Foxit and several other major ones.

Categories