Training Tesseract OCR for ambiguities - python

I am pretty new to data scraping and I am facing a minor issue.
I am trying to extract text from a Hindi pdf using textract and Tesseract OCR.
Following is the code in Python:
import textract
text = textract.parsers.process("test.pdf", encoding='utf_8', method='tesseract', language = 'hin')
Now, many of the words from the PDF are correctly extracted. However, there are some things that are messed up. I read the documentation and about how ambiguities can be overridden by using a file lang.unicharambigs. However, I need to run combine_tessdata in order to actually bring it into effect and override certain trained data.
However, when I try to run the command I get the following:
-bash: combine_tessdata: command not found
I have installed tesseract from the source and I can't seem to understand why this is happening. Any ideas on how to troubleshoot this?
Thanks in advance!

Tesseract training executables are built separately.
https://github.com/tesseract-ocr/tesseract/wiki/Compiling

Related

Using python oledump tool gives me "extraction failed" error

I would like some help with the oledump tool from the oletools python toolset.
I am doing a training exercise where I need to extract an attachment in a msg file and get the md5 hash for it. However I am having trouble extracting the attachment using oletools in a Linux environment. Keep in mind this is a training lab environment and I cannot get any other tools then the ones provided in the lab (oledump).
Checking the msg file, it clearly has an attachment (attach_version section) with binary data (ending in hex 0102)
Checking section 3, it is a word doc
Section 4 seems to be the start of the binary data
However when I try to extract this part, I always get an extraction failed error. I tried the "-d" flag but that only gives me the section and not the whole file
How would be the best way to go about this to extract the file with oledump without the errors?
I had the same problem. Finally worked this:
python oledump extract file .docm
python3 oledump.py Salary-Ranges.msg -s4 -d > maldoc1.docm
MD5 checksum match to to task.
The '-e' argument probably is not designed for that job or just I don't know how to use correctly - still learning this tools.

Is pytesseract safe to use with confidential images?

I am working on a project for my company which tries to read scanned pdfs and classify them depending on their contents.
After doing some research online, the easiest way to solve this seems to be by using a Python Library called pytesseract.
My question is: Is this library safe to use with images containing confidential customer data? Do the images/the extracted text get saved in some server?
I found this link which suggests that it is. But I am lacking understandment of what exactly happens 'behind the scenes' everytime I read an image with the module.
Thanks in advance for any help!

Is the Paragram_300_SL999 Word Embeddings file corrupt?

I need to use the Paragram_SL999_300 embeddings for my project that uses the open source code from a published article (https://github.com/cecilialeiqi/adversarial_text). When I try to run Step 4 (generate adversarial examples) from https://github.com/cecilialeiqi/adversarial_text, I get a ValueError saying int() expected but got ','. I know from the readme.txt for Paragram-SL999 300 that is supposed to be one token per line followed by its embeddings. Upon trying to open the Paragram_SL999_300.txt file to see if it matches this criteria, it loads about half way and then closes the TextEditor, without letting me edit it. Furthermore, it crashes LibreOffice if I try and open it in there. This was in an Ubuntu 18.04 Virtual Machine. However, I wasn't sure if this was because the author's code is wrong (in discrete_attack.py at https://github.com/cecilialeiqi/adversarial_text/blob/master/src/discrete_attack.py) or because the file is corrupt so I tried downloading and extracting the Paragram-SL999 300 archive from Wieting's website (http://www.cs.cmu.edu/~jwieting/) on my Windows computer, I get a message saying that the archive is corrupted, which prevents me from extracting the Paragram_SL999_300.txt file and also using it. On another Windows computer, I get the Error Code 0x80004005: Unspecified error when trying to extract the archive.
Is there any way to get around this issue or anyone who can provide insight on it? Would it be recommended instead to produce the embeddings from Wieting's GitHub (https://github.com/jwieting/paragram-word)? I would very much appreciate any input as these embeddings are paramount to my project.
I managed to download it from the Google drive link at https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F0B9w48e1rj-MOck1fRGxaZW1LU2M%2Fview%3Fusp%3Dsharing&data=02%7C01%7C%7C36fd021bae0343bbe54408d7bdd28c81%7C1faf88fea9984c5b93c9210a11d9a5c2%7C0%7C0%7C637186584305548961&sdata=PouX2kyBlnQHpzAaDKjqe7gFC3ctti6tjBcGWt8pg1s%3D&reserved=0. In the end it worked but I'm not sure why the other times I was unable to get it to work. Also, I didn't realise that for the code I had I also I needed to add the vocabulary size and the embedding size at the first line of the file (1703756 300).

Scraping PDF data into Excel *absolute beginner*

This is literally day 1 of python for me. I've coded in VBA, Java, and Swift in the past, but I am having a particularly hard time following guides online for coding a pdf scraper. Since I have no idea what I am doing, I keep running into a wall every time I want to test out some of the code I've found online.
Basic Info
Windows 7 64bit
python 3.6.0
Spyder3
I have many of the pdf related code packages (PyPDF2, pdfminer, pdfquery, pdfwrw, etc)
Goals
To create something in python that allows me to convert PDFs from a folder into an excel file (ideallY) OR a text file (from which I will use VBA to convert).
Issues
Every time I try some sample code from guides i've found online, I always run into syntax errors on the lines where I am calling the pdf that I want to test the code on. Some guide links and error examples below. Should I be putting my test.pdf into the same file as the .py file?
How to scrape tables in thousands of PDF files?
I got an invalid syntax error due to "for" on the last line
PDFMiner guide (Link)
runfile('C:/Users/U587208/Desktop/pdffolder/pdfminer.py', wdir='C:/Users/U587208/Desktop/pdffolder')
File "C:/Users/U587208/Desktop/pdffolder/pdfminer.py", line 79
print pdf_to_csv('test.pdf', separator, threshold)
^
SyntaxError: invalid syntax
It seems that the tutorials you are following make use of python 2. There are usually few noticable differences, the the biggest is that in python 3, print became a funtion so
print()
I would recomment either changing you version of python or finding a tutorial for python 3. Hope this helps
Here
Pdfminer python 3.5 an example, how to extract informations from a PDF.
But it does not solve the problem with tables you want to export to Excel. Commercial products are probably better in doing that...
I am trying to do this exact same thing! I have been able to convert my pdf to text however the formatting is extremely random and messy and I need the tables to stay in tact to be able to write them into excel data sheets. I am now attempting to convert to XML to see if it will be easier to extract from. If I get anywhere on this I will let you know :)
btw, use python 2 if you're going to use pdfminer. Here's some help with pdfminer https://media.readthedocs.org/pdf/pdfminer-docs/latest/pdfminer-docs.pdf

PyTesseract - Restricting OCR to a set of characters

I'm having trouble with pytesseract. I know that you can restrict tesseract to a specific set of characters using command line arguments :
tesseract input.tif output nobatch digits
I found some ppl saying they can restrict tesseract with the following lines in python :
import tesseract
ocr = tesseract.TessBaseAPI();
ocr.Init(".","eng",tesseract.OEM_TESSERACT_ONLY)
ocr.SetVariable("tessedit_char_whitelist", "0123456789")
But this is for using the tesseract API, and I'm using pytesseract.... Finally I also tried :
print(image_to_string(someimage, config='outputbase digits'))
But this doesn't work as I still get letters in my output. This is weird because I am using the below code and it is working :
print(image_to_string(screen, config='-psm 10'))
PSM stands for PageSegmentationMode and it allows me to parse my imagefile as a single character. I don't understand why this works and the snippet before doesn't when they are both commandline arguments to tesseract...
Can anyone help ? I want to use both options with a custom wordlist (that i created in the config folder of tesseract).
Finally found the solution, if it can ever help anyone... This is from the tesseract help page :
Simplest invocation of tesseract :
tesseract imagename outputbase
I could deduce the proper syntax from that (in fact, everything I found on stack overflow pretty much pointed me in the wrong direction, maybe because of different versions of tesseract). Keep in mind I'm using tesseract 3.05 (win installer available on GitHub) and pytesseract (installed from pip).
image_to_string(someimage, config='digits -psm 7')
As we've seen on the help page, the outputbase argument comes first after the filename and before the other options, this allows the use of both PSM & restricted charset.
All the command line args from tesseract help page can be used this way, in the config variable !!

Categories