how to add person with disability (PwD) symbol to tesseract dataset - python

I am working on license plate recognition using python. I worked with tesseract for doing OCR recognition.For my project i wish to include person with disability symbol in tesseract library. I reviewed the following links for updating the tesseract library tutorial for tesseract library update
I followed the steps for creating tff file but failed by notifying the image is not prescribed font.
I understood from the studied literature people added various types of fonts and number style and couldn't find the information about how to add image into tesseract data set.
Can anyone suggest me how could i succeed in adding the image to tesseract data set? I am grateful if someone provided me the links or information pertaining to the problem

To train for new data in tesseract library you need the following packages (i) jTessBoxEditor (ii)notepad++ (iii) serak trainer
jTessBoxEditor can be downloaded from herejTessBoxEditordownload link which also requires runtime java environment. It accepts input as .txt format.
You can use notepadd++ to enter the special characters. The procedures to enter the characters can be found from how to enter characters in notepad++ For example, to enter PwD symbol you might press ALT key hold down then entered +9855 from numberpad, the symbol will appear in notepad++. After entering the characters save as .txt type.
Open the jTessBoxEditor and click Tiff/Box Generator to feed the .txt file as input, also change the font style that supports your character. For Pwd symbol i choose segoeuisymbol. the tif will be stored in the folder where the .txt file created.
To train the tessdata you need seraktrainer which can be downloaded from serak download link The procedures for using serak trainer can be found from the videoserak trainer video It explains step by step procedure for creating tessdata i.e trained data file
Hope this can be useful for someone

Related

Is pytesseract safe to use with confidential images?

I am working on a project for my company which tries to read scanned pdfs and classify them depending on their contents.
After doing some research online, the easiest way to solve this seems to be by using a Python Library called pytesseract.
My question is: Is this library safe to use with images containing confidential customer data? Do the images/the extracted text get saved in some server?
I found this link which suggests that it is. But I am lacking understandment of what exactly happens 'behind the scenes' everytime I read an image with the module.
Thanks in advance for any help!

Is there a way to extract data contained in the private tags of a .tiff image of a graph created by software?

To preface this, I just want to say I am not overly experienced, and I am not a developer of any sort. This is also my first post on Stack Overflow so if I do anything wrong please let me know.
I am a physical chemist performing measurements using a Park Systems NX10 and it ouputs the data as a .tiff image. The measurement is just recording current, potential and time. What I want is the pure XY data of the current and potential so I can plot the graph myself.
The usual way to get this is to open the .tiff file in their proprietary software (XEI), right click the graph that comes up and export the data as a .txt file, which is what I want. The issue is I have to do a lot of these measurements so I would like to have a way to automate this data extraction.
My attempt is shown below. According to the manual for the machine I am using the data I need is enclosed in the private tags of the image, but this is all the information it gives. Using tifffile I extracted all the tags (of which there are 17 total) and the code below writes 2 files, with tag name 50438 and 50439. The only argument of .decode() that didn't give an error for every file was windows-1252. When I write the file, it gives some words that I expect (Sample Bias, Current etc) as well as a bunch of NULL characters when viewed in a text editor. I have tried changing the encoding of the text editor also to no avail.
import tifffile
filepath = r"filepath\scan.tiff"
i=0
with tifffile.TiffFile(filepath) as tif:
for page in tif.pages:
for tag in page.tags:
tag_name, tag_value = tag.name, tag.value
print('tag_name=',tag_name,'tag_value=',tag_value)
try:
new_string = tag_value.decode('windows-1252')
newfile = open('newfile{}.txt'.format(i),'w')
newfile.write(new_string)
i+=1
print('wrote file')
newfile.close()
except:
print('wrong encoding')
My main question is, is it possible to do what I am attempting or is this proprietary software the only thing capable of it? If it is possible, does anyone have any idea of how to achieve this?

Python get data from custom tab in file's properties

I'm strungling with what's seems to be custom metadata tab.
I spend 16 hour trying to write a snippet to read data from this tab 'Phenom-World':
I tried different approach without any success :
Exif
Win32com
os.stat
My attempts are returning information relative to File>Properties>Details (size, résolution, date, etc) but nothing from the tab named 'Phenom-World'.
Any help please ? I'm kind of despair now.
Thanks for your help !
Here's the file, it's a .jpeg
I'm using python 3.8, windows 10
EDIT #1
It seems that the phenom-world tab can only be seen by computers where a software from this same brand is installed.
I send the file 6.jpeg to another computer able to read it, it return exactly the same datas, so those metadata are embeded in the file even through if they are 'hidden' for computer without the phenom software.

Is the Paragram_300_SL999 Word Embeddings file corrupt?

I need to use the Paragram_SL999_300 embeddings for my project that uses the open source code from a published article (https://github.com/cecilialeiqi/adversarial_text). When I try to run Step 4 (generate adversarial examples) from https://github.com/cecilialeiqi/adversarial_text, I get a ValueError saying int() expected but got ','. I know from the readme.txt for Paragram-SL999 300 that is supposed to be one token per line followed by its embeddings. Upon trying to open the Paragram_SL999_300.txt file to see if it matches this criteria, it loads about half way and then closes the TextEditor, without letting me edit it. Furthermore, it crashes LibreOffice if I try and open it in there. This was in an Ubuntu 18.04 Virtual Machine. However, I wasn't sure if this was because the author's code is wrong (in discrete_attack.py at https://github.com/cecilialeiqi/adversarial_text/blob/master/src/discrete_attack.py) or because the file is corrupt so I tried downloading and extracting the Paragram-SL999 300 archive from Wieting's website (http://www.cs.cmu.edu/~jwieting/) on my Windows computer, I get a message saying that the archive is corrupted, which prevents me from extracting the Paragram_SL999_300.txt file and also using it. On another Windows computer, I get the Error Code 0x80004005: Unspecified error when trying to extract the archive.
Is there any way to get around this issue or anyone who can provide insight on it? Would it be recommended instead to produce the embeddings from Wieting's GitHub (https://github.com/jwieting/paragram-word)? I would very much appreciate any input as these embeddings are paramount to my project.
I managed to download it from the Google drive link at https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F0B9w48e1rj-MOck1fRGxaZW1LU2M%2Fview%3Fusp%3Dsharing&data=02%7C01%7C%7C36fd021bae0343bbe54408d7bdd28c81%7C1faf88fea9984c5b93c9210a11d9a5c2%7C0%7C0%7C637186584305548961&sdata=PouX2kyBlnQHpzAaDKjqe7gFC3ctti6tjBcGWt8pg1s%3D&reserved=0. In the end it worked but I'm not sure why the other times I was unable to get it to work. Also, I didn't realise that for the code I had I also I needed to add the vocabulary size and the embedding size at the first line of the file (1703756 300).

Training Tesseract OCR for ambiguities

I am pretty new to data scraping and I am facing a minor issue.
I am trying to extract text from a Hindi pdf using textract and Tesseract OCR.
Following is the code in Python:
import textract
text = textract.parsers.process("test.pdf", encoding='utf_8', method='tesseract', language = 'hin')
Now, many of the words from the PDF are correctly extracted. However, there are some things that are messed up. I read the documentation and about how ambiguities can be overridden by using a file lang.unicharambigs. However, I need to run combine_tessdata in order to actually bring it into effect and override certain trained data.
However, when I try to run the command I get the following:
-bash: combine_tessdata: command not found
I have installed tesseract from the source and I can't seem to understand why this is happening. Any ideas on how to troubleshoot this?
Thanks in advance!
Tesseract training executables are built separately.
https://github.com/tesseract-ocr/tesseract/wiki/Compiling

Categories