Using pytesseract for Hebrew language in tables

Using pytesseract for Hebrew language in tables - python

I've been trying to use pytesseract for the Hebrew language, from tables (I have my own ways to detecting the tables, for this question I'll focusing on the chars detections), the platform I'm using is databricks with a notebook (.ipynb).
I'm using the following command:
languages_ = 'heb+eng' # I also tried only 'heb'
data = pytesseract.image_to_data(image,lang=languages_, output_type='data.frame', config=special_config)
The config options I tried are:
8) --psm 12 --oem 1 --dpi 3000
7) --psm 12 --oem 2
6) --psm 12 --oem 1
5) --psm 12 --oem 0
4) --psm 12
3) --psm 6
2) --psm 5
1) --psm 11
I got the best result with option number (6).
The problems I still have is the following:
Multiple time I see that it can't distinguish between the , and the . characters
For some reason for numbers it recognize number 2 and 3 as 3
Multiple time I see that if a cell have multiple values (even just 2) it will seperate it for multiple columns per word (it does not occur on every row so its a problem because the next rows are good so its impossible to organize it)
Sometimes (not on every row and column) it adds another column with some of the cells containing the character | , I can only imagine it detect the column border as a character sometimes
I've noticed that also different colors image and different boldness are making it to give different results too (the above problems are the most mutual problems I had)
I saw this GitHub issue Hebrew issues #82 and multiple other links to articles and repos but still no luck in my quest.
I am unable to supply with an example table since the information I
use is sensitive, but if it's a must comment and I'll make something
up
If any other information is needed, please comment and I'll add it.

Related

Duplicate function retuning non-duplicated results on a BLAST hittable

New to python (3 weeks!) and unsurprisingly having difficulties. I'm working on a BLAST hittable and am trying to identify sequences coming from the same hit by using duplicate on the accession number only. I do not want to discard these results but rather save them to a new file so I can take a look to see if anything interesting if popping up.
A snippet of the table (this is purely an example, it includes 11 columns but it seems excessive to print them all here):
Query
Accession
% identity
mismatches
gaps
s start
s end
Q112
ABCDEFG111222
90.99
9
3
1000
2000
Q112
HIJKLMN222111
80
14
98
128
900
Q112
OPQRSTUV33111
76
2
23
12
900
I'm importing the file to make it a data frame using pandas, then using reset_index to replace the query number with an index.
I have then done the following:
To find out if I have any duplicates in Accession column
print(file.Accession.duplicated().sum())
To print those results into a new data frame (n is the same)
filedupes = (file.loc[file.Accession.duplicated()])
Finally, to then write it to a csv for me to look through
fileDupes.to_csv('Filedupes.csv', sep='\t', encoding='utf-8')
This does half work, as my CSV does contain duplicated entries based on Accession number only but it also contains some unique Accession number entries. These entries only seem to have the first 2 letters identical to other entries but then the rest is unique
ie I have XM_JK1234343 and XM_983HSJAN and XM_83QMZBDH1 included despite having no other entry present (have checked using find/replace). The other 11 columns are also unique to these strange entries.
I am stumped, is it my code? Have I not specified enough and allowed the above examples to be chucked in with other legit duplicates? I have tried to find if someone else had asked a similar question but no luck - thank you kindly in advance for any insight and apologises in advance if this is silly mistake!

Tesseract OCR fails to detect varying font size and letters that are not horizontally aligned

I am trying to detect these price labels text which is always clearly preprocessed. Although it can easily read the text written above it, it fails to detect price values. I am using python bindings pytesseract although it also fails to read from the CLI commands. Most of the time it tries to recognize the part where the price as one or two characters.
Sample 1:
tesseract D:\tesseract\tesseract_test_images\test.png output
And the output of the sample image is this.
je Beutel
13
However if I crop and stretch the price to look like they are seperated and are the same font size, output is just fine.
Processed image(cropped and shrinked price):
je Beutel
1,89
How do get OCR tesseract to work as I intended, as I will be going over a lot of similar images?
Edit: Added more price tags:
sample5 sample6 sample7

The problem is the image you are using is of small size. Now when tesseract processes the image it considers '8', '9' and ',' as a single letter and thus predicts it to '3' or may consider '8' and ',' as one letter and '9' as a different letter and so produces wrong output. The image shown below explains it.
A simple solution could be increasing its size by factor of 2 or 3 or even more as per the size of your original image and then passing to tesseract so that it detects each letter individually as shown below. (Here I increased its size by factor of 2)
Bellow is a simple python script that will solve your purpose
import pytesseract
import cv2
img = cv2.imread('dKC6k.png')
img = cv2.resize(img, None, fx=2, fy=2)
data = pytesseract.image_to_string(img)
print(data)
Detected text:
je Beutel
89
1.
Now you can simply extract the required data from the text and format it as per your requirement.
data = data.replace('\n\n', '\n')
data = data.split('\n')
dollars = data[2].strip(',').strip('.')
cents = data[1]
print('{}.{}'.format(dollars, cents))
Desired Format:
1.89

The problem is that the Tesseract engine was not trained to read this kind of text topology.
You can:
train your own model, and you'll need in particular to provide images with variations of topology (position of characters). You can actually use the same image, and shuffle the positions of the characters.
reorganize the image into clusters of text and use tesseract, in particular, I would consider the cents part and move it on the right of the coma, in that case you can use tesseract out of the box. Few relevant criterions would be the height of the clusters (to differenciate cents and integers), and the position of the clusters (read from the left to the right).
In general computer vision algorithms (including CNNs) are giving you tool to have a higher representation of an image (features or descriptors), but they fail to create a logic or an algorithm to process intermediate results in a certain way.
In your case that would be:
"if the height of those letters are smaller, it's cents",
"if the height, and vertical position is the same, it's about the
same number, either on left of coma, or on the right of coma".
The thing is that it's difficult to reach that through training, and at the same time it's extremely simple to write this for a human as an algorithm. Sorry for not giving you an actual implementation, but my text is the pseudo code.
TrainingTesseract2
TrainingTesseract4
Joint Unsupervised Learning of Deep Representations and Image Clusters

Python - print image bit data to a ESC/POS printer with python

I've been looking for an example of how to format and print bmp's to my receipt printer (so I can add logos) for a long time, so I doubt this is a duplicate post considering others were for java or other script languages. Usually I'm pretty good at understanding instructions, but all I seem to find is the same old instructions I can never fully understand.
I am using python 2.7 and I have a function pI(x) which uses win32print to send data to the printer, where x is the data in string format using "\x??" for hex data like formatting text. It seems to work well.
The programmer manual that came with my printer says (for downloading bit image, GS *) for syntax:
Hex 1D 2A x y d1...dk
and:
d=1 for printing the corresponding dot and d=0 for not printing the corresponding dot.
Here is my questions about these instructions:
Does this mean that all x, y, d1...dk is in hex (or "\x??")? I think so.
What is x and y representing? I read a while ago on a site (maybe this one) that x+y*255 = image width, and I assume that is using the order of operations. Is this correct?
The instructions on my generic printer also state that x and y are both supposed to be between 1 and 48, totaling no more than 1500, unlike some manuals which say x is supposed to be between 0 and 3 and y being between 1 and 128 (I think) which said x+y*255=width, totaling about 2000. it also says k=x*y*8 which I think that the example would be 8*8*8=512*"\x01", so where does the third 8 come from and how do I code that in the string??? Does x=width and y=height?... then how do I get an image width of the maximum 384 dots?
Does this mean that I have to enter "\x00" or "\x01" for each dot, so one instance (a small black block of 8x8) of GS* would be 64*"\x01"?
Do I have to GS * each group of 8 dots tall or line of 8 dots, or will that overwrite the previously programmed data?
I'd like to later include in my program a means of easily creating logos using a tkinter canvas widget and saving it to a text file for future printing using pI(), so I really need to know how to directly 'download' image data to the printer and using a third party module probably won't work since I want to continue using my pI() function. Yes, it's ambitious and I'm probably doing it the hard way. But I'm afraid if I start incorporating too much new stuff I'm not familiar with, I'll get too confused.
Basically, what string should I send to pI() to download an image of a solid 8x8-dot black box with a 2x wide white line down the center on the printer?
Here's an example of what I would like the printer to print so I can see a working code string

Tips on performing OCR - not getting desired results

So I have the following image:
I'm trying to extract three arrays:
var a = [30,31,32,35,37,40,44];
var b = [6,7,11,15,18,21,22];
var c = [5,11,15,18,23,37,28];
I tried feeding this image into tesseract ~/Desktop/test.png out to no avail:
9 % ooenesew #
5 ‘ 904399
And here is the result from ocrad ~/Desktop/test.ppm:
o
?
28
Can any OCR experts suggest what I might try next? I'm comfortable using Python/OpenCV, but will try anything.

If your images always look like in the example, you might have to do some tidy up to remove anything that is not a number (all the black background and the circle). Then the method described in the accepted answer on the linked question might be sufficient for your needs, since it looks like you are not dealing with different fonts and sizes:
Simple Digit Recognition OCR in OpenCV-Python

How can I group a large dataset

I have simple text file containing two columns, both integers
1 5
1 12
2 5
2 341
2 12
and so on..
I need to group the dataset by second value,
such that the output will be.
5 1 2
12 1 2
341 2
Now the problem is that the file is very big around 34 Gb
in size, I tried writing a python script to group them into a dictionary with value as an array of integers, still it takes way too long. (I guess a large time is taken for allocating the array('i') and extending them on append.
I am now planning to write a pig script which I am planning to run on a pseudo distributed hadoop machine (An Amazon EC3 High Memory Large instance).
data = load 'Net.txt';
gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet
store gdata into 'res.txt';
I wanted to know if there was any simpler way of doing this.
Update:
keeping such a big file in memory is out of question, In case of python solution, what I planned was to conduct 4 runs in first run only second col values from 1 - 10 million are considered in next run 10 million to 20 million are considered and so on. but this turned out to be really slow.
The pig / hadoop solution is interesting because it keeps everything on disk [Well most of it].
For better understanding this dataset contains information about connectivity of ~45 Million twitter users and the format in file means that userid given by the second number is following the the first one.
Solution which I had used:
class AdjDict(dict):
"""
A special Dictionary Class to hold adjecancy list
"""
def __missing__(self, key):
"""
Missing is changed such that when a key is not found an integer array is initialized
"""
self.__setitem__(key,array.array('i'))
return self[key]
Adj= AdjDict()
for line in file("net.txt"):
entry = line.strip().split('\t')
node = int(entry[1])
follower = int(entry[0])
if node < 10 ** 6:
Adj[node].append(follower)
# Code for writting Adj matrix to the file:

Assuming you have ~17 characters per line (a number I picked randomly to make the math easier), you have about 2 billion records in this file. Unless you are running with much physical memory on a 64-bit system, you will thrash your pagefile to death trying to hold all this in memory in a single dict. And that's just to read it in as a data structure - one presumes that after this structure is built, you plan to actually do something with it.
With such a simple data format, I should think you'd be better off doing something in C instead of Python. Cracking this data shouldn't be difficult, and you'll have much less per-value overhead. At minimum, just to hold 2 billion 4-byte integers would be 8 Gb (unless you can make some simplifying assumptions about the possible range of the values you currently list as 1 and 2 - if they will fit within a byte or a short, then you can use smaller int variables, which will be worth the trouble for a data set of this size).

If I had to solve this on my current hardware, I'd probably write a few small programs:
The first would work on 500-megabyte chunks of the file, swapping columns and writing the result to new files. (You'll get 70 or more.) (This won't take much memory.)
Then I'd call the OS-supplied sort(1) on each small file. (This might take a few gigs of memory.)
Then I'd write a merge-sort program that would merge together the lines from all 70-odd sub-files. (This won't take much memory.)
Then I'd write a program that would run through the large sorted list; you'll have a bunch of lines like:
5 1
5 2
12 1
12 2
and you'll need to return:
5 1 2
12 1 2
(This won't take much memory.)
By breaking it into smaller chunks, hopefully you can keep the RSS down to something that would fit a reasonable machine -- it will take more disk I/O, but on anything but astonishing hardware, swap use would kill attempts to handle this in one big program.

Maybe you can do a multi-pass through the file.
Do a range of keys each pass through the file, for example if you picked a range size of 100
1st pass - work out all the keys from 0-99
2nd pass - work out all the keys from 100-199
3rd pass - work out all the keys from 200-299
4th pass - work out all the keys from 300-399
..and so on.
for your sample, the 1st pass would output
5 1 2
12 1 2
and the 4th pass would output
341 2
Choose the range size so that the dict you are creating fits into your RAM
I wouldn't bother using multiprocessing to try to speed it up by using multiple cores, unless you have a very fast harddrive this should be IO bound and you would just end up thrashing the disk

If you are working with a 34 GB file, I'm assuming that hard drive, both in terms of storage and access-time, is not a problem. How about reading the pairs sequentially and when you find pair (x,y), open file "x", append " y" and close file "x"? In the end, you will have one file per Twitter userid, and each file containing all users this one is connected to. You can then concatenate all those files if you want to have your result in the output format you specified.
THAT SAID HOWEVER, I really do think that:
(a) for such a large data set, exact resolution is not appropriate and that
(b) there is probably some better way to measure connectivity, so perhaps you'd like to tell us about your end goal.
Indeed, you have a very large graph and a lot of efficient techniques have been devised to study the shape and properties of huge graphs---most of these techniques are built to work as streaming, online algorithms.
For instance, a technique called triangle counting, coupled with probabilistic cardinality estimation algorithms, efficiently and speedily provides information on the cliques contained in your graph. For a better idea on the triangle counting aspect, and how it is relevant to graphs, see for example this (randomly chosen) article.

I had a similar requirement and you just require one more pig statement to remove the redundancies in 5 (1,5) (2,5).
a = LOAD 'edgelist' USING PigStorage('\t') AS (user:int,following:int);
b = GROUP a BY user;
x = FOREACH b GENERATE group.user, a.following;
store x INTO 'following-list';

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.