Cut out unnecessary characters from pytesseract output - python

im trying to get a list of prices from an mmorpg using pytesseract to get data as string from screenshots.
Example screenshot:
The output from my image looks like that
[' ', '', ' ', '', ' ', '', ' ', '', ' ', '', ' ', '', '
', ' ', ' ', '', ' ', '', "eel E J Gbasce'sthiel Sateen nach", '', ' ', '', 'Ly] Preis aufsteigend', '', '[ Tternname Anzahl Preis pro Stick Zeitraum. Verkaufer', '', ' ', '', '
', '', ' ', '', 'Holzstock 1 149,999 30 Tag#e) Heavend', '', '
I just want to get that bold section (name, amount, price) out of the output but i really dont know how to cut it out of that text mess.
Does someone got an idea how can i achieve it?
Thank you.

I think the best method is finding Holzstock section of your images. It's easy, you could use advanced models like YOLO Or you could try it using feature description and matching by classical methods like SURF and SIFT and .... Then crop that part and feed it to tesseract.
This method has some benefits, for example you will find all Holzstock section of your images and doing OCR after that. It leads to decresing OCR error and remove unnecessary parts of text.

Related

Unable To Remove Whitespace On String Using Python

I'm trying to get rid of the whitespace on this list however I am unable to. Anyone know where I am going wrong with my code?
love_maybe_lines = ['Always ', ' in the middle of our bloodiest battles ', 'you lay down your arms', ' like flowering mines ', ' to conquer me home. ']
love_maybe_lines_joined = '\n'.join(love_maybe_lines)
love_maybe_lines_stripped = love_maybe_lines_joined.strip()
print(love_maybe_lines_stripped)
Terminal:
Always
in the middle of our bloodiest battles
you lay down your arms
like flowering mines
to conquer me home.
love_maybe_lines = ['Always ', ' in the middle of our bloodiest battles ', 'you lay down your arms', ' like flowering mines ', ' to conquer me home. ']
love_maybe_lines = [item.strip() for item in love_maybe_lines]
That may help.

parsing a list of strings based on values in the string

I scraped data from a website and output the results in a list using the following code to get the following output using beautifulsoup and requests:
['1\n',
' Saul Alvarez*',
'1545\n',
'\n\n',
' middle\n',
' 30\n',
' 53\xa01\xa02\n',
' \n',
'orthodox\n',
'Guadalajara, Mexico',
'2\n',
' Tyson Fury',
'1030\n',
'\n\n',
' heavy\n',
' 32\n',
' 30\xa00\xa01\n',
' \n',
'orthodox\n',
'Wilmslow, United Kingdom',
'3\n',
' Errol Spence Jr',
'697.2\n',
'\n\n',
' welter\n',
' 30\n',
' 27\xa00\xa00\n',
' \n',
'southpaw\n',
'Desoto, USA',
'4\n',
' Terence Crawford',
'658.9\n',
'\n\n',
' welter\n',
...
I'm having difficulty parsing this list wherever there is an integer + '\n'.
So ideally I would like the output to be a list of lists :
[[
'1\n',
' Saul Alvarez*',
'1545\n',
'\n\n',
' middle\n',
' 30\n',
' 53\xa01\xa02\n',
' \n',
'orthodox\n',
'Guadalajara, Mexico'
],
['2\n',
' Tyson Fury',
'1030\n',
'\n\n',
' heavy\n',
' 32\n',
' 30\xa00\xa01\n',
' \n',
'orthodox\n',
'Wilmslow, United Kingdom']
['3\n',
' Errol Spence Jr',
'697.2\n',
'\n\n',
' welter\n',
' 30\n',
' 27\xa00\xa00\n',
' \n',
'southpaw\n',
'Desoto, USA'],
...]
Well, there are 2 things going on, and I'll address only the first.
You can drop blanks and '\n' because those are newline characters, i.e. linefeeds.
li = ['1\n',
' Saul Alvarez*',
'1545\n',
'\n\n',
' middle\n',
' 30\n',
' 53\xa01\xa02\n',
' \n',
]
li = [val.replace(r"\n","") for val in li]
li = [val.strip() for val in li if val.strip()]
print(li)
That outputs:
['1', 'Saul Alvarez*', '1545', 'middle', '30', '53\xa01\xa02']
Second issue, which I won't address here as we don't know the html format which you haven't given, is that you are grabbing all the element values (the text in each tag) without looking at the HTML markup's structure. That's the wrong approach to take.
I assume that if you look at the page's source you might find something like <div class="name">Saul Alvarez</div><div class="weightclass">middle</div>. Using the markup's annotation and semantic context is more productive than trying to guess at the structure from the above list with 6 elements. BeautifulSoup can do it, trying using soup.select("div.name") for example.
The nice thing with soup.select which uses CSS selectors is that you can pre-test your query in your browser's dev tools.
Just remember, soup.select will return a list of html elements, from which you'll want to look at the value.

splitting text further while preserving line breaks

I am splitting text para and preserving the line breaks \n using the following
from nltk import SpaceTokenizer
para="\n[STUFF]\n comma, with period. the new question? \n\nthe\n \nline\n new char*"
sent=SpaceTokenizer().tokenize(para)
Which gives me the following
print(sent)
['\n[STUFF]\n', '', 'comma,', '', 'with', 'period.', 'the', 'new', 'question?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']
My goal is to get the following output
['\n[STUFF]\n', '', 'comma', ',', '', 'with', 'period', '.', 'the', 'new', 'question', '?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']
That is to say, I would like to split the 'comma,' into 'comma', ',' split the 'period.' into 'period', '.' split the 'question?' into 'question', '?' while preserving the \n
I have tried word_tokenize and it will achieve splitting 'comma', ',' etc but does not preserve \n
What can I do to further split sent as shown above while preserving \n?
https://docs.python.org/3/library/re.html#re.split is probably what you want.
From the looks of your desired output however, you're going to need to process the string a bit more than just applying a single function to it.
I would start by replacing all of the \n with a string like new_line_goes_here before splitting the string up, and then replacing new_line_goes_here with \n once it's all split up.
per #randy suggestion to look https://docs.python.org/3/library/re.html#re.split
import re
para = re.split(r'(\W+)', '\n[STUFF]\n comma, with period. the new question? \n\nthe\n \nline\n new char*')
print(para)
Output (close to what I am looking for)
['', '\n[', 'STUFF', ']\n ', 'comma', ', ', 'with', ' ', 'period', '. ', 'the', ' ', 'new', ' ', 'question', '? \n\n', 'the', '\n \n', 'line', '\n ', 'new', ' ', 'char', '*', '']

Scrapy returns garbage data such as spaces and newlines. How shall I filter these?

I wrote a spider and it returns me data which is littered with spaces and newline characters. The newline characters also caused extract() method to return as a list. How do I filter these before it touch the selector? Filtering these after extract() is called breaks the DRY principle as there are a lot of data from a page I need to extract that is attributeless which makes the only way to parse it is through indexing.
How do I filter these?
Source
it returns bad data like this
{ 'aired': ['\n ', '\n Apr 3, 2016 to Jun 26, 2016\n '],
'broadcast': [], 'duration': ['\n ', '\n 24 min. per ep.\n '], 'episodes': ['\n ', '\n 13\n '], 'favourites': ['\n ', '\n 22,673\n'], 'genres': ['Action', 'Comedy', 'School', 'Shounen', 'Super Power'], 'image_url': ['https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',
'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg'], 'licensors': ['Funimation'], 'members': ['\n ', '\n 818,644\n'], 'popularity': ['\n ', '\n #21\n'], 'premiered': ['Spring 2016'], 'producers': ['Dentsu',
'Mainichi Broadcasting System',
'Movic',
'TOHO animation',
'Shueisha'], 'ranked': ['\n ', '\n #135', '\n ', '\n'], 'rating': ['\n ', '\n PG-13 - Teens 13 or older\n '], 'score': ['8.44'], 'source': ['\n ', '\n Manga\n '], 'status': ['\n ', '\n Finished Airing\n '], 'studios': ['Bones'], 'title': 'Boku no Hero Academia', 'type': ['TV']}
Edit: The link to source code is different from the time of posting, to see the code back then take a look at commit faae4aff1f998f5589fab1616d21c7afc69e03eb
Looking at your code, you could try using xpaths normalize-space
mal_item['aired'] = border_class.xpath('normalize-space(.//div[11]/text())').extract()
*untested, but seems legit.
For a more general answer, yourString.strip('someChar') or yourString.replace('this','withThis') works well (but in the case of operating with json objects it might not be as efficient as other approaches). If those characters are present in the original data, you need to manually remove them or skip them.
The newline characters also caused extract() method to return as a list
It is not the line breaks that is a cause of such behavior but the way nodes appear in the document tree. Text nodes which are separated by element nodes such as for example <a>, <br>, <hr> are seen as separate entities and scrappy will yield them as such (in fact extract() is supposed to always return a list, even when only a single node was selected). XPath has several basic value processing / filtering functions but it's very limited.
Filtering these after extract() is called breaks the DRY principle
You seem convinced that the only correct way to filter these outputs is by doing it within the selector expression. But it's no use to be so stringent about the principles, you are selecting text nodes from inside your target nodes, and these are bound to have excessive whitespace or be scattered all around their containers. XPath filtering by content is very sluggish and as such it should be done outside of it. Post processing scraped fields is a common practice. You might want to read about scrapy loaders and processors.
Otherwise the simplest way is:
# import re
...
def join_clean(texts):
return re.sub(r'\s+', ' ', ' '.join(texts)).strip()
...
mal_item['type'] = join_clean(border_class.xpath('.//div[8]/a/text()').extract())

Strip off characters from output

I have the following structure generated by bs4, python.
['Y10765227', '9884877926, 9283183326', '', 'Dealer', 'Rgmuthu']
['L10038779', '9551154555', ',', ',']
['R10831945', '9150000747, 9282109134, 9043728565', ',', ',']
['B10750123', '9952946340', '', 'Dealer', 'Bala']
['R10763559', '9841280752, 9884797013', '', 'Dealer', 'Senthil']
I wanna rip characters off and I should get something like the following
9884877926, 9283183326, Dealer, Rgmuthu
9551154555
9150000747, 9282109134, 9043728565
9952946340 , Dealer, Bala
9841280752, 9884797013, Dealer, Senthil
I am using print re.findall("'([a-zA-Z0-9,\s]*)'", eachproperty['onclick'])
So basically I wanna remove the "[]" and "''" and "," and random ID which is in the start.
Update
onclick="try{appendPropertyPosition(this,'Y10765227','9884877926, 9283183326','','Dealer','Rgmuthu');jsb9onUnloadTracking();jsevt.stopBubble(event);}catch(e){};"
So I am scraping from this onclick attribute to get the above mentioned data.
You can use a combination of str.join and str.translate here:
>>> from string import punctuation, whitespace
>>> lis = [['Y10765227', '9884877926, 9283183326', '', 'Dealer', 'Rgmuthu'],
['L10038779', '9551154555', ',', ','],['R10831945', '9150000747, 9282109134, 9043728565', ',', ','],
['B10750123', '9952946340', '', 'Dealer', 'Bala'],
['R10763559', '9841280752, 9884797013', '', 'Dealer', 'Senthil']]
for item in lis:
print ", ".join(x for x in item[1:]
if x.translate(None, punctuation + whitespace))
...
9884877926, 9283183326, Dealer, Rgmuthu
9551154555
9150000747, 9282109134, 9043728565
9952946340, Dealer, Bala
9841280752, 9884797013, Dealer, Senthil

Categories