splitting text further while preserving line breaks - python

I am splitting text para and preserving the line breaks \n using the following
from nltk import SpaceTokenizer
para="\n[STUFF]\n comma, with period. the new question? \n\nthe\n \nline\n new char*"
sent=SpaceTokenizer().tokenize(para)
Which gives me the following
print(sent)
['\n[STUFF]\n', '', 'comma,', '', 'with', 'period.', 'the', 'new', 'question?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']
My goal is to get the following output
['\n[STUFF]\n', '', 'comma', ',', '', 'with', 'period', '.', 'the', 'new', 'question', '?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']
That is to say, I would like to split the 'comma,' into 'comma', ',' split the 'period.' into 'period', '.' split the 'question?' into 'question', '?' while preserving the \n
I have tried word_tokenize and it will achieve splitting 'comma', ',' etc but does not preserve \n
What can I do to further split sent as shown above while preserving \n?

https://docs.python.org/3/library/re.html#re.split is probably what you want.
From the looks of your desired output however, you're going to need to process the string a bit more than just applying a single function to it.
I would start by replacing all of the \n with a string like new_line_goes_here before splitting the string up, and then replacing new_line_goes_here with \n once it's all split up.

per #randy suggestion to look https://docs.python.org/3/library/re.html#re.split
import re
para = re.split(r'(\W+)', '\n[STUFF]\n comma, with period. the new question? \n\nthe\n \nline\n new char*')
print(para)
Output (close to what I am looking for)
['', '\n[', 'STUFF', ']\n ', 'comma', ', ', 'with', ' ', 'period', '. ', 'the', ' ', 'new', ' ', 'question', '? \n\n', 'the', '\n \n', 'line', '\n ', 'new', ' ', 'char', '*', '']

Related

Cut out unnecessary characters from pytesseract output

im trying to get a list of prices from an mmorpg using pytesseract to get data as string from screenshots.
Example screenshot:
The output from my image looks like that
[' ', '', ' ', '', ' ', '', ' ', '', ' ', '', ' ', '', '
', ' ', ' ', '', ' ', '', "eel E J Gbasce'sthiel Sateen nach", '', ' ', '', 'Ly] Preis aufsteigend', '', '[ Tternname Anzahl Preis pro Stick Zeitraum. Verkaufer', '', ' ', '', '
', '', ' ', '', 'Holzstock 1 149,999 30 Tag#e) Heavend', '', '
I just want to get that bold section (name, amount, price) out of the output but i really dont know how to cut it out of that text mess.
Does someone got an idea how can i achieve it?
Thank you.
I think the best method is finding Holzstock section of your images. It's easy, you could use advanced models like YOLO Or you could try it using feature description and matching by classical methods like SURF and SIFT and .... Then crop that part and feed it to tesseract.
This method has some benefits, for example you will find all Holzstock section of your images and doing OCR after that. It leads to decresing OCR error and remove unnecessary parts of text.

Remove punctuations from a list

I have this:
words = ["Alice's", 'Adventures', 'in', 'Wonderland', "ALICE'S", 'ADVENTURES', 'IN', 'WONDERLAND', 'Lewis', 'Carroll', 'THE', 'MILLENNIUM', 'FULCRUM', 'EDITION', '3.0', 'CHAPTER', 'I', 'Down', 'the', 'Rabbit-Hole', 'Alice', 'was']
remove_strings = str.maketrans(' ', '!*01.23456,789-\,?\'\.(:;)\"!')
words = [s.translate(remove_strings) for s in words]
words = [words.lower() for words in words]
I want to get rid of all the punctuations and numbers.
But it just converts to lower case and does not remove the punctuations as I thought it would.
What am I doing wrong?
str.maketrans maps characters specified in the first argument to the second argument, so you're really just mapping a space to a different character with your current code. A quick fix therefore is to simply swap the two arguments:
remove_strings = str.maketrans('!*01.23456,789-\,?\'\.(:;)\"!', ' ')
An easier approach would be to use a regex substitution to replace all non-alphabets with a space:
import re
words = [re.sub('[^a-z]', ' ', word, flags=re.I).lower() for word in words]

matching quoted strings and unquoted words

I try to write a regular expression to match eihter strings surrounded by double quotes (") or words separated by space () and have them in a list in python.
I don't really understand the output of my code, can anybody give me a hint or explain what my regular expression is doing exactly?
here is my code:
import re
regex = re.compile('(\"[^\"]*\")|( [^ ]* )')
test = '"hello world." here are some words. "and more"'
print(regex.split(test))
I expect an output like this:
['"hello world."', ' here ', ' are ', ' some ', ' words. ', '"and more"']
but I get the following:
['', '"hello world."', None, '', None, ' here ', 'are', None, ' some ', 'words.', None, ' "and ', 'more"']
where does the empty strings and the Nones come from.
and why does it match "hello world." but not "and more".
Thanks for your help, and a happy new year for those who celebrate it today!
EDIT:
to be precise: i dont need the surrounding spaces but i need the surrounding quotes. this output would be fine too:
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
EDIT2:
i ended up using shlex.split() like #PadraicCunningham suggested because it did exactly what i need and ihmo it is much more readable than regular expressions.
i still keep #TigerhawkT3's answer the accepted one because it solves the problem in the way i have asked it (with regular expressions).
Include the quoted match first so it prioritizes that, and then non-whitespace characters:
>>> s = '"hello world." here are some words. "and more"'
>>> re.findall(r'"[^"]*"|\S+', s)
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
You can get the same result with a non-greedy repeating pattern instead of the character set negation:
>>> re.findall(r'".*?"|\S+', s)
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
shlex.split with posix=False will do it for you:
import shlex
test = '"hello world." here are some words. "and more"'
print(shlex.split(test,posix=False))
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
if you did not want the quotes, you would leave posix as True:
print(shlex.split(test))
['hello world.', 'here', 'are', 'some', 'words.', 'and more']
Looks like CSV, so use the appropriate tools:
import csv
lines = ['"hello world." here are some words. "and more"']
list(csv.reader(lines, delimiter=' ', quotechar='"'))
returns
[['hello world.', 'here', 'are', 'some', 'words.', 'and more']]

Strip off characters from output

I have the following structure generated by bs4, python.
['Y10765227', '9884877926, 9283183326', '', 'Dealer', 'Rgmuthu']
['L10038779', '9551154555', ',', ',']
['R10831945', '9150000747, 9282109134, 9043728565', ',', ',']
['B10750123', '9952946340', '', 'Dealer', 'Bala']
['R10763559', '9841280752, 9884797013', '', 'Dealer', 'Senthil']
I wanna rip characters off and I should get something like the following
9884877926, 9283183326, Dealer, Rgmuthu
9551154555
9150000747, 9282109134, 9043728565
9952946340 , Dealer, Bala
9841280752, 9884797013, Dealer, Senthil
I am using print re.findall("'([a-zA-Z0-9,\s]*)'", eachproperty['onclick'])
So basically I wanna remove the "[]" and "''" and "," and random ID which is in the start.
Update
onclick="try{appendPropertyPosition(this,'Y10765227','9884877926, 9283183326','','Dealer','Rgmuthu');jsb9onUnloadTracking();jsevt.stopBubble(event);}catch(e){};"
So I am scraping from this onclick attribute to get the above mentioned data.
You can use a combination of str.join and str.translate here:
>>> from string import punctuation, whitespace
>>> lis = [['Y10765227', '9884877926, 9283183326', '', 'Dealer', 'Rgmuthu'],
['L10038779', '9551154555', ',', ','],['R10831945', '9150000747, 9282109134, 9043728565', ',', ','],
['B10750123', '9952946340', '', 'Dealer', 'Bala'],
['R10763559', '9841280752, 9884797013', '', 'Dealer', 'Senthil']]
for item in lis:
print ", ".join(x for x in item[1:]
if x.translate(None, punctuation + whitespace))
...
9884877926, 9283183326, Dealer, Rgmuthu
9551154555
9150000747, 9282109134, 9043728565
9952946340, Dealer, Bala
9841280752, 9884797013, Dealer, Senthil

RegEx Tokenizer to split a text into words, digits and punctuation marks

What I want to do is to split a text into his ultimate elements.
For example:
from nltk.tokenize import *
txt = "A sample sentences with digits like 2.119,99 or 2,99 are awesome."
regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\S+')
['A','sample','sentences','with','digits','like','2.199,99','or','2,99','are','awesome','.']
You can see it works fine. My Problem is: What happens if the digit is at the end of a text?
txt = "Today it's 07.May 2011. Or 2.999."
regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\S+')
['Today', 'it', "'s", '07.May', '2011.', 'Or', '2.999.']
The result should be:
['Today', 'it', "'s", '07.May', '2011','.', 'Or', '2.999','.']
What I have to do to get the result above?
I created a pattern to try to include periods and commas occurring inside words, numbers. Hope this helps:
txt = "Today it's 07.May 2011. Or 2.999."
regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+')
['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.']

Categories