Distinguish quotes ' and apostrophes while tokenizing with regex - python

Having some text, i want to tokenize it on words properly. In the text may appear:
words with apostrophe in middle (Can't, I'll,the accountant‘s books )
words with apostrophe in the end (the employers‘ association , I spent most o’ the day replacin’ the broken bit)
quotes, staying directly after the word or between words like : word'word
text is splitted on sentences, but the can be many sentences inside a quote, also, the word with apostroph can stay inside a quote
different symbols for qutes like either ' ' both for opening and closing or one is ' other is ` or ´, etc...
What yould be your suggestion to solve it?
Is it solvable with regex ( Python re for example?
I want words with apostrophe do not split and quotes to split from word tokens
Parcing commont text, The Fellowship Of The Ring.txt for example is tricky a little bit:
input : had hardly any 'government'.
output: ["had","hardly","any","'","government","'"] (recognized as quote)
A rather larger body, varying at need, was employed to 'beat the bounds'
is a quote, however is tricky because of ending s'
'It isn't natural, and trouble will come of it!' apostrophe inside a quote
'Elves and Dragons'_ I says to him. is a quote, howewer, s' again.

My suggestion would be to try to break down your cases. If you want to split by words (meaning that a word has spaces on both ends) probably a simple split would do its job.
>>> my_str = "words like that'"
>>> my_str.split(' ')
['words', 'like', "that'"]
>>>
If it's more complicated, regex seems to be a better idea. You can use (a|b), meaning match a or b. My suggestion would be to experiment more, the perfect place to experiment is here: regex101.com. To make things clearer select 'Python' in the left panel!

Related

Remove continuous occurrence of vowels together in a string using Python

I have a string like below:
"i'm just returning from work. *oeee* all and we can go into some detail *oo*. what is it that happened as far as you're aware *aouu*"
with some junk characters like above (highlighted with '*' marks). All I could observe was that junk characters come as bunch of vowels knit together. Now, I need to remove any word that has space before and after and has only vowels in it (like oeee, aouu, etc...) and length of 2 or more. How do I achieve this in python?
Currently, I built a tuple to include replacement words like ((" oeee "," "),(" aouu "," ")) and sending it through a for loop with replace. But if the word is 'oeeee', I need a add a new item into the tuple. There must be a better way.
P.S: there will be no '*' in the actual text. I just put it here to highlight.
You need to use re.sub to do a regex replacement in python. You should use this regex:
\b[aeiou]{2,}\b
which will match a sequence of 2 or more vowels in a word by themselves. We use \b to match the boundaries of the word so it will match at the beginning and end of the string (in your string, aouu) as well as words adjacent to punctuation (in your string, oo). If your text may include uppercase vowels too, use the re.I flag to ignore case:
import re
text = "i'm just returning from work. oeee all and we can go into some detail oo. what is it that happened as far as you're aware aouu"
print(re.sub(r'\b[aeiou]{2,}\b', '', text, 0, re.I))
Output
i'm just returning from work. all and we can go into some detail . what is it that happened as far as you're aware

upper casing all values within quotes " " except those coming after certain words

I am trying to replace every word within quotes " " to upper case word except those coming after the word "then" in a pandas column:
for example:
0 There was a "quick" "brown" fox who "jumped" over the wall then "fell" and broke its "tooth"
the output should be:
0 There was a "QUICK" "BROWN" fox who "JUMPED" over the wall then "fell" and broke its "TOOTH"
although I am able to find the words in quotes but I am not able to exclude the word coming right after "then".
df.str.replace({r'"(.*?)"':r'\U$1') #this will select and replace all values in quotes to uppercase also values after then
please help.
You can use regex (?<!then\s)"(\w*)" to find the words within quotes that are NOT preceded by 'then' & 'space'
"(\w*)" = Look for words within quotes
(?<!then\s) = Make sure the words that are matched with "(\w*)"does not have 'then' & 'space' before it(Negative look-behind)
RegexDemo You can see the demo of the regex here (you can put several other string to check how the regex works on them as well)
Regex-info This is very comprehensive website (kind of the go-to website for all things regex) on regex, almost all concepts of regex should be answered here. It is not programming language dependent & has a lot of information which can be overwhelming.
Regex Cheat-Sheet I would say start with this cheat sheet, it is very simple & explained in simple words. I find it very helpful.
String= He "ate" a "penguin" then "played with a hamburger.
Turn the string into a list splitting at the word then. Convert the list[0] into a string, and use an if '"' is in clause to isolate the quoted words. Capitalize. Then split by spaces, use join to get the whole string back together again and there ya go

Python Regex Matching - Splitting on punctuation but ignoring certain words

Suppose I have the following sentence,
Hi, my name is Dr. Who. I'm in love with fish-fingers and custard !!
I'm trying to capture the punctuation (except the apostrophe and hyphen) using regular expressions, but I also want to ignore certain words. For example, I'm ignoring Dr., and so I don't want to capture the . in the word Dr.
Ideally, the regex should capture the text in between the parentheses:
Hi(, )my( )name( )is( )Dr.( )Who(. )I'm( )in( )love( )with( )fish-fingers( )and( )custard( !!)
Note that I have a Python list that contains words like "Dr." that I want to ignore. I'm also using string.punctuation to get a list of punctuation characters to use in the regex. I've tried using negative lookahead but it was still catching the "." in Dr. Any help appreciated!
you can throw away at first all your stop words (like "Dr.") and then all letters (and digits).
import re
text = "Hi, my name is Dr. Who. I'm in love with fish-fingers and custard !!"
tmp = re.sub(r'[Dr.|Prof.]', '', text)
print(re.sub('[a-zA-Z0-9]*', '', tmp))
Would that work?
it would print:
, ' - !!
The output is capturing the text in between the parentheses, in your question.

Sentence splitting based in regular expression [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I am trying to split an article into sentences. And using the following code (written by someone who has left organization). Help me understand the code
re.split(r' *[.,:-\#/_&?!;][\s ]+', x)
It looks for punctuation marks such as stops, commas and colons, optionally preceded by spaces and always followed by at least one whitespace character. In the commonest case that will be ". ". Then it splits the string x into pieces by removing the matched punctuation and returning whatever is left as a list.
>>> x = "First sentence. Second sentence? Third sentence."
>>> re.split(r' *[.,:-\#/_&?!;][\s ]+', x)
['First sentence', 'Second sentence', 'Third sentence.']
The regular expression is unnecessarily complex and doesn't do a very good job.
This bit: :-\# has a redundant quoting backslash, and means the characters between ascii 58 and 64, in other words : ; < = > ? #, but it would be better to list the 7 characters explicitly, because most people will not know what characters fall in that range. That includes me: I had to look it up. And it clearly also includes the code's author, since he redundantly specified ; again at the end.
This bit [\s ]+ means one or more spaces or whitespace characters but a space is a whitespace character so that could be more simply expressed as \s+.
Note the retained full stop in the 3rd element of the returned list. That is because when the full stop comes at the end of the line, it is not followed by a space, and the regular expression insists that it must be. Retaining the full stop is okay, but only if it is done consistently for all sentences, not just for the ones that end at a line break.
Throw away that bit of code and start from scratch. Or use nltk, which has power tools for splitting text into sentences and is likely to do a much more respectable job.
>>> import nltk
>>> sent_tokenizer=nltk.punkt.PunktSentenceTokenizer()
>>> sent_tokenizer.sentences_from_text(x)
['First sentence.', 'Second sentence?', 'Third sentence.']

Add quotes around sentences with the word "said"

Ok regex masters, I have a very long text and I'm trying to add quotes in sentences that contain the words "he said" and similar variations.
For example:
s = 'This should have no quotes. This one should he said. But this one should not. Neither should this. But this one should she said.'
Should result in:
This should have no quotes. "This one should," he said. But this one should not. Neither should this. "But this one should," she said.
So far I can get pretty close, but not quite right:
>>> import re
>>> m = re.sub(r'\.\W(.*?) (he|she|it) said.', r'. "\1," \2 said.', s)
Results in:
>>> print m
This should have no quotes. "This one should," he said. But this one should not. "Neither should this. But this one should," she said.
As you can see, it puts quote properly around the first instance, but places it too early for the second. Any help appreciated!
There are some different valid situations that have been pointed out in the comments, but to address the concern you were facing:
It is quoting the whole sentence because it sees the period at the end of one should not.. What you really want, is to only quote back to the last period. So in your matching brackets make sure to not include periods, like so:
m = re.sub(r'\.\W([^\.]*?) (he|she|it) said.', r'. "\1," \2 said.', s)
This will fail for things with periods in the sentence like "Dr. Seuss likes to eat, she said" but that is another problem.

Categories