Python Regex Matching - Splitting on punctuation but ignoring certain words - python

Suppose I have the following sentence,
Hi, my name is Dr. Who. I'm in love with fish-fingers and custard !!
I'm trying to capture the punctuation (except the apostrophe and hyphen) using regular expressions, but I also want to ignore certain words. For example, I'm ignoring Dr., and so I don't want to capture the . in the word Dr.
Ideally, the regex should capture the text in between the parentheses:
Hi(, )my( )name( )is( )Dr.( )Who(. )I'm( )in( )love( )with( )fish-fingers( )and( )custard( !!)
Note that I have a Python list that contains words like "Dr." that I want to ignore. I'm also using string.punctuation to get a list of punctuation characters to use in the regex. I've tried using negative lookahead but it was still catching the "." in Dr. Any help appreciated!

you can throw away at first all your stop words (like "Dr.") and then all letters (and digits).
import re
text = "Hi, my name is Dr. Who. I'm in love with fish-fingers and custard !!"
tmp = re.sub(r'[Dr.|Prof.]', '', text)
print(re.sub('[a-zA-Z0-9]*', '', tmp))
Would that work?
it would print:
, ' - !!
The output is capturing the text in between the parentheses, in your question.

Related

Create Python regex for specific sentence pattern

I'm trying to build a regex pattern that can capture the following examples:
pattern1 = '.She is greatThis is annoyingWhy u do this'
pattern2 = '.Weirdly specificThis sentence is longer than the other oneSee this is great'
example = 'He went such dare good mr fact. The small own seven saved man age no offer. Suspicion did mrs nor furniture smallness. Scale whole downs often leave not eat. An expression reasonably cultivated indulgence mr he surrounded instrument. Gentleman eat and consisted are pronounce distrusts.This is where the fun startsSummer is really bothersome this yearShe is out of ideas'
example_pattern_goal = 'This is where the fun startsSummer is really bothersome this yearShe is out of ideas'
Essentially, it's always a dot followed by sentences of various length not including any numbers. I only want to capture these specific sentences, so I tried to capture instances where a dot was immediately followed by a word that starts with an uppercase and other words that include two instances where an uppercase letter is inside the word.
So far, I've only come up with the following regex that doesn't quite work:
'.\b[A-Z]\w+[\s\w]+\b\w+[A-Z]\w+\b[\s\w]+\b\w+[A-Z]\w+\b[\s\w]+'
You can use
\.([A-Z][a-z]*(?:\s+[A-Za-z]+)*\s+[a-zA-Z]+[A-Z][a-z]+(?:\s+[A-Za-z]+)*)
See the regex demo.
Details:
\. - a dot
[A-Z][a-z]* - an ASCII word starting from an upper case letter
(?:\s+[A-Za-z]+)* - zero or more sequences of one or more whitespaces and then an ASCII word
\s+ - zero or more whitespaces
[a-zA-Z]+[A-Z][a-z]+ - an ASCII word with an uppercase letter inside it
(?:\s+[A-Za-z]+)* - zero or more sequences of one or more whitespaces and then an ASCII word.

How do I remove the substrings started with capital letters in a Python string?

I have this string which is a mix between a title and a regular sentence (there is no separator separating the two).
text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."
The title actually ends at the word Vaccines, the Before the pandemic is another sentence completely separate from the title.
How do I remove the substring until the word vaccines? My idea was to remove all words from the words "Read more:" to all the words after that that start with capital until before one word (before). But I don't know what to do if it meets with conjunction or preposition that doesn't need to be capitalized in a title, like the word the.
I know there is a function title() to convert a string into a title format in Python, but is there any function that can detect if a substring is a title?
I have tried the following using regular expression.
import re
text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."
res = re.sub(r"\s*[A-Z]\s*", " ", text)
res
But it just removed all words started with capital letters instead.
You can match the title by matching a sequence of capitalized words and words that can be non-capitalized in titles.
^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)*(?=[A-Z])
See the regex demo.
Details:
^ - start of string
(?:Read\s+more\s*:)? - an optional non-capturing group matching Read, one or more whitespaces, more, zero or more whitespaces and a :
\s* - zero or more whitespaces
(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)* - zero or more sequences of
(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of) - an capitalized word that may contain any non-whitespace chars or one of the words that can stay non-capitalized in an English title
\s+ - one or more whitespaces
(?=[A-Z]) - followed with an uppercase letter.
NOTE: You mentioned your language is not English, so
You need to find the list of your language words that may go non-capitalized in a title and use them instead of ^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of
You might want to replace [A-Z] with \p{Lu} to match any Unicode uppercase letters and \S* with \p{L}* to match any zero or more Unicode letters, BUT make sure you use the PyPi regex library then as Python built-in re does not support the Unicode category classes.
Why don't you just use slicing?
title = text[:44]
print(title)
Read more: Indonesia to Get Moderna Vaccines

parsing a sentence - match inflections and skip punctuation

I'm trying to parse sentences in python- for any sentence I get I should take only the words that appear after the words 'say' or 'ask' (if the words doesn't appear, I should take to whole sentence)
I simply did it with regular expressions:
sen = re.search('(?s)(?<=say|Say).*$', current_game_row["sentence"], re.M | re.I)
(this is only for 'say', but adding 'ask' is not a problem...)
The problem is that if I get a sentence with punctuations like comma, colon (,:) after the word 'say' it takes it too.
Someone suggested me to use nltk tokenization in order to define it, but I'm new in python and don't understand how to use it. I see that nltk has the function RegexpParser but I'm not sure how to use it.
Please help me :-)
**
I forgot to mention that- I want to recognize 'said'/ asked etc. too and don't want to catch word that include the word 'say' or 'ask' (I'm not sure there are such words...).
In addition, if where are multiply 'say' or 'ask' , I only want to catch the first token in in the sentence.
**
Everything after a Keyword
We can deal with the unwanted punctuation by using \w to eat up all non-unicode.
sentence = "Hearsay? With masked flasks I said: abracadabra"
keys = '|'.join(['ask', 'asks', 'asked', 'say', 'says', 'said'])
result = re.search(rf'\b({keys})\b\W+(.*)', sentence, re.S | re.I)
if result == None:
print(sentence)
else:
print(result.group(2))
Output:
abracadabra
case-sensitive: You have case-insensitive flag re.I, so we can remove Say permutation.
multi-line: You have re.M option which directs ^ to not only match at the start of your string, but also right after every \n within that string. We can drop this since we do not need to use ^.
dot-matches-all: You have (?s) which directs . to match everything including \n. This is the same as applying re.S flag.
I'm not sure what the net effect of having both re.M and re.S is. I think your sentence might be a text blob with newlines inside, so I removed re.M and kept (?s) as re.S

RegEx : Capturing words within Quotation mark

I have a paragraph of text like this:
John went out for a walk. He met Mrs. Edwards and said, 'Hello Mam how are you doing today?'. She replied 'I'm fine. How are you?'.
I would like to capture the words within the single quotes.
I tried this regex
re.findall(r"(?<=([']\b))((?=(\\?))\2.)*?(?=\1))",string)
(from this question: RegEx: Grabbing values between quotation marks)
It returned only single quotes as the output. I don't know what went wrong can someone help me?
Python requires capturing groups to be fully closed before any backreferences (\2) to the group.
You can use Positive Lookbehind (?<=[\s,.]) and Positive Lookahead (?=[\s,.]) zero-length assertions to match words inside single quotes, including words such as I'm, i.e.:
re.findall(r"(?<=[\s,.])'.*?'(?=[\s,.])", string)
Full match 56-92 'Hello Mam how are you doing today?'
Full match 106-130 'I'm fine. How are you?'
Explanation
Regex Demo

Distinguish quotes ' and apostrophes while tokenizing with regex

Having some text, i want to tokenize it on words properly. In the text may appear:
words with apostrophe in middle (Can't, I'll,the accountant‘s books )
words with apostrophe in the end (the employers‘ association , I spent most o’ the day replacin’ the broken bit)
quotes, staying directly after the word or between words like : word'word
text is splitted on sentences, but the can be many sentences inside a quote, also, the word with apostroph can stay inside a quote
different symbols for qutes like either ' ' both for opening and closing or one is ' other is ` or ´, etc...
What yould be your suggestion to solve it?
Is it solvable with regex ( Python re for example?
I want words with apostrophe do not split and quotes to split from word tokens
Parcing commont text, The Fellowship Of The Ring.txt for example is tricky a little bit:
input : had hardly any 'government'.
output: ["had","hardly","any","'","government","'"] (recognized as quote)
A rather larger body, varying at need, was employed to 'beat the bounds'
is a quote, however is tricky because of ending s'
'It isn't natural, and trouble will come of it!' apostrophe inside a quote
'Elves and Dragons'_ I says to him. is a quote, howewer, s' again.
My suggestion would be to try to break down your cases. If you want to split by words (meaning that a word has spaces on both ends) probably a simple split would do its job.
>>> my_str = "words like that'"
>>> my_str.split(' ')
['words', 'like', "that'"]
>>>
If it's more complicated, regex seems to be a better idea. You can use (a|b), meaning match a or b. My suggestion would be to experiment more, the perfect place to experiment is here: regex101.com. To make things clearer select 'Python' in the left panel!

Categories