Extract words begin with capital letters - python

I have a string like this
text1="sedentary. Allan Takocok. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."
I want to extract words in this text that begin with a capital letter but do not follow a fullstop. So [Takocok The New England Journal of Medicine] should be extracted without [That's Allan].
I tried this regex but still extracting Allan and That's.
t=re.findall("((?:[A-Z]\w+[ -]?)+)",text1)

Here is an option using re.findall:
text1 = "sedentary. Allan Takocok. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."
matches = re.findall(r'(?:(?<=^)|(?<=[^.]))\s+([A-Z][a-z]+)', text1)
print(matches)
This prints:
['Takocok', 'The', 'New', 'England', 'Journal', 'Medicine']
Here is an explanation of the regex pattern:
(?:(?<=^)|(?<=[^.])) assert that what precedes is either the start of the string,
or a non full stop character
\s+ then match (but do not capture) one or more spaces
([A-Z][a-z]+) then match AND capture a word starting with a capital letter

It's probably possible to find a single regular expression for this case, but it tends to get messy.
Instead, I suggest a two-step approach:
split the text into tokens
work on these tokens to extract the interesting words
tokens = [
'sedentary',
'.',
' ',
'Allan',
' ',
'Takocok',
'.',
' ',
'That\'s',
…
]
This token splitting is already complicated enough.
Using this list of tokens, it is easier to express the actual requirements since you now work on well-defined tokens instead of arbitrary character sequences.
I kept the spaces in the token list because you might want to distinguish between 'a.dotted.brand.name' or 'www.example.org' and the dot at the end of a sentence.
Using this token list, it is easier than before to express rules like "must be preceded immediately by a dot".
I expect that your rules get quite complicated over time since you are dealing with natural language text. Therefore the abstraction to tokens.

This should be the regex your looking for:
(?<!\.)\s+([A-Z][A-Za-z]+)
See the regex101 here: https://regex101.com/r/EoPqgw/1

Related

How to use regular expressions in python to split articles based on punctuation

I need to divide the article into sentences by punctuation. I use the following regular expression:
re.split(r'[,|.|?|!]', strContent)
It does work, but there is a problem. It will separate the following Latin names that should not be split (such as G. lucidum):
Many studies to date have described the anticancer properties of G. lucidum,
The abbreviation of this Latin name is a capital letter followed by a dot and a space.
So I try to modify the above regular expression as follows:
re.split(r'[,|(?:[^A-Z].)|?|!]', strContent)
However, the following error prompt was received:
re.error: unbalanced parenthesis
How can I modify this regular expression?
You should use a negative lookbehind, and put it before the character set that matches the sentence ending.
The negative lookbehind should match a word that's just a single capital letter. This can be done by matching a word boundary before the letter with \b.
You also don't need | inside the character set. That's used for alternative patterns to match.
re.split(r'(?<!\b[A-Z])[,.?!]', strContent)
Using pure regex to find complete sentences is difficult, because of edge cases such as abbreviations, which you have been seeing. You should use an NLP library like NLTK instead.
from nltk.tokenize import sent_tokenize
text = "Many studies to date have described the anticancer properties of G. lucidum. The studies are vast."
print(sent_tokenize(text))
# ['Many studies to date have described the anticancer properties of G. lucidum.', 'The studies are vast.']

Create Python regex for specific sentence pattern

I'm trying to build a regex pattern that can capture the following examples:
pattern1 = '.She is greatThis is annoyingWhy u do this'
pattern2 = '.Weirdly specificThis sentence is longer than the other oneSee this is great'
example = 'He went such dare good mr fact. The small own seven saved man age no offer. Suspicion did mrs nor furniture smallness. Scale whole downs often leave not eat. An expression reasonably cultivated indulgence mr he surrounded instrument. Gentleman eat and consisted are pronounce distrusts.This is where the fun startsSummer is really bothersome this yearShe is out of ideas'
example_pattern_goal = 'This is where the fun startsSummer is really bothersome this yearShe is out of ideas'
Essentially, it's always a dot followed by sentences of various length not including any numbers. I only want to capture these specific sentences, so I tried to capture instances where a dot was immediately followed by a word that starts with an uppercase and other words that include two instances where an uppercase letter is inside the word.
So far, I've only come up with the following regex that doesn't quite work:
'.\b[A-Z]\w+[\s\w]+\b\w+[A-Z]\w+\b[\s\w]+\b\w+[A-Z]\w+\b[\s\w]+'
You can use
\.([A-Z][a-z]*(?:\s+[A-Za-z]+)*\s+[a-zA-Z]+[A-Z][a-z]+(?:\s+[A-Za-z]+)*)
See the regex demo.
Details:
\. - a dot
[A-Z][a-z]* - an ASCII word starting from an upper case letter
(?:\s+[A-Za-z]+)* - zero or more sequences of one or more whitespaces and then an ASCII word
\s+ - zero or more whitespaces
[a-zA-Z]+[A-Z][a-z]+ - an ASCII word with an uppercase letter inside it
(?:\s+[A-Za-z]+)* - zero or more sequences of one or more whitespaces and then an ASCII word.

How do I remove the substrings started with capital letters in a Python string?

I have this string which is a mix between a title and a regular sentence (there is no separator separating the two).
text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."
The title actually ends at the word Vaccines, the Before the pandemic is another sentence completely separate from the title.
How do I remove the substring until the word vaccines? My idea was to remove all words from the words "Read more:" to all the words after that that start with capital until before one word (before). But I don't know what to do if it meets with conjunction or preposition that doesn't need to be capitalized in a title, like the word the.
I know there is a function title() to convert a string into a title format in Python, but is there any function that can detect if a substring is a title?
I have tried the following using regular expression.
import re
text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."
res = re.sub(r"\s*[A-Z]\s*", " ", text)
res
But it just removed all words started with capital letters instead.
You can match the title by matching a sequence of capitalized words and words that can be non-capitalized in titles.
^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)*(?=[A-Z])
See the regex demo.
Details:
^ - start of string
(?:Read\s+more\s*:)? - an optional non-capturing group matching Read, one or more whitespaces, more, zero or more whitespaces and a :
\s* - zero or more whitespaces
(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)* - zero or more sequences of
(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of) - an capitalized word that may contain any non-whitespace chars or one of the words that can stay non-capitalized in an English title
\s+ - one or more whitespaces
(?=[A-Z]) - followed with an uppercase letter.
NOTE: You mentioned your language is not English, so
You need to find the list of your language words that may go non-capitalized in a title and use them instead of ^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of
You might want to replace [A-Z] with \p{Lu} to match any Unicode uppercase letters and \S* with \p{L}* to match any zero or more Unicode letters, BUT make sure you use the PyPi regex library then as Python built-in re does not support the Unicode category classes.
Why don't you just use slicing?
title = text[:44]
print(title)
Read more: Indonesia to Get Moderna Vaccines

Remove duplicated puntaction in a string

I'm working on a cleaning some text as the one bellow:
Great talking with you. ? See you, the other guys and Mr. Jack Daniels next week, I hope-- ? Bobette ? ? Bobette Riner??????????????????????????????? Senior Power Markets Analyst?????? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com ? ? - cinhrly020101.doc
It has multiple spaces and question marks, to clean it I'm using regular expressions:
def remove_duplicate_characters(text):
text = re.sub("\s+"," ",text)
text = re.sub("\s*\?+","?",text)
text = re.sub("\s*\?+","?",text)
return text
remove_duplicate_characters(msg)
remove_duplicate_characters(msg)
Which gives me the following result:
'Great talking with you.? See you, the other guys and Mr. Jack Daniels next week, I hope--? Bobette? Bobette Riner? Senior Power Markets Analyst? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com? - cinhrly020101.doc'
For this particular case, it does work, but does not looks like the best approach if I want to add more charaters to remove. Is there an optimal way to solve this?
To replace all consecutive punctuation chars with their single occurrence you can use
re.sub(r"([^\w\s]|_)\1+", r"\1", text)
If the leading whitespace must be removed, use the r"\s*([^\w\s]|_)\1+" regex.
See the regex demo online.
In case you want to introduce exceptions to this generic regex, you may add an alternative on the left where you'd capture all the contexts where you wat the consecutive punctuation to be kept:
re.sub(r'((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+', r'\1\2', text)
See this regex demo.
The ((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+ regex matches and captures a ... (not encosed with other dots on both ends) and a :// string (commonly seen in URLS), and the rest is the original regex with the adjusted backreference (since now, there are two capturing groups).
The \1\2 in the replacement pattern put back the captured vaues into the resulting string.

Multiple negative lookbehind assertions in python regex?

I'm new to programming, sorry if this seems trivial: I have a text that I'm trying to split into individual sentences using regular expressions. With the .split method I search for a dot followed by a capital letter like
"\. A-Z"
However I need to refine this rule in the following way: The . (dot) may not be preceeded by either Abs or S. And if it is followed by a capital letter (A-Z), it should still not match if it is a month name, like January | February | March.
I tried implementing the first half, but even this did not work. My code was:
"( (?<!Abs)\. A-Z) | (?<!S)\. A-Z) ) "
First, I think you may want to replace the space with \s+, or \s if it really is exactly one space (you often find double spaces in English text).
Second, to match an uppercase letter you have to use [A-Z], but A-Z will not work (but remember there may be other uppercase letters than A-Z ...).
Additionally, I think I know why this does not work. The regular expression engine will try to match \. [A-Z] if it is not preceeded by Abs or S. The thing is that, if it is preceeded by an S, it is not preceeded by Abs, so the first pattern matches. If it is preceeded by Abs, it is not preceeded by S, so the second pattern version matches. In either way one of those patterns will match since Abs and S are mutually exclusive.
The pattern for the first part of your question could be
(?<!Abs)(?<!S)(\. [A-Z])
or
(?<!Abs)(?<!S)(\.\s+[A-Z])
(with my suggestion)
That is because you have to avoid |, without it the expression now says not preceeded by Abs and not preceeded by S. If both are true the pattern matcher will continue to scan the string and find your match.
To exclude the month names I came up with this regular expression:
(?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[A-Z]
The same arguments hold for the negative look ahead patterns.
I'm adding a short answer to the question in the title, since this is at the top of Google's search results:
The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this:
"(?<!1)(?<!12)(?<!123)example"
This would match example 2example and 3example but not 1example 12example or 123example.
Use nltk punkt tokenizer. It's probably more robust than using regex.
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
Use nltk or similar tools as suggested by #root.
To answer your regex question:
import re
import sys
print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])",
sys.stdin.read())
Input
First. Second. January. Third. Abs. Forth. S. Fifth.
S. Sixth. ABs. Eighth
Output
['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth',
'S. Sixth', 'ABs', 'Eighth']
You can use Set [].
'(?<![1,2,3]example)'
This would not match 1example, 2example, 3example.

Categories