Regular Expression to extract Named Entities from text just based on capitalization

Regular Expression to extract Named Entities from text just based on capitalization - python

I want a regex in Python which extracts one or multiple occurrences of words starting with capital letters unless the word occurs in the first word. I know it's not a robust and consistent method but it'll solve my problem as I don't want to use any statistical method (e.g. as in NLTK or StanfordNER).
Examples:
extract('His name is John Wayne.')
should return ['John Wayne'].
extract('He is The President of Neverland.')
should return ['The President', 'Neverland'] because they are capitalized words and they don't occur at the beginning of a sentence.
another example:
extract('He came home. Although late, it was nice to have Patrick there.')
should return ['Patrick'] because 'He' and 'Although' occur at the beginning of a sentence.
Also it could drop punctuation for example 'He was John, who came' should return 'John' and not 'John,'.

You can use this expression for this task:
(?<!\.\s)(?!^)\b([A-Z]\w*(?:\s+[A-Z]\w*)*)
RegEx Demo
RegEx Breakup:
(?<!\.\s) - Negative lookbehind to assert we don't have a DOT and space before
(?!^) - Negative lookahead to assert we are not at start
\b - Word boundary
( - Start capturing group
[A-Z]\w* - Match a word starting with a capital letter
(?: - Start non-capturing group
\s+ - Match 1 or more whitespaces
[A-Z]\w* - Match a capital letter word
)* End non-capturing group. Match 0 ore more of these
) - End capturing group

Related

python regex: match the dot only, not the letter before it

I have a regex pattern as follows:
r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.)+'
and I am trying to modify that so it only matches the dot at the end of the sentences and not the letter before them. here is my string:
sent = 'This is the U.A. we have r.a.d. golden 13.56 date. a better date 34. was there.'
and here is what i have done:
import re
re.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.)+', sent)
however what happens is that it removes the last letter of the words:
current output:
['This is the U.A. we have r.a.d. golden 13.56 dat',' a better date 34. was ther',
'']
my desired output is:
['This is the U.A. we have r.a.d. golden 13.56 date',' a better date 34. was there',
'']
i do not know how I can modify the pattern to keep the last letter of the words 'date', and 'there'

Your pattern can be reduced to and fixed as
(?<=(?<![.\s])[a-zA-Z])\.
See the regex demo.
If you need to also match multiple dots, put back + after the \..
Details:
(?<=(?<![.\s])[a-zA-Z]) - a positive lookbehind that matches a location that is immediately preceded with
(?<![.\s]) - a negative lookbehind that fails the match if there is a . or whitespace immediately to the left of the current location
[a-zA-Z] - an ASCII letter
\. - a literal dot.
Look, your pattern is basically an alternation of two patterns, (?<!\.|\s)[a-z]\. and (?<!\.|\s)[A-Z]\., the only difference between which is [a-z] and [A-Z]. It is clear the same alternation can be shortened to (?<!\.|\s)[a-zA-Z]\. The [a-zA-Z] must be put into a non-consuming pattern so that the letters could not be eaten up when splitting, so using a positive lookbehind is a natural solution.

Regex for for creating an acronym

I'm trying to build a function that will collect an acronym using only regular expressions.
Example:
Data Science = DS
I'm trying to do 3 steps:
Find the first letter of each word
Translate every single letter to uppercase.
Group
Unfortunately I get errors.
I repeat that I need to use the regular expression functionality.
Regular expression for creating an acronym.
some_words = 'Data Science'
all_words_select = r'(\b\w)'
word_upper = re.sub(all_words_select, some_words.upper(), some_words)
print(word_upper)
result:
DATA SCIENCEata DATA SCIENCEcience
Why is the text duplicated?
I plan to get: DATA SCIENCE

You don't need regex for the problem you have stated. You can just split the words on space, then take the first character and convert it to the upper case, and finally join them all.
>>> ''.join(w[0].upper() for w in some_words.split(' '))
>>> 'DS'
You need to deal with special condition such as word starting with character other than alphabets, with something like if w[0].isalpha()
The another approach using re.sub and negative lookbehind:
>>> re.sub(r'(?<!\b).|\s','', some_words)
'DS'

Use
import re
some_words = 'Data Science'
all_words_select = r'\b(?![\d_])(\w)|.'
word_upper = re.sub(all_words_select, lambda z: z.group(1).upper() if z.group(1) else '', some_words, flags=re.DOTALL)
print(word_upper)
See Python proof.
EXPLANATION
Match a letter at the word beginning => capture (\b(?![\d_])(\w))
Else, match any character (|.)
Whenever capture is not empty replace with a capital variant (z.group(1).upper())
Else, remove the match ('').
Pattern:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
[\d_] any character of: digits (0-9), '_'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
. any character except \n

Regex (Python) - Match words with two or more distinct vowels

I'm attempting to match words in a string that contain two or more distinct vowels. The question can be restricted to lowercase.
string = 'pool pound polio papa pick pair'
Expected result:
pound, polio, pair
pool and papa would fail because they contain only one distinct vowel. However, polio is fine, because even though it contains two os, it contains two distinct vowels (i and o). mississippi would fail, but albuquerque would pass).
Thought process: Using a lookaround, perhaps five times (ignore uppercase), wrapped in a parenthesis, with a {2} afterward. Something like:
re.findall(r'\w*((?=a{1})|(?=e{1})|(?=i{1})|(?=o{1})|(?=u{1})){2}\w*', string)
However, this matches on all six words.
I killed the {1}s, which makes it prettier (the {1}s seem to be unnecessary), but it still returns all six:
re.findall(r'\w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w*', string)
Thanks in advance for any assistance. I checked other queries, including "How to find words with two vowels", but none seemed close enough. Also, I'm looking for pure RegEx.

You don't need 5 separate lookaheads, that's complete overkill. Just capture the first vowel in a capture group, and then use a negative lookahead to assert that it's different from the second vowel:
[a-z]*([aeiou])[a-z]*(?!\1)[aeiou][a-z]*
See the online demo.

Your \w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w* regex matches all words that have at least 1 any vowel. \w* matches 0+ word chars, so the first pattern grabs the whole chunk of letters, digits and underscores. Then, backtracking begins, the regex engine tries to find a location that is followed with either a, e, i, o, or u. Once it finds that location, the previously grabbed word chars are again grabbed and consumed with the trailing \w*.
To match whole words with at least 2 different vowels, you may use
\b(?=\w*([aeiou])\w*(?!\1)[aeiou])\w+
See the regex demo.
Details
\b - word boundary
(?=\w*([aeiou])\w*(?!\1)[aeiou]) - a positive lookahead that, immediately to the left of the current location, requires
\w* - 0+ word chars
([aeiou]) - Capturing group 1 (its value is referenced to with \1 backreference later in the pattern): any vowel
\w* - 0+ word chars
(?!\1)[aeiou] - any vowel from the [aeiou] set that is not equal to the vowel stored in Group 1 (due to the negative lookahead (?!\1) that fails the match if, immediately to the right of the current location, the lookahead pattern match is found)
\w+ - 1 or more word chars.

Match words in a string that contain at least two distinct vowels in the least amount of characters (to my knowledge): \w*([aeiou])\w*(?!\1)[aeiou]\w*
Demo: https://regex101.com/r/uRgVVa/1
Explanation:
\w*: matches 0 or more word characters. You don't need to start with a word boundary (\b) because \w does not include spaces, so using \b would be redundant.
([aeiou]): [aeiou] matches any one vowel. It is in parenthesis so we can reference what vowel was matched later. Whatever is inside these first parenthesis is group 1.
\w*: matches 0 or more word characters.
(?!\1): says the following regex cannot be the same as the character selected in group 1. For example, if the vowel matched in group 1 was a, the following regex cannot be a. This is called by \1, which references what character was chosen in group 1 (e.g. if a matched group 1, \1 references a). ?! is a negative lookahead that says the following regex outside the parenthesis cannot match what follows ?!.
\w*: matches 0 or more word characters.

Python regex with \w does not work

I want to have a regex to find a phrase and two words preceding it if there are two words.
For example I have the string (one sentence per line):
Chevy is my car and Rusty is my horse.
My car is very pretty my dog is red.
If i use the regex:
re.finditer(r'[\w+\b|^][\w+\b]my car',txt)
I do not get any match.
If I use the regex:
re.finditer(r'[\S+\s|^][\S+\s]my car',txt)
I am getting:
's my car' and '. My car' (I am ignoring case and using multi-line)
Why is the regex with \w+\b not finding anything? It should find two words and 'my car'
How can I get two complete words before 'my car' if there are two words. If there is only one word preceding my car, I should get it. If there are no words preceding it I should get only 'my car'. In my string example I should get: 'Chevy is my car' and 'My car' (no preceding words here)

In your r'[\w+\b|^][\w+\b]my car regex, [\w+\b|^] matches 1 symbol that is either a word char, a +, a backdpace, |, or ^ and [\w+\b] matches 1 symbol that is either a word char, or +, or a backspace.
The point is that inside a character class, quantifiers and a lot (but not all) special characters match literal symbols. E.g. [+] matches a plus symbol, [|^] matches either a | or ^. Since you want to match a sequence, you need to provide a sequence of subpatterns outside of a character class.
It seems as if you intended to use \b as a word boundary, however, \b inside a character class matches only a backspace character.
To find two words and 'my car', you can use, for example
\S+\s+\S+\s+my car
See the regex demo (here, \S+ matches one or more non-whitespace symbols, and \s+ matches 1 or more whitespaces, and the 2 occurrences of these 2 consecutive subpatterns match these symbols as a sequence).
To make the sequences before my car optional, just use a {0,2} quantifier like this:
(?:\S+[ \t]+){0,2}my car
See this regex demo (to be used with the re.IGNORECASE flag). See Python demo:
import re
txt = 'Chevy is my car and Rusty is my horse.\nMy car is very pretty my dog is red.'
print(re.findall(r'(?:\S+[ \t]+){0,2}my car', txt, re.I))
Details:
(?:\S+[ \t]+){0,2} - 0 to 2 sequences of 1+ non-whitespaces followed with 1+ space or tab symbols (you may also replace it with [^\S\r\n] to match any horizontal space or \s if you also plan to match linebreaks).
my car - a literal text my car.

Using Regex to capture phrase

My question is regarding the following tweets:
Credit Suisse Trims Randgold Resources Limited (RRS) Target Price to GBX
JPMorgan Chase & Co Trims Occidental Petroleum Co (OXY) Target Price to
I want to remove "Randgold Resources Limited (RRS)" from the first tweet and "Occidental Petroleum Co (OXY)" from the second tweet using Regex.
I am working in Python and so far I have tried this without much luck:
Trims\s[\w\s.()]+(?=Target)
I want to capture the phrase "Trims Target Price" in both instances. Help would be appreciated.

You can use this lookaround based regex:
p = re.compile(r'(?<= Trims) .*?(?= Target )')
result = re.sub(p, "", test_str)
(?<= Trims) .*?(?= Target ) will match any text that is between Trim and Target.
RegEx Demo

(?<=Trims )([A-Z][a-z]+ ){3}\([A-Z]{3}\)
See it in action
The idea is:
(?<=Trims ) - find a place preceded by Trims using positive lookbehind
[A-Z][a-z]+ - a word starting with capital letter that continues with multiple lower case letters
([A-Z][a-z]+ ){3} - three such words followed by space
\( and \) - brackets have to be escaped, otherwise they have the meaning of capturing group
[A-Z]{3} - three capital letters

The (?<=...) Lookbehind assertion, match if preceded is missing for Trims word.
re.sub('(?<=Trims)\s[\w\s.()]+(?=Target)', ' ', text)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular Expression to extract Named Entities from text just based on capitalization - python

Related

python regex: match the dot only, not the letter before it

Regex for for creating an acronym

Regex (Python) - Match words with two or more distinct vowels

Python regex with \w does not work

Using Regex to capture phrase

Categories

Resources