Removing varying text phrases through RegEx in a Python Data frame

Removing varying text phrases through RegEx in a Python Data frame - python

Basically, I want to remove the certain phrase patterns embedded in my text data:
Starts with an upper case letter and ends with an Em Dash "—"
Starts with an Em Dash "—" and ends with a "Read Next"
Say, I've got the following data:
CEBU CITY—The widow of slain human rights lawyer .... citing figures from the NUPL that showed that 34 lawyers had been killed in the past two years. —WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next
and
Manila, Philippines—President .... but justice will eventually push its way through their walls of impunity, ... —REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next
I want to remove the following phrases:
"CEBU CITY—"
"—WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next"
"Manila, Philippines—"
"—REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next"
I am assuming this would be needing two regex for each patterns enumerated above.
The regex: —[A-Z].*Read Next\s*$ may work on the pattern # 2 but only when there are no other em dashes in the text data. It will not work when pattern # 1 occurs as it will remove the chunk from the first em dash it has seen until the "Read Next" string.
I have tried the following regex for pattern # 1:
^[A-Z]([A-Za-z]).+(—)$
But how come it does not work. That regex was supposed to look for a phrase that starts with any upper case letter, followed by any length of string as long as it ends with an "—".

What you are considering a hyphen - is not indeed a hyphen instead called Em Dash, hence you need to use this regex which has em dash instead of hyphen in start,
^—[A-Z].*Read Next\s*$
Here is the explanation for this regex,
^ --> Start of input
— --> Matches a literal Em Dash whose Unicode Decimal Code is 8212
[A-Z] --> Matches an upper case letter
.* --> Matches any character zero or more times
Read Next --> Matches these literal words
\s* --> This is for matching any optional white space that might be present at the end of line
$ --> End of input
Online demo

The regex that should take care of this -
^—[A-Z]+(.)*(Read Next)$
You can try implementing this regex on your data and see if it works out.

Related

Using RegEx in Python to extract contents

Good evening,
I am very new to Python and RegEx. I have the following sentence:
-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16 +100.00 3257 UpAmex Top PM 9:55 +300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here
I would like to search for just '-' and the amount after that.
After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'
For instance, I would get
-75.76 ASIA Direct to
-400 PTE AXS to
What would be the regex command? Also, is there a good regex tutorial where I can read up on?

For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).
Regex:
(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid
You can check the details here: https://regex101.com/r/eUMgdW/1
Python code:
import re
output = re.findall("""(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid""", your_input_string)
for found in output:
print(found)
#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')

Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.
"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.
If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

Capture the n previous words when matching a string

Let's say I have this text:
abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)
I want to capture these personal names:
Mark Jones, Taylor Daniel Lautner, Allan Stewart Konigsberg Farrow.
Basically, when we find (P followed by any capital letter, we capture the n previous words that start with a capital letter.
What I have achieved so far is to capture just one previous word with this code: \w+(?=\s+(\(P+[A-Z])). But I couldn't evolve from that.
I appreciate it if someone can help :)

Regex pattern
\b((?:[A-Z]\w+\s?)+)\s\(P[A-Z]
In order to find all matching occurrences of the above regex pattern we can use re.findall
import re
text = """abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)
"""
matches = re.findall(r'\b((?:[A-Z]\w+\s?)+)\s\(P[A-Z]', text)
>>> matches
['Mark Jones', 'Taylor Daniel Lautner', 'Allan Stewart Konigsberg Farrow']
Regex details
\b : Word boundary to prevent partial matches
((?:[A-Z]\w+\s?)+): First Capturing group
(?:[A-Z]\w+\s?)+: Non capturing group matches one or more times
[A-Z]: Matches a single alphabet from capital A to Z
\w+: Matches any word character one or more times
\s? : Matches any whitespace character zero or one times
\s : Matches a single whitespace character
\(: Matches the character ( literally
P : Matches the character P literally
[A-Z] : Matches a single alphabet from capital A to Z
See the online regex demo

With your shown samples, could you please try following. Using Python's re library here to fetch the results. Firstly using findall to fetch all values from given string var where (.*?)\s+\((?=P[A-Z]) will catch everything which is having P and a capital letter after it, then creating a list lst. Later using substitute function to substitute everything non-spacing things followed by spaces 1st occurrences with NULL to get exact values.
import re
var="""abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)"""
lst = re.findall(r'(.*?)\s+\((?=P[A-Z])',var)
[re.sub(r'^\S+\s+','',s) for s in lst]
Output will be as follows:
['Mark Jones', 'Taylor Daniel Lautner', 'Allan Stewart Konigsberg Farrow']

Regex to find name in sentence

I have some sentence like
1:
"RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held
ball is correctly called."
2:
"Nurkic (POR) maintains legal
guarding position and makes incidental contact with Wall (WAS) that
does not affect his driving shot attempt."
I need to use Python regex to find the name "Oubre Jr." ,"Nurkic" and "Nurkic", "Wall".
p = r'\s*(\w+?)\s[(]'
use this pattern,
I can find "['Nurkic', 'Wall']", but in sentence 1, I just can find ['Nurkic'], missed "Oubre Jr."
Who can help me?

You can use the following regex:
(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()
|-----Main Pattern-----|
Details:
(?:) - Creates a non-capturing group
[A-Z] - Captures 1 uppercase letter
[a-z] - Captures 1 lowercase letter
[\s\.a-z]* - Captures spaces (' '), periods ('.') or lowercase letters 0+ times
(?=\s\() - Captures the main pattern if it is only followed by ' (' string
str = '''RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called.
Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt.'''
res = re.findall( r'(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()', str )
print(res)
Demo: https://repl.it/#RahulVerma8/OvalRequiredAdvance?language=python3
Match: https://regex101.com/r/OsLTrY/1

Here is one approach:
line = "RLB shows Oubre Jr (WAS) legally ties up Nurkic (POR), and a held ball is correctly called."
results = re.findall( r'([A-Z][\w+'](?: [JS][r][.]?)?)(?= \([A-Z]+\))', line, re.M|re.I)
print(results)
['Oubre Jr', 'Nurkic']
The above logic will attempt to match one name, beginning with a capital letter, which is possibly followed by either the suffix Jr. or Sr., which in turn is followed by a ([A-Z]+) term.

You need a pattern that you can match - for your sentence you cou try to match things before (XXX) and include a list of possible "suffixes" to include as well - you would need to extract them from your sources
import re
suffs = ["Jr."] # append more to list
rsu = r"(?:"+"|".join(suffs)+")? ?"
# combine with suffixes
regex = r"(\w+ "+rsu+")\(\w{3}\)"
test_str = "RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called. Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt."
matches = re.finditer(regex, test_str, re.MULTILINE)
names = []
for matchNum, match in enumerate(matches,1):
for groupNum in range(0, len(match.groups())):
names.extend(match.groups(groupNum))
print(names)
Output:
['Oubre Jr.', 'Nurkic ', 'Nurkic ', 'Wall ']
This should work as long as you do not have Names with non-\w in them. If you need to adapt the regex, use https://regex101.com/r/pRr9ZU/1 as starting point.
Explanation:
r"(?:"+"|".join(suffs)+")? ?" --> all items in the list suffs are strung together via | (OR) as non grouping (?:...) and made optional followed by optional space.
r"(\w+ "+rsu+")\(\w{3}\)" --> the regex looks for any word characters followed by optional suffs group we just build, followed by literal ( then three word characters followed by another literal )

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.

Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])

You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Why doesn't this regular expression work in all cases?

I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?

It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space

text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.