Select string which contains punctuation

Select string which contains punctuation - python

so I'm trying to remove title from a set of professors' name.
Like Dr.Eng, Dr.rer.nat, M.S., Dr., S.Si so on and so forth. Basically any string that contains more than one dot.
This is an example list after I have split the name and the title based on ","
2 [CHOTIMAH, Dr., M.S., RINTO ANUGRAHA NQZ, S...
3 [HARSOJO, S.U., M.Sc., Dr., SUDARMAJI, S.S...
4 [IKHSAN SETIAWAN, S.Si., M.Si., ARI SETIAWAN...
5 [EKO SULISTYA, Dr., M.Si., YOSEF ROBERTUS UT...
6 [SUNARTA, Drs., M.S., WAGINI R., Drs., M.S.]
7 [BAMBANG MURDAKA EKA JATI, Drs., M.S., KAMSU...
8 [AHMAD KUSUMA ATMAJA, S.Si., M.Sc., Dr.Eng....
9 [MOH. ALI JOKO WASONO, M.S., Dr.]
I have tried r'\S*[^\w\s]\S' but it returned
CHOTIMAH, INTO ANUGRAHA NQZ, .
HARSOJO, UDARMAJI, i.
IKHSAN SETIAWAN, RI SETIAWAN, ng.
EKO SULISTYA, OSEF ROBERTUS UTOMO, Dr.
SUNARTA, AGINI .
BAMBANG MURDAKA EKA JATI, AMSUL ABRAHA, Prof.
AHMAD KUSUMA ATMAJA, ITRAYANA, Dr.
MOH. ALI JOKO WASONO, Dr.
Some professors' names are shortened to XXX. Ex: MOHAMMAD TO MOH. And I don't want that to get removed.
Any help is appreciated!

\w{0,}\.(\w{0,}\.)? This regex test string will grab any length word followed by a period, and will look for another word of any length followed by a period optionally. This captures Dr., M.S. etc. I'm pretty sure that's what you're asking for, if not let me know.
In the future you can use regexr.com to easily test regex matches. Also you've tagged this post with Python and Pandas but those aren't really relevant tags. Please either include more code to make tags relevant or avoid using irrelevant tags

Related

Using RegEx in Python to extract contents

Good evening,
I am very new to Python and RegEx. I have the following sentence:
-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16 +100.00 3257 UpAmex Top PM 9:55 +300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here
I would like to search for just '-' and the amount after that.
After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'
For instance, I would get
-75.76 ASIA Direct to
-400 PTE AXS to
What would be the regex command? Also, is there a good regex tutorial where I can read up on?

For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).
Regex:
(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid
You can check the details here: https://regex101.com/r/eUMgdW/1
Python code:
import re
output = re.findall("""(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid""", your_input_string)
for found in output:
print(found)
#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')

Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.
"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.
If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

Apply function is not working on a data-frame column

I am trying to remove special characters like ",",".","-"(except comma) from the "Actors" column of my pandas data-frame. For this I use the apply method on the "Actors" column
df['Actors']= df['Actors'].apply(lambda x : x.lower().replace("[^a-zA-Z,]","",)
df['Actors'].head()
The output of the above snippet is shown below and we can see no special characters have been replaced:
1 tim robbins, morgan freeman, bob gunton, willi...
2 marlon brando, al pacino, james caan, richard ...
3 al pacino, robert duvall, diane keaton, robert...
4 christian bale, heath ledger, aaron eckhart, m...
5 martin balsam, john fiedler, lee j. cobb, e.g....
Name: Actors, dtype: object
But when I try resolving the above issue using the snippet below, the code works:
df['Actors'] = df['Actors'].str.lower().str.replace("[^a-zA-Z,]","")
df['Actors'].head()
1 timrobbins,morganfreeman,bobgunton,williamsadler
2 marlonbrando,alpacino,jamescaan,richardscastel...
3 alpacino,robertduvall,dianekeaton,robertdeniro
4 christianbale,heathledger,aaroneckhart,michael...
5 martinbalsam,johnfiedler,leejcobb,egmarshall
Name: Actors, dtype: object
I want to know what is it with the apply function that it doesn't work properly while replacing characters ?

You call apply on series, so x in the lambda is a single string of each row of the series. So, x.lower().replace is python replace. Python replace doesn't support regex. so it considers "[^a-zA-Z,]" as a whole string and it looks for that substring in each x. It couldn't find it so nothing got replaced.
On the other hand, Pandas str.replace default option is regex=True, so it considers "[^a-zA-Z,]" as a regex pattern and replaces everything properly

It does not work because you do a replace on a string, formally you do str.replace("[^a-zA-Z,]","",). Your sting do not contain those characters [^a-zA-Z,] so nothing is removed. If you prefer, python do interpret those characters as regex, but simply as string elements.
To work you should do it like this, it's just to answer your question because the preferred way to do it is with your second exemple.
remove = re.compile(r"[^a-zA-Z,]")
df['Actors']= df['Actors'].apply(lambda x : re.sub(remove, "", x.lower()))
Herw are some documentation :
python str replace
pandas str replace

Removing varying text phrases through RegEx in a Python Data frame

Basically, I want to remove the certain phrase patterns embedded in my text data:
Starts with an upper case letter and ends with an Em Dash "—"
Starts with an Em Dash "—" and ends with a "Read Next"
Say, I've got the following data:
CEBU CITY—The widow of slain human rights lawyer .... citing figures from the NUPL that showed that 34 lawyers had been killed in the past two years. —WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next
and
Manila, Philippines—President .... but justice will eventually push its way through their walls of impunity, ... —REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next
I want to remove the following phrases:
"CEBU CITY—"
"—WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next"
"Manila, Philippines—"
"—REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next"
I am assuming this would be needing two regex for each patterns enumerated above.
The regex: —[A-Z].*Read Next\s*$ may work on the pattern # 2 but only when there are no other em dashes in the text data. It will not work when pattern # 1 occurs as it will remove the chunk from the first em dash it has seen until the "Read Next" string.
I have tried the following regex for pattern # 1:
^[A-Z]([A-Za-z]).+(—)$
But how come it does not work. That regex was supposed to look for a phrase that starts with any upper case letter, followed by any length of string as long as it ends with an "—".

What you are considering a hyphen - is not indeed a hyphen instead called Em Dash, hence you need to use this regex which has em dash instead of hyphen in start,
^—[A-Z].*Read Next\s*$
Here is the explanation for this regex,
^ --> Start of input
— --> Matches a literal Em Dash whose Unicode Decimal Code is 8212
[A-Z] --> Matches an upper case letter
.* --> Matches any character zero or more times
Read Next --> Matches these literal words
\s* --> This is for matching any optional white space that might be present at the end of line
$ --> End of input
Online demo

The regex that should take care of this -
^—[A-Z]+(.)*(Read Next)$
You can try implementing this regex on your data and see if it works out.

Parsing name and degree?

I'm trying to parse a string containing a name and a degree. I have a long list of these. Some contain no degrees, some contain one, and some contain multiple.
Example strings:
Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
As far as I can tell, the degrees come in the following patterns:
x.x.
x.x.x.
x.x.xx.
x.xx.
xx.x.
x.xxx.
two caps (ex: 'MA')
How would I parse this?
I'm new to regex and breaking down this problem has proved very time-consuming. I've been using this post and tried split = re.split('\s+|([.])',s) and split = re.split('\s+|\.',s) but these still split on the first space.
I have thought, in response to the first comment, about the degree designations. I've been trying to make a regex that recognizes 'x.x' and then a wildcard afterwards because there are several patterns within the degrees which look like this: x.x(something):
x.x.
x.x.x.
x.x.xx.
and then I'd have a few more to classify.
Alternatively, classifying the name might be easier?
Or even listing the degrees in a collection and searching for them?
{'M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.", 'RN', 'B.S.Ed.'}

Try to change your "Jr.", "Sr.", ... replacing them with something like this: "Jr~", "Sr~", ...
This is the the regular expression for doing that:
/ (Jr|Sr)\. / $1~ /g
(See here )
You obtain this string:
Sam da Man J.D.
Green Eggs Jr~ Ed.M.
Argle Bargle Sr~ MA
Cersei Lannister M.A. Ph.D.
Now you can easily capture degrees with this regular expression:
/ (MA|RN|([A-Z][a-z]?[a-z]?\.)+) /g
(See here )

you can use this:
'[ ](MA|RN|([A-Z][a-z]?[a-z]?\.){2,3})'
it doesn't take any word with one dot

I think the best approach is either creating a list or regex of specific degrees you're looking for, instead of trying to define patterns like x.x. that will match several different degrees. A pattern like this is too general, and may match many other values in free text (in this case, people's initials).
import re
s = """Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
Albus Dumbledore M.A.T.
"""
pattern = r"M.A.T.|Ph.D.|MA|J.D.|Ed.M.|M.A.|M.B.A.|Ed.S.|M.Div.|M.Ed.|RN|B.S.Ed."
degrees = re.findall(pattern, s, re.MULTILINE)
print(degrees)
Output:
['J.D.', 'Ed.M.', 'MA', 'M.A.', 'Ph.D.', 'M.A.T.']
If you're looking to get the names that appear between the degrees in a block of text like the one above, you can use re.split.
names = re.split(pattern, s)
names = [n.strip() for n in names if n.strip()]
print(names)
Output:
['Sam da Man', 'Green Eggs Jr.', 'Argle Bargle Sr.', 'Cersei Lannister', 'Albus Dumbledore']
Note that I had to strip the remaining strings and remove empty strings from the results to capture just the names. Doing that operation on the result allows the regex to be much simpler.
Note also that this can still fail when a specific degree could also be someone's initials, (e.g., J.D. Salinger). You may need to make adjustments or other allowances based on your real data.

Why doesn't this regular expression work in all cases?

I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?

It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space

text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.