Split Column on regex

Split Column on regex - python

I really struggle with regex, and I'm hoping for some help.
I have columns that look like this
import pandas as pd
data = {'Location': ['Building A, 100 First St City, State', 'Fire Station # 100, 2 Apple Row, City, State Zip', 'Church , 134 Baker Rd City, State']}
df = pd.DataFrame(data)
Location
0 Building A, 100 First St City, State
1 Fire Station # 100, 2 Apple Row, City, State Zip
2 Church , 134 Baker Rd City, State
I would like to get it to the code chunk below by splitting anytime there is a comma followed by space and then a number. However, I'm running into an issue where I'm removing the number.
Location Name Address
0 Building A 100 First St City, State
1 Fire Station # 100 2 Apple Row, City, State, Zip
2 Church 134 Baker Rd City, State
This is the code I've been using
df['Location Name']= df['Location'].str.split('.,\s\d', expand=True)[0]
df['Address']= df['Location'].str.split('.,\s\d', expand=True)[1]

You can use Series.str.extract:
df[['Location Name','Address']] = df['Location'].str.extract(r'^(.*?),\s(\d.*)', expand=True)
The ^(.*?),\s(\d.*) regex matches
^ - start of string
(.*?) - Group 1 ('Location Name'): any zero or more chars other than line break chars as few as possible
,\s - comma and whitespace
(\d.*) - Group 1 ('Address'): digit and the rest of the line.
See the regex demo.

Another simple solution to your problem is to use a positive lookahead. You want to check if there is a number ahead of your pattern, while not including the number in the match. Here's an example of a regex that solves your problem:
\s?,\s(?=\d)
Here, we optionally remove a trailing whitespace, then match a comma followed by whitespace.
The (?= ) is a positive lookahead, in this case we check for a following digit. If that's matched, the split will remove the comma and whitespace only.

Related

Remove and replace multiple commas in string

I have this dataset
df = pd.DataFrame({'name':{0: 'John,Smith', 1: 'Peter,Blue', 2:'Larry,One,Stacy,Orange' , 3:'Joe,Good' , 4:'Pete,High,Anne,Green'}})
yielding:
name
0 John,Smith
1 Peter,Blue
2 Larry,One,Stacy,Orange
3 Joe,Good
4 Pete,High,Anne,Green
I would like to:
remove commas (replace them by one space)
wherever I have 2 persons in one cell, insert the "&"symbol after the first person family name and before the second person name.
Desired output:
name
0 John Smith
1 Peter Blue
2 Larry One & Stacy Orange
3 Joe Good
4 Pete High & Anne Green
Tried this code below, but it simply removes commas. I could not find how to insert the "&"symbol in the same code.
df['name']= df['name'].str.replace(r',', '', regex=True)
Disclaimer : all names in this table are fictitious. No identification with actual persons (living or deceased)is intended or should be inferred.

I would do it following way
import pandas as pd
df = pd.DataFrame({'name':{0: 'John,Smith', 1: 'Peter,Blue', 2:'Larry,One,Stacy,Orange' , 3:'Joe,Good' , 4:'Pete,High,Anne,Green'}})
df['name'] = df['name'].str.replace(',',' ').str.replace(r'(\w+ \w+) ', r'\1 & ', regex=True)
print(df)
gives output
name
0 John Smith
1 Peter Blue
2 Larry One & Stacy Orange
3 Joe Good
4 Pete High & Anne Green
Explanation: replace ,s using spaces, then use replace again to change one-or-more word characters followed by space followed by one-or-more word character followed by space using content of capturing group (which includes everything but last space) followed by space followed by & character followed by space.

With single regex replacement:
df['name'].str.replace(r',([^,]+)(,)?', lambda m:f" {m.group(1)}{' & ' if m.group(2) else ''}")
0 John Smith
1 Peter Blue
2 Larry One & Stacy Orange
3 Joe Good
4 Pete High & Anne Green

This should work:
import re
def separate_names(original_str):
spaces = re.sub(r',([^,]*(?:,|$))', r' \1', original_str)
return spaces.replace(',', ' & ')
df['spaced'] = df.name.map(separate_names)
df
I created a function called separate_names which replaces the odd number of commas with spaces using regex. The remaining commas (even) are then replaced by & using the replace function. Finally I used the map function to apply separate_names to each row. The output is as follows:

In replace statement you should replace comma with space. Please put space between '' -> so you have ' '
df['name']= df['name'].str.replace(r',', ' ', regex=True)
inserted space ^ here

What Python RegEx can I use to indicate a pattern only in the end of an Excel cell

I am working with a dataset where I am separating the contents of one Excel column into 3 separate columns. A mock version of the data is as follows:
Movie Titles/Category/Rating
Wolf of Wall Street A-13 x 9
Django Unchained IMDB x 8
The EXPL Haunted House FEAR x 7
Silver Lining DC-23 x 8
This is what I want the results to look like:
Title
Category
Rating
Wolf of Wall Street
A-13
9
Django Unchained
IMDB
8
The EXPL Haunted House
FEAR
7
Silver Lining
DC-23
8
Here is the RegEx I used to successfully separate the cells:
For Rating, this RegEx worked:
data = [[Movie Titles/Category/Rating, Rating]] = data['Movie Titles/Category/Rating'].str.split(' x ', expand = True)
However, to separate Category from movie titles, this RegEx doesn't work:
data['Category']=data['Movie Titles/Category/Rating'].str.extract('((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4}$))', expand = True)
Since the uppercase letter pattern is present in the middle of the third cell as well (EXPL and I only want to separate FEAR into a separate column), the regex pattern '\s[A-Z]{4}$' is not working. Is there a way to indicate in the RegEx pattern that I only want the uppercase text in the end of the table cell to separate (FEAR) and not the middle (EXPL)?

You can use
import pandas as pd
df = pd.DataFrame({'Movie Titles/Category/Rating':['Wolf of Wall Street A-13 x 9','Django Unchained IMDB x 8','The EXPL Haunted House FEAR x 7','Silver Lining DC-23 x 8']})
df2 = df['Movie Titles/Category/Rating'].str.extract(r'^(?P<Movie>.*?)\s+(?P<Category>\S+)\s+x\s+(?P<Rating>\d+)$', expand=True)
See the regex demo.
Details:
^ - start of string
(?P<Movie>.*?) - Group (Column) "Movie": any zero or more chars other than line break chars, as few as possible
\s+ - one or more whitespaces
(?P<Category>\S+) - Group "Category": one or more non-whitespace chars
\s+x\s+ - x enclosed with one or more whitespaces
(?P<Rating>\d+) - Group "Rating": one or more digits
$ - end of string.

Assuming there is always x between Category and Rating, and the Category has no spaces in it, then the following should get what you want:
(.*) (.*) x (\d+)

I think
'((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4})) x'
would work for you - to indicate that you want the part of the string that comes right before the x. (Assuming that pattern is always true for your data.)

Find pattern in doc & print proceeding 3 line whenever pattern match

I have the requirement to find out a date pattern from a .doc file and whenever the pattern matches then it will print the preceding 3 lines of that match.
But as output I am getting only the 1st match and for another match I am not getting any result.
Below is the code :
phoneNumRegex = re.compile(r'[a-zA-Z]+\s+\d{2,4}') ##specify pattern which needs to identify
mo = phoneNumRegex.search(string) ##finding pattern on varible 'string' which holds file value
for index, line in enumerate(lines): # enumerate the list and loop through it
if mo.group() in line: # check if the current line has your substring
print("".join(lines[max(0,index-3):min(index+3, len(lines)-0)])) ## Defining index value from 0 and priniting previous 3 line & after 3 line if any match find

re.search() stops after the first match, so this is why you only find one. You can either use re.findall() or move the regex matching into your loop. Then there is no need to compare the file against the matches.
Side note: Double check your edge cases to see if you need to modify your regex:
Your regex will also match if there are more than 4 digits in the number. I assume you want exactly 2-4? If so, you could e.g. add a word boundary (\b) at the end of your expression.
Your regex will also match on any line that has a word followed by a number of at least two digits. If you don't want this, you can e.g. modify your regex by adding line start (^) and ending ( $) delimiters at the start/end of your expression.
See https://regex101.com/r/nPjpX2/2.
Example:
import re
phoneNumRegex = re.compile(r'[a-zA-Z]+\s+\d{2,4}')
string = '''This is a multiline string that contains
some text and also addresses and numbers that match your regex
Guy Incognito
742 Evergreen Terrace
Springfield
Whatever 1234
some more text and another address, but a longer number
Another One
12 Two St
Washington
PREFIX 654321
and a third one
Ms Final
555 Wall Street
NY
Something 99
Beware of some number, like 33, appearing somewhere in between'''
lines = string.splitlines()
for idx, line in enumerate(lines):
if phoneNumRegex.search(line):
print('\n'.join(lines[max(0,idx-3):idx]))
print('---')
Output for regex [a-zA-Z]+\s+\d{2,4}:
Guy Incognito
742 Evergreen Terrace
Springfield
---
Another One
12 Two St
Washington
---
Ms Final
555 Wall Street
NY
---
NY
Something 99
---
Output for regex ^[a-zA-Z]+\s+\d{2,4}$:
Guy Incognito
742 Evergreen Terrace
Springfield
---
Ms Final
555 Wall Street
NY
---

Regex for python: how do I extract a string between words?

Suppose I have a sentence:
Meet me at 201 South First St. at noon
And I want to get the address like this:
South First
What would be the appropriate Regex expression for it ? I currently have this, but it is not working:
x = re.search(r"\d+\s?=([A-Z][a-z]*)\s(Rd.|Dr.|Ave.|St.)",searchstring)
Where searchstring is the sentence. The address is always preceded by 1 or more digits followed by a space and followed by either Rd. Dr. Ave. or St. The address also always starts with a capital letter.

The first group, the part where you try to match the address is [A-Z][a-z]*, it means one uppercase letter followed by any lowercase letters. Probably what you want is any uppercase or lowercase letter or space: [A-Za-z ]*. Also note that the dots in the second group mean any character and not the literal ., so you have to escape it. The solution would look like this:
>>> re.search(r'\d+\s?([A-Za-z ]*)\s+(Rd|Dr|Ave|St)\.', 'Meet me at 201 South First St. at noon')[1]
'South First'
Or just use . to accept anything.
>>> re.search(r'\d+\s?(.*?)\s+(Rd|Dr|Ave|St)\.', 'Meet me at 201 South First St. at noon')[1]
'South First'

You may use
\d+\s*([A-Z].*?)\s+(?:Rd|Dr|Ave|St)\.
See the regex demo.
Details
\d+ - one or more digits
\s* - 0 or more whitespaces
([A-Z].*?) - capturing group #1: an uppercase ASCII letter and then any 0 or more chars other than line break chars as few as possible
\s+ - 1+ whitespaces
(?:Rd|Dr|Ave|St) - Rd, Dr, Ave or St
\. - a dot
See a Python demo:
m = re.search(r'\d+\s*([A-Z].*?)\s+(?:Rd|Dr|Ave|St)\.', text)
if m:
print(m.group(1))
Output: South First.

Here is how:
import re
s = 'Meet me at 201 South First St. at noon'
print(re.findall('(?<=\d )[A-Z].*(?= d.|Dr.|Ave.|St.)', s)[0])
Output:
'South First'

Why doesn't this regular expression work in all cases?

I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?

It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space

text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split Column on regex - python

Related

Remove and replace multiple commas in string

What Python RegEx can I use to indicate a pattern only in the end of an Excel cell

Find pattern in doc & print proceeding 3 line whenever pattern match

Regex for python: how do I extract a string between words?

Why doesn't this regular expression work in all cases?

Categories

Resources