Python pandas regular expression replace part of the matching pattern

Python pandas regular expression replace part of the matching pattern - python

I've got a bunch of addresses like so:
df['street'] =
5311 Whitsett Ave 34
355 Sawyer St
607 Hampshire Rd #358
342 Old Hwy 1
267 W Juniper Dr 402
What I want to do is to remove those numbers at the end of the street part of the addresses to get:
df['street'] =
5311 Whitsett Ave
355 Sawyer St
607 Hampshire Rd
342 Old Hwy 1
267 W Juniper Dr
I have my regular expression like this:
df['street'] = df.street.str.replace(r"""\s(?:dr|ave|rd)[^a-zA-Z]\D*\d+$""", '', case=False)
which gives me this:
df['street'] =
5311 Whitsett
355 Sawyer St
607 Hampshire
342 Old Hwy 1
267 W Juniper
It dropped the words 'Ave', 'Rd' and 'Dr' from my original street addresses. Is there a way to keep part of the regular expression pattern (in my case this is 'Ave', 'Rd', 'Dr' and replace the rest?
EDIT:
Notice the address 342 Old Hwy 1. I do not want to also take out the number in such cases. That's why I specified the patterns ('Ave', 'Rd', 'Dr', etc) to have a better control of who gets changed.

df_street = '''
5311 Whitsett Ave 34
355 Sawyer St
607 Hampshire Rd #358
342 Old Hwy 1
267 W Juniper Dr 402
'''
# digits on the end are preceded by one of ( Ave, Rd, Dr), space,
# may be preceded by a #, and followed by a possible space, and by the newline
df_street = re.sub(r'(Ave|Rd|Dr)\s+#?\d+\s*\n',r'\1\n', df_street,re.MULTILINE|re.IGNORECASE)
print(df_street)
5311 Whitsett Ave
355 Sawyer St
607 Hampshire Rd
342 Old Hwy 1
267 W Juniper Dr

You should use the following regex:
>>> import re
>>> example_str = "607 Hampshire Rd #358"
>>> re.sub(r"\s*\#?[^\D]+\s*$", r"", example_str)
'607 Hampshire Rd'

Related

Can I use regex to split pandas dataframe column on first match only?

A column in my dataframe has the following campaign contribution data formatted in one of two ways:
JOHN A. DONOR1234 W ROAD ST CITY, STATE 56789
And
JANE M. DONOR
1234 W ROAD ST
CITY, STATE 56789
I want to split this column into two. Column one should be the name of the donor. Column two should be the address.
Currently, I'm using the following regex code to try and accomplish this:
url = ("http://www.voterfocus.com/CampaignFinance/candidate_pr.php?op=rp&e=8&c=munmiamibeach&ca=64&sdc=116&rellevel=4&dhc=774&committee=N")
dfs = pd.read_html(url)
df = dfs[0]
df['Contributor'].str.split(r'\d\d?', expand=True)
But instead of splitting after the first match and quitting - as I intend - the regex seems continue matching and splitting. My output should looke like this:
Col1 Col2
JOHN A. DONOR 1234 W ROAD ST CITY, STATE 56789

it may be much simpler than that. You can use the string methods. For example, I think this is the behavior you want:
import pandas as pd
s = """JOHN A. DONOR
1234 W ROAD ST
CITY, STATE 56789"""
df = pd.DataFrame([s], columns=["donors"])
df.donors.str.split("\n", 1, expand=True)
output:
0 1
0 JOHN A. DONOR 1234 W ROAD ST\nCITY, STATE 56789

Splitting solution
You can use
df['Contributor'].str.split(r'(?<=\D)(?=\d)', expand=True, n=1)
The (?<=\D)(?=\d) regex finds a location between a non-digit char (\D) and a digit char (\d), splits the string there and only performs this operation once (due to n=1).
Alternative solution
You can match and capture the names up to the first number, and then capture all text there remains after and including the first digit using
df['Contributor'].str.extract(r'(?P<Name>\D*)(?P<Address>\d.*)', expand=True)
# => Name # Address
# 0 Contributor CHRISTIAN ULVERT 1742 W FLAGLER STMIAMI, FL 33135
# 1 Contributor Roger Thomson 4271 Alton Miami Beach , FL 33140
# 2 Contributor Steven Silverstein 691 West 247th Street Bronx , NY 10471
# 3 Contributor Cathy Raduns 691 West 247th Street Bronx, NY 10471
# 4 Contributor Asher Raduns-Silverstein 691 West 247th StreetBRONX, NY 10471
The (?P<Name>\D*)(?P<Address>\d.*) pattern means
(?P<Name>\D*) - Group "Name": zero or more chars other than digits
(?P<Address>\d.*) - Group "Address": a digit and then any zero or more chars other than line break chars.
If there are line breaks in the string, add (?s) at the start of the pattern, i.e. r'(?s)(?P<Name>\D*)(?P<Address>\d.*)'.
See the regex demo.

Find number greater than given parameter in regex

I am trying to write whole line if city name and number of rooms greater than given parameter of function. so far, I have wrote following regex expression. But it finds only rooms available exactly given in parameter not more rooms.
if re.search(r'R[0-9]{7},([-\w ]{1,30}%s[\w ]{1,30}),[0-9]{1,9},%d' % (city_name, number_of_bedrooms), string, re.IGNORECASE):
The file that I am looking into is:
R2507956,2242 Grant Street Vancouver BC V5L 2Z7,1699000,5,2,House,13
R2500627,305-1006 Beach Avenue Vancouver BC V6E 1T7,981000,2,2,Condo,34
R2512107,680 W 6th Avenue Vancouver BC V5Z 1A3,989000,2,2,Townhouse,1
R2512000,208-607 E 8th Avenue Vancouver BC V5T 1T2,574900,1,1,Condo,1
R2511923,2146 W 14th Avenue Vancouver BC V6K 2V7,2248000,3,3,House,31
R2511301,2638 Charles Street Vancouver BC V5K 3A5,1890000,8,8,House,18
R2511809,307-2080 E Kent Avenue Vancouver BC V5P 4X2,449000,1,1,Condo,2
R2511747,1408-1775 Quebec Street Vancouver BC V5T 0E3,679900,1,1,Condo,5
R2511972,306-7180 Linden Avenue Burnaby BC V5E 3G6,448800,1,1,Condo,30
R2511059,7760 Berkley Street Burnaby BC V5E 2J7,1150000,2,1,House,20
R2511262,1106-9222 University Crescent Burnaby BC V5A 0A6,629800,2,2,Condo,4
R2510818,5190 Fulwell Street Burnaby BC V5G 1P2,1390000,7,4,House,15
R2510183,5712 Grant Street Burnaby BC V5B 2K4,1698000,3,4,House,18
R2512071,8154 Gilley Avenue Burnaby BC V5J 4Y5,2488000,9,9,House,1
R2510573,5059 Norfolk Street Burnaby BC V5G 1E9,1299000,4,4,House,7
R2512173,11226 236 Street Maple Ridge BC V2W 0C8 ,900000,4,4,House,35
R2512052,21560 Ashbury Court Maple Ridge BC V2X 8Z7,775000,3,2,House,43
R2508895,227-12258 224 Street Maple Ridge BC V2X 8Y7,474900,2,2,Condo,12
R2512451,102 Croteau Court Coquitlam BC V3K 6E2,948000,4,2,House,20
R2512494,1803-1185 The High Street Coquitlam BC V3B 0A9,968000,3,2,Condo,10

You can match and capture the number of bedrooms and then compare if a match occurred.
Also, you can match city names as whole words, that is where regex comes in handy.
Here is a snippet:
number_of_bedrooms = 3
city_name = 'Vancouver'
rx = r'^R[0-9]{7},([^,]*\b%s\b[^,]*),\d{1,9},(\d+)' % (city_name)
with open(filepath, 'r') as f:
for line in f:
m = re.search(rx, line, re.IGNORECASE)
if m:
if int(m.group(2)) >= number_of_bedrooms: # Nr of bedrooms is in Group 2
print(line)
See an online demo. Here, as number_of_bedrooms = 3, the output is
R2507956,2242 Grant Street Vancouver BC V5L 2Z7,1699000,5,2,House,13
R2511923,2146 W 14th Avenue Vancouver BC V6K 2V7,2248000,3,3,House,31
R2511301,2638 Charles Street Vancouver BC V5K 3A5,1890000,8,8,House,18
Since the field with a city is withing commas, [\w ]{1,30} can be replaced with [^,]* patterns.

Split column in DataFrame based on item in list

I have the following table and would like to split each row into three columns: state, postcode and city. State and postcode are easy, but I'm unable to extract the city. I thought about splitting each string after the street synonyms and before the state, but I seem to be getting the loop wrong as it will only use the last item in my list.
Input data:
Address Text
0 11 North Warren Circle Lisbon Falls ME 04252
1 227 Cony Street Augusta ME 04330
2 70 Buckner Drive Battle Creek MI
3 718 Perry Street Big Rapids MI
4 14857 Martinsville Road Van Buren MI
5 823 Woodlawn Ave Dallas TX 75208
6 2525 Washington Avenue Waco TX 76710
7 123 South Main St Dallas TX 75201
The output I'm trying to achieve (for all rows, but I only wrote out the first two to save time)
City State Postcode
0 Lisbon Falls ME 04252
1 Augusta ME 04330
My code:
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand = True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand = True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
# This is where I got stuck
df["Syn"] = df["Address Text"].apply(lambda x: x.split(syn))
df

Here's a way to do that:
import pandas as pd
# data
df = pd.DataFrame(
['11 North Warren Circle Lisbon Falls ME 04252',
'227 Cony Street Augusta ME 04330',
'70 Buckner Drive Battle Creek MI',
'718 Perry Street Big Rapids MI',
'14857 Martinsville Road Van Buren MI',
'823 Woodlawn Ave Dallas TX 75208',
'2525 Washington Avenue Waco TX 76710',
'123 South Main St Dallas TX 75201'],
columns=['Address Text'])
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand=True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand=True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
def find_city(address, state, street_synonyms):
for syn in street_synonyms:
if syn in address:
# remove street
city = address.split(syn)[-1]
# remove State and postcode
city = city.split(state)[0]
return city
df['City'] = df.apply(lambda x: find_city(x['Address Text'], x['State'], street_synonyms), axis=1)
print(df[['City', 'State', 'Zip']])
"""
City State Zip
0 Lisbon Falls ME 04252
1 Augusta ME 04330
2 Battle Creek MI NaN
3 Big Rapids MI NaN
4 Van Buren MI 14857
5 Dallas TX 75208
6 nue Waco TX 76710
7 Dallas TX 75201
"""

Substituting a part of array using re

9 7 316 Lake St Arran Dr St. Catharines, ON L2N 4H4 Phone: 905-934-5885 112.9 123 130 --- 1/1/18
10 Esso 142 Lakeshore Rd Geneva St St. Catharines, ON L2N 2T5 Phone: 905-646-4558 112.7 125.9 131.9 --- 1/1/18
11 Petro-Canada 533 Lake St Linwell Rd St. Catharines, ON L2N 4H6 Phone: (905) 937-7719 112.9 125.9 131.9 124.9 1/1/18
I have above data where I need to change (905) to 905- so that all data is in similar format.I have tried to read this content as list and import re.
import re
for line in data :
line = re.sub(r"(905) ", "905-", line)
print(line)
But it is not working.How to replace it ?

If all you want is a simple replacement, then you shouldn't use re:
line = line.replace("(905) ", "905-")
If you need to replace more prefixes than just 905, only then you need regular expressions:
line = re.sub(r"\((\d{3})\) ", r"\1-", line)
That would also replace (204) 342-4532 with 204-342-4532.

Escape the brackets in RE like this
re.sub(r"\(905\) ", "905-", line)

You need to escape the parentheses because they are special characters:
for line in data :
line = re.sub("\(905\) ", "905-", line)
print(line)
Output:
9 7 316 Lake St Arran Dr St. Catharines, ON L2N 4H4 Phone: 905-934-5885 112.9 123 130 --- 1/1/18
10 Esso 142 Lakeshore Rd Geneva St St. Catharines, ON L2N 2T5 Phone: 905-646-4558 112.7 125.9 131.9 --- 1/1/18
11 Petro-Canada 533 Lake St Linwell Rd St. Catharines, ON L2N 4H6 Phone: 905-937-7719 112.9 125.9 131.9 124.9 1/1/18

AWK reformat portion of results (names) within larger string

My goal is to reformat names from Last First Middle (LFM) to First Middle Last (FML), which are part of a larger string. Here's some sample data:
Name, Address1, Address2
Smith Joe M, 123 Apple Rd, Paris TX
Adams Keith Randall, 543 1st Street, Salinas CA
Price Tiffany, 11232 32nd Street, New York NY
Walker Karen E F, 98 West Ave, Denver CO
What I would like is:
Name, Address1, Address2 Joe M Smith, 123 Apple Rd, Paris
TX Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY Karen E F
Walker, 98 West Ave, Denver CO
I know how to reorder the first column, but I end up dropping the rest of the row data:
# Return the first colum via comma seperation (name), then seperate by spaces
# If there are two strings but not three (only a last and first name),
# then change the order to first last.
awk -F, '{print $1}'| awk -F" " '$2!="" && $3=="" {print $2,$1}' >> names.txt
awk -F, '{print $1}'| awk -F" " '$3!="" && $4=="" {print $3,$1,$2}' >> names.txt
...# Continue to iterate column numbers
If there's an easier way to put the last string found and move it to the front I'd like to hear about it, but here's my real interest...
My problem is that I want to reorder the space separated fields of the 1st comma separated field (what I did above), but then also print the rest of the comma separated data.
Is there a way I can store the address info in a variable and append it after the space seperated names?
Alternatively, could I do some kind of nested split?
I'm currently doing this with awk in bash, but am willing to use python/pandas or any other efficient methods.
Thanks for the help!

Using sed, looks terrible but works:
sed -E '2,$s/^([^ ,]*) ([^ ,]*)( [^,]*)?/\2\3 \1/' in
and POSIX version:
sed '2,$s/^\([^ ,]*\) \([^ ,]*\)\( [^,]*\)*/\2\3 \1/' in
output:
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO

The following AWK script, as ugly as it is, works for your inputs (run with awk -F, -f script.awk):
{
split($1, names, " ");
for (i=2; i<=length(names); i++)
printf("%s ", names[i]);
printf("%s, ", names[1]);
for(i=2; i<NF; i++)
printf("%s,", $i);
print($NF)
}
Input:
Smith Joe M, 123 Apple Rd, Paris TX
Adams Keith Randall, 543 1st Street, Salinas CA
Price Tiffany, 11232 32nd Street, New York NY
Walker Karen E F, 98 West Ave, Denver CO
Output:
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO
The same solution in Python:
import sys
import re
for line in sys.stdin:
parts = re.split('\s*,\s*', line)
names = parts[0].split()
print(", ".join([" ".join(names[1:] + names[:1])] + parts[1:]))

Another awk. This one works with the header line and Madonna (ie. single word fields):
$ awk ' # using awk
BEGIN{FS=OFS=","} # csv
{
n=split($1,a," ") # split the first field to a
for(i=n;i>1;i--) # iterate back from the last element of a
a[1]=a[i] " " a[1] # prepending to the first element of a
$1=a[1] # replace the first field with the first element of a
}1' file # output
Output:
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO
Madonna, ...

$ awk '
BEGIN { FS=OFS=", " }
$1 ~ / / {
last = rest = $1
sub(/ .*/,"",last)
sub(/[^ ]+ /,"",rest)
$1 = rest " " last
}
{ print }
' file
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas regular expression replace part of the matching pattern - python

You should use the following regex: >>> import re >>> example_str = "607 Hampshire Rd #358" >>> re.sub(r"\s\#?[^\D]+\s$", r"", example_str) '607 Hampshire Rd'

Related

Can I use regex to split pandas dataframe column on first match only?

Find number greater than given parameter in regex

Split column in DataFrame based on item in list

Substituting a part of array using re

AWK reformat portion of results (names) within larger string

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas regular expression replace part of the matching pattern - python

You should use the following regex: >>> import re >>> example_str = "607 Hampshire Rd #358" >>> re.sub(r"\s*\#?[^\D]+\s*$", r"", example_str) '607 Hampshire Rd'

Related

Can I use regex to split pandas dataframe column on first match only?

Find number greater than given parameter in regex

Split column in DataFrame based on item in list

Substituting a part of array using re

AWK reformat portion of results (names) within larger string

Categories

Resources

You should use the following regex: >>> import re >>> example_str = "607 Hampshire Rd #358" >>> re.sub(r"\s\#?[^\D]+\s$", r"", example_str) '607 Hampshire Rd'