I have a dataframe with full addresses in a column, and I need to create a separate column with just the zip code. Some of the addresses just have the five digit zip code whereas others have the additional four digits.
How do I split the column to just get the zip code?
Example Data
d = {'name':['bob','john'],'address':['123 6th Street,Sterling VA 20165-7513','567 7th Street, Wilmington NC 28411']}
df = pd.DataFrame(d)
I tried using rpartition but I get everything before the zip code:
df['test'] = df['address'].str.rpartition(" ")
print(df)
name address test
bob 123 6th Street,Sterling VA 20165-7513 123 6th Street,Sterling VA
john 567 7th Street, Wilmington NC 28411 567 7th Street, Wilmington NC
This is what I'm trying to get:
name address zipcode
bob 123 6th Street,Sterling VA 20165-7513 20165-7513
john 567 7th Street, Wilmington NC 28411 28411
Use a regex with str.extract():
df['zip'] = df['address'].str.extract(r'(\d{5}\-?\d{0,4})')
returns:
name address zip
0 bob 123 6th Street,Sterling VA 20165-7513 20165-7513
1 john 567 7th Street, Wilmington NC 28411 28411
See the pandas page on str.extract() and the python page on re.
In particular, the {5} specifies that we must match 5 repetitions of \d (a numerical digit), while {0,4} indicates that we can match from 0 to 4 repetitions.
You can Try this
df['zip']= [i[-1] for i in df.address.str.split(' ').values]
You need to split the spaces, get the last item and you'll have the zipcode.
Something like this:
zipcodes = list()
for item in d['address']:
zipcode = item.split()[-1]
zipcodes.append(zipcode)
d['zipcodes'] = zipcodes
df = pd.DataFrame(d)
Related
I am trying to get the zip code after the specific word 'zip_code' within a string.
I have a data frame with a column named "location", in this column there is a string, I want to identify the word "zip_code" and get the value after this word for each row.
Input
name location
Bar1 LA Jefferson zip_code 202378 Avenue free
Pizza Avenue 45 zip_code 45623 wichita st
Tacos Las Americas avenue 7 zip_code 67890 nicolas st
Expected output
name location
Bar1 202378
Pizza 45623
Tacos 67890
So far, following an example I was able to extract the zip code for any string
str = "address1 355 Turnpike Ste 4 address3 zip_code 02021 country US "
str.split("zip_code")[1].split()[0]
>> 02021
But I do not know how to do the same for each row of my column location
The best way is to use extract() which accepts regex and allows searching through each row.
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['Bar1', 'Pizza', 'Tacos'],
'location':['LA Jefferson zip_code 202378 Avenue free', 'Avenue 45 zip_code 45623 wichita st', 'Las Americas avenue 7 zip_code 67890 nicolas st']})
df['location'] = df['location'].str.extract('zip_code\s(.*?)\s')
>>> df
name location
0 Bar1 202378
1 Pizza 45623
2 Tacos 67890
A column in my dataframe has the following campaign contribution data formatted in one of two ways:
JOHN A. DONOR1234 W ROAD ST CITY, STATE 56789
And
JANE M. DONOR
1234 W ROAD ST
CITY, STATE 56789
I want to split this column into two. Column one should be the name of the donor. Column two should be the address.
Currently, I'm using the following regex code to try and accomplish this:
url = ("http://www.voterfocus.com/CampaignFinance/candidate_pr.php?op=rp&e=8&c=munmiamibeach&ca=64&sdc=116&rellevel=4&dhc=774&committee=N")
dfs = pd.read_html(url)
df = dfs[0]
df['Contributor'].str.split(r'\d\d?', expand=True)
But instead of splitting after the first match and quitting - as I intend - the regex seems continue matching and splitting. My output should looke like this:
Col1 Col2
JOHN A. DONOR 1234 W ROAD ST CITY, STATE 56789
it may be much simpler than that. You can use the string methods. For example, I think this is the behavior you want:
import pandas as pd
s = """JOHN A. DONOR
1234 W ROAD ST
CITY, STATE 56789"""
df = pd.DataFrame([s], columns=["donors"])
df.donors.str.split("\n", 1, expand=True)
output:
0 1
0 JOHN A. DONOR 1234 W ROAD ST\nCITY, STATE 56789
Splitting solution
You can use
df['Contributor'].str.split(r'(?<=\D)(?=\d)', expand=True, n=1)
The (?<=\D)(?=\d) regex finds a location between a non-digit char (\D) and a digit char (\d), splits the string there and only performs this operation once (due to n=1).
Alternative solution
You can match and capture the names up to the first number, and then capture all text there remains after and including the first digit using
df['Contributor'].str.extract(r'(?P<Name>\D*)(?P<Address>\d.*)', expand=True)
# => Name # Address
# 0 Contributor CHRISTIAN ULVERT 1742 W FLAGLER STMIAMI, FL 33135
# 1 Contributor Roger Thomson 4271 Alton Miami Beach , FL 33140
# 2 Contributor Steven Silverstein 691 West 247th Street Bronx , NY 10471
# 3 Contributor Cathy Raduns 691 West 247th Street Bronx, NY 10471
# 4 Contributor Asher Raduns-Silverstein 691 West 247th StreetBRONX, NY 10471
The (?P<Name>\D*)(?P<Address>\d.*) pattern means
(?P<Name>\D*) - Group "Name": zero or more chars other than digits
(?P<Address>\d.*) - Group "Address": a digit and then any zero or more chars other than line break chars.
If there are line breaks in the string, add (?s) at the start of the pattern, i.e. r'(?s)(?P<Name>\D*)(?P<Address>\d.*)'.
See the regex demo.
I have following 2 data frames that are taken from excel files:
df1 = 10000 rows (like the master list that has all unique #s)
df2 = 670 rows
I am loading a excel file (df2) that has zip, address, state and I want to match that info and then add on the supplier # from df1 so that I could have 1 file thats still 670 rows but now has the supplier row column.
Since there was no unique key between two dataframes, I thought that I could make a unique key to merge on by joining 3 columns together - zip, address, and state and join them with a "-".. maybe this is too risky for a match? df1 has a ton of duplicate addresses, zips, states so I couldnt do something like joining just zip and state.
df1 =
(10000 rows)
(unique)
supplier_num ZIP ADDRESS STATE CCCjoin
0 7100000 35481 14th street CA 35481-14th street-CA
1 7000005 45481 14th street CA 45481-14th street-CA
2 7000006 45482 140th circle CT 45482-140th circle-CT
3 7000007 35482 140th circle CT 35482-140th circle-CT
4 7000008 35483 13th road VT 35483-13th road-VT
...
df2 =
(670 rows)
ZIP ADDRESS STATE CCCjoin
0 35481 14th street CA 35481-14th street-CA
1 45481 14th street CA 45481-14th street-CA
2 45482 140th circle CT 45482-140th circle-CT
3 35482 140th circle CT 35482-140th circle-CT
4 35483 13th road VT 35483-13th road-VT
...
OUTPUT:
df3 =
(670 rows)
ZIP ADDRESS STATE Unique Key (Unique)supplier_num
0 35481 14th street CA 35481-14th street-CA 7100000
1 45481 14th street CA 45481-14th street-CA 7100005
2 45482 140th circle CT 45482-140th circle-CT 7100006
3 35482 140th circle CT 35482-140th circle-CT 7100007
4 35483 13th road VT 35483-13th road-VT 7100008
...
670 15483 13 baker road CA 15483-13 baker road-CA 7100009
I've looked around on here and found some helpful tricks and I think ive made some progress. Here is some code that I tried.
df1['g'] = df1.groupby('CCCjoin').cumcount()
df2['g'] = df2.groupby('CCCjoin').cumcount()
then I merge:
merged_table = pd.merge(df1,df2, on=['CCCjoin', 'g'], how='inner').drop('g', axis =1 )
This sort of works. I get a match of 293 rows and I cross checked the supplier number and it matches the address.
What am I missing to get the 377 matches? Thanks in advance!
I have a large df called data which looks like:
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
I have another dataframe called updates. In this example the dataframe has updated information for data for a couple of records and looks like:
Identifier Surname First names(s) Date change
0 12233.0 Smith Bob 05/09/14
1 10610.0 Cooper Amy 16/08/12
I'm trying to find a way to update data with the updates df so the resulting dataframe looks like:
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob 15/09/14 FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
As you can see the Date change field for Bob in the data df has been updated with the Date change from the updates df.
What can I try next?
a while back, I was dealing with that too. the straight up .update was giving me issues (sorry can't remember the exact issue I had. I think it was that when you do .update, it's reliant on indexes matching, and they didn't match in my 2 separate dataframes. so I wanted to use certain columns as my index to update on),
But I made a function to deal with it. So this might be way overkill than what's needed but try this and see if it'll work.
I'm also assuming the date you want update from the updates dataframe should be 15/09/14 not 05/09/14. So I had that different in my sample data below
Also, I'm assuming the Identifier is unique key. If not, you'll need to include multiple columns as your unique key
import sys
import pandas as pd
data = pd.DataFrame([[12233.0,'Smith','Bob','','FT','NW'],
[54213.0,'Jones','Sally','15/04/15','FT','NW'],
[12237.0,'Evans','Steve','26/08/14','FT','SE'],
[10610.0,'Cooper','Amy','16/08/12','FT','SE']],
columns = ['Identifier','Surname','First names(s)','Date change','Work Pattern','Region'])
updates = pd.DataFrame([[12233.0,'Smith','Bob','15/09/14'],
[10610.0,'Cooper','Amy','16/08/12']],
columns = ['Identifier','Surname','First names(s)','Date change'])
def update(df1, df2, keys_list):
df1 = df1.set_index(keys_list)
df2 = df2.set_index(keys_list)
dup_idx1 = df1.index.get_duplicates()
dup_idx2 = df2.index.get_duplicates()
if len(dup_idx1) > 0 or len(dup_idx2) > 0:
print('\n'+'#'*50+'\nError! Duplicate Indicies:')
for element in dup_idx1:
print('df1: %s' %(element,))
for element in dup_idx2:
print('df2: %s' %(element,))
print('#'*50+'\n\n')
df1.update(df2, overwrite=True)
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
return df1
# the 3rd input is a list, in case you need multiple columns as your unique key
df = update(data, updates, ['Identifier'])
Output:
print (data)
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
print (updates)
Identifier Surname First names(s) Date change
0 12233.0 Smith Bob 15/09/14
1 10610.0 Cooper Amy 16/08/12
df = update(data, updates, ['Identifier'])
In [19]: print (df)
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob 15/09/14 FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
Using DataFrame.update.
First set index:
data.set_index('Identifier', inplace=True)
updates.set_index('Identifier', inplace=True)
Then update:
data.update(updates)
print(data)
Surname First names(s) Date change Work Pattern Region
Identifier
12233.0 Smith Bob 15/09/14 FT NW
54213.0 Jones Sally 15/04/15 FT NW
12237.0 Evans Steve 26/08/14 FT SE
10610.0 Cooper Amy 16/08/12 FT SE
If you need multiple columns to create a unique index you can just set them with a list. For example:
data.set_index(['Identifier', 'Surname'], inplace=True)
updates.set_index(['Identifier', 'Surname'], inplace=True)
data.update(updates)
I have many excel files that are in different formats. Some of them look like this, which is normal with one header can be read into pandas.
# First Column Second Column Address City State Zip
1 House The Clairs 4321 Main Street Chicago IL 54872
2 Restaurant The Monks 6323 East Wing Miluakee WI 45458
and some of them are in various formats with multiple headers,
Table 1
Comp ID Info
# First Column Second Column Address City State Zip
1 Office The Fairs 1234 Main Street Seattle WA 54872
2 College The Blanks 4523 West Street Madison WI 45875
3 Ground The Brewers 895 Toronto Street Madrid IA 56487
Table2
Comp ID Info
# First Column Second Column Address City State Zip
1 College The Banks 568 Old Street Cleveland OH 52125
2 Professional The Circuits 695 New Street Boston MA 36521
This looks like this in Excel (I am pasting the image here to show how it actually looks in excel),
As you can see above there are three different levels of headers. For sure every file has a row that starts with First Column.
For an individual file like this, I can read like below, which is just fine.
xls = pd.ExcelFile(r'mypath\myfile.xlsx')
df = pd.read_excel('xls', 'mysheet', header=[2])
However, I need a final data frame like this (Appended with files that have only one header),
First Column Second Column Address City State Zip
0 House The Clair 4321 Main Street Chicago IL 54872
1 Restaurant The Monks 6323 East Wing Milwaukee WI 45458
2 Office The Fairs 1234 Main Street Seattle WA 54872
3 College The Blanks 4523 West Street Madison WI 45875
4 Ground The Brewers 895 Toronto Street Madrid IA 56487
5 College The Banks 568 Old Street Cleveland OH 52125
6 Professional The Circuits 695 New Street Boston MA 36521
Since I have many files, I want to read each file in my folder and clean them up by getting only one header from a row. Had I knew the index position of the row, that I need as head, I could simply do something like in this post.
However, as some of those files can have Multiple headers (I showed 2 extra headers in above example, some have 4 headers) in different formats, I want to iterate through the file and set the row that starts with First Column to be header in the beginning of the file.
Additionally, I want to drop those rows that are in the middle of the the file that has First Column.
After I create a cleaned file headers starting with First Column, I can append each data frame and create my output file I need. How can I achieve this in pandas? Any help or suggestions would be great.