I have a following dataframe df_address containing addresses of students
student_id address_type Address City
1 R 6th street MPLS
1 P 10th street SE Chicago
1 E 10th street SE Chicago
2 P Washington ST Boston
2 E Essex St NYC
3 E 1040 Taft Blvd Dallas
4 R 24th street NYC
4 P 8th street SE Chicago
5 T 10 Riverside Ave Boston
6 20th St NYC
Each student can have multiple address types:
R stands for "Residential",P for "Permanent" ,E for "Emergency",T for "Temporary" and addr_type can also be blank
I want to populate "IsPrimaryAddress" columns based on the following logic:
If for particular student if address_type R exists then "Yes" should be written
in front of address_type "R" in the IsPrimaryAddress column
and "No" should be written in front of other address types for that particular student_id.
if address_type R doesn't exist but P exists then IsPrimaryAddress='Yes' for 'P' and 'No'
for rest of the types
if neither P or R exists,but E exists then IsPrimaryAddress='Yes' for 'E'
if P,R or E don't exist,but 'T' exists then IsPrimaryAddress='Yes' for 'T'
Resultant dataframe would look like this:
student_id address_type Address City IsPrimaryAddress
1 R 6th street MPLS Yes
1 P 10th street SE Chicago No
1 E 10th street SE Chicago No
2 P Washington ST Boston Yes
2 E Essex St NYC No
3 E 1040 Taft Blvd Dallas Yes
4 R 24th street NYC Yes
4 P 8th street SE Chicago No
5 T 10 Riverside Ave Boston Yes
6 20th St NYC Yes
How can I achieve this?I tried rank and cumcount functions on address_type but couldn't get them work.
First using Categorical make the address_type can be sort customized
df.address_type=pd.Categorical(df.address_type,['R','P','E','T',''],ordered=True)
df=df.sort_values('address_type') # the sort the values
df['new']=(df.groupby('student_id').address_type.transform('first')==df.address_type).map({True:'Yes',False:'No'}) # since we sorted the value , so the first value of each group is the one we need to mark as Yes
df=df.sort_index() # sort the index order back to the original df
student_id address_type new
0 1 R Yes
1 1 P No
2 1 E No
3 2 P Yes
4 2 E No
5 3 E Yes
6 4 R Yes
7 4 P No
8 5 T Yes
9 6 Yes
Related
This question already has answers here:
Get rows based on distinct values from one column
(2 answers)
Closed 1 year ago.
I have a dataframe with thousands rows like this:
city zip_code name
paris 1 John
paris 1 Eric
paris 2 David
LA 3 David
LA 4 David
LA 4 NaN
How can I do a groupby city and zip code and know the name for each city and zip_code grouped ?
Expected output: a dataframe with rows with unique city and unique zip_code and corresponding names in other column (one row per name)
city zip_code name
paris 1 John
Eric
paris 2 David
LA 3 David
LA 4 David
IIUC, you want to know the existing combinations of city and zip_code?
[k for k,_ in df.groupby(['city', 'zip_code'])]
output: [('LA', 3), ('LA', 4), ('paris', 1), ('paris', 2)]
edit following your change in the question:
It looks like you want:
df.drop_duplicates().dropna()
output:
city zip_code name
0 paris 1 John
1 paris 1 Eric
2 paris 2 David
3 LA 3 David
4 LA 4 David
I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None
I have some addresses that I would like to clean.
You can see that in column address1, we have some entries that are just numbers, where they should be numbers and street names like the first three rows.
df = pd.DataFrame({'address1':['15 Main Street','10 High Street','5 Other Street',np.nan,'15','12'],
'address2':['New York','LA','London','Tokyo','Grove Street','Garden Street']})
print(df)
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street
5 12 Garden Street
I'm trying to create a function that will check if address1 is a number, and if so, concat address1 and street name from address2, then delete address2.
My expected output is this. We can see index 4 and 5 now have complete address1 entries:
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN <---
5 12 Garden Street NaN <---
What I have tried with the .apply() function:
def f(x):
try:
#if address1 is int
if isinstance(int(x['address1']), int):
# create new address using address1 + address 2
newaddress = str(x['address1']) +' '+ str(x['address2'])
# delete address2
x['address2'] = np.nan
# return newaddress to address1 column
return newadress
except:
pass
Applying the function:
df['address1'] = df.apply(f,axis=1)
However, the column address1 is now all None.
I've tried a few variations of this function but can't get it to work. Would appreciate advice.
You may avoid apply by using str.isdigit to pick exact rows need to modify. Create a mask m to identify these rows. Use agg on these rows and construct a sub-dataframe for these rows. Finally append back to original df
m = df.address1.astype(str).str.isdigit()
df1 = df[m].agg(' '.join, axis=1).to_frame('address1').assign(address2=np.nan)
Out[179]:
address1 address2
4 15 Grove Street NaN
5 12 Garden Street NaN
Finally, append it back to df
df[~m].append(df1)
Out[200]:
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN
5 12 Garden Street NaN
If you still insist to use apply, you need modify f to return outside of if to return non-modify rows together with modified rows
def f(x):
y = x.copy()
try:
#if address1 is int
if isinstance(int(x['address1']), int):
# create new address using address1 + address 2
y['address1'] = str(x['address1']) +' '+ str(x['address2'])
# delete address2
y['address2'] = np.nan
except:
pass
return y
df.apply(f, axis=1)
Out[213]:
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN
5 12 Garden Street NaN
Note: it is reccommended that apply should not modify the passed object, so I do y = x.copy() and modify and return y
You can create a mask and update:
mask = pd.to_numeric(df.address1, errors='coerce').notna()
df.loc[mask, 'address1'] = df.loc[mask, 'address1'] + ' ' +df.loc[mask,'address2']
df.loc[mask, 'address2'] = np.nan
output:
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN
5 12 Garden Street NaN
Try this
apply try except and convert address1 in int
def test(row):
try:
address = int(row['address1'])
return 1
except:
return 0
df['address1'] = np.where(df['test']==1,df['address1']+ ' '+df['address2'],df['address1'])
df['address2'] = np.where(df['test']==1,np.nan,df['address2'])
df.drop(['test'],axis=1,inplace=True)
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN
5 12 Garden Street NaN
I have a dataframe like this:
CITY LOCATION PRODUCT
CHICAGO CHI1 A
CHICAGO CHI1 B
CHICAGO CHI4 C
NEWYORK NY1 D
NEWYORK NY2 E
NEWYORK NY2 F
NEWYORK NY2 G
ATLANTA ATL1 H
ATLANTA ATL1 I
And I want to get 2 different stats based on the same grouping.
The grouping is [CITY, LOCATION]. I want to be able to get the number of products per location as well as the name of the first product (in alphabetical order) for that location.
The result would be:
CITY LOCATION FIRST COUNT
CHICAGO CHI1 A 2
CHICAGO CHI4 C 1
NEWYORK NY1 D 1
NEWYORK NY2 E 3
ATLANTA ATL1 H 2
The only way I've managed to do this is by:
gb = data.groupby(['CITY', 'LOCATION'])
df = gb.max().join(other=gb.count(), how='left', on=['CITY', 'LOCATION'], rsuffix='_r')
But I'm sure there's a better way to re-use the same groupby() object without having to join 2 dataframes.
Something similar to SQL:
SELECT city, location, max(product), count(product) FROM table GROUP BY city, location
Is there a better way to this this?
agg
df.groupby(['CITY', 'LOCATION'], sort=False).PRODUCT.agg(['min', 'count']).reset_index()
CITY LOCATION min count
0 CHICAGO CHI1 A 2
1 CHICAGO CHI4 C 1
2 NEWYORK NY1 D 1
3 NEWYORK NY2 E 3
4 ATLANTA ATL1 H 2
Goal: If the name in df2 in row i is a sub-string or an exact match of a name in df1 in some row N and the state and district columns of row N in df1 are a match to the respective state and district columns of df2 row i, combine.
Break down of data frame inputs:
df1 is a time-series style data frame.
df2 is a regular data frame.
3.df1 and df2 do not have the same length.
df1 Names contain initials,titles, and even weird character encodings.
df2 Names are just a combination of First Name, Space and Last Name.
My attempts have centered around taking into account 1. Names, Districts and State.
My approaches have tried to take into account that names in df1 have initials or second names, titles, etc whereas df2 is simply first and last names. I tried to use str.contains('A-za-z') to account for this difference.
# Data Frame Samples
# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)
print df1
# CandidateName District Party State
#0 Theodorick A. Bland 9 VA
#1 Aedanus Rutherford Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
#4 Theodorick Bland 9 VA
#5 Aedanus Burke 2 SC
#6 Jason Initial Lewis 2 Democrat MN
#7 '' 1 Whig NH
#8 '' 1 Whig NH
Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)
print df2
# CandidateName District Party State
#0 Theodorick Bland 9 VA
#1 Aedanus Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
# Attempt code
df3 = df1.merge(df2, left_on = (df1.State, df1.District,df1.CandidateName.str.contains('[A-Za-z]')), right_on=(df2.State, df2.District,df2.Name.str.contains('[A-Za-z]')))
I included merging on District and State in order to reduce redundancies and inaccuracies. When I removed district and state from left_on and right_on, not did the output df3 increase in size with a lot of wrong matches.
Examples include CandidateName and Name being two different people:
Theodorick A. Bland sharing the same row as Jasson Lewis Sr.
Some of the row results with the Attempt Code above are as follows:
Header
key_0 key_1 key_2 CandidateName District_x Party_x State_x District_y Name Party_y State_y
Row 6, index 4
MN 2 True Jason Lewis 2 Democrat MN 2 Jasson Lewis Sr. Republican MN
Row 11, index 3
3 VA 10 True Barbara Comstock 10 VA 10 Barbara Comstock Democrat VA
We can use difflib for this to create an artificial key column to merge on. We call this column name, like the one in df2:
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])
print(df_merge)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Theodorick Bland VA 9 Theodorick Bland
2 Aedanus Rutherford Burke SC 2 Aedanus Burke
3 Aedanus Burke SC 2 Aedanus Burke
4 Jason Lewis MN 2 Jason Lewis
5 Jason Initial Lewis MN 2 Democrat Jason Lewis
6 Barbara Comstock VA 10 Democrat Barbara Comstock
Explanation of difflib.get_close_matches. It looks for similar strings in df2. This is how the new Name column in df1 looks like:
print(df1)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Aedanus Rutherford Burke SC 2 Aedanus Burke
2 Jason Lewis MN 2 Jason Lewis
3 Barbara Comstock VA 10 Democrat Barbara Comstock
4 Theodorick Bland VA 9 Theodorick Bland
5 Aedanus Burke SC 2 Aedanus Burke
6 Jason Initial Lewis MN 2 Democrat Jason Lewis