I am trying to mimic my issue with my current data. I am trying to use pandas to merge two dataframes on different column names (Code and Number), and bring only one column from df2 (location). I get either Key error, or NaN.
Both were imported CSV files as data frames;
Both column names do not have white space;
Both columns have the same d.type
I have tried looking at other answers here, with literally copying and pasting the coded answers filling in my parts, and still get errors or NaN.
df1:
[['Name', 'Income', 'Favourite superhero', 'Code', 'Colour'],
['Joe', '80000', 'Batman', '10004', 'Red'],
['Christine', '50000', 'Superman', '10005', 'Brown'],
['Joey', '90000', 'Aquaman', '10002', 'Blue']
df2:
[['Number', 'Language', 'Location'],
['10005', 'English', 'Sudbury'],
['10002', 'French', 'Ottawa'],
['10004', 'German', 'New York']]
what I tried:
data = pd.merge(CSV1,
CSV2[['Location']],
left_on='Code',
right_on='Number',
how='left')
data = pd.merge(CSV1,
CSV2[['Location']],
left_on='Code',
right_index=True,
how='left')
I am trying to have df1 with the location column from df2 for each instance where Number
and Code are the same.
For both you commands work , you need Number existing in the right side dataframe. For the 1st command you need to drop Number columns after merge. For 2nd command, you need to set_index on the right sliced dataframe and no need to drop Number. I modified your command accordingly:
CSV1.merge(CSV2[['Number', 'Location']], left_on='Code', right_on='Number', how='left').drop('Number', 1)
Or
CSV1.merge(CSV2[['Number', 'Location']].set_index('Number'), left_on='Code', right_index=True, how='left')
Out[892]:
Name Income Favourite superhero Code Colour Location
0 Joe 80000 Batman 10004 Red New York
1 Christine 50000 Superman 10005 Brown Sudbury
2 Joey 90000 Aquaman 10002 Blue Ottawa
Related
Assume I have two dataframes, df1 and df2, described as follows. See code below that creates each of these dfs.
df1
Has 5,000 rows and 10,000 columns.
The first column contains a list of non-sequential dates. The dates are listed oldest to newest, but not every day is listed (i.e., only some days are listed). Each date is unique.
Each column is labeled with a different person's name. Each column name is unique.
All columns other than the Date column contain a number value.
df2
Has 2,000,000 rows and 4 columns.
The first column contains a list of dates. These are NOT sorted by oldest to newest.
The next column contains a person's name (which is listed as the column name in one of the columns of df1).
The other two columns contain data about that person based on the date listed in the row.
My Objective
I want to populate the two blank columns in df2 using data pulled from df1.
For instance, the first row of df2 lists a date of 2017-05-15 and a Person named Person4. I want to populate df2['Value_Today'] with 4752. I want to populate df2['Value_2_records_later'] with 4866.
For the next row of df2 (with Date of 2019-01-28 and Person named Person1, I want to populate df2['Value_Today'] with 1918. I want to populate df2['Value_2_records_later'] with 1912.
I want to do this for all 2 million rows in df2, so I assume that a for loop is a bad idea.
Any help would be greatly appreciated. Thank you!
Code
# Import dependencies
import pandas as pd
import numpy as np
# Create df1
df1 = pd.DataFrame(np.array([['2016-05-03', 1651,2653,3655,4658,5655],
['2017-05-29', 1751,2752,3754,4755, 5759],
['2018-08-22', 1889, 2882,3887, 4884, 5882],
['2019-06-28', 1966, 2965, 3966, 4960, 5963],
['2018-11-15', 1811, 2811, 3811, 4811, 5811],
['2019-12-31', 1912, 2912, 3912, 4912, 5912],
['2016-07-05', 1672, 2678, 3679, 4672, 5674],
['2017-05-15', 1755, 2750, 3759, 4752, 5755],
['2018-06-10', 1860, 2864, 3866, 4866, 5867],
['2019-01-28', 1918, 2910, 3914, 4911, 5918],
['2018-11-30', 1812, 2812, 3812, 4812, 5812],
['2019-01-03', 1915, 2917, 3916, 4916, 5917],]),
columns=['Date', 'Person1', 'Person2', 'Person3', 'Person4',
'Person5',])
# Format df1['Date'] col as datetime
df1['Date'] = pd.to_datetime(df1['Date'])
# Sort df1 by 'Date'
df1 = df1.sort_values(['Date'],ascending=[True]).reset_index(drop=True)
# Create 'df2', which contains measurement data on specific dates.
df2 = pd.DataFrame(np.array([['2017-05-15', 'Person4', '', ''], ['2019-01-28 ', 'Person1', '', ''],
['2018-11-15', 'Person1', '', ''], ['2018-08-22', 'Person3', '', ''],
['2017-05-15', 'Person5', '', ''], ['2016-05-03', 'Person2', '', ''],]),
columns=['Date', 'Person', 'Value_Today', 'Value_2_records_later'])
df2['Date'] = pd.to_datetime(df2['Date'])
# Display dfs
display(df1)
display(df2)
### I DON'T KNOW WHAT CODE I NEED TO SOLVE MY ISSUE ###
# To capture the row that is two rows below, I think I would use the '.shift(-2)' function?
Solution with MultiIndex.map:
Set the index of df1 to Date
Stack the dataframe to create multiindex mapping series s1. The index of this series will be the combination of date and name of the person. Similarly create another series s2.
Set the index of df2 to Date and Person columns
Substitute the values in the index of df2 using the values from s1 and s2 and assign the corresponding results to Value_Today and Value_2_records_later
s1 = df1.set_index('Date').stack()
s2 = df1.set_index('Date').shift(-2).stack()
ix = df2.set_index(['Date', 'Person']).index
df2['Value_Today'] = ix.map(s1)
df2['Value_2_records_later'] = ix.map(s2)
Result
print(df2)
Date Person Value_Today Value_2_records_later
0 2017-05-15 Person4 4752 4866
1 2019-01-28 Person1 1918 1912
2 2018-11-15 Person1 1811 1915
3 2018-08-22 Person3 3887 3812
4 2017-05-15 Person5 5755 5867
5 2016-05-03 Person2 2653 2750
First, copy the values once for Value_2_records_later,
step1 = df1.set_index('Date')
persons = step1.columns.tolist()
c1 = [('Value_Today', p) for p in persons]
c2 = [('Value_2_records_later', p) for p in persons]
step1.columns = pd.MultiIndex.from_tuples(c1, names=('','Person'))
step1[c2] = step1[c1].shift(-2)
Then stack to move columns to rows
step1.stack()
A list of names and I want to retrieve each of the correspondent information in different data frames, to form a new dataframe.
I converted the list into a 1 column dataframe, then to look up its corresponding values in different dataframes.
The idea is visualized as:
I have tried:
import pandas as pd
data = {'Name': ["David","Mike","Lucy"]}
data_h = {'Name': ["David","Mike","Peter", "Lucy"],
'Hobby': ['Music','Sports','Cooking','Reading'],
'Member': ['Yes','Yes','Yes','No']}
data_s = {'Name': ["David","Lancy", "Mike","Lucy"],
'Speed': [56, 42, 35, 66],
'Location': ['East','East','West','West']}
df = pd.DataFrame(data)
df_hobby = pd.DataFrame(data_h)
df_speed = pd.DataFrame(data_s)
df['Hobby'] = df.lookup(df['Name'], df_hobby['Hobby'])
print (df)
But it returns the error message as:
ValueError: Row labels must have same size as column labels
I have also tried:
df = pd.merge(df, df_hobby, on='Name')
It works but it includes unnecessary columns.
What will be the smart an efficient way to do such, especially when the number of to-be-looked-up dataframes are many?
Thank you.
Filter only columns for merge and columns for append like:
df = (pd.merge(df, df_hobby[['Name','Hobby']], on='Name')
.merge(df_speed[['Name','Location']], on='Name'))
print(df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
If want working with list use this solution with filtering columns:
dfList = [df,
df_hobby[['Name','Hobby']],
df_speed[['Name','Location']]]
from functools import reduce
df = reduce(lambda df1,df2: pd.merge(df1,df2,on='Name'), dfList)
print (df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
For this example i have 2 dataframes, the genre column in df1 is column 3 but in df2 it is column 2, also the header is slightly different. in my actual script i have to search the column names because the column location varies in each sheet it reads.
how do i recognise different header names as the same thing?
df1 = pd.DataFrame({'TITLE': ['The Matrix','Die Hard','Kill Bill'],
'VENDOR ID': ['1234','4321','4132'],
'GENRE(S)': ['Action', 'Adventure', 'Drama']})
df2 = pd.DataFrame({'TITLE': ['Toy Story','Shrek','Frozen'],
'Genre': ['Animation', 'Adventure', 'Family'],
'VENDOR ID': ['5678','8765','8576']})
column_names = ['TITLE','VENDOR ID','GENRE(S)']
appended_data = []
sheet1 = df1[df1.columns.intersection(column_names)]
appended_data.append(sheet1)
sheet2 = df2[df2.columns.intersection(column_names)]
appended_data.append(sheet2)
appended_data = pd.concat(appended_data, sort=False)
output:
TITLE VENDOR ID GENRE(S)
0 The Matrix 1234 Action
1 Die Hard 4321 Adventure
2 Kill Bill 4132 Drama
0 Toy Story 5678 NaN
1 Shrek 8765 NaN
2 Frozen 8576 NaN
desired output:
TITLE VENDOR ID GENRE(S)
0 The Matrix 1234 Action
1 Die Hard 4321 Adventure
2 Kill Bill 4132 Drama
0 Toy Story 5678 Animation
1 Shrek 8765 Adventure
2 Frozen 8576 Family
Thank you for taking the time to do that. Asking a good questions is very important and now that you have posed a coherent question I was able to find a simple solution rather quickly:
import pandas as pd
df1 = pd.DataFrame({'TITLE': ['The Matrix','Die Hard','Kill Bill'],
'VENDOR ID': ['1234','4321','4132'],
'GENRE(S)': ['Action', 'Adventure', 'Drama']})
df2 = pd.DataFrame({'TITLE': ['Toy Story','Shrek','Frozen'],
'Genre': ['Animation', 'Adventure', 'Family'],
'VENDOR ID': ['5678','8765','8576']})
Simple way:
We will use .append() below but for this to work, we need columns in df1 and df2 to match. In this case we'll simply replace df2's 'Genre' to 'GENRE(S)'
df2.columns = ['TITLE', 'GENRE(S)', 'VENDOR ID']
df3 = df1.append(df2)
print(df3)
GENRE(S) TITLE VENDOR ID
0 Action The Matrix 1234
1 Adventure Die Hard 4321
2 Drama Kill Bill 4132
0 Animation Toy Story 5678
1 Adventure Shrek 8765
2 Family Frozen 8576
More elaborate:
Now, for a single use case this works but there may be cases where you have many mismatched columns and/or have to do this repeatedly. Here is a solution using boolean indexing to find mismatched names, then zip() and .rename() to map the column names:
# RELOAD YOUR ORIGINAL DF'S
df1_find = df1.columns[~df1.columns.isin(df2.columns)] # select col name that isnt in df2
df2_find = df2.columns[~df2.columns.isin(df1.columns)] # select col name that isnt in df1
zipped = dict(zip(df2_find, df1_find)) # df2_find as key, df1_find as value
df2.rename(columns=zipped, inplace=True) # map zipped dict to the column names
df3 = df1.append(df2)
print(df3)
GENRE(S) TITLE VENDOR ID
0 Action The Matrix 1234
1 Adventure Die Hard 4321
2 Drama Kill Bill 4132
0 Animation Toy Story 5678
1 Adventure Shrek 8765
2 Family Frozen 8576
Keep in mind:
this way of doing it assumes that both your df's have the same count
of columns
this ALSO assumes that df1 has your ideal column names which you
will use against other dfs to fix their column names
I hope this helps.
I have these two data frames:
1st df
#df1 -----
location Ethnic Origins Percent(1)
0 Beaches-East York English 18.9
1 Davenport Portuguese 22.7
2 Eglinton-Lawrence Polish 12.0
2nd df
#df2 -----
location lat lng
0 Beaches—East York, Old Toronto, Toronto, Golde... 43.681470 -79.306021
1 Davenport, Old Toronto, Toronto, Golden Horses... 43.671561 -79.448293
2 Eglinton—Lawrence, North York, Toronto, Golden... 43.719265 -79.429765
Expected Output:
I want to use the location column of #df1 as it is cleaner and retain all other columns. I don't need the city, country info on the location column.
location Ethnic Origins Percent(1) lat lng
0 Beaches-East York English 18.9 43.681470 -79.306021
1 Davenport Portuguese 22.7 43.671561 -79.448293
2 Eglinton-Lawrence Polish 12.0 43.719265 -79.429765
I have tried several ways to merge them but to no avail.
This returns a NaN for all lat and long rows
df3 = pd.merge(df1, df2, on="location", how="left")
This returns a NaN for all Ethnic and Percent rows
df3 = pd.merge(df1, df2, on="location", how="right")
As others have noted, the problem is that the 'location' columns do not share any values. One solution to this is to use a regular expression to get rid of everything starting with the first comma and extending to the end of the string:
df2.location = df2.location.replace(r',.*', '', regex=True)
Using the exact data you provide this still won't work because you have different kinds of dashes in the two data frame. You could solve this in a similar way (no regex needed this time):
df2.location = df2.location.replace('—', '-')
And then merge as you suggested
df3 = pd.merge(df1, df2, on="location", how="left")
We should using findall create the key
df2['location']=df2.location.str.findall('|'.join(df1.location)).str[0]
df3 = pd.merge(df1, df2, on="location", how="left")
I'm guessing the problem you're having is that the column you're trying to merge on is not the same, i.e. it doesn't find the corresponding values in df2.location to merge to df1. Try changing those first and it should work:
df2["location"] = df2["location"].apply(lambda x: x.split(",")[0])
df3 = pd.merge(df1, df2, on="location", how="left")
I trying to merge 2 data frames. The frames do not share columns (except the keys). So merging should just add columns of the right to the left. However, I am also getting extra rows. I don't understand where the 2 extra rows come from.
If I use left_index and right_index then it would have worked perfect. However, I don't understand how normal merging on keys would have 2 extra rows like in my result. Thanks
dat1 = np.array([['Afghanistan', 2007, 'new_ep_m1524', 0],['Afghanistan', 2007, 'new_sn_m65', 0],
['Afghanistan', 2012, 'new_sn_f014', 1190],['Afghanistan', 2011, 'new_sn_f014', 851],
['Afghanistan', 2013, 'newrel_m014', 1705]], dtype=object)
dat2 = np.array([['ep', 'male', '15-24', 'Afghanistan', 2007],['sn', 'male', '65+', 'Afghanistan', 2007],
['sn', 'female', '0-14', 'Afghanistan', 2012],['sn', 'female', '0-14', 'Afghanistan', 2011],
['rel', 'male', '0-14', 'Afghanistan', 2013]], dtype=object)
left = pd.DataFrame(data=dat1, columns=['country', 'year', 'case_type', 'count'] )
rigt = pd.DataFrame(data=dat2, columns=['type', 'gender', 'age_group', 'country', 'year'])
display(left), display(right)
pd.merge(left,right, on=['country', 'year'], how='outer')
left
right
result
You have keys that are repeated in your dataset. Afghanistan 2007 has two rows in each data frame. When merging with a full outer join, it is not clear which of the two records Afghanistan 2007 should match between the two data frames. So, both are joined. This is why there are four records with Afghanistan 2007 in the merged data frame (2 from the first data frame + 2 from the second data frame).
Your merging is on the column year, which is not unique for each row.
The merging associates the row 0 of the left dataframe with the rows 0 and 1 of the right dataframe, and the row 1 of the left dataframe with the rows 0 and 1 of the right dataframe.
To avoid that you could remove the case_type or add a unique id according to your needs.