I trying to merge 2 data frames. The frames do not share columns (except the keys). So merging should just add columns of the right to the left. However, I am also getting extra rows. I don't understand where the 2 extra rows come from.
If I use left_index and right_index then it would have worked perfect. However, I don't understand how normal merging on keys would have 2 extra rows like in my result. Thanks
dat1 = np.array([['Afghanistan', 2007, 'new_ep_m1524', 0],['Afghanistan', 2007, 'new_sn_m65', 0],
['Afghanistan', 2012, 'new_sn_f014', 1190],['Afghanistan', 2011, 'new_sn_f014', 851],
['Afghanistan', 2013, 'newrel_m014', 1705]], dtype=object)
dat2 = np.array([['ep', 'male', '15-24', 'Afghanistan', 2007],['sn', 'male', '65+', 'Afghanistan', 2007],
['sn', 'female', '0-14', 'Afghanistan', 2012],['sn', 'female', '0-14', 'Afghanistan', 2011],
['rel', 'male', '0-14', 'Afghanistan', 2013]], dtype=object)
left = pd.DataFrame(data=dat1, columns=['country', 'year', 'case_type', 'count'] )
rigt = pd.DataFrame(data=dat2, columns=['type', 'gender', 'age_group', 'country', 'year'])
display(left), display(right)
pd.merge(left,right, on=['country', 'year'], how='outer')
left
right
result
You have keys that are repeated in your dataset. Afghanistan 2007 has two rows in each data frame. When merging with a full outer join, it is not clear which of the two records Afghanistan 2007 should match between the two data frames. So, both are joined. This is why there are four records with Afghanistan 2007 in the merged data frame (2 from the first data frame + 2 from the second data frame).
Your merging is on the column year, which is not unique for each row.
The merging associates the row 0 of the left dataframe with the rows 0 and 1 of the right dataframe, and the row 1 of the left dataframe with the rows 0 and 1 of the right dataframe.
To avoid that you could remove the case_type or add a unique id according to your needs.
Related
I have a large CSV file of sports data and I need to transform the data so that teams with the same game_id are on the same row and create new columns based on the homeAway column and existing columns. Is there a way to do this wih Pandas?
Existing format:
game_id school conference homeAway points
332410041 Connecticut American Athletic home 18
332410041 Towson CAA away 33
Desired format:
game_id home_school home_conference home_points away_school away_conference away_points
332410041 Connecticut American Athletic 18 Towson CAA 33
One way to solve this is to convert the table into a Pandas dataframe. Filter the main table by 'homeaway', to create 'home' and 'away' dataframes. The columns in the 'away' table are relabelled, and original column of the key is preserved. We then run a join to both to produce the desired output.
import pandas as pd
data = {'game_id': [332410041, 332410041],
'school': ['Connecticut', 'Towson'],
'conference':['American Athletic', 'CAA'],
'homeAway': ['home', 'away'],
'points': [18, 33]
}
df = pd.DataFrame(data)
home = df[df['homeAway'] == 'home']
del home['homeAway']
away = df[df['homeAway'] == 'away']
del away['homeAway']
away.columns = ['game_id', 'away_school', 'away_conference', 'away_points']
home.merge(away)
Create two dataframes selected by the unique values in the 'homeAway' column, 'home' and 'away', using Boolean indexing.
Drop the obsolete 'homeAway' column
Rename the appropriate columns with a 'home_', and 'away_' prefix.
This can be done in a for-loop, with each dataframe added to a list, which can be consolidated into a simple list-comprehension.
Use pd.merge to combine the two dataframes on the common 'game_id' column.
See Merge, join, concatenate and compare and Pandas Merging 101 for additional details.
import pandas as pd
# test dataframe
data = {'game_id': [332410041, 332410041, 662410041, 662410041, 772410041, 772410041],
'school': ['Connecticut', 'Towson', 'NY', 'CA', 'FL', 'AL'],
'conference': ['American Athletic', 'CAA', 'a', 'b', 'c', 'd'],
'homeAway': ['home', 'away', 'home', 'away', 'home', 'away'], 'points': [18, 33, 1, 2, 3, 4]}
df = pd.DataFrame(data)
# create list of dataframes
dfl = [(df[df.homeAway.eq(loc)]
.drop('homeAway', axis=1)
.rename({'school': f'{loc}_school',
'conference': f'{loc}_conference',
'points': f'{loc}_points'}, axis=1))
for loc in df.homeAway.unique()]
# combine the dataframes
df_new = pd.merge(dfl[0], dfl[1])
# display(df_new)
game_id home_school home_conference home_points away_school away_conference away_points
0 332410041 Connecticut American Athletic 18 Towson CAA 33
1 662410041 NY a 1 CA b 2
2 772410041 FL c 3 AL d 4
Assume I have two dataframes, df1 and df2, described as follows. See code below that creates each of these dfs.
df1
Has 5,000 rows and 10,000 columns.
The first column contains a list of non-sequential dates. The dates are listed oldest to newest, but not every day is listed (i.e., only some days are listed). Each date is unique.
Each column is labeled with a different person's name. Each column name is unique.
All columns other than the Date column contain a number value.
df2
Has 2,000,000 rows and 4 columns.
The first column contains a list of dates. These are NOT sorted by oldest to newest.
The next column contains a person's name (which is listed as the column name in one of the columns of df1).
The other two columns contain data about that person based on the date listed in the row.
My Objective
I want to populate the two blank columns in df2 using data pulled from df1.
For instance, the first row of df2 lists a date of 2017-05-15 and a Person named Person4. I want to populate df2['Value_Today'] with 4752. I want to populate df2['Value_2_records_later'] with 4866.
For the next row of df2 (with Date of 2019-01-28 and Person named Person1, I want to populate df2['Value_Today'] with 1918. I want to populate df2['Value_2_records_later'] with 1912.
I want to do this for all 2 million rows in df2, so I assume that a for loop is a bad idea.
Any help would be greatly appreciated. Thank you!
Code
# Import dependencies
import pandas as pd
import numpy as np
# Create df1
df1 = pd.DataFrame(np.array([['2016-05-03', 1651,2653,3655,4658,5655],
['2017-05-29', 1751,2752,3754,4755, 5759],
['2018-08-22', 1889, 2882,3887, 4884, 5882],
['2019-06-28', 1966, 2965, 3966, 4960, 5963],
['2018-11-15', 1811, 2811, 3811, 4811, 5811],
['2019-12-31', 1912, 2912, 3912, 4912, 5912],
['2016-07-05', 1672, 2678, 3679, 4672, 5674],
['2017-05-15', 1755, 2750, 3759, 4752, 5755],
['2018-06-10', 1860, 2864, 3866, 4866, 5867],
['2019-01-28', 1918, 2910, 3914, 4911, 5918],
['2018-11-30', 1812, 2812, 3812, 4812, 5812],
['2019-01-03', 1915, 2917, 3916, 4916, 5917],]),
columns=['Date', 'Person1', 'Person2', 'Person3', 'Person4',
'Person5',])
# Format df1['Date'] col as datetime
df1['Date'] = pd.to_datetime(df1['Date'])
# Sort df1 by 'Date'
df1 = df1.sort_values(['Date'],ascending=[True]).reset_index(drop=True)
# Create 'df2', which contains measurement data on specific dates.
df2 = pd.DataFrame(np.array([['2017-05-15', 'Person4', '', ''], ['2019-01-28 ', 'Person1', '', ''],
['2018-11-15', 'Person1', '', ''], ['2018-08-22', 'Person3', '', ''],
['2017-05-15', 'Person5', '', ''], ['2016-05-03', 'Person2', '', ''],]),
columns=['Date', 'Person', 'Value_Today', 'Value_2_records_later'])
df2['Date'] = pd.to_datetime(df2['Date'])
# Display dfs
display(df1)
display(df2)
### I DON'T KNOW WHAT CODE I NEED TO SOLVE MY ISSUE ###
# To capture the row that is two rows below, I think I would use the '.shift(-2)' function?
Solution with MultiIndex.map:
Set the index of df1 to Date
Stack the dataframe to create multiindex mapping series s1. The index of this series will be the combination of date and name of the person. Similarly create another series s2.
Set the index of df2 to Date and Person columns
Substitute the values in the index of df2 using the values from s1 and s2 and assign the corresponding results to Value_Today and Value_2_records_later
s1 = df1.set_index('Date').stack()
s2 = df1.set_index('Date').shift(-2).stack()
ix = df2.set_index(['Date', 'Person']).index
df2['Value_Today'] = ix.map(s1)
df2['Value_2_records_later'] = ix.map(s2)
Result
print(df2)
Date Person Value_Today Value_2_records_later
0 2017-05-15 Person4 4752 4866
1 2019-01-28 Person1 1918 1912
2 2018-11-15 Person1 1811 1915
3 2018-08-22 Person3 3887 3812
4 2017-05-15 Person5 5755 5867
5 2016-05-03 Person2 2653 2750
First, copy the values once for Value_2_records_later,
step1 = df1.set_index('Date')
persons = step1.columns.tolist()
c1 = [('Value_Today', p) for p in persons]
c2 = [('Value_2_records_later', p) for p in persons]
step1.columns = pd.MultiIndex.from_tuples(c1, names=('','Person'))
step1[c2] = step1[c1].shift(-2)
Then stack to move columns to rows
step1.stack()
A list of names and I want to retrieve each of the correspondent information in different data frames, to form a new dataframe.
I converted the list into a 1 column dataframe, then to look up its corresponding values in different dataframes.
The idea is visualized as:
I have tried:
import pandas as pd
data = {'Name': ["David","Mike","Lucy"]}
data_h = {'Name': ["David","Mike","Peter", "Lucy"],
'Hobby': ['Music','Sports','Cooking','Reading'],
'Member': ['Yes','Yes','Yes','No']}
data_s = {'Name': ["David","Lancy", "Mike","Lucy"],
'Speed': [56, 42, 35, 66],
'Location': ['East','East','West','West']}
df = pd.DataFrame(data)
df_hobby = pd.DataFrame(data_h)
df_speed = pd.DataFrame(data_s)
df['Hobby'] = df.lookup(df['Name'], df_hobby['Hobby'])
print (df)
But it returns the error message as:
ValueError: Row labels must have same size as column labels
I have also tried:
df = pd.merge(df, df_hobby, on='Name')
It works but it includes unnecessary columns.
What will be the smart an efficient way to do such, especially when the number of to-be-looked-up dataframes are many?
Thank you.
Filter only columns for merge and columns for append like:
df = (pd.merge(df, df_hobby[['Name','Hobby']], on='Name')
.merge(df_speed[['Name','Location']], on='Name'))
print(df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
If want working with list use this solution with filtering columns:
dfList = [df,
df_hobby[['Name','Hobby']],
df_speed[['Name','Location']]]
from functools import reduce
df = reduce(lambda df1,df2: pd.merge(df1,df2,on='Name'), dfList)
print (df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
I am trying to mimic my issue with my current data. I am trying to use pandas to merge two dataframes on different column names (Code and Number), and bring only one column from df2 (location). I get either Key error, or NaN.
Both were imported CSV files as data frames;
Both column names do not have white space;
Both columns have the same d.type
I have tried looking at other answers here, with literally copying and pasting the coded answers filling in my parts, and still get errors or NaN.
df1:
[['Name', 'Income', 'Favourite superhero', 'Code', 'Colour'],
['Joe', '80000', 'Batman', '10004', 'Red'],
['Christine', '50000', 'Superman', '10005', 'Brown'],
['Joey', '90000', 'Aquaman', '10002', 'Blue']
df2:
[['Number', 'Language', 'Location'],
['10005', 'English', 'Sudbury'],
['10002', 'French', 'Ottawa'],
['10004', 'German', 'New York']]
what I tried:
data = pd.merge(CSV1,
CSV2[['Location']],
left_on='Code',
right_on='Number',
how='left')
data = pd.merge(CSV1,
CSV2[['Location']],
left_on='Code',
right_index=True,
how='left')
I am trying to have df1 with the location column from df2 for each instance where Number
and Code are the same.
For both you commands work , you need Number existing in the right side dataframe. For the 1st command you need to drop Number columns after merge. For 2nd command, you need to set_index on the right sliced dataframe and no need to drop Number. I modified your command accordingly:
CSV1.merge(CSV2[['Number', 'Location']], left_on='Code', right_on='Number', how='left').drop('Number', 1)
Or
CSV1.merge(CSV2[['Number', 'Location']].set_index('Number'), left_on='Code', right_index=True, how='left')
Out[892]:
Name Income Favourite superhero Code Colour Location
0 Joe 80000 Batman 10004 Red New York
1 Christine 50000 Superman 10005 Brown Sudbury
2 Joey 90000 Aquaman 10002 Blue Ottawa
Example Code & Output:
data_country1 = {'Country': [np.NaN, 'India', 'Brazil'],
'Capital': [np.NaN, 'New Delhi', 'Brasília'],
'Population': [np.NaN, 1303171035, 207847528]}
df_country1 = pd.DataFrame(data_country1, columns=['Country', 'Capital', 'Population'])
data_country2= {'Country': ['Belgium', 'India', 'Brazil'],
'Capital': ['Brussels', 'New Delhi', 'Brasília'],
'Population': [102283932, 1303171035, 207847528]}
df_country2 = pd.DataFrame(data_country2, columns=['Country', 'Capital', 'Population'])
print(df_country1)
print(df_country2)
Country Capital Population
0 NaN NaN NaN
1 India New Delhi 1.303171e+09
2 Brazil Brasília 2.078475e+08
Country Capital Population
0 Belgium Brussels 102283932
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
In the first DataFrame, for every row that is comprised of ALL NaN, I want to replace the entire row with a row from another dataframe. In this example, row 0 from the second dataframe, so that the first df ends up with the same information as the second dataframe.
You can find the rows that have NaN for all elements, and replace them with the rows of the other dataframe using:
# find the indices that are all NaN
na_indices = df_country1.index[df_country1.isnull().all(axis=1)]
# replace those indices with the values of the other dataframe
df_country1.loc[na_indices,:] = df_country2.loc[na_indices,:]
This assumes that the data frames are the same shape and you want to match on the missing rows.
I will join the two dataframes:
data_complete=pd.merge(df_country1.dropna(),df_country2,on=['Country','Capital','Population'],how='outer')
You can combine them using append, drop any duplicates (rows that were in both data frames), and then remove all the indices where the values are NaN:
#combine into one data frame with unique values
df_country = df_country1.append(df_country2,ignore_index=True).drop_duplicates()
#filter out NaN rows
df_country = df_country.drop(df_country.index[df_country.isnull().all(axis=1)])
The ignore_index flag in append gives each line unique indexes so that when you search for the indexes with NaN rows and it returns 0 you don't end up deleting the 0 index row from df_country2.