Join two dataframes on multiple conditions in python - python

I have the following problem: i am trying to join df1 = ['ID, 'Earnings', 'WC, 'Year'] and df2 = ['ID', 'F1_Earnings', 'df2_year']. So for example: the 'F1_Earnings' of a particular company, e.g. with ID = 1 and year = 1996, in df2 (aka. the Forward Earnings) should get joined on df1 in a way that they show up in df1 under ID = 1 and year = 1995.
I have no clue how to specify a join on two conditions, of course they need to join on "ID", but how do I add a second condition which specifies that they also join on "df1_year = df2_year - 1"?
d1 = {'ID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], 'Earnings': [100, 200, 400, 250, 300, 350, 400, 550, 700, 259, 300, 350], 'WC': [20, 40, 35, 55, 60, 65, 30, 28, 32, 45, 60, 52], 'Year': [1995, 1996, 1997, 1996, 1997, 1998, 1995, 1997, 1998, 1996, 1997, 1998]}
df1 = pd.DataFrame(data=d1)
d2 = {'ID': [1, 2, 3, 4], 'F1_Earnings': [120, 220, 420, 280], 'WC': [23, 37, 40, 52], 'Year': [1996, 1997, 1998, 1999]}
df2 = pd.DataFrame(data=d2)
I did the following, but I guess there miust be a smarter way? I am afraid it wont work for larger datasets...:
df3 = pd.merge(df1, df2, how='left', on = 'ID')
df3.loc[df3['Year_x'] == df3['Year_y'] - 1]

You can use a Series as key in merge:
df1.merge(df2, how='left',
left_on=['ID', 'Year'],
right_on=['ID', df2['Year'].sub(1)])
output:
ID Year Earnings WC_x Year_x F1_Earnings WC_y Year_y
0 1 1995 100 20 1995 120.0 23.0 1996.0
1 1 1996 200 40 1996 NaN NaN NaN
2 1 1997 400 35 1997 NaN NaN NaN
3 2 1996 250 55 1996 220.0 37.0 1997.0
4 2 1997 300 60 1997 NaN NaN NaN
5 2 1998 350 65 1998 NaN NaN NaN
6 3 1995 400 30 1995 NaN NaN NaN
7 3 1997 550 28 1997 420.0 40.0 1998.0
8 3 1998 700 32 1998 NaN NaN NaN
9 4 1996 259 45 1996 NaN NaN NaN
10 4 1997 300 60 1997 NaN NaN NaN
11 4 1998 350 52 1998 280.0 52.0 1999.0
Or change the Year to Year-1, before the merge:
df1.merge(df2.assign(Year=df2['Year'].sub(1)),
how='left', on=['ID', 'Year'])
output:
ID Earnings WC_x Year F1_Earnings WC_y
0 1 100 20 1995 120.0 23.0
1 1 200 40 1996 NaN NaN
2 1 400 35 1997 NaN NaN
3 2 250 55 1996 220.0 37.0
4 2 300 60 1997 NaN NaN
5 2 350 65 1998 NaN NaN
6 3 400 30 1995 NaN NaN
7 3 550 28 1997 420.0 40.0
8 3 700 32 1998 NaN NaN
9 4 259 45 1996 NaN NaN
10 4 300 60 1997 NaN NaN
11 4 350 52 1998 280.0 52.0

Related

Taking away all previous values in a column in dataframe

I am using some data where I need to find the time difference between all previous rows i.e. in row 3 I need to know the time between row 2 and row 1 and row 2 and row 0, in row 5 i need to know the time between row 5 and row 4, row 5 and row 3.... row 5 and row 0. I then want to have a big dataframe with all these differences in (as well as the other columns).
I have made a test dataframe for this
data = {random': [1, 3, 9, 3, 4, 7, 8, 10],
'timestamp': [2, 138, 157, 232, 245, 302, 323, 379]}
df = pd.DataFrame(data)
I then tried to do
for i in range(0,len(df-1)):
difference = df.timestamp.diff(periods=i+1)
print(difference)
To iterate through each row and takeaway the previous row the first iteration, the second row the second iteration etc.
I am stuck on how to combine this into one large dataframe after all the iterations AND how to make sure the loop uses the original dataframe at the start of each iteration (not the dataframe from the previous iteration).
This is what is being outputted
0 NaN
1 136.0
2 19.0
3 75.0
4 13.0
5 57.0
6 21.0
7 56.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 155.0
3 94.0
4 88.0
5 70.0
6 78.0
7 77.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 230.0
4 107.0
5 145.0
6 91.0
7 134.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 243.0
5 164.0
6 166.0
7 147.0
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 300.0
6 185.0
7 222.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 321.0
7 241.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 377.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
Name: timestamp, dtype: float64
If anyone knows how to solve this that would be great :)
Here is one way of solving the problem with Series.expanding:
df['diff'] = [list(s.iat[-1] - s[-2::-1]) for s in df['timestamp'].expanding(1)]
random timestamp diff
0 1 2 []
1 3 138 [136]
2 9 157 [19, 155] #--> 157-138, 157-2
3 3 232 [75, 94, 230] #--> 232-157, 232-138, 232-2
4 4 245 [13, 88, 107, 243]
5 7 302 [57, 70, 145, 164, 300]
6 8 323 [21, 78, 91, 166, 185, 321]
7 10 379 [56, 77, 134, 147, 222, 241, 377]
I may be misunderstanding what you mean but if you're asking how to collect these differences together:
differences = [df.timestamp.diff(periods=i+1) for i in range(0,len(df-1))]
differences = pd.concat(differences)
I also may be misunderstanding, but this is the best representation I could think of from what you described:
>>> df2 = df.copy()
>>> for i in df2.timestamp:
df2[i]=df2['timestamp']-i
>>> df2
random timestamp 2 138 157 232 245 302 323 379
0 1 2 0 -136 -155 -230 -243 -300 -321 -377
1 3 138 136 0 -19 -94 -107 -164 -185 -241
2 9 157 155 19 0 -75 -88 -145 -166 -222
3 3 232 230 94 75 0 -13 -70 -91 -147
4 4 245 243 107 88 13 0 -57 -78 -134
5 7 302 300 164 145 70 57 0 -21 -77
6 8 323 321 185 166 91 78 21 0 -56
7 10 379 377 241 222 147 134 77 56 0

Adding column in pandas based on values from other columns with conditions

I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.
Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN

Inserting Index along with column values from one dataframe to another

If i have two data frames df1 and df2:
df1
yr
24 1984
30 1985
df2
d m yr
16 12 4 2012
17 13 10 1976
18 24 4 98
I would like to have a dataframe that gives an output as below, could you help with the function that could help me achieve this
d m yr
16 12 4 2012
17 13 10 1976
18 24 4 98
24 NaN NaN 1984
30 NaN NaN 1985
You are looking to concat two dataframes:
res = pd.concat([df2, df1], sort=False)
print(res)
d m yr
16 12.0 4.0 2012
17 13.0 10.0 1976
18 24.0 4.0 98
24 NaN NaN 1984
30 NaN NaN 1985

How to concat two dataframes in python

I have two data frames, i want to join them so that i could check the quantity of the that week in every year in a single in a single data frame.
df1= City Week qty Year
hyd 35 10 2015
hyd 36 15 2015
hyd 37 11 2015
hyd 42 10 2015
hyd 23 10 2016
hyd 32 15 2016
hyd 37 11 2017
hyd 42 10 2017
pune 35 10 2015
pune 36 15 2015
pune 37 11 2015
pune 42 10 2015
pune 23 10 2016
pune 32 15 2016
pune 37 11 2017
pune 42 10 2017
df2= city Week qty Year
hyd 23 10 2015
hyd 32 15 2015
hyd 35 12 2016
hyd 36 15 2016
hyd 37 11 2016
hyd 42 10 2016
hyd 43 12 2016
hyd 44 18 2016
hyd 35 11 2017
hyd 36 15 2017
hyd 37 11 2017
hyd 42 10 2017
hyd 51 14 2017
hyd 52 17 2017
pune 35 12 2016
pune 36 15 2016
pune 37 11 2016
pune 42 10 2016
pune 43 12 2016
pune 44 18 2016
pune 35 11 2017
pune 36 15 2017
pune 37 11 2017
pune 42 10 2017
pune 51 14 2017
pune 52 17 2017
I want to join two data frames as shown in the result, i want to append the quantity of the that week in every year for each city in a single data frame.
city Week qty Year y2016_wk qty y2017_wk qty y2015_week qty
hyd 35 10 2015 2016_35 12 2017_35 11 nan nan
hyd 36 15 2015 2016_36 15 2017_36 15 nan nan
hyd 37 11 2015 2016_37 11 2017_37 11 nan nan
hyd 42 10 2015 2016_42 10 2017_42 10 nan nan
hyd 23 10 2016 nan nan 2017_23 x 2015_23 10
hyd 32 15 2016 nan nan 2017_32 y 2015_32 15
hyd 37 11 2017 2016_37 11 nan nan 2015_37 x
hyd 42 10 2017 2016_42 10 nan nan 2015_42 y
pune 35 10 2015 2016_35 12 2017_35 11 nan nan
pune 36 15 2015 2016_36 15 2017_36 15 nan nan
pune 37 11 2015 2016_37 11 2017_37 11 nan nan
pune 42 10 2015 2016_42 10 2017_42 10 nan nan
You can break down your task into a few steps:
Combine your dataframes df1 and df2.
Create a list of dataframes from your combined dataframe, splitting by year.
At the same time, rename columns to reflect year, set index to Week.
Finally, concatenate along axis=1 and reset_index.
Here is an example:
df = pd.concat([df1, df2], ignore_index=True)
dfs = [df[df['Year'] == y].rename(columns=lambda x: x+'_'+str(y) if x != 'Week' else x)\
.set_index('Week') for y in df['Year'].unique()]
res = pd.concat(dfs, axis=1).reset_index()
Result:
print(res)
Week qty_2015 Year_2015 qty_2016 Year_2016 qty_2017 Year_2017
0 35 10.0 2015.0 12.0 2016.0 11.0 2017.0
1 36 15.0 2015.0 15.0 2016.0 15.0 2017.0
2 37 11.0 2015.0 11.0 2016.0 11.0 2017.0
3 42 10.0 2015.0 10.0 2016.0 10.0 2017.0
4 43 NaN NaN 12.0 2016.0 NaN NaN
5 44 NaN NaN 18.0 2016.0 NaN NaN
6 51 NaN NaN NaN NaN 14.0 2017.0
7 52 NaN NaN NaN NaN 17.0 2017.0
Personally I don't think your example output is that readable, so unless you need that format for a specific reason I might consider using a pivot table. I also think the code required is cleaner.
import pandas as pd
df3 = pd.concat([df1, df2], ignore_index=True)
df4 = df3.pivot(index='Week', columns='Year', values='qty')
print(df4)
Year 2015 2016 2017
Week
35 10.0 12.0 11.0
36 15.0 15.0 15.0
37 11.0 11.0 11.0
42 10.0 10.0 10.0
43 NaN 12.0 NaN
44 NaN 18.0 NaN
51 NaN NaN 14.0
52 NaN NaN 17.0

Python - Pandas: how to divide by specific key's value

I would like to calculate the column by other row of pandas dataframe.
For example, when I have these dataframes,
df = pd.DataFrame({
"year" : ['2017', '2017', '2017', '2017', '2017','2017', '2017', '2017', '2017'],
"rooms" : ['1', '2', '3', '1', '2', '3', '1', '2', '3'],
"city" : ['tokyo', 'tokyo', 'toyko', 'nyc','nyc', 'nyc', 'paris', 'paris', 'paris'],
"rent" : [1000, 1500, 2000, 1200, 1600, 1900, 900, 1500, 2200],
})
print(df)
city rent rooms year
0 tokyo 1000 1 2017
1 tokyo 1500 2 2017
2 toyko 2000 3 2017
3 nyc 1200 1 2017
4 nyc 1600 2 2017
5 nyc 1900 3 2017
6 paris 900 1 2017
7 paris 1500 2 2017
8 paris 2200 3 2017
I'd like to add the rent compared to other city's rent in the same year and rooms.
Ideal results are like below,
city rent rooms year vs_nyc
0 tokyo 1000 1 2017 0.833333
1 tokyo 1500 2 2017 0.9375
2 toyko 2000 3 2017 1.052631
3 nyc 1200 1 2017 1.0
4 nyc 1600 2 2017 1.0
5 nyc 1900 3 2017 1.0
6 paris 900 1 2017 0.75
7 paris 1500 2 2017 0.9375
8 paris 2200 3 2017 1.157894
How to add column like vs_nyc taking account of the year and rooms?
I tried some but not worked,
# filtering gets NaN value, and fillna(method='pad') also not worked
df.rent / df[df['city'] == 'nyc'].rent
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 NaN
7 NaN
8 NaN
Name: rent, dtype: float64
To illustrate:
set_index + unstack
d1 = df.set_index(['city', 'year', 'rooms']).rent.unstack('city')
d1
city nyc paris tokyo toyko
year rooms
2017 1 1200.0 900.0 1000.0 NaN
2 1600.0 1500.0 1500.0 NaN
3 1900.0 2200.0 NaN 2000.0
Then we can divide
d1.div(d1.nyc, 0)
city nyc paris tokyo toyko
year rooms
2017 1 1.0 0.750000 0.833333 NaN
2 1.0 0.937500 0.937500 NaN
3 1.0 1.157895 NaN 1.052632
solution
d1 = df.set_index(['city', 'year', 'rooms']).rent.unstack('city')
df.join(d1.div(d1.nyc, 0).stack().rename('vs_nyc'), on=['year', 'rooms', 'city'])
city rent rooms year vs_nyc
0 tokyo 1000 1 2017 0.833333
1 tokyo 1500 2 2017 0.937500
2 toyko 2000 3 2017 1.052632
3 nyc 1200 1 2017 1.000000
4 nyc 1600 2 2017 1.000000
5 nyc 1900 3 2017 1.000000
6 paris 900 1 2017 0.750000
7 paris 1500 2 2017 0.937500
8 paris 2200 3 2017 1.157895
A little cleaned up
cols = ['city', 'year', 'rooms']
ny_rent = df.set_index(cols).rent.loc['nyc'].rename('ny_rent')
df.assign(vs_nyc=df.rent / df.join(d1, on=d1.index.names).ny_rent)

Categories