Adding correlation result back to pandas dataframe - python

I am wondering how to add the corr() result back to a panda dataframe as the current output is a bit nested. I just want to have one column in the original dataframe to list the value. What's the best way to achieve this?
id date water fire
0 apple 2018-01-01 100 100
1 orange 2018-01-01 110 110
2 apple 2019-01-01 90 9
3 orange 2019-01-01 50 50
4 apple 2020-01-01 40 4
5 orange 2020-01-01 60 60
6 apple 2021-01-01 70 470
7 orange 2021-01-01 80 15
8 apple 2022-01-01 90 90
9 orange 2022-01-01 100 9100
data = pd.DataFrame({
'id': ['apple', 'orange','apple','orange','apple', 'orange', 'apple', 'orange', 'apple', 'orange'],
'date': [
datetime.datetime(2018, 1, 1),
datetime.datetime(2018, 1, 1),
datetime.datetime(2019, 1, 1),
datetime.datetime(2019, 1, 1),
datetime.datetime(2020, 1, 1),
datetime.datetime(2020, 1, 1),
datetime.datetime(2021, 1, 1),
datetime.datetime(2021, 1, 1),
datetime.datetime(2022, 1, 1),
datetime.datetime(2022, 1, 1)
],
'water': [100, 110, 90, 50, 40, 60, 70, 80, 90, 100],
'fire': [100, 110, 9, 50, 4, 60, 470, 15, 90, 9100]
}
)
data.groupby('id')[['water', 'fire']].apply(lambda x : x.rolling(3).corr())
water fire
id
apple 0 water NaN NaN
fire NaN NaN
2 water NaN NaN
fire NaN NaN
4 water 1.000000 0.663924
fire 0.663924 1.000000
6 water 1.000000 0.123983
fire 0.123983 1.000000
8 water 1.000000 0.285230
fire 0.285230 1.000000
orange 1 water NaN NaN
fire NaN NaN
3 water NaN NaN
fire NaN NaN
5 water 1.000000 1.000000
fire 1.000000 1.000000
7 water 1.000000 -0.854251
fire -0.854251 1.000000
9 water 1.000000 0.863867
fire 0.863867 1.000000

Here is one way to do it:
df = pd.concat(
[
data,
data.groupby("id")[["water", "fire"]]
.apply(lambda x: x.rolling(3).corr())
.reset_index()
.drop_duplicates(subset=["level_1"])
.set_index("level_1")["fire"]
.rename("corr")
],
axis=1,
)
print(df)
# Output
id date water fire corr
0 apple 2018-01-01 100 100 NaN
1 orange 2018-01-01 110 110 NaN
2 apple 2019-01-01 90 9 NaN
3 orange 2019-01-01 50 50 NaN
4 apple 2020-01-01 40 4 0.663924
5 orange 2020-01-01 60 60 1.000000
6 apple 2021-01-01 70 470 0.123983
7 orange 2021-01-01 80 15 -0.854251
8 apple 2022-01-01 90 90 0.285230
9 orange 2022-01-01 100 9100 0.863867

Related

Stick the dataframe rows and column in one row+ replace the nan values with the day before or after

I have a df and I want to stick the values of it. At first I want to select the specific time, and replace the Nan values with the same in the day before. Here is a simple example: I only want to choose the values in 2020, I want to stick its value based on the time, and also replace the nan value same as day before.
df = pd.DataFrame()
df['day'] =[ '2020-01-01', '2019-01-01', '2020-01-02','2020-01-03', '2018-01-01', '2020-01-15','2020-03-01', '2020-02-01', '2017-01-01' ]
df['value_1'] = [ 1, np.nan, 32, 48, 5, -1, 5,10,2]
df['value_2'] = [ np.nan, 121, 23, 34, 15, 21, 15, 12, 39]
df
day value_1 value_2
0 2020-01-01 1.0 NaN
1 2019-01-01 NaN 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
4 2018-01-01 5.0 15.0
5 2020-01-15 -1.0 21.0
6 2020-03-01 5.0 15.0
7 2020-02-01 10.0 12.0
8 2017-01-01 2.0 39.0
The output:
_1 _2 _3 _4 _5 _6 _7 _8 _9 _10 _11 _12
0 1 121 1 23 48 34 -1 21 10 12 -1 21
I have tried to use the follwing code, but it does not solve my problem:
val_cols = df.filter(like='value_').columns
output = (df.pivot('day', val_cols).groupby(level=0, axis=1).apply(lambda x:x.ffill(axis=1).bfill(axis=1)).sort_index(axis=1, level=1))
I don't know what the output is supposed to be but i think this should do at least part of what you're trying to do
df['day'] = pd.to_datetime(df['day'], format='%Y-%m-%d')
df = df.sort_values(by=['day'])
filter_2020 = df['day'].dt.year == 2020
val_cols = df.filter(like='value_').columns
df.loc[filter_2020, val_cols] = df.loc[:,val_cols].ffill().loc[filter_2020]
print(df)
day value_1 value_2
8 2017-01-01 2.0 39.0
4 2018-01-01 5.0 15.0
1 2019-01-01 NaN 121.0
0 2020-01-01 1.0 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
5 2020-01-15 -1.0 21.0
7 2020-02-01 10.0 12.0
6 2020-03-01 5.0 15.0

Join two dataframes on multiple conditions in python

I have the following problem: i am trying to join df1 = ['ID, 'Earnings', 'WC, 'Year'] and df2 = ['ID', 'F1_Earnings', 'df2_year']. So for example: the 'F1_Earnings' of a particular company, e.g. with ID = 1 and year = 1996, in df2 (aka. the Forward Earnings) should get joined on df1 in a way that they show up in df1 under ID = 1 and year = 1995.
I have no clue how to specify a join on two conditions, of course they need to join on "ID", but how do I add a second condition which specifies that they also join on "df1_year = df2_year - 1"?
d1 = {'ID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], 'Earnings': [100, 200, 400, 250, 300, 350, 400, 550, 700, 259, 300, 350], 'WC': [20, 40, 35, 55, 60, 65, 30, 28, 32, 45, 60, 52], 'Year': [1995, 1996, 1997, 1996, 1997, 1998, 1995, 1997, 1998, 1996, 1997, 1998]}
df1 = pd.DataFrame(data=d1)
d2 = {'ID': [1, 2, 3, 4], 'F1_Earnings': [120, 220, 420, 280], 'WC': [23, 37, 40, 52], 'Year': [1996, 1997, 1998, 1999]}
df2 = pd.DataFrame(data=d2)
I did the following, but I guess there miust be a smarter way? I am afraid it wont work for larger datasets...:
df3 = pd.merge(df1, df2, how='left', on = 'ID')
df3.loc[df3['Year_x'] == df3['Year_y'] - 1]
You can use a Series as key in merge:
df1.merge(df2, how='left',
left_on=['ID', 'Year'],
right_on=['ID', df2['Year'].sub(1)])
output:
ID Year Earnings WC_x Year_x F1_Earnings WC_y Year_y
0 1 1995 100 20 1995 120.0 23.0 1996.0
1 1 1996 200 40 1996 NaN NaN NaN
2 1 1997 400 35 1997 NaN NaN NaN
3 2 1996 250 55 1996 220.0 37.0 1997.0
4 2 1997 300 60 1997 NaN NaN NaN
5 2 1998 350 65 1998 NaN NaN NaN
6 3 1995 400 30 1995 NaN NaN NaN
7 3 1997 550 28 1997 420.0 40.0 1998.0
8 3 1998 700 32 1998 NaN NaN NaN
9 4 1996 259 45 1996 NaN NaN NaN
10 4 1997 300 60 1997 NaN NaN NaN
11 4 1998 350 52 1998 280.0 52.0 1999.0
Or change the Year to Year-1, before the merge:
df1.merge(df2.assign(Year=df2['Year'].sub(1)),
how='left', on=['ID', 'Year'])
output:
ID Earnings WC_x Year F1_Earnings WC_y
0 1 100 20 1995 120.0 23.0
1 1 200 40 1996 NaN NaN
2 1 400 35 1997 NaN NaN
3 2 250 55 1996 220.0 37.0
4 2 300 60 1997 NaN NaN
5 2 350 65 1998 NaN NaN
6 3 400 30 1995 NaN NaN
7 3 550 28 1997 420.0 40.0
8 3 700 32 1998 NaN NaN
9 4 259 45 1996 NaN NaN
10 4 300 60 1997 NaN NaN
11 4 350 52 1998 280.0 52.0

Taking away all previous values in a column in dataframe

I am using some data where I need to find the time difference between all previous rows i.e. in row 3 I need to know the time between row 2 and row 1 and row 2 and row 0, in row 5 i need to know the time between row 5 and row 4, row 5 and row 3.... row 5 and row 0. I then want to have a big dataframe with all these differences in (as well as the other columns).
I have made a test dataframe for this
data = {random': [1, 3, 9, 3, 4, 7, 8, 10],
'timestamp': [2, 138, 157, 232, 245, 302, 323, 379]}
df = pd.DataFrame(data)
I then tried to do
for i in range(0,len(df-1)):
difference = df.timestamp.diff(periods=i+1)
print(difference)
To iterate through each row and takeaway the previous row the first iteration, the second row the second iteration etc.
I am stuck on how to combine this into one large dataframe after all the iterations AND how to make sure the loop uses the original dataframe at the start of each iteration (not the dataframe from the previous iteration).
This is what is being outputted
0 NaN
1 136.0
2 19.0
3 75.0
4 13.0
5 57.0
6 21.0
7 56.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 155.0
3 94.0
4 88.0
5 70.0
6 78.0
7 77.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 230.0
4 107.0
5 145.0
6 91.0
7 134.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 243.0
5 164.0
6 166.0
7 147.0
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 300.0
6 185.0
7 222.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 321.0
7 241.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 377.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
Name: timestamp, dtype: float64
If anyone knows how to solve this that would be great :)
Here is one way of solving the problem with Series.expanding:
df['diff'] = [list(s.iat[-1] - s[-2::-1]) for s in df['timestamp'].expanding(1)]
random timestamp diff
0 1 2 []
1 3 138 [136]
2 9 157 [19, 155] #--> 157-138, 157-2
3 3 232 [75, 94, 230] #--> 232-157, 232-138, 232-2
4 4 245 [13, 88, 107, 243]
5 7 302 [57, 70, 145, 164, 300]
6 8 323 [21, 78, 91, 166, 185, 321]
7 10 379 [56, 77, 134, 147, 222, 241, 377]
I may be misunderstanding what you mean but if you're asking how to collect these differences together:
differences = [df.timestamp.diff(periods=i+1) for i in range(0,len(df-1))]
differences = pd.concat(differences)
I also may be misunderstanding, but this is the best representation I could think of from what you described:
>>> df2 = df.copy()
>>> for i in df2.timestamp:
df2[i]=df2['timestamp']-i
>>> df2
random timestamp 2 138 157 232 245 302 323 379
0 1 2 0 -136 -155 -230 -243 -300 -321 -377
1 3 138 136 0 -19 -94 -107 -164 -185 -241
2 9 157 155 19 0 -75 -88 -145 -166 -222
3 3 232 230 94 75 0 -13 -70 -91 -147
4 4 245 243 107 88 13 0 -57 -78 -134
5 7 302 300 164 145 70 57 0 -21 -77
6 8 323 321 185 166 91 78 21 0 -56
7 10 379 377 241 222 147 134 77 56 0

DataFrame: how do I find value in one column for a quantile in a second column

I have a DataFrame shown below with dates, offset and count.
example, this is the start of the dataframe
df = pd.DataFrame(np.array([['2018-01-01', 0, 1], ['2018-01-01', 26, 2], ['2018-01-01', 178, 8], ['2018-01-01', 187, 10], ['2018-01-01', 197, 13], ['2018-01-01', 208, 15], ['2018-01-01', 219, 16], ['2018-01-01', 224, 19],['2018-01-01', 232, 21], ['2018-01-01', 233, 25], ['2018-01-01', 236, 32],['2018-01-02', 0, 1], ['2018-01-02', 11, 4], ['2018-01-02', 12, 7], ['2018-01-02', 20, 12], ['2018-01-02', 35, 24], ]), columns=['obs_date', 'offset', 'count'])
obs_date offset count
0 2018-01-01 0 1
1 2018-01-01 26 2
2 2018-01-01 178 8
3 2018-01-01 187 10
4 2018-01-01 197 13
5 2018-01-01 208 15
6 2018-01-01 219 16
7 2018-01-01 224 19
8 2018-01-01 232 21
9 2018-01-01 233 25
10 2018-01-01 236 32
11 2018-01-02 0 1
12 2018-01-02 11 4
13 2018-01-02 12 7
14 2018-01-02 20 12
15 2018-01-02 35 24
etc
I'd like to get the (cumulative) ['count'] quantile [0.25, 0.5, 0.75] for each date and find the row with the ['offset'] that that quantile applies to.
the total count for each date will be different, and the offsets are not regular
so for 2018-01-01 the date & offset that correspond to a counts of 8, 16 & 24 (0.25, 0.5, 0.75 * 32)
something like
0 2018-01-01 178 0.25
1 2018-01-01 219 0.5
2 2018-01-01 232.75 0.75
3 2018-01-02 43 0.25
etc
This worked for me:
df['count'] = df['count'].astype(int)
quantiles = [.25, .5, .75]
def get_offset(x):
s = x['count']
indices = [(s.sort_values()[::-1] <= s.quantile(q)).idxmax() for q in quantiles]
return df.iloc[indices, x.columns.get_loc('offset')]
res = df.groupby('obs_date').apply(get_offset).reset_index(level=0)
Then you can concat with quantiles:
pd.concat([res.reset_index(drop=True), pd.Series(quantiles * df.obs_date.nunique())], axis=1)
obs_date offset 0
0 2018-01-01 178 0.25
1 2018-01-01 208 0.50
2 2018-01-01 224 0.75
3 2018-01-02 11 0.25
4 2018-01-02 12 0.50
5 2018-01-02 20 0.75

Python: doing multiple column aggregation in pandas

I have dataframe where I went to do multiple column aggregations in pandas.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]})
df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean})
With this code, I get the mean for lat. I would also like to find the mean for long.
I tried df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean}).long.agg({'avg_long': np.mean}) but this produces
AttributeError: 'DataFrame' object has no attribute 'long'
If I just do avg_long, the code works as well.
df2 = df.groupby(['ser_no', 'CTRY_NM']).long.agg({'avg_long': np.mean})
In[2]: df2
Out[42]:
avg_long
ser_no CTRY_NM
1 a 21.5
b 23.0
2 a 26.0
b 27.0
e 24.5
3 b 28.5
d 30.0
Is there a way to do this in one step or is this something I have to do separately and join back later?
I think more simplier is use GroupBy.mean:
print df.groupby(['ser_no', 'CTRY_NM']).mean()
lat long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0
Ir you need define columns for aggregating:
print df.groupby(['ser_no', 'CTRY_NM']).agg({'lat' : 'mean', 'long' : 'mean'})
lat long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0
More info in docs.
EDIT:
If you need rename column names - remove multiindex in columns, you can use list comprehension:
import pandas as pd
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
'date':pd.date_range(pd.to_datetime('2016-02-24'),
pd.to_datetime('2016-02-28'), freq='10H')})
print df
CTRY_NM date lat long ser_no
0 a 2016-02-24 00:00:00 1 21 1
1 a 2016-02-24 10:00:00 2 22 1
2 b 2016-02-24 20:00:00 3 23 1
3 e 2016-02-25 06:00:00 4 24 2
4 e 2016-02-25 16:00:00 5 25 2
5 a 2016-02-26 02:00:00 6 26 2
6 b 2016-02-26 12:00:00 7 27 2
7 b 2016-02-26 22:00:00 8 28 3
8 b 2016-02-27 08:00:00 9 29 3
9 d 2016-02-27 18:00:00 10 30 3
df2=df.groupby(['ser_no','CTRY_NM']).agg({'lat':'mean','long':'mean','date':[min,max,'count']})
df2.columns = ['_'.join(col) for col in df2.columns]
print df2
lat_mean date_min date_max date_count \
ser_no CTRY_NM
1 a 1.5 2016-02-24 00:00:00 2016-02-24 10:00:00 2
b 3.0 2016-02-24 20:00:00 2016-02-24 20:00:00 1
2 a 6.0 2016-02-26 02:00:00 2016-02-26 02:00:00 1
b 7.0 2016-02-26 12:00:00 2016-02-26 12:00:00 1
e 4.5 2016-02-25 06:00:00 2016-02-25 16:00:00 2
3 b 8.5 2016-02-26 22:00:00 2016-02-27 08:00:00 2
d 10.0 2016-02-27 18:00:00 2016-02-27 18:00:00 1
long_mean
ser_no CTRY_NM
1 a 21.5
b 23.0
2 a 26.0
b 27.0
e 24.5
3 b 28.5
d 30.0
You are getting the error because you are first selecting the lat column of the dataframe and doing operations on that column. Getting the long column through that series is not possible, you need the dataframe.
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean)
would do the same operation for both columns. If you want the column names changed, you can rename the columns afterwards:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
In [22]:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
df2
Out[22]:
avg_lat avg_long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0

Categories