Correlation between columns of different dataframes - python

I have many dataframes. They all share the same column structure "date", "open_position_profit", "more columns...".
date open_position_profit col2 col3
0 2008-04-01 -260.0 1 290.0
1 2008-04-02 -340.0 1 -60.0
2 2008-04-03 100.0 1 40.0
3 2008-04-04 180.0 1 -90.0
4 2008-04-05 0.0 0 0.0 0.0 1
Although "date" is present in all dataframes, they might or might not have the same count (some dates might be in one dataframe but not the other).
I want to compute a correlation matrix of the columns "open_position_profit" of all these dataframes.
I've tried this
dfs = [df1[["date", "open_position_profit"]], df2[["date", "open_position_profit"]], ...]
pd.concat(dfs).groupby('date', as_index=False).corr()
But this gives me a series of the correlation for each cell:
open_position_profit
0 open_position_profit 1.0
1 open_position_profit 1.0
2 open_position_profit 1.0
3 open_position_profit 1.0
4 open_position_profit NaN
I want the correlation for the entire time series, not each single cell. How can I do this?

If I understand your intention correctly, it is necessary to do outer join first. The following code does outer join by date key. The missing value can be represented by NaN.
df = pd.merge(df1, df2, on='date', how='outer')
date open_position_profit_x open_position_profit_y ... ...
0 2019-01-01 ...
1 2019-01-02 ...
2 2019-01-03 ...
3 2019-01-04 ...
Then you can calculate the correlation with the new DataFrame.
df.corr()
open_position_profit_x open_position_profit_y ... ...
open_position_profit_x 1.000000 0.866025
open_position_profit_y 0.866025 1.000000
... 1.000000 1.000000
... 1.000000 1.000000
See: pd.merge

Related

How to reshape a dataframe into a particular format

I have the following dataframe which I want to reshape:
year control Counts_annee_control
0 2014.0 0 81.0
1 2016.0 1 147.0
2 2016.0 0 168.0
3 2015.0 1 90.0
4 2016.0 1 147.0
I want the control column to be the new index, so I will have only two columns. The year values should be the new columns of the dataframe and finally the Counts_annee_control column will fill the dataframe.
I tried using groupby and transform but I don't think I'm doing it correctly.
You are probably looking for pivot_table.
If df is your DataFrame, then this
modified_df = df.pivot_table(
values='Counts_annee_control', index='control', columns='year'
)
will result in
year 2014.0 2015.0 2016.0
control
0 81.0 NaN 168.0
1 NaN 90.0 147.0

How do I create a dummy variable by comparing columns in different data frames?

I would like to compare one column of a df with another column in a different df. The columns are timestamp and holiday date. I'd like to create a dummy variable wherein if the timestamp in df1 match the dates in df2 = 1, else 0.
For example, df1:
timestamp weight(kg)
0 2016-03-04 4.0
1 2015-02-15 5.0
2 2019-05-04 5.0
3 2018-12-25 29.0
4 2020-01-01 58.0
For example, df2:
holiday
0 2016-12-25
1 2017-01-01
2 2019-05-01
3 2018-12-26
4 2020-05-26
Ideal output:
timestamp weight(kg) holiday
0 2016-03-04 4.0 0
1 2015-02-15 5.0 0
2 2019-05-04 5.0 0
3 2018-12-25 29.0 1
4 2020-01-01 58.0 1
I have tried writing a function but it is taking very long to calculate:
def add_holiday(x):
hols_df = hols.apply(lambda y: y['holiday_dt'] if
x['timestamp'] == y['holiday_dt']
else None, axis=1)
hols_df = hols_df.dropna(axis=0, how='all')
if hols_df.empty:
hols_df= np.nan
else:
hols_df= hols_df.to_string(index=False)
return hols_df
#df_hols['holidays'] = df_hols.apply(add_holiday, axis=1)
Perhaps, there is a simpler way to do so or the function is not exactly well-written. Any help will be appreciated.
Use Series.isin with convert mask to 1,0 by Series.astype:
df1['holiday'] = df1['timestamp'].isin(df2['holiday']).astype(int)
Or with numpy.where:
df1['holiday'] = np.where(df1['timestamp'].isin(df2['holiday']), 1, 0)

Setting DataFrame value using a datetime as index

I have two data frames, one with 3 rows and 4 columns + date as index dataframeA
TYPE UNIT PRICE PERCENT
2010-01-05 REDUCE CAR 2300.00 3.0
2010-06-03 INCREASE BOAT 1000.00 2.0
2010-07-01 INCREASE CAR 3500.00 3.0
and another empty one with 100's of dates as index and two columns dataframeB
CAR BOAT
2010-01-01 Nan 0.0
2010-01-02 Nan 0.0
2010-01-03 Nan 0.0
2010-01-04 Nan 0.0
2010-01-05 -69.00 0.0
.....
2010-06-03 Nan 20.00
...
2010-07-01 105.00 0.0
I need to read each row from the first data frame , find the corresponding date and based on the unit type assign it the corresponding percentage or reduction on the second data frame.
I was reading about not iterating when dealing with dataframes? not sure how else?. how can i evaluate each row and then set the value on dataframeB ?
I tried doing the following :
for index, row in dataframeA.iterrows():
type = row['TYPE']
unit = row['UNIT']
price = row['PRICE']
percent = row['PERCENT']
then here with basic math come up with the reduction or
increase and assign to dataframeB do the same for the others
My question is, is this the right approach and also how do i assign the value i come up to the other dataframeB ?
If your first dataframe is limited to just the variables stated, you can do this. Not terribly elegant, but works. If you have many more combinations in the dataframe, it'd have to be rethought. See comments inline.
df = pd.read_csv(io.StringIO(''' date TYPE UNIT PRICE PERCENT
2010-01-05 REDUCE CAR 2300.00 3.0
2010-06-03 INCREASE BOAT 1000.00 2.0
2010-07-01 INCREASE CAR 3500.00 3.0'''), sep='\s+', engine='python').set_index('date')
df1 = pd.read_csv(io.StringIO('''date
2010-01-01
2010-01-02
2010-01-03
2010-01-04
2010-01-05
2010-06-03
2010-07-01'''), engine='python').set_index('date')
# calculate your changes in first dataframe
df.loc[df.TYPE == 'REDUCE', 'Change'] = - df['PRICE'] * df['PERCENT'] / 100
df.loc[df.TYPE == 'INCREASE', 'Change'] = df['PRICE'] * df['PERCENT'] / 100
#merge the Changes into car and boat dataframes; rename columns
df_car = df[['Change']].loc[df.UNIT == 'CAR'].merge(df1, right_index=True, left_index=True, how='right')
df_car.rename(columns={'Change':'Car'}, inplace=True)
df_boat = df[['Change']].loc[df.UNIT == 'BOAT'].merge(df1, right_index=True, left_index=True, how='right')
df_boat.rename(columns={'Change':'Boat'}, inplace=True)
# merge car and boat
dfnew = df_car.merge(df_boat, right_index=True, left_index=True, how='right')
dfnew
Car Boat
date
2010-01-01 NaN NaN
2010-01-02 NaN NaN
2010-01-03 NaN NaN
2010-01-04 NaN NaN
2010-01-05 -69.000 NaN
2010-06-03 NaN 20.000
2010-07-01 105.000 NaN

Iteratively update the values of a dataframe with another one

I have a main df:
print(df)
item dt_op
0 product_1 2019-01-08
1 product_2 2019-02-08
2 product_1 2019-01-08
...
and a subset of the first one, that contains only one product and two extra columns:
print(df_1)
item dt_op DQN_Pred DQN_Inv
0 product_1 2019-01-08 6 7.0
2 product_1 2019-01-08 2 2.0
...
That I am iteratively creating, with a for loop (hence, df_1 = df.loc[df.item == i] for i in items).
I would like to merge df_1 and df, at every step of the iteration, hence updating df with the two extra columns.
print(final_df)
item dt_op DQN_Pred DQN_Inv
0 product_1 2019-01-08 6 7.0
1 product_2 2019-02-08 nan nan
2 product_1 2019-01-08 2 2.0
...
and update the nan at the second step of the for loop, in which df_1 only contains product_2.
How can I do it?
IIUC, you can use combine_first with reindex:
final_df=df_1.combine_first(df).reindex(columns=df_1.columns)
item dt_op DQN_Pred DQN_Inv
0 product_1 2019-01-08 6.0 7.0
1 product_2 2019-02-08 NaN NaN
2 product_1 2019-01-08 2.0 2.0
Alternatively, Using merge , you can use the common keys with left_index and right_index =True:
common_keys=df.columns.intersection(df_1.columns).tolist()
final_df=df.merge(df_1,on=common_keys,left_index=True,right_index=True,how='left')
item dt_op DQN_Pred DQN_Inv
0 product_1 2019-01-08 6.0 7.0
1 product_2 2019-02-08 NaN NaN
2 product_1 2019-01-08 2.0 2.0

pandas merge DataFrames like magnetic thing

import pandas as pd
df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-10', '2015-01-11', '2015-01-12'], 'a': [1,2,3,4]})
df2 = pd.DataFrame({'date': ['2015-01-01', '2015-01-05', '2015-01-11'], 'b': [10,20,30]})
df = df1.merge(df2, on=['date'], how='outer')
df = df.sort_values('date')
print df
"like magnetic thing" may not be a good expression in title. I will explain below.
I want record from df2 to match the first record of df1 which date is greater or equals df2's. For example, I want df2's '2015-01-05' to match df1's '2015-01-10'.
I cannot achieve it by simply merging them in inner, outer, left way. Though, the above result is very close to what I want.
a date b
0 1.0 2015-01-01 10.0
4 NaN 2015-01-05 20.0
1 2.0 2015-01-10 NaN
2 3.0 2015-01-11 30.0
3 4.0 2015-01-12 NaN
How can achieve this from what I have done or in some other ways from scratch?
a date b
0 1.0 2015-01-01 10.0
1 2.0 2015-01-10 20.0
2 3.0 2015-01-11 30.0
3 4.0 2015-01-12 NaN
making sure your dates are dates
df1.date = pd.to_datetime(df1.date)
df2.date = pd.to_datetime(df2.date)
numpy
np.searchsorted
ilocs = df1.date.values.searchsorted(df2.date.values)
df1.loc[df1.index[ilocs], 'b'] = df2.b.values
df1
a date b
0 1 2015-01-01 10.0
1 2 2015-01-10 20.0
2 3 2015-01-11 30.0
3 4 2015-01-12 NaN
pandas
pd.merge_asof gets you really close
pd.merge_asof(df1, df2)
a date b
0 1 2015-01-01 10
1 2 2015-01-10 20
2 3 2015-01-11 30
3 4 2015-01-12 30

Categories