Columnwise operation on multiple mapped columns using pandas - python

I have two dataframes namely df1 and df2. I want to perform operation on column "New_Amount_Dollar" from df2. Basically in df1 I have historical currency data and I want to perform datewise operation given Currency and Amount_Dollar from df2 to calculate the values for New_Amount_Dollar column in df2.
For 'Currency' == [AUD, BWP] We need to multiply the Amount_Dollar by respective currency value for respective date.
For other currencies We need to divide the Amount_Dollar by respective currency value for respective date.
e.g In df2 I have first currency as AUD for Date = '01-01-2019', so I want to calculate New_Amount_Dollar value such that
New_Amount_Dollar = Amount_Dollar*AUD value from df1 i.e New_Amount_Dollar = 19298*98 = 1891204
another example where in df2 I have third currency as COP for Date = '03-01-2019, so I want to calculate New_Amount_Dollar value such that
New_Amount_Dollar = Amount_Dollar/COP value from df1 i.e New_Amount_Dollar = 5000/0.043 = 116279.06
import pandas as pd
data1 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019',
'04-01-2019','05-01-2019'],
'AUD':[98, 98.5, 99, 99.5, 97],
'BWP':[30,31,33,32,31],
'CAD':[0.02,0.0192,0.0196,0.0196,0.0192],
'BND':[0.99,0.952,0.970,0.980,0.970],
'COP':[0.05,0.047,0.043,0.047,0.045]}
df1 = pd.DataFrame(data1)
data2 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019', '04-01-2019','05-01-2019'],
'Currency':['AUD','AUD','COP','CAD','BND'],
'Amount_Dollar':[19298, 19210, 5000, 200, 2300],
'New_Amount_Dollar':[0,0,0,0,0]
}
df2 = pd.DataFrame(data2)
print (df2)
df1
Date AUD BWP CAD BND COP
0 01-01-2019 98.0 30 0.0200 0.990 0.050
1 02-01-2019 98.5 31 0.0192 0.952 0.047
2 03-01-2019 99.0 33 0.0196 0.970 0.043
3 04-01-2019 99.5 32 0.0196 0.980 0.047
4 05-01-2019 97.0 31 0.0192 0.970 0.045
df2
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 0
1 02-01-2019 AUD 19210 0
2 03-01-2019 COP 5000 0
3 04-01-2019 CAD 200 0
4 05-01-2019 BND 2300 0
Expected result
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 1891204
1 02-01-2019 AUD 19210 1892185.0
2 03-01-2019 COP 5000 116279.06
3 04-01-2019 CAD 200 10204.08
4 05-01-2019 BND 2300 2371.13

You want lookup and isin():
# this is to know where to multiply
# where to divide
s = df2['Currency'].isin(['AUD', 'BWP'])
# the values to multiply/divide
m = df1.set_index('Date').lookup(df2['Date'],df2['Currency'])
df2['New_Amount_Dollar'] = df2['Amount_Dollar'] * np.where(s, m, 1/m)
Output:
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 1891204.00
1 02-01-2019 AUD 19210 1892185.00
2 03-01-2019 COP 5000 116279.07
3 04-01-2019 CAD 200 10204.08
4 05-01-2019 BND 2300 2371.13

Try using melt and merge:
df_out = df2.merge(df1.melt('Date', var_name='Currency'), on= ['Date','Currency'])
df_out['New_Amount_Dollar'] = (df_out['Amount_Dollar'] *
np.where(df_out['Currency'].isin(['AUD', 'BWP']),
df_out['value'],
1/df_out['value']))
print(df_out)
Output:
Date Currency Amount_Dollar New_Amount_Dollar value
0 01-01-2019 AUD 19298 1891204.000 98.000
1 02-01-2019 AUD 19210 1892185.000 98.500
2 03-01-2019 COP 5000 116279.070 0.043
3 04-01-2019 CAD 200 10204.082 0.020
4 05-01-2019 BND 2300 2371.134 0.970

Related

Modify duplicate rows with datetime

I have a dataframe with id, purchase date, price of purchase and duration in days,
df
id purchased_date price duration
1 2020-01-01 16.50 2
2 2020-01-01 24.00 4
What I'm trying to do is where ever the duration is greater than 1 day, I want the number of extra days to be split into duplicated rows, the price to be divided by the number of individual days and the date to increase by 1 day for each day purchased. Effectively giving me this,
df_new
id purchased_date price duration
1 2020-01-01 8.25 1
1 2020-01-02 8.25 1
2 2020-01-01 6.00 1
2 2020-01-02 6.00 1
2 2020-01-03 6.00 1
2 2020-01-04 6.00 1
So far
I've managed to duplicate the rows based on the duration using.
df['price'] = df['price']/df['duration']
df = df.loc[df.index.repeat(df.duration)]
and then I've tried using,
df.groupby(['id', 'purchased_date']).purchased_date.apply(lambda n: n + pd.to_timedelta(1, unit='d'))
however, this just gets stuck in an endless loop and I'm a bit stuck.
My plan is to put this all in a function but for now I just want to get the process working.
Thank you for any help.
Use GroupBy.cumcount for counter, so possible pass to to_timedeltato_timedelta for days timedeltas and add to column purchased_date:
df['price'] = df['price']/df['duration']
df = df.loc[df.index.repeat(df.duration)].assign(duration=1)
df['purchased_date'] += pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df = df.reset_index(drop=True)
print (df)
id purchased_date price duration
0 1 2020-01-01 8.25 1
1 1 2020-01-02 8.25 1
2 2 2020-01-01 6.00 1
3 2 2020-01-02 6.00 1
4 2 2020-01-03 6.00 1
5 2 2020-01-04 6.00 1
An approach with pandas.date_range and explode:
(df.assign(price=df['price'].div(df['duration']),
purchased_date=df.apply(lambda x: pd.date_range(x['purchased_date'],
periods=x['duration']),
axis=1),
duration=1
)
.explode('purchased_date', ignore_index=True)
)
output:
id purchased_date price duration
0 1 2020-01-01 8.25 1
1 1 2020-01-02 8.25 1
2 2 2020-01-01 6.00 1
3 2 2020-01-02 6.00 1
4 2 2020-01-03 6.00 1
5 2 2020-01-04 6.00 1
Here is an easy to understand approach:
Assign average 'price' value
Create a temporary 'end_date' column
Modify 'purchased_date' to form a list of date-time
Explode 'purchased_date' to form new rows
Assign 1 to duration column
Delete the temporary 'end_date' column
Code:
df['price'] = df['price']/df['duration']
df['end_date'] = df.purchased_date + pd.to_timedelta(df.duration.sub(1), unit='d')
df['purchased_date'] = df.apply(lambda x: pd.date_range(start=x['purchased_date'], end=x['end_date']), axis=1)
df = df.explode('purchased_date').reset_index(drop=True)
df = df.assign(duration=1)
del df['end_date']
print (df)
id purchased_date price duration
0 1 2020-01-01 8.25 1
1 1 2020-01-02 8.25 1
2 2 2020-01-01 6.00 1
3 2 2020-01-02 6.00 1
4 2 2020-01-03 6.00 1
5 2 2020-01-04 6.00 1

Select Pandas dataframe rows between two dates

I am working on two tables as follows:
A first table df1 giving a rate and a validity period:
rates = {'rate': [ 0.974, 0.966, 0.996, 0.998, 0.994, 1.006, 1.042, 1.072, 0.954],
'Valid from': ['31/12/2018','15/01/2019','01/02/2019','01/03/2019','01/04/2019','15/04/2019','01/05/2019','01/06/2019','30/06/2019'],
'Valid to': ['14/01/2019','31/01/2019','28/02/2019','31/03/2019','14/04/2019','30/04/2019','31/05/2019','29/06/2019','31/07/2019']}
df1 = pd.DataFrame(rates)
df1['Valid to'] = pd.to_datetime(df1['Valid to'])
df1['Valid from'] = pd.to_datetime(df1['Valid from'])
rate Valid from Valid to
0 0.974 2018-12-31 2019-01-14
1 0.966 2019-01-15 2019-01-31
2 0.996 2019-01-02 2019-02-28
3 0.998 2019-01-03 2019-03-31
4 0.994 2019-01-04 2019-04-14
5 1.006 2019-04-15 2019-04-30
6 1.042 2019-01-05 2019-05-31
7 1.072 2019-01-06 2019-06-29
8 0.954 2019-06-30 2019-07-31
A second table df2 listing recorded amounts and corresponding dates
data = {'date': ['03/01/2019','23/01/2019','27/02/2019','14/03/2019','05/04/2019','30/04/2019','14/06/2019'],
'amount': [200,305,155,67,95,174,236,]}
df2 = pd.DataFrame(data)
df2['date'] = pd.to_datetime(df2['date'])
date amount
0 2019-03-01 200
1 2019-01-23 305
2 2019-02-27 155
3 2019-03-14 67
4 2019-05-04 95
5 2019-04-30 174
6 2019-06-14 236
The objective would be to retrieve from df1 the applicable rate to each row on df2 using iteration and based on the date on df2.
Example: the date on the first row in df2 is 2019-01-03, therefore the applicable rate would be 0.974
The explanations given here (https://www.interviewqs.com/ddi_code_snippets/select_pandas_dataframe_rows_between_two_dates) gives me an idea on how to retrieve the rows on df2 between two dates in df1.
But I didn't manage to retrieve from df1 the applicable rate to each row on df2 using iteration.
If your dataframes are not very big, you can simply do the join on a dummy key and then do filtering to narrow it down to what you need. See example below (note that I had to update your example a little bit to have correct date formatting)
import pandas as pd
rates = {'rate': [ 0.974, 0.966, 0.996, 0.998, 0.994, 1.006, 1.042, 1.072, 0.954],
'valid_from': ['31/12/2018','15/01/2019','01/02/2019','01/03/2019','01/04/2019','15/04/2019','01/05/2019','01/06/2019','30/06/2019'],
'valid_to': ['14/01/2019','31/01/2019','28/02/2019','31/03/2019','14/04/2019','30/04/2019','31/05/2019','29/06/2019','31/07/2019']}
df1 = pd.DataFrame(rates)
df1['valid_to'] = pd.to_datetime(df1['valid_to'],format ='%d/%m/%Y')
df1['valid_from'] = pd.to_datetime(df1['valid_from'],format='%d/%m/%Y')
Then you df1 would be
rate valid_from valid_to
0 0.974 2018-12-31 2019-01-14
1 0.966 2019-01-15 2019-01-31
2 0.996 2019-02-01 2019-02-28
3 0.998 2019-03-01 2019-03-31
4 0.994 2019-04-01 2019-04-14
5 1.006 2019-04-15 2019-04-30
6 1.042 2019-05-01 2019-05-31
7 1.072 2019-06-01 2019-06-29
8 0.954 2019-06-30 2019-07-31
This is your second data frame df2
data = {'date': ['03/01/2019','23/01/2019','27/02/2019','14/03/2019','05/04/2019','30/04/2019','14/06/2019'],
'amount': [200,305,155,67,95,174,236,]}
df2 = pd.DataFrame(data)
df2['date'] = pd.to_datetime(df2['date'],format ='%d/%m/%Y')
Then your df2 would look like the following
date amount
0 2019-01-03 200
1 2019-01-23 305
2 2019-02-27 155
3 2019-03-14 67
4 2019-04-05 95
5 2019-04-30 174
6 2019-06-14 236
Your solution:
df1['key'] = 1
df2['key'] = 1
df_output = pd.merge(df1, df2, on='key').drop('key',axis=1)
df_output = df_output[(df_output['date'] > df_output['valid_from']) & (df_output['date'] <= df_output['valid_to'])]
This is how would the result look like df_output:
rate valid_from valid_to date amount
0 0.974 2018-12-31 2019-01-14 2019-01-03 200
8 0.966 2019-01-15 2019-01-31 2019-01-23 305
16 0.996 2019-02-01 2019-02-28 2019-02-27 155
24 0.998 2019-03-01 2019-03-31 2019-03-14 67
32 0.994 2019-04-01 2019-04-14 2019-04-05 95
40 1.006 2019-04-15 2019-04-30 2019-04-30 174
55 1.072 2019-06-01 2019-06-29 2019-06-14 236

Column-wise Mapping and operations on dataframe using pandas

I have two dataframes namely df1 and df2. I want to perform operation on column New_Amount_Dollar from df2. Basically in df1 I have historical currency data and I want to perform datewise operation given Currency and Amount_Dollar from df2 to calculate the values for New_Amount_Dollar column in df2.
e.g In df2 I have first currency as AUD for Date = '01-01-2019', so I want to calculate New_Amount_Dollar value such that
New_Amount_Dollar = Amount_Dollar/AUD value from df1
i.e New_Amount_Dollar = 19298/98 = 196.91
another example where in df2 I have third currency as COP for Date = '03-01-2019, so I want to calculate New_Amount_Dollar value such that
New_Amount_Dollar = Amount_Dollar/COP value from df1
i.e New_Amount_Dollar = 5000/0.043 = 116279.06
import pandas as pd
data1 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019', '04-01-2019','05-01-2019'],
'AUD':[98, 98.5, 99, 99.5, 97],
'BWP':[30,31,33,32,31],
'CAD':[0.02,0.0192,0.0196,0.0196,0.0192],
'BND':[0.99,0.952,0.970,0.980,0.970],
'COP':[0.05,0.047,0.043,0.047,0.045]}
df1 = pd.DataFrame(data1)
data2 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019', '04-01-2019','05-01-2019'],
'Currency':['AUD','AUD','COP','CAD','BND'],
'Amount_Dollar':[19298, 19210, 5000, 200, 2300],
'New_Amount_Dollar':[0,0,0,0,0]
}
df2 = pd.DataFrame(data2)
df1
Date AUD BWP CAD BND COP
0 01-01-2019 98.0 30 0.0200 0.990 0.050
1 02-01-2019 98.5 31 0.0192 0.952 0.047
2 03-01-2019 99.0 33 0.0196 0.970 0.043
3 04-01-2019 99.5 32 0.0196 0.980 0.047
4 05-01-2019 97.0 31 0.0192 0.970 0.045
df2
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 0
1 02-01-2019 AUD 19210 0
2 03-01-2019 COP 5000 0
3 04-01-2019 CAD 200 0
4 05-01-2019 BND 2300 0
Expected Result
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 196.91
1 02-01-2019 AUD 19210 195.02
2 03-01-2019 COP 5000 116279.06
3 04-01-2019 CAD 200 10204.08
4 05-01-2019 BND 2300 2371.13
Use DataFrame.lookup with DataFrame.set_index for array and divide Amount_Dollar column:
arr = df1.set_index('Date').lookup(df2['Date'], df2['Currency'])
df2['New_Amount_Dollar'] = df2['Amount_Dollar'] / arr
print (df2)
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 196.918367
1 02-01-2019 AUD 19210 195.025381
2 03-01-2019 COP 5000 116279.069767
3 04-01-2019 CAD 200 10204.081633
4 05-01-2019 BND 2300 2371.134021
But if datetimes not match, use DataFrame.asfreq:
import pandas as pd
data1 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019',
'04-01-2019','05-01-2019','08-01-2019'],
'AUD':[98, 98.5, 99, 99.5, 97,100],
'BWP':[30,31,33,32,31,20],
'CAD':[0.02,0.0192,0.0196,0.0196,0.0192,0.2],
'BND':[0.99,0.952,0.970,0.980,0.970,.23],
'COP':[0.05,0.047,0.043,0.047,0.045,0.023]}
df1 = pd.DataFrame(data1)
data2 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019', '04-01-2019','07-01-2019'],
'Currency':['AUD','AUD','COP','CAD','BND'],
'Amount_Dollar':[19298, 19210, 5000, 200, 2300],
'New_Amount_Dollar':[0,0,0,0,0]
}
df2 = pd.DataFrame(data2)
print (df1)
Date AUD BWP CAD BND COP
0 01-01-2019 98.0 30 0.0200 0.990 0.050
1 02-01-2019 98.5 31 0.0192 0.952 0.047
2 03-01-2019 99.0 33 0.0196 0.970 0.043
3 04-01-2019 99.5 32 0.0196 0.980 0.047
4 05-01-2019 97.0 31 0.0192 0.970 0.045
5 08-01-2019 100.0 20 0.2000 0.230 0.023
print (df2)
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 0
1 02-01-2019 AUD 19210 0
2 03-01-2019 COP 5000 0
3 04-01-2019 CAD 200 0
4 07-01-2019 BND 2300 0
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date'] = pd.to_datetime(df2['Date'], dayfirst=True)
print (df1.set_index('Date').asfreq('D', method='ffill'))
AUD BWP CAD BND COP
Date
2019-01-01 98.0 30 0.0200 0.990 0.050
2019-01-02 98.5 31 0.0192 0.952 0.047
2019-01-03 99.0 33 0.0196 0.970 0.043
2019-01-04 99.5 32 0.0196 0.980 0.047
2019-01-05 97.0 31 0.0192 0.970 0.045
2019-01-06 97.0 31 0.0192 0.970 0.045
2019-01-07 97.0 31 0.0192 0.970 0.045
2019-01-08 100.0 20 0.2000 0.230 0.023
arr = df1.set_index('Date').asfreq('D', method='ffill').lookup(df2['Date'], df2['Currency'])
df2['New_Amount_Dollar'] = df2['Amount_Dollar'] / arr
print (df2)
Date Currency Amount_Dollar New_Amount_Dollar
0 2019-01-01 AUD 19298 196.918367
1 2019-01-02 AUD 19210 195.025381
2 2019-01-03 COP 5000 116279.069767
3 2019-01-04 CAD 200 10204.081633
4 2019-01-07 BND 2300 2371.134021

Convert pandas column with single list of values into rows

I have the following dataframe:
symbol PSAR
0 AAPL [nan,100,200]
1 PYPL [nan,300,400]
2 SPY [nan,500,600]
I am trying to turn the PSAR list values into rows like the following:
symbol PSAR
AAPL nan
AAPL 100
AAPL 200
PYPL nan
PYPL 300
... ...
SPY 600
I have been trying to solve it by following the answers in this post(one key difference being that that post has a list of list) but cant get there.
How to convert column with list of values into rows in Pandas DataFrame.
df['PSAR'].stack().reset_index(level=1, drop=True).to_frame('PSAR')
.join(df[['symbol']], how='left')
Not a slick one but this does the job:
list_of_lists = []
df_as_dict = dict(df.values)
for key,values in df_as_dict.items():
list_of_lists+=[[key,value] for value in values]
pd.DataFrame(list_of_lists)
returns:
0 1
0 AAPL NaN
1 AAPL 100.0
2 AAPL 200.0
3 PYPL NaN
4 PYPL 300.0
5 PYPL 400.0
6 SPY NaN
7 SPY 500.0
8 SPY 600.0
Pandas >= 0.25:
df1 = pd.DataFrame({'symbol':['AAPL', 'PYPL', 'SPY'],
'PSAR':[[None,100,200], [None,300,400], [None,500,600]]})
print(df1)
symbol PSAR
0 AAPL [None, 100, 200]
1 PYPL [None, 300, 400]
2 SPY [None, 500, 600]
df1.explode('PSAR')
symbol PSAR
0 AAPL None
0 AAPL 100
0 AAPL 200
1 PYPL None
1 PYPL 300
1 PYPL 400
2 SPY None
2 SPY 500
2 SPY 600

pandas calculates column value means on groups and means across whole dataframe

I have a df, df['period'] = (df['date1'] - df['date2']) / np.timedelta64(1, 'D')
code y_m date1 date2 period
1000 201701 2017-12-10 2017-12-09 1
1000 201701 2017-12-14 2017-12-12 2
1000 201702 2017-12-15 2017-12-13 2
1000 201702 2017-12-17 2017-12-15 2
2000 201701 2017-12-19 2017-12-18 1
2000 201701 2017-12-12 2017-12-10 2
2000 201702 2017-12-11 2017-12-10 1
2000 201702 2017-12-13 2017-12-12 1
2000 201702 2017-12-11 2017-12-10 1
then groupby code and y_m to calculate the average of date1-date2,
df_avg_period = df.groupby(['code', 'y_m'])['period'].mean().reset_index(name='avg_period')
code y_m avg_period
1000 201701 1.5
1000 201702 2
2000 201701 1.5
2000 201702 1
but I like to convert df_avg_period into a matrix that transposes column code to rows and y_m to columns, like
0 1 2 3
0 -1 0 201701 201702
1 0 1.44 1.44 1.4
2 1000 1.75 1.5 2
3 2000 1.20 1.5 1
-1 represents a dummy value that indicates either a value doesn't exist for a specific code/y_m cell or to maintain matrix shape; 0 represents 'all' values, that averages the code or y_m or code and y_m, e.g. cell (1,1) averages the period values for all rows in df; (1,2) averages the period for 201701 across all rows that have this value for y_m in df.
apparently pivot_table cannot give correct results using mean. so I am wondering how to achieve that correctly?
pivot_table with margins=True
piv = df.pivot_table(
index='code', columns='y_m', values='period', aggfunc='mean', margins=True
)
# housekeeping
(piv.reset_index()
.rename_axis(None, 1)
.rename({'code' : -1, 'All' : 0}, axis=1)
.sort_index(axis=1)
)
-1 0 201701 201702
0 1000 1.750000 1.5 2.0
1 2000 1.200000 1.5 1.0
2 All 1.444444 1.5 1.4

Categories