I have two dataframes namely df1 and df2. I want to perform operation on column New_Amount_Dollar from df2. Basically in df1 I have historical currency data and I want to perform datewise operation given Currency and Amount_Dollar from df2 to calculate the values for New_Amount_Dollar column in df2.
e.g In df2 I have first currency as AUD for Date = '01-01-2019', so I want to calculate New_Amount_Dollar value such that
New_Amount_Dollar = Amount_Dollar/AUD value from df1
i.e New_Amount_Dollar = 19298/98 = 196.91
another example where in df2 I have third currency as COP for Date = '03-01-2019, so I want to calculate New_Amount_Dollar value such that
New_Amount_Dollar = Amount_Dollar/COP value from df1
i.e New_Amount_Dollar = 5000/0.043 = 116279.06
import pandas as pd
data1 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019', '04-01-2019','05-01-2019'],
'AUD':[98, 98.5, 99, 99.5, 97],
'BWP':[30,31,33,32,31],
'CAD':[0.02,0.0192,0.0196,0.0196,0.0192],
'BND':[0.99,0.952,0.970,0.980,0.970],
'COP':[0.05,0.047,0.043,0.047,0.045]}
df1 = pd.DataFrame(data1)
data2 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019', '04-01-2019','05-01-2019'],
'Currency':['AUD','AUD','COP','CAD','BND'],
'Amount_Dollar':[19298, 19210, 5000, 200, 2300],
'New_Amount_Dollar':[0,0,0,0,0]
}
df2 = pd.DataFrame(data2)
df1
Date AUD BWP CAD BND COP
0 01-01-2019 98.0 30 0.0200 0.990 0.050
1 02-01-2019 98.5 31 0.0192 0.952 0.047
2 03-01-2019 99.0 33 0.0196 0.970 0.043
3 04-01-2019 99.5 32 0.0196 0.980 0.047
4 05-01-2019 97.0 31 0.0192 0.970 0.045
df2
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 0
1 02-01-2019 AUD 19210 0
2 03-01-2019 COP 5000 0
3 04-01-2019 CAD 200 0
4 05-01-2019 BND 2300 0
Expected Result
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 196.91
1 02-01-2019 AUD 19210 195.02
2 03-01-2019 COP 5000 116279.06
3 04-01-2019 CAD 200 10204.08
4 05-01-2019 BND 2300 2371.13
Use DataFrame.lookup with DataFrame.set_index for array and divide Amount_Dollar column:
arr = df1.set_index('Date').lookup(df2['Date'], df2['Currency'])
df2['New_Amount_Dollar'] = df2['Amount_Dollar'] / arr
print (df2)
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 196.918367
1 02-01-2019 AUD 19210 195.025381
2 03-01-2019 COP 5000 116279.069767
3 04-01-2019 CAD 200 10204.081633
4 05-01-2019 BND 2300 2371.134021
But if datetimes not match, use DataFrame.asfreq:
import pandas as pd
data1 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019',
'04-01-2019','05-01-2019','08-01-2019'],
'AUD':[98, 98.5, 99, 99.5, 97,100],
'BWP':[30,31,33,32,31,20],
'CAD':[0.02,0.0192,0.0196,0.0196,0.0192,0.2],
'BND':[0.99,0.952,0.970,0.980,0.970,.23],
'COP':[0.05,0.047,0.043,0.047,0.045,0.023]}
df1 = pd.DataFrame(data1)
data2 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019', '04-01-2019','07-01-2019'],
'Currency':['AUD','AUD','COP','CAD','BND'],
'Amount_Dollar':[19298, 19210, 5000, 200, 2300],
'New_Amount_Dollar':[0,0,0,0,0]
}
df2 = pd.DataFrame(data2)
print (df1)
Date AUD BWP CAD BND COP
0 01-01-2019 98.0 30 0.0200 0.990 0.050
1 02-01-2019 98.5 31 0.0192 0.952 0.047
2 03-01-2019 99.0 33 0.0196 0.970 0.043
3 04-01-2019 99.5 32 0.0196 0.980 0.047
4 05-01-2019 97.0 31 0.0192 0.970 0.045
5 08-01-2019 100.0 20 0.2000 0.230 0.023
print (df2)
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 0
1 02-01-2019 AUD 19210 0
2 03-01-2019 COP 5000 0
3 04-01-2019 CAD 200 0
4 07-01-2019 BND 2300 0
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date'] = pd.to_datetime(df2['Date'], dayfirst=True)
print (df1.set_index('Date').asfreq('D', method='ffill'))
AUD BWP CAD BND COP
Date
2019-01-01 98.0 30 0.0200 0.990 0.050
2019-01-02 98.5 31 0.0192 0.952 0.047
2019-01-03 99.0 33 0.0196 0.970 0.043
2019-01-04 99.5 32 0.0196 0.980 0.047
2019-01-05 97.0 31 0.0192 0.970 0.045
2019-01-06 97.0 31 0.0192 0.970 0.045
2019-01-07 97.0 31 0.0192 0.970 0.045
2019-01-08 100.0 20 0.2000 0.230 0.023
arr = df1.set_index('Date').asfreq('D', method='ffill').lookup(df2['Date'], df2['Currency'])
df2['New_Amount_Dollar'] = df2['Amount_Dollar'] / arr
print (df2)
Date Currency Amount_Dollar New_Amount_Dollar
0 2019-01-01 AUD 19298 196.918367
1 2019-01-02 AUD 19210 195.025381
2 2019-01-03 COP 5000 116279.069767
3 2019-01-04 CAD 200 10204.081633
4 2019-01-07 BND 2300 2371.134021
Related
i have a dataframe that contains cell phone minutes usage logged by date of call and duration.
It looks like this (30 row sample):
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
I want to group by user_id AND call_date with the ultimate goal of calculating the number of minutes used per month over the course of the year, per user.
I thought i could accomplish this by using:
calls.groupby(['user_id','call_date'])['duration'].sum()
but the results aren't what i expected:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-08-14 13.86
2018-08-16 23.46
2018-08-17 8.11
2018-08-18 1.74
2018-08-19 10.73
2018-08-20 7.32
2018-08-21 0.00
2018-08-23 8.50
2018-08-24 8.63
2018-08-25 35.39
2018-08-27 10.57
2018-08-28 19.91
2018-08-29 0.54
2018-08-31 22.38
2018-09-01 7.53
2018-09-02 10.27
2018-09-03 30.66
2018-09-04 0.00
2018-09-05 9.09
2018-09-06 10.06
i'd hoped that it would be grouped like user_id 1000, all calls for jan with duration summed, all calls for feb with duration summed, etc.
i am really new to python and programming in general and am not sure what my next step should be to get these grouped by user_id and month of the year?
Thanks in advance for any insight you can offer.
Regards,
Jared
Something is not quite right in your setup. First of all, both of your tables are the same, so I am not sure if this is a cut-and-paste error or something else. Here is what I do with your data. Load it up like so, note we explicitly convert call_date to Datetime`
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(
"""
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
"""), delim_whitespace = True, index_col=0)
df['call_date'] = pd.to_datetime(df['call_date'])
Then using
df.groupby(['user_id','call_date'])['duration'].sum()
does the expected grouping by user and by each date:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-09-06 10.06
2018-09-21 5.75
2018-09-30 14.78
2018-10-12 1.00
2018-10-17 15.83
2018-10-27 0.98
2018-10-28 5.90
2018-11-09 1.00
2018-11-15 30.00
2018-11-17 2.45
2018-11-19 2.40
2018-12-04 7.19
2018-12-05 0.00
2018-12-13 6.27
2018-12-24 0.00
If you want to group by month as you seem to suggest you can use the Grouper functionality:
df.groupby(['user_id',pd.Grouper(key='call_date', freq='1M')])['duration'].sum()
which produces
user_id call_date
1000 2018-12-31 116.83
1001 2018-09-30 30.59
2018-10-31 23.71
2018-11-30 35.85
2018-12-31 13.46
Let me know if you are getting different results from following these steps
I have a dataframe, which has different rates for multiple 'N' currencies over a time period.
dataframe
Dates AUD CAD CHF GBP EUR
20/05/2019 0.11 -0.25 -0.98 0.63 0.96
21/05/2019 0.14 -0.35 -0.92 1.92 0.92
...
02/01/2020 0.135 -0.99 -1.4 0.93 0.83
Firstly, I would like to reshape the dataframe table to look like the below as I would like to join another table which would be in a similar format:
dataframe
Dates Pairs Rates
20/05/2019 AUD 0.11
20/05/2019 CAD -0.25
20/05/2019 CHF -0.98
...
...
02/01/2020 AUD 0.135
02/01/2020 CAD -0.99
02/01/2020 CHF -1.4
Then, for every N currency, I would like to plot a histogram . So with the above, it would be 5 separate histograms based off each N ccy.
I assume I would need to get this in some sort of loop, but not sure on the easiest way to approach.
Thanks
Use DataFrame.melt first:
df['Dates'] = pd.to_datetime(df['Dates'], dayfirst=True)
df = df.melt('Dates', var_name='Pairs', value_name='Rates')
print (df)
Dates Pairs Rates
0 2019-05-20 AUD 0.110
1 2019-05-21 AUD 0.140
2 2020-01-02 AUD 0.135
3 2019-05-20 CAD -0.250
4 2019-05-21 CAD -0.350
5 2020-01-02 CAD -0.990
6 2019-05-20 CHF -0.980
7 2019-05-21 CHF -0.920
8 2020-01-02 CHF -1.400
9 2019-05-20 GBP 0.630
10 2019-05-21 GBP 1.920
11 2020-01-02 GBP 0.930
12 2019-05-20 EUR 0.960
13 2019-05-21 EUR 0.920
14 2020-01-02 EUR 0.830
And then DataFrameGroupBy.hist:
df.groupby('Pairs').hist()
I am working on two tables as follows:
A first table df1 giving a rate and a validity period:
rates = {'rate': [ 0.974, 0.966, 0.996, 0.998, 0.994, 1.006, 1.042, 1.072, 0.954],
'Valid from': ['31/12/2018','15/01/2019','01/02/2019','01/03/2019','01/04/2019','15/04/2019','01/05/2019','01/06/2019','30/06/2019'],
'Valid to': ['14/01/2019','31/01/2019','28/02/2019','31/03/2019','14/04/2019','30/04/2019','31/05/2019','29/06/2019','31/07/2019']}
df1 = pd.DataFrame(rates)
df1['Valid to'] = pd.to_datetime(df1['Valid to'])
df1['Valid from'] = pd.to_datetime(df1['Valid from'])
rate Valid from Valid to
0 0.974 2018-12-31 2019-01-14
1 0.966 2019-01-15 2019-01-31
2 0.996 2019-01-02 2019-02-28
3 0.998 2019-01-03 2019-03-31
4 0.994 2019-01-04 2019-04-14
5 1.006 2019-04-15 2019-04-30
6 1.042 2019-01-05 2019-05-31
7 1.072 2019-01-06 2019-06-29
8 0.954 2019-06-30 2019-07-31
A second table df2 listing recorded amounts and corresponding dates
data = {'date': ['03/01/2019','23/01/2019','27/02/2019','14/03/2019','05/04/2019','30/04/2019','14/06/2019'],
'amount': [200,305,155,67,95,174,236,]}
df2 = pd.DataFrame(data)
df2['date'] = pd.to_datetime(df2['date'])
date amount
0 2019-03-01 200
1 2019-01-23 305
2 2019-02-27 155
3 2019-03-14 67
4 2019-05-04 95
5 2019-04-30 174
6 2019-06-14 236
The objective would be to retrieve from df1 the applicable rate to each row on df2 using iteration and based on the date on df2.
Example: the date on the first row in df2 is 2019-01-03, therefore the applicable rate would be 0.974
The explanations given here (https://www.interviewqs.com/ddi_code_snippets/select_pandas_dataframe_rows_between_two_dates) gives me an idea on how to retrieve the rows on df2 between two dates in df1.
But I didn't manage to retrieve from df1 the applicable rate to each row on df2 using iteration.
If your dataframes are not very big, you can simply do the join on a dummy key and then do filtering to narrow it down to what you need. See example below (note that I had to update your example a little bit to have correct date formatting)
import pandas as pd
rates = {'rate': [ 0.974, 0.966, 0.996, 0.998, 0.994, 1.006, 1.042, 1.072, 0.954],
'valid_from': ['31/12/2018','15/01/2019','01/02/2019','01/03/2019','01/04/2019','15/04/2019','01/05/2019','01/06/2019','30/06/2019'],
'valid_to': ['14/01/2019','31/01/2019','28/02/2019','31/03/2019','14/04/2019','30/04/2019','31/05/2019','29/06/2019','31/07/2019']}
df1 = pd.DataFrame(rates)
df1['valid_to'] = pd.to_datetime(df1['valid_to'],format ='%d/%m/%Y')
df1['valid_from'] = pd.to_datetime(df1['valid_from'],format='%d/%m/%Y')
Then you df1 would be
rate valid_from valid_to
0 0.974 2018-12-31 2019-01-14
1 0.966 2019-01-15 2019-01-31
2 0.996 2019-02-01 2019-02-28
3 0.998 2019-03-01 2019-03-31
4 0.994 2019-04-01 2019-04-14
5 1.006 2019-04-15 2019-04-30
6 1.042 2019-05-01 2019-05-31
7 1.072 2019-06-01 2019-06-29
8 0.954 2019-06-30 2019-07-31
This is your second data frame df2
data = {'date': ['03/01/2019','23/01/2019','27/02/2019','14/03/2019','05/04/2019','30/04/2019','14/06/2019'],
'amount': [200,305,155,67,95,174,236,]}
df2 = pd.DataFrame(data)
df2['date'] = pd.to_datetime(df2['date'],format ='%d/%m/%Y')
Then your df2 would look like the following
date amount
0 2019-01-03 200
1 2019-01-23 305
2 2019-02-27 155
3 2019-03-14 67
4 2019-04-05 95
5 2019-04-30 174
6 2019-06-14 236
Your solution:
df1['key'] = 1
df2['key'] = 1
df_output = pd.merge(df1, df2, on='key').drop('key',axis=1)
df_output = df_output[(df_output['date'] > df_output['valid_from']) & (df_output['date'] <= df_output['valid_to'])]
This is how would the result look like df_output:
rate valid_from valid_to date amount
0 0.974 2018-12-31 2019-01-14 2019-01-03 200
8 0.966 2019-01-15 2019-01-31 2019-01-23 305
16 0.996 2019-02-01 2019-02-28 2019-02-27 155
24 0.998 2019-03-01 2019-03-31 2019-03-14 67
32 0.994 2019-04-01 2019-04-14 2019-04-05 95
40 1.006 2019-04-15 2019-04-30 2019-04-30 174
55 1.072 2019-06-01 2019-06-29 2019-06-14 236
I have two dataframes namely df1 and df2. I want to perform operation on column "New_Amount_Dollar" from df2. Basically in df1 I have historical currency data and I want to perform datewise operation given Currency and Amount_Dollar from df2 to calculate the values for New_Amount_Dollar column in df2.
For 'Currency' == [AUD, BWP] We need to multiply the Amount_Dollar by respective currency value for respective date.
For other currencies We need to divide the Amount_Dollar by respective currency value for respective date.
e.g In df2 I have first currency as AUD for Date = '01-01-2019', so I want to calculate New_Amount_Dollar value such that
New_Amount_Dollar = Amount_Dollar*AUD value from df1 i.e New_Amount_Dollar = 19298*98 = 1891204
another example where in df2 I have third currency as COP for Date = '03-01-2019, so I want to calculate New_Amount_Dollar value such that
New_Amount_Dollar = Amount_Dollar/COP value from df1 i.e New_Amount_Dollar = 5000/0.043 = 116279.06
import pandas as pd
data1 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019',
'04-01-2019','05-01-2019'],
'AUD':[98, 98.5, 99, 99.5, 97],
'BWP':[30,31,33,32,31],
'CAD':[0.02,0.0192,0.0196,0.0196,0.0192],
'BND':[0.99,0.952,0.970,0.980,0.970],
'COP':[0.05,0.047,0.043,0.047,0.045]}
df1 = pd.DataFrame(data1)
data2 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019', '04-01-2019','05-01-2019'],
'Currency':['AUD','AUD','COP','CAD','BND'],
'Amount_Dollar':[19298, 19210, 5000, 200, 2300],
'New_Amount_Dollar':[0,0,0,0,0]
}
df2 = pd.DataFrame(data2)
print (df2)
df1
Date AUD BWP CAD BND COP
0 01-01-2019 98.0 30 0.0200 0.990 0.050
1 02-01-2019 98.5 31 0.0192 0.952 0.047
2 03-01-2019 99.0 33 0.0196 0.970 0.043
3 04-01-2019 99.5 32 0.0196 0.980 0.047
4 05-01-2019 97.0 31 0.0192 0.970 0.045
df2
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 0
1 02-01-2019 AUD 19210 0
2 03-01-2019 COP 5000 0
3 04-01-2019 CAD 200 0
4 05-01-2019 BND 2300 0
Expected result
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 1891204
1 02-01-2019 AUD 19210 1892185.0
2 03-01-2019 COP 5000 116279.06
3 04-01-2019 CAD 200 10204.08
4 05-01-2019 BND 2300 2371.13
You want lookup and isin():
# this is to know where to multiply
# where to divide
s = df2['Currency'].isin(['AUD', 'BWP'])
# the values to multiply/divide
m = df1.set_index('Date').lookup(df2['Date'],df2['Currency'])
df2['New_Amount_Dollar'] = df2['Amount_Dollar'] * np.where(s, m, 1/m)
Output:
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 1891204.00
1 02-01-2019 AUD 19210 1892185.00
2 03-01-2019 COP 5000 116279.07
3 04-01-2019 CAD 200 10204.08
4 05-01-2019 BND 2300 2371.13
Try using melt and merge:
df_out = df2.merge(df1.melt('Date', var_name='Currency'), on= ['Date','Currency'])
df_out['New_Amount_Dollar'] = (df_out['Amount_Dollar'] *
np.where(df_out['Currency'].isin(['AUD', 'BWP']),
df_out['value'],
1/df_out['value']))
print(df_out)
Output:
Date Currency Amount_Dollar New_Amount_Dollar value
0 01-01-2019 AUD 19298 1891204.000 98.000
1 02-01-2019 AUD 19210 1892185.000 98.500
2 03-01-2019 COP 5000 116279.070 0.043
3 04-01-2019 CAD 200 10204.082 0.020
4 05-01-2019 BND 2300 2371.134 0.970
I have following issue:
I have to calc the percentual change of two values in a hierarchical DataFrame. The problem: I have to lookup the second row by a calculated value based on the first value.
Example:
Symbol Date Num Price EndDate
A 2018-01-01 1 100.00 201903
2 102.00 201907
3 104.00 201911
4 107.00 202003
2018-01-02 1 101.00 201903
2 103.00 201907
3 106.00 201911
4 110.00 202003
:
B 2018-01-01 1 5.00 201903
2 5.50 201909
3 4.23 202003
2018-02-02 1 5.20 201903
2 5.30 201909
3 5.40 202003
Now I have to calc for each (Symbol, Date, Num):
Price(this row) / Price(row, where EndDate = this EndDate + 100)
So I will get:
Symbol Date Num Price EndDate NewColumn
A 2018-01-01 1 100.00 201903 ( 100.00 / 107.00 )
2 102.00 201907 NaN
:
2018-01-02 1 101.00 201903 ( 101.00 / 110.00 )
:
B 2018-01-01 1 5.00 201903 ( 5.00 / 4.23 )
:
2018-02-02 1 5.20 201903 ( 5.20 / 5.40)
I hope it got a bit clear what I mean. Thank you for your suggestions!
I've solved this issue by joining the dataset to itself with precalculated target-dates.