I have two dataframes, deals
Currency Deal_Amount
0 USD 18.40
1 USD 18.40
2 USD 5559.00
3 USD 14300.00
4 USD 1000.00
5 EUR 3072.00
6 USD 500.00
7 CAD 100000.00
8 USD 250.00
15 EUR 6000.00
and currency_rates
currency_code year quarter from_usd_rate to_usd_rate
AED 2018 3 3.67285 0.27226813
ARS 2018 3 17.585 0.056866648
AUD 2018 3 1.27186 0.786250059
BRL 2018 3 3.1932 0.313165477
CAD 2018 3 1.2368 0.808538163
EUR 2018 3 0.852406 1.173149884
GBP 2018 3 0.747077 1.338550109
GHS 2018 3 4.4 0.227272727
I want to create a column in deals that converts deals where deals['Currency'] != USD, and apply the currency_rate['to_usd_rate'] to deals['Deal_Amount'] to get the USD converted amount.
So far i tried
def convert_amount(data):
if data['Currency']==currency_rates['currency_code']:
Converted_amount=data['Deal_Amount'] * currency_rates['to_usd_rate']
return Converted_amount
but its not working.
You can merge and fillna with 1 to use as new rate
A = df1.merge(df2, left_on='Currency', right_on='currency_code', how='left').fillna(1)
df1['converted'] = A['Deal_Amount']*A['to_usd_rate']
output
index Currency Deal_Amount converted
0 0 USD 18.4 18.400000
1 1 USD 18.4 18.400000
2 2 USD 5559.0 5559.000000
3 3 USD 14300.0 14300.000000
4 4 USD 1000.0 1000.000000
5 5 EUR 3072.0 3603.916444
6 6 USD 500.0 500.000000
7 7 CAD 100000.0 80853.816300
8 8 USD 250.0 250.000000
9 15 EUR 6000.0 7038.899304
Related
I would like to convert values from a currency to a currency based on the Following logic:
#df1#
id
from_curr
to_curr
Date
item_number
value_to_convert
1
AED
EUR
2017-01-12
10
2000
1
AED
EUR
2017-01-12
20
189
2
UAD
EUR
2021-05-18
10
12.5
3
DZD
EUR
2017-01-12
10
130
5
GBP
EUR
2017-01-12
10
1000
5
GBP
EUR
2017-01-12
20
1300
5
GBP
EUR
2017-01-12
30
500
6
EUR
EUR
2020-09-14
10
22.50
7
EUR
EUR
2021-09-01
10
150
6
EUR
EUR
2020-09-14
20
18
df2: #currency_table#
from_curr
To_curr
Date
rate_exchange
AED
EUR
2017-01-01
-5,123
UAD
EUR
2021-05-26
-9.5
AED
EUR
2018-03-10
-5,3
DZD
EUR
2017-01-01
-6,12
GBP
EUR
2017-01-02
-0.8015
EUR
CHF
01-01-2022
-1.22760
GBP
EUR
2017-02-01
-1.02
I would like to create a Pyspark function that convert value_to_convert from df1 using the exchange_rate from currency_table (by looking in the exchange_rate dataframe corresponding to the date group from currency ) while joining both dataframes on from_curr field and date field, each value should be converted with rate_exchange from the right date to get df3 like but this time a currency may have two rate of exchanges.
id
from_curr
to_curr
Date
item_number
converted_value
1
AED
EUR
2017-01-12
10
390.39
1
AED
EUR
2017-01-12
20
19,89
2
UAD
EUR
2021-05-18
10
1,31
3
DZD
EUR
2017-01-12
10
21,24
5
GBP
EUR
2017-01-12
10
1247.66
5
GBP
EUR
2017-01-12
20
1621.95
5
GBP
EUR
2017-01-12
30
623.83
6
EUR
EUR
2020-09-14
10
22.50
7
EUR
EUR
2021-09-01
10
150
6
EUR
EUR
2020-09-14
20
18
Here is a sample of a dataframe based CSV file which is a monthly statement:
Date QUANTITY DES CURR
0 2020-07-06 -500.0 BETAPRO NASDAQ-100 2X DAILY BU CAD
1 2020-07-07 -18.0 AMAZON.COM USD
2 2020-07-10 -20.0 AMAZON.COM USD
3 2020-07-13 -30.0 AMAZON.COM USD
4 2020-07-15 -50.0 AMAZON.COM USD
5 2020-07-22 -32.0 AMAZON.COM USD
6 2020-07-23 -25.0 AMAZON.COM USD
7 2020-07-28 -25.0 AMAZON.COM USD
Is there a way to append this dataframe with a column for USD/CAD exchange rate based on the dates on the left? i tried to use 'concat' but it simply puts the YAHOO exchange rates at the bottom of the dataframe
You can use pd.concat() and many other methods to combine them, merge() is more convenient. merge
import pandas as pd
import numpy as np
import io
data = '''
Date QUANTITY DES CURR
0 2020-07-06 -500.0 "BETAPRO NASDAQ-100 2X DAILY BU" CAD
1 2020-07-07 -18.0 AMAZON.COM USD
2 2020-07-10 -20.0 AMAZON.COM USD
3 2020-07-13 -30.0 AMAZON.COM USD
4 2020-07-15 -50.0 AMAZON.COM USD
5 2020-07-22 -32.0 AMAZON.COM USD
6 2020-07-23 -25.0 AMAZON.COM USD
7 2020-07-28 -25.0 AMAZON.COM USD
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
data2 = '''
Date CURR RATE
0 2020-07-06 USD 1.15098
1 2020-07-07 USD 1.16319
2 2020-07-10 USD 1.17112
3 2020-07-13 USD 1.18092
4 2020-07-15 USD 1.16503
5 2020-07-22 USD 1.17809
6 2020-07-23 USD 1.89103
7 2020-07-28 USD 1.91234
8 2020-07-06 CAD 1.5585
9 2020-07-07 CAD 1.5802
10 2020-07-10 CAD 1.5703
11 2020-07-13 CAD 1.5234
12 2020-07-15 CAD 1.5623
13 2020-07-22 CAD 1.5237
14 2020-07-23 CAD 1.5129
15 2020-07-28 CAD 1.5343
'''
df1 = pd.read_csv(io.StringIO(data2), sep='\s+')
df.merge(df1, on=['Date', 'CURR'], how='inner')
Date QUANTITY DES CURR RATE
0 2020-07-06 -500.0 BETAPRO NASDAQ-100 2X DAILY BU CAD 1.55850
1 2020-07-07 -18.0 AMAZON.COM USD 1.16319
2 2020-07-10 -20.0 AMAZON.COM USD 1.17112
3 2020-07-13 -30.0 AMAZON.COM USD 1.18092
4 2020-07-15 -50.0 AMAZON.COM USD 1.16503
5 2020-07-22 -32.0 AMAZON.COM USD 1.17809
6 2020-07-23 -25.0 AMAZON.COM USD 1.89103
7 2020-07-28 -25.0 AMAZON.COM USD 1.91234
Sorry for any possible confusion with the title. I will describe my question better with the following code and pictures.
Now I have a dataframe with multiple columns. The first two columns, by which they are sorted, 'Route' and 'ID' (Sorry about the formatting, all the rows here have 'Route' value of '100' and 'ID' from 1 to 3.
df1.head(9)
Route ID Year Vol Truck_Vol Truck_%
0 100 1 2017.0 7016 635.0 9.1
1 100 1 2014.0 6835 NaN NaN
2 100 1 2011.0 5959 352.0 5.9
3 100 2 2018.0 15828 NaN NaN
4 100 2 2015.0 13114 2964.0 22.6
5 100 2 2009.0 11844 1280.0 10.8
6 100 3 2016.0 15434 NaN NaN
7 100 3 2013.0 18699 2015.0 10.8
8 100 3 2010.0 15903 NaN NaN
What I want to have is
Route ID Year Vol1 Truck_Vol1 Truck_%1 Year2 Vol2 Truck_Vol2 Truck_%2 Year3 Vol3 Truck_Vol3 Truck_%3
0 100 1 2017 7016 635.0 9.1 2014 6835 NaN NaN 2011 5959 352.0 5.9
1 100 2 2018 15828 NaN NaN 2015 13114 2964.0 22.6 2009 11844 1280.0 10.8
2 100 3 2016 15434 NaN NaN 2013 18699 2015.0 10.8 2010 15903 NaN NaN
Again, sorry for the messy formatting. Let me try a simplified version.
Input:
Route ID Year Vol T_%
0 100 1 2017 100 1.0
1 100 1 2014 200 NaN
2 100 1 2011 300 2.0
3 100 2 2018 400 NaN
4 100 2 2015 500 3.0
5 100 2 2009 600 4.0
Desired Output:
Route ID Year Vol T_% Year.1 Vol.1 T_%.1 Year.2 Vol.2 T_%.2
0 100 1 2017 100 1.0 2014 200 NaN 2011 300 2
1 100 2 2018 400 NaN 2015 500 3.0 2009 600 4
So basically just move the cells shown in the picture
I am stumped here. The names for the newly generated columns don't matter.
For this current dataframe, I have three rows per 'group' like shown in the code. It will be great if the answer can accommodate any number of rows each group.
Thanks for your time.
with groupby + cumcount + set_index + unstack
df1 = df.assign(cid = df.groupby(['Route', 'ID']).cumcount()).set_index(['Route', 'ID', 'cid']).unstack(-1).sort_index(1,1)
df1.columns = [f'{x}{y}' for x,y in df1.columns]
df1 = df1.reset_index()
Output df1:
Route ID T_%0 Vol0 Year0 T_%1 Vol1 Year1 T_%2 Vol2 Year2
0 100 1 1.0 100 2017 NaN 200 2014 2.0 300 2011
1 100 2 NaN 400 2018 3.0 500 2015 4.0 600 2009
melt + pivot_table
v = df.melt(id_vars=['Route', 'ID'])
v['variable'] += v.groupby(['Route', 'ID', 'variable']).cumcount().astype(str)
res = v.pivot_table(index=['Route', 'ID'], columns='variable', values='value')
variable T_% 0 T_% 1 T_% 2 Vol 0 Vol 1 Vol 2 Year 0 Year 1 Year 2
Route ID
100 1 1.0 NaN 2.0 100.0 200.0 300.0 2017.0 2014.0 2011.0
2 NaN 3.0 4.0 400.0 500.0 600.0 2018.0 2015.0 2009.0
If you want to sort these:
c = res.columns.str.extract(r'(\d+)')[0].values.astype(int)
res.iloc[:,np.argsort(c)]
variable T_%0 Vol0 Year0 T_%1 Vol1 Year1 T_%2 Vol2 Year2
Route ID
100 1 1.0 100.0 2017.0 NaN 200.0 2014.0 2.0 300.0 2011.0
2 NaN 400.0 2018.0 3.0 500.0 2015.0 4.0 600.0 2009.0
You asked about why I used cumcount. To explain, here is what v looks like from above:
Route ID variable value
0 100 1 Year 2017.0
1 100 1 Year 2014.0
2 100 1 Year 2011.0
3 100 2 Year 2018.0
4 100 2 Year 2015.0
5 100 2 Year 2009.0
6 100 1 Vol 100.0
7 100 1 Vol 200.0
8 100 1 Vol 300.0
9 100 2 Vol 400.0
10 100 2 Vol 500.0
11 100 2 Vol 600.0
12 100 1 T_% 1.0
13 100 1 T_% NaN
14 100 1 T_% 2.0
15 100 2 T_% NaN
16 100 2 T_% 3.0
17 100 2 T_% 4.0
If I used pivot_table on this DataFrame, you would end up with something like this:
variable T_% Vol Year
Route ID
100 1 1.5 200.0 2014.0
2 3.5 500.0 2014.0
Obviously you are losing data here. cumcount is the solution, as it turns the variable series into this:
Route ID variable value
0 100 1 Year0 2017.0
1 100 1 Year1 2014.0
2 100 1 Year2 2011.0
3 100 2 Year0 2018.0
4 100 2 Year1 2015.0
5 100 2 Year2 2009.0
6 100 1 Vol0 100.0
7 100 1 Vol1 200.0
8 100 1 Vol2 300.0
9 100 2 Vol0 400.0
10 100 2 Vol1 500.0
11 100 2 Vol2 600.0
12 100 1 T_%0 1.0
13 100 1 T_%1 NaN
14 100 1 T_%2 2.0
15 100 2 T_%0 NaN
16 100 2 T_%1 3.0
17 100 2 T_%2 4.0
Where you have a count of repeated elements per unique Route and ID.
I want to create 2 Running Total columns that ONLY aggregate the Amount values based on whether TYPE is ANNUAL or MONTHLY within each Deal
so it would be DF.groupby(['Deal','Booking Month']) then somehow apply a sum function when TYPE==ANNUAL for the first column and TYPE==MONTHLY for the second column.
This if what my grouped DF looks like + the two Desired Columns.
Deal TYPE Month Amount Running Total(ANNUAL) Running Total(Monthly)
A ANNUAL April 1000 1000 0
A ANNUAL April 2000 3000 0
A MONTHLY June 1500 3000 1500
B MONTHLY April 11150 0 11150
B ANNUAL July 700 700 11150
B ANNUAL August 303.63 1003.63 11150
C ANNUAL April 25624.59 25624.59 0
D ANNUAL June 5000 5000 0
D ANNUAL July 5000 10000 0
D ANNUAL August 5000 15000 0
E ANNUAL April 10 10 0
E MONTHLY May 1000 10 1000
E ANNUAL May 500 510 1000
E MONTHLY June 500.00 510 1500
E ANNUAL June 600 1110 1500
E MONTHLY July 300 1110 1800
E MONTHLY July 8200 1110 10000
Use filters and groupby + transform:
mask = df.TYPE.eq('ANNUAL')
cols = ['Running Total(ANNUAL)','Running Total(MONTHLY)']
df.loc[mask,'Running Total(ANNUAL)'] = df.loc[mask,'Amount']
df.loc[~mask,'Running Total(MONTHLY)'] = df.loc[~mask,'Amount']
df[cols] = df[cols].fillna(0)
df[cols] = df.groupby(['Deal'])['Running Total(ANNUAL)','Running Total(MONTHLY)'].transform('cumsum')
print(df)
Deal TYPE Month Amount Running Total(ANNUAL) \
0 A ANNUAL April 1000.00 1000.00
1 A ANNUAL April 2000.00 3000.00
2 A MONTHLY June 1500.00 3000.00
3 B MONTHLY April 11150.00 0.00
4 B ANNUAL July 700.00 700.00
5 B ANNUAL August 303.63 1003.63
6 C ANNUAL April 25624.59 25624.59
7 D ANNUAL June 5000.00 5000.00
8 D ANNUAL July 5000.00 10000.00
9 D ANNUAL August 5000.00 15000.00
10 E ANNUAL April 10.00 10.00
11 E MONTHLY May 1000.00 10.00
12 E ANNUAL May 500.00 510.00
13 E MONTHLY June 500.00 510.00
14 E ANNUAL June 600.00 1110.00
15 E MONTHLY July 300.00 1110.00
16 E MONTHLY July 8200.00 1110.00
Running Total(MONTHLY)
0 0.0
1 0.0
2 1500.0
3 11150.0
4 11150.0
5 11150.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 1000.0
12 1000.0
13 1500.0
14 1500.0
15 1800.0
16 10000.0
You can do this with .expanding.sum() which will maintain a multiIndex of the groups that you can unstack to get separate columns for each type. Use another groupby to fill the missing values within each group accordingly. Concatenate it back.
The nice thing about this is that it can be done for arbitrarily many types, without needing to define them anywhere explicitly.
import pandas as pd
df2 = (df.groupby(['Deal', 'TYPE'])
.Amount.expanding().sum()
.unstack(level=1)
.groupby(level=0)
.ffill().fillna(0)
.reset_index(level=0, drop=True)
.drop(columns='Deal'))
pd.concat([df, df2], axis=1)
Output
Deal TYPE Month Amount ANNUAL MONTHLY
0 A ANNUAL April 1000.00 1000.00 0.0
1 A ANNUAL April 2000.00 3000.00 0.0
2 A MONTHLY June 1500.00 3000.00 1500.0
3 B MONTHLY April 11150.00 0.00 11150.0
4 B ANNUAL July 700.00 700.00 11150.0
5 B ANNUAL August 303.63 1003.63 11150.0
6 C ANNUAL April 25624.59 25624.59 0.0
7 D ANNUAL June 5000.00 5000.00 0.0
8 D ANNUAL July 5000.00 10000.00 0.0
9 D ANNUAL August 5000.00 15000.00 0.0
10 E ANNUAL April 10.00 10.00 0.0
11 E MONTHLY May 1000.00 10.00 1000.0
12 E ANNUAL May 500.00 510.00 1000.0
13 E MONTHLY June 500.00 510.00 1500.0
14 E ANNUAL June 600.00 1110.00 1500.0
15 E MONTHLY July 300.00 1110.00 1800.0
16 E MONTHLY July 8200.00 1110.00 10000.0
I would like to calculate the column by other row of pandas dataframe.
For example, when I have these dataframes,
df = pd.DataFrame({
"year" : ['2017', '2017', '2017', '2017', '2017','2017', '2017', '2017', '2017'],
"rooms" : ['1', '2', '3', '1', '2', '3', '1', '2', '3'],
"city" : ['tokyo', 'tokyo', 'toyko', 'nyc','nyc', 'nyc', 'paris', 'paris', 'paris'],
"rent" : [1000, 1500, 2000, 1200, 1600, 1900, 900, 1500, 2200],
})
print(df)
city rent rooms year
0 tokyo 1000 1 2017
1 tokyo 1500 2 2017
2 toyko 2000 3 2017
3 nyc 1200 1 2017
4 nyc 1600 2 2017
5 nyc 1900 3 2017
6 paris 900 1 2017
7 paris 1500 2 2017
8 paris 2200 3 2017
I'd like to add the rent compared to other city's rent in the same year and rooms.
Ideal results are like below,
city rent rooms year vs_nyc
0 tokyo 1000 1 2017 0.833333
1 tokyo 1500 2 2017 0.9375
2 toyko 2000 3 2017 1.052631
3 nyc 1200 1 2017 1.0
4 nyc 1600 2 2017 1.0
5 nyc 1900 3 2017 1.0
6 paris 900 1 2017 0.75
7 paris 1500 2 2017 0.9375
8 paris 2200 3 2017 1.157894
How to add column like vs_nyc taking account of the year and rooms?
I tried some but not worked,
# filtering gets NaN value, and fillna(method='pad') also not worked
df.rent / df[df['city'] == 'nyc'].rent
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 NaN
7 NaN
8 NaN
Name: rent, dtype: float64
To illustrate:
set_index + unstack
d1 = df.set_index(['city', 'year', 'rooms']).rent.unstack('city')
d1
city nyc paris tokyo toyko
year rooms
2017 1 1200.0 900.0 1000.0 NaN
2 1600.0 1500.0 1500.0 NaN
3 1900.0 2200.0 NaN 2000.0
Then we can divide
d1.div(d1.nyc, 0)
city nyc paris tokyo toyko
year rooms
2017 1 1.0 0.750000 0.833333 NaN
2 1.0 0.937500 0.937500 NaN
3 1.0 1.157895 NaN 1.052632
solution
d1 = df.set_index(['city', 'year', 'rooms']).rent.unstack('city')
df.join(d1.div(d1.nyc, 0).stack().rename('vs_nyc'), on=['year', 'rooms', 'city'])
city rent rooms year vs_nyc
0 tokyo 1000 1 2017 0.833333
1 tokyo 1500 2 2017 0.937500
2 toyko 2000 3 2017 1.052632
3 nyc 1200 1 2017 1.000000
4 nyc 1600 2 2017 1.000000
5 nyc 1900 3 2017 1.000000
6 paris 900 1 2017 0.750000
7 paris 1500 2 2017 0.937500
8 paris 2200 3 2017 1.157895
A little cleaned up
cols = ['city', 'year', 'rooms']
ny_rent = df.set_index(cols).rent.loc['nyc'].rename('ny_rent')
df.assign(vs_nyc=df.rent / df.join(d1, on=d1.index.names).ny_rent)