Pandas: join on partial string match, like Excel VLOOKUP - python

I am trying to perform an action in Python which is very similar to VLOOKUP in Excel. There have been many questions related to this on StackOverflow but they are all slightly different from this use case. Hopefully anyone can guide me in the right direction. I have the following two pandas dataframes:
df1 = pd.DataFrame({'Invoice': ['20561', '20562', '20563', '20564'],
'Currency': ['EUR', 'EUR', 'EUR', 'USD']})
df2 = pd.DataFrame({'Ref': ['20561', 'INV20562', 'INV20563BG', '20564'],
'Type': ['01', '03', '04', '02'],
'Amount': ['150', '175', '160', '180'],
'Comment': ['bla', 'bla', 'bla', 'bla']})
print(df1)
Invoice Currency
0 20561 EUR
1 20562 EUR
2 20563 EUR
3 20564 USD
print(df2)
Ref Type Amount Comment
0 20561 01 150 bla
1 INV20562 03 175 bla
2 INV20563BG 04 160 bla
3 20564 02 180 bla
Now I would like to create a new dataframe (df3) where I combine the two based on the invoice numbers. The problem is that the invoice numbers are not always a "full match", but sometimes a "partial match" in df2['Ref']. So the joining on 'Invoice' does not give the desired output because it doesn't copy the data for invoices 20562 & 20563, see below:
df3 = df1.join(df2.set_index('Ref'), on='Invoice')
print(df3)
Invoice Currency Type Amount Comment
0 20561 EUR 01 150 bla
1 20562 EUR NaN NaN NaN
2 20563 EUR NaN NaN NaN
3 20564 USD 02 180 bla
Is there a way to join on a partial match? I know how to "clean" df2['Ref'] with regex, but that is not the solution I am after. With a for loop, I get a long way but this isn't very Pythonic.
df4 = df1.copy()
for i, row in df1.iterrows():
tmp = df2[df2['Ref'].str.contains(row['Invoice'])]
df4.loc[i, 'Amount'] = tmp['Amount'].values[0]
print(df4)
Invoice Currency Amount
0 20561 EUR 150
1 20562 EUR 175
2 20563 EUR 160
3 20564 USD 180
Can str.contains() somehow be used in a more elegant way? Thank you so much in advance for your help!

This is one way using pd.Series.apply, which is just a thinly veiled loop. A "partial string merge" is what you are looking for, I'm not sure it exists in a vectorised form.
df4 = df1.copy()
def get_amount(x):
return df2.loc[df2['Ref'].str.contains(x), 'Amount'].iloc[0]
df4['Amount'] = df4['Invoice'].apply(get_amount)
print(df4)
Currency Invoice Amount
0 EUR 20561 150
1 EUR 20562 175
2 EUR 20563 160
3 USD 20564 180

Here are two alternative solutions, both using Pandas' merge.
# Solution 1 (checking directly if 'Invoice' string is in the 'Ref' string)
df4 = df2.copy()
df4['Invoice'] = [val for idx, val in enumerate(df1['Invoice']) if val in df2['Ref'][idx]]
df_m4 = df1.merge(df4[['Amount', 'Invoice']], on='Invoice')
# Solution 2 (regex)
import re
df5 = df2.copy()
df5['Invoice'] = [re.findall(r'(\d{5})', s)[0] for s in df2['Ref']]
df_m5 = df1.merge(df5[['Amount', 'Invoice']], on='Invoice')
Both df_m4 and df_m5 will print
Currency Invoice Amount
0 EUR 20561 150
1 EUR 20562 175
2 EUR 20563 160
3 USD 20564 180
Note: The regex solution presented assumes that the invoice numbers are always 5 digits and only takes the first of such occurrences. Solution 1 is more robust, as it directly compares the strings.
The regex solution could be improved to be more robust if needed though.

Related

How can I filter an pandas dataframe with another Smaller pandas dataframe

I have 2 Data frames the first looks like this
df1:
MONEY Value
0 EUR 850
1 USD 750
2 CLP 1
3 DCN 1
df2:
Money
0 USD
1 USD
2 USD
3 USD
4 EGP
... ...
25984 USD
25985 DCN
25986 USD
25987 CLP
25988 USD
I want to remove the "Money" values of df2 that are not present in df1. and add any column of the values of the "Value" column in df1
Money Value
0 USD 720
1 USD 720
2 USD 720
3 USD 720
... ...
25984 USD 720
25985 DCN 1
25986 USD 720
25987 CLP 1
25000 USD 720
Step by step:
df1.set_index("MONEY")["Value"]
This code transforms the column MONEY into the Dataframe index. Which results in:
print(df1)
MONEY
EUR 850
USD 150
DCN 1
df2["Money"].map(df1.set_index("MONEY")["Value"])
This code maps the content of df2 to df1. This returns the following:
0 150.0
1 NaN
2 850.0
3 NaN
Now we assign the previous column to a new column in df2 called Value. Putting it all together:
df2["Value"] = df2["Money"].map(df1.set_index("MONEY")["Value"])
df2 now looks like:
Money Value
0 USD 150.0
1 GBP NaN
2 EUR 850.0
3 CLP NaN
Only one thing is left to do: Delete any rows that have NaN value:
df2.dropna(inplace=True)
Entire code sample:
import pandas as pd
# Create df1
x_1 = ["EUR", 850], ["USD", 150], ["DCN", 1]
df1 = pd.DataFrame(x_1, columns=["MONEY", "Value"])
# Create d2
x_2 = "USD", "GBP", "EUR", "CLP"
df2 = pd.DataFrame(x_2, columns=["Money"])
# Create new column in df2 called 'Value'
df2["Value"] = df2["Money"].map(df1.set_index("MONEY")["Value"])
# Drops any rows that have 'NaN' in column 'Value'
df2.dropna(inplace=True)
print(df2)
Outputs:
Money Value
0 USD 150.0
2 EUR 850.0

Split string column based on number of characters

I have the following column in a dataframe:
columnA
EUR590
USD680
EUR10000,9
USD40
how can i split it based on the first three characters, that the dataframe looks like:
columnA columnB
590 EUR
680 USD
10000,9 EUR
40 USD
df['columnB']=df['columnA'].str.slice(stop=3)
df['columnA']=df['columnA'].str.slice(start=3)
Another way would be using series.str.extract() with a pattern:
df1=df.columnA.str.extract('(?P<columnA>.{3})(?P<columnB>.{1,})')
print(df1)
columnA columnB
0 EUR 590
1 USD 680
2 EUR 10000,9
3 USD 40
Regex graph below:

Pandas DataFrame currency conversion

I have DataFrame with two columns:
col1 | col2
20 EUR
31 GBP
5 JPY
I may have 10000 rows like this
How to do fast currency conversion to base currency being GBP?
should I use easymoney?
I know how to apply conversion to single row but I do not know how to iterate through all the rows fast.
EDIT:
I would like to apply sth as:
def convert_currency(amount, currency_symbol):
converted = ep.currency_converter(amount=1000, from_currency=currency_symbol, to_currency="GBP")
return converted
df.loc[df.currency != 'GBP', 'col1'] = convert_currency(currency_data.col1, df.col2
)
but it does not work yet.
Join a third column with the conversion rates for each currency, joining on the currency code in col2. Then create a column with the translated amount.
dfRate:
code | rate
EUR 1.123
USD 2.234
df2 = pd.merge(df1, dfRate, how='left', left_on=['col2'], right_on=['code'])
df2['translatedAmt'] = df2['col1'] / df2['rate']
df = pd.DataFrame([[20, 'EUR'], [31, 'GBP'], [5, 'JPY']], columns=['value', 'currency'])
print df
value currency
0 20 EUR
1 31 GBP
2 5 JPY
def convert_to_gbp(args): # placeholder for your fancy conversion function
amount, currency = args
rates = {'EUR': 2, 'JPY': 10, 'GBP': 1}
return rates[currency] * amount
df.assign(**{'In GBP': df.apply(convert_to_gbp, axis=1)})
value currency In GBP
0 20 EUR 40
1 31 GBP 31
2 5 JPY 50

Pandas groupby stored in a new dataframe

I have the following code:
import pandas as pd
df1 = pd.DataFrame({'Counterparty':['Bank','Bank','GSE','PSE'],
'Sub Cat':['Tier1','Small','Small', 'Small'],
'Location':['US','US','UK','UK'],
'Amount':[50, 55, 65, 55],
'Amount1':[1,2,3,4]})
df2=df1.groupby(['Counterparty','Location'])[['Amount']].sum()
df2.dtypes
df1.dtypes
The df2 data frame does not have the columns that I am aggregating across ( Counterparty and Location). Any ideas why this is the case ? Both Amount and Amount1 are numeric fields. I just want to sum across Amount and aggregate across Amount1
For columns from index add as_index=False parameter or reset_index:
df2=df1.groupby(['Counterparty','Location'])[['Amount']].sum().reset_index()
print (df2)
Counterparty Location Amount
0 Bank US 105
1 GSE UK 65
2 PSE UK 55
df2=df1.groupby(['Counterparty','Location'], as_index=False)[['Amount']].sum()
print (df2)
Counterparty Location Amount
0 Bank US 105
1 GSE UK 65
2 PSE UK 55
If aggregate by all columns here happens automatic exclusion of nuisance columns - column Sub Cat is omitted:
df2=df1.groupby(['Counterparty','Location']).sum().reset_index()
print (df2)
Counterparty Location Amount Amount1
0 Bank US 105 3
1 GSE UK 65 3
2 PSE UK 55 4
df2=df1.groupby(['Counterparty','Location'], as_index=False).sum()
Remove the double brackets around the 'Amount' and make them single brackets. You're telling it to only select one column.

Matching and adding column to data frame

I am going crazy about this one. I am trying to add a new column to a data frame DF1, based on values found in another data frame DF2. This is how they look,
DF1=
Date Amount Currency
0 2014-08-20 -20000000 EUR
1 2014-08-20 -12000000 CAD
2 2014-08-21 10000 EUR
3 2014-08-21 20000 USD
4 2014-08-22 25000 USD
DF2=
NAME OPEN
0 EUR 10
1 CAD 20
2 USD 30
Now, I would like to create a new column in DF1, named 'Amount (Local)', where each amount in 'Amount' is multipled with the correct matching value found in DF2 yielding a result,
DF1=
Date Amount Currency Amount (Local)
0 2014-08-20 -20000000 EUR -200000000
1 2014-08-20 -12000000 CAD -240000000
2 2014-08-21 10000 EUR 100000
3 2014-08-21 20000 USD 600000
4 2014-08-22 25000 USD 750000
If there exists a method for adding a column to DF1 based on a function, instead of just multiplication as the above problem, that would be very much appreciated also.
Thanks,
You can use map from a dict of your second df (in my case it is called df1. yours is DF2), and then multiply the result of this by the amount:
In [65]:
df['Amount (Local)'] = df['Currency'].map(dict(df1[['NAME','OPEN']].values)) * df['Amount']
df
Out[65]:
Date Amount Currency Amount (Local)
index
0 2014-08-20 -20000000 EUR -200000000
1 2014-08-20 -12000000 CAD -240000000
2 2014-08-21 10000 EUR 100000
3 2014-08-21 20000 USD 600000
4 2014-08-22 25000 USD 750000
So breaking this down, map will match the value against the value in the dict key, in this case we are matching Currency against the NAME key, the value in the dict is the OPEN values, the result of this would be:
In [66]:
df['Currency'].map(dict(df1[['NAME','OPEN']].values))
Out[66]:
index
0 10
1 20
2 10
3 30
4 30
Name: Currency, dtype: int64
We then simply multiply this series against the Amount column from df (DF1 in your case) to get the desired result.
Use fancy-indexing to create a currency array aligned with your data in df1, then use it in multiplication, and assign the result to a new column in df1:
import pandas as pd
ccy_series = pd.Series([10,20,30], index=['EUR', 'CAD', 'USD'])
df1 = pd.DataFrame({'amount': [-200, -120, 1, 2, 2.5], 'ccy': ['EUR', 'CAD', 'EUR', 'USD', 'USD']})
aligned_ccy = ccy_series[df1.ccy].reset_index(drop=True)
aligned_ccy
=>
0 10
1 20
2 10
3 30
4 30
dtype: int64
df1['amount_local'] = df1.amount *aligned_ccy
df1
=>
amount ccy amount_local
0 -200.0 EUR -2000
1 -120.0 CAD -2400
2 1.0 EUR 10
3 2.0 USD 60
4 2.5 USD 75

Categories