Finding the difference between two data frames - python

I have two data frames say df1, df2 each has two columns ['Name', 'Marks']
I want to find the difference between the two ifs for corresponding Name Values.
Eg:
df = pd.DataFrame([["Shivi",70],["Alex",40]],columns=['Names', 'Value'])
df2 = pd.DataFrame([["Shivi",40],["Andrew",40]],columns=['Names', 'Value'])
For df1-df2 I want
pd.DataFrame([["Shivi",30],["Alex",40],["Andrew",40]],columns=['Names', 'Value'])

You can use:
diff = df1.set_index("Name").subtract(df2.set_index("Name"), fill_value=0)
So a complete program will look like this:
import pandas as pd
data1 = {'Name': ["Ashley", "Tom"], 'Marks': [40, 50]}
data2 = {'Name': ["Ashley", "Stan"], 'Marks': [80, 90]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
diff = df1.set_index("Name").subtract(df2.set_index("Name"), fill_value=0)
print(diff)
Output:
Marks
Name
Ashley -40.0
Stan -90.0
Tom 50.0

Related

Keep the row for which the values of two columns match by group otherwise keep the first row by group

I wanted to left join df2 on df1 and then keep the row that matches by group and if there is no matching group then I would like to keep the first row of the group in order to achieve df3 (the desired result). I was hoping you guys could help me with finding the optimal solution.
Here is my code to create the two dataframes and the required result.
import pandas as pd
import numpy as np
market = ['SP', 'SP', 'SP']
underlying = ['TSLA', 'GOOG', 'MSFT']
# DF1
df = pd.DataFrame(list(zip(market, underlying)),
columns=['market', 'underlying'])
market2 = ['SP', 'SP', 'SP', 'SP', 'SP']
underlying2 = [None, 'TSLA', 'GBX', 'GBM', 'GBS']
client2 = [17, 12, 100, 21, 10]
# DF2
df2 = pd.DataFrame(list(zip(market2, underlying2, client2)),
columns=['market', 'underlying', 'client'])
market3 = ['SP', 'SP', 'SP']
underlying3 = ['TSLA', 'GOOG', 'MSFT']
client3 = [12, 17, 17]
# Desired
df3 = pd.DataFrame(list(zip(market3, underlying3, client3)),
columns =['market', 'underlying', 'client'])
# This works but feels sub optimal
df3 = pd.merge(df,
df2,
how='left',
on=['market', 'underlying'])
df3 = pd.merge(df3,
df2,
how='left',
on=['market'])
df3 = df3.drop_duplicates(['market', 'underlying_x'])
df3['client'] = df3['client_x'].combine_first(df3['client_y'])
df3 = df3.drop(labels=['underlying_y', 'client_x', 'client_y'], axis=1)
df3 = df3.rename(columns={'underlying_x': 'underlying'})
Hope you guys could help, thankyou so much!
Store the first value (a groupby might not be necessary if every single one in market is 'SP'), merge and fill with the first value:
fill_value = df2.groupby('market').client.first()
# if you are interested in filtering for None:
fill_value = df2.set_index('market').loc[lambda df: df.underlying.isna(), 'client']
(df
.merge(
df2,
on = ['market', 'underlying'],
how = 'left')
.set_index('market')
.fillna({'client':fill_value}, downcast='infer')
)
underlying client
market
SP TSLA 12
SP GOOG 17
SP MSFT 17

Separate dataframe into multiple ones and add them as columns

my dataset look similiar to this (but with a couple of more rows):
The aim is to get this:
What I tried to do is:
# Identify names that are in the dataset
names = df['name'].unique().tolist()
# Define dataframe with first name
df1 = pd.DataFrame()
df1 = df[(df == names[0]).any(axis=1)]
df1 = df1.drop(['name'], axis=1)
df1 = df1.rename({'color':'color_'+str(names[0]), 'number':'number_'+str(names[0])}, axis=1)
# Make dataframes with other names and their corresponding color and number, add them to df1
df_merged = pd.DataFrame()
for i in range(1, len(names)):
df2 = pd.DataFrame()
df2 = df[(df == names[i]).any(axis=1)]
df2 = df2.drop(['name'], axis=1)
df2 = df2.rename({'color':'color_'+str(names[i]), 'number':'number_'+str(names[i])}, axis=1)
df_merged = df1.join(df2, lsuffix="_left", rsuffix="_right", how='left')
In the end I get this result for df_merged:
As you can see the columns color_Donald and number_Donald are missing. Does anyone know why and how to improve the code? It seems as if the loop somehow skips or overwrites Donald.
Thanks in advance!
sample df
import pandas as pd
data = {'name': {'2020-01-01 00:00:00': 'Justin', '2020-01-02 00:00:00': 'Justin', '2020-01-03 00:00:00': 'Donald'}, 'color': {'2020-01-01 00:00:00': 'blue', '2020-01-02 00:00:00': 'red', '2020-01-03 00:00:00': 'green'}, 'number': {'2020-01-01 00:00:00': 1, '2020-01-02 00:00:00': 2, '2020-01-03 00:00:00': 9}}
df = pd.DataFrame(data)
print(f"{df}\n")
name color number
2020-01-01 00:00:00 Justin blue 1
2020-01-02 00:00:00 Justin red 2
2020-01-03 00:00:00 Donald green 9
final df
df = (
df
.reset_index(names="date")
.pivot(index="date", columns="name", values=["color", "number"])
.fillna("")
)
df.columns = ["_".join(x) for x in df.columns.values]
print(df)
color_Donald color_Justin number_Donald number_Justin
date
2020-01-01 00:00:00 blue 1
2020-01-02 00:00:00 red 2
2020-01-03 00:00:00 green 9
The problem is the line:
df_merged = df1.join(df2, lsuffix="_left", rsuffix="_right", how='left')
where df_merged will be set in the loop always to the join of df1 with current df2.
The result after the loop is therefore a join of df1 with the last df2 and the Donald gets lost in this process.
To fix this problem first join empty df_merged with df1 and then in the loop join df_merged with df2.
Here the full code with the changes (not tested):
# Identify names that are in the dataset
names = df['name'].unique().tolist()
# Define dataframe with first name
df1 = pd.DataFrame()
df1 = df[(df == names[0]).any(axis=1)]
df1 = df1.drop(['name'], axis=1)
df1 = df1.rename({'color':'color_'+str(names[0]), 'number':'number_'+str(names[0])}, axis=1)
# Make dataframes with other names and their corresponding color and number, add them to df1
df_merged = pd.DataFrame()
df_merged = df_merged.join(df1) # <- add options if necessary
for i in range(1, len(names)):
df2 = pd.DataFrame()
df2 = df[(df == names[i]).any(axis=1)]
df2 = df2.drop(['name'], axis=1)
df2 = df2.rename({'color':'color_'+str(names[i]), 'number':'number_'+str(names[i])}, axis=1)
# join the current df2 to df_merged:
df_merged = df_merged.join(df2, lsuffix="_left", rsuffix="_right", how='left')

Pandas-InnerJoin- Multiplication of Rows

I have two sets of data, with one common column. Some rows have repetitions so I created a similar small example.
Here are my dataframes:
#Dataframe1
import pandas as pd
data = [['tom', 10], ['tom', 11], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
#Dataframe2
data2 = [['tom', 'LA'], ['tom', 'AU'], ['nick', 'NY'], ['juli', 'London']]
df2 = pd.DataFrame(data2, columns = ['Name', 'City'])
#InnerJoin
a = pd.merge(df, df2, how= 'inner', on = 'Name')
a
The result is:
So, Instead of 2 rows with Tom, we have 4 rows. How can I solve this issue?
Thank you,
Create a temporary key for duplicate name in order, such that the first Tom in df joins to the first Tom in df2 and 2nd Tom joins to 2nd Tom in df2, etc.
df = df.assign(name_key = df.groupby('Name').cumcount())
df2 = df2.assign(name_key = df.groupby('Name').cumcount())
df.merge(df2, how='inner', on=['Name', 'name_key'])
Output:
Name Age name_key City
0 tom 10 0 LA
1 tom 11 1 AU
2 nick 15 0 NY
3 juli 14 0 London

How to update several DataFrames values with another DataFrame values?

Please would you like to know how I can update two DataFrames df1 y df2 from another DataFrame df3. All this is done within a for loop that iterates over all the elements of the DataFrame df3
for i in range(len(df3)):
df1.p_mw = ...
df2.p_mw = ...
The initial DataFrames df1 and df2 are as follows:
df1 = pd.DataFrame([['GH_1', 10, 'Hidro'],
['GH_2', 20, 'Hidro'],
['GH_3', 30, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 40, 'Termo'],
['GT_2', 50, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
The DataFrame from which I want to update the data is:
df3 = pd.DataFrame([[150,57,110,20,10],
[120,66,110,20,0],
[90,40,105,20,0],
[60,40,90,20,0]],
columns= ['GH_1', 'GH_2', 'GH_3', 'GT_1', 'GT_2'])
As you can see the DataFrame df3 contains data from the corresponding column p_mw for both DataFrames df1 and df2. Furthermore, the DataFrame df2 has an element named GF_1 for which there is no update and should remain the same.
After updating for the last iteration, the desired output is the following:
df1 = pd.DataFrame([['GH_1', 60, 'Hidro'],
['GH_2', 40, 'Hidro'],
['GH_3', 90, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 20, 'Termo'],
['GT_2', 0, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
Create a mapping series by selecting the last row from df3, then map it on the column name and fill the nan values using the values from p_mw column
s = df3.iloc[-1]
df1['p_mw'] = df1['name'].map(s).fillna(df1['p_mw'])
df2['p_mw'] = df2['name'].map(s).fillna(df2['p_mw'])
If there are multiple dataframes that needed to be updated then we can use a for loop to avoid repetition of our code:
for df in (df1, df2):
df['p_mw'] = df['name'].map(s).fillna(df['p_mw'])
>>> df1
name p_mw type
0 GH_1 60 Hidro
1 GH_2 40 Hidro
2 GH_3 90 Hidro
>>> df2
name p_mw type
0 GT_1 20.0 Termo
1 GT_2 0.0 Termo
2 GF_1 10.0 Fict
This should do as you ask. No need for a for loop.
df1 = pd.DataFrame([['GH_1', 10, 'Hidro'],
['GH_2', 20, 'Hidro'],
['GH_3', 30, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 40, 'Termo'],
['GT_2', 50, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
df3 = pd.DataFrame([[150,57,110,20,10],
[120,66,110,20,0],
[90,40,105,20,0],
[60,40,90,20,0]],
columns= ['GH_1', 'GH_2', 'GH_3', 'GT_1', 'GT_2'])
updates = df3.iloc[-1].values
df1["p_mw"] = updates[:3]
df2["p_mw"] = np.append(updates[3:], df2["p_mw"].iloc[-1])

Compare and join two columns in two dataframes

I have two data frames with the same column types.
First Dataframe (df1)
data = [['BTC', 2], ['ETH', 1], ['ADA', 100]]
df1 = pd.DataFrame(data, columns=['Coin', 'Quantity'])
Coin Quantity
BTC 2
ETH 1
ADA 100
... ...
Second Dataframe (df2)
data = [['BTC', 50000], ['FTM', 50], ['ETH', 1500], ['LRC', 5], ['ADA', 20]]
df2 = pd.DataFrame(data, columns=['code_name', 'selling rate'])
code_name selling rate
BTC 50000
FTM 50
ETH 1500
LRC 5
ADA 20
... ...
Expected output (FTM and LRC should be removed)
Coin Quantity selling rate
BTC 2 50000
ETH 1 1500
ADA 100 20
... ... ...
What I have tried
df1.merge(df2, how='outer', left_on=['Coin'], right_on=['code_name'])
df = np.where(df1['Coin'] == df2['code_name'])
Both codes did not give me the expected output. I searched on StackOverflow and couldn't find any helpful answer. Can anyone give a solution or make this question as duplicate if a related question exist?
What you need is an inner join, not an outer join. Inner joins only retain records that are common in the two tables you're joining together.
import pandas as pd
# Make the first data frame
df1 = pd.DataFrame({
'Coin': ['BTC', 'ETH', 'ADA'],
'Quantity': [2, 1, 100]
})
# Make the second data frame
df2 = pd.DataFrame({
'code_name': ['BTC', 'FTM', 'ETH', 'LRC', 'ADA'],
'selling_rate': [50000, 50, 1500, 5, 20]
})
# Merge the data frames via inner join. This only keeps entries that appear in
# both data frames
full_df = df1.merge(df2, how = 'inner', left_on = 'Coin', right_on = 'code_name')
# Drop the duplicate column
full_df = full_df.drop('code_name', axis = 1)
Since merge() is slow for large dataset. I prefer not to use it as long as I have a faster solution. Therefore, I suggest the following:
full_df = df1.copy()
full_df['selling_rate'] = list(
df2['selling_rate'][df2['code_name'].isin(df1['Coin'].unique())])
Note: This turns to the expected solution if df1 and df2 are in the same order with respect to Coin and code_name. If they are not, you should use sort_values() before the above code.

Categories