I have two data frames with the same column types.
First Dataframe (df1)
data = [['BTC', 2], ['ETH', 1], ['ADA', 100]]
df1 = pd.DataFrame(data, columns=['Coin', 'Quantity'])
Coin Quantity
BTC 2
ETH 1
ADA 100
... ...
Second Dataframe (df2)
data = [['BTC', 50000], ['FTM', 50], ['ETH', 1500], ['LRC', 5], ['ADA', 20]]
df2 = pd.DataFrame(data, columns=['code_name', 'selling rate'])
code_name selling rate
BTC 50000
FTM 50
ETH 1500
LRC 5
ADA 20
... ...
Expected output (FTM and LRC should be removed)
Coin Quantity selling rate
BTC 2 50000
ETH 1 1500
ADA 100 20
... ... ...
What I have tried
df1.merge(df2, how='outer', left_on=['Coin'], right_on=['code_name'])
df = np.where(df1['Coin'] == df2['code_name'])
Both codes did not give me the expected output. I searched on StackOverflow and couldn't find any helpful answer. Can anyone give a solution or make this question as duplicate if a related question exist?
What you need is an inner join, not an outer join. Inner joins only retain records that are common in the two tables you're joining together.
import pandas as pd
# Make the first data frame
df1 = pd.DataFrame({
'Coin': ['BTC', 'ETH', 'ADA'],
'Quantity': [2, 1, 100]
})
# Make the second data frame
df2 = pd.DataFrame({
'code_name': ['BTC', 'FTM', 'ETH', 'LRC', 'ADA'],
'selling_rate': [50000, 50, 1500, 5, 20]
})
# Merge the data frames via inner join. This only keeps entries that appear in
# both data frames
full_df = df1.merge(df2, how = 'inner', left_on = 'Coin', right_on = 'code_name')
# Drop the duplicate column
full_df = full_df.drop('code_name', axis = 1)
Since merge() is slow for large dataset. I prefer not to use it as long as I have a faster solution. Therefore, I suggest the following:
full_df = df1.copy()
full_df['selling_rate'] = list(
df2['selling_rate'][df2['code_name'].isin(df1['Coin'].unique())])
Note: This turns to the expected solution if df1 and df2 are in the same order with respect to Coin and code_name. If they are not, you should use sort_values() before the above code.
Related
I have two dataframes that are imported CSV files
df1:
1, 204c, 204s
2, 205c, 205s
3, ..., ...
df2:
204c, 1000
205c, 3000
..., ...
..., ...
204s, 4000
205s, 5000
I would like to combine df2 into df1 based off of their 'c' and 'd' values at the end so it can look something like this
df3:
204c, 1000, 204s, 4000
205c, 3000, 205c, 5000
I believe it had to do with pandas.concat(), .merge() or .join() however I am a bit stuck on the correct one to use.
I have tried using df3 = df1.merge(df2, how = 'cross') and it did every iteration of merge for each value which was incorrect and i have tried df3 = pd.concat([df1,df2], axis = 1) which was closer but it did not take the 's' values into consideration and put it into NaN categories
Here is a generic approach that can handle any number of columns to be mapped:
f1 = '''1, 204c, 204s
2, 205c, 205s'''
f2 = '''204c, 1000
205c, 3000
204s, 4000
205s, 5000
'''
df1 = pd.read_csv(io.StringIO(f1), sep=r',\s*', engine='python', header=None, index_col=0)
df2 = pd.read_csv(io.StringIO(f2), sep=r',\s*', engine='python', header=None, names=['val', 'val2'])
tmp = df1.stack().to_frame(name='val')
out = (tmp.merge(df2, how='left')
.set_axis(tmp.index)
.unstack()
.sort_index(level=1, axis=1, kind='stable', sort_remaining=False)
)
print(out)
Output:
val val2 val val2
1 1 2 2
0
1 204c 1000 204s 4000
2 205c 3000 205s 5000
I wanted to left join df2 on df1 and then keep the row that matches by group and if there is no matching group then I would like to keep the first row of the group in order to achieve df3 (the desired result). I was hoping you guys could help me with finding the optimal solution.
Here is my code to create the two dataframes and the required result.
import pandas as pd
import numpy as np
market = ['SP', 'SP', 'SP']
underlying = ['TSLA', 'GOOG', 'MSFT']
# DF1
df = pd.DataFrame(list(zip(market, underlying)),
columns=['market', 'underlying'])
market2 = ['SP', 'SP', 'SP', 'SP', 'SP']
underlying2 = [None, 'TSLA', 'GBX', 'GBM', 'GBS']
client2 = [17, 12, 100, 21, 10]
# DF2
df2 = pd.DataFrame(list(zip(market2, underlying2, client2)),
columns=['market', 'underlying', 'client'])
market3 = ['SP', 'SP', 'SP']
underlying3 = ['TSLA', 'GOOG', 'MSFT']
client3 = [12, 17, 17]
# Desired
df3 = pd.DataFrame(list(zip(market3, underlying3, client3)),
columns =['market', 'underlying', 'client'])
# This works but feels sub optimal
df3 = pd.merge(df,
df2,
how='left',
on=['market', 'underlying'])
df3 = pd.merge(df3,
df2,
how='left',
on=['market'])
df3 = df3.drop_duplicates(['market', 'underlying_x'])
df3['client'] = df3['client_x'].combine_first(df3['client_y'])
df3 = df3.drop(labels=['underlying_y', 'client_x', 'client_y'], axis=1)
df3 = df3.rename(columns={'underlying_x': 'underlying'})
Hope you guys could help, thankyou so much!
Store the first value (a groupby might not be necessary if every single one in market is 'SP'), merge and fill with the first value:
fill_value = df2.groupby('market').client.first()
# if you are interested in filtering for None:
fill_value = df2.set_index('market').loc[lambda df: df.underlying.isna(), 'client']
(df
.merge(
df2,
on = ['market', 'underlying'],
how = 'left')
.set_index('market')
.fillna({'client':fill_value}, downcast='infer')
)
underlying client
market
SP TSLA 12
SP GOOG 17
SP MSFT 17
I have two dataframes.
DF1
DF2
I want to add a column to DF1, 'Speed', that references the track category, and the LocationFrom and LocationTo range, to result in the below.
I have looked at merge_asof, and IntervalIndex, but unable to figure out how to reference the category before the range.
Thanks.
Check Below code: SQLITE
import pandas as pd
import sqlite3
conn = sqlite3.connect(':memory:')
DF1.to_sql('DF1', con = conn, index = False)
DF2.to_sql('DF2', con = conn, index = False)
pd.read_sql("""Select DF1.*, DF2.Speed
From DF1
join DF2 on DF1.Track = Df2.Track
AND DF1.Location BETWEEN DF2.LocationFrom and DF2.LocationTo""", con=conn)
Output:
As hinted in your question, this is a perfect use case for merge_asof:
pd.merge_asof(df1, df2, by='Track',
left_on='Location', right_on='LocationTo',
direction='forward'
)#.drop(columns=['LocationFrom', 'LocationTo'])
output:
Track Location LocationFrom LocationTo Speed
0 A 1 0 5 45
1 A 2 0 5 45
2 A 6 5 10 50
3 B 24 20 50 100
NB. uncomment the drop to remove the extra columns.
It works, but I would like to see someone do this without a for loop and without creating mini dataframes.
import pandas as pd
data1 = {'Track': list('AAAB'), 'Location': [1, 2, 6, 24]}
df1 = pd.DataFrame(data1)
data2 = {'Track': list('AABB'), 'LocationFrom': [0, 5, 0, 20], 'LocationTo': [5, 10, 20, 50], 'Speed': [45, 50, 80, 100]}
df2 = pd.DataFrame(data2)
speeds = []
for k in range(len(df1)):
track = df1['Track'].iloc[k]
location = df1['Location'].iloc[k]
df1_track = df1.loc[df1['Track'] == track]
df2_track = df2.loc[df2['Track'] == track]
speeds.append(df2_track['Speed'].loc[(df2_track['LocationFrom'] <= location) & (location < df2_track['LocationTo'])].iloc[0])
df1['Speed'] = speeds
print(df1)
Output:
Track Location Speed
0 A 1 45
1 A 2 45
2 A 6 50
3 B 24 100
This approach is probably not viable if your tables are large. It creates an intermediate table which has a merge of all pairs of matching Tracks between df1 and df2. Then it removes rows where the location is not between the boundaries. Thanks #Aeronatix for the dfs.
The all_merge intermediate table gets really big really fast. If a1 rows of df1 are Track A, a2 in df2 etc.. then the total rows in all_merge will be a1*a2+b1*b2+c1*c2...+z1*z2 which might or might not be gigantic depending on your dataset
all_merge = df1.merge(df2)
results = all_merge[all_merge.Location.between(all_merge.LocationFrom,all_merge.LocationTo)]
print(results)
I have two dataframes and wanted to check if they contain the same data or not.
df1:
df1 = [['tom', 10],['nick',15], ['juli',14]]
df1 = pd.DataFrame(df1, columns = ['Name', 'Age'])
df2:
df2 = [['nick', 15],['tom', 10], ['juli',14]]
df2 = pd.DataFrame(df2, columns = ['Name', 'Age'])
Note that the information between them are exactly the same. The only difference is the row order.
I've created a code to match both dataframes, but it's showing that the dataframes are different on the first two rows:
ne = (df != df2).any(1)
ne_stacked = (df != df2).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(df != df2)
changed_from = df.values[difference_locations]
changed_to = df2.values[difference_locations]
divergences = pd.DataFrame({'df1': changed_from, "df2": changed_to}, index=changed.index)
print(divergences)
I am receiving the below result:
GRID SPX RECAP
id col
0 Name tom nick
Age 10 15
1 Name nick tom
Age 15 10
I was expecting to receive:
Empty DataFrame
Columns: [df1, df2]
Index: []
How I change the code so they can test each row on dataframes to check if they are matched?
And if I was comparing two data frames with different number of rows?
I have two data frames say df1, df2 each has two columns ['Name', 'Marks']
I want to find the difference between the two ifs for corresponding Name Values.
Eg:
df = pd.DataFrame([["Shivi",70],["Alex",40]],columns=['Names', 'Value'])
df2 = pd.DataFrame([["Shivi",40],["Andrew",40]],columns=['Names', 'Value'])
For df1-df2 I want
pd.DataFrame([["Shivi",30],["Alex",40],["Andrew",40]],columns=['Names', 'Value'])
You can use:
diff = df1.set_index("Name").subtract(df2.set_index("Name"), fill_value=0)
So a complete program will look like this:
import pandas as pd
data1 = {'Name': ["Ashley", "Tom"], 'Marks': [40, 50]}
data2 = {'Name': ["Ashley", "Stan"], 'Marks': [80, 90]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
diff = df1.set_index("Name").subtract(df2.set_index("Name"), fill_value=0)
print(diff)
Output:
Marks
Name
Ashley -40.0
Stan -90.0
Tom 50.0