Merging Two Dataframes stacks rows instead of merging into one - python

I'm attempting to merge two dataframes using two columns as keys: "Date" and "Instrument"
Here is my code:
merge_df = pd.merge(df1 , df2, how='outer', left_on=['Date','Instrument'], right_on = ['Date','Instrument'])
df1:
df2:
You'll notice that the row in each dataframe has the same instrument and date value: AEA000201011 & 2008-01-31.
The merged dataframe is stacking the two rows instead of combining them:
merged_df:
I have ensured that the dataframe key columns dtypes match:
df1:
df2:
Any advice would be much appreciated!

Man I wish I could use add comment section.
Even though you've probably already tried, have tried to use "left" or "right" instead of "outer"
Or for once check them like
df1["Instrument"].iloc[0] == df2["Instrument"].iloc[0]
Maybe they got some invisible chars in them. If it's like that you can try using strip() functions.
Nothing other than these comes to my mind.

Related

Pandas Dataframes: Combining Columns from Two Global Datasets when the rows hold different Countries

My Problem is that these two CSV files have different countries at different rows, so I can't just append the column in question to the other data frame.
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
I'm trying to think of some way to use a for loop, checking every row, and add the recovered cases to the correct row where the country name is the same in both data frames, but I don't know how to put that idea in to code. Help?
You can do this a couple of ways:
Option 1: use pd.concat with set_index
pd.concat([df_confirmed.set_index(['Province/State', 'Country/Region']),
df_recovered.set_index(['Province/State', 'Country/Region'])],
axis=1, keys=['Confirmed', 'Recovered'])
Option 2: use pd.DataFrame.merge with an left join or outer join using how parameter
df_confirmed.merge(df_recovered, on=['Province/State', 'Country/Region'], how='left',
suffixes=('_confirmed','_recovered'))
Using pd.read_csv from github raw format:
df_recovered = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
df_confirmed = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')

Find where three separate DataFrames overlap and create a new DataFrame

I have three separate DataFrames. Each DataFrame has the same columns - ['Email', 'Rating']. There are duplicate row values in all three DataFrames for the column Email. I'm trying to find those emails that appear in all three DataFrames and then create a new DataFrame based off those rows. So far I have I had all three DataFrames saved to a list like this dfs = [df1, df2, df3], and then concatenated them together using df = pd.concat(dfs). I tried using groupby from here but to no avail. Any help would be greatly appreciated
You want to do a merge. Similar to a join in sql you can do an inner merge and treat the email like a foreign key. Here is the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
It would look something like this:
in_common = pd.merge(df1, df2, on=['Email'], how='inner')
you could try using .isin from pandas, e.g:
df[df['Email'].isin(df2['Email'])]
This would retrieve row entries where the values for the column email are the same in the two dataframes.
Another idea is maybe try an inner merge.
Goodluck, post code next time.

How to inner join in pandas as SQL , Stuck in a problem below

I have two df named "df" and second as "topwud".
df
topwud
when I join these two dataframes bt inner join using BOMCPNO and PRTNO as the join column
like
second_level=pd.merge(df,top_wud ,left_on='BOMCPNO', right_on='PRTNO', how='inner').drop_duplicates()
Then I got this data frame
Result
I don't want common name coming as PRTNO_x and PRTNO_y , I want to keep only PRTNO_x in my result dataframe as name "PRTNO" which is default name.
Kindly help me :)
try This -
pd.merge(df1, top_wud, on=['BOMCPNO', 'PRTNO'])
What this will do though is return only the values where BOMCPNO and PRTNO exist in both dataframes as the default merge type is an inner merge.
So what you could do is compare this merged df size with your first one and see if they are the same and if so you could do a merge on both columns or just drop/rename the _x/_y suffix B columns.
I would spend time though determining if these values are indeed the same and exist in both dataframes, in which case you may wish to perform an outer merge:
pd.merge(df1, df2, on=['A', 'B'], how='outer')
Then what you could do is then drop duplicate rows (and possibly any NaN rows) and that should give you a clean merged dataframe.
merged_df.drop_duplicates(cols=['BOMCPNO', 'PRTNO'],inplace=True)
also try other types of join , as i dont know what exactly you want, i think its left inner .
check this if it solved your problem.

Pandas: merge_asof-like solutions for merging two multi-indexed DataFrames?

I have two dataframes, df1 and df2 say, which are both multi-indexed.
At the first index level, both dataframes share the same keys (i.e. df1.index.get_level_values(0) and df2.index.get_level_values(0) contain the same elements). Those keys are unordered strings, such as ['foo','bar','baz'].
At the second index level, both dataframes have timestamps which are ordered, but unequally spaced.
My question is as follows. I would like to merge df1and df2 in such a way that, for each key at level 1, the values of df2 should be inserted into df1 without changing the order of df1.
I tried using pd.merge, pd.merge_asof and pd.MultiIndex.searchsorted. From the descriptions of those methods, it seems like one of them should do the trick for me, but I cannot figure out how. Ideally, I would like to find a solution that avoids looping over the keys in index.get_level_values(0), since my dataframes can get large.
A few failed attempts for illustration:
df_merged = pd.merge(left=df1.reset_index(), right=df2.reset_index(),
left_on=[['some_keys', 'timestamps_df1']], right_on=[['some_keys', 'timestamps_df2']],
suffixes=('', '_2')
) # after sorting
# FAILED
df2.index.searchsorted(df1, side='right') # after sorting
# FAILED
Any help is greatly appreciated!
Base on your description , here is the solution from merge_asof
df_merged = pd.merge_asof(left=df1.reset_index(), right=df2.reset_index(),
left_on=['timestamps_df1'], right_on=['timestamps_df2'],by='some_keys',
suffixes=('', '_2')
)

Python Dataframes not merging on index

I'm trying to merge 2 dataframes, but for some reason it's throwing KeyError: Player_Id
I'm trying to merge on Striker_Id and Player_Id
This is how my Dataframe looks like
Merge Code:
player_runs.merge(matches_played_by_players,left_on='Striker_Id',right_on='Player_Id',how='left')
What am I doing wrong?
Hmm, from looking at your problem, it seems like you're trying to merge on the indexes, but you treat them as columns? Try changing your merge code a bit -
player_runs.merge(matches_played_by_players,
left_index=True,
right_index=True,
how='left')
Furthermore, make sure that both indexes are of the same type (in this case, consider strint?)
player_runs.index = player_runs.index.astype(int)
And,
matches_played_by_players.index = matches_played_by_players.index.astype(int)
you're basically merging on none existing columns. this is because reset_index creates a new data-frame rather than changing the data frame it's applied to. setting the parameter inplace=True when using reset_index should resolve this issue, alternatively merge on the index of each data-frame. i.e.
pd.merge(df1,df2,left_index=True,right_index=True,how='left')

Categories