I have two multi-index dataframes which should have the same index names (barring any format issues) that I want to join together.
DF1 =
DF2 =
I want to join DF2 to DF1 only where the NAME and code match so on this dataframe for example, it would pull in qty for 10'' of 69.0
I've tried different variations of join, concat and multiindex.join but can't see to figure it out. Any help much appreciated!
Related
So I've been looking at different ways to compare two PySpark dataframes where we have no key columns.
Let's say I have two dataframes, df1 & df2, with columns col1, col2, col3.
The idea is that I would get an output dataframe containing rows from df1 that do not match with any rows in df2 and vice versa. I would also like some kind of flag so I can distinguish between rows from df1 and rows from df2.
I have so far looked at a full outer join as method, such as:
columns = df1.columns
df1 = df1.withColumn("df1_flag", lit("X"))
df2 = df2.withColumn("df2_flag", lit("X"))
df3 = df1.join(df2, columns, how = 'full')\
.withColumn("FLAG", when(col("df1_flag").isNotNull() & col("df2_flag").isNotNull(), "MATCHED")\
.otherwise(when(col("df1_flag").isNotNull(), "df1").otherwise("df2"))).drop("df1_flag","df2_flag")
df4 = df3.filter(df3.flag != "MATCHED")
The issue with the full outer join is that I may need to deal with some very large dataframes (1 million + records), I am concerned about efficiency.
I have thought about using an anti left join and an anti right join and then combining, but still there are efficiency worries with that also.
Is there any method of comparison I am overlooking here that could be more efficient for very large dataframes?
You can run a minus query on your dataframes
Mismatvhed_df1 = df1.exceptAll(df2)
Mismatvhed_df2 = df2.exceptAll(df1)
I want to join two dataframes together, both dataframes have date columns (df1[date1], df2[date2]). I want the joined dataframe to satisfy this condition df2[date2] > df1[date1]. Second dataframe does not have any duplicates but first one does, so this does not work as expected:
I know for certain that for every date in df2 there is a date in df1 which satisfies this condition. But I cannot figure out how to join them properly. I have tried doing this:
joined = df1.join(df2, how='inner')
joined = joined.query('date2 > date1')
But since df1 has entries with duplicate id-s the way they align after join results in bunch of rows not satisfying the condition, so I get left with smaller database.
How can I accomplish this?
based on your clairification I sugegst the following solution:
1) concatenate (not join) the 2 dataframes.
df12 = pd.concat([df1, df2], axis=1)
I assume that the indices match. If not - reindex on id or join on id.
2) filter the rows that match criteria
df12 = df12[df12['date2'] > df12['date1]]
I have two df named "df" and second as "topwud".
df
topwud
when I join these two dataframes bt inner join using BOMCPNO and PRTNO as the join column
like
second_level=pd.merge(df,top_wud ,left_on='BOMCPNO', right_on='PRTNO', how='inner').drop_duplicates()
Then I got this data frame
Result
I don't want common name coming as PRTNO_x and PRTNO_y , I want to keep only PRTNO_x in my result dataframe as name "PRTNO" which is default name.
Kindly help me :)
try This -
pd.merge(df1, top_wud, on=['BOMCPNO', 'PRTNO'])
What this will do though is return only the values where BOMCPNO and PRTNO exist in both dataframes as the default merge type is an inner merge.
So what you could do is compare this merged df size with your first one and see if they are the same and if so you could do a merge on both columns or just drop/rename the _x/_y suffix B columns.
I would spend time though determining if these values are indeed the same and exist in both dataframes, in which case you may wish to perform an outer merge:
pd.merge(df1, df2, on=['A', 'B'], how='outer')
Then what you could do is then drop duplicate rows (and possibly any NaN rows) and that should give you a clean merged dataframe.
merged_df.drop_duplicates(cols=['BOMCPNO', 'PRTNO'],inplace=True)
also try other types of join , as i dont know what exactly you want, i think its left inner .
check this if it solved your problem.
I have two dataframes, df1 and df2 say, which are both multi-indexed.
At the first index level, both dataframes share the same keys (i.e. df1.index.get_level_values(0) and df2.index.get_level_values(0) contain the same elements). Those keys are unordered strings, such as ['foo','bar','baz'].
At the second index level, both dataframes have timestamps which are ordered, but unequally spaced.
My question is as follows. I would like to merge df1and df2 in such a way that, for each key at level 1, the values of df2 should be inserted into df1 without changing the order of df1.
I tried using pd.merge, pd.merge_asof and pd.MultiIndex.searchsorted. From the descriptions of those methods, it seems like one of them should do the trick for me, but I cannot figure out how. Ideally, I would like to find a solution that avoids looping over the keys in index.get_level_values(0), since my dataframes can get large.
A few failed attempts for illustration:
df_merged = pd.merge(left=df1.reset_index(), right=df2.reset_index(),
left_on=[['some_keys', 'timestamps_df1']], right_on=[['some_keys', 'timestamps_df2']],
suffixes=('', '_2')
) # after sorting
# FAILED
df2.index.searchsorted(df1, side='right') # after sorting
# FAILED
Any help is greatly appreciated!
Base on your description , here is the solution from merge_asof
df_merged = pd.merge_asof(left=df1.reset_index(), right=df2.reset_index(),
left_on=['timestamps_df1'], right_on=['timestamps_df2'],by='some_keys',
suffixes=('', '_2')
)
I have two pandas dataframes both holding irregular timeseries data.
I want merge/join the two frames by time.
I also want to forward fill the other columns of frame2 for any "new" rows that were added through the joining process. How can I do this?
I have tried:
df = pd.merge(df1, df2, on="DateTime")
but this just leave a frame with matching timestamp rows.
I would be grateful for any ideas!
Try this. The how='left' will have the merge keep all records of df1, and the fillna will populate missing values.
df = pd.merge(df1, df2, on='DateTime', how='left').fillna(method='ffill')