Merging data frames, highlighting the problematic column - python

I'm trying to merge two data frames with the aim of finding the value that causes the merging error. Most of the columns are not common across both data frames.
The following highlights what rows have a "NaN" value, how can I then find what column caused the merging issue? Thanks
df3 = pd.merge(df1, df2, how='outer')
df4 = (df3[df3.isnull().any(axis=1)])

It is difficult to tell from the question, but the question indicates pd.merge(df1, df2, on=None, how='outer')
If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
This means that the intersection of the columns in both DataFrames better have the same type. If not, an error will occur indicating a type issue.
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
Presupposing there is a conflict of type interfering with the outer join, the difference between the types of the intersecting columns should be examined.
dtypes_diff = pd.concat([df1.dtypes,df2.dtypes]).drop_duplicates(keep=False)

Related

Merging Two Dataframes stacks rows instead of merging into one

I'm attempting to merge two dataframes using two columns as keys: "Date" and "Instrument"
Here is my code:
merge_df = pd.merge(df1 , df2, how='outer', left_on=['Date','Instrument'], right_on = ['Date','Instrument'])
df1:
df2:
You'll notice that the row in each dataframe has the same instrument and date value: AEA000201011 & 2008-01-31.
The merged dataframe is stacking the two rows instead of combining them:
merged_df:
I have ensured that the dataframe key columns dtypes match:
df1:
df2:
Any advice would be much appreciated!
Man I wish I could use add comment section.
Even though you've probably already tried, have tried to use "left" or "right" instead of "outer"
Or for once check them like
df1["Instrument"].iloc[0] == df2["Instrument"].iloc[0]
Maybe they got some invisible chars in them. If it's like that you can try using strip() functions.
Nothing other than these comes to my mind.

Are there any alternatives to a full outer join for comparing PySpark dataframes with no key columns?

So I've been looking at different ways to compare two PySpark dataframes where we have no key columns.
Let's say I have two dataframes, df1 & df2, with columns col1, col2, col3.
The idea is that I would get an output dataframe containing rows from df1 that do not match with any rows in df2 and vice versa. I would also like some kind of flag so I can distinguish between rows from df1 and rows from df2.
I have so far looked at a full outer join as method, such as:
columns = df1.columns
df1 = df1.withColumn("df1_flag", lit("X"))
df2 = df2.withColumn("df2_flag", lit("X"))
df3 = df1.join(df2, columns, how = 'full')\
.withColumn("FLAG", when(col("df1_flag").isNotNull() & col("df2_flag").isNotNull(), "MATCHED")\
.otherwise(when(col("df1_flag").isNotNull(), "df1").otherwise("df2"))).drop("df1_flag","df2_flag")
df4 = df3.filter(df3.flag != "MATCHED")
The issue with the full outer join is that I may need to deal with some very large dataframes (1 million + records), I am concerned about efficiency.
I have thought about using an anti left join and an anti right join and then combining, but still there are efficiency worries with that also.
Is there any method of comparison I am overlooking here that could be more efficient for very large dataframes?
You can run a minus query on your dataframes
Mismatvhed_df1 = df1.exceptAll(df2)
Mismatvhed_df2 = df2.exceptAll(df1)

How to inner join in pandas as SQL , Stuck in a problem below

I have two df named "df" and second as "topwud".
df
topwud
when I join these two dataframes bt inner join using BOMCPNO and PRTNO as the join column
like
second_level=pd.merge(df,top_wud ,left_on='BOMCPNO', right_on='PRTNO', how='inner').drop_duplicates()
Then I got this data frame
Result
I don't want common name coming as PRTNO_x and PRTNO_y , I want to keep only PRTNO_x in my result dataframe as name "PRTNO" which is default name.
Kindly help me :)
try This -
pd.merge(df1, top_wud, on=['BOMCPNO', 'PRTNO'])
What this will do though is return only the values where BOMCPNO and PRTNO exist in both dataframes as the default merge type is an inner merge.
So what you could do is compare this merged df size with your first one and see if they are the same and if so you could do a merge on both columns or just drop/rename the _x/_y suffix B columns.
I would spend time though determining if these values are indeed the same and exist in both dataframes, in which case you may wish to perform an outer merge:
pd.merge(df1, df2, on=['A', 'B'], how='outer')
Then what you could do is then drop duplicate rows (and possibly any NaN rows) and that should give you a clean merged dataframe.
merged_df.drop_duplicates(cols=['BOMCPNO', 'PRTNO'],inplace=True)
also try other types of join , as i dont know what exactly you want, i think its left inner .
check this if it solved your problem.

Pandas: merge_asof-like solutions for merging two multi-indexed DataFrames?

I have two dataframes, df1 and df2 say, which are both multi-indexed.
At the first index level, both dataframes share the same keys (i.e. df1.index.get_level_values(0) and df2.index.get_level_values(0) contain the same elements). Those keys are unordered strings, such as ['foo','bar','baz'].
At the second index level, both dataframes have timestamps which are ordered, but unequally spaced.
My question is as follows. I would like to merge df1and df2 in such a way that, for each key at level 1, the values of df2 should be inserted into df1 without changing the order of df1.
I tried using pd.merge, pd.merge_asof and pd.MultiIndex.searchsorted. From the descriptions of those methods, it seems like one of them should do the trick for me, but I cannot figure out how. Ideally, I would like to find a solution that avoids looping over the keys in index.get_level_values(0), since my dataframes can get large.
A few failed attempts for illustration:
df_merged = pd.merge(left=df1.reset_index(), right=df2.reset_index(),
left_on=[['some_keys', 'timestamps_df1']], right_on=[['some_keys', 'timestamps_df2']],
suffixes=('', '_2')
) # after sorting
# FAILED
df2.index.searchsorted(df1, side='right') # after sorting
# FAILED
Any help is greatly appreciated!
Base on your description , here is the solution from merge_asof
df_merged = pd.merge_asof(left=df1.reset_index(), right=df2.reset_index(),
left_on=['timestamps_df1'], right_on=['timestamps_df2'],by='some_keys',
suffixes=('', '_2')
)

Join/Merge two pandas dataframes and filling

I have two pandas dataframes both holding irregular timeseries data.
I want merge/join the two frames by time.
I also want to forward fill the other columns of frame2 for any "new" rows that were added through the joining process. How can I do this?
I have tried:
df = pd.merge(df1, df2, on="DateTime")
but this just leave a frame with matching timestamp rows.
I would be grateful for any ideas!
Try this. The how='left' will have the merge keep all records of df1, and the fillna will populate missing values.
df = pd.merge(df1, df2, on='DateTime', how='left').fillna(method='ffill')

Categories