Pandas join two dataframes with condition - python

I want to join two dataframes together, both dataframes have date columns (df1[date1], df2[date2]). I want the joined dataframe to satisfy this condition df2[date2] > df1[date1]. Second dataframe does not have any duplicates but first one does, so this does not work as expected:
I know for certain that for every date in df2 there is a date in df1 which satisfies this condition. But I cannot figure out how to join them properly. I have tried doing this:
joined = df1.join(df2, how='inner')
joined = joined.query('date2 > date1')
But since df1 has entries with duplicate id-s the way they align after join results in bunch of rows not satisfying the condition, so I get left with smaller database.
How can I accomplish this?

based on your clairification I sugegst the following solution:
1) concatenate (not join) the 2 dataframes.
df12 = pd.concat([df1, df2], axis=1)
I assume that the indices match. If not - reindex on id or join on id.
2) filter the rows that match criteria
df12 = df12[df12['date2'] > df12['date1]]

Related

Join column in dataframe to another dataframe - Pandas

I have 2 dataframes. One has a bunch of columns including f_uuid. The other dataframe has 2 columns, f_uuid and i_uuid.
the first dataframe may contain some f_uuids that the second dataframe doesn't and vice versa.
I want the first dataframe to have a new column i_uuid (from the second dataframe) populated with the appropriate values for the matching f_uuid in that first dataframe.
How would I achieve this?
df1 = pd.merge(df1,
df2,
on='f_uuid')
If you want to keep all f_uuid from df1 (e.g. those not available in df2), you may run
df1 = pd.merge(df1,
df2,
on='f_uuid',
how='left')
I think what your looking for is a merge : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge
In your case, that would look like :
bunch_of_col_df.merge(other_df, on="f_uuid")

How to apply a function row by row in merge syntax in Python pandas

I have two dataframes:
df1:
df2:
If i map the date in df2 from df1, using below merge command, which gives me output same as df1,
df2.merge(df1, how = 'left', on='Category')
But actually i need the output as below,
where, if only one date is returned, assign to the category
if multiple dates are returned and all are unique, assign the unique date once
if multiple dates are returned and if more than one unique date is available, assign None.
Required output:
can any one help with this, since i'm struggling here.
Thanks in advance
STEPS:
Use groupby and filter the required groups from the 1st dataframe.
drop the duplicates from df1
perform merge with this updated df1.
df1 = df1.groupby('Category').filter(
lambda x: x['Date'].nuique().eq(1)).drop_duplicates()
df2.merge(df1, how='left', on='Category')

Given 2 data frames search for matching value and return value in second data frame

Given 2 data frames like the link example, I need to add to df1 the "index income" from df2. I need to search by the df1 combined key in df2 and if there is a match return the value into a new column in df1. There is not an equal number of instances in df1 and df2 and there are about 700 rows in df1 1000 rows in df2.
I was able to do this in excel with a vlookup but I am trying to apply it to python code now.
This should solve your issue:
df1.merge(df2, how='left', on='combind_key')
This (left join) will give you all the records of df1 and matching records from df2.
https://www.geeksforgeeks.org/how-to-do-a-vlookup-in-python-using-pandas/
Here is an answer using joins. I modified my df2 to only include useful columns then used pandas left join.
Left_join = pd.merge(df,
zip_df,
on ='State County',
how ='left')

Are there any alternatives to a full outer join for comparing PySpark dataframes with no key columns?

So I've been looking at different ways to compare two PySpark dataframes where we have no key columns.
Let's say I have two dataframes, df1 & df2, with columns col1, col2, col3.
The idea is that I would get an output dataframe containing rows from df1 that do not match with any rows in df2 and vice versa. I would also like some kind of flag so I can distinguish between rows from df1 and rows from df2.
I have so far looked at a full outer join as method, such as:
columns = df1.columns
df1 = df1.withColumn("df1_flag", lit("X"))
df2 = df2.withColumn("df2_flag", lit("X"))
df3 = df1.join(df2, columns, how = 'full')\
.withColumn("FLAG", when(col("df1_flag").isNotNull() & col("df2_flag").isNotNull(), "MATCHED")\
.otherwise(when(col("df1_flag").isNotNull(), "df1").otherwise("df2"))).drop("df1_flag","df2_flag")
df4 = df3.filter(df3.flag != "MATCHED")
The issue with the full outer join is that I may need to deal with some very large dataframes (1 million + records), I am concerned about efficiency.
I have thought about using an anti left join and an anti right join and then combining, but still there are efficiency worries with that also.
Is there any method of comparison I am overlooking here that could be more efficient for very large dataframes?
You can run a minus query on your dataframes
Mismatvhed_df1 = df1.exceptAll(df2)
Mismatvhed_df2 = df2.exceptAll(df1)

Joining two multi-index dataframes with pandas

I have two multi-index dataframes which should have the same index names (barring any format issues) that I want to join together.
DF1 =
DF2 =
I want to join DF2 to DF1 only where the NAME and code match so on this dataframe for example, it would pull in qty for 10'' of 69.0
I've tried different variations of join, concat and multiindex.join but can't see to figure it out. Any help much appreciated!

Categories