python pandas - joining specific columns - python

I have a main dataframe (MbrKPI4), and I want to left join it with another dataframe (mbrsdf). They have the same index. I am successful with the below.
MbrKPI4.join(mbrsdf['Gender'])
However, I want to join more columns from mbrsdf, and the below does not work (MemoryError). Is there a way to join that I can select the columns I want from mbrsdf?
MbrKPI4.join(mbrsdf['Gender'], mbrsdf['Marital Status'])

Based on documentation for join() I think you want to pass in an array of dataframes to left join by, or chain join calls.
d1.join([d2['Gender'], d2['Marital Status']])
d1.join(d2['Gender']).join(d2['Marital Status'])
Hope that works.

Related

Pandas: Suffixes created on merge with same name columns

I am trying to do a left join using pandas merge however for some reason, I get suffixes on my join even though the series/columns have the same name
penalty_kicks dataframe:
backup dataframe:
Here is how I am using pd.merge to do the left join as detailed before:
penalty_kicks = pd.merge(penalty_kicks,backup, on=['video_name','image_name'],how='left')
After that, here is the output table:
Is there a way to just make the data all go into their repective columns?

How to join on timestamp in dataframe

I have two dataframes. One has 4 days worth of data while other one has 2 days. Dataframe one looks like this
while df2 looks like this:
I need to join these. There are two options for joining these. First join based on dates that are existent in both. Second
I am merging them like this:
using this code:
pd.merge(freq_df_two,freq_df_one, on=["date","hour"])
Issue is that if the date from df1 is not present in df2 then it simply drops it. Forexample as you can see it doesnt have 2020-09-02. I want it to display NaN or 0 if that date and hour is not present in second df. How do I do that?
Add how='outer' to your merge:
pd.merge(freq_df_two,freq_df_one, on=["date","hour"], how='outer')
Pandas merge function by default uses the inner strategy when merging, similar to INNER JOIN in sql. Meaning if the data is not present in the second one then its simply dropped. You should use the left strategy to merge, similar to LEFT JOIN.
pd.merge(freq_df_two,freq_df_one, on=["date","hour"], how='left')
you can specify a how parameter.
Here outer is the equivalent of the SQL full outer join
pd.merge(freq_df_two,freq_df_one, on=["date","hour"], how="outer")
More info here

Find where three separate DataFrames overlap and create a new DataFrame

I have three separate DataFrames. Each DataFrame has the same columns - ['Email', 'Rating']. There are duplicate row values in all three DataFrames for the column Email. I'm trying to find those emails that appear in all three DataFrames and then create a new DataFrame based off those rows. So far I have I had all three DataFrames saved to a list like this dfs = [df1, df2, df3], and then concatenated them together using df = pd.concat(dfs). I tried using groupby from here but to no avail. Any help would be greatly appreciated
You want to do a merge. Similar to a join in sql you can do an inner merge and treat the email like a foreign key. Here is the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
It would look something like this:
in_common = pd.merge(df1, df2, on=['Email'], how='inner')
you could try using .isin from pandas, e.g:
df[df['Email'].isin(df2['Email'])]
This would retrieve row entries where the values for the column email are the same in the two dataframes.
Another idea is maybe try an inner merge.
Goodluck, post code next time.

How to inner join in pandas as SQL , Stuck in a problem below

I have two df named "df" and second as "topwud".
df
topwud
when I join these two dataframes bt inner join using BOMCPNO and PRTNO as the join column
like
second_level=pd.merge(df,top_wud ,left_on='BOMCPNO', right_on='PRTNO', how='inner').drop_duplicates()
Then I got this data frame
Result
I don't want common name coming as PRTNO_x and PRTNO_y , I want to keep only PRTNO_x in my result dataframe as name "PRTNO" which is default name.
Kindly help me :)
try This -
pd.merge(df1, top_wud, on=['BOMCPNO', 'PRTNO'])
What this will do though is return only the values where BOMCPNO and PRTNO exist in both dataframes as the default merge type is an inner merge.
So what you could do is compare this merged df size with your first one and see if they are the same and if so you could do a merge on both columns or just drop/rename the _x/_y suffix B columns.
I would spend time though determining if these values are indeed the same and exist in both dataframes, in which case you may wish to perform an outer merge:
pd.merge(df1, df2, on=['A', 'B'], how='outer')
Then what you could do is then drop duplicate rows (and possibly any NaN rows) and that should give you a clean merged dataframe.
merged_df.drop_duplicates(cols=['BOMCPNO', 'PRTNO'],inplace=True)
also try other types of join , as i dont know what exactly you want, i think its left inner .
check this if it solved your problem.

How to do a full outer join with pandas dataframes with date condition?

I'm trying to do a full outer join in pandas with one of the conditions being a date match. The SQL code would be like the following:
SELECT *
FROM applications apps
FULL OUTER JOIN order_data orders on apps.account = orders.account_order
and orders.[Order date] <= apps.time_stamp;
How could I achieve this considering apps and order_data are two pandas dataframes?
I tried using pysql but full outer joins are not supported.
Thank you
The Pandas .join method lets you use outer joins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html
Simply load up your two data frames and
a.join(b, how='outer')
will do it.

Categories