How to join on timestamp in dataframe - python

I have two dataframes. One has 4 days worth of data while other one has 2 days. Dataframe one looks like this
while df2 looks like this:
I need to join these. There are two options for joining these. First join based on dates that are existent in both. Second
I am merging them like this:
using this code:
pd.merge(freq_df_two,freq_df_one, on=["date","hour"])
Issue is that if the date from df1 is not present in df2 then it simply drops it. Forexample as you can see it doesnt have 2020-09-02. I want it to display NaN or 0 if that date and hour is not present in second df. How do I do that?

Add how='outer' to your merge:
pd.merge(freq_df_two,freq_df_one, on=["date","hour"], how='outer')

Pandas merge function by default uses the inner strategy when merging, similar to INNER JOIN in sql. Meaning if the data is not present in the second one then its simply dropped. You should use the left strategy to merge, similar to LEFT JOIN.
pd.merge(freq_df_two,freq_df_one, on=["date","hour"], how='left')

you can specify a how parameter.
Here outer is the equivalent of the SQL full outer join
pd.merge(freq_df_two,freq_df_one, on=["date","hour"], how="outer")
More info here

Related

Merging Two Dataframes stacks rows instead of merging into one

I'm attempting to merge two dataframes using two columns as keys: "Date" and "Instrument"
Here is my code:
merge_df = pd.merge(df1 , df2, how='outer', left_on=['Date','Instrument'], right_on = ['Date','Instrument'])
df1:
df2:
You'll notice that the row in each dataframe has the same instrument and date value: AEA000201011 & 2008-01-31.
The merged dataframe is stacking the two rows instead of combining them:
merged_df:
I have ensured that the dataframe key columns dtypes match:
df1:
df2:
Any advice would be much appreciated!
Man I wish I could use add comment section.
Even though you've probably already tried, have tried to use "left" or "right" instead of "outer"
Or for once check them like
df1["Instrument"].iloc[0] == df2["Instrument"].iloc[0]
Maybe they got some invisible chars in them. If it's like that you can try using strip() functions.
Nothing other than these comes to my mind.

Pandas Dataframe (inner) Join on the same Dataframe

I'm working on how to cluster a patstat (reference database) database.
With my own agorithm I came up with a dataframe which shows me the author, beginpage, endpage, volume and publication_year of a reference.
running:
dfhead = df.head(10)
shows me
Now I want the following:
Show inner join with the SAME dataframe such that for example author, beginpage and endpage are the same. (at least 3 similarities between the rows)
I tried:
c = ['author', 'beginpage','endpage', 'volume','publication year']
df_merge = dfhead.merge(dfhead, how = 'inner',on = [c[0],c[1],c[2]])
where
The answer will then be given such that there only exists an inner join with exactly the same row, but I don't want those to include.
In the example above the df_merge should not take any values since there are no 3 similar columns.
What if there would be some how to same row, I will show an example:
x = pd.Dataframe({'author':['lee','lee'], 'beginpage':[455,456],'endpage':[477,477],'volume':[300,300]})
Note that the two rows have (at least) 3 similar columns and therefore the merge/join should be visible.
BUT note that in don't want to include to join of exactly the same row!!!
You could do an inner join and apply filtering to exclude the same row, but maybe it would be more straightforward to use groupby instead?
df.groupby(by=['author', 'beginpage','endpage'])
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
Applying aggregations/calculations/ect to groups:
https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html

Find where three separate DataFrames overlap and create a new DataFrame

I have three separate DataFrames. Each DataFrame has the same columns - ['Email', 'Rating']. There are duplicate row values in all three DataFrames for the column Email. I'm trying to find those emails that appear in all three DataFrames and then create a new DataFrame based off those rows. So far I have I had all three DataFrames saved to a list like this dfs = [df1, df2, df3], and then concatenated them together using df = pd.concat(dfs). I tried using groupby from here but to no avail. Any help would be greatly appreciated
You want to do a merge. Similar to a join in sql you can do an inner merge and treat the email like a foreign key. Here is the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
It would look something like this:
in_common = pd.merge(df1, df2, on=['Email'], how='inner')
you could try using .isin from pandas, e.g:
df[df['Email'].isin(df2['Email'])]
This would retrieve row entries where the values for the column email are the same in the two dataframes.
Another idea is maybe try an inner merge.
Goodluck, post code next time.

How to inner join in pandas as SQL , Stuck in a problem below

I have two df named "df" and second as "topwud".
df
topwud
when I join these two dataframes bt inner join using BOMCPNO and PRTNO as the join column
like
second_level=pd.merge(df,top_wud ,left_on='BOMCPNO', right_on='PRTNO', how='inner').drop_duplicates()
Then I got this data frame
Result
I don't want common name coming as PRTNO_x and PRTNO_y , I want to keep only PRTNO_x in my result dataframe as name "PRTNO" which is default name.
Kindly help me :)
try This -
pd.merge(df1, top_wud, on=['BOMCPNO', 'PRTNO'])
What this will do though is return only the values where BOMCPNO and PRTNO exist in both dataframes as the default merge type is an inner merge.
So what you could do is compare this merged df size with your first one and see if they are the same and if so you could do a merge on both columns or just drop/rename the _x/_y suffix B columns.
I would spend time though determining if these values are indeed the same and exist in both dataframes, in which case you may wish to perform an outer merge:
pd.merge(df1, df2, on=['A', 'B'], how='outer')
Then what you could do is then drop duplicate rows (and possibly any NaN rows) and that should give you a clean merged dataframe.
merged_df.drop_duplicates(cols=['BOMCPNO', 'PRTNO'],inplace=True)
also try other types of join , as i dont know what exactly you want, i think its left inner .
check this if it solved your problem.

python pandas - joining specific columns

I have a main dataframe (MbrKPI4), and I want to left join it with another dataframe (mbrsdf). They have the same index. I am successful with the below.
MbrKPI4.join(mbrsdf['Gender'])
However, I want to join more columns from mbrsdf, and the below does not work (MemoryError). Is there a way to join that I can select the columns I want from mbrsdf?
MbrKPI4.join(mbrsdf['Gender'], mbrsdf['Marital Status'])
Based on documentation for join() I think you want to pass in an array of dataframes to left join by, or chain join calls.
d1.join([d2['Gender'], d2['Marital Status']])
d1.join(d2['Gender']).join(d2['Marital Status'])
Hope that works.

Categories