Pandas: Merging dataframes on a string column

Pandas: Merging dataframes on a string column - python

I am new to python and I'm currently working on a project where I have to merge two dataframes. One dataframe, which is called cancer_df, is cancer incidencies by county, year, sex, gender, etc. The other dataframe, which is called hspa_df, is a health score by county and year (FYI, it's only counties in California). I would like to combine my two dataframe on county and year. Here is the cancer dataframe before the merge and Here is the hspa dataframe before the merge
Then I imported my data and tried the following merge:
merged_df= pd.merge(cancer_df, hspa_df, on="County" , how="outer")
However, this seems to append the data not merge it. It adds my hspa_df at the end and fills the top of the variable they share in common as NaNs. Why is this happening? I have successfully used this merge with other dataframes, but i merged them on numerical columns, not string.
Here is the merged dataframes head and Here is the merged dataframes tail

I would like to combine my two dataframe on county and year
merged_df = pd.merge(cancer_df, hspa_df, on=['County', 'Year'] )
whether you want to do inner, left, right, etc. join, depends on your usecase, but note how to specify two columns.
It fills the top of the variable they share in common as NaNs
This is what an outer join does, and it uses fillers for that.

Related

Merging Two Dataframes stacks rows instead of merging into one

I'm attempting to merge two dataframes using two columns as keys: "Date" and "Instrument"
Here is my code:
merge_df = pd.merge(df1 , df2, how='outer', left_on=['Date','Instrument'], right_on = ['Date','Instrument'])
df1:
df2:
You'll notice that the row in each dataframe has the same instrument and date value: AEA000201011 & 2008-01-31.
The merged dataframe is stacking the two rows instead of combining them:
merged_df:
I have ensured that the dataframe key columns dtypes match:
df1:
df2:
Any advice would be much appreciated!

Man I wish I could use add comment section.
Even though you've probably already tried, have tried to use "left" or "right" instead of "outer"
Or for once check them like
df1["Instrument"].iloc[0] == df2["Instrument"].iloc[0]
Maybe they got some invisible chars in them. If it's like that you can try using strip() functions.
Nothing other than these comes to my mind.

How to Merge this Data-frames in Python Pandas?

I have 3 Data-frames of Following Shapes:
(34376, 13), (52389, 28), (16531, 14)
This is the First Dataframe which we have:
This is the Second Dataframe which we have:
This the Third Dataframe which we have:
Now, as I have mentioned the shapes of all the Dataframes, the main task is we have to merge this on the Accession Number \
DF1-has the exact 34376 Accession which we want.
DF2- has around 28000 Accession which we want. This basically means that the remaining Accession of that table we don't want.
DF3- has around 9200 Accession which we want
How can we, merge all these 3 DF's on Accession Number, so that we get the extra columns of DF2,DF3 merged with DF1 on Accession Number. Also, we can see that DF2 has 52389 columns, so if there are same Accession Numbers repeated in DF2, we still want to merge it, but the rows of DF1 should be repeated while merged with the extra rows of DF2 and same with DF3. The Accession where no values are available in DF2/DF3 but present in DF1, the rows should become Null.

You can simply use the pandas merge function
pd.merge(pd.merge(df1,df2,on='ACCESSION_NUMBER'),df3,on='ACCESSION_NUMBER')
or
df1.merge(df2,on='ACCESSION_NUMBER').merge(df3,on='ACCESSION_NUMBER')
or
You could use the reduce class from functools library
reduce(lambda x,y: pd.merge(x,y, on='ACCESSION_NUMBER', how='outer'), [df1, df2, df3])

Pandas, when merging two dataframes and values for some columns don't carry over

I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.

Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

Pandas merge is giving different answers with index merge and column merge

I have three DataFrames for which I am trying to merge and output the result. The common column in each DataFrame I am trying to merge on is COUNTRY.
Case1:
Before merging the three DataFrames I have set the index of each DataFrame to COUNTRY and did
pd.merge(leftdf,rightdf,left_index=True,right_index=True,how="inner")
I am getting the required answer. But when I am not setting the indices of each DataFrame to Country, leaving them as columns, and performing the merge
pd.merge(leftdf,rightdf,on="Country",how="inner")
the resultant DataFrame is reduced in size. I am loosing some rows. Why is this happening? I do not understand.

How to inner join in pandas as SQL , Stuck in a problem below

I have two df named "df" and second as "topwud".
df
topwud
when I join these two dataframes bt inner join using BOMCPNO and PRTNO as the join column
like
second_level=pd.merge(df,top_wud ,left_on='BOMCPNO', right_on='PRTNO', how='inner').drop_duplicates()
Then I got this data frame
Result
I don't want common name coming as PRTNO_x and PRTNO_y , I want to keep only PRTNO_x in my result dataframe as name "PRTNO" which is default name.
Kindly help me :)

try This -
pd.merge(df1, top_wud, on=['BOMCPNO', 'PRTNO'])
What this will do though is return only the values where BOMCPNO and PRTNO exist in both dataframes as the default merge type is an inner merge.
So what you could do is compare this merged df size with your first one and see if they are the same and if so you could do a merge on both columns or just drop/rename the _x/_y suffix B columns.
I would spend time though determining if these values are indeed the same and exist in both dataframes, in which case you may wish to perform an outer merge:
pd.merge(df1, df2, on=['A', 'B'], how='outer')
Then what you could do is then drop duplicate rows (and possibly any NaN rows) and that should give you a clean merged dataframe.
merged_df.drop_duplicates(cols=['BOMCPNO', 'PRTNO'],inplace=True)
also try other types of join , as i dont know what exactly you want, i think its left inner .
check this if it solved your problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Merging dataframes on a string column - python

Related

Merging Two Dataframes stacks rows instead of merging into one

How to Merge this Data-frames in Python Pandas?

Pandas, when merging two dataframes and values for some columns don't carry over

Pandas merge is giving different answers with index merge and column merge

How to inner join in pandas as SQL , Stuck in a problem below

Categories

Resources