combine dataframes in pandas with repeat slices of one of them - python

I want to combine 2 dataframes.
Here are the samples of dataframes:
df1:
linktoDF1
df2:
linkdoDF2
Desired output should be:
linktoResultcsv
What I want in essence is to extend df1 with data from df2. Key to linking data is index of both dataframes which is ['latitude','level','longitude']. I want to omit data with index unique to df2. i.e. I don't want to see data with index [41, 1000, 19.25 ]
Any help is appreciated.

Use "merge" with how='left', which omits indexes not in df1:
rslt= pd.merge(df1,df2,on=["latitude","level","longitude"],how="left")

Related

How to get rows from a dataframe that are not joined when merging with other dataframe in pandas

I am trying to make 2 new dataframes by using 2 given dataframe objects:
DF1 = id feature_text length
1 "example text" 12
2 "example text2" 13
....
....
DF2 = id case_num
3 0
....
....
As you could see, both df1 and df2 have column called "id". However, the df1 has all id values, where df2 only has some of them. I mean, df1 has 3200 rows, where each row has a unique id value (1~3200), however, df2 has only some of them (i.e. id=[3,7,20,...]).
What I want to do is 1) get a merged dataframe which contains all rows that have the id values which are included in both df1 and df2, and 2) get a dataframe, which contains the rows in the df1, which have id values that are not included in the df2.
I was able to find a solution for 1), however, have no idea how to do 2).
Thanks.
For the first case, you could use inner merge:
out = df1.merge(df2, on='id')
For the second case, you could use isin, with negation operator, so that we filter out the rows in df1 that have ids that also exist in df2:
out = df1[~df1['id'].isin(df2['id'])]

Why does my merged data frame have multiple columns that are populated with "none" values?

I'm merging 2 pretty large data frames, the shape of RD_4ML is (97058, 24) while the shape of NewDF is (104047, 3). They share a common column called 'personUID', below is the merge code I used.
Final_DF = RD_4ML.merge(NewDF, how='left', on='personUID')
Final_DF.fillna('none', inplace=True)
Final_DF.sample()
DF sample output:
|personUID| |code| |Death| |diagnosis_code_type| |lr|
|abc123| |ICD10| |1| |none| |none|
Essentially the columns from RD_4ML populate while the 2 columns from NewDF return "none" values. Does anyone know how to solve an error like this?
I think the 'personUID' column does not match in the two dataframe.
Ensure that they have the same data type.
Merge with how='left' takes every entry from the left dataframe and tries to find a corresponding matching id in the right dataframe. For all nonmatching ones, it will fill in nans for the columns coming from the right frame. In SQL that is called a left join. As an example you can have a look at this here
df1 = pd.DataFrame({"uid":range(4), "values": range(4)})
df2 = pd.DataFrame({"uid":range(5, 9), "values2": range(4)})
df1.merge(df2, how="left", on='uid')
# OUTPUT
uid values values2
0 0 0 NaN
1 1 1 NaN
2 2 2 NaN
3 3 3 NaN
here yous see all uids from the left dataframe end up in the merged dataframe and as no matching entry was found, the column from the right dataframe is set to NaN.
If your goal is, to end up with only those that have a match, you can change from "left" to "inner". For more information about that, just have a look at the great pandas docs.

Pandas merge two data frame only to first occurrence

I have two dataframe, I am able to merge by pd.merge(df1, df2, on='column_name'). But I only want to merge on first occurrence in df1 Any pointer or solution? It's a many to one, and I only want the first occurrence merged. Thanks in advance!
Since you want to merge two dataframes of different lengths, you'll have to have NaN values in the merged dataframe cells where there are no corresponding indices in df2. So let's try this. Merge left. This will duplicate df2 values for duplicated column_name rows in df1. Have a mask ready to filter those rows and assign NaN for them in the columns from df2.
mask = df1['column_name'].duplicated()
new_df = df1.merge(df2, how='left', on='column_name')
new_df.loc[mask, df2.columns[df2.columns!='column_name']] = np.nan

is there a function to delete (in df1) the different rows in two df?

I have two DataFrame with different length. In df1 i need to keep just the rows of df1 that are in df2 too and delete all the rows that are not in df1.
I used the function below that finds the difference between 2 df, but i won't be able to delete all the lines in df1.
df1[~(df1['F_Code'].isin(df2['Codice']))]
I think you are looking for this command :
pandas.merge(df,df2,how='inner')
you can know more about pandas.merge here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html.
This is quite like the SQL JOIN keyword.
Looks like you want to merge two data frames on columns with a different column name. In that case this is what you need:
import pandas as pd
# Create Example data
df1 = pd.DataFrame({'F_Code': [1,2, 8], 'a': [3,4,5]})
df2 = pd.DataFrame({'Codice': [1,2,3], 'b': [1,2,5]})
df = pd.merge(df1, df2, how='inner', left_on='F_Code',
right_on='Codice').drop(columns=df2.columns)
result:
F_Code a
0 1 3
1 2 4

Filter pandas dataframe columns based on other dataframe

I have two dataframes df1 and df2. df1 gives some numerical data on some elements (A,B,C ...) while df2 is a dataframe acting like a classification table with its index being the column names of df1. I would like to filter df1 by only keeping columns that are matching a certain classification in df2.
For instance, let's assume the following two dataframes and that I only want to keep elements (i.e. columns of df1) that belong to class 'C1':
df1 = pd.DataFrame({'A': [1,2],'B': [3,4],'C': [5,6]},index=[0, 1])
df2 = pd.DataFrame({'Name': ['A','B','C'],'Class': ['C1','C1','C2'],'Subclass': [C11,C12,C21]},index=[0, 1, 2])
df2 = df2.set_index('Name')
The expected result should be the dataframe df1 with only columns A and B because in df2, we can see that A and B are in class C1. Not sure how to do that. I was thinking about first filtering df2 by 'C1' values in its 'Class' column and then check if df1.columns are in df2.index but I suppose there is a much efficient way to do that. Thanks for your help
Here is one way using index slice
df1.loc[:,df2.index[df2.Class=='C1']]
Out[578]:
Name A B
0 1 3
1 2 4

Categories