How to merge two dfs which have duplicates in both - python

I have two dataframes df1 and df2 which have the duplicates rows in both. I want to merge these dfs. What i tried so far is to remove duplicates from one of the dataframe df2 as i need all the rows from the df1.
Question might be a duplicate one but i didn't find any solution/hints for this particular scenario.
data = {'Name':['ABC', 'DEF', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
'Age':[1,2,3,4,2,1,2,4]}
data2 = {'Name':['XYZ', 'NOP', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
'Sex':['M','F','M','M','M','M','F','M']}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
dfn = df1.merge(df2.drop_duplicates('Name'),on='Name')
print(dfn)
Result of above snippet:
Name Age Sex
0 ABC 1 M
1 ABC 3 M
2 ABC 4 M
3 MNO 4 M
4 XYZ 2 M
5 XYZ 1 M
6 PQR 2 F
This works perfectly well for the above data, but i have a large data and this method is behaving differently as im getting lots more rows than expected in dfn
I suspect due to large data and more duplicates im getting those extra rows but im cannot afford to delete the duplicate rows from df1.
Apologies as im not able to share the actual data as it is too large!
Edit:
A sample result from the actual data:
df2 after removing dups and the result dfn and i have only one entry in df1 for both ABC and XYZ:
Thanks in advance!

Try to drop_duplicates from df1 too:
dfn = pd.merge(df1, df2.drop_duplicates('Name'),
on='Name', how='left)

Related

Python Pandas: GROUPBY AND COUNT OF VALUES OF DIFFERENT COLUMNS in minimal steps and in a very fast way

I have a BIG dataframe with millions of rows & many columns and need to do GROUPBY AND COUNT OF VALUES OF DIFFERENT COLUMNS .
Need help with efficient coding for the problem with minimal lines of code and a code which runs very fast.
I'm giving a simpler example below about my problem.
Below is my input CSV.
UID,CONTINENT,AGE_GROUP,APPROVAL_STATUS
user1,ASIA,26-30,YES
user10,ASIA,26-30,NO
user11,ASIA,36-40,YES
user12,EUROPE,21-25,NO
user13,AMERICA,31-35,not_confirmed
user14,ASIA,26-30,YES
user15,EUROPE,41-45,not_confirmed
user16,AMERICA,21-25,NO
user17,ASIA,26-30,YES
user18,EUROPE,41-45,NO
user19,AMERICA,31-35,YES
user2,AMERICA,31-35,NO
user20,ASIA,46-50,NO
user21,EUROPE,18-20,not_confirmed
user22,ASIA,26-30,not_confirmed
user23,ASIA,36-40,YES
user24,AMERICA,26-30,YES
user25,EUROPE,36-40,NO
user26,EUROPE,Above 50,NO
user27,ASIA,46-50,YES
user28,AMERICA,31-35,NO
user29,AMERICA,Above 50,not_confirmed
user3,ASIA,36-40,YES
user30,EUROPE,41-45,YES
user4,EUROPE,41-45,NO
user5,ASIA,26-30,not_confirmed
user6,ASIA,46-50,not_confirmed
user7,ASIA,26-30,YES
user8,AMERICA,18-20,YES
user9,EUROPE,31-35,NO
I Expect the output to be as below.
Output should show
CONTINENT column as the main groupby column
UNIQUE values of AGE_GROUP and APPROVAL_STATUS columns as separate column name. And also, it should display the count of UNIQUE values of AGE_GROUP and APPROVAL_STATUS columns for each CONTINENT under respective output columns.
Output:-
CONTINENT,18-20,21-25,26-30,31-35,36-40,41-45,46-50,Above 50,NO,YES,not_confirmed,USER_COUNT
AMERICA,1,1,1,4,0,0,0,1,3,3,2,8
ASIA,0,0,7,0,3,0,3,0,2,8,3,13
EUROPE,1,1,0,1,1,4,0,1,6,1,2,9
Below is how I'm achieving it currently, but this is NOT en efficient way.
Need help with efficient coding for the problem with minimal lines of code and a code which runs very fast.
I've also sen that this could be achieved by using pivit table with pandas. But not too sure about it.
in_file = "/Users/user1/groupby.csv"
out_file = "/Users/user1/groupby1.csv"
df= pd.read_csv(in_file)
print(df)
df1 = df.groupby(['CONTINENT', 'AGE_GROUP']).size().unstack(fill_value=0).reset_index()
df1 = df1.sort_values(["CONTINENT"], axis=0, ascending=True)
print(df1)
df2 = df.groupby(['CONTINENT', 'APPROVAL_STATUS']).size().unstack(fill_value=0).reset_index()
df2 = df2.sort_values(["CONTINENT"], axis=0, ascending=True)
print(df2)
df3 = df.groupby("CONTINENT").count().reset_index()
df3 = df3[df3.columns[0:2]]
df3.columns = ["CONTINENT", "USER_COUNT"]
df3 = df3.sort_values(["CONTINENT"], axis=0, ascending=True)
df3.reset_index(drop=True, inplace=True)
# df3.to_csv(out_file, index=False)
print(df3)
df2.drop('CONTINENT', axis=1, inplace=True)
df3.drop('CONTINENT', axis=1, inplace=True)
df_final = pd.concat([df1, df2, df3], axis=1)
print(df_final)
df_final.to_csv(out_file, index=False)
Easy solution
Let us use crosstabs to calculate frequency tables then concat the tables along columns axis:
s1 = pd.crosstab(df['CONTINENT'], df['AGE_GROUP'])
s2 = pd.crosstab(df['CONTINENT'], df['APPROVAL_STATUS'])
pd.concat([s1, s2, s2.sum(1).rename('USER_COUNT')], axis=1)
18-20 21-25 26-30 31-35 36-40 41-45 46-50 Above 50 NO YES not_confirmed USER_COUNT
CONTINENT
AMERICA 1 1 1 4 0 0 0 1 3 3 2 8
ASIA 0 0 7 0 3 0 3 0 2 8 3 13
EUROPE 1 1 0 1 1 4 0 1 6 1 2 9

Stick the columns based on the one columns keeping ids

I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5

Using the items of a df as a header of a diffeerent dataframe

I have 2 dataframes
df1= 0 2
1 _A1-Site_0_norm _A1-Site_1_norm
and df2=
0 2
2 0.500000 0.012903
3 0.010870 0.013793
4 0.011494 0.016260
I want to use df1 as a header of df2 so that df1 is either the header of the columns or the first raw.
1 _A1-Site_0_norm _A1-Site_1_norm
2 0.500000 0.012903
3 0.010870 0.013793
4 0.011494 0.016260
i have multiple columns so it will not work to do
df2.columns=["_A1-Site_0_norm", "_A1-Site_1_norm"]
I thought of making a list of all the items present in the df1 to the use df2.columns and then include that list but I am having problems with converting the elements in row 1 of df1 of each column in items of a list.
I am not married to that approach any alternative to do it is wellcome
Many thanks
if I understood you question correctly
then this example should work for you
d={'A':[1],'B':[2],'C':[3]}
df = pd.DataFrame(data=d)
d2 = {'1':['D'],'2':['E'],'3':['F']}
df2 = pd.DataFrame(data=d2)
df.columns = df2.values.tolist() #this is what you need to implement

Appending only new values from a dataframe to another dataframe in pandas

I have a very huge data frame with me. I have also a small data frame with me.
Both of those data frames will have same columns.
The small data frame will have some rows that are already present in the big data frame. I want to append the small data frame to big one such that there will be no duplicates in the big data frame.
I can append simply and then remove the duplicates. But this will lead to wastage of memory to keep the duplicated data frame in the memory.
Is there any other method that can be used efficiently to solve this.?
What about isin?
Data:
df1 = pd.DataFrame({'a': [1,2,3,4,5,6,7]})
df2 = pd.DataFrame({'a': [3,4,9]})
Code:
df1.append(df2[df2.isin(df1) == False])
Output:
a
0 1
1 2
2 3
3 4
4 5
5 6
6 7
0 3
1 4
2 9
Data:
df1 = pd.DataFrame({'a': [1,2,3,4,5,6,7]})
df2 = pd.DataFrame({'a': [3,8,4,9]})
Use merge to get unique rows,
df3 = df2.merge(df1, how='left', indicator=True)
a _merge
0 3 both
1 8 left_only
2 4 both
3 9 left_only
Now, select rows with 'left_only',
df3 =df3[df3._merge == 'left_only'].iloc[:,:-1]
Finally, append them.
df1 = pd.concat([df1, df3], ignore_index=True)

Python Pandas - Appending data from multiple data frames onto same row by matching primary identifier, leave blank if no results from that data frame

Very new to python and using pandas, I only use it every once in a while when I'm trying to learn and automate otherwise a tedious Excel task. I've come upon a problem where I haven't exactly been able to find what I'm looking for through Google or here on Stack Overflow.
I currently have 6 different excel (.xlsx) files that I am able to parse and read into data frames. However, whenever I try to append them together, they're simply added on as new rows in a final output excel files, but instead I'm trying to append similar data values onto the same row, and not the same column so that I can see whether or not this unique value shows up in these data sets or not. A shortened example is as follows
[df1]
0 Col1 Col2
1 XYZ 41235
2 OAIS 15123
3 ABC 48938
[df2]
0 Col1 Col2
1 KFJ 21493
2 XYZ 43782
3 SHIZ 31299
4 ABC 33347
[Expected Output]
0 Col1 [df1] [df2]
1 XYZ 41235 43782
2 OAIS 15123
3 ABC 48938 33347
4 KFJ 21493
5 SHIZ 31299
I've tried to use a merge, however the actual data sheets are much more complicated in that I want to append 23 columns of data associated with each unique identifier in each data set. Such as, [XYZ] in [df2] has associated information across the next 23 columns that I would want to append after the 23 columns from the [XYZ] values in [df1].
How should I go about that? There are approximately 200 rows in each excel sheet and I would only need to essentially loop through until a matching unique identifier was found in [df2] with [df1], and then [df3] with [df1] and so on until [df6] and append those columns onto a new dataframe which would eventually be output as a new excel file.
df1 = pd.read_excel("set1.xlsx")
df2 = pd.read_excel("set2.xlsx")
df3 = pd.read_excel("set3.xlsx")
df4 = pd.read_excel("set4.xlsx")
df5 = pd.read_excel("set5.xlsx")
df6 = pd.read_excel("set6.xlsx")
Is currently the way I am reading into the excel files into data frames, I'm sure I could loop it, however, I am unsure of the best practices in doing so instead of hard coding each initialization of the data frame.
You need merge with the parameter how = 'outer'
new_df = df1.merge(df2, on = 'Col1',how = 'outer', suffixes=('_df1', '_df2'))
You get
Col1 Col2_df1 Col2_df2
0 XYZ 41235.0 43782.0
1 OAIS 15123.0 NaN
2 ABC 48938.0 33347.0
3 KFJ NaN 21493.0
4 SHIZ NaN 31299.0
For iterative merging, consider storing data frames in a list and then run the chain merge with reduce(). Below creates a list of dataframes from a list comprehension through the Excel files where enumerate() is used to rename the Col2 successively as df1, df2, etc.
from functools import reduce
...
dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)})
for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx",
"set4.xlsx", "set5.xlsx", "set6.xlsx"])]
df = reduce(lambda x,y: pd.merge(x, y, on=['Col1'], how='outer'), dfList)
# Col1 df1 df2
# 0 XYZ 41235.0 43782.0
# 1 OAIS 15123.0 NaN
# 2 ABC 48938.0 33347.0
# 3 KFJ NaN 21493.0
# 4 SHIZ NaN 31299.0
Alternatively, use pd.concat and outer join the dataframes horizontally where you need to set Col1 as index:
dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)}).set_index('Col1')
for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx",
"set4.xlsx", "set5.xlsx", "set6.xlsx"])]
df2 = pd.concat(dfList, axis=1, join='outer', copy=False)\
.reset_index().rename(columns={'index':'Col1'})
# Col1 df1 df2
# 0 ABC 48938.0 33347.0
# 1 KFJ NaN 21493.0
# 2 OAIS 15123.0 NaN
# 3 SHIZ NaN 31299.0
# 4 XYZ 41235.0 43782.0
You can use the merge function.
pd.merge(df1, df2, on=['Col1'])
You can use multiple keys by adding to the list on.
You can read more about the merge function in here
If you need only certain of the columns you can reach it by:
df1.merge(df2['col1','col2']], on=['Col1'])
EDIT:
In case of looping through some df's you can loop through all df's except the first and merge them all:
df_list = [df2, df3, df4]
for df in df_list:
df1 = df1.merge(df['col1','col2']], on=['Col1'])

Categories