Drop only specified amount of duplicates pandas [duplicate] - python

This question already has answers here:
Keeping the last N duplicates in pandas
(2 answers)
Closed 11 months ago.
Whereas panda's drop_duplicates function can be specified with "first", "last", or False. I want to be able to keep N amount of duplicates. Instead of keeping just one (e.g. with "first" or "last"), or none (with "False"), I want to keep a certain amount of the duplicates.
Any help is appreciated!

Something like this could work, but you haven't specified whether you are using one or more column(s) to deduplicate:
n = 3
df.groupby('drop_dup_col').head(n)
This can be used to keep the first three duplicates based on a column value from the top (head) of the dataframe. If you want to start from the bottom of the df, you can use .tail(n) instead.
Change n to the amount of rows you want to keep and change 'drop_dup_col' to the column name you are using to dedup your df.
Multiple columns can be specified in groupby using:
df.groupby(['col1','col5'])
Regarding the question in your comment:
It's a bit hard to implement, because if you want to say delete 3 duplicates there should also be a minimum of 3 duplicates, otherwise in case 2 duplicates occur they will be deleted from the data and no row is kept.
n = 3
df['dup_count'] = df.groupby('drop_dup_col').transform('size')
df2 = df
df2 = df2.loc[df['dup_count'] >= n]
df3 = pd.concat([df, df2])
df3.drop_duplicates(keep=False)

I believe a combination of groupby and tail(N) should work for this-
In this case, if you want to keep 4 duplicates in df['myColumnDuplicates']:
df.groupby('myColumnDuplicates').tail(4)
To be more precise, and complete the answer with #Stijn 's answer,
tail(n) would keep the last n duplicated values found- while head(n) should keep the first n duplicated values

Related

Pandas - drop rows based on two conditions on different columns

Although there are several related questions answered in Pandas, I cannot solve this issue. I have a large dataframe (~ 49000 rows) and want to drop rows the meet two conditions at the same time(~ 120):
For one column: an exact string
For another column: a NaN value
My code is ignoring the conditions and no row is removed.
to_remove = ['string1', 'string2']
df.drop(df[df['Column 1'].isin(to_remove) & (df['Column 2'].isna())].index, inplace=True)
What am I doing wrong? Thanks for any hint!
Instead of calling drop, and passing the index, You can create the mask for the condition for which you want to keep the rows, then take only those rows. Also, the logic error seems to be there, you are checking two different condition combined by AND for the same column values.
df[~(df['Column1'].isin(to_remove) & (df['Column2'].isna()))]
Also, if you need to check in the same column, then you probably want to combine the conditions by or i.e. |
If needed, you can reset_index at last.
Also, as side note, your list to_remove has two same string values, I'm assuming thats a typo in the question.

Find the sum of a column by grouping two columns [duplicate]

This question already has answers here:
Pandas DataFrame iterating over rows and sum
(1 answer)
Pandas sum by groupby, but exclude certain columns
(4 answers)
Closed 10 months ago.
For this dataset, i want to find the sum of Value(£) for each combination of the
three columns together for Year, Length Group and Port of Landing. So for example, one sum value will be for the year 2016, the Length group 10m&Under and the Port of Landing Aberdaran.
Given the response you have back to #berkayln, I think you want to project that column back to your original dataframe...
Does this suit your need ?
df['sumPerYearLengthGroupPortOfLanding']=df.groupby(['Year','Length Group','Port of Landing'])['Value(£)'].transform(lambda x: x.sum())
You can try this one:
dataframe.groupby(['Year','Length Group','Port of Landing'])['Value(£)'].sum()
That should work.
You can use pd.DataFrame.groupby to aggregate the data.
# Change the order if you want a different hierarchy
grp_cols = ["Year", "Length Group", "Port of Landing"]
df.groupby(grp_cols)["Value(£)"].sum()
You can also do them one-by-one as such:
for col in grp_cols:
df.groupby(col)["Value(£)"].sum()
You can also use .loc to get 2016 only.
df.loc[df.Year == 2016]["Value(£)"].sum()
The pd.DataFrame.groupby functionality allows you to aggregate using other functions other than .sum, including customized functions that operate on the sub-dataframes.

concatenate 2 dataframes while matching multiple columns [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 12 months ago.
I have 2 almost identical pandas dataframes with 5 common columns.
I want to add the second dataframe to the first which has a new column.
Dataframe 1
Dataframe 2
But I want it to update the same row given that columns 'Lot name', 'wafer' and 'site' match (green). If the columns do not match, I want to have the value of NaN as shown below.
Desired output
I have to do this with over 160 discrete columns but with possible matching Lot name, WAFER and SITE values.
I have tried the various merging(left right outer) and concat options, just cant seem to get it right. Any help\comments is appreciated.
Edit, follow up question:
I am trying to use this in a loop, where each iteration generates a new dataframe assigned to TEMP that needs to be merged with the previous dataframe. I cannot merge with an empty dataframe as it gives a merge error. How can I achieve this?
alldata = pd.DataFrame()
for i in range(len(operation)):
temp = data[data['OPE_NO'].isin([operation[i]])]
temp = temp[temp['PARAM_NAME'].isin([parameter[i]])]
temp = temp.reset_index(drop=True)
temp = temp[["LOT",'Lot name','WAFER',"SITE","PRODUCT",'PARAM_VALUE_NUMBER']]
temp = temp.rename(columns={'PARAM_VALUE_NUMBER':'PMRM28LEMCKLYTFR.1~'+operation[i]+'~'+parameter[i]})
alldata.merge(temp,how='outer')
example can be done with the following code
df1.merge(df2, how="outer")
If I'm misunderstanding problem, please tell me problem.
my english is not good but i have good heart to help you

Python Pandas - Dataframe column gets swallowed when I add two columns from second Dataframe [duplicate]

This question already has answers here:
How do I expand the output display to see more columns of a Pandas DataFrame?
(22 answers)
Closed 3 years ago.
I have two dataframes df and df2 with contents as follows
dataframe df
dataframe df2
I'd like to add to df1 the two columns from df2 "NUMSESSIONS_ANDROID" and "AVGSESSDUR_ANDROID"
I do this as follows:
df['NUMSESSIONS_ANDROID'] = df2['NUMSESSIONS_ANDROID']
df['AVGSESSDUR_ANDROID'] = df2['AVGSESSDUR_ANDROID']
However when I print the resulting df I see ... in place of AVGSESSDUR_IOS (i.e. it appears to have swallowed that column)
Appreciate any help resolving this ....
As ALollz stated, the fact you are seeing ... in the output means there's "hidden" data that is part of the dataframe, but not showing in your console or IDE. However you can perform an easy print to check all the columns that your dataframe contains with:
print(list(df))
And this will show you all the names of the columns in your df that way you can check whether the ones you want are there or not.
Furthermore you can print an specific column as a series (first line) or dataframe (second):
print(df['column_name'])
print(df[['column_name']])
If successful you will see the series/dataframe, if the column actually doesn't exist in your original dataframe, then you will get a KeyError.
Leveraging #ALollz's hint above ...
"The ... indicates that only part of the DataFrame is being shown in your terminal/output, so 'AVGSESSDUR_IOS' is almost certainly still there it's just not shown. You can look at print(df.iloc[:, 0:3]) to see the first 3 columns for instance."
I added the following two lines to increase the number of columns and width of console display and it worked:
pd.set_option('display.max_columns',20)
pd.set_option('display.width', 1000)
print(df.iloc[:,0:5])

Python Pandas Dataframes comparison on 2 columns (with where clause)

I'm stuck on particluar python question here. I have 2 dataframes DF1 and DF2. In both, I have 2 columns pID and yID (which are not indexed, just default). I'm look to add a column Found in DF1 where the respective values of columns (pID and yID) were found in DF2. Also, I would like to zone in on just values in DF2 where aID == 'Text'.
I believe the below gets me the 1st part of this question; however, I'm unsure how as to incorporate the where.
DF1['Found'] = (DF1[['pID', 'yID']] == DF2[['pID','yID']]).all(axis=1).astype(bool)
Suggestions or answers would be most appreciated. Thanks.
You could subset the second dataframe containing aID == 'Text' to get a reduced DF from which select those portions of columns to be compared against the first dataframe.
Use DF.isin() to check if the values that are present under these column names match or not. And, .all(axis=1) returns True if both the columns happen to be True, else they become False. Convert the boolean series to integers via astype(int) and assign the result to the new column, Found.
df1_sub = df1[['pID', 'yID']]
df2_sub = df2.query('aID=="Text"')[['pID', 'yID']]
df1['Found'] = df1_sub.isin(df2_sub).all(axis=1).astype(int)
df1
Demo DF's used:
df1 = pd.DataFrame(dict(pID=[1,2,3,4,5],
yID=[10,20,30,40,50]))
df2 = pd.DataFrame(dict(pID=[1,2,8,4,5],
yID=[10,12,30,40,50],
aID=['Text','Best','Text','Best','Text']))
If it does not matter where those matches occur, then merge the two dataframes on 'pID', 'yID' common columns as the key by considering the bigger DF's index (right_index=True) as the new index axis that needs to be emitted and aligned after the merge operation is over.
Access these indices which indicate matches found and assign the value, 1 to a new column named Found while filling it's missing elements with 0's throughout.
df1.loc[pd.merge(df1_sub, df2_sub, on=['pID', 'yID'], right_index=True).index, 'Found'] = 1
df1['Found'].fillna(0, inplace=True)
df1 should be modifed accordingly post the above steps.

Categories