Pandas / Python remove duplicates based on specific row values - python

I am trying to remove duplicated based on multiple criteria:
Find duplicated in column df['A']
Check column df['status'] and prioritize OK vs Open and Open vs Close
if we have a duplicate with same status pick the lates one based on df['Col_1]
df = pd.DataFrame({'A' : ['11', '11', '12', np.nan, '13', '13', '14', '14', '15'],
'Status' : ['OK','Close','Close','OK','OK','Open','Open','Open',np.nan],
'Col_1' :[2000, 2001, 2000, 2000, 2000, 2002, 2000, 2004, 2000]})
df
Expected output:
I have tried differente solutions like the links below (map or loc) but I am unable to find the correct way:
Pandas : remove SOME duplicate values based on conditions

Create ordered categorical for prioritize Status, then sorting per all columns, remove duplicates by first column A and last sorting index:
c = ['OK','Open','Close']
df['Status'] = pd.Categorical(df['Status'], ordered=True, categories=c)
df = df.sort_values(['A','Status','Col_1']).drop_duplicates('A').sort_index()
print (df)
A Status Col_1
0 11 OK 2000
2 12 Close 2000
3 NaN OK 2000
4 13 OK 2000
6 14 Open 2000
8 15 NaN 2000
EDIT If need avoid NaNs are removed add helper column:
df['test'] = df['A'].isna().cumsum()
c = ['OK','Open','Close']
df['Status'] = pd.Categorical(df['Status'], ordered=True, categories=c)
df = (df.sort_values(['A','Status','Col_1', 'test'])
.drop_duplicates(['A', 'test'])
.sort_index())

Related

Find Max of a Group, and Return Another Column's Value

My goal is to group the df by the 'Grouping' column, and then find the max sales and the corresponding SKU. There a lot of sources that explain how to do the groupby and find max sales, but adding this additional SKU column is where I am unable to find any sources that explain what I am doing wrong.
As you can see below, I am successfully grouping by the 'Grouping' column, and displaying the max Sales per group, but the SKU is incorrect. The SKU with the highest sales for group 1 is 'E2-MKEP', and my result is 'V4-DE5U'.
import pandas as pd
list1 = [
['1', 'V4-DE5U', 956.64],
['1', 'DH-Q9OY', 642.43],
['1', 'E2-MKEP', 1071.6],
['2', 'WL-NOLZ', 389.06],
['2', 'JF-4E3C', 162.69],
['3', 'N9-DABP', 618.96],
['3', 'OO-JBHE', 1451.19],
]
cols = ['Grouping', 'SKU', 'Sales']
df = pd.DataFrame(list1, columns = cols)
df1 = df.groupby(['Grouping']).agg(max)[['Sales', 'SKU']]
print(df1)
result:
Sales SKU
Grouping
1 1071.60 V4-DE5U
2 389.06 WL-NOLZ
3 1451.19 OO-JBHE
You could transform max and create a boolean mask and filter df:
msk = df.groupby(['Grouping'])['Sales'].transform('max') == df['Sales']
out = df.loc[msk, ['Sales', 'SKU']]
Output:
Sales SKU
2 1071.60 E2-MKEP
3 389.06 WL-NOLZ
6 1451.19 OO-JBHE
If you want the "Grouping" column as well, you could use msk on df without selecting columns:
out = df[msk]
Output:
Grouping SKU Sales
2 1 E2-MKEP 1071.60
3 2 WL-NOLZ 389.06
6 3 OO-JBHE 1451.19
You can sort the "Sales" values and then drop the "Grouping" duplicates
df1 = df1.sort_values('Sales', ascending=False).drop_duplicates(['Grouping']).sort_values('Grouping')
df1.reset_index(drop=True, inplace=True)
>>> df1
Grouping SKU Sales
0 1 E2-MKEP 1071.60
1 2 WL-NOLZ 389.06
2 3 OO-JBHE 1451.19

How to replace values in one DataFrames in Python with values from another DataFrames where the dates (in datetime format) matches? [duplicate]

This question already has answers here:
Update a column values from another df where ids are same
(2 answers)
How do I merge dataframes on unique column values?
(1 answer)
Closed last year.
Given I have two DataFrames:
import pandas as pd
df1 = pd.DataFrame([['2017', '1'],
['2018', '2'],
['2019', '3'],
['2020', '4'],
['2021', '5'],
['2022', '6'],
], columns=['datetime', 'values'])
df2 = pd.DataFrame([['2018', '0'],
['2019', '0'],
['2020', '0'],
], columns=['datetime', 'values'])
print(df1)
print(df2)
(Assume the values in the column 'datetime' has datetime format and is not string)
How can I replace the values in df1 to the values of df2 where the datetime exists in both without using loops?
You can use combine_first after temporarily setting the index to whatever you want to use as matching columns:
(df2.set_index('datetime')
.combine_first(df1.set_index('datetime'))
.reset_index()
)
output:
datetime values
0 2017 1
1 2018 0
2 2019 0
3 2020 0
4 2021 5
5 2022 6

Groupby, apply function and combine results in dataframe

I would like to group the ids by Type column and apply a function on the grouped stocks that returns the first row where the Value column of the grouped stock is not NaN and copies it into a separate data frame.
I got the following so far:
dummy data:
df1 = {'Date': ['04.12.1998','05.12.1998','06.12.1998','04.12.1998','05.12.1998','06.12.1998'],
'Type': [1,1,1,2,2,2],
'Value': ['NaN', 100, 120, 'NaN', 'NaN', 20]}
df2 = pd.DataFrame(df1, columns = ['Date', 'Type', 'Value'])
print (df2)
Date Type Value
0 04.12.1998 1 NaN
1 05.12.1998 1 100
2 06.12.1998 1 120
3 04.12.1998 2 NaN
4 05.12.1998 2 NaN
5 06.12.1998 2 20
import pandas as pd
selectedStockDates = {'Date': [], 'Type': [], 'Values': []}
selectedStockDates = pd.DataFrame(selectedStockDates, columns = ['Date', 'Type', 'Values'])
first_valid_index = df2[['Values']].first_valid_index()
selectedStockDates.loc[df2.index[first_valid_index]] = df2.iloc[first_valid_index]
The code above should work for the first id, but I am struggling to apply this to all ids in the data frame. Does anyone know how to do this?
Let's mask the values in dataframe where the values in column Value is NaN, then groupby the dataframe on Type and aggregate using first:
df2['Value'] = pd.to_numeric(df2['Value'], errors='coerce')
df2.mask(df2['Value'].isna()).groupby('Type', as_index=False).first()
Type Date Value
0 1.0 05.12.1998 100.0
1 2.0 06.12.1998 20.0
Just use groupby and first but you need to make sure that your null values are np.nan and not strings like they are in your sample data:
df2.groupby('Type')['Value'].first()

Dropping duplicate rows but keeping certain values Pandas

I have 2 similar dataframes that I concatenated that have a lot of repeated values because they are basically the same data set but for different years.
The problem is that one of the sets has some values missing whereas the other sometimes has these values.
For example:
Name Unit Year Level
Nik 1 2000 12
Nik 1 12
John 2 2001 11
John 2 2001 11
Stacy 1 8
Stacy 1 1999 8
.
.
I want to drop duplicates on the subset = ['Name', 'Unit', 'Level'] since some repetitions don't have years.
However, I'm left with the data that has no Year and I'd like to keep the data with these values:
Name Unit Year Level
Nik 1 2000 12
John 2 2001 11
Stacy 1 1999 8
.
.
How do I keep these values rather than the blanks?
Use sort_values with default parameter na_position='last', so should be omit, and then drop_duplicates:
print (df)
Name Unit Year Level
0 Nik 1 NaN 12
1 Nik 1 2000.0 12
2 John 2 2001.0 11
3 John 2 2001.0 11
4 Stacy 1 NaN 8
5 Stacy 1 1999.0 8
subset = ['Name', 'Unit', 'Level']
df = df.sort_values('Year').drop_duplicates(subset)
Or:
df = df.sort_values(subset + ['Year']).drop_duplicates(subset)
print (df)
Name Unit Year Level
5 Stacy 1 1999.0 8
1 Nik 1 2000.0 12
2 John 2 2001.0 11
Another solution with GroupBy.first for return first non missing value of Year per groups:
df = df.groupby(subset, as_index=False, sort=False)['Year'].first()
print (df)
Name Unit Level Year
0 Nik 1 12 2000.0
1 John 2 11 2001.0
2 Stacy 1 8 1999.0
One solution that comes to mind is to first sort the concatenated dataframe by year with the sortvalues function:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
then drop duplicates with keep='first' parameter
df.drop_duplicates(subset=['Name', 'Unit', 'Level'], keep="first")
I would suggest that you look at the creation step of your merged dataset.
When merging the data sets you can do so on multiple indices i.e.
df = pd.merge(left, right, how='outer', on=['Name', 'Unit', 'Level'], suffixes=['', '_r'])
With the outer join you collect all data sets and remove duplicates right away. The only thing left is to merge the Year column which you can do like so:
df['Year'] = df[['Year', 'Year_r']].apply(lambda x: x['Year'] if (x['Year'] is not np.nan and x['Year'] != '') else x['Year_r'], axis=1)
This fills the gaps and afterwards you are able to simply drop the 'Year_r' column.
The benefit here is that not only NaN values of missing years are covered but also missing Years which are represented as empty strings.
Following a small working example:
import pandas as pd
import numpy as np
left = pd.DataFrame({'Name': ['Adam', 'Beatrice', 'Crissy', 'Dumbo', 'Peter', 'Adam'],
'Unit': ['2', '4', '6', '2', '4', '12'],
'Year': ['', '2009', '1954', '2025', '2012', '2024'],
'Level': ['L1', 'L1', 'L0', 'L4', 'L3', 'L10']})
right = pd.DataFrame({'Name': ['Adam', 'Beatrice', 'Crissy', 'Dumbo'],
'Unit': ['2', '4', '6', '2'],
'Year': ['2010', '2009', '1954', '2025'],
'Level': ['L1', 'L1', 'L0', 'L4']})
df = pd.merge(left, right, how='outer', on=['Name', 'Unit', 'Level'], suffixes=['', '_r'])
df['Year'] = df[['Year', 'Year_r']].apply(lambda x: x['Year'] if (x['Year'] is not np.nan and x['Year'] != '') else x['Year_r'], axis=1)
df

Convert row to column header for Pandas DataFrame,

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)

Categories