Pandas drop consecutive duplicate rows only, ignoring specific columns

Pandas drop consecutive duplicate rows only, ignoring specific columns - python

I have a dataframe below
df = pd.DataFrame({
'ID': ['James', 'James', 'James', 'James',
'Max', 'Max', 'Max', 'Max', 'Max',
'Park', 'Park','Park', 'Park',
'Tom', 'Tom', 'Tom', 'Tom'],
'From_num': [578, 420, 420, 'Started', 298, 78, 36, 298, 'Started', 28, 28, 311, 'Started', 60, 520, 99, 'Started'],
'To_num': [96, 578, 578, 420, 36, 298, 78, 36, 298, 112, 112, 28, 311, 150, 60, 520, 99],
'Date': ['2020-05-12', '2020-02-02', '2020-02-01', '2019-06-18',
'2019-08-26', '2019-06-20', '2019-01-30', '2018-10-23',
'2018-08-29', '2020-05-21', '2020-05-20', '2019-11-22',
'2019-04-12', '2019-10-16', '2019-08-26', '2018-12-11', '2018-10-09']})
and it is like this:
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
2 James 420 578 2020-02-01 # Drop the this duplicated row (ignore date)
3 James Started 420 2019-06-18
4 Max 298 36 2019-08-26
5 Max 78 298 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park 28 112 2020-05-21
10 Park 28 112 2020-05-20 # Drop this duplicate row (ignore date)
11 Park 311 28 2019-11-22
12 Park Started 311 2019-04-12
13 Tom 60 150 2019-10-16
14 Tom 520 60 2019-08-26
15 Tom 99 520 2018-12-11
16 Tom Started 99 2018-10-09
There are some consecutive duplicated values (ignore the Date value) within each 'ID'(Name), e.g. line 1 and 2 for James, the From_num are both 420, same as line 9 and 10, I wish to drop the 2nd duplicated row and keep the first. I wrote loop conditions, but it is very redundant and slow, I assume there might be easier way to do this, so please help if you have ideas. Great thanks. The expected result is like this:
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-08-26
4 Max 78 298 2019-06-20
5 Max 36 78 2019-01-30
6 Max 298 36 2018-10-23
7 Max Started 298 2018-08-29
8 Park 28 112 2020-05-21
9 Park 311 28 2019-11-22
10 Park Started 311 2019-04-12
11 Tom 60 150 2019-10-16
12 Tom 520 60 2019-08-26
13 Tom 99 520 2018-12-11
14 Tom Started 99 2018-10-09

It's a bit late, but does this do what you wanted? This drops consecutive duplicates ignoring "Date".
t = df[['ID', 'From_num', 'To_num']]
df[(t.ne(t.shift())).any(axis=1)]
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-08-26
5 Max 78 298 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park 28 112 2020-05-21
11 Park 311 28 2019-11-22
12 Park Started 311 2019-04-12
13 Tom 60 150 2019-10-16
14 Tom 520 60 2019-08-26
15 Tom 99 520 2018-12-11
16 Tom Started 99 2018-10-09
This drops rows with index values 2 and 10.

Compare the rows below, with the rows above, invert the boolean to get your result:
cond1 = df.ID.eq(df.ID.shift())
cond2 = df.From_num.eq(df.From_num.shift())
cond = cond1 & cond2
df.loc[~cond].reset_index(drop=True)
Alternative: longer route :
(
df.assign(
temp=df.groupby(["ID", "From_num"]).From_num.transform("size"),
check=lambda x: (x.From_num.eq(x.From_num.shift())) &
(x.temp.eq(x.temp.shift())),
)
.query("check == 0")
.drop(["temp", "check"], axis=1)
)

it seems to me that's exactly what DataFrame.drop_duplicates does, by default it keeps the first occurence and drops the rest
unique_df = df.drop_duplicates(['ID', 'From_num', 'To_num'])
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
EDIT
as mentionned in the question, only the consecutive rows should be processed, to do so I propose to flag them first then run drop_duplicates on a subset of the flagged rows ( I'm not sure if it's the best solution )
df['original_index'] = null
indices = df.index[1:]
for i in range(1, indices):
# if current row equals the previous one
if df.loc[indices[i - 1], 'ID'] == df.loc[indices[i], 'ID'] and df.loc[indices[i -1], 'From_num'] == df.loc[indices[i], 'From_num'] and df.loc[indices[i -1], 'To_num'] == df.loc[indices[i], 'To_num']:
# get the original index if it has been already set on row index -1
if df.loc[indices[i - 1], 'original_index'] not null:
df.loc[indices[i], 'original_index'] = df.loc[indices[i - 1], 'original_index']
else:
# else set it to be current index for both rows
df.loc[indices[i - 1], 'original_index'] = indices[i - 1]
df.loc[indices[i], 'original_index'] = indices[i - 1]
now we add column 'original_index' to drop_duplicates
unique_df = df.drop_duplicates(['ID', 'From_num', 'To_num', 'original_index'])

df.groupby(['ID', 'From_num', 'To_num']).first().reset_index()
Edit - This will remove duplicates even if they are not consecutive. E.g row 4 and 7 in the original df.
Update
cols=['ID', 'From_num', 'To_num']
df.loc[(df[cols].shift() != df[cols]).any(axis=1)].shape

Related

Apply results of pandas groupby to multiple rows

I have a dataframe df that looks like this:
PO SO Date Name Qty
0 123 34 2020-01-05 Carl 5
1 111 55 2020-10-10 Beth 7
2 123 12 2020-02-03 Greg 11
3 101 55 2019-12-03 Carl 3
4 123 34 2020-11-30 Beth 24
5 111 55 2019-04-02 Greg 6
6 202 99 2020-05-06 Beth 19
What I would like to do is replace dates with the minimum date for the dataframe when grouped by PO and SO. For instance, there are two rows with a PO of '123' and an SO of '34'. Since the minimum Date among these rows is '2020-01-05', both rows should have their Date column set to '2020-01-05'.
Thus the result would looks like this:
PO SO Date Name Qty
0 123 34 2020-01-05 Carl 5
1 111 55 2019-04-02 Beth 7
2 123 12 2020-02-03 Greg 11
3 101 55 2019-12-03 Carl 3
4 123 34 2020-01-05 Beth 24
5 111 55 2019-04-02 Greg 6
6 202 99 2020-05-06 Beth 19

You can use transform with groupby to create a "calculated column", so that you can avoid a messy merge:
df = pd.DataFrame({'PO': [123, 111, 123, 101, 123, 111, 202],
'SO': [34, 55, 12, 55, 34, 55, 99],
'Date': ['2020-01-05', '2020-10-10', '2020-02-03', '2019-12-03', '2020-11-30', '2019-04-02', '2020-05-06'],
'Name': ['Carl', 'Beth', 'Greg', 'Carl', 'Beth', 'Greg', 'Beth'],
'Qty': [5, 7, 11, 3, 24, 6, 19]})
df_grouped = df.copy()
df_grouped['Date'] = df_grouped.groupby(['PO', 'SO'])['Date'].transform('min')
df_grouped
Out[1]:
PO SO Date Name Qty
0 123 34 2020-01-05 Carl 5
1 111 55 2019-04-02 Beth 7
2 123 12 2020-02-03 Greg 11
3 101 55 2019-12-03 Carl 3
4 123 34 2020-01-05 Beth 24
5 111 55 2019-04-02 Greg 6
6 202 99 2020-05-06 Beth 19

In order to accomplish this, we will create a key using PO, SO, and the minimum Date for each combination of PO and SO. We use groupby with min to accomplish this.
import pandas as pd
df = pd.DataFrame({'PO': [123, 111, 123, 101, 123, 111, 202],
'SO': [34, 55, 12, 55, 34, 55, 99],
'Date': ['2020-01-05', '2020-10-10', '2020-02-03', '2019-12-03', '2020-11-30', '2019-04-02', '2020-05-06'],
'Name': ['Carl', 'Beth', 'Greg', 'Carl', 'Beth', 'Greg', 'Beth'],
'Qty': [5, 7, 11, 3, 24, 6, 19]})
df_grouped = df[['PO', 'SO', 'Date']].groupby(by=['PO', 'SO'], as_index=False, dropna=False).min()
print(df_grouped)
PO SO Date
0 101 55 2019-12-03
1 111 55 2019-04-02
2 123 12 2020-02-03
3 123 34 2020-01-05
4 202 99 2020-05-06
Now we can merge this with the original dataframe, replacing the old Date column with the Date column from df_grouped.
df = pd.merge(df.drop(columns=['Date']), df_grouped, on=['PO', 'SO'])
df = df[['PO', 'SO', 'Date', 'Name', 'Qty']] # reset column order
print(df)
PO SO Date Name Qty
0 123 34 2020-01-05 Carl 5
1 123 34 2020-01-05 Beth 24
2 111 55 2019-04-02 Beth 7
3 111 55 2019-04-02 Greg 6
4 123 12 2020-02-03 Greg 11
5 101 55 2019-12-03 Carl 3
6 202 99 2020-05-06 Beth 19

Pandas: Create a new row within each group with conditions

I have a dateframe (df),
df = pd.DataFrame({
'ID': ['James', 'James', 'James','Max', 'Max', 'Max', 'Max','Park','Tom', 'Tom', 'Tom', 'Tom','Wong'],
'From_num': [78, 420, 'Started', 298, 36, 298, 'Started', 'Started', 60, 520, 99, 'Started', 'Started'],
'To_num': [96, 78, 420, 36, 78, 36, 298, 311, 150, 520, 78, 99, 39],
'Date': ['2020-05-12', '2020-02-02', '2019-06-18',
'2019-06-20', '2019-01-30', '2018-10-23',
'2018-08-29', '2020-05-21', '2019-11-22',
'2019-08-26', '2018-12-11', '2018-10-09', '2019-02-01']})
And it is like this:
ID From_num To_num Date
0 James 78 96 2020-05-12
1 James 420 78 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-06-20
4 Max 36 78 2019-01-30
5 Max 298 36 2018-10-23
6 Max Started 298 2018-08-29
7 Park Started 311 2020-05-21
8 Tom 60 150 2019-11-22
9 Tom 520 520 2019-08-26
10 Tom 99 78 2018-12-11
11 Tom Started 99 2018-10-09
12 Wong Started 39 2019-02-01
For each person (group), I wish to create a new duplicate row on the first row within each group ('ID'), the values for the created row in column'ID', 'From_num' and 'To_num' should be the same as the previous first row, but the 'Date' value is the old 1st row's Date plus one day e.g. for James, the newly created row values is: 'James' '78' '96' '2020-05-13', same as the rest data, so my expected result is:
ID From_num To_num Date
0 James 78 96 2020-05-13 # row added, Date + 1
1 James 78 96 2020-05-12
2 James 420 78 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-06-21 # row added, Date + 1
5 Max 298 36 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park Started 311 2020-05-22 # Row added, Date + 1
10 Park Started 311 2020-05-21
11 Tom 60 150 2019-11-23 # Row added, Date + 1
12 Tom 60 150 2019-11-22
13 Tom 520 520 2019-08-26
14 Tom 99 78 2018-12-11
15 Tom Started 99 2018-10-09
16 Wong Started 39 2019-02-02 # Row added Date + 1
17 Wong Started 39 2019-02-01
I wish to the order/sequence is same as my expected result. If you have any good ideas, please help. Many thanks

Use:
df['Date'] = pd.to_datetime(df['Date'])
df['order'] = df.groupby('ID').cumcount().add(1)
df1 = (
df.groupby('ID', as_index=False).first()
.assign(Date=lambda x: x['Date'] + pd.Timedelta(days=1), order=0)
)
df1 = pd.concat([df, df1]).sort_values(['ID', 'order'], ignore_index=True).drop('order', 1)
Details:
Convert the Date column to pandas datetime series and use DataFrame.groupby on column ID and groupby.cumcount to impose the total ordering in each groups in the dataframe.
print(df)
ID From_num To_num Date order
0 James 78 96 2020-05-13 1
1 James 78 96 2020-05-12 2
2 James 420 78 2020-02-02 3
3 James Started 420 2019-06-18 4
4 Max 298 36 2019-06-21 1
5 Max 298 36 2019-06-20 2
6 Max 36 78 2019-01-30 3
7 Max 298 36 2018-10-23 4
8 Max Started 298 2018-08-29 5
9 Park Started 311 2020-05-22 1
10 Park Started 311 2020-05-21 2
11 Tom 60 150 2019-11-23 1
12 Tom 60 150 2019-11-22 2
13 Tom 520 520 2019-08-26 3
14 Tom 99 78 2018-12-11 4
15 Tom Started 99 2018-10-09 5
16 Wong Started 39 2019-02-02 1
17 Wong Started 39 2019-02-01 2
Create a new dataframe df1 by using DataFrame.groupby on column ID and aggregate using groupby.first and assigning order=0 and incrementing Date by pd.Timedelta of 1 days.
print(df1)
ID From_num To_num Date order
0 James 78 96 2020-05-14 0 # Date incremented by 1 days
1 Max 298 36 2019-06-22 0 # and ordering added
2 Park Started 311 2020-05-23 0
3 Tom 60 150 2019-11-24 0
4 Wong Started 39 2019-02-03 0
Using pd.concat concat the dataframes df and df1 and use DataFrame.sort_values to sort the dataframe on columns ID and order.
print(df1)
ID From_num To_num Date
0 James 78 96 2020-05-14
1 James 78 96 2020-05-13
2 James 78 96 2020-05-12
3 James 420 78 2020-02-02
4 James Started 420 2019-06-18
5 Max 298 36 2019-06-22
6 Max 298 36 2019-06-21
7 Max 298 36 2019-06-20
8 Max 36 78 2019-01-30
9 Max 298 36 2018-10-23
10 Max Started 298 2018-08-29
11 Park Started 311 2020-05-23
12 Park Started 311 2020-05-22
13 Park Started 311 2020-05-21
14 Tom 60 150 2019-11-24
15 Tom 60 150 2019-11-23
16 Tom 60 150 2019-11-22
17 Tom 520 520 2019-08-26
18 Tom 99 78 2018-12-11
19 Tom Started 99 2018-10-09
20 Wong Started 39 2019-02-03
21 Wong Started 39 2019-02-02
22 Wong Started 39 2019-02-01

Pandas pivot table subtotals with multi-index

I'm trying to create a simple pivot table with subtotals, excel-style, however I can't find a method using Pandas. I've tried the solution Wes suggested in another subtotal-related question, however that doesn't give the expected results. Below the steps to reproduce it:
Create the sample data:
sample_data = {'customer': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'B'], 'product': ['astro','ball','car','astro','ball', 'car', 'astro', 'ball', 'car','astro','ball','car'],
'week': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'qty': [10, 15, 20, 40, 20, 34, 300, 20, 304, 23, 45, 23]}
df = pd.DataFrame(sample_data)
create the pivot table with margins (it only has total, not subtotal by customer (A, B))
piv = df.pivot_table(index=['customer','product'],columns='week',values='qty',margins=True,aggfunc=np.sum)
week 1 2 All
customer product
A astro 10 300 310
ball 15 20 35
car 20 304 324
B astro 40 23 63
ball 20 45 65
car 34 23 57
All 139 715 854
Then, I tried the method Wes Mckiney mentioned in another thread, using the stack function:
piv2 = df.pivot_table(index='customer',columns=['week','product'],values='qty',margins=True,aggfunc=np.sum)
piv2.stack('product')
The result has the format I want, but the rows with the "All" doesn't have the sum:
week 1 2 All
customer product
A NaN NaN 669.0
astro 10.0 300.0 NaN
ball 15.0 20.0 NaN
car 20.0 304.0 NaN
B NaN NaN 185.0
astro 40.0 23.0 NaN
ball 20.0 45.0 NaN
car 34.0 23.0 NaN
All NaN NaN 854.0
astro 50.0 323.0 NaN
ball 35.0 65.0 NaN
car 54.0 327.0 NaN
how to make it work as it would in Excel, sample below? with all the subtotals and totals working? what am I missing? ed
excel sample
just to point, I am able to make it work using For loops filtering by the customer on each iteration and concat later, but I hope there might be a more direct solution thank you

You can do it one step, but you have to be strategic about index name due to alphabetical sorting:
piv = df.pivot_table(index=['customer','product'],
columns='week',
values='qty',
margins=True,
margins_name='Total',
aggfunc=np.sum)
(pd.concat([piv,
piv.query('customer != "Total"')
.sum(level=0)
.assign(product='total')
.set_index('product', append=True)])
.sort_index())
Output:
week 1 2 Total
customer product
A astro 10 300 310
ball 15 20 35
car 20 304 324
total 45 624 669
B astro 40 23 63
ball 20 45 65
car 34 23 57
total 94 91 185
Total 139 715 854

#Scott Boston's answer is perfect and elegant. For reference, if you group just the customers and pd.concat() the results are We get the following results.
piv = df.pivot_table(index=['customer','product'],columns='week',values='qty',margins=True,aggfunc=np.sum)
piv3 = df.pivot_table(index=['customer'],columns='week',values='qty',margins=True,aggfunc=np.sum)
piv4 = pd.concat([piv, piv3], axis=0)
piv4
week 1 2 All
(A, astro) 10 300 310
(A, ball) 15 20 35
(A, car) 20 304 324
(B, astro) 40 23 63
(B, ball) 20 45 65
(B, car) 34 23 57
(All, ) 139 715 854
A 45 624 669
B 94 91 185
All 139 715 854

Compare each row of Pandas df1 with every row within df2 and return string value from closest matching column

I have two data frames.
df1 includes 4 men and 4 women with their weight and height (inches).
#df1
John, 236, 76
Jack, 204, 74
Jim, 156, 71
Jared, 182, 72
Suzy, 119, 60
Sally, 149, 66
Sharon, 169, 65
Sammy, 182, 75
df2 includes 4 men and 4 women with their weight and height (inches).
#df2
Aaron, 285, 77
Abe, 236, 75
Alex, 178, 72
Adam, 195, 71
Mary, 148, 66
Maylee, 155, 66
Marilyn, 199, 65
Madison, 160, 73
What I am trying to do is have men from df1 be compared to men from df2 to see who they are most like based on height and weight. Just subtract weight from weight and height from height and return an absolute value for each man in df2. More specifically, return the name of the man most similar.
So in this case John's closest match is Abe so in a new column
df1['doppelganger'] = "Abe".
I'm a beginner hobbyist so even pointing me in the right direction would be helpful. I've been looking through stack overflow for about five hours trying to figure out how to go about something like this.

First is necessary distinguish men and women, here is used new column with repeat 4 times m and f. Then is used DataFrame.merge with outer join by new column for all combinations and created new columns for differences, last column is sum of them. then sorting by 3 columns by DataFrame.sort_values, so first row per groups by A and g are filtered by DataFrame.drop_duplicates:
df = (df1.assign(g = ['m']*4 + ['f']*4)
.merge(df2.assign(g = ['m']*4 + ['f']*4), on='g', how='outer', suffixes=('','_'))
.assign(dif1 = lambda x: x['B'].sub(x['B_']).abs(),
dif2 = lambda x: x['C'].sub(x['C_']).abs(),
sumdiff = lambda x: x['dif1'] + x['dif2'])
.sort_values(['A', 'g','sumdiff'])
.drop_duplicates(['A','g'])
.sort_index()
.rename(columns={'A_':'doppelganger'})
)
print (df)
A B C g doppelganger B_ C_ dif1 dif2 sumdiff
1 John 236 76 m Abe 236 75 0 1 1
7 Jack 204 74 m Adam 195 71 9 3 12
10 Jim 156 71 m Alex 178 72 22 1 23
14 Jared 182 72 m Alex 178 72 4 0 4
16 Suzy 119 60 f Mary 148 66 29 6 35
20 Sally 149 66 f Mary 148 66 1 0 1
25 Sharon 169 65 f Maylee 155 66 14 1 15
31 Sammy 182 75 f Madison 160 73 22 2 24
Input DataFrames:
print (df1)
A B C
0 John 236 76
1 Jack 204 74
2 Jim 156 71
3 Jared 182 72
4 Suzy 119 60
5 Sally 149 66
6 Sharon 169 65
7 Sammy 182 75
print (df2)
A B C
0 Aaron 285 77
1 Abe 236 75
2 Alex 178 72
3 Adam 195 71
4 Mary 148 66
5 Maylee 155 66
6 Marilyn 199 65
7 Madison 160 73

Subtract/Add existing values if contents of one dataframe is present in another using pandas

Here are 2 dataframes
df1:
Index Number Name Amount
0 123 John 31
1 124 Alle 33
2 312 Amy 33
3 314 Holly 35
df2:
Index Number Name Amount
0 312 Amy 13
1 124 Alle 35
2 317 Jack 53
The resulting dataframe should look like this
result_df:
Index Number Name Amount Curr_amount
0 123 John 31 31
1 124 Alle 33 68
2 312 Amy 33 46
3 314 Holly 35 35
4 317 Jack 53
I have tried using pandas isin but it only says if the Number column was present or no in boolean. Is there any way to do this efficiently?

Use merge with outer join and then add Series.add (or
Series.sub if necessary):
df = df1.merge(df2, on=['Number','Name'], how='outer', suffixes=('','_curr'))
df['Amount_curr'] = df['Amount_curr'].add(df['Amount'], fill_value=0)
print (df)
Number Name Amount Amount_curr
0 123 John 31.0 31.0
1 124 Alle 33.0 68.0
2 312 Amy 33.0 46.0
3 314 Holly 35.0 35.0
4 317 Jack NaN 53.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas drop consecutive duplicate rows only, ignoring specific columns - python

df.groupby(['ID', 'From_num', 'To_num']).first().reset_index() Edit - This will remove duplicates even if they are not consecutive. E.g row 4 and 7 in the original df. Update cols=['ID', 'From_num', 'To_num'] df.loc[(df[cols].shift() != df[cols]).any(axis=1)].shape

Related

Apply results of pandas groupby to multiple rows

Pandas: Create a new row within each group with conditions

Pandas pivot table subtotals with multi-index

Compare each row of Pandas df1 with every row within df2 and return string value from closest matching column

Subtract/Add existing values if contents of one dataframe is present in another using pandas

Categories

Resources