I have a dataframe df that looks like this:
PO SO Date Name Qty
0 123 34 2020-01-05 Carl 5
1 111 55 2020-10-10 Beth 7
2 123 12 2020-02-03 Greg 11
3 101 55 2019-12-03 Carl 3
4 123 34 2020-11-30 Beth 24
5 111 55 2019-04-02 Greg 6
6 202 99 2020-05-06 Beth 19
What I would like to do is replace dates with the minimum date for the dataframe when grouped by PO and SO. For instance, there are two rows with a PO of '123' and an SO of '34'. Since the minimum Date among these rows is '2020-01-05', both rows should have their Date column set to '2020-01-05'.
Thus the result would looks like this:
PO SO Date Name Qty
0 123 34 2020-01-05 Carl 5
1 111 55 2019-04-02 Beth 7
2 123 12 2020-02-03 Greg 11
3 101 55 2019-12-03 Carl 3
4 123 34 2020-01-05 Beth 24
5 111 55 2019-04-02 Greg 6
6 202 99 2020-05-06 Beth 19
You can use transform with groupby to create a "calculated column", so that you can avoid a messy merge:
df = pd.DataFrame({'PO': [123, 111, 123, 101, 123, 111, 202],
'SO': [34, 55, 12, 55, 34, 55, 99],
'Date': ['2020-01-05', '2020-10-10', '2020-02-03', '2019-12-03', '2020-11-30', '2019-04-02', '2020-05-06'],
'Name': ['Carl', 'Beth', 'Greg', 'Carl', 'Beth', 'Greg', 'Beth'],
'Qty': [5, 7, 11, 3, 24, 6, 19]})
df_grouped = df.copy()
df_grouped['Date'] = df_grouped.groupby(['PO', 'SO'])['Date'].transform('min')
df_grouped
Out[1]:
PO SO Date Name Qty
0 123 34 2020-01-05 Carl 5
1 111 55 2019-04-02 Beth 7
2 123 12 2020-02-03 Greg 11
3 101 55 2019-12-03 Carl 3
4 123 34 2020-01-05 Beth 24
5 111 55 2019-04-02 Greg 6
6 202 99 2020-05-06 Beth 19
In order to accomplish this, we will create a key using PO, SO, and the minimum Date for each combination of PO and SO. We use groupby with min to accomplish this.
import pandas as pd
df = pd.DataFrame({'PO': [123, 111, 123, 101, 123, 111, 202],
'SO': [34, 55, 12, 55, 34, 55, 99],
'Date': ['2020-01-05', '2020-10-10', '2020-02-03', '2019-12-03', '2020-11-30', '2019-04-02', '2020-05-06'],
'Name': ['Carl', 'Beth', 'Greg', 'Carl', 'Beth', 'Greg', 'Beth'],
'Qty': [5, 7, 11, 3, 24, 6, 19]})
df_grouped = df[['PO', 'SO', 'Date']].groupby(by=['PO', 'SO'], as_index=False, dropna=False).min()
print(df_grouped)
PO SO Date
0 101 55 2019-12-03
1 111 55 2019-04-02
2 123 12 2020-02-03
3 123 34 2020-01-05
4 202 99 2020-05-06
Now we can merge this with the original dataframe, replacing the old Date column with the Date column from df_grouped.
df = pd.merge(df.drop(columns=['Date']), df_grouped, on=['PO', 'SO'])
df = df[['PO', 'SO', 'Date', 'Name', 'Qty']] # reset column order
print(df)
PO SO Date Name Qty
0 123 34 2020-01-05 Carl 5
1 123 34 2020-01-05 Beth 24
2 111 55 2019-04-02 Beth 7
3 111 55 2019-04-02 Greg 6
4 123 12 2020-02-03 Greg 11
5 101 55 2019-12-03 Carl 3
6 202 99 2020-05-06 Beth 19
Related
I have the following dataframe:
import pandas as pd
array = {'test_ID': [10, 13, 10, 13, 16],
'test_date': ['2010-09-05', '2010-10-23', '2011-09-12', '2010-05-05', '2010-06-01'],
'Value1': [40, 56, 23, 78, 67],
'Value2': [25, 0, 68, 0, 0]}
df = pd.DataFrame(array)
df
test_ID test_date Value1 Value2
0 10 2010-09-05 40 25
1 13 2010-10-23 56 0
2 10 2011-09-12 23 68
3 13 2010-05-05 78 0
4 16 2010-06-01 67 0
I would like to delete column 'Value2' and combine it in column 'Value1' - but only when Value2 != Zero.
The expected output is:
test_ID test_date Value1
0 10 2010-09-05 40
1 99 2010-09-05 25
2 13 2010-10-23 56
3 10 2011-09-12 23
4 99 2011-09-12 68
5 13 2010-05-05 78
6 16 2010-06-01 67
Use DataFrame.set_index with DataFrame.stack for reshape, remove values with 0, remove last level of MultiIndex by DataFrame.droplevel and last create 3 columns DataFrame:
s = df.set_index(['test_ID','test_date']).stack()
df = s[s.ne(0)].reset_index(name='Value1')
df['test_ID'] = df['test_ID'].mask(df.pop('level_2').eq('Value2'), 99)
print (df)
test_ID test_date Value1
0 10 2010-09-05 40
1 99 2010-09-05 25
2 13 2010-10-23 56
3 10 2011-09-12 23
4 99 2011-09-12 68
5 13 2010-05-05 78
6 16 2010-06-01 67
Another solution with DataFrame.melt and remove 0 rows by DataFrame.loc:
df = (df.melt(['test_ID','test_date'], value_name='Value1', ignore_index=False)
.assign(test_ID = lambda x: x['test_ID'].mask(x.pop('variable').eq('Value2'), 99))
.sort_index()
.loc[lambda x: x['Value1'].ne(0)]
.reset_index(drop=True))
print (df)
test_ID test_date Value1
0 10 2010-09-05 40
1 99 2010-09-05 25
2 13 2010-10-23 56
3 10 2011-09-12 23
4 99 2011-09-12 68
5 13 2010-05-05 78
6 16 2010-06-01 67
Here is a simple solution by filtering on non zero values.
df = pd.DataFrame(array)
filtered_rows = df.loc[df["Value2"] != 0]
filtered_rows.loc[:,'Value1'] = filtered_rows.loc[:,'Value2']
filtered_rows.loc[:, 'test_ID'] = 99
df = pd.concat([df, filtered_rows]).sort_index().drop(['Value2'], axis=1)
This gives us the expected data :
test_ID test_date Value1
0 10 2010-09-05 40
0 99 2010-09-05 25
1 13 2010-10-23 56
2 10 2011-09-12 23
2 99 2011-09-12 68
3 13 2010-05-05 78
4 16 2010-06-01 67
I have a dateframe (df),
df = pd.DataFrame({
'ID': ['James', 'James', 'James','Max', 'Max', 'Max', 'Max','Park','Tom', 'Tom', 'Tom', 'Tom','Wong'],
'From_num': [78, 420, 'Started', 298, 36, 298, 'Started', 'Started', 60, 520, 99, 'Started', 'Started'],
'To_num': [96, 78, 420, 36, 78, 36, 298, 311, 150, 520, 78, 99, 39],
'Date': ['2020-05-12', '2020-02-02', '2019-06-18',
'2019-06-20', '2019-01-30', '2018-10-23',
'2018-08-29', '2020-05-21', '2019-11-22',
'2019-08-26', '2018-12-11', '2018-10-09', '2019-02-01']})
And it is like this:
ID From_num To_num Date
0 James 78 96 2020-05-12
1 James 420 78 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-06-20
4 Max 36 78 2019-01-30
5 Max 298 36 2018-10-23
6 Max Started 298 2018-08-29
7 Park Started 311 2020-05-21
8 Tom 60 150 2019-11-22
9 Tom 520 520 2019-08-26
10 Tom 99 78 2018-12-11
11 Tom Started 99 2018-10-09
12 Wong Started 39 2019-02-01
For each person (group), I wish to create a new duplicate row on the first row within each group ('ID'), the values for the created row in column'ID', 'From_num' and 'To_num' should be the same as the previous first row, but the 'Date' value is the old 1st row's Date plus one day e.g. for James, the newly created row values is: 'James' '78' '96' '2020-05-13', same as the rest data, so my expected result is:
ID From_num To_num Date
0 James 78 96 2020-05-13 # row added, Date + 1
1 James 78 96 2020-05-12
2 James 420 78 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-06-21 # row added, Date + 1
5 Max 298 36 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park Started 311 2020-05-22 # Row added, Date + 1
10 Park Started 311 2020-05-21
11 Tom 60 150 2019-11-23 # Row added, Date + 1
12 Tom 60 150 2019-11-22
13 Tom 520 520 2019-08-26
14 Tom 99 78 2018-12-11
15 Tom Started 99 2018-10-09
16 Wong Started 39 2019-02-02 # Row added Date + 1
17 Wong Started 39 2019-02-01
I wish to the order/sequence is same as my expected result. If you have any good ideas, please help. Many thanks
Use:
df['Date'] = pd.to_datetime(df['Date'])
df['order'] = df.groupby('ID').cumcount().add(1)
df1 = (
df.groupby('ID', as_index=False).first()
.assign(Date=lambda x: x['Date'] + pd.Timedelta(days=1), order=0)
)
df1 = pd.concat([df, df1]).sort_values(['ID', 'order'], ignore_index=True).drop('order', 1)
Details:
Convert the Date column to pandas datetime series and use DataFrame.groupby on column ID and groupby.cumcount to impose the total ordering in each groups in the dataframe.
print(df)
ID From_num To_num Date order
0 James 78 96 2020-05-13 1
1 James 78 96 2020-05-12 2
2 James 420 78 2020-02-02 3
3 James Started 420 2019-06-18 4
4 Max 298 36 2019-06-21 1
5 Max 298 36 2019-06-20 2
6 Max 36 78 2019-01-30 3
7 Max 298 36 2018-10-23 4
8 Max Started 298 2018-08-29 5
9 Park Started 311 2020-05-22 1
10 Park Started 311 2020-05-21 2
11 Tom 60 150 2019-11-23 1
12 Tom 60 150 2019-11-22 2
13 Tom 520 520 2019-08-26 3
14 Tom 99 78 2018-12-11 4
15 Tom Started 99 2018-10-09 5
16 Wong Started 39 2019-02-02 1
17 Wong Started 39 2019-02-01 2
Create a new dataframe df1 by using DataFrame.groupby on column ID and aggregate using groupby.first and assigning order=0 and incrementing Date by pd.Timedelta of 1 days.
print(df1)
ID From_num To_num Date order
0 James 78 96 2020-05-14 0 # Date incremented by 1 days
1 Max 298 36 2019-06-22 0 # and ordering added
2 Park Started 311 2020-05-23 0
3 Tom 60 150 2019-11-24 0
4 Wong Started 39 2019-02-03 0
Using pd.concat concat the dataframes df and df1 and use DataFrame.sort_values to sort the dataframe on columns ID and order.
print(df1)
ID From_num To_num Date
0 James 78 96 2020-05-14
1 James 78 96 2020-05-13
2 James 78 96 2020-05-12
3 James 420 78 2020-02-02
4 James Started 420 2019-06-18
5 Max 298 36 2019-06-22
6 Max 298 36 2019-06-21
7 Max 298 36 2019-06-20
8 Max 36 78 2019-01-30
9 Max 298 36 2018-10-23
10 Max Started 298 2018-08-29
11 Park Started 311 2020-05-23
12 Park Started 311 2020-05-22
13 Park Started 311 2020-05-21
14 Tom 60 150 2019-11-24
15 Tom 60 150 2019-11-23
16 Tom 60 150 2019-11-22
17 Tom 520 520 2019-08-26
18 Tom 99 78 2018-12-11
19 Tom Started 99 2018-10-09
20 Wong Started 39 2019-02-03
21 Wong Started 39 2019-02-02
22 Wong Started 39 2019-02-01
I have a dataframe below
df = pd.DataFrame({
'ID': ['James', 'James', 'James', 'James',
'Max', 'Max', 'Max', 'Max', 'Max',
'Park', 'Park','Park', 'Park',
'Tom', 'Tom', 'Tom', 'Tom'],
'From_num': [578, 420, 420, 'Started', 298, 78, 36, 298, 'Started', 28, 28, 311, 'Started', 60, 520, 99, 'Started'],
'To_num': [96, 578, 578, 420, 36, 298, 78, 36, 298, 112, 112, 28, 311, 150, 60, 520, 99],
'Date': ['2020-05-12', '2020-02-02', '2020-02-01', '2019-06-18',
'2019-08-26', '2019-06-20', '2019-01-30', '2018-10-23',
'2018-08-29', '2020-05-21', '2020-05-20', '2019-11-22',
'2019-04-12', '2019-10-16', '2019-08-26', '2018-12-11', '2018-10-09']})
and it is like this:
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
2 James 420 578 2020-02-01 # Drop the this duplicated row (ignore date)
3 James Started 420 2019-06-18
4 Max 298 36 2019-08-26
5 Max 78 298 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park 28 112 2020-05-21
10 Park 28 112 2020-05-20 # Drop this duplicate row (ignore date)
11 Park 311 28 2019-11-22
12 Park Started 311 2019-04-12
13 Tom 60 150 2019-10-16
14 Tom 520 60 2019-08-26
15 Tom 99 520 2018-12-11
16 Tom Started 99 2018-10-09
There are some consecutive duplicated values (ignore the Date value) within each 'ID'(Name), e.g. line 1 and 2 for James, the From_num are both 420, same as line 9 and 10, I wish to drop the 2nd duplicated row and keep the first. I wrote loop conditions, but it is very redundant and slow, I assume there might be easier way to do this, so please help if you have ideas. Great thanks. The expected result is like this:
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-08-26
4 Max 78 298 2019-06-20
5 Max 36 78 2019-01-30
6 Max 298 36 2018-10-23
7 Max Started 298 2018-08-29
8 Park 28 112 2020-05-21
9 Park 311 28 2019-11-22
10 Park Started 311 2019-04-12
11 Tom 60 150 2019-10-16
12 Tom 520 60 2019-08-26
13 Tom 99 520 2018-12-11
14 Tom Started 99 2018-10-09
It's a bit late, but does this do what you wanted? This drops consecutive duplicates ignoring "Date".
t = df[['ID', 'From_num', 'To_num']]
df[(t.ne(t.shift())).any(axis=1)]
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-08-26
5 Max 78 298 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park 28 112 2020-05-21
11 Park 311 28 2019-11-22
12 Park Started 311 2019-04-12
13 Tom 60 150 2019-10-16
14 Tom 520 60 2019-08-26
15 Tom 99 520 2018-12-11
16 Tom Started 99 2018-10-09
This drops rows with index values 2 and 10.
Compare the rows below, with the rows above, invert the boolean to get your result:
cond1 = df.ID.eq(df.ID.shift())
cond2 = df.From_num.eq(df.From_num.shift())
cond = cond1 & cond2
df.loc[~cond].reset_index(drop=True)
Alternative: longer route :
(
df.assign(
temp=df.groupby(["ID", "From_num"]).From_num.transform("size"),
check=lambda x: (x.From_num.eq(x.From_num.shift())) &
(x.temp.eq(x.temp.shift())),
)
.query("check == 0")
.drop(["temp", "check"], axis=1)
)
it seems to me that's exactly what DataFrame.drop_duplicates does, by default it keeps the first occurence and drops the rest
unique_df = df.drop_duplicates(['ID', 'From_num', 'To_num'])
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
EDIT
as mentionned in the question, only the consecutive rows should be processed, to do so I propose to flag them first then run drop_duplicates on a subset of the flagged rows ( I'm not sure if it's the best solution )
df['original_index'] = null
indices = df.index[1:]
for i in range(1, indices):
# if current row equals the previous one
if df.loc[indices[i - 1], 'ID'] == df.loc[indices[i], 'ID'] and df.loc[indices[i -1], 'From_num'] == df.loc[indices[i], 'From_num'] and df.loc[indices[i -1], 'To_num'] == df.loc[indices[i], 'To_num']:
# get the original index if it has been already set on row index -1
if df.loc[indices[i - 1], 'original_index'] not null:
df.loc[indices[i], 'original_index'] = df.loc[indices[i - 1], 'original_index']
else:
# else set it to be current index for both rows
df.loc[indices[i - 1], 'original_index'] = indices[i - 1]
df.loc[indices[i], 'original_index'] = indices[i - 1]
now we add column 'original_index' to drop_duplicates
unique_df = df.drop_duplicates(['ID', 'From_num', 'To_num', 'original_index'])
df.groupby(['ID', 'From_num', 'To_num']).first().reset_index()
Edit - This will remove duplicates even if they are not consecutive. E.g row 4 and 7 in the original df.
Update
cols=['ID', 'From_num', 'To_num']
df.loc[(df[cols].shift() != df[cols]).any(axis=1)].shape
dummy_df = pd.DataFrame({
'accnt' : [101, 102, 103, 104, 101, 102, 103, 104, 101, 102, 103, 104, 101, 102, 103, 104, 101, 102, 103, 104],
'value' : [10, 20, 30, 40, 5, 2, 6, 48, 22, 23, 24, 25, 18, 25, 26, 14, 78, 72, 54, 6],
'category' : [1,1,1,1,2,2,2,2,1,1,2,2,3,3,3,3,1,3,2,3]
})
dummy_df
accnt value category
101 10 1
102 20 1
103 30 1
104 40 1
101 5 2
102 2 2
103 6 2
104 48 2
101 22 1
102 23 1
103 24 2
104 25 2
101 18 3
102 25 3
103 26 3
104 14 3
101 78 1
102 72 3
103 54 2
104 6 3
I want to get a dataframe like below:
accnt sum_val_c1 count_c1 sum_val_ct2 count_c2 sum_val_c3 count_c3
101 110 3 5 1 18 1
102 43 2 2 1 97 2
103 30 1 84 3 26 1
104 40 1 73 2 20 2
Which is summing up the occurrence of a category into count_c# and summing the value of that category into sum_val_c# and grouping by on accnt. I have tried using pivot() and groupby() but I know I'm missing something.
Use groupby, agg, and unstack:
u = df.groupby(['accnt', 'category'])['value'].agg(['sum', 'count']).unstack(1)
u.columns = u.columns.map('{0[0]}_c{0[1]}'.format)
u
sum_c1 sum_c2 sum_c3 count_c1 count_c2 count_c3
accnt
101 110 5 18 3 1 1
102 43 2 97 2 1 2
103 30 84 26 1 3 1
104 40 73 20 1 2 2
Similarly, with pivot_table,
u = df.pivot_table(index=['accnt'],
columns='category',
values='value',
aggfunc=['sum', 'count'])
u.columns = u.columns.map('{0[0]}_c{0[1]}'.format)
u
sum_c1 sum_c2 sum_c3 count_c1 count_c2 count_c3
accnt
101 110 5 18 3 1 1
102 43 2 97 2 1 2
103 30 84 26 1 3 1
104 40 73 20 1 2 2
Pandas has a method to do that.
pivot2 = dummy_df.pivot_table(values='value', index='accnt', columns='category', aggfunc=['count', 'sum'])
That returns a dataframe like this:
count sum
category 1 2 3 1 2 3
accnt
101 3 1 1 110 5 18
102 2 1 2 43 2 97
103 1 3 1 30 84 26
104 1 2 2 40 73 20
I have a Pandas timeseries:
days = pd.DatetimeIndex([
'2011-01-01T00:00:00.000000000',
'2011-01-02T00:00:00.000000000',
'2011-01-03T00:00:00.000000000',
'2011-01-04T00:00:00.000000000',
'2011-01-05T00:00:00.000000000',
'2011-01-06T00:00:00.000000000',
'2011-01-07T00:00:00.000000000',
'2011-01-08T00:00:00.000000000',
'2011-01-09T00:00:00.000000000',
'2011-01-11T00:00:00.000000000',
'2011-01-12T00:00:00.000000000',
'2011-01-13T00:00:00.000000000',
'2011-01-14T00:00:00.000000000',
'2011-01-16T00:00:00.000000000',
'2011-01-18T00:00:00.000000000',
'2011-01-19T00:00:00.000000000',
'2011-01-21T00:00:00.000000000',
])
counts = [85, 97, 24, 64, 3, 37, 73, 86, 87, 82, 75, 84, 43, 51, 42, 3, 70]
df = pd.DataFrame(counts,
index=days,
columns=['count'],
)
df['day of the week'] = df.index.dayofweek
And it looks like this:
count day of the week
2011-01-01 85 5
2011-01-02 97 6
2011-01-03 24 0
2011-01-04 64 1
2011-01-05 3 2
2011-01-06 37 3
2011-01-07 73 4
2011-01-08 86 5
2011-01-09 87 6
2011-01-11 82 1
2011-01-12 75 2
2011-01-13 84 3
2011-01-14 43 4
2011-01-16 51 6
2011-01-18 42 1
2011-01-19 3 2
2011-01-21 70 4
Notice that there are some days that are missing which should be filled with zeros. I want to convert this so it looks like a calendar so that the rows are increasing by weeks, the columns are days of the week, and the values are the count for that particular day. So the end result should look like:
0 1 2 3 4 5 6
0 0 0 0 0 0 85 97
1 24 64 3 37 73 86 87
2 0 82 75 84 0 0 51
3 0 42 3 0 70 0 0
# create weeks number based on day of the week
df['weeks'] = (df['day of the week'].diff() < 0).cumsum()
# pivot the table
df.pivot('weeks', 'day of the week', 'count').fillna(0)