Conditionally keep only one of the duplicates in pandas groupby groups - python

I have a dataset in this format: (can be download in CSV format from here)
ID DateAcquired DateSent
1 20210518 20220110
1 20210719 20220210
1 20210719 20220310
1 20200420 20220410
1 20210328 20220510
1 20210518 20220610
2 20210108 20220110
2 20210110 20220210
2 20210119 20220310
2 20210108 20220410
2 20200109 20220510
2 20210919 20220610
2 20211214 20220612
2 20210812 20220620
2 20210909 20220630
2 20200102 20220811
2 20200608 20220909
2 20210506 20221005
2 20210130 20221101
3 20210518 20220110
3 20210519 20220210
3 20210520 20220310
3 20210518 20220410
3 20210611 20220510
3 20210521 20220610
3 20210723 20220612
3 20211211 20220620
4 20210518 20220110
4 20210519 20220210
4 20210520 20220310
4 20210618 20220410
4 20210718 20220510
4 20210818 20220610
5 20210518 20220110
5 20210818 20220210
5 20210918 20220310
5 20211018 20220410
5 20211113 20220510
5 20211218 20220610
5 20210631 20221212
6T 20200102 20201101
6T 20200102 20201101
6T 20200102 20201101
6T 20210405 20220610
6T 20210606 20220611
I am doing groupby:
data.groupby(['ID','DateAcquired'])
For each unique combination of ID and DateAcquired, I am only interested in keeping one DateSent, and that is the newest one. Therefore, in other words, if a unique combination of ID and DateAcquired has two DateSent available, only take the one where DateSent is the largest/newest. This operation should apply only if ID is NOT 6T.
I am out of ideas on how to do this. Is there an easy way of doing it with pandas?

You can filter rows for not equal 6T and get maximum rows by DateSent by DataFrameGroupBy.idxmax and then append 6T rows to output:
m = df['ID'].ne('6T')
df = (df.loc[df[m].groupby(['ID','DateAcquired'])['DateSent'].idxmax()]
.append(df[~m], ignore_index=True))
Solution with sorting and removing duplicates:
m = df['ID'].ne('6T')
df = (df[m].sort_values(['ID','DateAcquired','DateSent'], ascending=[True, True, False])
.drop_duplicates(subset=['ID','DateAcquired'])
.append(df[~m], ignore_index=True))

Use pd.to_datetime with Groupby.max:
In [835]: df.DateSent = pd.to_datetime(df.DateSent, format='%Y%m%d')
In [841]: df[df.ID.ne('6T')].groupby(['ID','DateAcquired'])['DateSent'].max().reset_index().append(df[df.ID.eq('6T')])
Out[841]:
ID DateAcquired DateSent
0 1 20200420 2022-04-10
1 1 20210328 2022-05-10
2 1 20210518 2022-06-10
3 1 20210719 2022-03-10
4 2 20200102 2022-08-11
5 2 20200109 2022-05-10
6 2 20200608 2022-09-09
7 2 20210108 2022-04-10
8 2 20210110 2022-02-10
9 2 20210119 2022-03-10
10 2 20210130 2022-11-01
11 2 20210506 2022-10-05
12 2 20210812 2022-06-20
13 2 20210909 2022-06-30
14 2 20210919 2022-06-10
15 2 20211214 2022-06-12
16 3 20210518 2022-04-10
17 3 20210519 2022-02-10
18 3 20210520 2022-03-10
19 3 20210521 2022-06-10
20 3 20210611 2022-05-10
21 3 20210723 2022-06-12
22 3 20211211 2022-06-20
23 4 20210518 2022-01-10
24 4 20210519 2022-02-10
25 4 20210520 2022-03-10
26 4 20210618 2022-04-10
27 4 20210718 2022-05-10
28 4 20210818 2022-06-10
29 5 20210518 2022-01-10
30 5 20210631 2022-12-12
31 5 20210818 2022-02-10
32 5 20210918 2022-03-10
33 5 20211018 2022-04-10
34 5 20211113 2022-05-10
35 5 20211218 2022-06-10
40 6T 20200102 2020-11-01
41 6T 20200102 2020-11-01
42 6T 20200102 2020-11-01
43 6T 20210405 2022-06-10
44 6T 20210606 2022-06-11

Related

Replace row value by comparing dates

I have a date in a list:
[datetime.date(2017, 8, 9)]
I want replace the value of a dataframe matching that date with zero.
Dataframe:
Date Amplitude Magnitude Peaks Crests
0 2017-06-21 6.953356 1046.656154 4 3
1 2017-06-27 7.015520 1185.221306 5 4
2 2017-06-28 6.947471 908.115055 2 2
3 2017-06-29 6.921587 938.175153 3 3
4 2017-07-02 6.906078 938.273547 3 2
5 2017-07-03 6.898809 955.718452 6 5
6 2017-07-04 6.876283 846.514852 5 5
7 2017-07-26 6.862897 870.610086 6 5
8 2017-07-27 6.846426 824.403786 7 7
9 2017-07-28 6.831949 813.753420 7 7
10 2017-07-29 6.823125 841.245427 4 3
11 2017-07-30 6.816301 846.603427 5 4
12 2017-07-31 6.810133 842.287006 5 4
13 2017-08-01 6.800645 794.167590 3 3
14 2017-08-02 6.793034 801.505774 4 3
15 2017-08-03 6.790814 860.497395 7 6
16 2017-08-04 6.785664 815.055002 4 4
17 2017-08-05 6.782069 829.607640 5 4
18 2017-08-06 6.778176 819.014799 4 3
19 2017-08-07 6.774587 817.624203 5 5
20 2017-08-08 6.771193 815.101641 4 3
21 2017-08-09 6.765695 772.970000 1 1
22 2017-08-10 6.769422 945.207554 1 1
23 2017-08-11 6.773154 952.422598 4 3
24 2017-08-12 6.770926 826.700122 4 4
25 2017-08-13 6.772816 916.046905 5 5
26 2017-08-14 6.771130 834.881662 5 5
27 2017-08-15 6.769183 826.009391 5 5
28 2017-08-16 6.767313 824.650882 5 4
29 2017-08-17 6.765894 832.752100 5 5
30 2017-08-18 6.766861 894.165751 5 5
31 2017-08-19 6.768392 912.200274 4 3
i have tried this:
for x in range(len(all_details)):
for y in selected_day:
m = all_details['Date'] > y
all_details.loc[m, 'Peaks'] = 0
But getting an error:
ValueError: Arrays were different lengths: 32 vs 1
Can anybody suggest me the correct way to do it>
Any help would be appreciated.
First your solution working nice with your sample data.
Another faster solution is creating each mask in loop and then reduce by logical or, and - what need. Better it is explained here.
L = [datetime.date(2017, 8, 9)]
m = np.logical_or.reduce([all_details['Date'] > x for x in L])
all_details.loc[m, 'Peaks'] = 0
In your solution is better compare only by minimal date from list:
all_details.loc[all_details['Date'] > min(L), 'Peaks'] = 0

pandas pct_change() in reverse

Suppose we have a dataframe and we calculate as percent change between rows
y_axis = [1,2,3,4,5,6,7,8,9]
x_axis = [100,105,115,95,90,88,110,100,0]
DF = pd.DataFrame({'Y':y_axis, 'X':x_axis})
DF = DF[['Y','X']]
DF['PCT'] = DF['X'].pct_change()
Y X PCT
0 1 100 NaN
1 2 105 0.050000
2 3 115 0.095238
3 4 95 -0.173913
4 5 90 -0.052632
5 6 88 -0.022222
6 7 110 0.250000
7 8 100 -0.090909
8 9 0 -1.000000
That way it starts from the first row.
I want calculate pct_change() starting from the last row.
One way to do it
DF['Reverse'] = list(reversed(x_axis))
DF['PCT_rev'] = DF['Reverse'].pct_change()
pct_rev = DF.PCT_rev.tolist()
DF['_PCT_'] = list(reversed(pct_rev))
DF2 = DF[['Y','X','PCT','_PCT_']]
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN
But that is a very ugly and inefficient solution.
I was wondering if there are more elegant solutions?
DF.assign(_PCT_=DF.X.pct_change(-1))
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN
Series.pct_change(periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
periods : int, default 1 Periods to shift for forming percent change
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.pct_change.html
I deleted my other answer because #su79eu7k 's is way better.
You can cut your time in half by using the underlying arrays. But you also have to suppress a warning.
a = DF.X.values
DF.assign(_PCT_=np.append((a[:-1] - a[1:]) / a[1:], np.nan))
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN

sort pandas dataframe indeces

I have a dataframe df which indeces are
df.index
Out[4]:
Index([u'2015-03-28_p001_2', u'2015-03-29_p001_2',
u'2015-03-30_p001_2', u'2015-03-31_p001_2',
u'2015-03-31_p002_3', u'2015-04-01_p001_2',
u'2015-04-01_p002_3', u'2015-04-02_p001_2',
u'2015-04-02_p002_3', u'2015-04-03_p001_2',
...
u'2016-03-31_p127_1', u'2016-04-01_p127_1',
u'2016-04-01_p128_3', u'2016-04-02_p127_1',
u'2016-04-02_p128_3', u'2016-04-03_p127_1',
u'2016-04-03_p128_3', u'2016-04-04_p127_1',
u'2016-04-05_p127_1', u'2016-04-06_p127_1'],
dtype='object', length=781)
The dataframe df is the results of a merge of 2 dataframes.
As you can see from the indeces are not sorted. E.g. '2015-03-31_p002_3'(5th position) comes before '2015-04-01_p001_2' (6th position)
I would like to group together all the _p001_2 and sort them according to the date, then all the _p002_3, etc etc.
But I didnt manage to do it...
If sort_index is not possible use, then it is a bit complicated - need create helper DataFrame by split, then sort_values and last reindex:
idx = pd.Index([u'2015-03-28_p001_2', u'2015-03-29_p001_2',
u'2015-03-30_p001_2', u'2015-03-31_p001_2',
u'2015-03-31_p002_3', u'2015-04-01_p001_2',
u'2015-04-01_p002_3', u'2015-04-02_p001_2',
u'2015-04-02_p002_3', u'2015-04-03_p001_2',
u'2016-03-31_p127_1', u'2016-04-01_p127_1',
u'2016-04-01_p128_3', u'2016-04-02_p127_1',
u'2016-04-02_p128_3', u'2016-04-03_p127_1',
u'2016-04-03_p128_3', u'2016-04-04_p127_1',
u'2016-04-05_p127_1', u'2016-04-06_p127_1'])
df = pd.DataFrame({'a':range(len(idx))}, index=idx)
print (df)
a
2015-03-28_p001_2 0
2015-03-29_p001_2 1
2015-03-30_p001_2 2
2015-03-31_p001_2 3
2015-03-31_p002_3 4
2015-04-01_p001_2 5
2015-04-01_p002_3 6
2015-04-02_p001_2 7
2015-04-02_p002_3 8
2015-04-03_p001_2 9
2016-03-31_p127_1 10
2016-04-01_p127_1 11
2016-04-01_p128_3 12
2016-04-02_p127_1 13
2016-04-02_p128_3 14
2016-04-03_p127_1 15
2016-04-03_p128_3 16
2016-04-04_p127_1 17
2016-04-05_p127_1 18
2016-04-06_p127_1 19
df = df.sort_index()
print (df)
a
2015-03-28_p001_2 0
2015-03-29_p001_2 1
2015-03-30_p001_2 2
2015-03-31_p001_2 3
2015-03-31_p002_3 4
2015-04-01_p001_2 5
2015-04-01_p002_3 6
2015-04-02_p001_2 7
2015-04-02_p002_3 8
2015-04-03_p001_2 9
2016-03-31_p127_1 10
2016-04-01_p127_1 11
2016-04-01_p128_3 12
2016-04-02_p127_1 13
2016-04-02_p128_3 14
2016-04-03_p127_1 15
2016-04-03_p128_3 16
2016-04-04_p127_1 17
2016-04-05_p127_1 18
2016-04-06_p127_1 19
df1 = df.index.to_series().str.split('_', expand=True)
df1[0] = pd.to_datetime(df1[0])
#if necessary change order columns for sorting
df1 = df1.sort_values(by=[1,2,0])
print (df1)
0 1 2
2015-03-28_p001_2 2015-03-28 p001 2
2015-03-29_p001_2 2015-03-29 p001 2
2015-03-30_p001_2 2015-03-30 p001 2
2015-03-31_p001_2 2015-03-31 p001 2
2015-04-01_p001_2 2015-04-01 p001 2
2015-04-02_p001_2 2015-04-02 p001 2
2015-04-03_p001_2 2015-04-03 p001 2
2015-03-31_p002_3 2015-03-31 p002 3
2015-04-01_p002_3 2015-04-01 p002 3
2015-04-02_p002_3 2015-04-02 p002 3
2016-03-31_p127_1 2016-03-31 p127 1
2016-04-01_p127_1 2016-04-01 p127 1
2016-04-02_p127_1 2016-04-02 p127 1
2016-04-03_p127_1 2016-04-03 p127 1
2016-04-04_p127_1 2016-04-04 p127 1
2016-04-05_p127_1 2016-04-05 p127 1
2016-04-06_p127_1 2016-04-06 p127 1
2016-04-01_p128_3 2016-04-01 p128 3
2016-04-02_p128_3 2016-04-02 p128 3
2016-04-03_p128_3 2016-04-03 p128 3
df = df.reindex(df1.index)
print (df)
a
2015-03-28_p001_2 0
2015-03-29_p001_2 1
2015-03-30_p001_2 2
2015-03-31_p001_2 3
2015-04-01_p001_2 5
2015-04-02_p001_2 7
2015-04-03_p001_2 9
2015-03-31_p002_3 4
2015-04-01_p002_3 6
2015-04-02_p002_3 8
2016-03-31_p127_1 10
2016-04-01_p127_1 11
2016-04-02_p127_1 13
2016-04-03_p127_1 15
2016-04-04_p127_1 17
2016-04-05_p127_1 18
2016-04-06_p127_1 19
2016-04-01_p128_3 12
2016-04-02_p128_3 14
2016-04-03_p128_3 16
EDIT:
If duplicates, then is necessary create new columns, sort and last drop them:
df[[0,1,2]] = df.index.to_series().str.split('_', expand=True)
df[0] = pd.to_datetime(df[0])
df = df.sort_values(by=[1,2,0])
df = df.drop([0,1,2], axis=1)
print (df)
a
2015-03-28_p001_2 0
2015-03-29_p001_2 1
2015-03-30_p001_2 2
2015-03-31_p001_2 3
2015-04-01_p001_2 5
2015-04-02_p001_2 7
2015-04-03_p001_2 9
2015-03-31_p002_3 4
2015-04-01_p002_3 6
2015-04-02_p002_3 8
2016-03-31_p127_1 10
2016-04-01_p127_1 11
2016-04-02_p127_1 13
2016-04-03_p127_1 15
2016-04-04_p127_1 17
2016-04-05_p127_1 18
2016-04-06_p127_1 19
2016-04-01_p128_3 12
2016-04-02_p128_3 14
2016-04-03_p128_3 16

Working with datetime in pandas

I have a huge dataframe. Below is a small example:
Date Timing Day_number
17.03.2016 8 1
17.03.2016 8 2
17.03.2016 8 3
17.03.2016 8 4
17.03.2016 8 5
17.03.2016 8 6
17.03.2016 8 7
17.03.2016 8 8
30.08.2016 3 1
30.08.2016 3 2
30.08.2016 3 3
31.05.2016 3 1
31.05.2016 3 2
31.05.2016 3 3
...
I need to add a new column. I look at what value is in the column "Timing". For example if the value is 8, then I look at the date and add one day in each line for this case. The result is eight rows with dates from 17.03.2016 to 24.03.2016. The value in the column "Timing" can be different. Dates are also different. For this example, I should have something like this:
Date Timing Day_number Distribution_of_days
17.03.2016 8 1 17.03.2016
17.03.2016 8 2 18.03.2016
17.03.2016 8 3 19.03.2016
17.03.2016 8 4 20.03.2016
17.03.2016 8 5 21.03.2016
17.03.2016 8 6 22.03.2016
17.03.2016 8 7 23.03.2016
17.03.2016 8 8 24.03.2016
30.08.2016 3 1 30.08.2016
30.08.2016 3 2 31.08.2016
30.08.2016 3 3 01.09.2016
31.05.2016 3 1 31.05.2016
31.05.2016 3 2 01.06.2016
31.05.2016 3 3 02.06.2016
...
At the same time I need to skip the weekend!
Pandas recognizes the value of a column "Date" as non-null object. Does this mean that he does not see them as dates?
Can someone help me? I can't deal with this task myself.
IIUC:
from pandas.tseries.offsets import BDay
df['Date'] = pd.to_datetime(df.Date)
df.assign(Distribution_of_days=df['Date'] + df['Day_number'].apply(BDay))
Output:
Date Timing Day_number Distribution_of_days
0 2016-03-17 8 1 2016-03-18
1 2016-03-17 8 2 2016-03-21
2 2016-03-17 8 3 2016-03-22
3 2016-03-17 8 4 2016-03-23
4 2016-03-17 8 5 2016-03-24
5 2016-03-17 8 6 2016-03-25
6 2016-03-17 8 7 2016-03-28
7 2016-03-17 8 8 2016-03-29
8 2016-08-30 3 1 2016-08-31
9 2016-08-30 3 2 2016-09-01
10 2016-08-30 3 3 2016-09-02
11 2016-05-31 3 1 2016-06-01
12 2016-05-31 3 2 2016-06-02
13 2016-05-31 3 3 2016-06-03
EDIT (He starts work on the current day):
df.assign(Distribution_of_days=df['Date'] + df['Day_number'].add(-1).apply(BDay))
Output:
Date Timing Day_number Distribution_of_days
0 2016-03-17 8 1 2016-03-17
1 2016-03-17 8 2 2016-03-18
2 2016-03-17 8 3 2016-03-21
3 2016-03-17 8 4 2016-03-22
4 2016-03-17 8 5 2016-03-23
5 2016-03-17 8 6 2016-03-24
6 2016-03-17 8 7 2016-03-25
7 2016-03-17 8 8 2016-03-28
8 2016-08-30 3 1 2016-08-30
9 2016-08-30 3 2 2016-08-31
10 2016-08-30 3 3 2016-09-01
11 2016-05-31 3 1 2016-05-31
12 2016-05-31 3 2 2016-06-01
13 2016-05-31 3 3 2016-06-02
This will make it work:
import pandas as pd
#this is just creation of your dataframe
data = '17.03.2016,8,1,17.03.2016,8,2,17.03.2016,8,3,17.03.2016,8,4,17.03.2016,8,5,17.03.2016,8,6,17.03.2016,8,7,17.03.2016,8,8,30.08.2016,3,1,30.08.2016,3,2,30.08.2016,3,3,31.05.2016,3,1,31.05.2016,3,2,31.05.2016,3,3'
data = data.split(',')
date = data[::3]
timing = [int(i) for i in data[1::3]]
day_number = [int(j) for j in data[2::3]]
#here is actual code
df = pd.DataFrame({'Date': date, 'Timing': timing, 'Day_number': day_number})
df['Date'] = pd.to_datetime(df['Date'])
df['Distribution_of_days'] = df.Date + pd.to_timedelta(df.Day_number-1, unit='D')

python replace string in a specific dataframe column

I would like to replace any string in a dataframe column by the string 'Chaudière', for any word that starts with the string "chaud". I would like the first and last name after each "Chaudiere" to disapper, to anonymize the NameDevice
My data frame is called df1 and the column name is NameDevice.
I have tried this:
df1.loc[df['NameDevice'].str.startswith('chaud'), 'NameDevice'] = df1['NameDevice'].str.replace("chaud","Chaudière") . I check with df1.head(), it returns:
IdDevice IdDeviceType SerialDevice NameDevice IdLocation UuidAttributeDevice IdBox IsUpdateDevice
0 119 48 00001 Chaudière Maud Ferrand 4 NaN 4 0
1 120 48 00002 Chaudière Yvan Martinod 6 NaN 6 0
2 121 48 00006 Chaudière Anne-Sophie Premereur 7 NaN 7 0
3 122 48 00005 Chaudière Denis Fauser 8 NaN 8 0
4 123 48 00004 Chaudière Elariak Djilali 3 NaN 3 0
You can do the matching by calling str.lower first, then you can use str.startswith, and then just split on the spaces and take the first entry to anonymise the data:
In [14]:
df.loc[df['NameDevice'].str.lower().str.startswith('chaud'), 'NameDevice'] = df['NameDevice'].str.split().str[0]
df
Out[14]:
IdDevice IdDeviceType SerialDevice NameDevice IdLocation \
0 119 48 1 Chaudière 4
1 120 48 2 Chaudière 6
2 121 48 6 Chaudière 7
3 122 48 5 Chaudière 8
4 123 48 4 Chaudière 3
UuidAttributeDevice IdBox IsUpdateDevice
0 NaN 4 0
1 NaN 6 0
2 NaN 7 0
3 NaN 8 0
4 NaN 3 0
Another method is to use str.extract so it only takes Chaud...:
In [27]:
df.loc[df['NameDevice'].str.lower().str.startswith('chaud'), 'NameDevice'] = df['NameDevice'].str.extract('(Chaud\w+ )', expand=False)
df
Out[27]:
IdDevice IdDeviceType SerialDevice NameDevice IdLocation \
0 119 48 1 Chaudière 4
1 120 48 2 Chaudière 6
2 121 48 6 Chaudière 7
3 122 48 5 Chaudière 8
4 123 48 4 Chaudière 3
UuidAttributeDevice IdBox IsUpdateDevice
0 NaN 4 0
1 NaN 6 0
2 NaN 7 0
3 NaN 8 0
4 NaN 3 0

Categories