Drop duplicates, keep most recent date, Pandas dataframe

Drop duplicates, keep most recent date, Pandas dataframe - python

I have a Pandas dataframe containing two columns: a datetime column, and a column of integers representing station IDs. I need a new dataframe with the following modifications:
For each set of duplicate STATION_ID values, keep the row with the most recent entry for DATE_CHANGED. If the duplicate entries for the STATION_ID all contain the same DATE_CHANGED then drop the duplicates and retain a single row for the STATION_ID. If there are no duplicates for the STATION_ID value, simply retain the row.
Dataframe (sorted by STATION_ID):
DATE_CHANGED STATION_ID
0 2006-06-07 06:00:00 1
1 2000-09-26 06:00:00 1
2 2000-09-26 06:00:00 1
3 2000-09-26 06:00:00 1
4 2001-06-06 06:00:00 2
5 2005-07-29 06:00:00 2
6 2005-07-29 06:00:00 2
7 2001-06-06 06:00:00 2
8 2001-06-08 06:00:00 4
9 2003-11-25 07:00:00 4
10 2001-06-12 06:00:00 7
11 2001-06-04 06:00:00 8
12 2017-04-03 18:36:16 8
13 2017-04-03 18:36:16 8
14 2017-04-03 18:36:16 8
15 2001-06-04 06:00:00 8
16 2001-06-08 06:00:00 10
17 2001-06-08 06:00:00 10
18 2001-06-08 06:00:00 11
19 2001-06-08 06:00:00 11
20 2001-06-08 06:00:00 12
21 2001-06-08 06:00:00 12
22 2001-06-08 06:00:00 13
23 2001-06-08 06:00:00 13
24 2001-06-08 06:00:00 14
25 2001-06-08 06:00:00 14
26 2001-06-08 06:00:00 15
27 2017-08-07 17:48:25 15
28 2001-06-08 06:00:00 15
29 2017-08-07 17:48:25 15
... ... ...
157066 2018-08-06 14:11:28 71655
157067 2018-08-06 14:11:28 71656
157068 2018-08-06 14:11:28 71656
157069 2018-09-11 21:45:05 71664
157070 2018-09-11 21:45:05 71664
157071 2018-09-11 21:45:05 71664
157072 2018-09-11 21:41:04 71664
157073 2018-08-09 15:22:07 71720
157074 2018-08-09 15:22:07 71720
157075 2018-08-09 15:22:07 71720
157076 2018-08-23 12:43:12 71899
157077 2018-08-23 12:43:12 71899
157078 2018-08-23 12:43:12 71899
157079 2018-09-08 20:21:43 71969
157080 2018-09-08 20:21:43 71969
157081 2018-09-08 20:21:43 71969
157082 2018-09-08 20:21:43 71984
157083 2018-09-08 20:21:43 71984
157084 2018-09-08 20:21:43 71984
157085 2018-09-05 18:46:18 71985
157086 2018-09-05 18:46:18 71985
157087 2018-09-05 18:46:18 71985
157088 2018-09-08 20:21:44 71990
157089 2018-09-08 20:21:44 71990
157090 2018-09-08 20:21:44 71990
157091 2018-09-08 20:21:43 72003
157092 2018-09-08 20:21:43 72003
157093 2018-09-08 20:21:43 72003
157094 2018-09-10 17:06:18 72024
157095 2018-09-10 17:15:05 72024
[157096 rows x 2 columns]
DATE_CHANGED is dtype: datetime64[ns]
STATION_ID is dtype: int64
pandas==0.23.4
python==2.7.15

Try:
df.sort_values('DATE_CHANGED').drop_duplicates('STATION_ID',keep='last')

Related

How to set a custom day of week mapping

I am trying to create my own custom day of the week mapping using python. I have used a few different methods such as dayofweek and isoweekday. Yet all of these dont provide me with what I need.
0 2026-01-01 00:00:00
1 2026-01-01 01:00:00
2 2026-01-01 02:00:00
3 2026-01-01 03:00:00
4 2026-01-01 04:00:00
5 2026-01-01 05:00:00
6 2026-01-01 06:00:00
7 2026-01-01 07:00:00
8 2026-01-01 08:00:00
9 2026-01-01 09:00:00
10 2026-01-01 10:00:00
11 2026-01-01 11:00:00
12 2026-01-01 12:00:00
13 2026-01-01 13:00:00
14 2026-01-01 14:00:00
15 2026-01-01 15:00:00
16 2026-01-01 16:00:00
17 2026-01-01 17:00:00
18 2026-01-01 18:00:00
19 2026-01-01 19:00:00
Name: Date, dtype: datetime64[ns]
It contiunes to
0 2026-01-01 00:00:00
1 2026-01-01 01:00:00
2 2026-01-01 02:00:00
3 2026-01-01 03:00:00
4 2026-01-01 04:00:00
95 2026-01-04 23:00:00
96 2026-01-05 00:00:00
97 2026-01-05 01:00:00
98 2026-01-05 02:00:00
99 2026-01-05 03:00:00
example of my code
power_data['Day of Week'] = power_data['Date'].dt.dayofweek+1
power_data['Day of Weekv2'] = power_data['Date'].dt.isoweekday
Above is a example portion of my dataframe, the data formatting I would like to follow is Sunday be 1, monday be 2...etc and Saturday be equal to 7. Please let me know if I can do this how its currently presented

As per pandas documentation for weekday, it mentions that:
The day of the week with Monday=0, Sunday=6.
Which means, you just need to add 2 to the weekday value as you want to shift Monday from 0 to 2, then take modulo by 8
# df is your dataframe, and date is the column name consisting pandas Timestamp
>>> (df['date'].dt.weekday+2)%8
#output:
0 5
1 5
2 5
3 5
4 5
5 5
6 5
7 5
8 5
9 5
10 5
11 5
12 5
13 5
14 5
15 5
16 5
17 5
18 5
Name: date, dtype: int64

Incrementing dates in pandas groupby

I'm building a basic rota/schedule for staff, and have a DataFrame from a MySQL cursor which gives a list of IDs, dates and class
id the_date class
0 195593 2017-09-12 14:00:00 3
1 193972 2017-09-13 09:15:00 2
2 195594 2017-09-13 14:00:00 3
3 195595 2017-09-15 14:00:00 3
4 193947 2017-09-16 17:30:00 3
5 195627 2017-09-17 08:00:00 2
6 193948 2017-09-19 11:30:00 2
7 195628 2017-09-21 08:00:00 2
8 193949 2017-09-21 11:30:00 2
9 195629 2017-09-24 08:00:00 2
10 193950 2017-09-24 10:00:00 2
11 193951 2017-09-27 11:30:00 2
12 195644 2017-09-28 06:00:00 1
13 194400 2017-09-28 08:00:00 1
14 195630 2017-09-28 08:00:00 2
15 193952 2017-09-29 11:30:00 2
16 195631 2017-10-01 08:00:00 2
17 194401 2017-10-06 08:00:00 1
18 195645 2017-10-06 10:00:00 1
19 195632 2017-10-07 13:30:00 3
If the class == 1, I need that instance duplicated 5 times.
first_class = df[df['class'] == 1]
non_first_class = df[df['class'] != 1]
first_class_replicated = pd.concat([tests_df]*5,ignore_index=True).sort_values(['the_date'])
id the_date class
0 195644 2017-09-28 06:00:00 1
16 195644 2017-09-28 06:00:00 1
4 195644 2017-09-28 06:00:00 1
12 195644 2017-09-28 06:00:00 1
8 195644 2017-09-28 06:00:00 1
17 194400 2017-09-28 08:00:00 1
13 194400 2017-09-28 08:00:00 1
9 194400 2017-09-28 08:00:00 1
5 194400 2017-09-28 08:00:00 1
1 194400 2017-09-28 08:00:00 1
6 194401 2017-10-06 08:00:00 1
18 194401 2017-10-06 08:00:00 1
10 194401 2017-10-06 08:00:00 1
14 194401 2017-10-06 08:00:00 1
2 194401 2017-10-06 08:00:00 1
11 195645 2017-10-06 10:00:00 1
3 195645 2017-10-06 10:00:00 1
15 195645 2017-10-06 10:00:00 1
7 195645 2017-10-06 10:00:00 1
19 195645 2017-10-06 10:00:00 1
I then merge non_first_class and first_class_replicated. Before that though, I need the dates in first_class_replicated to increment by one day, grouped by id. Below is how I need it to look. Is there an elegant Pandas solution to this, or should I be looking at looping over a groupby series to modify the dates?
Desired:
id
0 195644 2017-09-28 6:00:00
16 195644 2017-09-29 6:00:00
4 195644 2017-09-30 6:00:00
12 195644 2017-10-01 6:00:00
8 195644 2017-10-02 6:00:00
17 194400 2017-09-28 8:00:00
13 194400 2017-09-29 8:00:00
9 194400 2017-09-30 8:00:00
5 194400 2017-10-01 8:00:00
1 194400 2017-10-02 8:00:00
6 194401 2017-10-06 8:00:00
18 194401 2017-10-07 8:00:00
10 194401 2017-10-08 8:00:00
14 194401 2017-10-09 8:00:00
2 194401 2017-10-10 8:00:00
11 195645 2017-10-06 10:00:00
3 195645 2017-10-07 10:00:00
15 195645 2017-10-08 10:00:00
7 195645 2017-10-09 10:00:00
19 195645 2017-10-10 10:00:00

You can use cumcount for count categories, then convert to_timedelta and add to column:
#another solution for repeat
first_class_replicated = first_class.loc[np.repeat(first_class.index, 5)]
.sort_values(['the_date'])
df1 = first_class_replicated.groupby('id').cumcount()
first_class_replicated['the_date'] += pd.to_timedelta(df1, unit='D')
print (first_class_replicated)
id the_date class
0 195644 2017-09-28 06:00:00 1
16 195644 2017-09-29 06:00:00 1
4 195644 2017-09-30 06:00:00 1
12 195644 2017-10-01 06:00:00 1
8 195644 2017-10-02 06:00:00 1
17 194400 2017-09-28 08:00:00 1
13 194400 2017-09-29 08:00:00 1
9 194400 2017-09-30 08:00:00 1
5 194400 2017-10-01 08:00:00 1
1 194400 2017-10-02 08:00:00 1
6 194401 2017-10-06 08:00:00 1
18 194401 2017-10-07 08:00:00 1
10 194401 2017-10-08 08:00:00 1
14 194401 2017-10-09 08:00:00 1
2 194401 2017-10-10 08:00:00 1
11 195645 2017-10-06 10:00:00 1
3 195645 2017-10-07 10:00:00 1
15 195645 2017-10-08 10:00:00 1
7 195645 2017-10-09 10:00:00 1
19 195645 2017-10-10 10:00:00 1

python: compare data of different date types

I have a question of comparing data of datetime64[ns] and date like '2017-01-01'.
here is the code:
df.loc[(df['Date'] >= datetime.date(2017.1.1), 'TimeRange'] = '2017.1'
but , an error has been showed and said descriptor 'date' requires a 'datetime.datetime' object but received a 'int'.
how can i compare a datetime64 to data (2017-01-01 or 2-17-6-1 and likes)
Thanks

Demo:
Source DF:
In [83]: df = pd.DataFrame({'tm':pd.date_range('2000-01-01', freq='9999T', periods=20)})
In [84]: df
Out[84]:
tm
0 2000-01-01 00:00:00
1 2000-01-07 22:39:00
2 2000-01-14 21:18:00
3 2000-01-21 19:57:00
4 2000-01-28 18:36:00
5 2000-02-04 17:15:00
6 2000-02-11 15:54:00
7 2000-02-18 14:33:00
8 2000-02-25 13:12:00
9 2000-03-03 11:51:00
10 2000-03-10 10:30:00
11 2000-03-17 09:09:00
12 2000-03-24 07:48:00
13 2000-03-31 06:27:00
14 2000-04-07 05:06:00
15 2000-04-14 03:45:00
16 2000-04-21 02:24:00
17 2000-04-28 01:03:00
18 2000-05-04 23:42:00
19 2000-05-11 22:21:00
Filtering:
In [85]: df.loc[df.tm > '2000-03-01']
Out[85]:
tm
9 2000-03-03 11:51:00
10 2000-03-10 10:30:00
11 2000-03-17 09:09:00
12 2000-03-24 07:48:00
13 2000-03-31 06:27:00
14 2000-04-07 05:06:00
15 2000-04-14 03:45:00
16 2000-04-21 02:24:00
17 2000-04-28 01:03:00
18 2000-05-04 23:42:00
19 2000-05-11 22:21:00
In [86]: df.loc[df.tm > '2000-3-1']
Out[86]:
tm
9 2000-03-03 11:51:00
10 2000-03-10 10:30:00
11 2000-03-17 09:09:00
12 2000-03-24 07:48:00
13 2000-03-31 06:27:00
14 2000-04-07 05:06:00
15 2000-04-14 03:45:00
16 2000-04-21 02:24:00
17 2000-04-28 01:03:00
18 2000-05-04 23:42:00
19 2000-05-11 22:21:00
not standard date format:
In [87]: df.loc[df.tm > pd.to_datetime('03/01/2000')]
Out[87]:
tm
9 2000-03-03 11:51:00
10 2000-03-10 10:30:00
11 2000-03-17 09:09:00
12 2000-03-24 07:48:00
13 2000-03-31 06:27:00
14 2000-04-07 05:06:00
15 2000-04-14 03:45:00
16 2000-04-21 02:24:00
17 2000-04-28 01:03:00
18 2000-05-04 23:42:00
19 2000-05-11 22:21:00

You need to ensure that the data you're comparing it with is also in the same format. Assuming that you have two datetime objects, you can do it like this:
import datetime
print(df.loc[(df['Date'] >= datetime.date(2017, 1, 1), 'TimeRange'])
This will create a datetime object and list out the filtered results. You can also assign the results an updated value as you have mentioned above.

Pandas: How to get the column name where a row contain the date?

I have a dataframe named DateUnique made of all unique dates (format datetime or string) that are present in my other dataframe named A.
>>> print(A)
'dateLivraisonDemande' 'abscisse' 'BaseASDébut' 'BaseATDébut' 0 2015-05-27 2004-01-10 05:00:00 05:00:00
1 2015-05-27 2004-02-10 18:30:00 22:30:00
2 2015-05-27 2004-01-20 23:40:00 19:30:00
3 2015-05-27 2004-03-10 12:05:00 06:00:00
4 2015-05-27 2004-01-10 23:15:00 13:10:00
5 2015-05-27 2004-02-10 18:00:00 13:45:00
6 2015-05-27 2004-01-20 02:05:00 19:15:00
7 2015-05-27 2004-03-20 08:00:00 07:45:00
8 2015-05-29 2004-01-01 18:45:00 21:00:00
9 2015-05-27 2004-02-15 04:20:00 07:30:00
10 2015-04-10 2004-01-20 13:50:00 15:30:00
And:
>>> print(DateUnique)
1 1899-12-30
2 1900-01-01
3 2004-03-10
4 2004-03-20
5 2004-01-20
6 2015-05-29
7 2015-04-10
8 2015-05-27
9 2004-02-15
10 2004-02-10
How can I get the name of the columns that contain each date?
Maybe with something similar to this:
# input:
If row == '2015-04-10':
print(df.name_Of_Column([0]))
# output:
'dateLivraisonDemande'

You can make a function that returns the appropriate column. Use the vectorized isin function, and then check if any value is True.
df = pd.DataFrame({'dateLivraisonDemande': ['2015-05-27']*7 + ['2015-05-27', '2015-05-29', '2015-04-10'],
'abscisse': ['2004-02-10', '2004-01-20', '2004-03-10', '2004-01-10',
'2004-02-10', '2004-01-20', '2004-03-10', '2004-01-10',
'2004-02-15', '2004-01-20']})
DateUnique = pd.Series(['1899-12-30', '1900-01-01', '2004-03-10', '2004-03-20',
'2004-01-20', '2015-05-29', '2015-04-10', '2015-05-27',
'2004-02-15', '2004-02-10'])
def return_date_columns(date_input):
if df["dateLivraisonDemande"].isin([date_input]).any():
return "dateLivraisonDemande"
if df["abscisse"].isin([date_input]).any():
return "abscisse"
>>> DateUnique.apply(return_date_columns)
0 None
1 None
2 abscisse
3 None
4 abscisse
5 dateLivraisonDemande
6 dateLivraisonDemande
7 dateLivraisonDemande
8 abscisse
9 abscisse
dtype: object

Changing time components of pandas datetime64 column

I have a dataframe that can be simplified as:
date id
0 02/04/2015 02:34 1
1 06/04/2015 12:34 2
2 09/04/2015 23:03 3
3 12/04/2015 01:00 4
4 15/04/2015 07:12 5
5 21/04/2015 12:59 6
6 29/04/2015 17:33 7
7 04/05/2015 10:44 8
8 06/05/2015 11:12 9
9 10/05/2015 08:52 10
10 12/05/2015 14:19 11
11 19/05/2015 19:22 12
12 27/05/2015 22:31 13
13 01/06/2015 11:09 14
14 04/06/2015 12:57 15
15 10/06/2015 04:00 16
16 15/06/2015 03:23 17
17 19/06/2015 05:37 18
18 23/06/2015 13:41 19
19 27/06/2015 15:43 20
It can be created using:
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"]})
The data has the following types:
tempDF.dtypes
date object
id int64
dtype: object
I have set the 'date' variable to be Pandas datefime64 format (if that's the right way to describe it) using:
import numpy as np
import pandas as pd
tempDF['date'] = pd_to_datetime(tempDF['date'])
So now, the dtypes look like:
tempDF.dtypes
date datetime64[ns]
id int64
dtype: object
I want to change the hours of the original date data. I can use .normalize() to convert to midnight via the .dt accessor:
tempDF['date'] = tempDF['date'].dt.normalize()
And, I can get access to individual datetime components (e.g. year) using:
tempDF['date'].dt.year
This produces:
0 2015
1 2015
2 2015
3 2015
4 2015
5 2015
6 2015
7 2015
8 2015
9 2015
10 2015
11 2015
12 2015
13 2015
14 2015
15 2015
16 2015
17 2015
18 2015
19 2015
Name: date, dtype: int64
The question is, how can I change specific date and time components? For example, how could I change the midday (12:00) for all the dates? I've found that datetime.datetime has a .replace() function. However, having converted dates to Pandas format, it would make sense to keep in that format. Is there a way to do that without changing the format again?

EDIT :
A vectorized way to do this would be to normalize the series, and then add 12 hours to it using timedelta. Example -
tempDF['date'].dt.normalize() + datetime.timedelta(hours=12)
Demo -
In [59]: tempDF
Out[59]:
date id
0 2015-02-04 12:00:00 1
1 2015-06-04 12:00:00 2
2 2015-09-04 12:00:00 3
3 2015-12-04 12:00:00 4
4 2015-04-15 12:00:00 5
5 2015-04-21 12:00:00 6
6 2015-04-29 12:00:00 7
7 2015-04-05 12:00:00 8
8 2015-06-05 12:00:00 9
9 2015-10-05 12:00:00 10
10 2015-12-05 12:00:00 11
11 2015-05-19 12:00:00 12
12 2015-05-27 12:00:00 13
13 2015-01-06 12:00:00 14
14 2015-04-06 12:00:00 15
15 2015-10-06 12:00:00 16
16 2015-06-15 12:00:00 17
17 2015-06-19 12:00:00 18
18 2015-06-23 12:00:00 19
19 2015-06-27 12:00:00 20
In [60]: tempDF['date'].dt.normalize() + datetime.timedelta(hours=12)
Out[60]:
0 2015-02-04 12:00:00
1 2015-06-04 12:00:00
2 2015-09-04 12:00:00
3 2015-12-04 12:00:00
4 2015-04-15 12:00:00
5 2015-04-21 12:00:00
6 2015-04-29 12:00:00
7 2015-04-05 12:00:00
8 2015-06-05 12:00:00
9 2015-10-05 12:00:00
10 2015-12-05 12:00:00
11 2015-05-19 12:00:00
12 2015-05-27 12:00:00
13 2015-01-06 12:00:00
14 2015-04-06 12:00:00
15 2015-10-06 12:00:00
16 2015-06-15 12:00:00
17 2015-06-19 12:00:00
18 2015-06-23 12:00:00
19 2015-06-27 12:00:00
dtype: datetime64[ns]
Timing information for both methods at bottom
One method would be to use Series.apply along with the .replace() method OP mentions in his post. Example -
tempDF['date'] = tempDF['date'].apply(lambda x:x.replace(hour=12,minute=0))
Demo -
In [12]: tempDF
Out[12]:
date id
0 2015-02-04 02:34:00 1
1 2015-06-04 12:34:00 2
2 2015-09-04 23:03:00 3
3 2015-12-04 01:00:00 4
4 2015-04-15 07:12:00 5
5 2015-04-21 12:59:00 6
6 2015-04-29 17:33:00 7
7 2015-04-05 10:44:00 8
8 2015-06-05 11:12:00 9
9 2015-10-05 08:52:00 10
10 2015-12-05 14:19:00 11
11 2015-05-19 19:22:00 12
12 2015-05-27 22:31:00 13
13 2015-01-06 11:09:00 14
14 2015-04-06 12:57:00 15
15 2015-10-06 04:00:00 16
16 2015-06-15 03:23:00 17
17 2015-06-19 05:37:00 18
18 2015-06-23 13:41:00 19
19 2015-06-27 15:43:00 20
In [13]: tempDF['date'] = tempDF['date'].apply(lambda x:x.replace(hour=12,minute=0))
In [14]: tempDF
Out[14]:
date id
0 2015-02-04 12:00:00 1
1 2015-06-04 12:00:00 2
2 2015-09-04 12:00:00 3
3 2015-12-04 12:00:00 4
4 2015-04-15 12:00:00 5
5 2015-04-21 12:00:00 6
6 2015-04-29 12:00:00 7
7 2015-04-05 12:00:00 8
8 2015-06-05 12:00:00 9
9 2015-10-05 12:00:00 10
10 2015-12-05 12:00:00 11
11 2015-05-19 12:00:00 12
12 2015-05-27 12:00:00 13
13 2015-01-06 12:00:00 14
14 2015-04-06 12:00:00 15
15 2015-10-06 12:00:00 16
16 2015-06-15 12:00:00 17
17 2015-06-19 12:00:00 18
18 2015-06-23 12:00:00 19
19 2015-06-27 12:00:00 20
Timing information
In [52]: df = pd.DataFrame([[datetime.datetime.now()] for _ in range(100000)],columns=['date'])
In [54]: %%timeit
....: df['date'].dt.normalize() + datetime.timedelta(hours=12)
....:
The slowest run took 12.53 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 32.3 ms per loop
In [57]: %%timeit
....: df['date'].apply(lambda x:x.replace(hour=12,minute=0))
....:
1 loops, best of 3: 1.09 s per loop

Here's the solution I used to replace the time component of the datetime values in a Pandas DataFrame. Not sure how efficient this solution is, but it fit my needs.
import pandas as pd
# Create a list of EOCY dates for a specified period
sDate = pd.Timestamp('2022-01-31 23:59:00')
eDate = pd.Timestamp('2060-01-31 23:59:00')
dtList = pd.date_range(sDate, eDate, freq='Y').to_pydatetime()
# Create a DataFrame with a single column called 'Date' and fill the rows with the list of EOCY dates.
df = pd.DataFrame({'Date': dtList})
# Loop through the DataFrame rows using the replace function to replace the hours and minutes of each date value.
for i in range(df.shape[0]):
df.iloc[i, 0]=df.iloc[i, 0].replace(hour=00, minute=00)
Not sure how efficient this solution is, but it fit my needs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Drop duplicates, keep most recent date, Pandas dataframe - python

Try: df.sort_values('DATE_CHANGED').drop_duplicates('STATION_ID',keep='last')

Related

How to set a custom day of week mapping

Incrementing dates in pandas groupby

python: compare data of different date types

Pandas: How to get the column name where a row contain the date?

Changing time components of pandas datetime64 column

Categories

Resources