Incrementing dates in pandas groupby - python

I'm building a basic rota/schedule for staff, and have a DataFrame from a MySQL cursor which gives a list of IDs, dates and class
id the_date class
0 195593 2017-09-12 14:00:00 3
1 193972 2017-09-13 09:15:00 2
2 195594 2017-09-13 14:00:00 3
3 195595 2017-09-15 14:00:00 3
4 193947 2017-09-16 17:30:00 3
5 195627 2017-09-17 08:00:00 2
6 193948 2017-09-19 11:30:00 2
7 195628 2017-09-21 08:00:00 2
8 193949 2017-09-21 11:30:00 2
9 195629 2017-09-24 08:00:00 2
10 193950 2017-09-24 10:00:00 2
11 193951 2017-09-27 11:30:00 2
12 195644 2017-09-28 06:00:00 1
13 194400 2017-09-28 08:00:00 1
14 195630 2017-09-28 08:00:00 2
15 193952 2017-09-29 11:30:00 2
16 195631 2017-10-01 08:00:00 2
17 194401 2017-10-06 08:00:00 1
18 195645 2017-10-06 10:00:00 1
19 195632 2017-10-07 13:30:00 3
If the class == 1, I need that instance duplicated 5 times.
first_class = df[df['class'] == 1]
non_first_class = df[df['class'] != 1]
first_class_replicated = pd.concat([tests_df]*5,ignore_index=True).sort_values(['the_date'])
id the_date class
0 195644 2017-09-28 06:00:00 1
16 195644 2017-09-28 06:00:00 1
4 195644 2017-09-28 06:00:00 1
12 195644 2017-09-28 06:00:00 1
8 195644 2017-09-28 06:00:00 1
17 194400 2017-09-28 08:00:00 1
13 194400 2017-09-28 08:00:00 1
9 194400 2017-09-28 08:00:00 1
5 194400 2017-09-28 08:00:00 1
1 194400 2017-09-28 08:00:00 1
6 194401 2017-10-06 08:00:00 1
18 194401 2017-10-06 08:00:00 1
10 194401 2017-10-06 08:00:00 1
14 194401 2017-10-06 08:00:00 1
2 194401 2017-10-06 08:00:00 1
11 195645 2017-10-06 10:00:00 1
3 195645 2017-10-06 10:00:00 1
15 195645 2017-10-06 10:00:00 1
7 195645 2017-10-06 10:00:00 1
19 195645 2017-10-06 10:00:00 1
I then merge non_first_class and first_class_replicated. Before that though, I need the dates in first_class_replicated to increment by one day, grouped by id. Below is how I need it to look. Is there an elegant Pandas solution to this, or should I be looking at looping over a groupby series to modify the dates?
Desired:
id
0 195644 2017-09-28 6:00:00
16 195644 2017-09-29 6:00:00
4 195644 2017-09-30 6:00:00
12 195644 2017-10-01 6:00:00
8 195644 2017-10-02 6:00:00
17 194400 2017-09-28 8:00:00
13 194400 2017-09-29 8:00:00
9 194400 2017-09-30 8:00:00
5 194400 2017-10-01 8:00:00
1 194400 2017-10-02 8:00:00
6 194401 2017-10-06 8:00:00
18 194401 2017-10-07 8:00:00
10 194401 2017-10-08 8:00:00
14 194401 2017-10-09 8:00:00
2 194401 2017-10-10 8:00:00
11 195645 2017-10-06 10:00:00
3 195645 2017-10-07 10:00:00
15 195645 2017-10-08 10:00:00
7 195645 2017-10-09 10:00:00
19 195645 2017-10-10 10:00:00

You can use cumcount for count categories, then convert to_timedelta and add to column:
#another solution for repeat
first_class_replicated = first_class.loc[np.repeat(first_class.index, 5)]
.sort_values(['the_date'])
df1 = first_class_replicated.groupby('id').cumcount()
first_class_replicated['the_date'] += pd.to_timedelta(df1, unit='D')
print (first_class_replicated)
id the_date class
0 195644 2017-09-28 06:00:00 1
16 195644 2017-09-29 06:00:00 1
4 195644 2017-09-30 06:00:00 1
12 195644 2017-10-01 06:00:00 1
8 195644 2017-10-02 06:00:00 1
17 194400 2017-09-28 08:00:00 1
13 194400 2017-09-29 08:00:00 1
9 194400 2017-09-30 08:00:00 1
5 194400 2017-10-01 08:00:00 1
1 194400 2017-10-02 08:00:00 1
6 194401 2017-10-06 08:00:00 1
18 194401 2017-10-07 08:00:00 1
10 194401 2017-10-08 08:00:00 1
14 194401 2017-10-09 08:00:00 1
2 194401 2017-10-10 08:00:00 1
11 195645 2017-10-06 10:00:00 1
3 195645 2017-10-07 10:00:00 1
15 195645 2017-10-08 10:00:00 1
7 195645 2017-10-09 10:00:00 1
19 195645 2017-10-10 10:00:00 1

Related

pandas - get max of groups based on multiple columns

I have the following pandas DataFrame df:
FFDI_SFC AET_date
latitude longitude time
-39.7650000000 140.8954000000 2017-09-30 13:00:00 1 2017-09-30
2017-09-30 14:00:00 2 2017-10-01
2017-09-30 15:00:00 1 2017-10-01
2017-09-30 16:00:00 1 2017-10-01
2017-09-30 17:00:00 2 2017-10-01
2017-09-30 18:00:00 4 2017-10-01
2017-09-30 19:00:00 3 2017-10-01
2017-09-30 20:00:00 2 2017-10-01
2017-09-30 21:00:00 4 2017-10-01
2017-09-30 22:00:00 1 2017-10-01
2017-09-30 23:00:00 3 2017-10-01
2017-10-01 00:00:00 nan 2017-10-01
2017-10-01 01:00:00 nan 2017-10-01
2017-10-01 02:00:00 4 2017-10-01
2017-10-01 03:00:00 3 2017-10-01
2017-10-01 04:00:00 nan 2017-10-01
2017-10-01 05:00:00 5 2017-10-01
2017-10-01 06:00:00 nan 2017-10-01
2017-10-01 07:00:00 4 2017-10-01
2017-10-01 08:00:00 4 2017-10-01
2017-10-01 09:00:00 4 2017-10-01
2017-10-01 10:00:00 3 2017-10-01
2017-10-01 11:00:00 4 2017-10-01
2017-10-01 12:00:00 5 2017-10-01
2017-10-01 13:00:00 3 2017-10-02
2017-10-01 13:00:00 3 2017-10-02
2017-10-01 14:00:00 nan 2017-10-02
2017-10-01 14:00:00 4 2017-10-02
2017-10-01 15:00:00 5 2017-10-02
2017-10-01 16:00:00 nan 2017-10-02
2017-10-01 17:00:00 nan 2017-10-02
2017-10-01 18:00:00 nan 2017-10-02
... ... ... ... ...
-33.9350000000 151.0466000000 2017-10-08 07:00:00 6 2017-10-08
2017-10-08 08:00:00 5 2017-10-08
2017-10-08 09:00:00 5 2017-10-08
2017-10-08 10:00:00 6 2017-10-08
2017-10-08 11:00:00 6 2017-10-08
2017-10-08 12:00:00 nan 2017-10-08
2017-10-08 13:00:00 6 2017-10-09
2017-10-08 13:00:00 nan 2017-10-09
2017-10-08 14:00:00 7 2017-10-09
2017-10-08 14:00:00 7 2017-10-09
2017-10-08 15:00:00 7 2017-10-09
... ... ... ... ... ... ... ... ... ... ...
2017-10-10 09:00:00 nan 2017-10-10
2017-10-10 10:00:00 12 2017-10-10
2017-10-10 11:00:00 10 2017-10-10
2017-10-10 12:00:00 14 2017-10-10
2017-10-10 13:00:00 13 2017-10-11
2017-10-10 14:00:00 15 2017-10-11
103554880 rows × 2 columns
They are multi-indexed (latitude, longitude and time). There is a column called AET_date indicating its actual date for each record. FFDI_SFC is a nan-able value for each record.
What I want to achieve is to get the max of FFDI_SFC for the rows which have the identical latitude, longitude and AET_date. In other words this is to group rows by latitude, longitude and AET_date and get the max (daily) for each group.
The anticipated output will look like:
Max_Daily_FFDI_SFC
latitude longitude AET_date
-39.7650000000 140.8954000000 2017-09-30 5
2017-10-01 7
2017-10-02 5
... ... ... ... ...
-33.9350000000 151.0466000000 2017-10-08 14
2017-10-09 12
2017-10-10 16

How do I create a new column by matching a variable with another variable and have it repeat until the first variable changes in R?

receptor
year
month
day
hour
hour.inc
lat
lon
height
pressure
date
1
2018
1
3
19
0
31.768
-106.501
500.0
835.6
2018-01-03 19:00:00
1
2018
1
3
18
-1
31.628
-106.350
508.8
840.5
2018-01-03 18:00:00
1
2018
1
3
17
-2
31.489
-106.180
526.2
839.4
2018-01-03 17:00:00
1
2018
1
3
16
-3
31.372
-105.974
547.6
836.8
2018-01-03 16:00:00
1
2018
1
3
15
-4
31.289
-105.731
555.3
829.8
2018-01-03 15:00:00
1
2018
1
3
14
-5
31.265
-105.462
577.8
812.8
2018-01-03 14:00:00
1
2018
1
3
13
-6
31.337
-105.175
640.0
793.9
2018-01-03 13:00:00
1
2018
1
3
12
-7
31.492
-104.897
645.6
809.2
2018-01-03 12:00:00
1
2018
1
3
11
-8
31.671
-104.700
686.8
801.0
2018-01-03 11:00:00
1
2018
1
3
10
-9
31.913
-104.552
794.2
795.8
2018-01-03 10:00:00
2
2018
1
4
19
0
31.768
-106.501
500.0
830.9
2018-01-04 19:00:00
2
2018
1
4
18
-1
31.904
-106.635
611.5
819.5
2018-01-04 18:00:00
2
2018
1
4
17
-2
32.070
-106.749
709.7
808.0
2018-01-04 17:00:00
2
2018
1
4
16
-3
32.223
-106.855
787.3
794.9
2018-01-04 16:00:00
Above is what my dataframe looks like but I am trying to create a new column called date1 and will look like the frame below.
receptor year month day hour hour.inc lat lon height pressure date date1
1 1 2018 1 3 19 0 31.768 -106.501 500.0 835.6 2018-01-03 19:00:00 2018-01-03 19:00:00
2 1 2018 1 3 18 -1 31.628 -106.350 508.8 840.5 2018-01-03 18:00:00 2018-01-03 19:00:00
3 1 2018 1 3 17 -2 31.489 -106.180 526.2 839.4 2018-01-03 17:00:00 2018-01-03 19:00:00
4 1 2018 1 3 16 -3 31.372 -105.974 547.6 836.8 2018-01-03 16:00:00 2018-01-03 19:00:00
5 1 2018 1 3 15 -4 31.289 -105.731 555.3 829.8 2018-01-03 15:00:00 2018-01-03 19:00:00
6 1 2018 1 3 14 -5 31.265 -105.462 577.8 812.8 2018-01-03 14:00:00 2018-01-03 19:00:00
7 1 2018 1 3 13 -6 31.337 -105.175 640.0 793.9 2018-01-03 13:00:00 2018-01-03 19:00:00
8 1 2018 1 3 12 -7 31.492 -104.897 645.6 809.2 2018-01-03 12:00:00 2018-01-03 19:00:00
9 1 2018 1 3 11 -8 31.671 -104.700 686.8 801.0 2018-01-03 11:00:00 2018-01-03 19:00:00
10 1 2018 1 3 10 -9 31.913 -104.552 794.2 795.8 2018-01-03 10:00:00 2018-01-03 19:00:00
38 2 2018 1 4 19 0 31.768 -106.501 500.0 830.9 2018-01-04 19:00:00 2018-01-04 19:00:00
39 2 2018 1 4 18 -1 31.904 -106.635 611.5 819.5 2018-01-04 18:00:00 2018-01-04 19:00:00
40 2 2018 1 4 17 -2 32.070 -106.749 709.7 808.0 2018-01-04 17:00:00 2018-01-04 19:00:00
41 2 2018 1 4 16 -3 32.223 -106.855 787.3 794.9 2018-01-04 16:00:00 2018-01-04 19:00:00
Disregard the index furthest to the left. I want to match the receptor (Ex:1,2) with the first occurrence of the date (Ex: 2018-01-03 19:00:00,2018-01-04 19:00:00) and then repeat till the receptor changes.
I'm working in R so I'd like to find a solution in R but I could also use a python solution and make use of the Reticulate package in R.
Using data.table you can try
library(data.table)
setDT(df) #converting into data.frame
df[,date1 := date[1],receptor] # taking the first date per receptor
df
#Output
receptor year month day hour hour.inc lat lon height pressure date date1
1: 1 2018 1 3 19 0 31.768 -106.501 500.0 835.6 2018-01-03 19:00:00 2018-01-03 19:00:00
2: 1 2018 1 3 18 -1 31.628 -106.350 508.8 840.5 2018-01-03 18:00:00 2018-01-03 19:00:00
3: 1 2018 1 3 17 -2 31.489 -106.180 526.2 839.4 2018-01-03 17:00:00 2018-01-03 19:00:00
4: 1 2018 1 3 16 -3 31.372 -105.974 547.6 836.8 2018-01-03 16:00:00 2018-01-03 19:00:00
5: 1 2018 1 3 15 -4 31.289 -105.731 555.3 829.8 2018-01-03 15:00:00 2018-01-03 19:00:00
6: 1 2018 1 3 14 -5 31.265 -105.462 577.8 812.8 2018-01-03 14:00:00 2018-01-03 19:00:00
7: 1 2018 1 3 13 -6 31.337 -105.175 640.0 793.9 2018-01-03 13:00:00 2018-01-03 19:00:00
8: 1 2018 1 3 12 -7 31.492 -104.897 645.6 809.2 2018-01-03 12:00:00 2018-01-03 19:00:00
9: 1 2018 1 3 11 -8 31.671 -104.700 686.8 801.0 2018-01-03 11:00:00 2018-01-03 19:00:00
10: 1 2018 1 3 10 -9 31.913 -104.552 794.2 795.8 2018-01-03 10:00:00 2018-01-03 19:00:00
11: 2 2018 1 4 19 0 31.768 -106.501 500.0 830.9 2018-01-04 19:00:00 2018-01-04 19:00:00
12: 2 2018 1 4 18 -1 31.904 -106.635 611.5 819.5 2018-01-04 18:00:00 2018-01-04 19:00:00
13: 2 2018 1 4 17 -2 32.070 -106.749 709.7 808.0 2018-01-04 17:00:00 2018-01-04 19:00:00
14: 2 2018 1 4 16 -3 32.223 -106.855 787.3 794.9 2018-01-04 16:00:00 2018-01-04 19:00:00
Try filling location of the unchanged value with np.nan and changed value location with date (of that index) and then simply do forward fill using .ffill()
df.receptor.shift().ne(df.receptor) will give you where the receptor value changes. compare the previous and current value to see the change.
df['date1'] = np.where(df.receptor.shift().ne(df.receptor), df.date, np.nan)
df.date1 = df.date1.ffill()
receptor
year
month
day
hour
hour.inc
lat
lon
height
pressure
date
date1
0
1
2018
1
3
19
0
31.768
-106.501
500.0
835.6
2018-01-03 19:00:00
2018-01-03 19:00:00
1
1
2018
1
3
18
-1
31.628
-106.350
508.8
840.5
2018-01-03 18:00:00
2018-01-03 19:00:00
2
1
2018
1
3
17
-2
31.489
-106.180
526.2
839.4
2018-01-03 17:00:00
2018-01-03 19:00:00
3
1
2018
1
3
16
-3
31.372
-105.974
547.6
836.8
2018-01-03 16:00:00
2018-01-03 19:00:00
4
1
2018
1
3
15
-4
31.289
-105.731
555.3
829.8
2018-01-03 15:00:00
2018-01-03 19:00:00
5
1
2018
1
3
14
-5
31.265
-105.462
577.8
812.8
2018-01-03 14:00:00
2018-01-03 19:00:00
6
1
2018
1
3
13
-6
31.337
-105.175
640.0
793.9
2018-01-03 13:00:00
2018-01-03 19:00:00
7
1
2018
1
3
12
-7
31.492
-104.897
645.6
809.2
2018-01-03 12:00:00
2018-01-03 19:00:00
8
1
2018
1
3
11
-8
31.671
-104.700
686.8
801.0
2018-01-03 11:00:00
2018-01-03 19:00:00
9
1
2018
1
3
10
-9
31.913
-104.552
794.2
795.8
2018-01-03 10:00:00
2018-01-03 19:00:00
10
2
2018
1
4
19
0
31.768
-106.501
500.0
830.9
2018-01-04 19:00:00
2018-01-04 19:00:00
11
2
2018
1
4
18
-1
31.904
-106.635
611.5
819.5
2018-01-04 18:00:00
2018-01-04 19:00:00
12
2
2018
1
4
17
-2
32.070
-106.749
709.7
808.0
2018-01-04 17:00:00
2018-01-04 19:00:00
13
2
2018
1
4
16
-3
32.223
-106.855
787.3
794.9
2018-01-04 16:00:00
2018-01-04 19:00:00
Consider base R's ave after calculating a Date column to return first date time per date grouping using head:
df <- within(df, {
date_short <- as.Date(substr(as.character(date), 1, 10), origin="1970-01-01")
first_dt_hour <- ave(date, date_short, FUN=function(x) head(x, 1))
rm(date_short) # DROP HELPER COLUMN
})
print(df)
# receptor year month day hour hour.inc lat lon height pressure date first_dt_hour
# 1 1 2018 1 3 19 0 31.768 -106.501 500.0 835.6 2018-01-03 19:00:00 2018-01-03 19:00:00
# 2 1 2018 1 3 18 -1 31.628 -106.350 508.8 840.5 2018-01-03 18:00:00 2018-01-03 19:00:00
# 3 1 2018 1 3 17 -2 31.489 -106.180 526.2 839.4 2018-01-03 17:00:00 2018-01-03 19:00:00
# 4 1 2018 1 3 16 -3 31.372 -105.974 547.6 836.8 2018-01-03 16:00:00 2018-01-03 19:00:00
# 5 1 2018 1 3 15 -4 31.289 -105.731 555.3 829.8 2018-01-03 15:00:00 2018-01-03 19:00:00
# 6 1 2018 1 3 14 -5 31.265 -105.462 577.8 812.8 2018-01-03 14:00:00 2018-01-03 19:00:00
# 7 1 2018 1 3 13 -6 31.337 -105.175 640.0 793.9 2018-01-03 13:00:00 2018-01-03 19:00:00
# 8 1 2018 1 3 12 -7 31.492 -104.897 645.6 809.2 2018-01-03 12:00:00 2018-01-03 19:00:00
# 9 1 2018 1 3 11 -8 31.671 -104.700 686.8 801.0 2018-01-03 11:00:00 2018-01-03 19:00:00
# 10 1 2018 1 3 10 -9 31.913 -104.552 794.2 795.8 2018-01-03 10:00:00 2018-01-03 19:00:00
# 38 2 2018 1 4 19 0 31.768 -106.501 500.0 830.9 2018-01-04 19:00:00 2018-01-04 19:00:00
# 39 2 2018 1 4 18 -1 31.904 -106.635 611.5 819.5 2018-01-04 18:00:00 2018-01-04 19:00:00
# 40 2 2018 1 4 17 -2 32.070 -106.749 709.7 808.0 2018-01-04 17:00:00 2018-01-04 19:00:00
# 41 2 2018 1 4 16 -3 32.223 -106.855 787.3 794.9 2018-01-04 16:00:00 2018-01-04 19:00:00
Data
data <- ' receptor year month day hour hour.inc lat lon height pressure date
1 1 2018 1 3 19 0 31.768 -106.501 500.0 835.6 "2018-01-03 19:00:00"
2 1 2018 1 3 18 -1 31.628 -106.350 508.8 840.5 "2018-01-03 18:00:00"
3 1 2018 1 3 17 -2 31.489 -106.180 526.2 839.4 "2018-01-03 17:00:00"
4 1 2018 1 3 16 -3 31.372 -105.974 547.6 836.8 "2018-01-03 16:00:00"
5 1 2018 1 3 15 -4 31.289 -105.731 555.3 829.8 "2018-01-03 15:00:00"
6 1 2018 1 3 14 -5 31.265 -105.462 577.8 812.8 "2018-01-03 14:00:00"
7 1 2018 1 3 13 -6 31.337 -105.175 640.0 793.9 "2018-01-03 13:00:00"
8 1 2018 1 3 12 -7 31.492 -104.897 645.6 809.2 "2018-01-03 12:00:00"
9 1 2018 1 3 11 -8 31.671 -104.700 686.8 801.0 "2018-01-03 11:00:00"
10 1 2018 1 3 10 -9 31.913 -104.552 794.2 795.8 "2018-01-03 10:00:00"
38 2 2018 1 4 19 0 31.768 -106.501 500.0 830.9 "2018-01-04 19:00:00"
39 2 2018 1 4 18 -1 31.904 -106.635 611.5 819.5 "2018-01-04 18:00:00"
40 2 2018 1 4 17 -2 32.070 -106.749 709.7 808.0 "2018-01-04 17:00:00"
41 2 2018 1 4 16 -3 32.223 -106.855 787.3 794.9 "2018-01-04 16:00:00"'
df <- read.table(text=data,
colClasses=c(rep("integer", 7), rep("numeric", 4), "POSIXct"),
header=TRUE)

How do I display a subset of a pandas dataframe?

I have a dataframe df that contains datetimes for every hour of a day between 2003-02-12 to 2017-06-30 and I want to delete all datetimes between 24th Dec and 1st Jan of EVERY year.
An extract of my data frame is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
7512,2003-12-24 00:00:00
7513,2003-12-24 01:00:00
7514,2003-12-24 02:00:00
7515,2003-12-24 03:00:00
7516,2003-12-24 04:00:00
7517,2003-12-24 05:00:00
7518,2003-12-24 06:00:00
...
7723,2004-01-01 19:00:00
7724,2004-01-01 20:00:00
7725,2004-01-01 21:00:00
7726,2004-01-01 22:00:00
7727,2004-01-01 23:00:00
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
and my expected output is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
...
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
Sample dataframe:
dates
0 2003-12-23 23:00:00
1 2003-12-24 05:00:00
2 2004-12-27 05:00:00
3 2003-12-13 23:00:00
4 2002-12-23 23:00:00
5 2004-01-01 05:00:00
6 2014-12-24 05:00:00
Solution:
If you want it for every year between the following dates excluded, then extract the month and dates first:
df['month'] = df['dates'].dt.month
df['day'] = df['dates'].dt.day
And now put the condition check:
dec_days = [24, 25, 26, 27, 28, 29, 30, 31]
## if the month is dec, then check for these dates
## if the month is jan, then just check for the day to be 1 like below
df = df[~(((df.month == 12) & (df.day.isin(dec_days))) | ((df.month == 1) & (df.day == 1)))]
Sample output:
dates month day
0 2003-12-23 23:00:00 12 23
3 2003-12-13 23:00:00 12 13
4 2002-12-23 23:00:00 12 23
This takes advantage of the fact that datetime-strings in the form mm-dd are sortable. Read everything in from the CSV file then filter for the dates you want:
df = pd.read_csv('...', parse_dates=['DateTime'])
s = df['DateTime'].dt.strftime('%m-%d')
excluded = (s == '01-01') | (s >= '12-24') # Jan 1 or >= Dec 24
df[~excluded]
You can try dropping on conditionals. Maybe with a pattern match to the date string or parsing the date as a number (like in Java) and conditionally removing.
datesIdontLike = df[df['colname'] == <stringPattern>].index
newDF = df.drop(datesIdontLike, inplace=True)
Check this out: https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/
(If you have issues, let me know.)
You can use pandas and boolean filtering with strftime
# version 0.23.4
import pandas as pd
# make df
df = pd.DataFrame(pd.date_range('20181223', '20190103', freq='H'), columns=['date'])
# string format the date to only include the month and day
# then set it strictly less than '12-24' AND greater than or equal to `01-02`
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2018-12-23 00:00:00
1 2018-12-23 01:00:00
2 2018-12-23 02:00:00
3 2018-12-23 03:00:00
4 2018-12-23 04:00:00
5 2018-12-23 05:00:00
6 2018-12-23 06:00:00
7 2018-12-23 07:00:00
8 2018-12-23 08:00:00
9 2018-12-23 09:00:00
10 2018-12-23 10:00:00
11 2018-12-23 11:00:00
12 2018-12-23 12:00:00
13 2018-12-23 13:00:00
14 2018-12-23 14:00:00
15 2018-12-23 15:00:00
16 2018-12-23 16:00:00
17 2018-12-23 17:00:00
18 2018-12-23 18:00:00
19 2018-12-23 19:00:00
20 2018-12-23 20:00:00
21 2018-12-23 21:00:00
22 2018-12-23 22:00:00
23 2018-12-23 23:00:00
240 2019-01-02 00:00:00
241 2019-01-02 01:00:00
242 2019-01-02 02:00:00
243 2019-01-02 03:00:00
244 2019-01-02 04:00:00
245 2019-01-02 05:00:00
246 2019-01-02 06:00:00
247 2019-01-02 07:00:00
248 2019-01-02 08:00:00
249 2019-01-02 09:00:00
250 2019-01-02 10:00:00
251 2019-01-02 11:00:00
252 2019-01-02 12:00:00
253 2019-01-02 13:00:00
254 2019-01-02 14:00:00
255 2019-01-02 15:00:00
256 2019-01-02 16:00:00
257 2019-01-02 17:00:00
258 2019-01-02 18:00:00
259 2019-01-02 19:00:00
260 2019-01-02 20:00:00
261 2019-01-02 21:00:00
262 2019-01-02 22:00:00
263 2019-01-02 23:00:00
264 2019-01-03 00:00:00
This will work with multiple years because we are only filtering on the month and day.
# change range to include 2017
df = pd.DataFrame(pd.date_range('20171223', '20190103', freq='H'), columns=['date'])
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2017-12-23 00:00:00
1 2017-12-23 01:00:00
2 2017-12-23 02:00:00
3 2017-12-23 03:00:00
4 2017-12-23 04:00:00
5 2017-12-23 05:00:00
6 2017-12-23 06:00:00
7 2017-12-23 07:00:00
8 2017-12-23 08:00:00
9 2017-12-23 09:00:00
10 2017-12-23 10:00:00
11 2017-12-23 11:00:00
12 2017-12-23 12:00:00
13 2017-12-23 13:00:00
14 2017-12-23 14:00:00
15 2017-12-23 15:00:00
16 2017-12-23 16:00:00
17 2017-12-23 17:00:00
18 2017-12-23 18:00:00
19 2017-12-23 19:00:00
20 2017-12-23 20:00:00
21 2017-12-23 21:00:00
22 2017-12-23 22:00:00
23 2017-12-23 23:00:00
240 2018-01-02 00:00:00
241 2018-01-02 01:00:00
242 2018-01-02 02:00:00
243 2018-01-02 03:00:00
244 2018-01-02 04:00:00
245 2018-01-02 05:00:00
... ...
8779 2018-12-23 19:00:00
8780 2018-12-23 20:00:00
8781 2018-12-23 21:00:00
8782 2018-12-23 22:00:00
8783 2018-12-23 23:00:00
9000 2019-01-02 00:00:00
9001 2019-01-02 01:00:00
9002 2019-01-02 02:00:00
9003 2019-01-02 03:00:00
9004 2019-01-02 04:00:00
9005 2019-01-02 05:00:00
9006 2019-01-02 06:00:00
9007 2019-01-02 07:00:00
9008 2019-01-02 08:00:00
9009 2019-01-02 09:00:00
9010 2019-01-02 10:00:00
9011 2019-01-02 11:00:00
9012 2019-01-02 12:00:00
9013 2019-01-02 13:00:00
9014 2019-01-02 14:00:00
9015 2019-01-02 15:00:00
9016 2019-01-02 16:00:00
9017 2019-01-02 17:00:00
9018 2019-01-02 18:00:00
9019 2019-01-02 19:00:00
9020 2019-01-02 20:00:00
9021 2019-01-02 21:00:00
9022 2019-01-02 22:00:00
9023 2019-01-02 23:00:00
9024 2019-01-03 00:00:00
Since you want this to happen for every year, we can first define a series that where we replace the year by a static value (2000 for example). Let date be the column that stores the date, we can generate such column as:
dt = pd.to_datetime({'year': 2000, 'month': df['date'].dt.month, 'day': df['date'].dt.day})
For the given sample data, we get:
>>> dt
0 2000-12-23
1 2000-12-23
2 2000-12-23
3 2000-12-23
4 2000-12-23
5 2000-12-23
6 2000-12-23
7 2000-12-24
8 2000-12-24
9 2000-12-24
10 2000-12-24
11 2000-12-24
12 2000-12-24
13 2000-12-24
14 2000-01-01
15 2000-01-01
16 2000-01-01
17 2000-01-01
18 2000-01-01
19 2000-01-02
20 2000-01-02
21 2000-01-02
22 2000-01-02
23 2000-01-02
24 2000-01-02
25 2000-01-02
26 2000-01-02
dtype: datetime64[ns]
Next we can filter the rows, like:
from datetime import date
df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
This gives us the following data for your sample data:
>>> df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
id dt
0 7505 2003-12-23 17:00:00
1 7506 2003-12-23 18:00:00
2 7507 2003-12-23 19:00:00
3 7508 2003-12-23 20:00:00
4 7509 2003-12-23 21:00:00
5 7510 2003-12-23 22:00:00
6 7511 2003-12-23 23:00:00
19 7728 2004-01-02 00:00:00
20 7729 2004-01-02 01:00:00
21 7730 2004-01-02 02:00:00
22 7731 2004-01-02 03:00:00
23 7732 2004-01-02 04:00:00
24 7733 2004-01-02 05:00:00
25 7734 2004-01-02 06:00:00
26 7735 2004-01-02 07:00:00
So regardless what the year is, we will only consider dates between the 2nd of January and the 23rd of December (both inclusive).

Drop duplicates, keep most recent date, Pandas dataframe

I have a Pandas dataframe containing two columns: a datetime column, and a column of integers representing station IDs. I need a new dataframe with the following modifications:
For each set of duplicate STATION_ID values, keep the row with the most recent entry for DATE_CHANGED. If the duplicate entries for the STATION_ID all contain the same DATE_CHANGED then drop the duplicates and retain a single row for the STATION_ID. If there are no duplicates for the STATION_ID value, simply retain the row.
Dataframe (sorted by STATION_ID):
DATE_CHANGED STATION_ID
0 2006-06-07 06:00:00 1
1 2000-09-26 06:00:00 1
2 2000-09-26 06:00:00 1
3 2000-09-26 06:00:00 1
4 2001-06-06 06:00:00 2
5 2005-07-29 06:00:00 2
6 2005-07-29 06:00:00 2
7 2001-06-06 06:00:00 2
8 2001-06-08 06:00:00 4
9 2003-11-25 07:00:00 4
10 2001-06-12 06:00:00 7
11 2001-06-04 06:00:00 8
12 2017-04-03 18:36:16 8
13 2017-04-03 18:36:16 8
14 2017-04-03 18:36:16 8
15 2001-06-04 06:00:00 8
16 2001-06-08 06:00:00 10
17 2001-06-08 06:00:00 10
18 2001-06-08 06:00:00 11
19 2001-06-08 06:00:00 11
20 2001-06-08 06:00:00 12
21 2001-06-08 06:00:00 12
22 2001-06-08 06:00:00 13
23 2001-06-08 06:00:00 13
24 2001-06-08 06:00:00 14
25 2001-06-08 06:00:00 14
26 2001-06-08 06:00:00 15
27 2017-08-07 17:48:25 15
28 2001-06-08 06:00:00 15
29 2017-08-07 17:48:25 15
... ... ...
157066 2018-08-06 14:11:28 71655
157067 2018-08-06 14:11:28 71656
157068 2018-08-06 14:11:28 71656
157069 2018-09-11 21:45:05 71664
157070 2018-09-11 21:45:05 71664
157071 2018-09-11 21:45:05 71664
157072 2018-09-11 21:41:04 71664
157073 2018-08-09 15:22:07 71720
157074 2018-08-09 15:22:07 71720
157075 2018-08-09 15:22:07 71720
157076 2018-08-23 12:43:12 71899
157077 2018-08-23 12:43:12 71899
157078 2018-08-23 12:43:12 71899
157079 2018-09-08 20:21:43 71969
157080 2018-09-08 20:21:43 71969
157081 2018-09-08 20:21:43 71969
157082 2018-09-08 20:21:43 71984
157083 2018-09-08 20:21:43 71984
157084 2018-09-08 20:21:43 71984
157085 2018-09-05 18:46:18 71985
157086 2018-09-05 18:46:18 71985
157087 2018-09-05 18:46:18 71985
157088 2018-09-08 20:21:44 71990
157089 2018-09-08 20:21:44 71990
157090 2018-09-08 20:21:44 71990
157091 2018-09-08 20:21:43 72003
157092 2018-09-08 20:21:43 72003
157093 2018-09-08 20:21:43 72003
157094 2018-09-10 17:06:18 72024
157095 2018-09-10 17:15:05 72024
[157096 rows x 2 columns]
DATE_CHANGED is dtype: datetime64[ns]
STATION_ID is dtype: int64
pandas==0.23.4
python==2.7.15
Try:
df.sort_values('DATE_CHANGED').drop_duplicates('STATION_ID',keep='last')

Resample python list with pandas

Fairly new to python and pandas here.
I make a query that's giving me back a timeseries. I'm never sure how many data points I receive from the query (run for a single day), but what I do know is that I need to resample them to contain 24 points (one for each hour in the day).
Printing m3hstream gives
[(1479218009000L, 109), (1479287368000L, 84)]
Then I try to make a dataframe df with
df = pd.DataFrame(data = list(m3hstream), columns=['Timestamp', 'Value'])
and this gives me an output of
Timestamp Value
0 1479218009000 109
1 1479287368000 84
Following I do this
daily_summary = pd.DataFrame()
daily_summary['value'] = df['Value'].resample('H').mean()
daily_summary = daily_summary.truncate(before=start, after=end)
print "Now daily summary"
print daily_summary
But this is giving me a TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Could anyone please let me know how to resample it so I have 1 point for each hour in the 24 hour period that I'm querying for?
Thanks.
First thing you need to do is convert that 'Timestamp' to an actual pd.Timestamp. It looks like those are milliseconds
Then resample with the on parameter set to 'Timestamp'
df = df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index()
Timestamp Value
0 2016-11-15 13:00:00 109.0
1 2016-11-15 14:00:00 NaN
2 2016-11-15 15:00:00 NaN
3 2016-11-15 16:00:00 NaN
4 2016-11-15 17:00:00 NaN
5 2016-11-15 18:00:00 NaN
6 2016-11-15 19:00:00 NaN
7 2016-11-15 20:00:00 NaN
8 2016-11-15 21:00:00 NaN
9 2016-11-15 22:00:00 NaN
10 2016-11-15 23:00:00 NaN
11 2016-11-16 00:00:00 NaN
12 2016-11-16 01:00:00 NaN
13 2016-11-16 02:00:00 NaN
14 2016-11-16 03:00:00 NaN
15 2016-11-16 04:00:00 NaN
16 2016-11-16 05:00:00 NaN
17 2016-11-16 06:00:00 NaN
18 2016-11-16 07:00:00 NaN
19 2016-11-16 08:00:00 NaN
20 2016-11-16 09:00:00 84.0
If you want to fill those NaN values, use ffill, bfill, or interpolate
df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index().interpolate()
Timestamp Value
0 2016-11-15 13:00:00 109.00
1 2016-11-15 14:00:00 107.75
2 2016-11-15 15:00:00 106.50
3 2016-11-15 16:00:00 105.25
4 2016-11-15 17:00:00 104.00
5 2016-11-15 18:00:00 102.75
6 2016-11-15 19:00:00 101.50
7 2016-11-15 20:00:00 100.25
8 2016-11-15 21:00:00 99.00
9 2016-11-15 22:00:00 97.75
10 2016-11-15 23:00:00 96.50
11 2016-11-16 00:00:00 95.25
12 2016-11-16 01:00:00 94.00
13 2016-11-16 02:00:00 92.75
14 2016-11-16 03:00:00 91.50
15 2016-11-16 04:00:00 90.25
16 2016-11-16 05:00:00 89.00
17 2016-11-16 06:00:00 87.75
18 2016-11-16 07:00:00 86.50
19 2016-11-16 08:00:00 85.25
20 2016-11-16 09:00:00 84.00
Let's try:
daily_summary = daily_summary.set_index('Timestamp')
daily_summary.index = pd.to_datetime(daily_summary.index, unit='ms')
For once an hour:
daily_summary.resample('H').mean()
or for once a day:
daily_summary.resample('D').mean()

Categories