I have time-series data starting from Jan 1st with columns including 'Month', 'Hour of Day' and 'Hour of Year'. I would like to create a datetime column which expresses all this information in the format MM/DD/YYYY HH/MM.
I have tried converting the 'Hour of Year' column to datetime and timedelta objects however both times I receive an error saying that hour must have a value between 0 and 23. As I have data for the whole year, my column ranges from 1 to 8760.
I expect to get data that looks like this : 1/1/2018 1:00.
Here is a sample of the dataset I am working with:
Month Hour_of_Day Hour_of_Year
1 1 1
1 2 2
1 3 3
1 4 4
1 5 5
1 6 6
1 7 7
1 8 8
1 9 9
1 10 10
1 11 11
1 12 12
1 13 13
1 14 14
1 15 15
1 16 16
1 17 17
1 18 18
1 19 19
1 20 20
1 21 21
1 22 22
1 23 23
1 24 24
1 1 25
1 2 26
1 3 27
1 4 28
1 5 29
1 6 30
1 7 31
1 8 32
1 9 33
1 10 34
1 11 35
1 12 36
1 13 37
1 14 38
1 15 39
1 16 40
1 17 41
1 18 42
1 19 43
1 20 44
1 21 45
1 22 46
1 23 47
1 24 48
1 1 49
1 2 50
1 3 51
1 4 52
1 5 53
1 6 54
1 7 55
1 8 56
1 9 57
1 10 58
1 11 59
1 12 60
1 13 61
pd.to_timedelta is your friend here:
df['ts'] = pd.Timestamp('2018-01-01')+pd.to_timedelta(df.Hour_of_Year, unit='H')
gives:
Month Hour_of_Day Hour_of_Year ts
0 1 1 1 2018-01-01 01:00:00
1 1 2 2 2018-01-01 02:00:00
2 1 3 3 2018-01-01 03:00:00
3 1 4 4 2018-01-01 04:00:00
4 1 5 5 2018-01-01 05:00:00
5 1 6 6 2018-01-01 06:00:00
6 1 7 7 2018-01-01 07:00:00
7 1 8 8 2018-01-01 08:00:00
8 1 9 9 2018-01-01 09:00:00
9 1 10 10 2018-01-01 10:00:00
10 1 11 11 2018-01-01 11:00:00
11 1 12 12 2018-01-01 12:00:00
12 1 13 13 2018-01-01 13:00:00
13 1 14 14 2018-01-01 14:00:00
14 1 15 15 2018-01-01 15:00:00
15 1 16 16 2018-01-01 16:00:00
16 1 17 17 2018-01-01 17:00:00
17 1 18 18 2018-01-01 18:00:00
18 1 19 19 2018-01-01 19:00:00
19 1 20 20 2018-01-01 20:00:00
20 1 21 21 2018-01-01 21:00:00
21 1 22 22 2018-01-01 22:00:00
22 1 23 23 2018-01-01 23:00:00
23 1 24 24 2018-01-02 00:00:00
24 1 1 25 2018-01-02 01:00:00
25 1 2 26 2018-01-02 02:00:00
26 1 3 27 2018-01-02 03:00:00
27 1 4 28 2018-01-02 04:00:00
28 1 5 29 2018-01-02 05:00:00
29 1 6 30 2018-01-02 06:00:00
.. ... ... ... ...
31 1 8 32 2018-01-02 08:00:00
32 1 9 33 2018-01-02 09:00:00
33 1 10 34 2018-01-02 10:00:00
34 1 11 35 2018-01-02 11:00:00
35 1 12 36 2018-01-02 12:00:00
36 1 13 37 2018-01-02 13:00:00
37 1 14 38 2018-01-02 14:00:00
38 1 15 39 2018-01-02 15:00:00
39 1 16 40 2018-01-02 16:00:00
40 1 17 41 2018-01-02 17:00:00
41 1 18 42 2018-01-02 18:00:00
42 1 19 43 2018-01-02 19:00:00
43 1 20 44 2018-01-02 20:00:00
44 1 21 45 2018-01-02 21:00:00
45 1 22 46 2018-01-02 22:00:00
46 1 23 47 2018-01-02 23:00:00
47 1 24 48 2018-01-03 00:00:00
48 1 1 49 2018-01-03 01:00:00
49 1 2 50 2018-01-03 02:00:00
50 1 3 51 2018-01-03 03:00:00
51 1 4 52 2018-01-03 04:00:00
52 1 5 53 2018-01-03 05:00:00
53 1 6 54 2018-01-03 06:00:00
54 1 7 55 2018-01-03 07:00:00
55 1 8 56 2018-01-03 08:00:00
56 1 9 57 2018-01-03 09:00:00
57 1 10 58 2018-01-03 10:00:00
58 1 11 59 2018-01-03 11:00:00
59 1 12 60 2018-01-03 12:00:00
60 1 13 61 2018-01-03 13:00:00
Related
The dataframe below is what I'm trying to plot, but there are several duplicate entries in each column. By maintaining only the final entry, I wish to eliminate the initial repeated components in each column so that they do not appear in the graph(Ignore if duplicates in middle and last).
Could someone please help me solve this issue?
Code I tried, this removes if duplicates in entire row:
df = df.drop_duplicates(subset=df.columns[1:], keep='last')
df = df.groupby((df.shift() != df).cumsum()).filter(lambda x: len(x) < 5)
Input:
Date Build1 Build2 Build3 Build4 Build5 Build6
2022-11-26 00:00:00 30 30 30 30 30 30
2022-11-27 00:00:00 30 30 30 30 30 30
2022-11-28 00:00:00 30 30 30 30 30 30
2022-11-29 00:00:00 30 30 30 30 30 30
2022-11-30 00:00:00 30 30 30 30 30 30
2022-12-01 00:00:00 28 30 30 30 30 30
2022-12-02 00:00:00 25 30 30 30 30 30
2022-12-03 00:00:00 25 30 30 30 30 30
2022-12-04 00:00:00 22 28 30 30 30 30
2022-12-05 00:00:00 22 26 30 30 30 30
2022-12-06 00:00:00 22 23 30 30 30 30
2022-12-07 00:00:00 22 22 30 30 30 30
2022-12-08 00:00:00 22 20 30 30 30 30
2022-12-09 00:00:00 22 20 25 30 30 30
2022-12-10 00:00:00 22 20 23 30 30 30
2022-12-11 00:00:00 22 20 23 30 30 30
2022-12-12 00:00:00 22 20 18 30 30 30
2022-12-13 00:00:00 22 20 14 30 30 30
2022-12-14 00:00:00 22 20 11 30 30 30
2022-12-15 00:00:00 22 20 10 27 30 30
2022-12-16 00:00:00 22 20 10 20 30 30
2022-12-17 00:00:00 22 20 10 20 30 30
2022-12-18 00:00:00 22 20 10 20 30 30
2022-12-19 00:00:00 22 20 10 13 30 30
2022-12-20 00:00:00 22 20 10 2 30 30
2022-12-21 00:00:00 22 20 10 2 19 30
2022-12-22 00:00:00 22 20 10 2 11 30
2022-12-23 00:00:00 22 20 10 2 4 30
2022-12-24 00:00:00 22 20 10 2 0 30
2022-12-25 00:00:00 22 20 10 2 0 22
2022-12-26 00:00:00 22 20 10 2 0 15
2022-12-27 00:00:00 22 20 10 2 0 15
2022-12-28 00:00:00 22 20 10 2 0 9
Expected output:
Date Build1 Build2 Build3 Build4 Build5 Build6
2022-11-26 00:00:00
2022-11-27 00:00:00
2022-11-28 00:00:00
2022-11-29 00:00:00
2022-11-30 00:00:00 30
2022-12-01 00:00:00 28
2022-12-02 00:00:00 25
2022-12-03 00:00:00 25 30
2022-12-04 00:00:00 22 28
2022-12-05 00:00:00 22 26
2022-12-06 00:00:00 22 23
2022-12-07 00:00:00 22 22
2022-12-08 00:00:00 22 20 30
2022-12-09 00:00:00 22 20 25
2022-12-10 00:00:00 22 20 23
2022-12-11 00:00:00 22 20 23
2022-12-12 00:00:00 22 20 18
2022-12-13 00:00:00 22 20 14
2022-12-14 00:00:00 22 20 11 30
2022-12-15 00:00:00 22 20 10 27
2022-12-16 00:00:00 22 20 10 20
2022-12-17 00:00:00 22 20 10 20
2022-12-18 00:00:00 22 20 10 20
2022-12-19 00:00:00 22 20 10 13
2022-12-20 00:00:00 22 20 10 2 30
2022-12-21 00:00:00 22 20 10 2 19
2022-12-22 00:00:00 22 20 10 2 11
2022-12-23 00:00:00 22 20 10 2 4
2022-12-24 00:00:00 22 20 10 2 0 30
2022-12-25 00:00:00 22 20 10 2 0 22
2022-12-26 00:00:00 22 20 10 2 0 15
2022-12-27 00:00:00 22 20 10 2 0 15
2022-12-28 00:00:00 22 20 10 2 0 9
You can simply do
is_duplicate = df.apply(pd.Series.duplicated, axis=1)
df.where(~is_duplicate, np.nan)
which gives
Date Build1 Build2 Build3 Build4
0 2022-11-26 00:00:00 30 30 NaN NaN NaN
1 2022-11-27 00:00:00 30 30 NaN NaN NaN
2 2022-11-28 00:00:00 30 30 NaN NaN NaN
3 2022-11-29 00:00:00 30 30 NaN NaN NaN
4 2022-11-30 00:00:00 30 30 NaN NaN NaN
5 2022-12-01 00:00:00 28 30 NaN NaN NaN
6 2022-12-02 00:00:00 25 30 NaN NaN NaN
7 2022-12-03 00:00:00 25 30 NaN NaN NaN
8 2022-12-04 00:00:00 22 30 NaN NaN NaN
9 2022-12-05 00:00:00 22 30 NaN NaN NaN
10 2022-12-06 00:00:00 22 30 NaN NaN NaN
11 2022-12-07 00:00:00 22 30 NaN NaN NaN
12 2022-12-08 00:00:00 22 30 NaN NaN NaN
13 2022-12-09 00:00:00 22 25 30.0 NaN NaN
14 2022-12-10 00:00:00 22 23 30.0 NaN NaN
15 2022-12-11 00:00:00 22 23 30.0 NaN NaN
16 2022-12-12 00:00:00 22 18 30.0 NaN NaN
17 2022-12-13 00:00:00 22 14 30.0 NaN NaN
18 2022-12-14 00:00:00 22 11 30.0 NaN NaN
19 2022-12-15 00:00:00 22 10 27.0 30.0 NaN
20 2022-12-16 00:00:00 22 10 20.0 30.0 NaN
21 2022-12-17 00:00:00 22 10 20.0 30.0 NaN
22 2022-12-18 00:00:00 22 10 20.0 30.0 NaN
23 2022-12-19 00:00:00 22 10 13.0 30.0 NaN
24 2022-12-20 00:00:00 22 10 2.0 30.0 NaN
25 2022-12-21 00:00:00 22 10 2.0 19.0 30.0
26 2022-12-22 00:00:00 22 10 2.0 11.0 30.0
27 2022-12-23 00:00:00 22 10 2.0 4.0 30.0
28 2022-12-24 00:00:00 22 10 2.0 0.0 30.0
29 2022-12-25 00:00:00 22 10 2.0 0.0 22.0
30 2022-12-26 00:00:00 22 10 2.0 0.0 15.0
31 2022-12-27 00:00:00 22 10 2.0 0.0 15.0
32 2022-12-28 00:00:00 22 10 2.0 0.0 9.0
or
is_duplicate = df.apply(pd.Series.duplicated, axis=1)
print(df.where(~is_duplicate, ''))
which gives:
Date Build1 Build2 Build3 Build4
0 2022-11-26 00:00:00 30 30
1 2022-11-27 00:00:00 30 30
2 2022-11-28 00:00:00 30 30
3 2022-11-29 00:00:00 30 30
4 2022-11-30 00:00:00 30 30
5 2022-12-01 00:00:00 28 30
6 2022-12-02 00:00:00 25 30
7 2022-12-03 00:00:00 25 30
8 2022-12-04 00:00:00 22 30
9 2022-12-05 00:00:00 22 30
10 2022-12-06 00:00:00 22 30
11 2022-12-07 00:00:00 22 30
12 2022-12-08 00:00:00 22 30
13 2022-12-09 00:00:00 22 25 30
14 2022-12-10 00:00:00 22 23 30
15 2022-12-11 00:00:00 22 23 30
16 2022-12-12 00:00:00 22 18 30
17 2022-12-13 00:00:00 22 14 30
18 2022-12-14 00:00:00 22 11 30
19 2022-12-15 00:00:00 22 10 27 30
20 2022-12-16 00:00:00 22 10 20 30
21 2022-12-17 00:00:00 22 10 20 30
22 2022-12-18 00:00:00 22 10 20 30
23 2022-12-19 00:00:00 22 10 13 30
24 2022-12-20 00:00:00 22 10 2 30
25 2022-12-21 00:00:00 22 10 2 19 30
26 2022-12-22 00:00:00 22 10 2 11 30
27 2022-12-23 00:00:00 22 10 2 4 30
28 2022-12-24 00:00:00 22 10 2 0 30
29 2022-12-25 00:00:00 22 10 2 0 22
30 2022-12-26 00:00:00 22 10 2 0 15
31 2022-12-27 00:00:00 22 10 2 0 15
32 2022-12-28 00:00:00 22 10 2 0 9
receptor
year
month
day
hour
hour.inc
lat
lon
height
pressure
date
1
2018
1
3
19
0
31.768
-106.501
500.0
835.6
2018-01-03 19:00:00
1
2018
1
3
18
-1
31.628
-106.350
508.8
840.5
2018-01-03 18:00:00
1
2018
1
3
17
-2
31.489
-106.180
526.2
839.4
2018-01-03 17:00:00
1
2018
1
3
16
-3
31.372
-105.974
547.6
836.8
2018-01-03 16:00:00
1
2018
1
3
15
-4
31.289
-105.731
555.3
829.8
2018-01-03 15:00:00
1
2018
1
3
14
-5
31.265
-105.462
577.8
812.8
2018-01-03 14:00:00
1
2018
1
3
13
-6
31.337
-105.175
640.0
793.9
2018-01-03 13:00:00
1
2018
1
3
12
-7
31.492
-104.897
645.6
809.2
2018-01-03 12:00:00
1
2018
1
3
11
-8
31.671
-104.700
686.8
801.0
2018-01-03 11:00:00
1
2018
1
3
10
-9
31.913
-104.552
794.2
795.8
2018-01-03 10:00:00
2
2018
1
4
19
0
31.768
-106.501
500.0
830.9
2018-01-04 19:00:00
2
2018
1
4
18
-1
31.904
-106.635
611.5
819.5
2018-01-04 18:00:00
2
2018
1
4
17
-2
32.070
-106.749
709.7
808.0
2018-01-04 17:00:00
2
2018
1
4
16
-3
32.223
-106.855
787.3
794.9
2018-01-04 16:00:00
Above is what my dataframe looks like but I am trying to create a new column called date1 and will look like the frame below.
receptor year month day hour hour.inc lat lon height pressure date date1
1 1 2018 1 3 19 0 31.768 -106.501 500.0 835.6 2018-01-03 19:00:00 2018-01-03 19:00:00
2 1 2018 1 3 18 -1 31.628 -106.350 508.8 840.5 2018-01-03 18:00:00 2018-01-03 19:00:00
3 1 2018 1 3 17 -2 31.489 -106.180 526.2 839.4 2018-01-03 17:00:00 2018-01-03 19:00:00
4 1 2018 1 3 16 -3 31.372 -105.974 547.6 836.8 2018-01-03 16:00:00 2018-01-03 19:00:00
5 1 2018 1 3 15 -4 31.289 -105.731 555.3 829.8 2018-01-03 15:00:00 2018-01-03 19:00:00
6 1 2018 1 3 14 -5 31.265 -105.462 577.8 812.8 2018-01-03 14:00:00 2018-01-03 19:00:00
7 1 2018 1 3 13 -6 31.337 -105.175 640.0 793.9 2018-01-03 13:00:00 2018-01-03 19:00:00
8 1 2018 1 3 12 -7 31.492 -104.897 645.6 809.2 2018-01-03 12:00:00 2018-01-03 19:00:00
9 1 2018 1 3 11 -8 31.671 -104.700 686.8 801.0 2018-01-03 11:00:00 2018-01-03 19:00:00
10 1 2018 1 3 10 -9 31.913 -104.552 794.2 795.8 2018-01-03 10:00:00 2018-01-03 19:00:00
38 2 2018 1 4 19 0 31.768 -106.501 500.0 830.9 2018-01-04 19:00:00 2018-01-04 19:00:00
39 2 2018 1 4 18 -1 31.904 -106.635 611.5 819.5 2018-01-04 18:00:00 2018-01-04 19:00:00
40 2 2018 1 4 17 -2 32.070 -106.749 709.7 808.0 2018-01-04 17:00:00 2018-01-04 19:00:00
41 2 2018 1 4 16 -3 32.223 -106.855 787.3 794.9 2018-01-04 16:00:00 2018-01-04 19:00:00
Disregard the index furthest to the left. I want to match the receptor (Ex:1,2) with the first occurrence of the date (Ex: 2018-01-03 19:00:00,2018-01-04 19:00:00) and then repeat till the receptor changes.
I'm working in R so I'd like to find a solution in R but I could also use a python solution and make use of the Reticulate package in R.
Using data.table you can try
library(data.table)
setDT(df) #converting into data.frame
df[,date1 := date[1],receptor] # taking the first date per receptor
df
#Output
receptor year month day hour hour.inc lat lon height pressure date date1
1: 1 2018 1 3 19 0 31.768 -106.501 500.0 835.6 2018-01-03 19:00:00 2018-01-03 19:00:00
2: 1 2018 1 3 18 -1 31.628 -106.350 508.8 840.5 2018-01-03 18:00:00 2018-01-03 19:00:00
3: 1 2018 1 3 17 -2 31.489 -106.180 526.2 839.4 2018-01-03 17:00:00 2018-01-03 19:00:00
4: 1 2018 1 3 16 -3 31.372 -105.974 547.6 836.8 2018-01-03 16:00:00 2018-01-03 19:00:00
5: 1 2018 1 3 15 -4 31.289 -105.731 555.3 829.8 2018-01-03 15:00:00 2018-01-03 19:00:00
6: 1 2018 1 3 14 -5 31.265 -105.462 577.8 812.8 2018-01-03 14:00:00 2018-01-03 19:00:00
7: 1 2018 1 3 13 -6 31.337 -105.175 640.0 793.9 2018-01-03 13:00:00 2018-01-03 19:00:00
8: 1 2018 1 3 12 -7 31.492 -104.897 645.6 809.2 2018-01-03 12:00:00 2018-01-03 19:00:00
9: 1 2018 1 3 11 -8 31.671 -104.700 686.8 801.0 2018-01-03 11:00:00 2018-01-03 19:00:00
10: 1 2018 1 3 10 -9 31.913 -104.552 794.2 795.8 2018-01-03 10:00:00 2018-01-03 19:00:00
11: 2 2018 1 4 19 0 31.768 -106.501 500.0 830.9 2018-01-04 19:00:00 2018-01-04 19:00:00
12: 2 2018 1 4 18 -1 31.904 -106.635 611.5 819.5 2018-01-04 18:00:00 2018-01-04 19:00:00
13: 2 2018 1 4 17 -2 32.070 -106.749 709.7 808.0 2018-01-04 17:00:00 2018-01-04 19:00:00
14: 2 2018 1 4 16 -3 32.223 -106.855 787.3 794.9 2018-01-04 16:00:00 2018-01-04 19:00:00
Try filling location of the unchanged value with np.nan and changed value location with date (of that index) and then simply do forward fill using .ffill()
df.receptor.shift().ne(df.receptor) will give you where the receptor value changes. compare the previous and current value to see the change.
df['date1'] = np.where(df.receptor.shift().ne(df.receptor), df.date, np.nan)
df.date1 = df.date1.ffill()
receptor
year
month
day
hour
hour.inc
lat
lon
height
pressure
date
date1
0
1
2018
1
3
19
0
31.768
-106.501
500.0
835.6
2018-01-03 19:00:00
2018-01-03 19:00:00
1
1
2018
1
3
18
-1
31.628
-106.350
508.8
840.5
2018-01-03 18:00:00
2018-01-03 19:00:00
2
1
2018
1
3
17
-2
31.489
-106.180
526.2
839.4
2018-01-03 17:00:00
2018-01-03 19:00:00
3
1
2018
1
3
16
-3
31.372
-105.974
547.6
836.8
2018-01-03 16:00:00
2018-01-03 19:00:00
4
1
2018
1
3
15
-4
31.289
-105.731
555.3
829.8
2018-01-03 15:00:00
2018-01-03 19:00:00
5
1
2018
1
3
14
-5
31.265
-105.462
577.8
812.8
2018-01-03 14:00:00
2018-01-03 19:00:00
6
1
2018
1
3
13
-6
31.337
-105.175
640.0
793.9
2018-01-03 13:00:00
2018-01-03 19:00:00
7
1
2018
1
3
12
-7
31.492
-104.897
645.6
809.2
2018-01-03 12:00:00
2018-01-03 19:00:00
8
1
2018
1
3
11
-8
31.671
-104.700
686.8
801.0
2018-01-03 11:00:00
2018-01-03 19:00:00
9
1
2018
1
3
10
-9
31.913
-104.552
794.2
795.8
2018-01-03 10:00:00
2018-01-03 19:00:00
10
2
2018
1
4
19
0
31.768
-106.501
500.0
830.9
2018-01-04 19:00:00
2018-01-04 19:00:00
11
2
2018
1
4
18
-1
31.904
-106.635
611.5
819.5
2018-01-04 18:00:00
2018-01-04 19:00:00
12
2
2018
1
4
17
-2
32.070
-106.749
709.7
808.0
2018-01-04 17:00:00
2018-01-04 19:00:00
13
2
2018
1
4
16
-3
32.223
-106.855
787.3
794.9
2018-01-04 16:00:00
2018-01-04 19:00:00
Consider base R's ave after calculating a Date column to return first date time per date grouping using head:
df <- within(df, {
date_short <- as.Date(substr(as.character(date), 1, 10), origin="1970-01-01")
first_dt_hour <- ave(date, date_short, FUN=function(x) head(x, 1))
rm(date_short) # DROP HELPER COLUMN
})
print(df)
# receptor year month day hour hour.inc lat lon height pressure date first_dt_hour
# 1 1 2018 1 3 19 0 31.768 -106.501 500.0 835.6 2018-01-03 19:00:00 2018-01-03 19:00:00
# 2 1 2018 1 3 18 -1 31.628 -106.350 508.8 840.5 2018-01-03 18:00:00 2018-01-03 19:00:00
# 3 1 2018 1 3 17 -2 31.489 -106.180 526.2 839.4 2018-01-03 17:00:00 2018-01-03 19:00:00
# 4 1 2018 1 3 16 -3 31.372 -105.974 547.6 836.8 2018-01-03 16:00:00 2018-01-03 19:00:00
# 5 1 2018 1 3 15 -4 31.289 -105.731 555.3 829.8 2018-01-03 15:00:00 2018-01-03 19:00:00
# 6 1 2018 1 3 14 -5 31.265 -105.462 577.8 812.8 2018-01-03 14:00:00 2018-01-03 19:00:00
# 7 1 2018 1 3 13 -6 31.337 -105.175 640.0 793.9 2018-01-03 13:00:00 2018-01-03 19:00:00
# 8 1 2018 1 3 12 -7 31.492 -104.897 645.6 809.2 2018-01-03 12:00:00 2018-01-03 19:00:00
# 9 1 2018 1 3 11 -8 31.671 -104.700 686.8 801.0 2018-01-03 11:00:00 2018-01-03 19:00:00
# 10 1 2018 1 3 10 -9 31.913 -104.552 794.2 795.8 2018-01-03 10:00:00 2018-01-03 19:00:00
# 38 2 2018 1 4 19 0 31.768 -106.501 500.0 830.9 2018-01-04 19:00:00 2018-01-04 19:00:00
# 39 2 2018 1 4 18 -1 31.904 -106.635 611.5 819.5 2018-01-04 18:00:00 2018-01-04 19:00:00
# 40 2 2018 1 4 17 -2 32.070 -106.749 709.7 808.0 2018-01-04 17:00:00 2018-01-04 19:00:00
# 41 2 2018 1 4 16 -3 32.223 -106.855 787.3 794.9 2018-01-04 16:00:00 2018-01-04 19:00:00
Data
data <- ' receptor year month day hour hour.inc lat lon height pressure date
1 1 2018 1 3 19 0 31.768 -106.501 500.0 835.6 "2018-01-03 19:00:00"
2 1 2018 1 3 18 -1 31.628 -106.350 508.8 840.5 "2018-01-03 18:00:00"
3 1 2018 1 3 17 -2 31.489 -106.180 526.2 839.4 "2018-01-03 17:00:00"
4 1 2018 1 3 16 -3 31.372 -105.974 547.6 836.8 "2018-01-03 16:00:00"
5 1 2018 1 3 15 -4 31.289 -105.731 555.3 829.8 "2018-01-03 15:00:00"
6 1 2018 1 3 14 -5 31.265 -105.462 577.8 812.8 "2018-01-03 14:00:00"
7 1 2018 1 3 13 -6 31.337 -105.175 640.0 793.9 "2018-01-03 13:00:00"
8 1 2018 1 3 12 -7 31.492 -104.897 645.6 809.2 "2018-01-03 12:00:00"
9 1 2018 1 3 11 -8 31.671 -104.700 686.8 801.0 "2018-01-03 11:00:00"
10 1 2018 1 3 10 -9 31.913 -104.552 794.2 795.8 "2018-01-03 10:00:00"
38 2 2018 1 4 19 0 31.768 -106.501 500.0 830.9 "2018-01-04 19:00:00"
39 2 2018 1 4 18 -1 31.904 -106.635 611.5 819.5 "2018-01-04 18:00:00"
40 2 2018 1 4 17 -2 32.070 -106.749 709.7 808.0 "2018-01-04 17:00:00"
41 2 2018 1 4 16 -3 32.223 -106.855 787.3 794.9 "2018-01-04 16:00:00"'
df <- read.table(text=data,
colClasses=c(rep("integer", 7), rep("numeric", 4), "POSIXct"),
header=TRUE)
I have a dataframe in Python below:
print (df)
Date Hour Weight
0 2019-01-01 8 1
1 2019-01-01 16 2
2 2019-01-01 24 6
3 2019-01-02 8 10
4 2019-01-02 16 4
5 2019-01-02 24 12
6 2019-01-03 8 10
7 2019-01-03 16 6
8 2019-01-03 24 5
How can I create a column (New_Col) that will return me the value of 'Hour' for the lowest value of 'Weight' in the day. I'm expecting:
Date Hour Weight New_Col
2019-01-01 8 1 8
2019-01-01 16 2 8
2019-01-01 24 6 8
2019-01-02 8 10 16
2019-01-02 16 4 16
2019-01-02 24 12 16
2019-01-03 8 10 24
2019-01-03 16 6 24
2019-01-03 24 5 24
Use GroupBy.transform with DataFrameGroupBy.idxmin, but first create index by Hour column for values from Hour per minimal Weight per groups:
df['New'] = df.set_index('Hour').groupby('Date')['Weight'].transform('idxmin').values
print (df)
Date Hour Weight New_Col New
0 2019-01-01 8 1 8 8
1 2019-01-01 16 2 8 8
2 2019-01-01 24 6 8 8
3 2019-01-02 8 10 16 16
4 2019-01-02 16 4 16 16
5 2019-01-02 24 12 16 16
6 2019-01-03 8 10 24 24
7 2019-01-03 16 6 24 24
8 2019-01-03 24 5 24 24
Alternative solution:
df['New'] = df['Date'].map(df.set_index('Hour').groupby('Date')['Weight'].idxmin())
I have a dataframe with dates and values from column A to H. Also, I have some fixed variables X1=5, X2=6, Y1=7,Y2=8, Z1=9
Date A B C D E F G H
0 2018-01-02 00:00:00 7161 7205 -44 54920 73 7 5 47073
1 2018-01-03 00:00:00 7101 7147 -46 54710 73 6 5 46570
2 2018-01-04 00:00:00 7146 7189 -43 54730 70 7 5 46933
3 2018-01-05 00:00:00 7079 7121 -43 54720 70 6 5 46404
4 2018-01-08 00:00:00 7080 7125 -45 54280 70 6 5 46355
5 2018-01-09 00:00:00 7060 7102 -43 54440 70 6 5 46319
6 2018-01-10 00:00:00 7113 7153 -40 54510 70 7 5 46837
7 2018-01-11 00:00:00 7103 7141 -38 54690 70 7 5 46728
8 2018-01-12 00:00:00 7074 7110 -36 54310 65 6 5 46357
9 2018-01-15 00:00:00 7181 7210 -29 54320 65 6 5 46792
10 2018-01-16 00:00:00 7036 7078 -42 54420 65 6 5 45709
11 2018-01-17 00:00:00 6994 7034 -40 53690 65 6 5 45416
12 2018-01-18 00:00:00 7032 7076 -44 53590 65 6 5 45705
13 2018-01-19 00:00:00 6999 7041 -42 53560 65 6 5 45331
14 2018-01-22 00:00:00 7025 7068 -43 53500 65 6 5 45455
15 2018-01-23 00:00:00 6883 6923 -41 53490 65 6 5 44470
16 2018-01-24 00:00:00 7111 7150 -39 52630 65 6 5 45866
17 2018-01-25 00:00:00 7101 7138 -37 53470 65 6 5 45663
18 2018-01-26 00:00:00 7043 7085 -43 53380 65 6 5 45087
19 2018-01-29 00:00:00 7041 7085 -44 53370 65 6 5 44958
20 2018-01-30 00:00:00 7010 7050 -41 53040 65 6 5 44790
21 2018-01-31 00:00:00 7079 7118 -39 52880 65 6 5 45248
What I wanted to do is adding some column-wise simple calculations to this dataframe using values in column A to H as well as those fixed variables.
The tricky part is that I need to apply different variables to different date ranges.
For example, during 2018-01-01 to 2018-01-10, I wanted to calculate a new column I where the value equals to: (A+B+C)*X1*Y1+Z1;
While during 2018-01-11 to 2018-01-25, the calculation needs to take (A+B+C)*X2*Y1+Z1. Similar to Y1 and Y2 applied to each of their date ranges.
I know this can calculate/create a new column I.
df[I]=(df[A]+df[B]+df[C])*X1*Y1+Z1
but not sure how to be able to have that flexibility to use different variables to different date ranges.
You can use np.select to define a value based on a condition:
cond = [df.Date.between('2018-01-01','2018-01-10'), df.Date.between('2018-01-11','2018-01-25')]
values = [(df['A']+df['B']+df['C'])*X1*Y1+Z1, (df['A']+df['B']+df['C'])*X2*Y2+Z1]
# select values depending on the condition
df['I'] = np.select(cond, values)
I have a pandas dataFrame like this:
content
date
2013-12-18 12:30:00 1
2013-12-19 10:50:00 1
2013-12-24 11:00:00 0
2014-01-02 11:30:00 1
2014-01-03 11:50:00 0
2013-12-17 16:40:00 10
2013-12-18 10:00:00 0
2013-12-11 10:00:00 0
2013-12-18 11:45:00 0
2013-12-11 14:40:00 4
2010-05-25 13:05:00 0
2013-11-18 14:10:00 0
2013-11-27 11:50:00 3
2013-11-13 10:40:00 0
2013-11-20 10:40:00 1
2008-11-04 14:49:00 1
2013-11-18 10:05:00 0
2013-08-27 11:00:00 0
2013-09-18 16:00:00 0
2013-09-27 11:40:00 0
date being the index.
I reduce the values to months using:
dataFrame = dataFrame.groupby([lambda x: x.year, lambda x: x.month]).agg([sum])
which outputs:
content
sum
2006 3 66
4 65
5 48
6 87
7 37
8 54
9 73
10 74
11 53
12 45
2007 1 28
2 40
3 95
4 63
5 56
6 66
7 50
8 49
9 18
10 28
Now when I plot this dataFrame, I want the x-axis show every month/year as a tick. I have tries setting xticks but it doesn't seem to work. How could this be achieved? This is my current plot using dataFrame.plot():
You can use set_xtick() and set_xticklabels():
idx = pd.date_range("2013-01-01", periods=1000)
val = np.random.rand(1000)
s = pd.Series(val, idx)
g = s.groupby([s.index.year, s.index.month]).mean()
ax = g.plot()
ax.set_xticks(range(len(g)));
ax.set_xticklabels(["%s-%02d" % item for item in g.index.tolist()], rotation=90);
output: