I want to complete my time serie of % humidity with missing records (or rows). Sensors are designed to record a mean value each 15min, so that is my target frequency. Here an example for one station (not the best in terms of gaps...) but I have 36 stations of measurements, 6 parameters and more than 24 000 records each to homogenize.
I choose columns of datetime and % humidity for example :
humdt = data["la-salade"][["datetime","humidite"]]
datetime humidite
0 2019-07-09 08:30:00 87
1 2019-07-09 11:00:00 87
2 2019-07-09 17:30:00 82
3 2019-07-09 23:30:00 80
4 2019-07-11 06:15:00 79
5 2019-07-19 14:30:00 39
I set datetime as an index : (so far it works)
humdt["datetime"] = pd.to_datetime(humdt["datetime"])
humdt = humdt.set_index("datetime",drop=True)
humidite
datetime
2019-07-09 08:30:00 87
2019-07-09 11:00:00 87
2019-07-09 17:30:00 82
2019-07-09 23:30:00 80
2019-07-11 06:15:00 79
2019-07-19 14:30:00 39
Beside this, I prepare a datetime range matching my expectations (15min frequency) :
date_rng = pd.period_range(start=debut, end=fin, freq='15min').strftime('%Y-%m-%d %H:%M:%S')
date_rng = pd.DataFrame(date_rng)
date_rng.columns = ["datetime"]
Then, I use this range to reindex my humidity values (expecting NaN when missing) :
humdt = humdt.reindex(pd.DatetimeIndex(date_rng["datetime"]))
humidite
datetime
2019-07-09 08:30:00 87.0
2019-07-09 08:45:00 88.0
2019-07-09 09:00:00 88.0
2019-07-09 09:15:00 88.0
2019-07-09 09:30:00 89.0
2019-07-09 09:45:00 89.0
2019-07-09 10:00:00 88.0
2019-07-09 10:15:00 88.0
2019-07-09 10:30:00 88.0
2019-07-09 10:45:00 88.0
2019-07-09 11:00:00 87.0
As a result, I get values of humidity from nowhere... not even a classical linear interpolation (ex : between 87% at 08H30 and 87% at 11H00). Please, help me, I have no clue what is going on... (also tried to merge and resampling, as here behavior is not as expected). Thank you !
You can add attribute fill_value to df.reindex.
humdt = humdt.reindex(pd.DatetimeIndex(date_rng["datetime"]), fill_value=np.nan)
This will fill new values with NaN
Related
I am trying find the cleanest, most pandastic way to create a new column that has the minimum values from one column in the same row as the maximum values in another column. The rest of the values can be nan as I will be interpolating.
rng = pd.date_range(start=datetime.date(2020,8,1), end=datetime.date(2020,8,3), freq='H')
df = pd.DataFrame(rng, columns=['date'])
df.index=pd.to_datetime(df['date'])
df.drop(['date'],axis=1,inplace=True)
df['val0']=np.random.randint(0,50,49)
df['val1']=np.random.randint(0,50,49)
One realization of df (cut and paste for reproducability):
val0 val1
date
2020-08-01 00:00:00 17 4
2020-08-01 01:00:00 89 0
2020-08-01 02:00:00 85 48
2020-08-01 03:00:00 83 13
2020-08-01 04:00:00 56 65
2020-08-01 05:00:00 48 31
2020-08-01 06:00:00 55 11
2020-08-01 07:00:00 15 87
2020-08-01 08:00:00 92 70
2020-08-01 09:00:00 95 57
2020-08-01 10:00:00 68 79
2020-08-01 11:00:00 87 7
2020-08-01 12:00:00 43 15
2020-08-01 13:00:00 23 4
2020-08-01 14:00:00 68 13
2020-08-01 15:00:00 68 63
2020-08-01 16:00:00 28 86
2020-08-01 17:00:00 12 40
2020-08-01 18:00:00 51 20
2020-08-01 19:00:00 20 48
2020-08-01 20:00:00 79 78
2020-08-01 21:00:00 67 89
2020-08-01 22:00:00 46 52
2020-08-01 23:00:00 7 47
2020-08-02 00:00:00 14 73
2020-08-02 01:00:00 70 30
2020-08-02 02:00:00 2 39
2020-08-02 03:00:00 65 81
2020-08-02 04:00:00 65 8
2020-08-02 05:00:00 83 60
2020-08-02 06:00:00 1 64
2020-08-02 07:00:00 13 63
2020-08-02 08:00:00 45 78
2020-08-02 09:00:00 83 7
2020-08-02 10:00:00 75 0
2020-08-02 11:00:00 52 3
2020-08-02 12:00:00 59 34
2020-08-02 13:00:00 54 57
2020-08-02 14:00:00 90 66
2020-08-02 15:00:00 82 56
2020-08-02 16:00:00 9 2
2020-08-02 17:00:00 5 51
2020-08-02 18:00:00 67 96
2020-08-02 19:00:00 18 77
2020-08-02 20:00:00 28 89
2020-08-02 21:00:00 96 53
2020-08-02 22:00:00 28 46
2020-08-02 23:00:00 41 87
2020-08-03 00:00:00 26 47
Now I find idxmax for and idxmin:
minidx=df.groupby(pd.Grouper(freq='D')).idxmin()
maxidx=df.groupby(pd.Grouper(freq='D')).idxmax()
minidx:
val0 val1
date
2020-08-01 2020-08-01 23:00:00 2020-08-01 01:00:00
2020-08-02 2020-08-02 06:00:00 2020-08-02 10:00:00
2020-08-03 2020-08-03 00:00:00 2020-08-03 00:00:00
maxidx:
val0 val1
date
2020-08-01 2020-08-01 09:00:00 2020-08-01 21:00:00
2020-08-02 2020-08-02 21:00:00 2020-08-02 18:00:00
2020-08-03 2020-08-03 00:00:00 2020-08-03 00:00:00
In this case, I would like to put the minimum daily value (7) located at 2020-08-01 23:00:00 into a new column at 2020-08-01 21:00:00 (i.e. adjacent to 89, the daily max of val1), and do the same for all other dates so the 'new' value on 2020-08-02 18:00:00 will be 1 (i.e. the minimum daily value occurring on 2020-08-02 06:00:00).
I tried the following, but I just get a bunch of nans:
df.loc[maxidx['val1'].values,'new']=df.loc[minidx['val0'].values,'val0']
If I just set it to an int (df.loc[maxidx['val1'].values,'new']=6), I get the int in the places I want the new values. The values I want are give by df.loc[minidx['val0'].values,'val0'], but I can't seem to get them into the dataframe.
minidx['val0'].values and maxidx['val1'].values are arrays of the same size with elements of type numpy.datetime64, and they are all generated from the same dataframe so maxidx and minidx should exist in df.index (df.index.values).
Is there an obvious reason this isn't working? Thanks
The simplest solution I have found is to loop through the idxmin and idxmax:
for v0,v1 in zip(minidx['val0'].values,maxidx['val1'].values):
df.loc[v1,'new']=df.loc[v0,'val0']
This gives me what I want, but doesn't seem very pandastic, so any other suggestions to accomplish the same thing would be great.
IIUC, you can do this using NamedAgg:
df.groupby(pd.Grouper(freq='D')).agg(val0_min_time=('val0','idxmin'),
val0_min_value=('val0','min'),
val0_max_time=('val0','idxmax'),
val0_max_value=('val0','max'),
val1_min_time=('val1','idxmin'),
val1_min_value=('val1','min'),
val1_max_time=('val1','idxmax'),
val1_max_value=('val1','max'),)
Output:
val0_min_time val0_min_value val0_max_time val0_max_value val1_min_time val1_min_value val1_max_time val1_max_value
date
2020-08-01 2020-08-01 23:00:00 7 2020-08-01 09:00:00 95 2020-08-01 01:00:00 0 2020-08-01 21:00:00 89
2020-08-02 2020-08-02 06:00:00 1 2020-08-02 21:00:00 96 2020-08-02 10:00:00 0 2020-08-02 18:00:00 96
2020-08-03 2020-08-03 00:00:00 26 2020-08-03 00:00:00 26 2020-08-03 00:00:00 47 2020-08-03 00:00:00 47
I want to create missing records from a time serie of % humidity.
datetime humidite
0 2019-07-09 08:30:00 87
1 2019-07-09 11:00:00 87
2 2019-07-09 17:30:00 82
3 2019-07-09 23:30:00 80
4 2019-07-11 06:15:00 79
5 2019-07-19 14:30:00 39
6 2019-07-21 00:00:00 80
I tried to index with existing datetime (result at this step is ok) :
humdt["datetime"] = pd.to_datetime(humdt["datetime"])
humdt = humdt.set_index("datetime")
humidite
datetime
2019-07-09 08:30:00 87
2019-07-09 11:00:00 87
2019-07-09 17:30:00 82
2019-07-09 23:30:00 80
2019-07-11 06:15:00 79
2019-07-19 14:30:00 39
Then reindex at 15 min frequency (my target frequency) :
humdt.resample("15min").asfreq()
humidite
datetime
2019-06-26 10:00:00 34.0
2019-06-26 10:15:00 33.0
2019-06-26 10:30:00 32.0
2019-06-26 10:45:00 31.0
2019-06-26 11:00:00 30.0
2019-06-26 11:15:00 29.0
As a result, I get wrong starting time and values, just frequency is respected.
Can you help me please ? I also tried to merge a range of datetime defined as my expected records with my data and it doesn't work. Thank you !!!
I'm new to pandas and I'm having some problems when I try to obtain daily average from data file.
So, my data is structured as follows:
DATA ESTACION
DATETIME
2020-01-15 00:00:00 175 47
2020-01-15 01:00:00 152 47
2020-01-15 02:00:00 180 47
2020-01-15 03:00:00 132 47
2020-01-15 04:00:00 115 47
... ... ...
2020-03-13 19:00:00 38 16
2020-03-13 20:00:00 53 16
2020-03-13 21:00:00 73 16
2020-03-13 22:00:00 28 16
2020-03-13 23:00:00 22 16
These are air pollution results gathered by 24 stations. Each station receives hourly information as you can see.
I'm trying to get daily average data by station. So this is what I do:
I group all info by station
grouped = data.groupby(['ESTACION'])
Then I get daily average resampling the grouped data
resampled = grouped.resample('D').mean()
And this is what I've obtained:
DATA ESTACION
ESTACION DATETIME
4 2020-01-02 18.250000 4.0
2020-01-03 NaN NaN
2020-01-04 NaN NaN
2020-01-05 NaN NaN
2020-01-06 NaN NaN
... ... ...
60 2020-11-29 NaN NaN
2020-11-30 NaN NaN
2020-12-01 NaN NaN
2020-12-02 118.666667 60.0
2020-12-03 80.833333 60.0
I don't really know whats going on cause I've only got data for 2020-01-15 - 2020-03-13 and it shows me info from other timestamps and NaN results.
If you need anything else to reproduce this case let me know.
Thanks and best regards
Output is expected, because resample always create consecutive DatetimeIndex.
So is possible remove missing rows by DataFrame.dropna:
resampled = grouped.resample('D').mean().dropna()
Another solution is use Series.dt.date:
data.groupby(['ESTACION', data['DATETIME'].dt.date]).mean()
sorry for the badly phrased question, currently only the first hour is updated with holiday.
e.g.
2013-01-01 00:00:00 - New Years Day
2013-01-01 00:00:00 - None
2013-01-01 00:00:00 - None
I would like to apply similar holidays to the same date using Pandas (Python).
What would be the most efficient method to apply the holiday to the same dates, there are a number of other holidays to apply as well?
Thank you in advance!
Screenshot of CSV in question
Using a library called holidays together with pandas apply could be a great solution to your problem. Here is a short contained example example
import pandas as pd
import holidays
us_holidays = holidays.UnitedStates()
# Create a sample DataFrame. You can just use your own
data = pd.DataFrame(pd.date_range('2020-01-01', '2020-01-30'), columns=['date'])
data['holiday'] = data['date'].apply(lambda x: us_holidays.get(x))
print(data)
Output
date holiday
0 2020-01-01 New Year's Day
1 2020-01-02 None
2 2020-01-03 None
3 2020-01-04 None
4 2020-01-05 None
5 2020-01-06 None
6 2020-01-07 None
7 2020-01-08 None
8 2020-01-09 None
9 2020-01-10 None
10 2020-01-11 None
11 2020-01-12 None
12 2020-01-13 None
13 2020-01-14 None
14 2020-01-15 None
15 2020-01-16 None
16 2020-01-17 None
17 2020-01-18 None
18 2020-01-19 None
19 2020-01-20 Martin Luther King, Jr. Day
20 2020-01-21 None
21 2020-01-22 None
22 2020-01-23 None
23 2020-01-24 None
24 2020-01-25 None
25 2020-01-26 None
26 2020-01-27 None
27 2020-01-28 None
28 2020-01-29 None
29 2020-01-30 None
IIUC, you have only the first hour of a day listed with a holiday. Here is a small sample of a dataframe with two months of data and three holidays on three separate days.
import pandas as pd
import numpy as np
df = pd.DataFrame({'temp':np.random.randint(50,110, 60*24)}, index=pd.date_range('2013-01-01', periods=(60*24), freq='H'))
df['Holiday'] = np.nan
df.loc['2013-01-01 00:00:00', 'Holiday'] = 'New Years Day'
df.loc['2013-02-02 00:00:00', 'Holiday'] = 'Groundhog Day'
df.loc['2013-02-14 00:00:00', 'Holiday'] = "Valentine's Day"
Now, let's use groupby with day from DatetimeIndex and ffill:
df['Holiday'] = df.groupby(df.index.day)['Holiday'].ffill()
Let's look at a few records:
print(df.head(40))
print(df['2013-02-02'])
print(df['2013-02-13':'2013-02-15'])
Output:
temp Holiday
2013-01-01 00:00:00 51 New Years Day
2013-01-01 01:00:00 71 New Years Day
2013-01-01 02:00:00 61 New Years Day
2013-01-01 03:00:00 90 New Years Day
2013-01-01 04:00:00 77 New Years Day
2013-01-01 05:00:00 69 New Years Day
2013-01-01 06:00:00 50 New Years Day
2013-01-01 07:00:00 99 New Years Day
2013-01-01 08:00:00 86 New Years Day
2013-01-01 09:00:00 72 New Years Day
2013-01-01 10:00:00 89 New Years Day
2013-01-01 11:00:00 62 New Years Day
2013-01-01 12:00:00 53 New Years Day
2013-01-01 13:00:00 91 New Years Day
2013-01-01 14:00:00 51 New Years Day
2013-01-01 15:00:00 93 New Years Day
2013-01-01 16:00:00 97 New Years Day
2013-01-01 17:00:00 83 New Years Day
2013-01-01 18:00:00 87 New Years Day
2013-01-01 19:00:00 58 New Years Day
2013-01-01 20:00:00 84 New Years Day
2013-01-01 21:00:00 92 New Years Day
2013-01-01 22:00:00 106 New Years Day
2013-01-01 23:00:00 104 New Years Day
2013-01-02 00:00:00 78 NaN
2013-01-02 01:00:00 104 NaN
2013-01-02 02:00:00 96 NaN
2013-01-02 03:00:00 103 NaN
2013-01-02 04:00:00 60 NaN
2013-01-02 05:00:00 87 NaN
2013-01-02 06:00:00 108 NaN
2013-01-02 07:00:00 85 NaN
2013-01-02 08:00:00 67 NaN
2013-01-02 09:00:00 61 NaN
2013-01-02 10:00:00 91 NaN
2013-01-02 11:00:00 79 NaN
2013-01-02 12:00:00 99 NaN
2013-01-02 13:00:00 82 NaN
2013-01-02 14:00:00 75 NaN
2013-01-02 15:00:00 90 NaN
temp Holiday
2013-02-02 00:00:00 82 Groundhog Day
2013-02-02 01:00:00 58 Groundhog Day
2013-02-02 02:00:00 102 Groundhog Day
2013-02-02 03:00:00 90 Groundhog Day
2013-02-02 04:00:00 79 Groundhog Day
2013-02-02 05:00:00 50 Groundhog Day
2013-02-02 06:00:00 50 Groundhog Day
2013-02-02 07:00:00 83 Groundhog Day
2013-02-02 08:00:00 80 Groundhog Day
2013-02-02 09:00:00 50 Groundhog Day
2013-02-02 10:00:00 52 Groundhog Day
2013-02-02 11:00:00 69 Groundhog Day
2013-02-02 12:00:00 100 Groundhog Day
2013-02-02 13:00:00 61 Groundhog Day
2013-02-02 14:00:00 62 Groundhog Day
2013-02-02 15:00:00 76 Groundhog Day
2013-02-02 16:00:00 83 Groundhog Day
2013-02-02 17:00:00 109 Groundhog Day
2013-02-02 18:00:00 109 Groundhog Day
2013-02-02 19:00:00 81 Groundhog Day
2013-02-02 20:00:00 52 Groundhog Day
2013-02-02 21:00:00 108 Groundhog Day
2013-02-02 22:00:00 68 Groundhog Day
2013-02-02 23:00:00 75 Groundhog Day
temp Holiday
2013-02-13 00:00:00 93 NaN
2013-02-13 01:00:00 93 NaN
2013-02-13 02:00:00 74 NaN
2013-02-13 03:00:00 97 NaN
2013-02-13 04:00:00 58 NaN
2013-02-13 05:00:00 103 NaN
2013-02-13 06:00:00 79 NaN
2013-02-13 07:00:00 65 NaN
2013-02-13 08:00:00 72 NaN
2013-02-13 09:00:00 100 NaN
2013-02-13 10:00:00 66 NaN
2013-02-13 11:00:00 60 NaN
2013-02-13 12:00:00 95 NaN
2013-02-13 13:00:00 51 NaN
2013-02-13 14:00:00 71 NaN
2013-02-13 15:00:00 58 NaN
2013-02-13 16:00:00 58 NaN
2013-02-13 17:00:00 98 NaN
2013-02-13 18:00:00 61 NaN
2013-02-13 19:00:00 63 NaN
2013-02-13 20:00:00 57 NaN
2013-02-13 21:00:00 102 NaN
2013-02-13 22:00:00 69 NaN
2013-02-13 23:00:00 86 NaN
2013-02-14 00:00:00 94 Valentine's Day
2013-02-14 01:00:00 64 Valentine's Day
2013-02-14 02:00:00 62 Valentine's Day
2013-02-14 03:00:00 59 Valentine's Day
2013-02-14 04:00:00 93 Valentine's Day
2013-02-14 05:00:00 99 Valentine's Day
2013-02-14 06:00:00 64 Valentine's Day
2013-02-14 07:00:00 80 Valentine's Day
2013-02-14 08:00:00 89 Valentine's Day
2013-02-14 09:00:00 96 Valentine's Day
2013-02-14 10:00:00 60 Valentine's Day
2013-02-14 11:00:00 76 Valentine's Day
2013-02-14 12:00:00 82 Valentine's Day
2013-02-14 13:00:00 65 Valentine's Day
2013-02-14 14:00:00 90 Valentine's Day
2013-02-14 15:00:00 62 Valentine's Day
2013-02-14 16:00:00 64 Valentine's Day
2013-02-14 17:00:00 98 Valentine's Day
2013-02-14 18:00:00 52 Valentine's Day
2013-02-14 19:00:00 72 Valentine's Day
2013-02-14 20:00:00 108 Valentine's Day
2013-02-14 21:00:00 85 Valentine's Day
2013-02-14 22:00:00 87 Valentine's Day
2013-02-14 23:00:00 62 Valentine's Day
2013-02-15 00:00:00 106 NaN
2013-02-15 01:00:00 82 NaN
2013-02-15 02:00:00 77 NaN
2013-02-15 03:00:00 52 NaN
2013-02-15 04:00:00 94 NaN
2013-02-15 05:00:00 71 NaN
2013-02-15 06:00:00 95 NaN
2013-02-15 07:00:00 96 NaN
2013-02-15 08:00:00 71 NaN
2013-02-15 09:00:00 69 NaN
2013-02-15 10:00:00 85 NaN
2013-02-15 11:00:00 92 NaN
2013-02-15 12:00:00 106 NaN
2013-02-15 13:00:00 77 NaN
2013-02-15 14:00:00 65 NaN
2013-02-15 15:00:00 104 NaN
2013-02-15 16:00:00 98 NaN
2013-02-15 17:00:00 107 NaN
2013-02-15 18:00:00 106 NaN
2013-02-15 19:00:00 67 NaN
2013-02-15 20:00:00 59 NaN
2013-02-15 21:00:00 81 NaN
2013-02-15 22:00:00 56 NaN
2013-02-15 23:00:00 75 NaN
Note: In this dataframe your datetime column is in the index.
You can try using the apply method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
The input to this is the function you want to be applied to each row. And in this case "axis" should be zero so that it is applied to each row.
I have a dataframe with datetime index:df.head(6)
NUMBERES PRICE
DEAL_TIME
2015-03-02 12:40:03 5 25
2015-03-04 14:52:57 7 23
2015-03-03 08:10:09 10 43
2015-03-02 20:18:24 5 37
2015-03-05 07:50:55 4 61
2015-03-02 09:08:17 1 17
The dataframe includes the data of one week. Now I need to count the time period of the day. If time period is 1 hour, I know the following method would work:
df_grouped = df.groupby(df.index.hour).count()
But I don't know how to do when the time period is half hour. How can I realize it?
UPDATE:
I was told that this question is similar to How to group DataFrame by a period of time?
But I had tried the methods mentioned. Maybe it's my fault that I didn't say it clearly. 'DEAL_TIME' ranges from '2015-03-02 00:00:00' to '2015-03-08 23:59:59'. If I use pd.TimeGrouper(freq='30Min') or resample(), the time periods would range from '2015-03-02 00:30' to '2015-03-08 23:30'. But what I want is a series like below:
COUNT
DEAL_TIME
00:00:00 53
00:30:00 49
01:00:00 31
01:30:00 22
02:00:00 1
02:30:00 24
03:00:00 27
03:30:00 41
04:00:00 41
04:30:00 76
05:00:00 33
05:30:00 16
06:00:00 15
06:30:00 4
07:00:00 60
07:30:00 85
08:00:00 3
08:30:00 37
09:00:00 18
09:30:00 29
10:00:00 31
10:30:00 67
11:00:00 35
11:30:00 60
12:00:00 95
12:30:00 37
13:00:00 30
13:30:00 62
14:00:00 58
14:30:00 44
15:00:00 45
15:30:00 35
16:00:00 94
16:30:00 56
17:00:00 64
17:30:00 43
18:00:00 60
18:30:00 52
19:00:00 14
19:30:00 9
20:00:00 31
20:30:00 71
21:00:00 21
21:30:00 32
22:00:00 61
22:30:00 35
23:00:00 14
23:30:00 21
In other words, the time period should be irrelevant to the date.
You need a 30-minute time grouper for this:
grouper = pd.TimeGrouper(freq="30T")
You also need to remove the 'date' part from the index:
df.index = df.reset_index()['index'].apply(lambda x: x - pd.Timestamp(x.date()))
Now, you can group by time alone:
df.groupby(grouper).count()
You can find somewhat obscure TimeGrouper documentation here: pandas resample documentation (it's actually resample documentation, but both features use the same rules).
In pandas, the most common way to group by time is to use the
.resample() function.
In v0.18.0 this function is two-stage.
This means that df.resample('M') creates an object to which we can
apply other functions (mean, count, sum, etc.)
The code snippet will be like,
df.resample('M').count()
You can refer here for example.