How to convert hourly data to half hourly - python

I have the following dataframe:
datetime temp
0 2015-01-01 00:00:00 11.22
1 2015-01-01 01:00:00 11.32
2 2015-01-01 02:00:00 11.30
3 2015-01-01 03:00:00 11.25
4 2015-01-01 04:00:00 11.32
... ... ...
31339 2018-07-29 19:00:00 17.60
31340 2018-07-29 20:00:00 17.49
31341 2018-07-29 21:00:00 17.44
31342 2018-07-29 22:00:00 17.39
31343 2018-07-29 23:00:00 17.37
I want to convert this dataframe to have data each half hour, and inpute each new position with the mean between the previous and the following value (or any similar interpolation), that is, for example:
datetime temp
0 2015-01-01 00:00:00 11.00
1 2015-01-01 00:30:00 11.50
2 2015-01-01 01:00:00 12.00
Is there any pandas/datetime function to assist in this operation?
Thank you

You can use the resample() function in Pandas. With this you can set the time to down/upsample to and then what you want to do with it (mean, sum etc.). In your case you can also interpolate between the values.
For this to work your datetime column will have to be a datetime dtype, then set it to the index.
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
Then you can resample to 30 minutes ('30T') and then interpolate.
df.resample('30T').interpolate()
Resulting in...
temp
datetime
2015-01-01 00:00:00 11.220
2015-01-01 00:30:00 11.270
2015-01-01 01:00:00 11.320
2015-01-01 01:30:00 11.310
2015-01-01 02:00:00 11.300
2015-01-01 02:30:00 11.275
2015-01-01 03:00:00 11.250
2015-01-01 03:30:00 11.285
2015-01-01 04:00:00 11.320
Read more about the frequency strings and resampling in the Pandas docs.

Related

Change value in pandas series based on hour of the day using df.apply and if statement

I have large df with datettime index with hourly time step and precipitation values in several columns. My precipitation valuesare a cumulative total during the day (from 1:00 am to 0:00 am of the next day) and are reset after every day, example:
datetime S1
2000-01-01 00:00:00 4.5 ...
2000-01-01 01:00:00 0 ...
2000-01-01 02:00:00 0 ...
2000-01-01 03:00:00 0 ...
2000-01-01 04:00:00 0
2000-01-01 05:00:00 0
2000-01-01 06:00:00 0
2000-01-01 07:00:00 0
2000-01-01 08:00:00 0
2000-01-01 09:00:00 0
2000-01-01 10:00:00 0
2000-01-01 11:00:00 6.5
2000-01-01 12:00:00 7.5
2000-01-01 13:00:00 8.7
2000-01-01 14:00:00 8.7
...
2000-01-01 22:00:00 8.7
2000-01-01 23:00:00 8.7
2000-01-02 00:00:00 8.7
2000-01-02 01:00:00 0
I am trying to go from this to the actual hourly values, so the value for 1:00 am for every day is fine and then I want to substract the value from the timestep before.
Can I somehow use if statement inside of df.apply?
I thought of smth like:
df_copy = df.copy()
df = df.apply(lambda x: if df.hour !=1: era5_T[x]=era5_T[x]-era5_T_copy[x-1])
But this is not working since I'm not calling a function? I could work with a for loop but that doesn't seem like the most efficient way as I'm working with a big dataset.
You can use numpy.where and pd.Series.shift to acheive the result
import numpy as np
df['hourly_S1'] = np.where(df.hour ==1, df.S1, df.S1-df.S1.shift())

Sampling dataframe Considering NaN values+Pandas

I have a data frame like below. I want to do sampling with '3S'
So there are situations where NaN is present. What I was expecting is the data frame should do sampling with '3S' and also if there is any 'NaN' found in between then stop there and start the sampling from that index. I tried using dataframe.apply method to achieve but it looks very complex. Is there any short way to achieve?
df.sample(n=3)
Code to generate Input:
index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series
series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'
I tried to do sampling but after that there is no clue how to proceed.
2015-01-01 00:00:00 0.0
2015-01-01 01:00:00 1.0
2015-01-01 02:00:00 2.0
2015-01-01 03:00:00 2.0
2015-01-01 04:00:00 NaN
2015-01-01 05:00:00 3.0
2015-01-01 06:00:00 4.0
2015-01-01 07:00:00 4.0
2015-01-01 08:00:00 4.0
2015-01-01 09:00:00 NaN
2015-01-01 10:00:00 3.0
2015-01-01 11:00:00 4.0
2015-01-01 12:00:00 4.0
The new data frame should sample based on '3S' also take into account of 'NaN' if present and start the sampling from there where 'NaN' records are found.
Expected Output:
2015-01-01 02:00:00 2.0 -- Sampling after 3S
2015-01-01 03:00:00 2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00 NaN -- print NaN record
2015-01-01 07:00:00 4.0 -- Sampling after 3S
2015-01-01 08:00:00 4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00 NaN -- print NaN record
2015-01-01 12:00:00 4.0 -- Sampling after 3S
Use:
index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan
print (df)
col
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 05:00:00 5.0
2000-01-01 06:00:00 6.0
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 10:00:00 10.0
2000-01-01 11:00:00 11.0
2000-01-01 12:00:00 12.0
m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t
df1 = df[mask | m]
print (df1)
col
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 12:00:00 12.0
Explanation:
Create mask for compare missing values by Series.isna
Create groups by consecutive values by comparing shifted values with Series.ne (!=)
print (s1)
2000-01-01 00:00:00 1
2000-01-01 01:00:00 1
2000-01-01 02:00:00 1
2000-01-01 03:00:00 1
2000-01-01 04:00:00 2
2000-01-01 05:00:00 3
2000-01-01 06:00:00 3
2000-01-01 07:00:00 3
2000-01-01 08:00:00 3
2000-01-01 09:00:00 4
2000-01-01 10:00:00 5
2000-01-01 11:00:00 5
2000-01-01 12:00:00 5
Freq: H, Name: col, dtype: int32
Get first value of index per groups, add timdelta (for expected output are added 2T) and compare by DatetimeIndex
Last filter by boolean indexing and chained masks by | for bitwise OR
One way would be to Fill the NAs with 0:
df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)
And then have the resampling to be done on the series:
(if datetime is your index)
series.resample('30S').asfreq()

Groupby for datetime on a scale of hours (ignoring what day)

I have a series of floats with a datetimeindex that I have resampled into bins of 3 hours. As such I have an index containing
2015-01-01 09:00:00
2015-01-01 12:00:00
2015-01-01 15:00:00
2015-01-01 18:00:00
2015-01-01 21:00:00
2015-01-02 00:00:00
2015-01-02 03:00:00
2015-01-02 06:00:00
2015-01-02 09:00:00
and so forth. I am trying to sum the floats associated with each time of day, say 09:00:00, for all days.
The only way I can think to do it with my limited experience is to convert this series to a dataframe by using the date time index as another column, then running iterations to see if the hours slot of the date time is equal to one another than summing the values. I feel like this is horribly inefficient and probably not the 'correct' way to do this. Any help would be appreciated!
IIUC:
In [116]: s
Out[116]:
2015-01-01 09:00:00 3
2015-01-01 12:00:00 1
2015-01-01 15:00:00 0
2015-01-01 18:00:00 1
2015-01-01 21:00:00 0
2015-01-02 00:00:00 9
2015-01-02 03:00:00 2
2015-01-02 06:00:00 2
2015-01-02 09:00:00 7
2015-01-02 12:00:00 8
Freq: 3H, Name: val, dtype: int32
In [117]: s.groupby(s.index - s.index.normalize()).sum()
Out[117]:
00:00:00 9
03:00:00 2
06:00:00 2
09:00:00 10
12:00:00 9
15:00:00 0
18:00:00 1
21:00:00 0
Name: val, dtype: int32
or:
In [118]: s.groupby(s.index.hour).sum()
Out[118]:
0 9
3 2
6 2
9 10
12 9
15 0
18 1
21 0
Name: val, dtype: int32

python: compare two timestamp in different dates

I have a dataframe, the index is timestamp format with 'YYYY-MM-DD HH:MM:SS'
Now i want to divide this data frame into two parts.
one is the data with time before 12pm('YYYY-MM-DD 12:00:00') everyday
another is the data with time after 12pm for everyday.
I'm just stuck with this question for several days. Any suggestions?
Thank you.
If you have a DatetimeIndex (and if you don't, df.index = pd.to_datetime(df.index) should work to get one), then you can access .hour, e.g. df.index.hour, and select using that:
>>> df.head()
A
2015-01-01 00:00:00 0
2015-01-01 01:00:00 1
2015-01-01 02:00:00 2
2015-01-01 03:00:00 3
2015-01-01 04:00:00 4
>>> morning = df[df.index.hour < 12]
>>> afternoon = df[df.index.hour >= 12]
>>> morning.head()
A
2015-01-01 00:00:00 0
2015-01-01 01:00:00 1
2015-01-01 02:00:00 2
2015-01-01 03:00:00 3
2015-01-01 04:00:00 4
>>> afternoon.head()
A
2015-01-01 12:00:00 12
2015-01-01 13:00:00 13
2015-01-01 14:00:00 14
2015-01-01 15:00:00 15
2015-01-01 16:00:00 16
You could also use groupby, e.g. df.groupby(df.index.hour < 12), but that seems like overkill here. If you wanted a more complex division that might be the way to go, though.

groupby.aggregate and modification then cast / reindex

I look for applying some deviation to a monthly granularity structure of a dataframe and then recast it in the initial dataframe. I firstly do a groupby and aggregate. This part works well. Then I reindex and take NaN. I want the reindexation will be done by matching month of the groupby element with the initial dataframe.
I want be able to due this operation on different granularity (yearly -> month & year, ...)
Has someone the solution of this problem ?
>>> df['profile']
date
2015-01-01 00:00:00 3.000000
2015-01-01 01:00:00 3.000143
2015-01-01 02:00:00 3.000287
2015-01-01 03:00:00 3.000430
2015-01-01 04:00:00 3.000574
...
2015-12-31 20:00:00 2.999426
2015-12-31 21:00:00 2.999570
2015-12-31 22:00:00 2.999713
2015-12-31 23:00:00 2.999857
Freq: H, Name: profile, Length: 8760
### Deviation on monthly basis
>>> dev_monthly = np.random.uniform(0.5, 1.5, len(df['profile'].groupby(df.index.month).aggregate(np.sum)))
>>> df['profile_monthly'] = (df['profile'].groupby(df.index.month).aggregate(np.sum) * dev_monthly).reindex(df)
>>> df['profile_monthly']
date
2015-01-01 00:00:00 NaN
2015-01-01 01:00:00 NaN
2015-01-01 02:00:00 NaN
...
2015-12-31 22:00:00 NaN
2015-12-31 23:00:00 NaN
Freq: H, Name: profile_monthly, Length: 8760
Check out the documentation for resampling.
You're looking for resample followed by fillna with method='bfill':
In [105]: df = DataFrame({'profile': normal(3, 0.1, size=10000)}, pd.date_range(start='2015-01-
01', freq='H', periods=10000))
In [106]: df['profile_monthly'] = df.profile.resample('M', how='sum')
In [107]: df
Out[107]:
profile profile_monthly
2015-01-01 00:00:00 2.8328 NaN
2015-01-01 01:00:00 3.0607 NaN
2015-01-01 02:00:00 3.0138 NaN
2015-01-01 03:00:00 3.0402 NaN
2015-01-01 04:00:00 3.0335 NaN
2015-01-01 05:00:00 3.0087 NaN
2015-01-01 06:00:00 3.0557 NaN
2015-01-01 07:00:00 2.9280 NaN
2015-01-01 08:00:00 3.1359 NaN
2015-01-01 09:00:00 2.9681 NaN
2015-01-01 10:00:00 3.1240 NaN
2015-01-01 11:00:00 3.0635 NaN
2015-01-01 12:00:00 2.9206 NaN
2015-01-01 13:00:00 3.0714 NaN
2015-01-01 14:00:00 3.0688 NaN
2015-01-01 15:00:00 3.0703 NaN
2015-01-01 16:00:00 2.9102 NaN
2015-01-01 17:00:00 2.9368 NaN
2015-01-01 18:00:00 3.0864 NaN
2015-01-01 19:00:00 3.2124 NaN
2015-01-01 20:00:00 2.8988 NaN
2015-01-01 21:00:00 3.0659 NaN
2015-01-01 22:00:00 2.7973 NaN
2015-01-01 23:00:00 3.0824 NaN
2015-01-02 00:00:00 3.0199 NaN
... ...
[10000 rows x 2 columns]
In [108]: df.dropna()
Out[108]:
profile profile_monthly
2015-01-31 2.9769 2230.9931
2015-02-28 2.9930 2016.1045
2015-03-31 2.7817 2232.4096
2015-04-30 3.1695 2158.7834
2015-05-31 2.9040 2236.5962
2015-06-30 2.8697 2162.7784
2015-07-31 2.9278 2231.7232
2015-08-31 2.8289 2236.4603
2015-09-30 3.0368 2163.5916
2015-10-31 3.1517 2233.2285
2015-11-30 3.0450 2158.6998
2015-12-31 2.8261 2228.5550
2016-01-31 3.0264 2229.2221
[13 rows x 2 columns]
In [110]: df.fillna(method='bfill')
Out[110]:
profile profile_monthly
2015-01-01 00:00:00 2.8328 2230.9931
2015-01-01 01:00:00 3.0607 2230.9931
2015-01-01 02:00:00 3.0138 2230.9931
2015-01-01 03:00:00 3.0402 2230.9931
2015-01-01 04:00:00 3.0335 2230.9931
2015-01-01 05:00:00 3.0087 2230.9931
2015-01-01 06:00:00 3.0557 2230.9931
2015-01-01 07:00:00 2.9280 2230.9931
2015-01-01 08:00:00 3.1359 2230.9931
2015-01-01 09:00:00 2.9681 2230.9931
2015-01-01 10:00:00 3.1240 2230.9931
2015-01-01 11:00:00 3.0635 2230.9931
2015-01-01 12:00:00 2.9206 2230.9931
2015-01-01 13:00:00 3.0714 2230.9931
2015-01-01 14:00:00 3.0688 2230.9931
2015-01-01 15:00:00 3.0703 2230.9931
2015-01-01 16:00:00 2.9102 2230.9931
2015-01-01 17:00:00 2.9368 2230.9931
2015-01-01 18:00:00 3.0864 2230.9931
2015-01-01 19:00:00 3.2124 2230.9931
2015-01-01 20:00:00 2.8988 2230.9931
2015-01-01 21:00:00 3.0659 2230.9931
2015-01-01 22:00:00 2.7973 2230.9931
2015-01-01 23:00:00 3.0824 2230.9931
2015-01-02 00:00:00 3.0199 2230.9931
... ...
[10000 rows x 2 columns]
When I use your code, I haven't same value for 2015-12-31 00:00:00 and 2015-12-31 01:00:00 as you can see below :
>>> df.fillna(method='bfill')[np.logical_and(df.index.month==12, df.index.day==31)]
profile profile_monthly
2015-12-31 00:00:00 2.926504 2232.288997
2015-12-31 01:00:00 3.008543 2234.470731
2015-12-31 02:00:00 2.930133 2234.470731
2015-12-31 03:00:00 3.078552 2234.470731
2015-12-31 04:00:00 3.141578 2234.470731
2015-12-31 05:00:00 3.061820 2234.470731
2015-12-31 06:00:00 2.981626 2234.470731
2015-12-31 07:00:00 3.010749 2234.470731
2015-12-31 08:00:00 2.878577 2234.470731
2015-12-31 09:00:00 2.915487 2234.470731
2015-12-31 10:00:00 3.072721 2234.470731
2015-12-31 11:00:00 3.087866 2234.470731
2015-12-31 12:00:00 3.089208 2234.470731
2015-12-31 13:00:00 2.957047 2234.470731
2015-12-31 14:00:00 3.002072 2234.470731
2015-12-31 15:00:00 3.106656 2234.470731
2015-12-31 16:00:00 3.100891 2234.470731
2015-12-31 17:00:00 3.077835 2234.470731
2015-12-31 18:00:00 3.032497 2234.470731
2015-12-31 19:00:00 2.959838 2234.470731
2015-12-31 20:00:00 2.878819 2234.470731
2015-12-31 21:00:00 3.041171 2234.470731
2015-12-31 22:00:00 3.061970 2234.470731
2015-12-31 23:00:00 3.019011 2234.470731
[24 rows x 2 columns]
So I finally found the following solution :
>>> AA = df.groupby((df.index.year, df.index.month)).aggregate(np.mean)
>>> AA['dev'] = np.random.randn(0,1,len(AA))
>>> df['dev'] = AA.ix[zip(df.index.year, df.index.month)]['dev'].values
Short and rapid. The only question is :
=> How to deal with other granularity (half year, quarter, week, ...) ?

Categories