Pandas: correctly resampling data at the hourly frequency - python

In Python 3.6.3, I have the following dataframe df1:
dt Val
2017-04-10 08:00:00 8.0
2017-04-10 09:00:00 2.0
2017-04-10 10:00:00 7.0
2017-04-11 08:00:00 3.0
2017-04-11 09:00:00 0.0
2017-04-11 10:00:00 5.0
2017-11-26 08:00:00 8.0
2017-11-26 09:00:00 1.0
2017-11-26 10:00:00 2.0
I am trying to compute the hourly average of these values, so as to have:
Hour Val
08:00:00 7.00
09:00:00 1.00
10:00:00 4.66
My attempt:
df2 = df1.resample('H')['Val'].mean()
Returns the same dataset as df1. What am I doing wrong?

Inspired by the comments above, I tested that the following works for me:
df.groupby(df.index.hour).Val.mean()
Or you can make the index values 'timedelta' dtypes
df.Val.groupby(df.index.hour.astype('timedelta64[h]')).mean()
dt
08:00:00 6.333333
09:00:00 1.000000
10:00:00 4.666667
Name: Val, dtype: float64

Related

How to impute missing value in time series data with the value of the same day and time from the previous week(day) in python

I have a dataframe with columns of timestamp and energy usage. The timestamp is taken for every min of the day i.e., a total of 1440 readings for each day. I have few missing values in the data frame.
I want to impute those missing values with the mean of the same day, same time from the last two or three week. This way if the previous week is also missing, I can use the value for two weeks ago.
Here's a example of the data:
mains_1
timestamp
2013-01-03 00:00:00 155.00
2013-01-03 00:01:00 154.00
2013-01-03 00:02:00 NaN
2013-01-03 00:03:00 154.00
2013-01-03 00:04:00 153.00
... ...
2013-04-30 23:55:00 NaN
2013-04-30 23:56:00 182.00
2013-04-30 23:57:00 181.00
2013-04-30 23:58:00 182.00
2013-04-30 23:59:00 182.00
Right now I have this line of code:
df['mains_1'] = (df
.groupby((df.index.dayofweek * 24) + (df.index.hour) + (df.index.minute / 60))
.transform(lambda x: x.fillna(x.mean()))
)
So what this does is it uses the average of the usage from the same hour of the day on the whole dataset. I want it to be more precise and use the average of the last two or three weeks.
You can concat together the Series with shift in a loop, as the index alignment will ensure it's matching on the previous weeks with the same hour. Then take the mean and use .fillna to update the original
Sample Data
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(index=pd.date_range('2010-01-01 10:00:00', freq='W', periods=10),
data = np.random.choice([1,2,3,4, np.NaN], 10),
columns=['mains_1'])
# mains_1
#2010-01-03 10:00:00 4.0
#2010-01-10 10:00:00 1.0
#2010-01-17 10:00:00 2.0
#2010-01-24 10:00:00 1.0
#2010-01-31 10:00:00 NaN
#2010-02-07 10:00:00 4.0
#2010-02-14 10:00:00 1.0
#2010-02-21 10:00:00 1.0
#2010-02-28 10:00:00 NaN
#2010-03-07 10:00:00 2.0
Code
# range(4) for previous 3 weeks.
df1 = pd.concat([df.shift(periods=x, freq='W') for x in range(4)], axis=1)
# mains_1 mains_1 mains_1 mains_1
#2010-01-03 10:00:00 4.0 NaN NaN NaN
#2010-01-10 10:00:00 1.0 4.0 NaN NaN
#2010-01-17 10:00:00 2.0 1.0 4.0 NaN
#2010-01-24 10:00:00 1.0 2.0 1.0 4.0
#2010-01-31 10:00:00 NaN 1.0 2.0 1.0
#2010-02-07 10:00:00 4.0 NaN 1.0 2.0
#2010-02-14 10:00:00 1.0 4.0 NaN 1.0
#2010-02-21 10:00:00 1.0 1.0 4.0 NaN
#2010-02-28 10:00:00 NaN 1.0 1.0 4.0
#2010-03-07 10:00:00 2.0 NaN 1.0 1.0
#2010-03-14 10:00:00 NaN 2.0 NaN 1.0
#2010-03-21 10:00:00 NaN NaN 2.0 NaN
#2010-03-28 10:00:00 NaN NaN NaN 2.0
df['mains_1'] = df['mains_1'].fillna(df1.mean(axis=1))
print(df)
mains_1
2010-01-03 10:00:00 4.000000
2010-01-10 10:00:00 1.000000
2010-01-17 10:00:00 2.000000
2010-01-24 10:00:00 1.000000
2010-01-31 10:00:00 1.333333
2010-02-07 10:00:00 4.000000
2010-02-14 10:00:00 1.000000
2010-02-21 10:00:00 1.000000
2010-02-28 10:00:00 2.000000
2010-03-07 10:00:00 2.000000

Python dataframe datetime based if and else condition

I have a datatime dataframe. I want to compare it with a reference date and assign before it is less than and after if greater.
My code:
df = pd.DataFrame({'A':np.arange(1.0,9.0)},index=pd.date_range(start='2020-05-04 08:00:00', freq='1d', periods=8))
df=
A
2020-05-04 08:00:00 1.0
2020-05-05 08:00:00 2.0
2020-05-06 08:00:00 3.0
2020-05-07 08:00:00 4.0
2020-05-08 08:00:00 5.0
2020-05-09 08:00:00 6.0
2020-05-10 08:00:00 7.0
2020-05-11 08:00:00 8.0
ref_date = '2020-05-08'
Expected answer
df=
A Condi.
2020-05-04 08:00:00 1.0 Before
2020-05-05 08:00:00 2.0 Before
2020-05-06 08:00:00 3.0 Before
2020-05-07 08:00:00 4.0 Before
2020-05-08 08:00:00 5.0 After
2020-05-09 08:00:00 6.0 After
2020-05-10 08:00:00 7.0 After
2020-05-11 08:00:00 8.0 After
My solution:
df['Cond.'] = = ['After' if df.index>=(ref_date)=='True' else 'Before cleaning']
Present answer
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
ref_date = "2020-05-08"
df["Cond"] = np.where(df.index < ref_date, "Before", "After")
print(df)
Prints:
A Cond
2020-05-04 08:00:00 1.0 Before
2020-05-05 08:00:00 2.0 Before
2020-05-06 08:00:00 3.0 Before
2020-05-07 08:00:00 4.0 Before
2020-05-08 08:00:00 5.0 After
2020-05-09 08:00:00 6.0 After
2020-05-10 08:00:00 7.0 After
2020-05-11 08:00:00 8.0 After
Helping you fix the list comprehension approach you took, you can use:
df['Cond'] = ['After' if x >= pd.to_datetime(ref_date) else 'Before' for x in df.index]

How to fetch hours from m8[ns] object in pandas?

I have a dataframe like shown below
df = pd.DataFrame({'time':['2166-01-09 14:00:00','2166-01-09 14:08:00','2166-01-09 16:00:00','2166-01-09 20:00:00',
'2166-01-09 04:00:00','2166-01-10 05:00:00','2166-01-10 06:00:00','2166-01-10 07:00:00','2166-01-10 11:00:00',
'2166-01-10 11:30:00','2166-01-10 12:00:00','2166-01-10 13:00:00','2166-01-10 13:30:00']})
I am trying to find a time difference between rows. For which I did the below
df['time2'] = df['time'].shift(-1)
df['tdiff'] = (df['time2'] - df['time'])
So, my result looks like as shown below
I found out that there exists a function like dt.days and I tried
df['tdiff'].dt.days
but it only gives the day component but am looking for something like 'hours` component
However, I would like to have my output like as shown below
I am sorry that I am not sure how to calculate the hour equivalent of negative time in row no 3. Might be that's an data issue.
In pandas is possible convert timedeltas to seconds by Series.dt.total_seconds and then divide by 3600:
df['tdiff'] = (df['time2'] - df['time']).dt.total_seconds() / 3600
print (df)
time time2 tdiff
0 2166-01-09 14:00:00 2166-01-09 14:08:00 0.133333
1 2166-01-09 14:08:00 2166-01-09 16:00:00 1.866667
2 2166-01-09 16:00:00 2166-01-09 20:00:00 4.000000
3 2166-01-09 20:00:00 2166-01-09 04:00:00 -16.000000
4 2166-01-09 04:00:00 2166-01-10 05:00:00 25.000000
5 2166-01-10 05:00:00 2166-01-10 06:00:00 1.000000
6 2166-01-10 06:00:00 2166-01-10 07:00:00 1.000000
7 2166-01-10 07:00:00 2166-01-10 11:00:00 4.000000
8 2166-01-10 11:00:00 2166-01-10 11:30:00 0.500000
9 2166-01-10 11:30:00 2166-01-10 12:00:00 0.500000
10 2166-01-10 12:00:00 2166-01-10 13:00:00 1.000000
11 2166-01-10 13:00:00 2166-01-10 13:30:00 0.500000
12 2166-01-10 13:30:00 NaT NaN

Sampling dataframe Considering NaN values+Pandas

I have a data frame like below. I want to do sampling with '3S'
So there are situations where NaN is present. What I was expecting is the data frame should do sampling with '3S' and also if there is any 'NaN' found in between then stop there and start the sampling from that index. I tried using dataframe.apply method to achieve but it looks very complex. Is there any short way to achieve?
df.sample(n=3)
Code to generate Input:
index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series
series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'
I tried to do sampling but after that there is no clue how to proceed.
2015-01-01 00:00:00 0.0
2015-01-01 01:00:00 1.0
2015-01-01 02:00:00 2.0
2015-01-01 03:00:00 2.0
2015-01-01 04:00:00 NaN
2015-01-01 05:00:00 3.0
2015-01-01 06:00:00 4.0
2015-01-01 07:00:00 4.0
2015-01-01 08:00:00 4.0
2015-01-01 09:00:00 NaN
2015-01-01 10:00:00 3.0
2015-01-01 11:00:00 4.0
2015-01-01 12:00:00 4.0
The new data frame should sample based on '3S' also take into account of 'NaN' if present and start the sampling from there where 'NaN' records are found.
Expected Output:
2015-01-01 02:00:00 2.0 -- Sampling after 3S
2015-01-01 03:00:00 2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00 NaN -- print NaN record
2015-01-01 07:00:00 4.0 -- Sampling after 3S
2015-01-01 08:00:00 4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00 NaN -- print NaN record
2015-01-01 12:00:00 4.0 -- Sampling after 3S
Use:
index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan
print (df)
col
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 05:00:00 5.0
2000-01-01 06:00:00 6.0
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 10:00:00 10.0
2000-01-01 11:00:00 11.0
2000-01-01 12:00:00 12.0
m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t
df1 = df[mask | m]
print (df1)
col
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 12:00:00 12.0
Explanation:
Create mask for compare missing values by Series.isna
Create groups by consecutive values by comparing shifted values with Series.ne (!=)
print (s1)
2000-01-01 00:00:00 1
2000-01-01 01:00:00 1
2000-01-01 02:00:00 1
2000-01-01 03:00:00 1
2000-01-01 04:00:00 2
2000-01-01 05:00:00 3
2000-01-01 06:00:00 3
2000-01-01 07:00:00 3
2000-01-01 08:00:00 3
2000-01-01 09:00:00 4
2000-01-01 10:00:00 5
2000-01-01 11:00:00 5
2000-01-01 12:00:00 5
Freq: H, Name: col, dtype: int32
Get first value of index per groups, add timdelta (for expected output are added 2T) and compare by DatetimeIndex
Last filter by boolean indexing and chained masks by | for bitwise OR
One way would be to Fill the NAs with 0:
df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)
And then have the resampling to be done on the series:
(if datetime is your index)
series.resample('30S').asfreq()

Using pandas dataframe and matplotlib to manipulate data from a csv file into a plot

Here is what I'm trying to do: build a dataframe that has a datetime index created from column 0. Use resample function over a quaterly period, create a plot that shows the quarterly precipitation total amounts over the 14 year period.
second plot
make a plot of the average monthly precip and the monthly standard dev. Plot both values on the same axes.
Here's my code so far:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
matplotlib.rcParams['figure.figsize'] = (10.0, 4.0)
df = pd.read_csv("ColumbusPrecipData.csv")
df.set_index("date", inplace = True)
#df['date'] = pd.to_datetime(df[['']])
print(df)
#build plots
#axes = plt.subplot()
#start = pd.to_datetime
#end = pd.to_datetime
#axes.set_xlim(start, end)
#axes.set_title("")
#axes.set_ylabel("")
#axes.tick_params(axis='x', rotation=45)
#axes.legend(loc='best')
Here's what the dataframe looks like:
Unnamed: 0 Precip
0 2000-01-01 01:00:00 0.0
1 2000-01-01 02:00:00 0.0
2 2000-01-01 03:00:00 0.0
3 2000-01-01 04:00:00 0.0
4 2000-01-01 05:00:00 0.0
5 2000-01-01 06:00:00 0.0
6 2000-01-01 07:00:00 0.0
7 2000-01-01 08:00:00 0.0
8 2000-01-01 09:00:00 0.0
9 2000-01-01 10:00:00 0.0
10 2000-01-01 11:00:00 0.0
11 2000-01-01 12:00:00 0.0
12 2000-01-01 13:00:00 0.0
13 2000-01-01 14:00:00 0.0
14 2000-01-01 15:00:00 0.0
15 2000-01-01 16:00:00 0.0
16 2000-01-01 17:00:00 0.0
17 2000-01-01 18:00:00 0.0
18 2000-01-01 19:00:00 0.0
19 2000-01-01 20:00:00 0.0
20 2000-01-01 21:00:00 0.0
21 2000-01-01 22:00:00 0.0
22 2000-01-01 23:00:00 0.0
23 2000-01-02 00:00:00 0.0
24 2000-01-02 01:00:00 0.0
25 2000-01-02 02:00:00 0.0
26 2000-01-02 03:00:00 0.0
27 2000-01-02 04:00:00 0.0
28 2000-01-02 05:00:00 0.0
29 2000-01-02 06:00:00 0.0
... ... ...
122696 2013-12-30 09:00:00 0.0
122697 2013-12-30 10:00:00 0.0
122698 2013-12-30 11:00:00 0.0
122699 2013-12-30 12:00:00 0.0
122700 2013-12-30 13:00:00 0.0
122701 2013-12-30 14:00:00 0.0
122702 2013-12-30 15:00:00 0.0
122703 2013-12-30 16:00:00 0.0
122704 2013-12-30 17:00:00 0.0
122705 2013-12-30 18:00:00 0.0
122706 2013-12-30 19:00:00 0.0
122707 2013-12-30 20:00:00 0.0
122708 2013-12-30 21:00:00 0.0
122709 2013-12-30 22:00:00 0.0
122710 2013-12-30 23:00:00 0.0
122711 2013-12-31 00:00:00 0.0
122712 2013-12-31 01:00:00 0.0
122713 2013-12-31 02:00:00 0.0
122714 2013-12-31 03:00:00 0.0
122715 2013-12-31 04:00:00 0.0
122716 2013-12-31 05:00:00 0.0
122717 2013-12-31 06:00:00 0.0
122718 2013-12-31 07:00:00 0.0
122719 2013-12-31 08:00:00 0.0
122720 2013-12-31 09:00:00 0.0
122721 2013-12-31 10:00:00 0.0
122722 2013-12-31 11:00:00 0.0
122723 2013-12-31 12:00:00 0.0
122724 2013-12-31 13:00:00 0.0
122725 2013-12-31 14:00:00 0.0
[122726 rows x 2 columns]
df = df.rename( columns={"Unnamed: 0": "date"})
df = df.set_index(pd.DatetimeIndex(df['date']))
Then
df1 = df.groupby(pd.Grouper(freq='M')).mean()
plt.plot(df1)

Categories