I have a dataframe which looks like this:
pressure mean pressure std
2016-03-01 00:00:00 615.686441 0.138287
2016-03-01 01:00:00 615.555000 0.067460
2016-03-01 02:00:00 615.220000 0.262840
2016-03-01 03:00:00 614.993333 0.138841
2016-03-01 04:00:00 615.075000 0.072778
2016-03-01 05:00:00 615.513333 0.162049
................
The first column is the index column.
I want to create a new dataframe with only the rows of 3pm and 3am,
so it will look like this:
pressure mean pressure std
2016-03-01 03:00:00 614.993333 0.138841
2016-03-01 15:00:00 616.613333 0.129493
2016-03-02 03:00:00 615.600000 0.068889
..................
Any ideas ?
Thank you !
I couldn't load your data using pd.read_clipboard(), so I'm going to recreate some data:
df = pd.DataFrame(index=pd.date_range('2016-03-01', freq='H', periods=72),
data=np.random.random(size=(72,2)),
columns=['pressure', 'mean'])
Now your dataframe should have a DatetimeIndex. If not, you can use df.index = pd.to_datetime(df.index).
Then its really easy using boolean indexing:
df.ix[(df.index.hour == 3) | (df.index.hour == 15)]
Related
I am working on some code that will rearrange a time series. Currently I have a standard time series. I have a three columns with with the header being [Date, Time, Value]. I want to reformat the dataframe to index with the date and use a header with the time (i.e. 0:00, 1:00, ... , 23:00). The dataframe will be filled in with the value.
Here is the Dataframe currently have
essentially I'd like to mve the index toa single day and show the hours through the columns.
Thanks,
Use pivot:
df = df.pivot(index='Date', columns='Time', values='Total')
Output (first 10 columns and with random values for Total):
>>> df.pivot(index='Date', columns='Time', values='Total').iloc[0:10]
time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00
date
2019-01-01 0.732494 0.087657 0.930405 0.958965 0.531928 0.891228 0.664634 0.432684 0.009653 0.604878
2019-01-02 0.471386 0.575126 0.509707 0.715290 0.337983 0.618632 0.413530 0.849033 0.725556 0.186876
You could try this.
Split the time part to get only the hour. Add hr to it.
df = pd.DataFrame([['2019-01-01', '00:00:00',-127.57],['2019-01-01', '01:00:00',-137.57],['2019-01-02', '00:00:00',-147.57],], columns=['Date', 'Time', 'Totals'])
df['hours'] = df['Time'].apply(lambda x: 'hr'+ str(int(x.split(':')[0])))
print(pd.pivot_table(df, values ='Totals', index=['Date'], columns = 'hours'))
Output
hours hr0 hr1
Date
2019-01-01 -127.57 -137.57
2019-01-02 -147.57 NaN
I have a time series dataframe with dates|weather information that looks like this:
2017-01-01 5
2017-01-02 10
.
.
2017-12-31 6
I am trying to upsample it to hourly data using the following:
weather.resample('H').pad()
I expected to see 8760 entries for 24 intervals * 365 days. However, it only returns 8737 with the last 23 intervals missing for 31st of december. Is there something special I need to do to get 24 intervals for the last day?
Thanks in advance.
Pandas normalizes 2017-12-31 to 2017-12-31 00:00 and then creates a range that ends in that last datetime... I would include a last row before resampling with
df.loc['2018-01-01'] = 0
Edit:
You can get the result you want with numpy.repeat
Take this df
np.random.seed(1)
weather = pd.DataFrame(index=pd.date_range('2017-01-01', '2017-12-31'),
data={'WEATHER_MAX': np.random.random(365)*15})
WEATHER_MAX
2017-01-01 6.255330
2017-01-02 10.804867
2017-01-03 0.001716
2017-01-04 4.534989
2017-01-05 2.201338
... ...
2017-12-27 4.503725
2017-12-28 2.145087
2017-12-29 13.519627
2017-12-30 8.123391
2017-12-31 14.621106
[365 rows x 1 columns]
By repeating on axis=1 you can then transform the default range(24) column names to hourly timediffs
# repeat, then stack
hourly = pd.DataFrame(np.repeat(weather.values, 24, axis=1),
index=weather.index).stack()
# combine date and hour
hourly.index = (
hourly.index.get_level_values(0) +
pd.to_timedelta(hourly.index.get_level_values(1), unit='h')
)
hourly = hourly.rename('WEATHER_MAX').to_frame()
Output
WEATHER_MAX
2017-01-01 00:00:00 6.255330
2017-01-01 01:00:00 6.255330
2017-01-01 02:00:00 6.255330
2017-01-01 03:00:00 6.255330
2017-01-01 04:00:00 6.255330
... ...
2017-12-31 19:00:00 14.621106
2017-12-31 20:00:00 14.621106
2017-12-31 21:00:00 14.621106
2017-12-31 22:00:00 14.621106
2017-12-31 23:00:00 14.621106
[8760 rows x 1 columns]
What to do and the reason are the same as #RichieV's answer.
However, the value to be used is not 0 or a meaningless value, it is necessary to use valid data actually measured on 2018-01-01.
This is because using a meaningless value reduces the effectiveness of the resampled 2017-12-31 data and the results derived using that data.
Prepare a valid value for 2018-01-01 at the end of the data.
Call resample.
Delete the data of 2018-01-01 after resample.
You will get 8670 data for 2017.
Look at #RichieV's modified answer:
I was misunderstanding the question.
My answer was to complement resample with interpolate etc.
resampleを用いた外挿 (データ補間) を行いたい
If the same value as 00:00 on the day is all right, it would be a different way of thinking.
I am a Korean student
Please understand that English is awkward
i want to make columns datetime > year,mounth .... ,second
train = pd.read_csv('input/Train.csv')
DateTime looks like this
(this is head(20) and I remove other columns easy to see)
datetime
0 2011-01-01 00:00:00
1 2011-01-01 01:00:00
2 2011-01-01 02:00:00
3 2011-01-01 03:00:00
4 2011-01-01 04:00:00
5 2011-01-01 05:00:00
6 2011-01-01 06:00:00
7 2011-01-01 07:00:00
8 2011-01-01 08:00:00
9 2011-01-01 09:00:00
10 2011-01-01 10:00:00
11 2011-01-01 11:00:00
12 2011-01-01 12:00:00
13 2011-01-01 13:00:00
14 2011-01-01 14:00:00
15 2011-01-01 15:00:00
16 2011-01-01 16:00:00
17 2011-01-01 17:00:00
18 2011-01-01 18:00:00
19 2011-01-01 19:00:00
then I write this code to see each columns (year,month,day,hour,minute,second)
train['year'] = train['datetime'].dt.year
train['month'] = train['datetime'].dt.month
train['day'] = train['datetime'].dt.day
train['hour'] = train['datetime'].dt.hour
train['minute'] = train['datetime'].dt.minute
train['second'] = train['datetime'].dt.seond
and error like this
AttributeError: Can only use .dt accessor with datetimelike values
please help me ㅠㅅㅠ
Note that by default read_csv is able to deduce column type only
for numeric and boolean columns.
Unless explicitely specified (e.g. passing converters or dtype
parameters), all other cases of input are left as strings
and the pandasonic type of such columns is object.
And just this occurred in your case.
So, as this column is of object type, you can not invoke dt accessor
on it, as it works only on columns of datetime type.
Actually, in this case, you can take the following approach:
do not specify any conversion of this column (it will be parsed
just as object),
after that split datetime column into "parts", using str.split
(all 6 columns with a single instruction),
set proper column names in the resulting DataFrame,
join it to the original DataFrame (then drop),
as late as now change the type of the original column.
To do it, you can run:
wrk = df['datetime'].str.split(r'[- :]', expand=True).astype(int)
wrk.columns = ['year', 'month', 'day', 'hour', 'minute', 'second']
df = df.join(wrk)
del wrk
df['datetime'] = pd.to_datetime(df['datetime'])
Note that I added astype(int). Otherwise these columns would be left as
object (actually string) type.
Or maybe this original column is not needed any more (as you have extracted
all date / time components)? In such case drop this column instead of
converting it.
And the last hint: datetime is used rather as a type name (with various
endings).
So it is better when you used some other name here, at least differing
in char case, e.g. DateTime.
I have a dataframe as follows
df = pd.DataFrame({ 'X' : np.random.randn(50000)}, index=pd.date_range('1/1/2000', periods=50000, freq='T'))
df.head(10)
Out[37]:
X
2000-01-01 00:00:00 -0.699565
2000-01-01 00:01:00 -0.646129
2000-01-01 00:02:00 1.339314
2000-01-01 00:03:00 0.559563
2000-01-01 00:04:00 1.529063
2000-01-01 00:05:00 0.131740
2000-01-01 00:06:00 1.282263
2000-01-01 00:07:00 -1.003991
2000-01-01 00:08:00 -1.594918
2000-01-01 00:09:00 -0.775230
I would like to create a variable that contains the sum of X
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At index 2000-01-01 00:00:00, df['rolling_sum_same_hour'] contains the sum the values of X observed at 00:00:00 during the last 5 days in the data (not including 2000-01-01 of course).
At index 2000-01-01 00:01:00, df['rolling_sum_same_hour'] contains the sum of of X observed at 00:00:01 during the last 5 days and so on.
The intuitive idea is that intraday prices have intraday seasonality, and I want to get rid of it that way.
I tried to use df['rolling_sum_same_hour']=df.at_time(df.index.minute).rolling(window=5).sum()
with no success.
Any ideas?
Many thanks!
Behold the power of groupby!
df = # as you defined above
df['rolling_sum_by_time'] = df.groupby(df.index.time)['X'].apply(lambda x: x.shift(1).rolling(10).sum())
It's a big pill to swallow there, but we are grouping by time (as in python datetime.time), then getting the column we care about (else apply will work on columns - it now works on the time-groups), and then applying the function you want!
IIUC, what you want is to perform a rolling sum, but only on the observations grouped by the exact same time of day. This can be done by
df.X.groupby([df.index.hour, df.index.minute]).apply(lambda g: g.rolling(window=5).sum())
(Note that your question alternates between 5 and 10 periods.) For example:
In [43]: df.X.groupby([df.index.hour, df.index.minute]).apply(lambda g: g.rolling(window=5).sum()).tail()
Out[43]:
2000-02-04 17:15:00 -2.135887
2000-02-04 17:16:00 -3.056707
2000-02-04 17:17:00 0.813798
2000-02-04 17:18:00 -1.092548
2000-02-04 17:19:00 -0.997104
Freq: T, Name: X, dtype: float64
I have a timeseries of intraday day data looks like below
ts =pd.Series(np.random.randn(60),index=pd.date_range('1/1/2000',periods=60, freq='2h'))
I am hoping to transform the data into a DataFrame, with the columns as each date, and rows as the time in the date.
I have tried these,
key = lambda x:x.date()
grouped = ts.groupby(key)
But how do I transform the groups into date columned DataFrame? or is there any better way?
import pandas as pd
import numpy as np
index = pd.date_range('1/1/2000', periods=60, freq='2h')
ts = pd.Series(np.random.randn(60), index = index)
key = lambda x: x.time()
groups = ts.groupby(key)
print pd.DataFrame({k:g for k,g in groups}).resample('D').T
out:
2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05 2000-01-06 \
00:00:00 0.109959 -0.124291 -0.137365 0.054729 -1.305821 -1.928468
03:00:00 1.336467 0.874296 0.153490 -2.410259 0.906950 1.860385
06:00:00 -1.172638 -0.410272 -0.800962 0.568965 -0.270307 -2.046119
09:00:00 -0.707423 1.614732 0.779645 -0.571251 0.839890 0.435928
12:00:00 0.865577 -0.076702 -0.966020 0.589074 0.326276 -2.265566
15:00:00 1.845865 -1.421269 -0.141785 0.433011 -0.063286 0.129706
18:00:00 -0.054569 0.277901 0.383375 -0.546495 -0.644141 -0.207479
21:00:00 1.056536 0.031187 -1.667686 -0.270580 -0.678205 0.750386
2000-01-07 2000-01-08
00:00:00 -0.657398 -0.630487
03:00:00 2.205280 -0.371830
06:00:00 -0.073235 0.208831
09:00:00 1.720097 -0.312353
12:00:00 -0.774391 NaN
15:00:00 0.607250 NaN
18:00:00 1.379823 NaN
21:00:00 0.959811 NaN