how to get the shifted index value of a dataframe in Pandas? - python

Consider the simple example below:
date = pd.date_range('1/1/2011', periods=5, freq='H')
df = pd.DataFrame({'cat' : ['A', 'A', 'A', 'B',
'B']}, index = date)
df
Out[278]:
cat
2011-01-01 00:00:00 A
2011-01-01 01:00:00 A
2011-01-01 02:00:00 A
2011-01-01 03:00:00 B
2011-01-01 04:00:00 B
I want to create a variable that contains the lagged/lead value of the index. That is something like:
df['index_shifted']=df.index.shift(1)
So, for instance, at time 2011-01-01 01:00:00 I expect the variable index_shifted to be 2011-01-01 00:00:00
How can I do that?
Thanks!

I think you need Index.shift with -1:
df['index_shifted']= df.index.shift(-1)
print (df)
cat index_shifted
2011-01-01 00:00:00 A 2010-12-31 23:00:00
2011-01-01 01:00:00 A 2011-01-01 00:00:00
2011-01-01 02:00:00 A 2011-01-01 01:00:00
2011-01-01 03:00:00 B 2011-01-01 02:00:00
2011-01-01 04:00:00 B 2011-01-01 03:00:00
For me it works without freq, but maybe it is necessary in real data:
df['index_shifted']= df.index.shift(-1, freq='H')
print (df)
cat index_shifted
2011-01-01 00:00:00 A 2010-12-31 23:00:00
2011-01-01 01:00:00 A 2011-01-01 00:00:00
2011-01-01 02:00:00 A 2011-01-01 01:00:00
2011-01-01 03:00:00 B 2011-01-01 02:00:00
2011-01-01 04:00:00 B 2011-01-01 03:00:00
EDIT:
If freq of DatetimeIndex is None, you need add freq to shift:
import pandas as pd
date = pd.date_range('1/1/2011', periods=5, freq='H').union(pd.date_range('5/1/2011', periods=5, freq='H'))
df = pd.DataFrame({'cat' : ['A', 'A', 'A', 'B',
'B','A', 'A', 'A', 'B',
'B']}, index = date)
print (df.index)
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 01:00:00',
'2011-01-01 02:00:00', '2011-01-01 03:00:00',
'2011-01-01 04:00:00', '2011-05-01 00:00:00',
'2011-05-01 01:00:00', '2011-05-01 02:00:00',
'2011-05-01 03:00:00', '2011-05-01 04:00:00'],
dtype='datetime64[ns]', freq=None)
df['index_shifted']= df.index.shift(-1, freq='H')
print (df)
cat index_shifted
2011-01-01 00:00:00 A 2010-12-31 23:00:00
2011-01-01 01:00:00 A 2011-01-01 00:00:00
2011-01-01 02:00:00 A 2011-01-01 01:00:00
2011-01-01 03:00:00 B 2011-01-01 02:00:00
2011-01-01 04:00:00 B 2011-01-01 03:00:00
2011-05-01 00:00:00 A 2011-04-30 23:00:00
2011-05-01 01:00:00 A 2011-05-01 00:00:00
2011-05-01 02:00:00 A 2011-05-01 01:00:00
2011-05-01 03:00:00 B 2011-05-01 02:00:00
2011-05-01 04:00:00 B 2011-05-01 03:00:00

What's wrong with df['index_shifted']=df.index.shift(-1)?
(Genuine question, not sure if I missed something)

This is an old question, but if your timestamps have gaps or you do not want to specify the frequency, AND you are not dealing with timezones the following will work:
df['index_shifted'] = pd.Series(df.index).shift(-1).values
If you are dealing with Timezones the following will work:
df['index_shifted'] = pd.to_datetime(pd.Series(df.index).shift(-1).values, utc=True).tz_convert('America/New_York')

Related

How to convert hourly data to half hourly

I have the following dataframe:
datetime temp
0 2015-01-01 00:00:00 11.22
1 2015-01-01 01:00:00 11.32
2 2015-01-01 02:00:00 11.30
3 2015-01-01 03:00:00 11.25
4 2015-01-01 04:00:00 11.32
... ... ...
31339 2018-07-29 19:00:00 17.60
31340 2018-07-29 20:00:00 17.49
31341 2018-07-29 21:00:00 17.44
31342 2018-07-29 22:00:00 17.39
31343 2018-07-29 23:00:00 17.37
I want to convert this dataframe to have data each half hour, and inpute each new position with the mean between the previous and the following value (or any similar interpolation), that is, for example:
datetime temp
0 2015-01-01 00:00:00 11.00
1 2015-01-01 00:30:00 11.50
2 2015-01-01 01:00:00 12.00
Is there any pandas/datetime function to assist in this operation?
Thank you
You can use the resample() function in Pandas. With this you can set the time to down/upsample to and then what you want to do with it (mean, sum etc.). In your case you can also interpolate between the values.
For this to work your datetime column will have to be a datetime dtype, then set it to the index.
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
Then you can resample to 30 minutes ('30T') and then interpolate.
df.resample('30T').interpolate()
Resulting in...
temp
datetime
2015-01-01 00:00:00 11.220
2015-01-01 00:30:00 11.270
2015-01-01 01:00:00 11.320
2015-01-01 01:30:00 11.310
2015-01-01 02:00:00 11.300
2015-01-01 02:30:00 11.275
2015-01-01 03:00:00 11.250
2015-01-01 03:30:00 11.285
2015-01-01 04:00:00 11.320
Read more about the frequency strings and resampling in the Pandas docs.

datetime difference between dates

I have a df like so:
firstdate seconddate
0 2011-01-01 13:00:00 2011-01-01 13:00:00
1 2011-01-02 14:00:00 2011-01-01 11:00:00
2 2011-01-02 16:00:00 2011-01-02 13:00:00
3 2011-01-04 12:00:00 2011-01-03 15:00:00
...
Seconddate is always before firstdate. I want to compute the difference between firstdate and seconddate in number of days and make this a column, if firstdate and seconddate are the same day, difference=0, if seconddate is the day before firstdate, difference=1 and so on until a week. How would I do this?
df['first'] = pd.to_datetime(df['first'])
df['second'] = pd.to_datetime(df['second'])
df['diff'] = (df['first'] - df['second']).dt.days
This will add a column with the diff. You can delete based on it
df.drop(df[df.diff < 0].index)
# or
df = df[df.diff > 0]

Remove 'seconds' and 'minutes' from a Pandas dataframe column

Given a dataframe like:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'Date' : pd.date_range('1/1/2011', periods=5, freq='3675S'),
'Num' : np.random.rand(5)})
Date Num
0 2011-01-01 00:00:00 0.580997
1 2011-01-01 01:01:15 0.407332
2 2011-01-01 02:02:30 0.786035
3 2011-01-01 03:03:45 0.821792
4 2011-01-01 04:05:00 0.807869
I would like to remove the 'minutes' and 'seconds' information.
The following (mostly stolen from: How to remove the 'seconds' of Pandas dataframe index?) works okay,
df = df.assign(Date = lambda x: pd.to_datetime(x['Date'].dt.strftime('%Y-%m-%d %H')))
Date Num
0 2011-01-01 00:00:00 0.580997
1 2011-01-01 01:00:00 0.407332
2 2011-01-01 02:00:00 0.786035
3 2011-01-01 03:00:00 0.821792
4 2011-01-01 04:00:00 0.807869
but it feels strange to convert a datetime to a string then back to a datetime. Is there a way to do this more directly?
dt.round
This is how it should be done... use dt.round
df.assign(Date=df.Date.dt.round('H'))
Date Num
0 2011-01-01 00:00:00 0.577957
1 2011-01-01 01:00:00 0.995748
2 2011-01-01 02:00:00 0.864013
3 2011-01-01 03:00:00 0.468762
4 2011-01-01 04:00:00 0.866827
OLD ANSWER
One approach is to set the index and use resample
df.set_index('Date').resample('H').last().reset_index()
Date Num
0 2011-01-01 00:00:00 0.577957
1 2011-01-01 01:00:00 0.995748
2 2011-01-01 02:00:00 0.864013
3 2011-01-01 03:00:00 0.468762
4 2011-01-01 04:00:00 0.866827
Another alternative is to strip the date and hour components
df.assign(
Date=pd.to_datetime(df.Date.dt.date) +
pd.to_timedelta(df.Date.dt.hour, unit='H'))
Date Num
0 2011-01-01 00:00:00 0.577957
1 2011-01-01 01:00:00 0.995748
2 2011-01-01 02:00:00 0.864013
3 2011-01-01 03:00:00 0.468762
4 2011-01-01 04:00:00 0.866827
Other solution could be this :
df.Date = pd.to_datetime(df.Date)
df.Date = df.Date.apply(lambda x: datetime(x.year, x.month, x.day, x.hour))

get mean for all columns in a dataframe and create a new dataframe

I have a dataframe with only numeric values and I want to calculate the mean for every column and create a new dataframe.
The original dataframe is indexed by a datetimefield. The new dataframe should be indexed by the same field as original dataframe with a value equal to last row index of original dataframe.
Code so far
mean_series=df.mean()
df_mean= pd.DataFrame(stddev_series)
df_mean.rename(columns=lambda x: 'std_dev_'+ x, inplace=True)
but this gives an error
df_mean.rename(columns=lambda x: 'std_mean_'+ x, inplace=True)
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S21') dtype('S21') dtype('S21')
Your question implies that you want a new DataFrame with a single row.
In [10]: df.head(10)
Out[10]:
0 1 2 3
2011-01-01 00:00:00 0.182481 0.523784 0.718124 0.063792
2011-01-01 01:00:00 0.321362 0.404686 0.481889 0.524521
2011-01-01 02:00:00 0.514426 0.735809 0.433758 0.392824
2011-01-01 03:00:00 0.616802 0.149099 0.217199 0.155990
2011-01-01 04:00:00 0.525465 0.439633 0.641974 0.270364
2011-01-01 05:00:00 0.749662 0.151958 0.200913 0.219916
2011-01-01 06:00:00 0.665164 0.396595 0.980862 0.560119
2011-01-01 07:00:00 0.797803 0.377273 0.273724 0.220965
2011-01-01 08:00:00 0.651989 0.553929 0.769008 0.545288
2011-01-01 09:00:00 0.692169 0.261194 0.400704 0.118335
In [11]: df.tail()
Out[11]:
0 1 2 3
2011-01-03 19:00:00 0.247211 0.539330 0.734206 0.781125
2011-01-03 20:00:00 0.278550 0.534943 0.804949 0.137291
2011-01-03 21:00:00 0.602246 0.108791 0.987120 0.455887
2011-01-03 22:00:00 0.003097 0.436435 0.987877 0.046066
2011-01-03 23:00:00 0.604916 0.670532 0.513927 0.610775
In [12]: df.mean()
Out[12]:
0 0.495307
1 0.477509
2 0.562590
3 0.447997
dtype: float64
In [13]: new_df = pd.DataFrame(df.mean().to_dict(),index=[df.index.values[-1]])
In [14]: new_df
Out[14]:
0 1 2 3
2011-01-03 23:00:00 0.495307 0.477509 0.56259 0.447997
In [15]: new_df.rename(columns=lambda c: "mean_"+str(c))
Out[15]:
mean_0 mean_1 mean_2 mean_3
2011-01-03 23:00:00 0.495307 0.477509 0.56259 0.447997

Finding hour of daily max using Pandas in Python

I am trying to find the hour of max demand every day in my demand time series.
I have created a dataframe that looks like..
power
2011-01-01 00:00:00 1015.70
2011-01-01 01:00:00 1015.70
2011-01-01 02:00:00 1010.30
2011-01-01 03:00:00 1010.90
2011-01-01 04:00:00 1021.10
2011-01-01 05:00:00 1046.00
2011-01-01 06:00:00 1054.60
...
and a grouped series to find the max value from each day using .max()
grouped = df.groupby(pd.TimeGrouper('D'))
grouped['power'].max()
OUTPUT
2011-01-01 1367.30
2011-01-02 1381.90
2011-01-03 1289.00
2011-01-04 1323.50
2011-01-05 1372.70
2011-01-06 1314.40
2011-01-07 1310.60
...
However I need the hour of the max value also. So something like:
2011-01-01 18 1367.30
2011-01-02 5 1381.90
2011-01-03 22 1289.00
2011-01-04 10 1323.50
...
I have tried using idxmax() but I keep getting a ValueError
UPDATE from 2018-09-19:
FutureWarning: pd.TimeGrouper is deprecated and will be removed;
Please use pd.Grouper(freq=...)
solution:
In [295]: df.loc[df.groupby(pd.Grouper(freq='D')).idxmax().iloc[:, 0]]
Out[295]:
power
2011-01-01 06:00:00 1054.6
2011-01-02 06:00:00 2054.6
Old answer:
try this:
In [376]: df.loc[df.groupby(pd.TimeGrouper('D')).idxmax().iloc[:, 0]]
Out[376]:
power
2011-01-01 06:00:00 1054.6
2011-01-02 06:00:00 2054.6
data:
In [377]: df
Out[377]:
power
2011-01-01 00:00:00 1015.7
2011-01-01 01:00:00 1015.7
2011-01-01 02:00:00 1010.3
2011-01-01 03:00:00 1010.9
2011-01-01 04:00:00 1021.1
2011-01-01 05:00:00 1046.0
2011-01-01 06:00:00 1054.6
2011-01-02 00:00:00 2015.7
2011-01-02 01:00:00 2015.7
2011-01-02 02:00:00 2010.3
2011-01-02 03:00:00 2010.9
2011-01-02 04:00:00 2021.1
2011-01-02 05:00:00 2046.0
2011-01-02 06:00:00 2054.6
You can also group by your index date with df.groupby(df.index.date) and then use idxmax() to find the index of the max value in the power column:
df.groupby(df.index.date)["power"].idxmax()
If you want the power values too, use .loc:
df.loc[df.groupby(df.index.date)["power"].idxmax()]

Categories