I have a df like so:
firstdate seconddate
0 2011-01-01 13:00:00 2011-01-01 13:00:00
1 2011-01-02 14:00:00 2011-01-01 11:00:00
2 2011-01-02 16:00:00 2011-01-02 13:00:00
3 2011-01-04 12:00:00 2011-01-03 15:00:00
...
Seconddate is always before firstdate. I want to compute the difference between firstdate and seconddate in number of days and make this a column, if firstdate and seconddate are the same day, difference=0, if seconddate is the day before firstdate, difference=1 and so on until a week. How would I do this?
df['first'] = pd.to_datetime(df['first'])
df['second'] = pd.to_datetime(df['second'])
df['diff'] = (df['first'] - df['second']).dt.days
This will add a column with the diff. You can delete based on it
df.drop(df[df.diff < 0].index)
# or
df = df[df.diff > 0]
I have a dataframe with only numeric values and I want to calculate the mean for every column and create a new dataframe.
The original dataframe is indexed by a datetimefield. The new dataframe should be indexed by the same field as original dataframe with a value equal to last row index of original dataframe.
Code so far
mean_series=df.mean()
df_mean= pd.DataFrame(stddev_series)
df_mean.rename(columns=lambda x: 'std_dev_'+ x, inplace=True)
but this gives an error
df_mean.rename(columns=lambda x: 'std_mean_'+ x, inplace=True)
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S21') dtype('S21') dtype('S21')
Your question implies that you want a new DataFrame with a single row.
In [10]: df.head(10)
Out[10]:
0 1 2 3
2011-01-01 00:00:00 0.182481 0.523784 0.718124 0.063792
2011-01-01 01:00:00 0.321362 0.404686 0.481889 0.524521
2011-01-01 02:00:00 0.514426 0.735809 0.433758 0.392824
2011-01-01 03:00:00 0.616802 0.149099 0.217199 0.155990
2011-01-01 04:00:00 0.525465 0.439633 0.641974 0.270364
2011-01-01 05:00:00 0.749662 0.151958 0.200913 0.219916
2011-01-01 06:00:00 0.665164 0.396595 0.980862 0.560119
2011-01-01 07:00:00 0.797803 0.377273 0.273724 0.220965
2011-01-01 08:00:00 0.651989 0.553929 0.769008 0.545288
2011-01-01 09:00:00 0.692169 0.261194 0.400704 0.118335
In [11]: df.tail()
Out[11]:
0 1 2 3
2011-01-03 19:00:00 0.247211 0.539330 0.734206 0.781125
2011-01-03 20:00:00 0.278550 0.534943 0.804949 0.137291
2011-01-03 21:00:00 0.602246 0.108791 0.987120 0.455887
2011-01-03 22:00:00 0.003097 0.436435 0.987877 0.046066
2011-01-03 23:00:00 0.604916 0.670532 0.513927 0.610775
In [12]: df.mean()
Out[12]:
0 0.495307
1 0.477509
2 0.562590
3 0.447997
dtype: float64
In [13]: new_df = pd.DataFrame(df.mean().to_dict(),index=[df.index.values[-1]])
In [14]: new_df
Out[14]:
0 1 2 3
2011-01-03 23:00:00 0.495307 0.477509 0.56259 0.447997
In [15]: new_df.rename(columns=lambda c: "mean_"+str(c))
Out[15]:
mean_0 mean_1 mean_2 mean_3
2011-01-03 23:00:00 0.495307 0.477509 0.56259 0.447997
I am trying to add holidays column for France in a Dataframe by using workalendar package but it gives me an error of
Series' object has no attribute 'weekday'
Below is my code;
from workalendar.europe import France
df1 = pd.read_csv('C:\Users\ABC.csv')
df1['Date'] = pd.to_datetime(df1['Date'], format= '%d/%m/%Y %H:%M:%S')
df1['Date1'] = df1.Date.dt.date
dr = df1['Date1']
cal = France()
df1['Holiday'] = cal.is_working_day(df1['Date1'])
df1.head()
The original data in the file looks like this;
Date Value
17/10/2012 19:00:00 0
17/10/2012 20:00:00 0.1
17/10/2012 21:00:00 0
17/10/2012 22:00:00 0
17/10/2012 23:00:00 0
18/10/2012 00:00:00 0
18/10/2012 01:00:00 0
18/10/2012 02:00:00 0
18/10/2012 03:00:00 0.1
18/10/2012 04:00:00 0
18/10/2012 05:00:00 0
18/10/2012 06:00:00 0
18/10/2012 07:00:00 0
18/10/2012 08:00:00 0.2
18/10/2012 09:00:00 0.5
`
Try this.
df1['Holiday'] = df1.Date.apply(lambda x: cal.is_working_day(pd.to_pydatetime(x)))
You have to convert the object type to datetime.
BTW, I thought that working_day would might not be Holiday...
Converting between datetime, Timestamp and datetime64
Consider the simple example below:
date = pd.date_range('1/1/2011', periods=5, freq='H')
df = pd.DataFrame({'cat' : ['A', 'A', 'A', 'B',
'B']}, index = date)
df
Out[278]:
cat
2011-01-01 00:00:00 A
2011-01-01 01:00:00 A
2011-01-01 02:00:00 A
2011-01-01 03:00:00 B
2011-01-01 04:00:00 B
I want to create a variable that contains the lagged/lead value of the index. That is something like:
df['index_shifted']=df.index.shift(1)
So, for instance, at time 2011-01-01 01:00:00 I expect the variable index_shifted to be 2011-01-01 00:00:00
How can I do that?
Thanks!
I think you need Index.shift with -1:
df['index_shifted']= df.index.shift(-1)
print (df)
cat index_shifted
2011-01-01 00:00:00 A 2010-12-31 23:00:00
2011-01-01 01:00:00 A 2011-01-01 00:00:00
2011-01-01 02:00:00 A 2011-01-01 01:00:00
2011-01-01 03:00:00 B 2011-01-01 02:00:00
2011-01-01 04:00:00 B 2011-01-01 03:00:00
For me it works without freq, but maybe it is necessary in real data:
df['index_shifted']= df.index.shift(-1, freq='H')
print (df)
cat index_shifted
2011-01-01 00:00:00 A 2010-12-31 23:00:00
2011-01-01 01:00:00 A 2011-01-01 00:00:00
2011-01-01 02:00:00 A 2011-01-01 01:00:00
2011-01-01 03:00:00 B 2011-01-01 02:00:00
2011-01-01 04:00:00 B 2011-01-01 03:00:00
EDIT:
If freq of DatetimeIndex is None, you need add freq to shift:
import pandas as pd
date = pd.date_range('1/1/2011', periods=5, freq='H').union(pd.date_range('5/1/2011', periods=5, freq='H'))
df = pd.DataFrame({'cat' : ['A', 'A', 'A', 'B',
'B','A', 'A', 'A', 'B',
'B']}, index = date)
print (df.index)
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 01:00:00',
'2011-01-01 02:00:00', '2011-01-01 03:00:00',
'2011-01-01 04:00:00', '2011-05-01 00:00:00',
'2011-05-01 01:00:00', '2011-05-01 02:00:00',
'2011-05-01 03:00:00', '2011-05-01 04:00:00'],
dtype='datetime64[ns]', freq=None)
df['index_shifted']= df.index.shift(-1, freq='H')
print (df)
cat index_shifted
2011-01-01 00:00:00 A 2010-12-31 23:00:00
2011-01-01 01:00:00 A 2011-01-01 00:00:00
2011-01-01 02:00:00 A 2011-01-01 01:00:00
2011-01-01 03:00:00 B 2011-01-01 02:00:00
2011-01-01 04:00:00 B 2011-01-01 03:00:00
2011-05-01 00:00:00 A 2011-04-30 23:00:00
2011-05-01 01:00:00 A 2011-05-01 00:00:00
2011-05-01 02:00:00 A 2011-05-01 01:00:00
2011-05-01 03:00:00 B 2011-05-01 02:00:00
2011-05-01 04:00:00 B 2011-05-01 03:00:00
What's wrong with df['index_shifted']=df.index.shift(-1)?
(Genuine question, not sure if I missed something)
This is an old question, but if your timestamps have gaps or you do not want to specify the frequency, AND you are not dealing with timezones the following will work:
df['index_shifted'] = pd.Series(df.index).shift(-1).values
If you are dealing with Timezones the following will work:
df['index_shifted'] = pd.to_datetime(pd.Series(df.index).shift(-1).values, utc=True).tz_convert('America/New_York')
I am trying to find the hour of max demand every day in my demand time series.
I have created a dataframe that looks like..
power
2011-01-01 00:00:00 1015.70
2011-01-01 01:00:00 1015.70
2011-01-01 02:00:00 1010.30
2011-01-01 03:00:00 1010.90
2011-01-01 04:00:00 1021.10
2011-01-01 05:00:00 1046.00
2011-01-01 06:00:00 1054.60
...
and a grouped series to find the max value from each day using .max()
grouped = df.groupby(pd.TimeGrouper('D'))
grouped['power'].max()
OUTPUT
2011-01-01 1367.30
2011-01-02 1381.90
2011-01-03 1289.00
2011-01-04 1323.50
2011-01-05 1372.70
2011-01-06 1314.40
2011-01-07 1310.60
...
However I need the hour of the max value also. So something like:
2011-01-01 18 1367.30
2011-01-02 5 1381.90
2011-01-03 22 1289.00
2011-01-04 10 1323.50
...
I have tried using idxmax() but I keep getting a ValueError
UPDATE from 2018-09-19:
FutureWarning: pd.TimeGrouper is deprecated and will be removed;
Please use pd.Grouper(freq=...)
solution:
In [295]: df.loc[df.groupby(pd.Grouper(freq='D')).idxmax().iloc[:, 0]]
Out[295]:
power
2011-01-01 06:00:00 1054.6
2011-01-02 06:00:00 2054.6
Old answer:
try this:
In [376]: df.loc[df.groupby(pd.TimeGrouper('D')).idxmax().iloc[:, 0]]
Out[376]:
power
2011-01-01 06:00:00 1054.6
2011-01-02 06:00:00 2054.6
data:
In [377]: df
Out[377]:
power
2011-01-01 00:00:00 1015.7
2011-01-01 01:00:00 1015.7
2011-01-01 02:00:00 1010.3
2011-01-01 03:00:00 1010.9
2011-01-01 04:00:00 1021.1
2011-01-01 05:00:00 1046.0
2011-01-01 06:00:00 1054.6
2011-01-02 00:00:00 2015.7
2011-01-02 01:00:00 2015.7
2011-01-02 02:00:00 2010.3
2011-01-02 03:00:00 2010.9
2011-01-02 04:00:00 2021.1
2011-01-02 05:00:00 2046.0
2011-01-02 06:00:00 2054.6
You can also group by your index date with df.groupby(df.index.date) and then use idxmax() to find the index of the max value in the power column:
df.groupby(df.index.date)["power"].idxmax()
If you want the power values too, use .loc:
df.loc[df.groupby(df.index.date)["power"].idxmax()]