Remove 'seconds' and 'minutes' from a Pandas dataframe column - python

Given a dataframe like:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'Date' : pd.date_range('1/1/2011', periods=5, freq='3675S'),
'Num' : np.random.rand(5)})
Date Num
0 2011-01-01 00:00:00 0.580997
1 2011-01-01 01:01:15 0.407332
2 2011-01-01 02:02:30 0.786035
3 2011-01-01 03:03:45 0.821792
4 2011-01-01 04:05:00 0.807869
I would like to remove the 'minutes' and 'seconds' information.
The following (mostly stolen from: How to remove the 'seconds' of Pandas dataframe index?) works okay,
df = df.assign(Date = lambda x: pd.to_datetime(x['Date'].dt.strftime('%Y-%m-%d %H')))
Date Num
0 2011-01-01 00:00:00 0.580997
1 2011-01-01 01:00:00 0.407332
2 2011-01-01 02:00:00 0.786035
3 2011-01-01 03:00:00 0.821792
4 2011-01-01 04:00:00 0.807869
but it feels strange to convert a datetime to a string then back to a datetime. Is there a way to do this more directly?

dt.round
This is how it should be done... use dt.round
df.assign(Date=df.Date.dt.round('H'))
Date Num
0 2011-01-01 00:00:00 0.577957
1 2011-01-01 01:00:00 0.995748
2 2011-01-01 02:00:00 0.864013
3 2011-01-01 03:00:00 0.468762
4 2011-01-01 04:00:00 0.866827
OLD ANSWER
One approach is to set the index and use resample
df.set_index('Date').resample('H').last().reset_index()
Date Num
0 2011-01-01 00:00:00 0.577957
1 2011-01-01 01:00:00 0.995748
2 2011-01-01 02:00:00 0.864013
3 2011-01-01 03:00:00 0.468762
4 2011-01-01 04:00:00 0.866827
Another alternative is to strip the date and hour components
df.assign(
Date=pd.to_datetime(df.Date.dt.date) +
pd.to_timedelta(df.Date.dt.hour, unit='H'))
Date Num
0 2011-01-01 00:00:00 0.577957
1 2011-01-01 01:00:00 0.995748
2 2011-01-01 02:00:00 0.864013
3 2011-01-01 03:00:00 0.468762
4 2011-01-01 04:00:00 0.866827

Other solution could be this :
df.Date = pd.to_datetime(df.Date)
df.Date = df.Date.apply(lambda x: datetime(x.year, x.month, x.day, x.hour))

Related

datetime difference between dates

I have a df like so:
firstdate seconddate
0 2011-01-01 13:00:00 2011-01-01 13:00:00
1 2011-01-02 14:00:00 2011-01-01 11:00:00
2 2011-01-02 16:00:00 2011-01-02 13:00:00
3 2011-01-04 12:00:00 2011-01-03 15:00:00
...
Seconddate is always before firstdate. I want to compute the difference between firstdate and seconddate in number of days and make this a column, if firstdate and seconddate are the same day, difference=0, if seconddate is the day before firstdate, difference=1 and so on until a week. How would I do this?
df['first'] = pd.to_datetime(df['first'])
df['second'] = pd.to_datetime(df['second'])
df['diff'] = (df['first'] - df['second']).dt.days
This will add a column with the diff. You can delete based on it
df.drop(df[df.diff < 0].index)
# or
df = df[df.diff > 0]

get mean for all columns in a dataframe and create a new dataframe

I have a dataframe with only numeric values and I want to calculate the mean for every column and create a new dataframe.
The original dataframe is indexed by a datetimefield. The new dataframe should be indexed by the same field as original dataframe with a value equal to last row index of original dataframe.
Code so far
mean_series=df.mean()
df_mean= pd.DataFrame(stddev_series)
df_mean.rename(columns=lambda x: 'std_dev_'+ x, inplace=True)
but this gives an error
df_mean.rename(columns=lambda x: 'std_mean_'+ x, inplace=True)
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S21') dtype('S21') dtype('S21')
Your question implies that you want a new DataFrame with a single row.
In [10]: df.head(10)
Out[10]:
0 1 2 3
2011-01-01 00:00:00 0.182481 0.523784 0.718124 0.063792
2011-01-01 01:00:00 0.321362 0.404686 0.481889 0.524521
2011-01-01 02:00:00 0.514426 0.735809 0.433758 0.392824
2011-01-01 03:00:00 0.616802 0.149099 0.217199 0.155990
2011-01-01 04:00:00 0.525465 0.439633 0.641974 0.270364
2011-01-01 05:00:00 0.749662 0.151958 0.200913 0.219916
2011-01-01 06:00:00 0.665164 0.396595 0.980862 0.560119
2011-01-01 07:00:00 0.797803 0.377273 0.273724 0.220965
2011-01-01 08:00:00 0.651989 0.553929 0.769008 0.545288
2011-01-01 09:00:00 0.692169 0.261194 0.400704 0.118335
In [11]: df.tail()
Out[11]:
0 1 2 3
2011-01-03 19:00:00 0.247211 0.539330 0.734206 0.781125
2011-01-03 20:00:00 0.278550 0.534943 0.804949 0.137291
2011-01-03 21:00:00 0.602246 0.108791 0.987120 0.455887
2011-01-03 22:00:00 0.003097 0.436435 0.987877 0.046066
2011-01-03 23:00:00 0.604916 0.670532 0.513927 0.610775
In [12]: df.mean()
Out[12]:
0 0.495307
1 0.477509
2 0.562590
3 0.447997
dtype: float64
In [13]: new_df = pd.DataFrame(df.mean().to_dict(),index=[df.index.values[-1]])
In [14]: new_df
Out[14]:
0 1 2 3
2011-01-03 23:00:00 0.495307 0.477509 0.56259 0.447997
In [15]: new_df.rename(columns=lambda c: "mean_"+str(c))
Out[15]:
mean_0 mean_1 mean_2 mean_3
2011-01-03 23:00:00 0.495307 0.477509 0.56259 0.447997

Adding holidays columns in a Dataframe in Python

I am trying to add holidays column for France in a Dataframe by using workalendar package but it gives me an error of
Series' object has no attribute 'weekday'
Below is my code;
from workalendar.europe import France
df1 = pd.read_csv('C:\Users\ABC.csv')
df1['Date'] = pd.to_datetime(df1['Date'], format= '%d/%m/%Y %H:%M:%S')
df1['Date1'] = df1.Date.dt.date
dr = df1['Date1']
cal = France()
df1['Holiday'] = cal.is_working_day(df1['Date1'])
df1.head()
The original data in the file looks like this;
Date Value
17/10/2012 19:00:00 0
17/10/2012 20:00:00 0.1
17/10/2012 21:00:00 0
17/10/2012 22:00:00 0
17/10/2012 23:00:00 0
18/10/2012 00:00:00 0
18/10/2012 01:00:00 0
18/10/2012 02:00:00 0
18/10/2012 03:00:00 0.1
18/10/2012 04:00:00 0
18/10/2012 05:00:00 0
18/10/2012 06:00:00 0
18/10/2012 07:00:00 0
18/10/2012 08:00:00 0.2
18/10/2012 09:00:00 0.5
`
Try this.
df1['Holiday'] = df1.Date.apply(lambda x: cal.is_working_day(pd.to_pydatetime(x)))
You have to convert the object type to datetime.
BTW, I thought that working_day would might not be Holiday...
Converting between datetime, Timestamp and datetime64

how to get the shifted index value of a dataframe in Pandas?

Consider the simple example below:
date = pd.date_range('1/1/2011', periods=5, freq='H')
df = pd.DataFrame({'cat' : ['A', 'A', 'A', 'B',
'B']}, index = date)
df
Out[278]:
cat
2011-01-01 00:00:00 A
2011-01-01 01:00:00 A
2011-01-01 02:00:00 A
2011-01-01 03:00:00 B
2011-01-01 04:00:00 B
I want to create a variable that contains the lagged/lead value of the index. That is something like:
df['index_shifted']=df.index.shift(1)
So, for instance, at time 2011-01-01 01:00:00 I expect the variable index_shifted to be 2011-01-01 00:00:00
How can I do that?
Thanks!
I think you need Index.shift with -1:
df['index_shifted']= df.index.shift(-1)
print (df)
cat index_shifted
2011-01-01 00:00:00 A 2010-12-31 23:00:00
2011-01-01 01:00:00 A 2011-01-01 00:00:00
2011-01-01 02:00:00 A 2011-01-01 01:00:00
2011-01-01 03:00:00 B 2011-01-01 02:00:00
2011-01-01 04:00:00 B 2011-01-01 03:00:00
For me it works without freq, but maybe it is necessary in real data:
df['index_shifted']= df.index.shift(-1, freq='H')
print (df)
cat index_shifted
2011-01-01 00:00:00 A 2010-12-31 23:00:00
2011-01-01 01:00:00 A 2011-01-01 00:00:00
2011-01-01 02:00:00 A 2011-01-01 01:00:00
2011-01-01 03:00:00 B 2011-01-01 02:00:00
2011-01-01 04:00:00 B 2011-01-01 03:00:00
EDIT:
If freq of DatetimeIndex is None, you need add freq to shift:
import pandas as pd
date = pd.date_range('1/1/2011', periods=5, freq='H').union(pd.date_range('5/1/2011', periods=5, freq='H'))
df = pd.DataFrame({'cat' : ['A', 'A', 'A', 'B',
'B','A', 'A', 'A', 'B',
'B']}, index = date)
print (df.index)
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 01:00:00',
'2011-01-01 02:00:00', '2011-01-01 03:00:00',
'2011-01-01 04:00:00', '2011-05-01 00:00:00',
'2011-05-01 01:00:00', '2011-05-01 02:00:00',
'2011-05-01 03:00:00', '2011-05-01 04:00:00'],
dtype='datetime64[ns]', freq=None)
df['index_shifted']= df.index.shift(-1, freq='H')
print (df)
cat index_shifted
2011-01-01 00:00:00 A 2010-12-31 23:00:00
2011-01-01 01:00:00 A 2011-01-01 00:00:00
2011-01-01 02:00:00 A 2011-01-01 01:00:00
2011-01-01 03:00:00 B 2011-01-01 02:00:00
2011-01-01 04:00:00 B 2011-01-01 03:00:00
2011-05-01 00:00:00 A 2011-04-30 23:00:00
2011-05-01 01:00:00 A 2011-05-01 00:00:00
2011-05-01 02:00:00 A 2011-05-01 01:00:00
2011-05-01 03:00:00 B 2011-05-01 02:00:00
2011-05-01 04:00:00 B 2011-05-01 03:00:00
What's wrong with df['index_shifted']=df.index.shift(-1)?
(Genuine question, not sure if I missed something)
This is an old question, but if your timestamps have gaps or you do not want to specify the frequency, AND you are not dealing with timezones the following will work:
df['index_shifted'] = pd.Series(df.index).shift(-1).values
If you are dealing with Timezones the following will work:
df['index_shifted'] = pd.to_datetime(pd.Series(df.index).shift(-1).values, utc=True).tz_convert('America/New_York')

Finding hour of daily max using Pandas in Python

I am trying to find the hour of max demand every day in my demand time series.
I have created a dataframe that looks like..
power
2011-01-01 00:00:00 1015.70
2011-01-01 01:00:00 1015.70
2011-01-01 02:00:00 1010.30
2011-01-01 03:00:00 1010.90
2011-01-01 04:00:00 1021.10
2011-01-01 05:00:00 1046.00
2011-01-01 06:00:00 1054.60
...
and a grouped series to find the max value from each day using .max()
grouped = df.groupby(pd.TimeGrouper('D'))
grouped['power'].max()
OUTPUT
2011-01-01 1367.30
2011-01-02 1381.90
2011-01-03 1289.00
2011-01-04 1323.50
2011-01-05 1372.70
2011-01-06 1314.40
2011-01-07 1310.60
...
However I need the hour of the max value also. So something like:
2011-01-01 18 1367.30
2011-01-02 5 1381.90
2011-01-03 22 1289.00
2011-01-04 10 1323.50
...
I have tried using idxmax() but I keep getting a ValueError
UPDATE from 2018-09-19:
FutureWarning: pd.TimeGrouper is deprecated and will be removed;
Please use pd.Grouper(freq=...)
solution:
In [295]: df.loc[df.groupby(pd.Grouper(freq='D')).idxmax().iloc[:, 0]]
Out[295]:
power
2011-01-01 06:00:00 1054.6
2011-01-02 06:00:00 2054.6
Old answer:
try this:
In [376]: df.loc[df.groupby(pd.TimeGrouper('D')).idxmax().iloc[:, 0]]
Out[376]:
power
2011-01-01 06:00:00 1054.6
2011-01-02 06:00:00 2054.6
data:
In [377]: df
Out[377]:
power
2011-01-01 00:00:00 1015.7
2011-01-01 01:00:00 1015.7
2011-01-01 02:00:00 1010.3
2011-01-01 03:00:00 1010.9
2011-01-01 04:00:00 1021.1
2011-01-01 05:00:00 1046.0
2011-01-01 06:00:00 1054.6
2011-01-02 00:00:00 2015.7
2011-01-02 01:00:00 2015.7
2011-01-02 02:00:00 2010.3
2011-01-02 03:00:00 2010.9
2011-01-02 04:00:00 2021.1
2011-01-02 05:00:00 2046.0
2011-01-02 06:00:00 2054.6
You can also group by your index date with df.groupby(df.index.date) and then use idxmax() to find the index of the max value in the power column:
df.groupby(df.index.date)["power"].idxmax()
If you want the power values too, use .loc:
df.loc[df.groupby(df.index.date)["power"].idxmax()]

Categories