I have the following code:
import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
import fxcmpy
import numpy as np
symbols = con.get_instruments()
ticker = 'NGAS'
start = datetime.datetime(2015,1,1)
end = datetime.datetime.today()
data = con.get_candles(ticker, period='m1', number=10000)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d %hh:%mm %s')
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d %hh:%mm %s')
data['hour'] = data.index.hour
data['minute'] = data.index.minute
data produces the following :
bidopen bidclose bidhigh bidlow askopen askclose askhigh asklow tickqty hour minute
date
2019-12-05 07:00:00 2.4230 2.4280 2.4300 2.422 2.4305 2.4360 2.439 2.4295 47 7 0
2019-12-05 07:01:00 2.4280 2.4265 2.4270 2.426 2.4360 2.4340 2.436 2.4340 10 7 1
2019-12-05 07:02:00 2.4265 2.4295 2.4300 2.426 2.4340 2.4370 2.438 2.4340 35 7 2
2019-12-05 07:03:00 2.4295 2.4285 2.4300 2.428 2.4370 2.4360 2.438 2.4360 20 7 3
2019-12-05 07:04:00 2.4285 2.4350 2.4360 2.428 2.4360 2.4425 2.444 2.4360 50 7 4
... ... ... ... ... ... ... ... ... ... ... ...
2019-12-17 15:07:00 2.3335 2.3340 2.3345 2.332 2.3410 2.3415 2.342 2.3395 94 15 7
2019-12-17 15:08:00 2.3340 2.3345 2.3355 2.334 2.3415 2.3420 2.344 2.3415 22 15 8
2019-12-17 15:09:00 2.3345 2.3335 2.3345 2.332 2.3420 2.3410 2.342 2.3410 15 15 9
2019-12-17 15:10:00 2.3335 2.3325 2.3345 2.331 2.3410 2.3400 2.342 2.3390 72 15 10
2019-12-17 15:11:00 2.3325 2.3270 2.3325 2.326 2.3400 2.3345 2.340 2.3335 99 15 11
In the table above hours start from 7 end end in 15. However when i run the following code, hour starts from 0 and ends at 59. Why is that?
df = data.groupby(['hour', 'minute']).mean()
bidopen bidclose bidhigh bidlow askopen askclose askhigh asklow tickqty
hour minute
0 0 2.302786 2.303500 2.304286 2.302071 2.310571 2.311214 2.312000 2.310143 16.285714
1 2.294917 2.294333 2.295250 2.293583 2.302667 2.302000 2.303333 2.301333 14.500000
2 2.283000 2.283333 2.283833 2.282333 2.290667 2.290833 2.292000 2.290167 18.666667
3 2.298417 2.298833 2.299167 2.297833 2.305917 2.306333 2.307000 2.305917 14.833333
4 2.283583 2.284000 2.284250 2.283000 2.291083 2.291750 2.292167 2.291083 14.166667
... ... ... ... ... ... ... ... ... ... ...
23 55 2.285500 2.285800 2.286600 2.284700 2.293100 2.293400 2.294300 2.292600 10.400000
56 2.303800 2.304000 2.304600 2.303300 2.311400 2.311700 2.312500 2.311000 11.200000
57 2.268700 2.268400 2.268900 2.268100 2.276200 2.276100 2.276700 2.275900 5.800000
58 2.302857 2.303000 2.303286 2.302357 2.310571 2.310571 2.311214 2.310286 8.000000
59 2.321300 2.321000 2.321700 2.320400 2.328900 2.328900 2.329500 2.328700 8.400000
What i am trying to do is group data by hour which starts from 7 and ends at 15, then i want the mean() of that. So mean() of all the hour 7 to hour 15.
--
Edit 1:
How can i set hour and day as index?
data.set_index('minute', inplace = True)
data.set_index('hour', inplace = True)
gives me an error
Perhaps data.index = pd.to_datetime(data.index, format ='%Y-%m-%d %hh:%mm %s') should be changed to data.index = pd.to_datetime(data.index, format ='%Y-%m-%d %H:%M %S') For hour, min and seconds!
The results you are seeing are correct:
The date of the first line is the 5th of December, the date of the last line is the 17th December, and so there are many lines in between where the hour of the day is after 3pm or before 7am.
Try df[df['hour']>15].head() to see some of the lines which are later in the day than 3pm
updated:
to get the mean for the hours 7 - 15 first see the below example code
df = pd.DataFrame()
df['hour']=np.array([15,12,10,6,4,19,15,12,10])
df['price']=np.array([1,2,3,4,5,6,7,8,9])
df[(df['hour']>=7)&(df['hour']<=15)].mean().price
which returns
5.0
or for mean by hour
df[(df['hour']>=7)&(df['hour']<=15)].groupby('hour').mean()
which returns
price
hour
10 6
12 5
15 4
First of all, what you're seeing is a multi-index. You're seeing hours ranging from 0 to 23 and minutes ranging from 0 to 59.
If you'd like the mean for each hour, you simply need:
data.groupby(['hour']).mean().
If you do choose to group by an additional quantity such as in data.groupby(['hour','minute']).mean() it may be helpful to call a .reset_index() to avoid the confusion of the multi-index.
(e.g. df = data.groupby(['hour','minute']).mean().reset_index())
%hh:%mm %s isn't supported in python datetime, instead of:
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d %hh:%mm %s')
Use:
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d %H:%M %S')
Related
I want to convert all rows of my DataFrame that contains hours and minutes into minutes only.
I have a dataframe that looks like this:
df=
time
0 8h30
1 14h07
2 08h30
3 7h50
4 8h0
5 8h15
6 6h15
I'm using the following method to convert:
df['time'] = pd.eval(
df['time'].replace(['h'], ['*60+'], regex=True))
Output
SyntaxError: invalid syntax
I think the error comes from the format of the hour, maybe pd.evalcant accept 08h30 or 8h0, how to solve this probleme ?
Pandas can already handle such strings if the units are included in the string. While 14h07 can't be parse (why assume 07 is minutes?), 14h07 can be converted to a Timedelta :
>>> pd.to_timedelta("14h07m")
Timedelta('0 days 14:07:00')
Given this dataframe :
d1 = pd.DataFrame(['8h30m', '14h07m', '08h30m', '8h0m'],
columns=['time'])
You can convert the time series into a Timedelta series with pd.to_timedelta :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d1
time tm
0 8h30m 0 days 08:30:00
1 14h07m 0 days 14:07:00
2 08h30m 0 days 08:30:00
3 8h0m 0 days 08:00:00
To handle the missing minutes unit in the original data, just append m:
d1['tm'] = pd.to_timedelta(d1['time'] + 'm')
Once you have a Timedelta you can calculate hours and minutes.
The components of the values can be retrieved with Timedelta.components
>>> d1.tm.dt.components.hours
0 8
1 14
2 8
3 8
Name: hours, dtype: int64
To get the total minutes, seconds or hours, change the frequency to minutes:
>>> d1.tm.astype('timedelta64[m]')
0 510.0
1 847.0
2 510.0
3 480.0
Name: tm, dtype: float64
Bringing all the operations together :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d2 = (d1.assign(h=d1.tm.dt.components.hours,
... m=d1.tm.dt.components.minutes,
... total_minutes=d1.tm.astype('timedelta64[m]')))
>>>
>>> d2
time tm h m total_minutes
0 8h30m 0 days 08:30:00 8 30 510.0
1 14h07m 0 days 14:07:00 14 7 847.0
2 08h30m 0 days 08:30:00 8 30 510.0
3 8h0m 0 days 08:00:00 8 0 480.0
To avoid having to trim leading zeros, an alternative approach:
df[['h', 'm']] = df['time'].str.split('h', expand=True).astype(int)
df['total_min'] = df['h']*60 + df['m']
Result:
time h m total_min
0 8h30 8 30 510
1 14h07 14 7 847
2 08h30 8 30 510
3 7h50 7 50 470
4 8h0 8 0 480
5 8h15 8 15 495
6 6h15 6 15 375
Just to give an alternative approach with kind of the same elements as above you could do:
df = pd.DataFrame(data=["8h30", "14h07", "08h30", "7h50", "8h0 ", "8h15", "6h15"],
columns=["time"])
First split you column on the "h"
hm = df["time"].str.split("h", expand=True)
Then combine the columns again, but zeropad time hours and minutes in order to make valid time strings:
df2 = hm[0].str.strip().str.zfill(2) + hm[1].str.strip().str.zfill(2)
Then convert the string column with proper values to a date time column:
df3 = pd.to_datetime(df2, format="%H%M")
Finally, calculate the number of minutes by subtrackting a zero time (to make deltatimes) and divide by the minutes deltatime:
zerotime= pd.to_datetime("0000", format="%H%M")
df['minutes'] = (df3 - zerotime) / pd.Timedelta(minutes=1)
The results look like:
time minutes
0 8h30 510.0
1 14h07 847.0
2 08h30 510.0
3 7h50 470.0
4 8h0 480.0
5 8h15 495.0
6 6h15 375.0
I want a scatter plot duration(mins) versus start time like this (which is a time of day, irrespective of what date it was on):
I have a CSV file commute.csv which looks like this:
date, prediction, start, stop, duration, duration(mins), Day of week
14/08/2015, , 08:02:00, 08:22:00, 00:20:00, 20, Fri
25/08/2015, , 18:16:00, 18:27:00, 00:11:00, 11, Tue
26/08/2015, , 08:26:00, 08:46:00, 00:20:00, 20, Wed
26/08/2015, , 18:28:00, 18:46:00, 00:18:00, 18, Wed
The full CSV file is here.
I can import the CSV file like so:
import pandas as pd
times = pd.read_csv('commute.csv', parse_dates=[[0, 2], [0, 3]], dayfirst=True)
times.head()
Out:
date_start date_stop prediction duration duration(mins) Day of week
0 2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00 20 Fri
1 2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00 11 Tue
2 2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00 20 Wed
3 2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00 18 Wed
4 2015-08-28 08:37:00 2015-08-28 08:52:00 NaN 00:15:00 15 Fri
I am now struggling to plot duration(mins) versus start time (without the date). Please help!
#jezrael has been a great help... one of the comments on issue 8113 proposes using a variant of df.plot(x=x, y=y, style="."). I tried it:
times.plot(x='start', y='duration(mins)', style='.')
However, it doesn't show the same as my intended plot: the output is incorrect because the X axis has been stretched so that each data point is the same distance apart in X:
Is there no way to plot against time?
I think there is problem use time - issue 8113 in scatter graph.
But you can use hour:
df['hours'] = df.date_start.dt.hour
print df
date_start date_stop prediction duration \
0 2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00
1 2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00
2 2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00
3 2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00
duration(mins) Dayofweek hours
0 20 Fri 8
1 11 Tue 18
2 20 Wed 8
3 18 Wed 18
df.plot.scatter(x='hours', y='duration(mins)')
Another solution with counting time in minutes:
df['time'] = df.date_start.dt.hour * 60 + df.date_start.dt.minute
print df
date_start date_stop prediction duration \
0 2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00
1 2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00
2 2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00
3 2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00
duration(mins) Dayofweek time
0 20 Fri 482
1 11 Tue 1096
2 20 Wed 506
3 18 Wed 1108
df.plot.scatter(x='time', y='duration(mins)')
To follow up, as this question is close to the top of the search results & it's difficult to put the necessary answer all in a comment;
To set the proper time tick labels along the horizontal axis for start time granularity of minutes, you need to set the frequency of the tick labels then convert to datetime.
This code sample has the horizontal axis datetime as the index of the DataFrame, although of course that could equally be a column rather than an index; notice that when it is a DatetimeIndex you access the minute & hour directly rather than through the dt attribute of a datetime column.
This code interprets the datetimes as UTC datetimes datetime.utcfromtimestamp(), see https://stackoverflow.com/a/44572082/437948 for a subtly different approach.
You could add handling of second granularity according to a similar theme.
df = pd.DataFrame({'value': np.random.randint(0, 11, 6 * 24 * 7)},
index = pd.DatetimeIndex(start='2018-10-03', freq='600s',
periods=6 * 24 * 7))
df['time'] = 60 * df.index.hour + df.index.minute
f, a = plt.subplots(figsize=(20, 10))
df.plot.scatter(x='time', y='value', style='.', ax=a)
plt.xticks(np.arange(0, 25 * 60, 60))
a.set_xticklabels([datetime.utcfromtimestamp(ts * 60).strftime('%H:%M')
for ts in a.get_xticks()])
In the end, I wrote a function to turn hours, minutes and seconds into a floating point number of hours.
def to_hours(dt):
"""Return floating point number of hours through the day in `datetime` dt."""
return dt.hour + dt.minute / 60 + dt.second / 3600
# Unit test the to_hours() function
import datetime
dt = datetime.datetime(2010, 4, 23) # Dummy date for testing
assert to_hours(dt) == 0
assert to_hours(dt.replace(hour=1)) == 1
assert to_hours(dt.replace(hour=2, minute=30)) == 2.5
assert to_hours(dt.replace(minute=15)) == 0.25
assert to_hours(dt.replace(second=30)) == 30 / 3600
Then create a column of the floating point number of hours:
# Convert start and stop times to hours
commutes['start_hour'] = commutes['start_date'].map(to_hours)
The full example is in my Jupyter notebook.
I have a file, df, that I wish to take the delta of every 7 day period and reflect the timestamp for that particular period
df:
Date Value
10/15/2020 75
10/14/2020 70
10/13/2020 65
10/12/2020 60
10/11/2020 55
10/10/2020 50
10/9/2020 45
10/8/2020 40
10/7/2020 35
10/6/2020 30
10/5/2020 25
10/4/2020 20
10/3/2020 15
10/2/2020 10
10/1/2020 5
Desired Output:
10/15/2020 to 10/9/2020 is 7 days with the delta being: 75 - 45 = 30
10/9/2020 timestamp would be: 30 and so on
Date Value
10/9/2020 30
10/2/2020 30
This is what I am doing:
df= df['Delta']=df.iloc[:,6].sub(df.iloc[:,0]),Date=pd.Series
(pd.date_range(pd.Timestamp('2020-10-
15'),
periods=7, freq='7d')))[['Delta','Date']]
I am also thinking I may be able to do this:
Edit I updated callDate to Date
for row in df.itertuples():
Date = datetime.strptime(row.Date, "%m/%d/%y %I:%M %p")
previousRecord = df['Date'].shift(-6).strptime(row.Date, "%m/%d/%y %I:%M
%p")
Delta = Date - previousRecord
Any suggestion is appreciated
Don't iterate through the dataframe. You can use a merge:
(df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('6D')),
on='Date')
.assign(Value = lambda x: x['Value_y']-x['Value_x'])
[['Date','Value']]
)
Output:
Date Value
0 2020-10-09 30
1 2020-10-08 30
2 2020-10-07 30
3 2020-10-06 30
4 2020-10-05 30
5 2020-10-04 30
6 2020-10-03 30
7 2020-10-02 30
8 2020-10-01 30
The last block of code you wrote is the way I would do it. Only problem is in Delta = Date - previousRecord, there is nothing called Date here. You should instead be accessing the value associated with callDate.
I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0
What's the best way in Pandas to resample/group/etc by year, but instead of going by calendar years, calculate full years starting with the last date in the data?
Example data set
pd.DataFrame({
'MyDate': ['2017-02-01', '2017-07-05', '2017-08-26', '2017-09-03', '2018-02-04',
'2018-08-03', '2018-08-10', '2018-12-03', '2019-07-13', '2019-08-15'],
'MyValue': [100, 90, 80, 70, 60, 50, 40, 30, 20, 10]
})
MyDate MyValue
0 2017-02-01 100
1 2017-07-05 90
2 2017-08-26 80
3 2017-09-03 70
4 2018-02-04 60
5 2018-08-03 50
6 2018-08-10 40
7 2018-12-03 30
8 2019-07-13 20
9 2019-08-15 10
Example result
Last date is 2019-08-15, so I'd like to group by the last full year 2018-08-16 - 2019-08-15, the 2017-08-17 - 2018-08-15, etc.
Here getting the last result per such year:
MyDate MyValue
0 2017-07-05 90
1 2018-08-10 40
2 2019-08-15 10
You can subtract last value and create years groups and pass to groupby with GroupBy.last:
df['MyDate'] = pd.to_datetime(df['MyDate'])
s = (df['MyDate'].sub(df['MyDate'].iat[-1]).dt.days / 365.25).astype(int)
df = df.groupby(s).last().reset_index(drop=True)
print (df)
MyDate MyValue
0 2017-07-05 90
1 2018-08-10 40
2 2019-08-15 10
You first need to parse your dates to real date objects, like:
df['MyDate'] = pd.to_datetime(df['MyDate'])
Next we can perform a group by with a relativedelta from the python-dateutil package:
>>> from operator import attrgetter
>>> from dateutil.relativedelta import relativedelta
>>> df.groupby(df['MyDate'].apply(relativedelta, dt2=df['MyDate'].max()).apply(attrgetter('years'))).last()
MyDate MyValue
MyDate
-2 2017-07-05 90
-1 2018-08-10 40
0 2019-08-15 10
One way is to use pd.cut, specifying the bins with pd.offsets.DateOffset to get calendar year separation.
import numpy as np
import pandas as pd
df['MyDate'] = pd.to_datetime(df['MyDate'])
N = int(np.ceil((df.MyDate.max()-df.MyDate.min())/np.timedelta64(1, 'Y')))+1
bins = [df.MyDate.max()-pd.offsets.DateOffset(years=y) for y in range(N)][::-1]
df.groupby(pd.cut(df.MyDate, bins)).last()
# MyDate MyValue
#MyDate
#(2016-08-15, 2017-08-15] 2017-07-05 90
#(2017-08-15, 2018-08-15] 2018-08-10 40
#(2018-08-15, 2019-08-15] 2019-08-15 10