I'm working with pandas and want to create a month-long custom date range where the week starts on Sunday night at 6pm and ends Friday afternoon at 4pm. And each day has 22 hours, so for example Sunday at 6pm to Monday at 4pm, Monday 6pm to Tuesday 4pm, etc.
I tried day_range = pd.date_range(datetime(2016,9,12,18),datetime.now(),freq='H') but that always gives me in 24 hours.
Any suggestions?
You need Custom Business Hour with date_range:
cbh = pd.offsets.CustomBusinessHour(start='06:00',
end='16:00',
weekmask='Mon Tue Wed Thu Fri Sat')
print (cbh)
<CustomBusinessHour: CBH=06:00-16:00>
day_range = pd.date_range(pd.datetime(2016,9,12,18),pd.datetime.now(),freq=cbh)
print (day_range)
DatetimeIndex(['2016-09-13 06:00:00', '2016-09-13 07:00:00',
'2016-09-13 08:00:00', '2016-09-13 09:00:00',
'2016-09-13 10:00:00', '2016-09-13 11:00:00',
'2016-09-13 12:00:00', '2016-09-13 13:00:00',
'2016-09-13 14:00:00', '2016-09-13 15:00:00',
...
'2016-10-11 08:00:00', '2016-10-11 09:00:00',
'2016-10-11 10:00:00', '2016-10-11 11:00:00',
'2016-10-11 12:00:00', '2016-10-11 13:00:00',
'2016-10-11 14:00:00', '2016-10-11 15:00:00',
'2016-10-12 06:00:00', '2016-10-12 07:00:00'],
dtype='datetime64[ns]', length=252, freq='CBH')
Test - it omit Sunday:
day_range = pd.date_range(pd.datetime(2016,9,12,18),pd.datetime.now(),freq=cbh)[45:]
print (day_range)
DatetimeIndex(['2016-09-17 11:00:00', '2016-09-17 12:00:00',
'2016-09-17 13:00:00', '2016-09-17 14:00:00',
'2016-09-17 15:00:00', '2016-09-19 06:00:00',
'2016-09-19 07:00:00', '2016-09-19 08:00:00',
'2016-09-19 09:00:00', '2016-09-19 10:00:00',
...
'2016-10-11 08:00:00', '2016-10-11 09:00:00',
'2016-10-11 10:00:00', '2016-10-11 11:00:00',
'2016-10-11 12:00:00', '2016-10-11 13:00:00',
'2016-10-11 14:00:00', '2016-10-11 15:00:00',
'2016-10-12 06:00:00', '2016-10-12 07:00:00'],
dtype='datetime64[ns]', length=207, freq='CBH')
Related
I have a dataframe which is having multiple rows with column date. date column is having date and time. not each row has incremental time so I want to calculate after each row how much was the time difference between current and previous date in seconds.
import pandas as pd
data = pd.date_range('1/1/2011', periods = 10, freq ='H')
In the above snippet time difference after each step is 1hr which means 3600 seconds so I want a list of tuple having [(<prev date time>, <current_datetime>, <time_difference>),.....].
I want a list of tuple having [(prev date time, current_datetime,
time_difference),.....]
In this case, use list with zip and compute the time difference with tolal_seconds :
data = pd.date_range("1/1/2011", periods = 10, freq ="H")
L = list(zip(data.shift(), # <- previous time
data, # <- current time
(data.shift() - data).total_seconds())) # <- time diff
NB : If you manipulate a dataframe, you need to replace data by df["date_column"].
Output :
print(L)
[(Timestamp('2011-01-01 01:00:00', freq='H'),
Timestamp('2011-01-01 00:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 02:00:00', freq='H'),
Timestamp('2011-01-01 01:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 03:00:00', freq='H'),
Timestamp('2011-01-01 02:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 04:00:00', freq='H'),
Timestamp('2011-01-01 03:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 05:00:00', freq='H'),
Timestamp('2011-01-01 04:00:00', freq='H'),
3600.0),
...
You can achieve this by using diff function in Pandas to calculate the time difference between consecutive rows in the data column. Here's an example:
df = pd.DataFrame({"date": pd.date_range("1/1/2011", periods=10, freq="H")})
# Calculate the time difference between consecutive rows in seconds
df["time_diff"] = df["date"].diff().dt.total_seconds()
# Create a list of tuples
result = [(df.iloc[i-1]["date"], row["date"], row["time_diff"]) for i, row in df[1:].iterrows()]
df:
date time_diff
0 2011-01-01 00:00:00 NaN
1 2011-01-01 01:00:00 3600.0
2 2011-01-01 02:00:00 3600.0
3 2011-01-01 03:00:00 3600.0
4 2011-01-01 04:00:00 3600.0
5 2011-01-01 05:00:00 3600.0
6 2011-01-01 06:00:00 3600.0
7 2011-01-01 07:00:00 3600.0
8 2011-01-01 08:00:00 3600.0
9 2011-01-01 09:00:00 3600.0
result:
[(Timestamp('2011-01-01 00:00:00'), Timestamp('2011-01-01 01:00:00'), 3600.0),
(Timestamp('2011-01-01 01:00:00'), Timestamp('2011-01-01 02:00:00'), 3600.0),
(Timestamp('2011-01-01 02:00:00'), Timestamp('2011-01-01 03:00:00'), 3600.0),
(Timestamp('2011-01-01 03:00:00'), Timestamp('2011-01-01 04:00:00'), 3600.0),
(Timestamp('2011-01-01 04:00:00'), Timestamp('2011-01-01 05:00:00'), 3600.0),
(Timestamp('2011-01-01 05:00:00'), Timestamp('2011-01-01 06:00:00'), 3600.0),
(Timestamp('2011-01-01 06:00:00'), Timestamp('2011-01-01 07:00:00'), 3600.0),
(Timestamp('2011-01-01 07:00:00'), Timestamp('2011-01-01 08:00:00'), 3600.0),
(Timestamp('2011-01-01 08:00:00'), Timestamp('2011-01-01 09:00:00'), 3600.0)]
It's possible to do this with list comprehension. [:-1] is required because we get a list of 10 intervals just using shift, but there are N-1 intervals between N points.
result = [(i[0],
i[1],
(i[1] - i[0]).total_seconds())
for i in list(zip(data, data.shift(1)))[:-1]]
print(result)
[(Timestamp('2011-01-01 00:00:00', freq='H'),
Timestamp('2011-01-01 01:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 01:00:00', freq='H'),
Timestamp('2011-01-01 02:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 02:00:00', freq='H'),
Timestamp('2011-01-01 03:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 03:00:00', freq='H'),
Timestamp('2011-01-01 04:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 04:00:00', freq='H'),
Timestamp('2011-01-01 05:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 05:00:00', freq='H'),
Timestamp('2011-01-01 06:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 06:00:00', freq='H'),
Timestamp('2011-01-01 07:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 07:00:00', freq='H'),
Timestamp('2011-01-01 08:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 08:00:00', freq='H'),
Timestamp('2011-01-01 09:00:00', freq='H'),
3600.0)]
I am trying to plot the following data:
01/01/2012 01:00
01/01/2012 02:00
01/01/2012 03:00
01/01/2012 04:00
01/01/2012 05:00
01/01/2012 06:00
01/01/2012 07:00
01/01/2012 08:00
01/01/2012 09:00
01/01/2012 10:00
01/01/2012 11:00
01/01/2012 12:00
01/01/2012 13:00
01/01/2012 14:00
01/01/2012 15:00
01/01/2012 16:00
01/01/2012 17:00
01/01/2012 18:00
01/01/2012 19:00
01/01/2012 20:00
01/01/2012 21:00
01/01/2012 22:00
01/01/2012 23:00
02/01/2012 00:00
04/01/2012 23:00
................
05/01/2012 00:00
05/01/2012 01:00
................
Against wind_speed data which is in the format:
[ 3.30049159 2.25226244 1.44078451 ... 12.8397099 9.75722427
7.98525797]
My code is:
T = T[1:]
print( datetime.datetime.strptime(T, "%m/%d/%Y %H:%M:%S").strftime("%Y%m%d %I:%M:%S") #pharsing the time
TIMESTAMP = [str (i) for i in T]
plt.plot_date(TIMESTAMP, wind_speed)
plt.show()
However, I am receiving the error message "TypeError: strptime() argument 1 must be str", not list. I am new to Python and would appreciate help on how to either convert to the list to a string or another method on how to resolve this. Thank you!
As the other answers suggest, you require a different way, probably map. You may also use pd.to_datetime() and pass it the whole list. And then use the same as x-axis and the wind_speed for y-axis.
import pandas as pd
timestamp = pd.to_datetime(T[1:])
It will create a DatetimeIndex which you can again format according to your needs, like:
timestamp = timestamp.strftime("%Y%m%d %I:%M:%S")
In one line:
timestamp = pd.to_datetime(T[1:]).strftime("%Y%m%d %I:%M:%S")
After having two lists for timestamp and wind_speed, use may use something like:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(13,6))
ax.plot(timestamp, wind_speed)
plt.xticks(rotation=30)
plt.show()
This should help. Using map with lambda to convert datetime to your required format.
Demo:
import datetime
data = ['01/01/2012 01:00', '01/01/2012 02:00', '01/01/2012 03:00', '01/01/2012 04:00', '01/01/2012 05:00', '01/01/2012 06:00', '01/01/2012 07:00', '01/01/2012 08:00', '01/01/2012 09:00', '01/01/2012 10:00', '01/01/2012 11:00', '01/01/2012 12:00', '01/01/2012 13:00', '01/01/2012 14:00', '01/01/2012 15:00', '01/01/2012 16:00', '01/01/2012 17:00', '01/01/2012 18:00', '01/01/2012 19:00', '01/01/2012 20:00', '01/01/2012 21:00', '01/01/2012 22:00', '01/01/2012 23:00', '02/01/2012 00:00', '04/01/2012 23:00']
data = list(map(lambda x: datetime.datetime.strptime(x, "%m/%d/%Y %H:%M").strftime("%Y%m%d %I:%M:%S"), data))
print(data)
Output:
['20120101 01:00:00', '20120101 02:00:00', '20120101 03:00:00', '20120101 04:00:00', '20120101 05:00:00', '20120101 06:00:00', '20120101 07:00:00', '20120101 08:00:00', '20120101 09:00:00', '20120101 10:00:00', '20120101 11:00:00', '20120101 12:00:00', '20120101 01:00:00', '20120101 02:00:00', '20120101 03:00:00', '20120101 04:00:00', '20120101 05:00:00', '20120101 06:00:00', '20120101 07:00:00', '20120101 08:00:00', '20120101 09:00:00', '20120101 10:00:00', '20120101 11:00:00', '20120201 12:00:00', '20120401 11:00:00']
Your variable T is apparently a list of strings, not a string itself, so you need to iterate through T and pass the items in T to strptime instead.
for t in T:
datetime.datetime.strptime(t, "%m/%d/%Y %H:%M:%S").strftime("%Y%m%d %I:%M:%S")
When I use pandas.date_range(), I sometimes have timestamp that have lots of milliseconds that I don't wish to keep.
Suppose I do...
import pandas as pd
dr = pd.date_range('2011-01-01', '2011-01-03', periods=15)
>>> dr
DatetimeIndex([ '2011-01-01 00:00:00',
'2011-01-01 03:25:42.857142784',
'2011-01-01 06:51:25.714285824',
'2011-01-01 10:17:08.571428608',
'2011-01-01 13:42:51.428571392',
'2011-01-01 17:08:34.285714176',
'2011-01-01 20:34:17.142857216',
'2011-01-02 00:00:00',
'2011-01-02 03:25:42.857142784',
'2011-01-02 06:51:25.714285824',
'2011-01-02 10:17:08.571428608',
'2011-01-02 13:42:51.428571392',
'2011-01-02 17:08:34.285714176',
'2011-01-02 20:34:17.142857216',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)
To ignore the currend miliseconds, I am forced to do this.
>>> t = []
>>> for item in dr:
... idx = str(item).find('.')
... if idx != -1:
... item = str(item)[:idx]
... t.append(pd.to_datetime(item))
...
>>> t
[Timestamp('2011-01-01 00:00:00'),
Timestamp('2011-01-01 03:25:42'),
Timestamp('2011-01-01 06:51:25'),
Timestamp('2011-01-01 10:17:08'),
Timestamp('2011-01-01 13:42:51'),
Timestamp('2011-01-01 17:08:34'),
Timestamp('2011-01-01 20:34:17'),
Timestamp('2011-01-02 00:00:00'),
Timestamp('2011-01-02 03:25:42'),
Timestamp('2011-01-02 06:51:25'),
Timestamp('2011-01-02 10:17:08'),
Timestamp('2011-01-02 13:42:51'),
Timestamp('2011-01-02 17:08:34'),
Timestamp('2011-01-02 20:34:17'),
Timestamp('2011-01-03 00:00:00')]
Is there a better way ?
I already tried this...
dr = [ pd.to_datetime(item, format='%Y-%m-%d %H:%M:%S') for item in dr ]
But it doesn't do anything.
(pd.date_range('2011-01-01', '2011-01-03', periods=15)).astype('datetime64[s]')
But it says it can't cast it.
dr = (dr.to_series()).apply(lambda x:x.replace(microseconds=0))
But this line doesn't solve my problem, as...
2018-04-17 15:07:04.777777664 gives --> 2018-04-17 15:07:04.000000664
I believe need DatetimeIndex.floor:
print (dr.floor('S'))
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 03:25:42',
'2011-01-01 06:51:25', '2011-01-01 10:17:08',
'2011-01-01 13:42:51', '2011-01-01 17:08:34',
'2011-01-01 20:34:17', '2011-01-02 00:00:00',
'2011-01-02 03:25:42', '2011-01-02 06:51:25',
'2011-01-02 10:17:08', '2011-01-02 13:42:51',
'2011-01-02 17:08:34', '2011-01-02 20:34:17',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)
I have a user case that I need always the non-leap calendar whatever the year is a leap year or not. I want to construct a 6-hourly datetime list for year 2000, for example:
import datetime
import pandas as pa
tdelta = datetime.timedelta(hours=6)
dt = datetime.datetime(2000,1,1,0,)
ts = [dt+i*tdelta for i in range(1460)]
pa.DatetimeIndex(ts)
With this block of code, I get the result:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 06:00:00',
'2000-01-01 12:00:00', '2000-01-01 18:00:00',
'2000-01-02 00:00:00', '2000-01-02 06:00:00',
'2000-01-02 12:00:00', '2000-01-02 18:00:00',
'2000-01-03 00:00:00', '2000-01-03 06:00:00',
...
'2000-12-28 12:00:00', '2000-12-28 18:00:00',
'2000-12-29 00:00:00', '2000-12-29 06:00:00',
'2000-12-29 12:00:00', '2000-12-29 18:00:00',
'2000-12-30 00:00:00', '2000-12-30 06:00:00',
'2000-12-30 12:00:00', '2000-12-30 18:00:00'],
dtype='datetime64[ns]', length=1460, freq=None, tz=None)
However I want the February to have 28 days and thus the last member of the output should be '2000-12-31 18:00:00', are there some way to do this with python? Thanks!!
All you need to do is check for the .month and .day attribute for the datetime instance. So just insert a condition that checks:
if month == 2
if day == 2
If both the conditions are true, you don't add it to the list.
To make it more descriptive:
ts = []
for i in range(1460):
x = dt + i * tdelta
if x.month == 2 and x.day == 29:
continue
ts.append(x)
Assume I have loaded time series data from SQL or CSV (not created in Python), the index would be:
DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
'2015-03-02 02:00:00', '2015-03-02 03:00:00',
'2015-03-02 04:00:00', '2015-03-02 05:00:00',
'2015-03-02 06:00:00', '2015-03-02 07:00:00',
'2015-03-02 08:00:00', '2015-03-02 09:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)
As you can see, the freq is None. I am wondering how can I detect the frequency of this series and set the freq as its frequency. If possible, I would like this to work in the case of data which isn't continuous (there are plenty of breaks in the series).
I was trying to find the mode of all the differences between two timestamps, but I am not sure how to transfer it into a format that is readable by Series
It is worth mentioning that if data is continuous, you can use pandas.DateTimeIndex.inferred_freq property:
dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'
or pandas.infer_freq method:
pd.infer_freq(dt_ix)
Out[3]: 'H'
If not continuous pandas.infer_freq will return None. Similarly to what has been proposed yet, another alternative is using pandas.Series.diff method:
split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')
Maybe try taking difference of the timeindex and use the mode (or smallest difference) as the freq.
import pandas as pd
import numpy as np
# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df
col
2015-03-02 01:00:00 2.0261
2015-03-02 04:00:00 1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00 0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00 1.8453
... ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00 1.1962
2015-07-19 15:00:00 1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00 0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00 0.5049
2015-07-19 23:00:00 -0.5349
[2000 rows x 1 columns]
# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()
01:00:00 1181
02:00:00 499
03:00:00 180
04:00:00 93
05:00:00 24
06:00:00 10
07:00:00 9
08:00:00 3
dtype: int64
# the mode can be considered as frequency
res.index[0] # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min() # output: Timedelta('0 days 01:00:00')
# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng
DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
'2015-03-02 03:00:00', '2015-03-02 04:00:00',
'2015-03-02 05:00:00', '2015-03-02 06:00:00',
'2015-03-02 07:00:00', '2015-03-02 08:00:00',
'2015-03-02 09:00:00', '2015-03-02 10:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', length=3359, freq='H', tz=None)
The minimum time difference is found with
np.diff(data.index.values).min()
which is normally in units of ns. To get a frequency, assuming ns:
freq = 1e9 / np.diff(df.index.values).min().astype(int)