I'm using pandas time series indexed with a DatetimeIndex, and I need to have support for semiannual frequencies. The basic semiannual frequency has 1H=Jan-Jun and 2H=Jul-Dec, though some series might have the last month be a month other than December, for instance 1H=Dec-May and 2H=Jun-Nov.
I imagine I could certainly achieve what I want by making a custom class that derives from pandas' DateOffset class. However, before I go and do that, I'm curious if there is a way I can simply use a built-in frequency, for instance a 6-month frequency? I have tried to do this, but cannot get resampling to the way I want.
For example:
import numpy as np
import pandas as pd
from datetime import datetime
data = np.arange(12)
s = pd.Series(data, pd.date_range(start=datetime(2007,1,31), periods=len(data), freq="M"))
s.resample("6M")
Out[11]:
2007-01-31 0.0
2007-07-31 3.5
2008-01-31 9.0
Freq: 6M
Notice how pandas is aggregating using windows from Aug-Jan and Feb-Jul. In this base case I would want Jan-Jun and Jul-Dec.
You could use a combination of the two Series.resample() parameters loffset= and closed=.
For example:
In [1]: import numpy as np, pandas as pd
In [2]: data = np.arange(1, 13)
In [3]: s = pd.Series(data, pd.date_range(start='1/31/2007', periods=len(data), freq='M'))
In [4]: s.resample('6M', how='sum', closed='left', loffset='-1M')
Out[4]:
2007-06-30 21
2007-12-31 57
I used loffset='-1M' to tell pandas to aggregate one period earlier than its default (moved us to Jan-Jun).
I used closed='left' to make the aggregator include the 'left' end of the sample window and exclude the 'right' end (closed='right' is the default behavior).
NOTE: I used how='sum' just to make sure it was doing what I thought. You can use any of the appropriate how's.
Related
I can't seem to understand what the difference is between <M8[ns] and date time formats on how these operations relate to why this does or doesn't work.
import pandas as pd
import datetime as dt
import numpy as np
my_dates = ['2021-02-03','2021-02-05','2020-12-25', '2021-12-27','2021-12-12']
my_numbers = [100,200,0,400,500]
df = pd.DataFrame({'a':my_dates, 'b':my_numbers})
df['a']=pd.to_datetime(df['a')
# ultimate goal is to be able to go. * df.mean() * and be able to see mean DATE
# but this doesn't seem to work so...
df['a'].mean().strftime('%Y-%m-%d') ### ok this works... I can mess around and concat stuff...
# But why won't this work?
df2 = df.select_dtypes('datetime')
df2.mean() # WONT WORK
df2['a'].mean() # WILL WORK?
What I seem to be running into unless I am missing something is the difference between 'datetime' and '<M8[ns]' and how that works when I'm trying to get the mean date.
You can try passing numeric_only parameter in mean() method:
out=df.select_dtypes('datetime').mean(numeric_only=False)
output of out:
a 2021-06-03 04:48:00
dtype: datetime64[ns]
Note: It will throw you an error If the dtype is string
mean function you apply is different in each case.
import pandas as pd
import datetime as dt
import numpy as np
my_dates = ['2021-02-03','2021-02-05','2020-12-25', '2021-12-27','2021-12-12']
my_numbers = [100,200,0,400,500]
df = pd.DataFrame({'a':my_dates, 'b':my_numbers})
df['a']=pd.to_datetime(df['a'])
df.mean()
This mean function its the DataFrame mean function, and it works on numeric data. To see who is numeric, do:
df._get_numeric_data()
b
0 100
1 200
2 0
3 400
4 500
But df['a'] is a datetime series.
df['a'].dtype, type(df)
(dtype('<M8[ns]'), pandas.core.frame.DataFrame)
So df['a'].mean() apply different mean function that works on datetime values. That's why df['a'].mean() output the mean of datetime values.
df['a'].mean()
Timestamp('2021-06-03 04:48:00')
read more here:
difference-between-data-type-datetime64ns-and-m8ns
DataFrame.mean() ignores datetime series
#28108
Let's say I have a 'low-frequency' series with data points every 2 hours, which I'd like to upsample to a 1-hour freq.
Is it possible in the code snippet below to have the high-freq signal have 24 rows (instead of 23)? More precisely, I'd like the new index to range from 00:00 to 23:00 with a NaN value (instead of stopping at 22:00).
I've played quite a bit with the several options but I still couldn't find out a clean way to do it.
import pandas as pd
import numpy as np
low_f = pd.Series(np.random.randn(12),
index=pd.date_range(start='01/01/2017', freq='2H', periods=12),
name='2H').cumsum()
high_f = low_f.resample('1H', ).mean()
print(high_f.tail(1).index)
#Yields DatetimeIndex(['2017-01-01 22:00:00'], dtype='datetime64[ns]', freq='H')
#I'd like DatetimeIndex(['2017-01-01 23:00:00'], dtype='datetime64[ns]', freq='H')
#(w/ 24 elements)
You can use DateTimeIndex.shift method to shift the dates by an 1 hour (leading). Take the union of it's old index and the newly formed shifted index.
Finally, reindex them according to these set of indices. As there would be no values of the series occuring at the last index, they would be filled by NaN as per it's default fill_value parameter.
high_f.reindex(high_f.index.union(high_f.index.shift(1, 'H')))
I try to obtain day deltas for a wide range of pandas dates. However, for time deltas >292 years I obtain negative values. For example,
import pandas as pd
dates = pd.Series(pd.date_range('1700-01-01', periods=4500, freq='m'))
days_delta = (dates-dates.min()).astype('timedelta64[D]')
However, using a DatetimeIndex I can do it and it works as I want it to,
import pandas as pd
import numpy as np
dates = pd.date_range('1700-01-01', periods=4500, freq='m')
days_fun = np.vectorize(lambda x: x.days)
days_delta = days_fun(dates.date - dates.date.min())
The question then is how to obtain the correct days_delta for Series objects?
Read here specifically about timedelta limitations:
Pandas represents Timedeltas in nanosecond resolution using 64 bit integers. As such, the 64 bit integer limits determine the Timedelta limits.
Incidentally this is the same limitation the docs mentioned that is placed on Timestamps in Pandas:
Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
This would suggest that the same recommendations the docs make for circumventing the timestamp limitations can be applied to timedeltas. The solution to the timestamp limitations are found in the docs (here):
If you have data that is outside of the Timestamp bounds, see Timestamp limitations, then you can use a PeriodIndex and/or Series of Periods to do computations.
Workaround
If you have continuous dates with small gaps which are calculatable, as in your example, you could sort the series and then use cumsum to get around this problem, like this:
import pandas as pd
dates = pd.TimeSeries(pd.date_range('1700-01-01', periods=4500, freq='m'))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum().describe()
count 4500.000000
mean 68466.072444
std 39543.094524
min 0.000000
25% 34233.250000
50% 68465.500000
75% 102699.500000
max 136935.000000
dtype: float64
See the min and max are both positive.
Failaround
If you have too big gaps, this workaround with not work. Like here:
dates = pd.Series(pd.datetools.to_datetime(['2016-06-06', '1700-01-01','2200-01-01']))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum()
1 0
0 -97931
2 -30883
This is because we calculate the step between each date, then add them up. And when they are sorted, we are guaranteed the smallest possible steps, however, each step is too big to handle in this case.
Resetting the order
As you see in the Failaround example, the series is no longer ordered by the index. Fix this by calling the .reset_index(inplace=True) method on the series.
I have a pandas Series which can be constructed like the following:
given_time = datetime(2013, 10, 8, 0, 0, 33, 945109,
tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))
given_times = np.array([given_time] * 3, dtype='datetime64[ns]'))
column = pd.Series(given_times)
The dtype of my Series is datetime64[ns]
However, when I access it: column[1], somehow it becomes of type pandas.tslib.Timestamp, while column.values[1] stays np.datetime64. Does Pandas auto cast my datetime into Timestamp when accessing the item? Is it slow?
Do I need to worry about the difference in types? As far as I see, Timestamp seems not have timezone (numpy.datetime64('2013-10-08T00:00:33.945109000+0100') -> Timestamp('2013-10-07 23:00:33.945109', tz=None))
In practice, I would do datetime arithmetic like take difference, compare to a datetimedelta. Does the possible type inconsistency around my operators affect my use case at all?
Besides, am I encouraged to use pd.to_datetime instead of astype(dtype='datetime64') while converting datetime objects?
Pandas time types are built on top of numpy's datetime64.
In order to continue using the pandas operators, you should keep using pd.to_datetime, rather than as astype(dtype='datetime64'). This is especially true since you'll be taking date time deltas, which pandas handles admirably, for example with resampling, and period definitions.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#up-and-downsampling
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#period
Though I haven't measured, since the pandas times are hiding numpy times, I suspect the conversion, is quite fast. Alternatively, you can just use pandas built in time series definitions and avoid the conversion altogether.
As a rule of thumb, it's good to use the type from the package you'll be using functions from, though. So if you're really only going to use numpy to manipulate the arrays, then stick with numpy date time. Pandas methods => pandas date time.
I had read in the documentation somewhere (apologies, can't find link) that scalar values will be converted to timestamps while arrays will keep their data type. For example:
from datetime import date
import pandas as pd
time_series = pd.Series([date(2010 + x, 1, 1) for x in range(5)])
time_series = time_series.apply(pd.to_datetime)
so that:
In[1]:time_series
Out[1]:
0 2010-01-01
1 2011-01-01
2 2012-01-01
3 2013-01-01
4 2014-01-01
dtype: datetime64[ns]
and yet:
In[2]:time_series.iloc[0]
Out[2]:Timestamp('2010-01-01 00:00:00')
while:
In[3]:time_series.values[0]
In[3]:numpy.datetime64('2009-12-31T19:00:00.000000000-0500')
because iloc requests a scalar from pandas (type conversion to Timestamp) while values requests the full numpy array (no type conversion).
There is similar behavior for series of length one. Additionally, referencing more than one element in the slice (ie iloc[1:10]) will return a series, which will always keep its datatype.
I'm unsure as to why pandas behaves this way.
In[4]: pd.__version__
Out[4]: '0.15.2'
I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.
You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128
Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()