I know there are several posts on how to split series based on consecutive values, and I have adopted some of their code, but I'm not sure what I am doing wrong.
I have a long datetimeindex ("times" below), and I want to split it to identify consecutive groupings. So, every time there is a gap in time longer than the normal time increment, I want it to split. The index is evenly incremented (10m between times).
times = pd.date_range(start=start_date, end=end_date, freq=frequency).difference(x.index)
splits = ((times-pd.Series(times).shift(-1)).abs() != frequency)
consec = np.split(times, splits)
"Splits" is a boolean array that accurately indicates where the splits should occur, so that seems to be working correctly.
However, when I actually use np.split, instead of splitting into sections, the output is like this, where it is only keeping the values that are at the split indices:
[DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex(['2003-02-05 09:20:00'], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex(['2003-02-09 01:20:00'], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex([], .... etc
Any ideas on why this is happening?
indices_or_sections argument of np.split accepts an array of integers.
To get it from split (an array of bools) you can use np.where:
splits_indexes = np.where(splits)
Then you can call
consec = np.split(times, splits_indexes + 1)
The +1 is there to identify the beginning of a new part.
Related
I have the following datetimeindex:
dates = DatetimeIndex(['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04',
'2022-03-05', '2022-03-06', '2022-03-07', '2022-03-08',
'2022-03-09', '2022-03-10',
...
'2022-06-06', '2022-06-07', '2022-06-08', '2022-06-09',
'2022-06-10', '2022-06-11', '2022-06-12', '2022-06-13',
'2022-06-14', '2022-06-15'],
dtype='datetime64[ns]', length=107, freq='D')
I want to discard all elements except the first and last one. How do I do that? I tried this:
[dates[0]] + dates[-1]] but it returns an actual list of datetimes and not this:
DatetimeIndex(['2022-03-01', '2022-06-15'],
dtype='datetime64[ns]', length=2, freq='D')
Index with a list to select multiple items.
>>> dates[[0, -1]]
DatetimeIndex(['2022-03-01', '2022-06-15'], dtype='datetime64[ns]', freq=None)
This is covered in the NumPy user guide under Integer array indexing. In the Pandas user guide, there's related info under Selection by position.
Here's a way to do it:
print(dates[::len(dates)-1])
Output:
DatetimeIndex(['2022-03-01', '2022-06-15'], dtype='datetime64[ns]', freq=None)
This is slicing using a step that skips right from the start to the end (explanation suggested by #wjandrea).
I miss the point of the 'freq' attribute in a pandas DatatimeIndex object. It can be passed at construction time or set at any time as a property but I don't see any difference in the behaviour of the DatatimeIndex object when this property changes.
Plase look at this example. We add 1 day to a DatetimeIndex that has freq='B' but the returned index contains non-business days:
import pandas as pd
from pandas.tseries.offsets import *
rng = pd.date_range('2012-01-05', '2012-01-10', freq=BDay())
index = pd.DatetimeIndex(rng)
print(index)
index2 = index + pd.Timedelta('1D')
print(index2)
This is the output:
DatetimeIndex(['2012-01-05', '2012-01-06', '2012-01-09', '2012-01-10'], dtype='datetime64[ns]', freq='B')
DatetimeIndex(['2012-01-06', '2012-01-07', '2012-01-10', '2012-01-11'], dtype='datetime64[ns]', freq='B')
Why isn't freq considered when performing computation (+/- Timedelta) on the DatetimeIndex?
Why freq doesn't reflect the actual data contained in the DatetimeIndex? ( it says 'B' even though it contains non-business days)
You are looking for shift
index.shift(1)
Out[336]: DatetimeIndex(['2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11'], dtype='datetime64[ns]', freq='B')
Also BDay will do that too
from pandas.tseries.offsets import BDay
index + BDay(1)
Out[340]: DatetimeIndex(['2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11'], dtype='datetime64[ns]', freq='B')
From github issue:
The freq attribute is meant to be purely descriptive, so it doesn't
and shouldn't impact calculations. Potentially docs could be clearer.
I have a huge size DataFrame that contains index in integer form for date time representation, for example, 20171001. What I'm going to do is to change the form, for example, 20171001, to the datetime format, '2017-10-01'.
For simplicity, I generate such a dataframe.
>>> df = pd.DataFrame(np.random.randn(3,2), columns=list('ab'), index=
[20171001,20171002,20171003])
>>> df
a b
20171001 2.205108 0.926963
20171002 1.104884 -0.445450
20171003 0.621504 -0.584352
>>> df.index
Int64Index([20171001, 20171002, 20171003], dtype='int64')
If we apply 'to_datetime' to df.index, we have the weird result:
>>> pd.to_datetime(df.index)
DatetimeIndex(['1970-01-01 00:00:00.020171001',
'1970-01-01 00:00:00.020171002',
'1970-01-01 00:00:00.020171003'],
dtype='datetime64[ns]', freq=None)
What I want is DatetimeIndex(['2017-10-01', '2017-10-02', '2017-10--3'], ...)
How can I manage this problem? Note that the file is given.
Use format %Y%m%d in pd.to_datetime i.e
pd.to_datetime(df.index, format='%Y%m%d')
DatetimeIndex(['2017-10-01', '2017-10-02', '2017-10-03'], dtype='datetime64[ns]', freq=None)
To assign df.index = pd.to_datetime(df.index, format='%Y%m%d')
pd.to_datetime is the panda way of doing it. But here are two alternatives:
import datetime
df.index = (datetime.datetime.strptime(str(i),"%Y%m%d") for i in df.index)
or
import datetime
df.index = df.index.map(lambda x: datetime.datetime.strptime(str(x),"%Y%m%d"))
I have two lists:
"max_" consists of datetime types:
2012-04-20 00:00:00
2012-11-29 00:00:00
2013-11-22 00:00:00
"min_" , consists of datetimeindex:
DatetimeIndex(['2012-07-11'], dtype='datetime64[ns]', name=u'Date', freq=None)
DatetimeIndex(['2013-02-05', '2013-10-23'], dtype='datetime64[ns]', name=u'Date', freq=None)
DatetimeIndex([], dtype='datetime64[ns]', name=u'Date', freq=None)
My expected output is to take a range from each max value to its respective min value, for example, the first one would be range (2012-04-20 to 2012-07-11). I've tried:
pd.date_range(max_, min_)
TypeError: Cannot convert input [DatetimeIndex(['2012-07-11'], dtype='datetime64[ns]', name=u'Date', freq=None)] of type <class 'pandas.core.indexes.datetimes.DatetimeIndex'> to Timestamp
I'm not sure how to get around the conversion part, additionally, I'd like to have only the first value for the min_ lists (and ignore any additional).
I think you need to just specify items in your lists that you want to create the range on:
pd.date_range(min_[0],max_[0])
If you are trying to print the ranges:
for date in max_:
print (pd.date_range(min_[0],max_[date])
Is it possible to convert a pd.DatetimeIndex consisting of timestamps in a single timezone to one where each timestamp has its own, in some cases distinct timezone?
Here is an example of what I would like to have:
type(df.index)
pandas.tseries.index.DatetimeIndex
df.index[0]
Timestamp('2015-06-07 23:00:00+0100', tz='Europe/London')
df.index[1]
Timestamp('2015-06-08 00:01:00+0200', tz='Europe/Brussels')
You can have an index contain Timestamps with different timezones. But you would have to explicity construct it as an Index.
In [33]: pd.Index([pd.Timestamp('2015-06-07 23:00:00+0100', tz='Europe/London'),pd.Timestamp('2015-06-08 00:01:00+0200', tz='Europe/Brussels')],dtype='object')
Out[33]: Index([2015-06-07 23:00:00+01:00, 2015-06-08 00:01:00+02:00], dtype='object')
In [34]: list(pd.Index([pd.Timestamp('2015-06-07 23:00:00+0100', tz='Europe/London'),pd.Timestamp('2015-06-08 00:01:00+0200', tz='Europe/Brussels')],dtype='object'))
Out[34]:
[Timestamp('2015-06-07 23:00:00+0100', tz='Europe/London'),
Timestamp('2015-06-08 00:01:00+0200', tz='Europe/Brussels')]
This is a very odd thing to do, and completely non-performant. You generally want to have a single timezone represented (UTC or other). In 0.17.0, you can represent efficiently a single column with a timezone, so one way of accomplishing what I think your goal is to segregate the different timezones into different columns. See the docs
If you're happy for it to not be an Index, but just a regular Series this should be OK:
pd.Series([pd.Timestamp('2015-06-07 23:00:00+0100', tz='Europe/London'),
pd.Timestamp('2015-06-08 00:01:00+0200', tz='Europe/Brussels')])
Adding timestamps with different timezones into the same DatetimeIndex automatically yields a DatetimeIndex with UTC as the default timezone. For example:
In [269] index = pandas.DatetimeIndex([Timestamp('2015-06-07 23:00:00+0100')])
In [270] index
Out[270] DatetimeIndex(['2015-06-07 23:00:00+01:00'], dtype='datetime64[ns, pytz.FixedOffset(60)]', freq=None)
In [271] index2 = DatetimeIndex([Timestamp('2015-06-08 00:01:00+0200')])
In [272] index2
Out[272] DatetimeIndex(['2015-06-08 00:01:00+02:00'], dtype='datetime64[ns, pytz.FixedOffset(120)]', freq=None)
In [273] index.append(index2) # returns single index containing both data
Out[273] DatetimeIndex(['2015-06-07 22:00:00+00:00', '2015-06-07 22:01:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
Notice how the result is a UTC DatetimeIndex with the correct UTC timestamps preserved.
Similarly:
In [279] pandas.to_datetime([Timestamp('2015-06-07 23:00:00+0100'), Timestamp('2015-06-08 00:01:00+0200')], utc=True) # utc=True is needed
Out[279] DatetimeIndex(['2015-06-07 22:00:00+00:00', '2015-06-07 22:01:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
This is not a bad thing as you get to preserve the correct time while having the ability to use the indexing prowess of DatetimeIndex (e.g. slice by date range) and at the same time you can easily convert the timestamps to any other timezone (unless you really need to know the original timezone of each timestamp, then this won't be ideal).