Python pandas select rows by list of dates - python

How to select multiple rows of a dataframe by list of dates
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
In[1]: df
Out[1]:
A B C D
2013-01-01 0.084393 -2.460860 -0.118468 0.543618
2013-01-02 -0.024358 -1.012406 -0.222457 1.906462
2013-01-03 -0.305999 -0.858261 0.320587 0.302837
2013-01-04 0.527321 0.425767 -0.994142 0.556027
2013-01-05 0.411410 -1.810460 -1.172034 -1.142847
2013-01-06 -0.969854 0.469045 -0.042532 0.699582
myDates = ["2013-01-02", "2013-01-04", "2013-01-06"]
So the output should be
A B C D
2013-01-02 -0.024358 -1.012406 -0.222457 1.906462
2013-01-04 0.527321 0.425767 -0.994142 0.556027
2013-01-06 -0.969854 0.469045 -0.042532 0.699582

You can use index.isin() method to create a logical index for subsetting:
df[df.index.isin(myDates)]

Convert your entry into a DateTimeIndex:
df.loc[pd.to_datetime(myDates)]
A B C D
2013-01-02 -0.047710 -1.827593 -0.944548 -0.149460
2013-01-04 1.437924 0.126788 0.641870 0.198664
2013-01-06 0.408820 -1.842112 -0.287346 0.071397

If you have a timeseries containing hours and minutes in the index (e.g. 2022-03-07 09:03:00+00:00 instead of 2022-03-07), and you want to filter by dates (without hours, minutes, etc.), you can use the following:
df.loc[np.isin(df.index.date, myDates)]
If you try df.loc[df.index.date.isin(myDates)] it might not work and python will throw an error saying 'numpy.ndarray' object has no attribute 'isin', and this is why we use np.isin.
This is an old post but I think this can be useful to a lot of people (such as myself).

Related

Why does date_range give a result different from indexing [] for DataFrame Pandas dates?

Here is a simple code with date_range and indexing [ ] I used with Pandas
period_start = '2013-01-01'
period_end = '2019-12-24'
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns},index=pd.date_range(start=period_start,periods=6)))
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns})[period_start:'20130110'])
date_range gives Nan results
close returns
2013-01-01 NaN NaN
2013-01-02 NaN NaN
2013-01-03 NaN NaN
2013-01-04 NaN NaN
Indexing gives correct results
close returns
2013-01-02 00:00:00+00:00 68.732 0.028322
2013-01-03 00:00:00+00:00 68.032 -0.010184
2013-01-04 00:00:00+00:00 66.091 -0.028531
Based on how the dates are shown by date_range - I suppose the date format of date_range does not match the date format in the Pandas DataFrame.
1) Can you explaine please why it gives NaN?
2) What would you suggest to get a specific time range from the Panda DataFrame?
As I'm a beginner in Python and its libraries, I didn't understand that this question refers to the Quantopian library, not to Pandas.
I got a solution on their forum. All the times returned by methods on Quantopian are timezone aware with a timezone of 'UTC'. By default, the date_range method returns timezone naive dates. Simply supply the timezone information to date_range method. Like this
pd.DataFrame ({
'close':aapl_close,
'returns':aapl_returns,},
index=pd.date_range(start=period_start, periods=6, tz='UTC'))
To get a specific date or time range in pandas perhaps the easiest is simple bracket notation. For example, to get dates between 2013-01-04 and 2013-01-08 (inclusive) simply enter this:
df = pd.DataFrame ({'close':aapl_close, 'returns':aapl_returns,})
my_selected_dates = df['2013-01-04':'2013-01-08']
This bracket notation is really shorthand for using the loc method
my_selected_dates = df.loc['2013-01-04':'2013-01-08']
Both work the same but the loc method has a bit more flexibility. This notation also works with datetimes if desired.

Offset date for a Pandas DataFrame date index

Given a Pandas dataframe created as follows:
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6),index=dates,columns=list('A'))
A
2013-01-01 0.847528
2013-01-02 0.204139
2013-01-03 0.888526
2013-01-04 0.769775
2013-01-05 0.175165
2013-01-06 -1.564826
I want to add 15 days to the index.
This does not work>
#from pandas.tseries.offsets import *
df.index+relativedelta(days=15)
#df.index + DateOffset(days=5)
TypeError: relativedelta(days=+15)
I seem to be incapable of doing anything right with indexes....
you can use DateOffset:
>>> df = pd.DataFrame(np.random.randn(6),index=dates,columns=list('A'))
>>> df.index = df.index + pd.DateOffset(days=15)
>>> df
A
2013-01-16 0.015282
2013-01-17 1.214255
2013-01-18 1.023534
2013-01-19 1.355001
2013-01-20 1.289749
2013-01-21 1.484291
Marginally shorter/more direct is tshift:
df = df.tshift(15, freq='D')
Link to a list of freq aliases
If need convert it to DatetimeIndex and add days use:
df.index = pd.to_datetime(df.index) + pd.Timedelta('15 days')
If already DatetimeIndex:
df.index += pd.Timedelta('15 days')

How to groupby non-unique timedate index and column

Just starting with Pandas. I have a DataFrame with a timedate index a number of columns (data from parsing a log file). I have been able to convert the DataFrame index to a period index (monthly). One of the columns contains the user name associated to the event in the logfile. I would like to get an overview of the number of occurrences (i.e. rows in the DataFrame) per month per user. The index has non-unique values, so I have been able to group this by using
grp = DF_monthly.groupby(level=0)
However, I don't seem to be able to add that extra grouping on the user column. How can I do this?
Say your raw log looks like:
import pandas as pd
from StringIO import StringIO
infile = StringIO("""datetime,user,event
2013-01-01 11:15:23,A,error
2013-01-02 11:15:23,C,warning
2013-01-03 11:15:23,C,message
2013-02-01 11:15:23,A,error
2013-02-02 11:15:23,B,warning
2013-02-03 11:15:23,A,message""")
df = pd.read_csv(infile, parse_dates=True, index_col='datetime')
user event
datetime
2013-01-01 11:15:23 A error
2013-01-02 11:15:23 C warning
2013-01-03 11:15:23 C message
2013-02-01 11:15:23 A error
2013-02-02 11:15:23 B warning
2013-02-03 11:15:23 A message
Then you can get a count per user per month with:
df.groupby([lambda x: x.strftime('%Y-%b'), 'user']).count()['event']
user
2013-Feb A 2
B 1
2013-Jan A 1
C 2
So its not necessary to groupby month first, unless you have other reasons to do so. If so, you can also apply the last groupby on the monthly df as well.
The lambda function converts each timestamp from the index to a string of 'Year-Month' and performs a groupby on that string.

pandas timeseries between_datetime function?

I have been using the between_time method of TimeSeries in pandas, which returns all values between the specified times, regardless of their date.
But I need to select both date and time, because my timeseries structure
contains multiple dates.
One way of solving this, though quite inflexible, is just iterate over the values and remove those which are not relevant.
Is there a more elegant way of doing this ?
You can select the dates that are of interest first, and then use between_time. For example, suppose you have a time series of 72 hours:
import pandas as pd
from numpy.random import randn
rng = pd.date_range('1/1/2013', periods=72, freq='H')
ts = pd.Series(randn(len(rng)), index=rng)
To select the between 20:00 & 22:00 on the 2nd and 3rd of January you can simply do:
ts['2013-01-02':'2013-01-03'].between_time('20:00', '22:00')
Giving you something like this:
2013-01-02 20:00:00 0.144399
2013-01-02 21:00:00 0.886806
2013-01-02 22:00:00 0.126844
2013-01-03 20:00:00 -0.464741
2013-01-03 21:00:00 1.856746
2013-01-03 22:00:00 -0.286726

How to convert a pandas DataFrame into a TimeSeries?

I am looking for a way to convert a DataFrame to a TimeSeries without splitting the index and value columns. Any ideas? Thanks.
In [20]: import pandas as pd
In [21]: import numpy as np
In [22]: dates = pd.date_range('20130101',periods=6)
In [23]: df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
In [24]: df
Out[24]:
A B C D
2013-01-01 -0.119230 1.892838 0.843414 -0.482739
2013-01-02 1.204884 -0.942299 -0.521808 0.446309
2013-01-03 1.899832 0.460871 -1.491727 -0.647614
2013-01-04 1.126043 0.818145 0.159674 -1.490958
2013-01-05 0.113360 0.190421 -0.618656 0.976943
2013-01-06 -0.537863 -0.078802 0.197864 -1.414924
In [25]: pd.Series(df)
Out[25]:
0 A
1 B
2 C
3 D
dtype: object
I know this is late to the game here but a few points.
Whether or not a DataFrame is considered a TimeSeries is the type of index. In your case, your index is already a TimeSeries, so you are good to go. For more information on all the cool slicing you can do with a the pd.timeseries index, take a look at http://pandas.pydata.org/pandas-docs/stable/timeseries.html#datetime-indexing
Now, others might arrive here because they have a column 'DateTime' that they want to make an index, in which case the answer is simple
ts = df.set_index('DateTime')
Here is one possibility
[3]: df
Out[3]:
A B C D
2013-01-01 -0.024362 0.712035 -0.913923 0.755276
2013-01-02 2.624298 0.285546 0.142265 -0.047871
2013-01-03 1.315157 -0.333630 0.398759 -1.034859
2013-01-04 0.713141 -0.109539 0.263706 -0.588048
2013-01-05 -1.172163 -1.387645 -0.171854 -0.458660
2013-01-06 -0.192586 0.480023 -0.530907 -0.872709
In [4]: df.unstack()
Out[4]:
A 2013-01-01 -0.024362
2013-01-02 2.624298
2013-01-03 1.315157
2013-01-04 0.713141
2013-01-05 -1.172163
2013-01-06 -0.192586
B 2013-01-01 0.712035
2013-01-02 0.285546
2013-01-03 -0.333630
2013-01-04 -0.109539
2013-01-05 -1.387645
2013-01-06 0.480023
C 2013-01-01 -0.913923
2013-01-02 0.142265
2013-01-03 0.398759
2013-01-04 0.263706
2013-01-05 -0.171854
2013-01-06 -0.530907
D 2013-01-01 0.755276
2013-01-02 -0.047871
2013-01-03 -1.034859
2013-01-04 -0.588048
2013-01-05 -0.458660
2013-01-06 -0.872709
dtype: float64

Categories