I have been using the between_time method of TimeSeries in pandas, which returns all values between the specified times, regardless of their date.
But I need to select both date and time, because my timeseries structure
contains multiple dates.
One way of solving this, though quite inflexible, is just iterate over the values and remove those which are not relevant.
Is there a more elegant way of doing this ?
You can select the dates that are of interest first, and then use between_time. For example, suppose you have a time series of 72 hours:
import pandas as pd
from numpy.random import randn
rng = pd.date_range('1/1/2013', periods=72, freq='H')
ts = pd.Series(randn(len(rng)), index=rng)
To select the between 20:00 & 22:00 on the 2nd and 3rd of January you can simply do:
ts['2013-01-02':'2013-01-03'].between_time('20:00', '22:00')
Giving you something like this:
2013-01-02 20:00:00 0.144399
2013-01-02 21:00:00 0.886806
2013-01-02 22:00:00 0.126844
2013-01-03 20:00:00 -0.464741
2013-01-03 21:00:00 1.856746
2013-01-03 22:00:00 -0.286726
Related
Lets say I have a idx=pd.DatatimeIndex with one minute frequency. I also have a list of bad dates (each are of type pd.Timestamp without the time information) that I want to remove from the original idx. How do I do that in pandas?
Use normalize to remove the time part from your index so you can do a simple ~ + isin selection, i.e. find the dates not in that bad list. You can further ensure your list of dates don't have a time part with the same [x.normalize() for x in bad_dates] if you need to be extra safe.
Sample Data
import pandas as pd
df = pd.DataFrame(range(9), index=pd.date_range('2010-01-01', freq='11H', periods=9))
bad_dates = [pd.Timestamp('2010-01-02'), pd.Timestamp('2010-01-03')]
Code
df[~df.index.normalize().isin(bad_dates)]
# 0
#2010-01-01 00:00:00 0
#2010-01-01 11:00:00 1
#2010-01-01 22:00:00 2
#2010-01-04 05:00:00 7
#2010-01-04 16:00:00 8
Here is a simple code with date_range and indexing [ ] I used with Pandas
period_start = '2013-01-01'
period_end = '2019-12-24'
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns},index=pd.date_range(start=period_start,periods=6)))
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns})[period_start:'20130110'])
date_range gives Nan results
close returns
2013-01-01 NaN NaN
2013-01-02 NaN NaN
2013-01-03 NaN NaN
2013-01-04 NaN NaN
Indexing gives correct results
close returns
2013-01-02 00:00:00+00:00 68.732 0.028322
2013-01-03 00:00:00+00:00 68.032 -0.010184
2013-01-04 00:00:00+00:00 66.091 -0.028531
Based on how the dates are shown by date_range - I suppose the date format of date_range does not match the date format in the Pandas DataFrame.
1) Can you explaine please why it gives NaN?
2) What would you suggest to get a specific time range from the Panda DataFrame?
As I'm a beginner in Python and its libraries, I didn't understand that this question refers to the Quantopian library, not to Pandas.
I got a solution on their forum. All the times returned by methods on Quantopian are timezone aware with a timezone of 'UTC'. By default, the date_range method returns timezone naive dates. Simply supply the timezone information to date_range method. Like this
pd.DataFrame ({
'close':aapl_close,
'returns':aapl_returns,},
index=pd.date_range(start=period_start, periods=6, tz='UTC'))
To get a specific date or time range in pandas perhaps the easiest is simple bracket notation. For example, to get dates between 2013-01-04 and 2013-01-08 (inclusive) simply enter this:
df = pd.DataFrame ({'close':aapl_close, 'returns':aapl_returns,})
my_selected_dates = df['2013-01-04':'2013-01-08']
This bracket notation is really shorthand for using the loc method
my_selected_dates = df.loc['2013-01-04':'2013-01-08']
Both work the same but the loc method has a bit more flexibility. This notation also works with datetimes if desired.
I have a couple of million DateTime objects in pandas. I could not find anything in the documentation for exploratory data analysis (EDA).
It looks like every single row has the same time in either data frame:
DF1
Timestamp('2018-02-20 00:00:00')
or
DF2
Timestamp('2018-01-01 05:00:00')
is there a way to use pandas to go through each column and check to see if there is a difference in the hours/minutes/seconds?
Everything I have found is about calculating differences between times.
I have tried a couple of basic techniques but all I get back are simple descriptive numbers.
min(data['date'])
data['date'].nunique()
I have tried:
print(data['TIMESTAMP_UTC'])
Which does show some dates that have different hours, but I need a way to manage this information:
0 2018-01-16 05:00:00
1 2018-05-04 04:00:00
2 2018-10-22 04:00:00
3 2018-01-02 05:00:00
4 2018-01-03 05:00:00
5 2018-01-04 05:00:00
6 2018-01-05 05:00:00
......
Ideally, I am looking for something that could spit out a .value_counts() of dates that deviate from everything else
You can use the .apply() method to transform the format from str to datetime. Then you use datetime to handle it.
To convert your column values into datetime :
df['TIMESTAMP_UTC'] = pd.to_datetime(df['TIMESTAMP_UTC'] )
df['TIMESTAMP_UTC'] = df['TIMESTAMP_UTC'].apply(lambda x: datetime.strptime(x, "%Y-%b-%d %H:%M:%S"))
then you can use the power of datetime to compare or extract information like this to extract hours for instance:
df['TIMESTAMP_UTC'].dt.day
So I have a dataframe of the form: index is a date and then I have a column that consists of np.arrays with a shape of 180x360. What I want to do is calculate the weekly mean of the data set. Example of the dataframe:
vika geop
1990-01-01 06:00:00 [[50995.954225, 50995.954225, 50995.954225, 50...
1990-01-02 06:00:00 [[51083.0576138, 51083.0576138, 51083.0576138,...
1990-01-03 06:00:00 [[51045.6321168, 51045.6321168, 51045.6321168,...
1990-01-04 06:00:00 [[50499.8436192, 50499.8436192, 50499.8436192,...
1990-01-05 06:00:00 [[49823.5114237, 49823.5114237, 49823.5114237,...
1990-01-06 06:00:00 [[50050.5148846, 50050.5148846, 50050.5148846,...
1990-01-07 06:00:00 [[50954.5188533, 50954.5188533, 50954.5188533,...
1990-01-08 06:00:00 [[50995.954225, 50995.954225, 50995.954225, 50...
1990-01-09 06:00:00 [[50628.1596088, 50628.1596088, 50628.1596088,...
What I've tried so far is the simple
df = df.resample('W-MON')
But I get this error:
pandas.core.groupby.DataError: No numeric types to aggregate
I've tried to change the datatype of the column to list, but it still does not work. Any idea of how to do it with resample, or any other method?
You can use Panel to represent 3d data:
import pandas as pd
import numpy as np
index = pd.date_range("2012/01/01", "2012/02/01")
p = pd.Panel(np.random.rand(len(index), 3, 4), items=index)
p.resample("W-MON")
I have a series of hourly prices. Each price is valid throughout the whole 1-hour period. What is the best way to represent these prices in Pandas that would enable me to index them in arbitrary higher frequencies (such as minutes or seconds) and do arithmetic with them?
Data specifics
Sample prices might be:
>>> prices = Series(randn(5), pd.date_range('2013-01-01 12:00', periods = 5, freq='H'))
>>> prices
2013-01-01 12:00:00 -1.001692
2013-01-01 13:00:00 -1.408082
2013-01-01 14:00:00 -0.329637
2013-01-01 15:00:00 1.005882
2013-01-01 16:00:00 1.202557
Freq: H
Now, what representation to use if I want the value at 13:37:42(I expect it to be the same as at 13:00)?
>>> prices['2013-01-01 13:37:42']
...
KeyError: <Timestamp: 2013-01-01 13:37:42>
Resampling
I know I could resample the prices and fill in the details (ffill, right?), but that doesn't seem like such a nice solution, because I have to assume the frequency I'm going to be indexing it at and it reduces readability with too many unnecessary data points.
Time spans
At first glance a PeriodIndex seems to work
>>> price_periods = prices.to_period()
>>> price_periods['2013-01-01 13:37:42']
-1.408082
But a time-spanned series doesn't offer some of the other functionality I expect from a Series. Say that I have another series amounts that says how many items I bought in a certain moment. If I wanted to calculate the prices I would want to multiply the two series'
>>> amounts = Series([1,2,2], pd.DatetimeIndex(['2013-01-01 13:37', '2013-01-01 13:57', '2013-01-01 14:05']))
>>> amounts*price_periods
but that yields an exception and sometimes even freezes my IPy Notebook. Indexing doesn't help either.
>>> ts_periods[amounts.index]
Are PeriodIndex structures still a work in progress or these features aren't going to be added? Is there maybe some other structure I should have used (or should use for now, before PeriodIndex matures)? I'm using Pandas version 0.9.0.dev-1e68fd9.
Check asof
prices.asof('2013-01-01 13:37:42')
returns the value for the previous available datetime:
prices['2013-01-01 13:00:00']
To make calculations, you can use:
prices.asof(amounts.index) * amounts
which returns a Series with amount's Index and the respective values:
>>> prices
2013-01-01 12:00:00 0.943607
2013-01-01 13:00:00 -1.019452
2013-01-01 14:00:00 -0.279136
2013-01-01 15:00:00 1.013548
2013-01-01 16:00:00 0.929920
>>> prices.asof(amounts.index)
2013-01-01 13:37:00 -1.019452
2013-01-01 13:57:00 -1.019452
2013-01-01 14:05:00 -0.279136
>>> prices.asof(amounts.index) * amounts
2013-01-01 13:37:00 -1.019452
2013-01-01 13:57:00 -2.038904
2013-01-01 14:05:00 -0.558272