I am looking for a way to convert a DataFrame to a TimeSeries without splitting the index and value columns. Any ideas? Thanks.
In [20]: import pandas as pd
In [21]: import numpy as np
In [22]: dates = pd.date_range('20130101',periods=6)
In [23]: df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
In [24]: df
Out[24]:
A B C D
2013-01-01 -0.119230 1.892838 0.843414 -0.482739
2013-01-02 1.204884 -0.942299 -0.521808 0.446309
2013-01-03 1.899832 0.460871 -1.491727 -0.647614
2013-01-04 1.126043 0.818145 0.159674 -1.490958
2013-01-05 0.113360 0.190421 -0.618656 0.976943
2013-01-06 -0.537863 -0.078802 0.197864 -1.414924
In [25]: pd.Series(df)
Out[25]:
0 A
1 B
2 C
3 D
dtype: object
I know this is late to the game here but a few points.
Whether or not a DataFrame is considered a TimeSeries is the type of index. In your case, your index is already a TimeSeries, so you are good to go. For more information on all the cool slicing you can do with a the pd.timeseries index, take a look at http://pandas.pydata.org/pandas-docs/stable/timeseries.html#datetime-indexing
Now, others might arrive here because they have a column 'DateTime' that they want to make an index, in which case the answer is simple
ts = df.set_index('DateTime')
Here is one possibility
[3]: df
Out[3]:
A B C D
2013-01-01 -0.024362 0.712035 -0.913923 0.755276
2013-01-02 2.624298 0.285546 0.142265 -0.047871
2013-01-03 1.315157 -0.333630 0.398759 -1.034859
2013-01-04 0.713141 -0.109539 0.263706 -0.588048
2013-01-05 -1.172163 -1.387645 -0.171854 -0.458660
2013-01-06 -0.192586 0.480023 -0.530907 -0.872709
In [4]: df.unstack()
Out[4]:
A 2013-01-01 -0.024362
2013-01-02 2.624298
2013-01-03 1.315157
2013-01-04 0.713141
2013-01-05 -1.172163
2013-01-06 -0.192586
B 2013-01-01 0.712035
2013-01-02 0.285546
2013-01-03 -0.333630
2013-01-04 -0.109539
2013-01-05 -1.387645
2013-01-06 0.480023
C 2013-01-01 -0.913923
2013-01-02 0.142265
2013-01-03 0.398759
2013-01-04 0.263706
2013-01-05 -0.171854
2013-01-06 -0.530907
D 2013-01-01 0.755276
2013-01-02 -0.047871
2013-01-03 -1.034859
2013-01-04 -0.588048
2013-01-05 -0.458660
2013-01-06 -0.872709
dtype: float64
Related
I have a dataframe with DatetimeIndex. This is one of columns:
>>> y.out_brd
2013-01-01 11:25:00 0.04464286
2013-01-01 11:30:00 NaN
2013-01-01 11:35:00 NaN
2013-01-01 11:40:00 0.005952381
2013-01-01 11:45:00 0.01785714
2013-01-01 11:50:00 0.008928571
Freq: 5T, Name: out_brd, dtype: object
When I'm trying to use interpolate() on function I get absolutly nothing changes:
>>> y.out_brd.interpolate(method='time')
2013-01-01 11:25:00 0.04464286
2013-01-01 11:30:00 NaN
2013-01-01 11:35:00 NaN
2013-01-01 11:40:00 0.005952381
2013-01-01 11:45:00 0.01785714
2013-01-01 11:50:00 0.008928571
Freq: 5T, Name: out_brd, dtype: object
How to make it work?
Update:
the code for generating such a dataframe.
time_index = pd.date_range(start=datetime(2013, 1, 1, 3),
end=datetime(2013, 1, 2, 2, 59),
freq='5T')
grid_columns = [u'in_brd', u'in_alt', u'out_brd', u'out_alt']
df = pd.DataFrame(index=time_index, columns=grid_columns)
After that I fill cells with some data.
I have dataframe field_data with survey data about boarding and alighting on railroad, and station variable.
I also have interval_end function defined like this:
interval_end = lambda index, prec_lvl: index.to_datetime() \
+ timedelta(minutes=prec_lvl - 1,
seconds=59)
The code:
for index, row in df.iterrows():
recs = field_data[(field_data.station_name == station)
& (field_data.arrive_time >= index.time())
& (field_data.arrive_time <= interval_end(
index, prec_lvl).time())]
in_recs_num = recs[recs.orientation == u'in'][u'train_number'].count()
out_recs_num = recs[recs.orientation == u'out'][u'train_number'].count()
if in_recs_num:
df.loc[index, u'in_brd'] = recs[
recs.orientation == u'in'][u'boarding'].sum() / \
(in_recs_num * CAR_CAPACITY)
df.loc[index, u'in_alt'] = recs[
recs.orientation == u'in'][u'alighting'].sum() / \
(in_recs_num * CAR_CAPACITY)
if out_recs_num:
df.loc[index, u'out_brd'] = recs[
recs.orientation == u'out'][u'boarding'].sum() / \
(out_recs_num * CAR_CAPACITY)
df.loc[index, u'out_alt'] = recs[
recs.orientation == u'out'][u'alighting'].sum() / \
(out_recs_num * CAR_CAPACITY)
You need to convert your Series to have a dtype of float64 instead of your current object. Here's an example to illustrate the difference. Note that in general object dtype Series are of limited use, the most common case being a Series containing strings. Other than that they are very slow since they cannot take advantage of any data type information.
In [9]: s = Series(randn(6), index=pd.date_range('2013-01-01 11:25:00', freq='5T', periods=6), dtype=object)
In [10]: s.iloc[1:3] = nan
In [11]: s
Out[11]:
2013-01-01 11:25:00 -0.69522
2013-01-01 11:30:00 NaN
2013-01-01 11:35:00 NaN
2013-01-01 11:40:00 -0.70308
2013-01-01 11:45:00 -1.5653
2013-01-01 11:50:00 0.95893
Freq: 5T, dtype: object
In [12]: s.interpolate(method='time')
Out[12]:
2013-01-01 11:25:00 -0.69522
2013-01-01 11:30:00 NaN
2013-01-01 11:35:00 NaN
2013-01-01 11:40:00 -0.70308
2013-01-01 11:45:00 -1.5653
2013-01-01 11:50:00 0.95893
Freq: 5T, dtype: object
In [13]: s.astype(float).interpolate(method='time')
Out[13]:
2013-01-01 11:25:00 -0.6952
2013-01-01 11:30:00 -0.6978
2013-01-01 11:35:00 -0.7005
2013-01-01 11:40:00 -0.7031
2013-01-01 11:45:00 -1.5653
2013-01-01 11:50:00 0.9589
Freq: 5T, dtype: float64
I am late but, this solved my problem.
You need to assign the outcome to some variable or itself.
y=y.out_brd.interpolate(method='time')
You could also fix this without changing the name of the data frame with the function "in place":
y.out_brd.interpolate(method='time', inplace=True)
Short answer from Phillip, which I missed the first time and came back to answer it:
You need to have a float series:
s.astype(float).interpolate(method='time')
I've got a pandas dataframe organized by date I'm trying to split up by year (in a column called 'year'). I want to return one dataframe per year, with a name something like "df19XX".
I was hoping to write a "For" loop that can handle this... something like...
for d in [1980, 1981, 1982]:
df(d) = df[df['year']==d]
... which would return three data frames called df1980, df1981 and df1982.
thanks!
Something like this ? Also using #Andy's df
variables = locals()
for i in [2012, 2013]:
variables["df{0}".format(i)]=df.loc[df.date.dt.year==i]
df2012
Out[118]:
A date
0 0.881468 2012-12-28
1 0.237672 2012-12-29
2 0.992287 2012-12-30
3 0.194288 2012-12-31
df2013
Out[119]:
A date
4 0.151854 2013-01-01
5 0.855312 2013-01-02
6 0.534075 2013-01-03
You can iterate through the groupby:
In [11]: df = pd.DataFrame({"date": pd.date_range("2012-12-28", "2013-01-03"), "A": np.random.rand(7)})
In [12]: df
Out[12]:
A date
0 0.434715 2012-12-28
1 0.208877 2012-12-29
2 0.912897 2012-12-30
3 0.226368 2012-12-31
4 0.100489 2013-01-01
5 0.474088 2013-01-02
6 0.348368 2013-01-03
In [13]: g = df.groupby(df.date.dt.year)
In [14]: for k, v in g:
...: print(k)
...: print(v)
...: print()
...:
2012
A date
0 0.434715 2012-12-28
1 0.208877 2012-12-29
2 0.912897 2012-12-30
3 0.226368 2012-12-31
2013
A date
4 0.100489 2013-01-01
5 0.474088 2013-01-02
6 0.348368 2013-01-03
I would strongly argue that is preferable to simply have a dict having variables and messing around with the locals() dictionary (I claim using locals() so is not "pythonic"):
In [14]: {k: grp for k, grp in g}
Out[14]:
{2012: A date
0 0.434715 2012-12-28
1 0.208877 2012-12-29
2 0.912897 2012-12-30
3 0.226368 2012-12-31, 2013: A date
4 0.100489 2013-01-01
5 0.474088 2013-01-02
6 0.348368 2013-01-03}
Though you might consider calculating this on the fly (rather than storing in a dict or indeed a variable). You can use get_group:
In [15]: g.get_group(2012)
Out[15]:
A date
0 0.865239 2012-12-28
1 0.019071 2012-12-29
2 0.362088 2012-12-30
3 0.031861 2012-12-31
How to select multiple rows of a dataframe by list of dates
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
In[1]: df
Out[1]:
A B C D
2013-01-01 0.084393 -2.460860 -0.118468 0.543618
2013-01-02 -0.024358 -1.012406 -0.222457 1.906462
2013-01-03 -0.305999 -0.858261 0.320587 0.302837
2013-01-04 0.527321 0.425767 -0.994142 0.556027
2013-01-05 0.411410 -1.810460 -1.172034 -1.142847
2013-01-06 -0.969854 0.469045 -0.042532 0.699582
myDates = ["2013-01-02", "2013-01-04", "2013-01-06"]
So the output should be
A B C D
2013-01-02 -0.024358 -1.012406 -0.222457 1.906462
2013-01-04 0.527321 0.425767 -0.994142 0.556027
2013-01-06 -0.969854 0.469045 -0.042532 0.699582
You can use index.isin() method to create a logical index for subsetting:
df[df.index.isin(myDates)]
Convert your entry into a DateTimeIndex:
df.loc[pd.to_datetime(myDates)]
A B C D
2013-01-02 -0.047710 -1.827593 -0.944548 -0.149460
2013-01-04 1.437924 0.126788 0.641870 0.198664
2013-01-06 0.408820 -1.842112 -0.287346 0.071397
If you have a timeseries containing hours and minutes in the index (e.g. 2022-03-07 09:03:00+00:00 instead of 2022-03-07), and you want to filter by dates (without hours, minutes, etc.), you can use the following:
df.loc[np.isin(df.index.date, myDates)]
If you try df.loc[df.index.date.isin(myDates)] it might not work and python will throw an error saying 'numpy.ndarray' object has no attribute 'isin', and this is why we use np.isin.
This is an old post but I think this can be useful to a lot of people (such as myself).
I am trying to use the function numpy.unwrap to correct some phase
I have long vector with 2678399 records which contains the difference in radians between 2 angles. The array contains nan values although I think is not relevant as unwrap is applied to each record independently.
When I applied unwrap, by the 400 record generates nan values in the rest of the array
If I apply np.unwrap to just one slice of the original array works fine.
Is that a possible bug in this function?
d90dif=(df2['d90']-df2['d90avg'])*(np.pi/180)#difference between two angles in radians
df2['d90dif']=np.unwrap(d90dif.values)#unwrap to the array to create new column
just to explain the problem
d90dif[700:705]#angle difference for some records
2013-01-01 00:11:41 0.087808
2013-01-01 00:11:42 0.052901
2013-01-01 00:11:43 0.000541
2013-01-01 00:11:44 0.087808
2013-01-01 00:11:45 0.017995
dtype: float64
df2['d90dif'][700:705]#results with unwrap for these records
2013-01-01 00:11:41 NaN
2013-01-01 00:11:42 NaN
2013-01-01 00:11:43 NaN
2013-01-01 00:11:44 NaN
2013-01-01 00:11:45 NaN
Name: d90dif, dtype: float64
now I repeat the process with a small array
test=d90dif[700:705]
2013-01-01 00:11:41 0.087808
2013-01-01 00:11:42 0.052901
2013-01-01 00:11:43 0.000541
2013-01-01 00:11:44 0.087808
2013-01-01 00:11:45 0.017995
dtype: float64
unw=np.unwrap(test.values)
array([ 0.08780774, 0.05290116, 0.00054128, 0.08780774, 0.01799457])
Now it is ok. If I do it with a dataframe input in unwrap() works fine as well
By looking at the documentation of unwrap, it seems that NaN would have an effect since the function is looking at differences of adjacent elements to detect jumps in the phase.
It seems that the nan values play an important role
test
2013-01-01 00:11:41 0.087808
2013-01-01 00:11:42 0.052901
2013-01-01 00:11:43 0.000541
2013-01-01 00:11:44 NaN
2013-01-01 00:11:45 0.017995
dtype: float64
If there is nan in the column, from there everything becomes a nan
np.unwrap(test)
array([ 0.08780774, 0.05290116, 0.00054128, nan, nan])
I would say this is a bug but...
I am using Pandas Timegrouper to group datapoints in a pandas dataframe in python:
grouped = data.groupby(pd.TimeGrouper('30S'))
I would like to know if there's a way to achieve window overlap, like suggested in this question: Window overlap in Pandas while keeping the pandas dataframe as data structure.
Update: tested timing of the three solutions proposed below and the rolling mean seems faster:
%timeit df.groupby(pd.TimeGrouper('30s',closed='right')).mean()
%timeit df.resample('30s',how='mean',closed='right')
%timeit pd.rolling_mean(df,window=30).iloc[29::30]
yields:
1000 loops, best of 3: 336 µs per loop
1000 loops, best of 3: 349 µs per loop
1000 loops, best of 3: 199 µs per loop
Create some data exactly 3 x 30s long
In [51]: df = DataFrame(randn(90,2),columns=list('AB'),index=date_range('20130101 9:01:01',freq='s',periods=90))
Using a TimeGrouper in this way is equivalent of resample (and that's what resample actually does)
Note that I used closed to make sure that exactly 30 observations are included
In [57]: df.groupby(pd.TimeGrouper('30s',closed='right')).mean()
Out[57]:
A B
2013-01-01 09:01:00 -0.214968 -0.162200
2013-01-01 09:01:30 -0.090708 -0.021484
2013-01-01 09:02:00 -0.160335 -0.135074
In [52]: df.resample('30s',how='mean',closed='right')
Out[52]:
A B
2013-01-01 09:01:00 -0.214968 -0.162200
2013-01-01 09:01:30 -0.090708 -0.021484
2013-01-01 09:02:00 -0.160335 -0.135074
This is also equivalent if you then pick out the 30s intervals
In [55]: pd.rolling_mean(df,window=30).iloc[28:40]
Out[55]:
A B
2013-01-01 09:01:29 NaN NaN
2013-01-01 09:01:30 -0.214968 -0.162200
2013-01-01 09:01:31 -0.150401 -0.180492
2013-01-01 09:01:32 -0.160755 -0.142534
2013-01-01 09:01:33 -0.114918 -0.181424
2013-01-01 09:01:34 -0.098945 -0.221110
2013-01-01 09:01:35 -0.052450 -0.169884
2013-01-01 09:01:36 -0.011172 -0.185132
2013-01-01 09:01:37 0.100843 -0.178179
2013-01-01 09:01:38 0.062554 -0.097637
2013-01-01 09:01:39 0.048834 -0.065808
2013-01-01 09:01:40 0.003585 -0.059181
So depending on what you want to achieve, its easy to do an overlap, by using rolling_mean
and then pick out whatever 'frequency' you want. Eg here is a 5s resample with a 30s interval.
In [61]: pd.rolling_mean(df,window=30)[9::5]
Out[61]:
A B
2013-01-01 09:01:10 NaN NaN
2013-01-01 09:01:15 NaN NaN
2013-01-01 09:01:20 NaN NaN
2013-01-01 09:01:25 NaN NaN
2013-01-01 09:01:30 -0.214968 -0.162200
2013-01-01 09:01:35 -0.052450 -0.169884
2013-01-01 09:01:40 0.003585 -0.059181
2013-01-01 09:01:45 -0.055886 -0.111228
2013-01-01 09:01:50 -0.110191 -0.045032
2013-01-01 09:01:55 0.093662 -0.036177
2013-01-01 09:02:00 -0.090708 -0.021484
2013-01-01 09:02:05 -0.286759 0.020365
2013-01-01 09:02:10 -0.273221 -0.073886
2013-01-01 09:02:15 -0.222720 -0.038865
2013-01-01 09:02:20 -0.175630 0.001389
2013-01-01 09:02:25 -0.301671 -0.025603
2013-01-01 09:02:30 -0.160335 -0.135074