Add data to pandas rolling mean without recalculating everything

Add data to pandas rolling mean without recalculating everything - python

I have a pandas Series series with a lot of entries, now I calculate
series.rolling(window=25000,center=False).mean()
This may calculate a very long time, since it is a lot of data.
Now I want to add new data to this series (Data that is only available over time and not before the original calculation).
If I calculate the rolling mean again, it will take again a very long time, is there a way, to spped up the second calculation by reusing data from the first calculation? (Probably with another library)

Assume you have new series
r1=series.rolling(window=25000,center=False).mean()
r2=series.iloc[-2499:].append(new_series).rolling(window=25000,center=False).mean().dropna()
r=r1.append(r2)

Related

Undersampling large dataset under specific conditon applied to other column in python/pandas

I currently working with a large dataset (about 40 coulmns and tens of thousans of rows) and i would like to undersample it to be able to work with it more easily.
For the undersampling, unlike the resample method from pandas that resample according to timedelta, I'm trying to specify conditons on other columns to determine the data points to keep.
I'm not sure it's so clear but for example, let's say I have 3 columns (index, time and temperature) like as followed:
Now for the resampling, I would like to keep a data point every 1s or every 2°C, the resulting dataset would look like this:
I couldn't find a simple way of doing this with pandas. The only way would be to iterate over the rows but it was very slow because of the size of my datasets.
I though about using the diff method but of course it can only determine the difference on a specified period, same for pct_change that could have been use to keep only the points in the regions were the variations are maximal to undersample.
Thanks in advance if you have any suggestions on how to proceed with this resampling.

How can I implement time based time series windowing?

I want to window a pandas series which has a DatetimeIndex to the last X seconds. Usually I'd use pandas.Series.rolling for windowing. However the datetime indices are not equidistant means I cannot calculate the number of data points in a reliable manner. How can I implement a time based windowing (e.g. by implementing a BaseIndexer subclass and passing it to the window parameter of rolling())?

The easiest way to get the last X seconds from a datetime indexed series I came up with is getting the newest_timestamp = series.index.max(), calculating the oldest timestamp to consider from it oldest_timestamp = newest_timestamp - pd.to_timedelta(<X-seconds>, unit='s') and slicing the windowed_series = series[oldest_timestamp:newest_timestamp. Cause the oldest_timestamp is not extracted from the series but calculated the slicing operation will usually not match exactly. However this does not matter cause it is handled automatically.
NOTE: series.rolling() is usually used in a time series data pre-processing context (e.g. weighting samples within a window dependent on a function as part of a forecasting application) not for plain windowing use cases.

python dataframe time series division into smaller time windows loop

at the moment I'm trying to do a time series analysis in python. My data is saved as a pandas dataframe, and now I try to calculate e.g. the mean for smaller time windows (for the data in the columns). I directly want to save the new values into an array. If I only had a small dataset I would do it manual like this:
dataframe1.values (to convert it into an array)
array2 = np.array(([dataframe1[0:1000].mean()],[dataframe1[1001:2000].mean()]...])
Well, but I have a really large dataset, so it would take very long to do it this way by hand. So I thought about solving my problem with a loop, but I don't really know how to do this, when I directly want to save the new values in an array. Thanks in advance :)

How can I improve the creation time of a pandas DataFrame?

I am having a dictionary of pandas Series, each with their own index and all containing float numbers.
I need to create a pandas DataFrame with all these series, which works fine by just doing:
result = pd.DataFrame( dict_of_series )
Now, I actually have to do this a large amount of time, along with some heavy calculation (we're in a Monte-Carlo engine).
I noticed that the part where my code was spending the most time was over this line. Obviously, this is if I sum up all the times it's been called.
I thought about caching the result but unfortunately the dict_of_series is almost all the time different.
I guess that what takes time is obviously the constructor has to build the global index and fill the holes and maybe there is simply no way around it, but I'm wondering if I'm not missing something obvious which slows down the process, or if there is something smarter I could do to speed it up.
Has anybody had the same experience?

How to store multiple related time series in Pandas

I'm new to Pandas and would like some insight from the pros. I need to perform various statistical analyses (multiple regression, correlation etc) on >30 time series of financial securities' daily Open, High, Low, Close prices. Each series has 500-1500 days of data. As each analysis looks at multiple securities, I'm wondering if it's preferable from an ease of use and efficiency perspective to store each time series in a separate df, each with date as the index, or to merge them all into a single df with a single date index, which would effectively be a 3d df. If the latter, any recommendations on how to structure it?
Any thoughts much appreciated.
PS. I'm working my way up to working with intraday data across multiple timezones but that's a bit much for my first pandas project; this is a first step in that direction.

since you're only dealing with OHLC, it's not that much data to process, so that's good.
for these types of things i usually use a multiindex (http://pandas.pydata.org/pandas-docs/stable/indexing.html) with symbol as the first level and date as the second. then you can have just the columns OHLC and you're all set.
to access multiindex use the .xs function.

Unless you are going to correlate everything with everything, my suggestion is to put this into separate dataframes and put them all in a dictionary, ie {"Timeseries1":df1, "Timeseries 2":df2...}. Then, when you want to correlate some timeseries together, you can merge them and put suffixes in the columns of every different df to differentiate between them.
Probably you are interested in this talk Python for Financial Data Analysis with pandas by the author of pandas himself.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add data to pandas rolling mean without recalculating everything - python

Assume you have new series r1=series.rolling(window=25000,center=False).mean() r2=series.iloc[-2499:].append(new_series).rolling(window=25000,center=False).mean().dropna() r=r1.append(r2)

Related

Undersampling large dataset under specific conditon applied to other column in python/pandas

How can I implement time based time series windowing?

python dataframe time series division into smaller time windows loop

How can I improve the creation time of a pandas DataFrame?

How to store multiple related time series in Pandas

Categories

Resources