python dataframe time series division into smaller time windows loop - python

at the moment I'm trying to do a time series analysis in python. My data is saved as a pandas dataframe, and now I try to calculate e.g. the mean for smaller time windows (for the data in the columns). I directly want to save the new values into an array. If I only had a small dataset I would do it manual like this:
dataframe1.values (to convert it into an array)
array2 = np.array(([dataframe1[0:1000].mean()],[dataframe1[1001:2000].mean()]...])
Well, but I have a really large dataset, so it would take very long to do it this way by hand. So I thought about solving my problem with a loop, but I don't really know how to do this, when I directly want to save the new values in an array. Thanks in advance :)

Related

Undersampling large dataset under specific conditon applied to other column in python/pandas

I currently working with a large dataset (about 40 coulmns and tens of thousans of rows) and i would like to undersample it to be able to work with it more easily.
For the undersampling, unlike the resample method from pandas that resample according to timedelta, I'm trying to specify conditons on other columns to determine the data points to keep.
I'm not sure it's so clear but for example, let's say I have 3 columns (index, time and temperature) like as followed:
Now for the resampling, I would like to keep a data point every 1s or every 2°C, the resulting dataset would look like this:
I couldn't find a simple way of doing this with pandas. The only way would be to iterate over the rows but it was very slow because of the size of my datasets.
I though about using the diff method but of course it can only determine the difference on a specified period, same for pct_change that could have been use to keep only the points in the regions were the variations are maximal to undersample.
Thanks in advance if you have any suggestions on how to proceed with this resampling.

Fastest way to perform math on arrays that are constantly expanding?

I'm writing a python script which takes signals from various different components, such as an accelerometer, GPS position data etc and then saves all of this data in one class (i'm calling a signal class). This way, it doesnt matter where the acceleration data is coming from, because it will all be processed into the same format. Every time there is a new piece of data, I currently add this new value to an expanding list. I chose a list as I belive it is dynamic, so you can add data without to much computational cost. Comapred to, numpy arrays which are static.
However, I need to also perform mathematical operations on these datasets in near live-time. Would it be faster to:
Store the data initially as a numpy array, and expand it as data is added
Store the data as an expanding list, and every time some math needs to be performed on the data convert what is needed into a numpy array and then use numpy functions
Keep all of the data as lists, and write custom functions to perform the math.
Some other method, that I dont know about?
The update times vary, depending on where the data comes from, from anywhere between 1Hz to 1000Hz.

Making Dataframe Analysis faster

I am using three dataframes to analyze sequential numeric data - basically numeric data captured in time. There are 8 columns, and 360k entries. I created three identical dataframes - one is the raw data, the second a "scratch pad" for analysis and a third dataframe contains the analyzed outcome. This runs really slowly. I'm wondering if there are ways to make this analysis run faster? Would it be faster if instead of three separate 8 column dataframes I had one large one 24 column dataframe?
Use cProfile and lineprof to figure out where the time is being spent.
To get help from others, post your real code and your real profile results.
Optimization is an empirical process. The little tips people have are often counterproductive.
Most probably it doesn't matter because pandas stores each column separately anyway (DataFrame is a collection of Series). But you might get better data locality (all data next to each other in memory) by using a single frame, so it's worth trying. Check this empirically.
Rereading this post I am realizing I could have been clearer. I have been using write statement like:
dm.iloc[p,XCol] = dh.iloc[x,XCol]
to transfer individual cells of one dataframe (dh) to a different row of a second dataframe (dm). It ran very slowly but I needed this specific file sorted and I just lived with the performance.
According to "Learning Pandas" by Michael Heydt, pg 146, ".iat" is faster than ".iloc" for extracting (or writing) scalar values from a dataframe. I tried it and it works. With my original 300k row files, run time was 13 hours(!) using ".iloc", same datafile using ".iat" ran in about 5 minutes.
Net - this is faster:
dm.iat[p,XCol] = dh.iat[x,XCol]

How can I improve the creation time of a pandas DataFrame?

I am having a dictionary of pandas Series, each with their own index and all containing float numbers.
I need to create a pandas DataFrame with all these series, which works fine by just doing:
result = pd.DataFrame( dict_of_series )
Now, I actually have to do this a large amount of time, along with some heavy calculation (we're in a Monte-Carlo engine).
I noticed that the part where my code was spending the most time was over this line. Obviously, this is if I sum up all the times it's been called.
I thought about caching the result but unfortunately the dict_of_series is almost all the time different.
I guess that what takes time is obviously the constructor has to build the global index and fill the holes and maybe there is simply no way around it, but I'm wondering if I'm not missing something obvious which slows down the process, or if there is something smarter I could do to speed it up.
Has anybody had the same experience?

Running filter over a large amount of data points and a long time period?

I need to apply two running filters on a large amount of data. I have read that creating variables on the fly is not a good idea, but I wonder if it still might be the best solution for me.
My question:
Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?
Why I want to do it:
The data are 400x700 arrays for 15min time steps over a year (So I have 35000 400x700 arrays). Each time step is read into python individually. Now I need to apply one running filter that checks if the last four time steps are equal (element-wise) and if they are, then all four values are set to zero. The next filter uses the data after the first filter has run and checks if the sum of the last twelve time steps exceeds a certain value. When both filters are done I want to sum up the values, so that at the end of the year I have one 400x700 array with the filtered accumulated values.
I do not have enough memory to read in all the data at once. So I thought I could create a loop where for each time step a new variable for the 400x700 array is created and the two filters run. The older arrays that are filtered I could then add to the yearly sum and delete, so that I do not have more than 16 (4+12) time steps(arrays) in memory at all times.
I don’t now if it’s correct of me to ask such a question without any code to show, but I would really appreciate the help.
If your question is about the best data structure to keep a certain amount of arrays in memory, in this case I would suggest using a three dimensional array. It's shape would be (400, 700, 12) since twelve is how many arrays you need to look back at. The advantage of this is that your memory use will be constant since you load new arrays into the larger one. The disadvantage is that you need to shift all arrays manually.
If you don't want to deal with the shifting yourself I'd suggest using a deque with a maxlen of 12.
"Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?"
This is a very common question that I think a lot of programmers will face eventually. Two examples for Python on Stack Overflow:
generating variable names on fly in python
How do you create different variable names while in a loop? (Python)
The lesson to learn from this is to not use dynamic variable names, but instead put the pieces of data you want to work with in an encompassing data structure.
The data structure could e.g. be a list, dict or Numpy array. Also the collections.deque proposed by #Midnighter seems to be a good candidate for such a running filter.

Categories