Fastest way to perform math on arrays that are constantly expanding? - python

I'm writing a python script which takes signals from various different components, such as an accelerometer, GPS position data etc and then saves all of this data in one class (i'm calling a signal class). This way, it doesnt matter where the acceleration data is coming from, because it will all be processed into the same format. Every time there is a new piece of data, I currently add this new value to an expanding list. I chose a list as I belive it is dynamic, so you can add data without to much computational cost. Comapred to, numpy arrays which are static.
However, I need to also perform mathematical operations on these datasets in near live-time. Would it be faster to:
Store the data initially as a numpy array, and expand it as data is added
Store the data as an expanding list, and every time some math needs to be performed on the data convert what is needed into a numpy array and then use numpy functions
Keep all of the data as lists, and write custom functions to perform the math.
Some other method, that I dont know about?
The update times vary, depending on where the data comes from, from anywhere between 1Hz to 1000Hz.

Related

Data type to save expanding data for data logging in Python

I am writing a serial data logger in Python and am wondering which data type would be best suited for this. Every few milliseconds a new value is read from the serial interface and is saved into my variable along with the current time. I don't know how long the logger is going to run, so I can't preallocate for a known size.
Intuitively I would use an numpy array for this, but appending / concatenating elements creates a new array each time from what I've read.
So what would be the appropriate data type to use for this?
Also, what would be the proper vocabulary to describe this problem?
Python doesn't have arrays as you think of them in most languages. It has "lists", which use the standard array syntax myList[0] but unlike arrays, lists can change size as needed. using myList.append(newItem) you can add more data to the list without any trouble on your part.
Since you asked for proper vocabulary in a useful concept to you would be "linked lists" which is a way of implementing array like things with varying lengths in other languages.

Python Array Data Structure with History

I recently needed to store large array-like data (sometimes numpy, sometimes key-value indexed) whose values would be changed over time (t=1 one element changes, t=2 another element changes, etc.). This history needed to be accessible (some time in the future, I want to be able to see what t=2’s array looked like).
An easy solution was to keep a list of arrays for all timesteps, but this became too memory intensive. I ended up writing a small class that handled this by keeping all data “elements” in a dict with each element represented by a list of (this_value, timestamp_for_this_value). that let me recreate things for arbitrary timestamps by looking for the last change before some time t, but it was surely not as efficient as it could have been.
Are there data structures available for python that have these properties natively? Or some sort of class of data structure meant for this kind of thing?
Have you considered writing a log file? A good use of memory would be to have the arrays contain only the current relevant values but build in a procedure where the update statement could trigger a logging function. This function could write to a text file, database or an array/dictionary of some sort. These types of audit trails are pretty common in the database world.

Is there a way to get a numpy-style view to a slice of an array stored in a hdf5 file?

I have to work on large 3D cubes of data. I want to store them in HDF5 files (using h5py or maybe pytables). I often want to perform analysis on just a section of these cubes. This section is too large to hold in memory. I would like to have a numpy style view to my slice of interest, without copying the data to memory (similar to what you could do with a numpy memmap). Is this possible? As far as I know, performing a slice using h5py, you get a numpy array in memory.
It has been asked why I would want to do this, since the data has to enter memory at some point anyway. My code, out of necessity, already run piecemeal over data from these cubes, pulling small bits into memory at a time. These functions are simplest if they simply iterate over the entirety of the datasets passed to them. If I could have a view to the data on disk, I simply could pass this view to these functions unchanged. If I cannot have a view, I need to write all my functions to only iterate over the slice of interest. This will add complexity to the code, and make it more likely for human error during analysis.
Is there any way to get a view to the data on disk, without copying to memory?
One possibility is to create a generator that yields the elements of the slice one by one. Once you have such a generator, you can pass it to your existing code and iterate through the generator as normal. As an example, you can use a for loop on a generator, just as you might use it on a slice. Generators do not store all of their values at once, they 'generate' them as needed.
You might be able create a slice of just the locations of the cube you want, but not the data itself, or you could generate the next location of your slice programmatically if you have too many locations to store in memory as well. A generator could use those locations to yield the data they contain one by one.
Assuming your slices are the (possibly higher-dimensional) equivalent of cuboids, you might generate coordinates using nested for-range() loops, or by applying product() from the itertools module to range objects.
It is unavoidable to not copy that section of the dataset to memory.
Reason for that is simply because you are requesting the entire section, not just a small part of it.
Therefore, it must be copied completely.
So, as h5py already allows you to use HDF5 datasets in the same way as NumPy arrays, you will have to change your code to only request the values in the dataset that you currently need.

Running filter over a large amount of data points and a long time period?

I need to apply two running filters on a large amount of data. I have read that creating variables on the fly is not a good idea, but I wonder if it still might be the best solution for me.
My question:
Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?
Why I want to do it:
The data are 400x700 arrays for 15min time steps over a year (So I have 35000 400x700 arrays). Each time step is read into python individually. Now I need to apply one running filter that checks if the last four time steps are equal (element-wise) and if they are, then all four values are set to zero. The next filter uses the data after the first filter has run and checks if the sum of the last twelve time steps exceeds a certain value. When both filters are done I want to sum up the values, so that at the end of the year I have one 400x700 array with the filtered accumulated values.
I do not have enough memory to read in all the data at once. So I thought I could create a loop where for each time step a new variable for the 400x700 array is created and the two filters run. The older arrays that are filtered I could then add to the yearly sum and delete, so that I do not have more than 16 (4+12) time steps(arrays) in memory at all times.
I don’t now if it’s correct of me to ask such a question without any code to show, but I would really appreciate the help.
If your question is about the best data structure to keep a certain amount of arrays in memory, in this case I would suggest using a three dimensional array. It's shape would be (400, 700, 12) since twelve is how many arrays you need to look back at. The advantage of this is that your memory use will be constant since you load new arrays into the larger one. The disadvantage is that you need to shift all arrays manually.
If you don't want to deal with the shifting yourself I'd suggest using a deque with a maxlen of 12.
"Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?"
This is a very common question that I think a lot of programmers will face eventually. Two examples for Python on Stack Overflow:
generating variable names on fly in python
How do you create different variable names while in a loop? (Python)
The lesson to learn from this is to not use dynamic variable names, but instead put the pieces of data you want to work with in an encompassing data structure.
The data structure could e.g. be a list, dict or Numpy array. Also the collections.deque proposed by #Midnighter seems to be a good candidate for such a running filter.

Sensible storage of 1 billion+ values in a python list type structure

I'm writing a program that creates vario-function plots for a fixed region of a digital elevation model that has been converted to an array. I calculate the variance (difference in elevation) and lag (distance) between point pairs within the window constraints. Every array position is compared with every other array position. For each pair, the lag and variance values are appended to separate lists. Once all pairs have been compared, these lists are then used for data binning, averaging and eventually plotting.
The program runs fine for smaller window sizes (say 60x60 px). For windows up to about 120x120 px or so, which would give 2 lists of 207,360,000 entries, I am able to slowly get the program running. Greater than this, and I run into "MemoryError" reports - e.g. for a 240x240 px region, I would have 3,317,760,000 entries
At the beginning of the program, I create an empty list:
variance = []
lag = []
Then within a for loop where I calculate my lags and variances, I append the values to the different lists:
variance.append(var_val)
lag.append(lag_val)
I've had a look over the stackoverflow pages and have seen a similar issue discussed here. This solution would potentially improve temporal program performance however the solution offered only goes up to 100 million entries and therefore doesn't help me out with the larger regions (as with the 240x240px example). I've also considered using numpy arrays to store the values but I don't think this will stave of the memory issues.
Any suggestions for ways to use some kind of list of the proportions I have defined for the larger window sizes would be much appreciated.
I'm new to python so please forgive any ignorance.
The main bulk of the code can be seen here
Use the array module of Python. It offers some list-like types that are more memory efficient (but cannot be used to store random objects, unlike regular lists). For example, you can have arrays containing regular floats ("doubles" in C terms), or even single-precision floats (four bytes each instead of eight, at the cost of a reduced precision). An array of 3 billion such single-floats would fit into 12 GB of memory.
You could look into PyTables, a library wrapping the HDF5 C library that can be used with numpy and pandas.
Essentially PyTables will store your data on disk and transparently load it into memory as needed.
Alternatively if you want to stick to pure python, you could use a sqlite3 database to store and manipulate your data - the docs say the size limit for a sqlite database is 140TB, which should be enough for your data.
try using heapq, import heapq. It uses the heap for storage rather than the stack allowing you to access the computer full memory.

Categories