I recently needed to store large array-like data (sometimes numpy, sometimes key-value indexed) whose values would be changed over time (t=1 one element changes, t=2 another element changes, etc.). This history needed to be accessible (some time in the future, I want to be able to see what t=2’s array looked like).
An easy solution was to keep a list of arrays for all timesteps, but this became too memory intensive. I ended up writing a small class that handled this by keeping all data “elements” in a dict with each element represented by a list of (this_value, timestamp_for_this_value). that let me recreate things for arbitrary timestamps by looking for the last change before some time t, but it was surely not as efficient as it could have been.
Are there data structures available for python that have these properties natively? Or some sort of class of data structure meant for this kind of thing?
Have you considered writing a log file? A good use of memory would be to have the arrays contain only the current relevant values but build in a procedure where the update statement could trigger a logging function. This function could write to a text file, database or an array/dictionary of some sort. These types of audit trails are pretty common in the database world.
Related
I am writing a serial data logger in Python and am wondering which data type would be best suited for this. Every few milliseconds a new value is read from the serial interface and is saved into my variable along with the current time. I don't know how long the logger is going to run, so I can't preallocate for a known size.
Intuitively I would use an numpy array for this, but appending / concatenating elements creates a new array each time from what I've read.
So what would be the appropriate data type to use for this?
Also, what would be the proper vocabulary to describe this problem?
Python doesn't have arrays as you think of them in most languages. It has "lists", which use the standard array syntax myList[0] but unlike arrays, lists can change size as needed. using myList.append(newItem) you can add more data to the list without any trouble on your part.
Since you asked for proper vocabulary in a useful concept to you would be "linked lists" which is a way of implementing array like things with varying lengths in other languages.
I'm trying to use HDF5 to store time-series EEG data. These files can be quite large and consist of many channels, and I like the features of the HDF5 file format (lazy I/O, dynamic compression, mpi, etc).
One common thing to do with EEG data is to mark sections of data as 'interesting'. I'm struggling with a good way to store these marks in the file. I see soft/hard links supported for linking the same dataset to other groups, etc -- but I do not see any way to link to sections of the dataset.
For example, let's assume I have a dataset called EEG containing sleep data. Let's say I run an algorithm that takes a while to process the data and generates indices corresponding to periods of REM sleep. What is the best way to store these index ranges in an HDF5 file?
The best I can think of right now is to create a dataset with three columns -- the first column is a string and contains a label for the event ("REM1"), and the second/third column contains the start/end index respectively. The only reason I don't like this solution is because HDF5 datasets are pretty set in size -- if I decide later that a period of REM sleep was mis-identified and I need to add/remove that event, the dataset size would need to change (and deleting the dataset/recreating it with a new size is suboptimal). Compound this by the fact that I may have MANY events (imagine marking eyeblink events), this becomes more of a problem.
I'm more curious to find out if there's functionality in the HDF5 file that I'm just not aware of, because this seems like a pretty common thing that one would want to do.
I think what you want is a Region Reference — essentially, a way to store a reference to a slice of your data. In h5py, you create them with the regionref property and numpy slicing syntax, so if you have a dataset called ds and your start and end indexes of your REM period, you can do:
rem_ref = ds.regionref[start:end]
ds.attrs['REM1'] = rem_ref
ds[ds.attrs['REM1']] # Will be a 1-d set of values
You can store regionrefs pretty naturally — they can be attributes on a dataset, objects in a group, or you can create a regionref-type dataset and store them in there.
In your case, I might create a group ("REM_periods" or something) and store the references in there. Creating a "REM_periods" dataset and storing the regionrefs there is reasonable too, but you run into the whole "datasets tend not to be variable-length very well" thing.
Storing them as attrs on the dataset might be OK, too, but it'd get awkward if you wanted to have more than one event type.
I'm currently rewriting some python code to make it more efficient and I have a question about saving python arrays so that they can be re-used / manipulated later.
I have a large number of data, saved in CSV files. Each file contains time-stamped values of the data that I am interested in and I have reached the point where I have to deal with tens of millions of data points. The data has got so large now that the processing time is excessive and inefficient---the way the current code is written the entire data set has to be reprocessed every time some new data is added.
What I want to do is this:
Read in all of the existing data to python arrays
Save the variable arrays to some kind of database/file
Then, the next time more data is added I load my database, append the new data, and resave it. This way only a small number of data need to be processed at any one time.
I would like the saved data to be accessible to further python scripts but also to be fairly "human readable" so that it can be handled in programs like OriginPro or perhaps even Excel.
My question is: whats the best format to save the data in? HDF5 seems like it might have all the features I need---but would something like SQLite make more sense?
EDIT: My data is single dimensional. I essentially have 30 arrays which are (millions, 1) in size. If it wasn't for the fact that there are so many points then CSV would be an ideal format! I am unlikely to want to do lookups of single entries---more likely is that I might want to plot small subsets of data (eg the last 100 hours, or the last 1000 hours, etc).
HDF5 is an excellent choice! It has a nice interface, is widely used (in the scientific community at least), many programs have support for it (matlab for example), there are libraries for C,C++,fortran,python,... It has a complete toolset to display the contents of a HDF5 file. If you later want to do complex MPI calculation on your data, HDF5 has support for concurrently read/writes. It's very well suited to handle very large datasets.
Maybe you could use some kind of key-value database like Redis, Berkeley DB, MongoDB... But it would be nice some more info about the schema you would be using.
EDITED
If you choose Redis for example, you can index very long lists:
The max length of a list is 232 - 1 elements (4294967295, more than 4
billion of elements per list). The main features of Redis Lists from
the point of view of time complexity are the support for constant time
insertion and deletion of elements near the head and tail, even with
many millions of inserted items. Accessing elements is very fast near
the extremes of the list but is slow if you try accessing the middle
of a very big list, as it is an O(N) operation.
I would use a single file with fixed record length for this usecase. No specialised DB solution (seems overkill to me in that case), just plain old struct (see the documentation for struct.py) and read()/write() on a file. If you have just millions of entries, everything should be working nicely in a single file of some dozens or hundreds of MB size (which is hardly too large for any file system). You also have random access to subsets in case you will need that later.
I have a data structure which serves as a wrapper to a 2D numpy array in order to use labeled indices and perform statements such as
myMatrix[ "rowLabel", "colLabel" ] = 1.0
Basically this is implemented as
def __setitem__( self, row, col, value ):
... # Check validity of row/col labels.
self.__matrixRepresentation[ ( self.__rowMap[row], self.__colMap[col] ) ] = value
I am assigning the values in a database table to this data structure, and it was straightforward to write a loop for this. However, I want to execute this loop 100 million or more times, and iteratively retrieving chunks of values from the database table and moving them to this structure takes more time than I would prefer.
All of the values I retrieve from the database table have different (row,column) pairs.
Therefore, it seems that I could parallelize the above assignment, but I don't know if numpy arrays permit simultaneous assignment using some sort of internal locking mechanism for atomic operations, or if it altogether prohibits any such thought process. If anyone has suggestions or criticisms, I would appreciate it. (If possible, in this case I'd prefer not to resort to cython or PyPy.)
Parallel execution at that level is unlikely here. The global interpreter lock will ruin your day. Besides, you're still going to have to pull each set of values out of the database sequentially, which is quite possibly going to dwarf the in-process map lookups and array assignment. Especially if the database is on a remote machine.
If at all possible, don't store your matrix in that database. Dedicated formats exist for efficiently storing large arrays. HDF5/NetCDF come to mind. There are excellent Python/NumPy libraries for using HDF5 datasets. With no further information on the format or purpose of the database and/or matrix, I can't really give you better storage recommendations.
If you don't control how that data is stored, you're just going to have to wait for it to get slurped in. Depending on what you're using it for and how often it gets updated, you might just wait for it once, and then updates from the database can be written to it from a separate thread as they become available.
(Unrelated terminology issue: "CPython" is the standard Python interpreter. I assume you meant that you don't want to write an extension for CPython using C, Cython, Boost Python, or whatever.)
I'm currently writing a python application that will take a directory of text files and parse them into custom python objects based on the the attributes specified in the text file. As part of my application, I compare the current loaded object data set to a previous dataset (same format) and scan it for possible duplicates, conflicts, updates, etc. However since there can be ~10,000+ objects at a time, I'm not really sure how to approach this.
I'm currently storing the previous data set in a DB as it's being used by another web app. As of now, my python application loads the 'proposed' dataset into memory (creating the rule objects), and then I store those objects in a dictionary (problem #1). Then when it comes time to compare, I use a combination of SQL queries and failed inserts to determine new/existing and existing but updated entries (problem #2).
This is hackish and terrible at best. I'm looking for some advice on restructuring the application and handling the object storage/comparisons.
You can fake what Git does and load the entire set as basically a single file and parse from there. The biggest issue is that dictionaries are not ordered so your comparisons will not always be 1:1. A list of tuples will give you 1:1 comparisons. If a lot has changed this will be difficult.
Here is a basic flow for how you can do this.
Start with both tuple lists at index 0.
Compare a hash of each tuple hashlib.sha1(str(tuple1)) == hashlib.sha1(str(tuple2))
If they are equal, record the matching indexes and add 1 to each index and compare again
If the are unequal, search each side for a match and record the matching indexes
If there are no matches, you can assume there is an insert/update/delete happening and come back to it later
You can map your matching items as reference points to do further investigation into the ones that did not match. This technique can be applied at each level you drill down. You will end up with a map of what is different down to the individual values.
The nice thing is each of the slices that you create can be compared in parallel since they will not correspond to each other... unless you are moving things from one file to another.
Then again, it may be easier to use a diff library to compare the two data sets. Might as well not reinvent the wheel; even if it might be a really shiny wheel.
Check out http://docs.python.org/library/difflib.html