I'm building a sort of monitoring tool in Python where I want to keep certain stats for a short period of time. I only want to keep a maximum of, say, 30 entries for a stat, and for the older entries to be overwritten as new ones come in. This way, only the 30 most recent entries are kept. What sort of file should I use for this (I'll have multiple different stats all of which I would like to only keep their recent history. The stats are updated at regular intervals ~15 seconds).
I want this to be in a file as the data will be handled in another program.
If you're only keeping a small number of samples (and you don't care about historic data), then the simplest solution is keeping data in memory. You can use a collections.deque object, as described here, to create a fixed-length list that will automatically drop older items as you add newer items.
For situations in which you want to keep data for longer periods (or you simply want it to persist in the event your application restarts, or you want to be able to access the data from multiple applications, etc), people often use a dedicated time series database, such as InfluxDB, Prometheus, Graphite, or any of a number of other solutions.
You probably want to keep it all in memory. But if you need to keep a file that mimics and a data structure (say a dictionary) I've had great success with pickle. It's easy, and it's fast.
https://pythontips.com/2013/08/02/what-is-pickle-in-python/
Alternatively a more enterprise solution would be to simply store your stats in a database.
Related
I am currently using Python Record Linkage Toolkit to perform deduplication on data sets at work. In an ideal world, I would just use blocking or sortedneighborhood to trim down the size of the index of record pairs, but sometimes I need to do a full index on a data set with over 75k records, which results in a couple billion records pairs.
The issue I'm running into is that the workstation I'm able to use is running out of memory, so it can't store the full 2.5-3 billion pair multi-index. I know the documentation has ideas for doing record linkage with two large data sets using numpy split, which is simple enough for my usage, but doesn't provide anything for deduplication within a single dataframe. I actually incorporated this subset suggestion into a method for splitting the multiindex into subsets and running those, but it doesn't get around the issue of the .index() call seemingly loading the entire multiindex into memory and causing an out of memory error.
Is there a way to split a dataframe and compute the matched pairs iteratively so I don't have to load the whole kit and kaboodle into memory at once? I was looking at dask, but I'm still pretty green on the whole python thing, so I don't know how to incorporate the dask dataframes into the record linkage toolkit.
While I was able to solve this, sort of, I am going to leave it open because I suspect given my inexperience with python, my process could be improved.
Basically, I had to ditch the index function from record linkage toolkit. I pulled out the Index of the dataframe I was using, and then converted it to a list, and passed it through the itertools combinations function.
candidates = fl
candidates = candidates.index
candidates = candidates.tolist()
candidates = combinations(candidates,2)
This then gave me an iteration object full of tuples, without having to load everything in to memory. I then passed it into an islice grouper as a for loop.
for x in iter(lambda: list(islice(candidates,1000000)),[]):
I then proceeded to perform all of the necessary comparisons in the for loop, and added the resultant dataframe to a dictionary, which I then concatenate at the end for the full list. Python's memory usage hasn't risen above 3GB the entire time.
I would still love some information on how to incorporate dask into this, so I will accept any answer that can provide that (unless the mods think I should open a new question).
I recently needed to store large array-like data (sometimes numpy, sometimes key-value indexed) whose values would be changed over time (t=1 one element changes, t=2 another element changes, etc.). This history needed to be accessible (some time in the future, I want to be able to see what t=2’s array looked like).
An easy solution was to keep a list of arrays for all timesteps, but this became too memory intensive. I ended up writing a small class that handled this by keeping all data “elements” in a dict with each element represented by a list of (this_value, timestamp_for_this_value). that let me recreate things for arbitrary timestamps by looking for the last change before some time t, but it was surely not as efficient as it could have been.
Are there data structures available for python that have these properties natively? Or some sort of class of data structure meant for this kind of thing?
Have you considered writing a log file? A good use of memory would be to have the arrays contain only the current relevant values but build in a procedure where the update statement could trigger a logging function. This function could write to a text file, database or an array/dictionary of some sort. These types of audit trails are pretty common in the database world.
I'm trying to use HDF5 to store time-series EEG data. These files can be quite large and consist of many channels, and I like the features of the HDF5 file format (lazy I/O, dynamic compression, mpi, etc).
One common thing to do with EEG data is to mark sections of data as 'interesting'. I'm struggling with a good way to store these marks in the file. I see soft/hard links supported for linking the same dataset to other groups, etc -- but I do not see any way to link to sections of the dataset.
For example, let's assume I have a dataset called EEG containing sleep data. Let's say I run an algorithm that takes a while to process the data and generates indices corresponding to periods of REM sleep. What is the best way to store these index ranges in an HDF5 file?
The best I can think of right now is to create a dataset with three columns -- the first column is a string and contains a label for the event ("REM1"), and the second/third column contains the start/end index respectively. The only reason I don't like this solution is because HDF5 datasets are pretty set in size -- if I decide later that a period of REM sleep was mis-identified and I need to add/remove that event, the dataset size would need to change (and deleting the dataset/recreating it with a new size is suboptimal). Compound this by the fact that I may have MANY events (imagine marking eyeblink events), this becomes more of a problem.
I'm more curious to find out if there's functionality in the HDF5 file that I'm just not aware of, because this seems like a pretty common thing that one would want to do.
I think what you want is a Region Reference — essentially, a way to store a reference to a slice of your data. In h5py, you create them with the regionref property and numpy slicing syntax, so if you have a dataset called ds and your start and end indexes of your REM period, you can do:
rem_ref = ds.regionref[start:end]
ds.attrs['REM1'] = rem_ref
ds[ds.attrs['REM1']] # Will be a 1-d set of values
You can store regionrefs pretty naturally — they can be attributes on a dataset, objects in a group, or you can create a regionref-type dataset and store them in there.
In your case, I might create a group ("REM_periods" or something) and store the references in there. Creating a "REM_periods" dataset and storing the regionrefs there is reasonable too, but you run into the whole "datasets tend not to be variable-length very well" thing.
Storing them as attrs on the dataset might be OK, too, but it'd get awkward if you wanted to have more than one event type.
I am in the planning phase of building a simulation and need ideas on how to represent data, based on memory and speed considerations.
At each time-step, the simulation process creates 10^3 to 10^4 new data records, and looks at each new or existing records (there are 10^6 to 10^8 of them) then either deletes it or modifies it.
Each record has 3-10 simple fields, each either an integer or a string of several ASCII characters. In addition, each record has 1-5 other fields, each a variable-length list containing integers. A typical record weighs 100-500 bytes.
The modify-or-delete process works like this: For this record, compute a function whose arguments are the values of some of this record's fields, and the values of these fields of another record. Depending on the results, the process prepares to delete or modify its fields in some way.
Then repeat for each other record. Then move to the next record and repeat. When all records have been processed, the simulation is ready to move to the next time-step.
Just before moving to the next time-step, apply all the deletions and modifications as prepared.
The more records allowed, the better the simulation. If all records are in RAM, downside is simulation size and presumably upside is speed. The simulation doesn't need to be realtime, but obviously I don't want it too slow.
To represent each record in memory, I know of these options: a list or dict (with some lists nested in it), or a class instance. To store away all the records and continue the simulation another day, the options in order of decreasing familiarity to me are: a csv file where each line is a record, or just put all records in RAM then put them into a file (perhaps using pickle), or use some sort of database.
I've learned Python basics plus some concepts like generators but haven't learned database, haven't tried pickling, and obviously need to learn more. If possible, I'd avoid multiple computers because I have only 1, and concurrency because it looks too scary.
What would you advise about how to represent records in memory, and about how to store away the simulated system?
If we take your worst case, 10**8 records and 500 bytes per record, that would be a lot of RAM, so it's worth designing some flexibility and assuming not all records will always be resident in RAM. You could make an abstraction class that hides the details of where the records are.
class Record(object):
def __init__(self, x, y, z):
pass # code goes here
def get_record(id):
pass # code goes here
Instead of using the name get_record() you could use the name __index__() and then your class will act like a list, but might be going out to a database, or referencing a RAM cache, or whatever. Just use integers as the ID values. Then if you change your mind about the persistence store (switch from database to pickle or whatever) the actual code won't change.
You could also try just making a really huge swapfile and letting the virtual memory system handle shuffling records in and out of actual RAM. This is easy to try. It does not have any easy way to interrupt a calculation and save the state.
You could represent each record as a tuple, even a named tuple. I believe a tuple would have the lowest overhead of any "container" object in Python. (A named tuple just stores the names once in one place, so it's low overhead also.)
I'm currently rewriting some python code to make it more efficient and I have a question about saving python arrays so that they can be re-used / manipulated later.
I have a large number of data, saved in CSV files. Each file contains time-stamped values of the data that I am interested in and I have reached the point where I have to deal with tens of millions of data points. The data has got so large now that the processing time is excessive and inefficient---the way the current code is written the entire data set has to be reprocessed every time some new data is added.
What I want to do is this:
Read in all of the existing data to python arrays
Save the variable arrays to some kind of database/file
Then, the next time more data is added I load my database, append the new data, and resave it. This way only a small number of data need to be processed at any one time.
I would like the saved data to be accessible to further python scripts but also to be fairly "human readable" so that it can be handled in programs like OriginPro or perhaps even Excel.
My question is: whats the best format to save the data in? HDF5 seems like it might have all the features I need---but would something like SQLite make more sense?
EDIT: My data is single dimensional. I essentially have 30 arrays which are (millions, 1) in size. If it wasn't for the fact that there are so many points then CSV would be an ideal format! I am unlikely to want to do lookups of single entries---more likely is that I might want to plot small subsets of data (eg the last 100 hours, or the last 1000 hours, etc).
HDF5 is an excellent choice! It has a nice interface, is widely used (in the scientific community at least), many programs have support for it (matlab for example), there are libraries for C,C++,fortran,python,... It has a complete toolset to display the contents of a HDF5 file. If you later want to do complex MPI calculation on your data, HDF5 has support for concurrently read/writes. It's very well suited to handle very large datasets.
Maybe you could use some kind of key-value database like Redis, Berkeley DB, MongoDB... But it would be nice some more info about the schema you would be using.
EDITED
If you choose Redis for example, you can index very long lists:
The max length of a list is 232 - 1 elements (4294967295, more than 4
billion of elements per list). The main features of Redis Lists from
the point of view of time complexity are the support for constant time
insertion and deletion of elements near the head and tail, even with
many millions of inserted items. Accessing elements is very fast near
the extremes of the list but is slow if you try accessing the middle
of a very big list, as it is an O(N) operation.
I would use a single file with fixed record length for this usecase. No specialised DB solution (seems overkill to me in that case), just plain old struct (see the documentation for struct.py) and read()/write() on a file. If you have just millions of entries, everything should be working nicely in a single file of some dozens or hundreds of MB size (which is hardly too large for any file system). You also have random access to subsets in case you will need that later.