I am in the planning phase of building a simulation and need ideas on how to represent data, based on memory and speed considerations.
At each time-step, the simulation process creates 10^3 to 10^4 new data records, and looks at each new or existing records (there are 10^6 to 10^8 of them) then either deletes it or modifies it.
Each record has 3-10 simple fields, each either an integer or a string of several ASCII characters. In addition, each record has 1-5 other fields, each a variable-length list containing integers. A typical record weighs 100-500 bytes.
The modify-or-delete process works like this: For this record, compute a function whose arguments are the values of some of this record's fields, and the values of these fields of another record. Depending on the results, the process prepares to delete or modify its fields in some way.
Then repeat for each other record. Then move to the next record and repeat. When all records have been processed, the simulation is ready to move to the next time-step.
Just before moving to the next time-step, apply all the deletions and modifications as prepared.
The more records allowed, the better the simulation. If all records are in RAM, downside is simulation size and presumably upside is speed. The simulation doesn't need to be realtime, but obviously I don't want it too slow.
To represent each record in memory, I know of these options: a list or dict (with some lists nested in it), or a class instance. To store away all the records and continue the simulation another day, the options in order of decreasing familiarity to me are: a csv file where each line is a record, or just put all records in RAM then put them into a file (perhaps using pickle), or use some sort of database.
I've learned Python basics plus some concepts like generators but haven't learned database, haven't tried pickling, and obviously need to learn more. If possible, I'd avoid multiple computers because I have only 1, and concurrency because it looks too scary.
What would you advise about how to represent records in memory, and about how to store away the simulated system?
If we take your worst case, 10**8 records and 500 bytes per record, that would be a lot of RAM, so it's worth designing some flexibility and assuming not all records will always be resident in RAM. You could make an abstraction class that hides the details of where the records are.
class Record(object):
def __init__(self, x, y, z):
pass # code goes here
def get_record(id):
pass # code goes here
Instead of using the name get_record() you could use the name __index__() and then your class will act like a list, but might be going out to a database, or referencing a RAM cache, or whatever. Just use integers as the ID values. Then if you change your mind about the persistence store (switch from database to pickle or whatever) the actual code won't change.
You could also try just making a really huge swapfile and letting the virtual memory system handle shuffling records in and out of actual RAM. This is easy to try. It does not have any easy way to interrupt a calculation and save the state.
You could represent each record as a tuple, even a named tuple. I believe a tuple would have the lowest overhead of any "container" object in Python. (A named tuple just stores the names once in one place, so it's low overhead also.)
Related
I recently needed to store large array-like data (sometimes numpy, sometimes key-value indexed) whose values would be changed over time (t=1 one element changes, t=2 another element changes, etc.). This history needed to be accessible (some time in the future, I want to be able to see what t=2’s array looked like).
An easy solution was to keep a list of arrays for all timesteps, but this became too memory intensive. I ended up writing a small class that handled this by keeping all data “elements” in a dict with each element represented by a list of (this_value, timestamp_for_this_value). that let me recreate things for arbitrary timestamps by looking for the last change before some time t, but it was surely not as efficient as it could have been.
Are there data structures available for python that have these properties natively? Or some sort of class of data structure meant for this kind of thing?
Have you considered writing a log file? A good use of memory would be to have the arrays contain only the current relevant values but build in a procedure where the update statement could trigger a logging function. This function could write to a text file, database or an array/dictionary of some sort. These types of audit trails are pretty common in the database world.
I'm building a sort of monitoring tool in Python where I want to keep certain stats for a short period of time. I only want to keep a maximum of, say, 30 entries for a stat, and for the older entries to be overwritten as new ones come in. This way, only the 30 most recent entries are kept. What sort of file should I use for this (I'll have multiple different stats all of which I would like to only keep their recent history. The stats are updated at regular intervals ~15 seconds).
I want this to be in a file as the data will be handled in another program.
If you're only keeping a small number of samples (and you don't care about historic data), then the simplest solution is keeping data in memory. You can use a collections.deque object, as described here, to create a fixed-length list that will automatically drop older items as you add newer items.
For situations in which you want to keep data for longer periods (or you simply want it to persist in the event your application restarts, or you want to be able to access the data from multiple applications, etc), people often use a dedicated time series database, such as InfluxDB, Prometheus, Graphite, or any of a number of other solutions.
You probably want to keep it all in memory. But if you need to keep a file that mimics and a data structure (say a dictionary) I've had great success with pickle. It's easy, and it's fast.
https://pythontips.com/2013/08/02/what-is-pickle-in-python/
Alternatively a more enterprise solution would be to simply store your stats in a database.
I have a collection that is potentially going to be very large. Now I know MongoDB doesn't really have a problem with this, but I don't really know how to go about designing a schema that can handle a very large dataset comfortably. So I'm going to give an outline of the problem.
We are collecting large amounts of data for our customers. Basically, when we gather this data it is represented as a 3-tuple, lets say (a, b, c), where b and c are members of sets B and C respectively. In this particular case we know that the B and C sets will not grow very much over time. For our current customers we are talking about ~200,000 members. However, the A set is the one that keeps growing over time. Currently we are at about ~2,000,000 members per customer, but this is going to grow (possibly rapidly.) Also, there are 1->n relations between b->a and c->a.
The workload on this data set is basically split up into 3 use cases. The collections will be periodically updated, where A will get the most writes, and B and C will get some, but not many. The second use case is random access into B, then aggregating over some number of documents in C that pertain to b \in B. And the last usecase is basically streaming a large subset from A and B to generate some new data.
The problem that we are facing is that the indexes are getting quite big. Currently we have a test setup with about 8 small customers, the total dataset is about 15GB in size at the moment, and indexes are running at about 3GB to 4GB. The problem here is that we don't really have any hot zones in our dataset. It's basically going to get an evenly distributed load amongst all documents.
Basically we've come up with 2 options to do this. The one that I described above, where all data for all customers is piled into one collection. This means we'd have to create an index om some field that links the documents in that collection to a particular customer.
The other options is to throw all b's and c's together (these sets are relatively small) but divide up the C collection, one per customer. I can imangine this last solution being a bit harder to manage, but since we rarely access data for multiple customers at the same time, it would prevent memory problems. MongoDB would be able to load the customers index into memory and just run from there.
What are your thoughts on this?
P.S.: I hope this wasn't too vague, if anything is unclear I'll go into some more details.
It sounds like the larger set (A if I followed along correctly), could reasonably be put into its own database. I say database rather than collection, because now that 2.2 is released you would want to minimize lock contention between the busier database and the others, and to do that a separate database would be best (2.2 introduced database level locking). That is looking at this from a single replica set model, of course.
Also the index sizes sound a bit out of proportion to your data size - are you sure they are all necessary? Pruning unneeded indexes, combining and using compound indexes may well significantly reduce the pain you are hitting in terms of index growth (it would potentially make updates and inserts more efficient too). This really does need specifics and probably belongs in another question, or possibly a thread in the mongodb-user group so multiple eyes can take a look and make suggestions.
If we look at it with the possibility of sharding thrown in, then the truly important piece is to pick a shard key that allows you to make sure locality is preserved on the shards for the pieces you will frequently need to access together. That would lend itself more toward a single sharded collection (preserving locality across multiple related sharded collections is going to be very tricky unless you manually split and balance the chunks in some way). Sharding gives you the ability to scale out horizontally as your indexes hit the single instance limit etc. but it is going to make the shard key decision very important.
Again, specifics for picking that shard key are beyond the scope of this more general discussion, similar to the potential index review I mentioned above.
I'm currently rewriting some python code to make it more efficient and I have a question about saving python arrays so that they can be re-used / manipulated later.
I have a large number of data, saved in CSV files. Each file contains time-stamped values of the data that I am interested in and I have reached the point where I have to deal with tens of millions of data points. The data has got so large now that the processing time is excessive and inefficient---the way the current code is written the entire data set has to be reprocessed every time some new data is added.
What I want to do is this:
Read in all of the existing data to python arrays
Save the variable arrays to some kind of database/file
Then, the next time more data is added I load my database, append the new data, and resave it. This way only a small number of data need to be processed at any one time.
I would like the saved data to be accessible to further python scripts but also to be fairly "human readable" so that it can be handled in programs like OriginPro or perhaps even Excel.
My question is: whats the best format to save the data in? HDF5 seems like it might have all the features I need---but would something like SQLite make more sense?
EDIT: My data is single dimensional. I essentially have 30 arrays which are (millions, 1) in size. If it wasn't for the fact that there are so many points then CSV would be an ideal format! I am unlikely to want to do lookups of single entries---more likely is that I might want to plot small subsets of data (eg the last 100 hours, or the last 1000 hours, etc).
HDF5 is an excellent choice! It has a nice interface, is widely used (in the scientific community at least), many programs have support for it (matlab for example), there are libraries for C,C++,fortran,python,... It has a complete toolset to display the contents of a HDF5 file. If you later want to do complex MPI calculation on your data, HDF5 has support for concurrently read/writes. It's very well suited to handle very large datasets.
Maybe you could use some kind of key-value database like Redis, Berkeley DB, MongoDB... But it would be nice some more info about the schema you would be using.
EDITED
If you choose Redis for example, you can index very long lists:
The max length of a list is 232 - 1 elements (4294967295, more than 4
billion of elements per list). The main features of Redis Lists from
the point of view of time complexity are the support for constant time
insertion and deletion of elements near the head and tail, even with
many millions of inserted items. Accessing elements is very fast near
the extremes of the list but is slow if you try accessing the middle
of a very big list, as it is an O(N) operation.
I would use a single file with fixed record length for this usecase. No specialised DB solution (seems overkill to me in that case), just plain old struct (see the documentation for struct.py) and read()/write() on a file. If you have just millions of entries, everything should be working nicely in a single file of some dozens or hundreds of MB size (which is hardly too large for any file system). You also have random access to subsets in case you will need that later.
I have large text files upon which all kinds of operations need to be performed, mostly involving row by row validations. The data are generally of a sales / transaction nature, and thus tend to contain a huge amount of redundant information across rows, such as customer names. Iterating and manipulating this data has become such a common task that I'm writing a library in C that I hope to make available as a Python module.
In one test, I found that out of 1.3 million column values, only ~300,000 were unique. Memory overhead is a concern, as our Python based web application could be handling simultaneous requests for large data sets.
My first attempt was to read in the file and insert each column value into a binary search tree. If the value has never been seen before, memory is allocated to store the string, otherwise a pointer to the existing storage for that value is returned. This works well for data sets of ~100,000 rows. Much larger and everything grinds to a halt, and memory consumption skyrockets. I assume the overhead of all those node pointers in the tree isn't helping, and using strcmp for the binary search becomes very painful.
This unsatisfactory performance leads me to believe I should invest in using a hash table instead. This, however, raises another point -- I have no idea ahead of time how many records there are. It could be 10, or ten million. How do I strike the right balance of time / space to prevent resizing my hash table again and again?
What are the best data structure candidates in a situation like this?
Thank you for your time.
Hash table resizing isn't a concern unless you have a requirement that each insert into the table should take the same amount of time. As long as you always expand the hash table size by a constant factor (e.g. always increasing the size by 50%), the computational cost of adding an extra element is amortized O(1). This means that n insertion operations (when n is large) will take an amount of time that is proportionate to n - however, the actual time per insertion may vary wildly (in practice, one of the insertions will be very slow while the others will be very fast, but the average of all operations is small). The reason for this is that when you insert an extra element that forces the table to expand from e.g. 1000000 to 1500000 elements, that insert will take a lot of time, but now you've bought yourself 500000 extremely fast future inserts before you need to resize again. In short, I'd definitely go for a hash table.
You need to use incremental resizing of your hash table. In my current project, I keep track of the hash key size used in every bucket, and if that size is below the current key size of the table, then I rehash that bucket on an insert or lookup. On a resizing of the hash table, the key size doubles (add an extra bit to the key) and in all the new buckets, I just add a pointer back to the appropriate bucket in the existing table. So if n is the number of hash buckets, the hash expand code looks like:
n=n*2;
bucket=realloc(bucket, sizeof(bucket)*n);
for (i=0,j=n/2; j<n; i++,j++) {
bucket[j]=bucket[i];
}
library in C that I hope to make
available as a Python module
Python already has very efficient finely-tuned hash tables built in. I'd strongly suggest that you get your library/module working in Python first. Then check the speed. If that's not fast enough, profile it and remove any speed-humps that you find, perhaps by using Cython.
setup code:
shared_table = {}
string_sharer = shared_table.setdefault
scrunching each input row:
for i, field in enumerate(fields):
fields[i] = string_sharer(field, field)
You may of course find after examining each column that some columns don't compress well and should be excluded from "scrunching".