Best data structures for 3D array in Python - python

I am developing a moving average filter for position "tracks" in Touch Designer, which implements a Python runtime. I'm new to Python and I'm unclear on the best data structures to use. The pcode is roughly:
Receive a new list of tracks formatted as:
id, posX, posY
2001, 0.54, 0.21
2002, 0.43, 0.23
...
Add incoming X and Y values to an existing data structure keyed on "id," if that id is already present
Create new entries for new ids
Remove any id entries that are not present in the incoming data
Return a moving average of X and Y values per id
Question: Would it be a good idea to do this as a hashtable where the key is id, and the values are a list of lists? eg:
ids = {2001: posVals, 2002: posVals2}
where posVals is a list of [x,y] pairs?
I think this is like a 3D array, but where I want to use id as a key for lots of operations.
Thanks

First, if the IDs are all relatively small positive integers, and the array is almost completely dense (that is, almost all IDs will exist), a dict is just adding extra overhead. On the other hand, if there are large numbers, or large gaps, or keys of different types, using a dict as a "sparse array" makes perfect sense.
Meanwhile, for the other two dimensions… how many IDs are you going to have, and how many pairs for each? If we're talking a handful of pairs per ID and a few thousands total pairs across all IDs, a list of pairs per ID is perfectly fine (although I'd probably represent each pair as a tuple rather than a list), and anything else would be adding unnecessary complexity. But if there are going to be a lot of pairs per ID, or a lot total, you may run into problems with storage, or performance.
If you can use the third-party library numpy, it can store a 2D array of numbers in much less memory than a list of pairs of numbers, and it can perform calculations like moving averages with much briefer/more readable code, and with much less CPU time. In fact, it can even store a sparse 3D array for you.
If you can only use the standard library, the array module can get you most of the same memory benefits, but without the simplicity benefits (in fact, your code becomes slightly more complex, as you have to represent the 2D array as a 1D array, old-school-C-style—although you can wrap that up pretty easily), or the time performance benefits, and it can't help with the sparseness.

Yes, this is how I would do it. It's very intuitive this way, assuming you're always looking things up by their id and don't need to sort in some other way.
Also, the terminology in Python is dict (as in dictionary), rather than hashtable.

Related

Are nested sets and dictionaries anti-pattern in Python?

I need to create a sort of similarity matrix based on user_id values. I am currently using Pandas to store the majority of my data, but I know that iteration is very anti-pattern, so I am considering creating a set/dictionary nest to store the similarities, similar to some of the proposed structures here
I would only be storing N nearest similarities, so it would amount to something like this:
{
'user_1' : {'user_2':0.5, 'user_4':0.9, 'user_3':1.0},
'user_2' : ...
}
It would be allowing me to access a neighbourhood by doing dict_name[user_id] quite easily.
Essentially the outermost dictionary key would hold a user_id which returns another dictionary of its N closest neighbours with user_id- similarity_value key-value sets.
For more context, I'm just writing a simple KNN recommender. I am doing it from scratch as I've tried using Surpriselib and sklearn but they don't have the context-aware flexibility I require.
This seems like a reasonable way to store these values to me, but is it very anti-pythonic, or should I be looking to do this using some other structures (e.g. NumPy or Pandas or something else I don't yet know about)?
As the comment says, there is nothing inherently wrong or anti-pythonic with using (one level of) nested dictionaries and writing everything from scratch.
Performance-wise you can probably beat your self-written solution if you use an existing data structure whose API works well with the transformations/operations you intend to perform on them. Numpy/Pandas only will help if your operations can be expressed as vectorized operations that operate on all (pairs of) elements along a common axis, e.g. all users in your top-level dictionary.

More efficient use of dictionaries

I'm going to store on the order of 10,000 securities X 300 date pairs X 2 Types in some caching mechanism.
I'm assuming I'm going to use a dictionary.
Question Part 1:
Which is more efficient or Faster? Assume that I'll be generally looking up knowing a list of security IDs and the 2 dates plus type. If there is a big efficiency gain by tweaking my lookup, I'm happy to do that. Also assume I can be wasteful of memory to an extent.
Method 1: store and look up using keys that look like strings "securityID_date1_date2_type"
Method 2: store and look up using keys that look like tuples (securityID, date1, date2, type)
Method 3: store and look up using nested dictionaries of some variation mentioned in methods 1 and 2
Question Part 2:
Is there an easy and better way to do this?
It's going to depend a lot on your use case. Is lookup the only activity or will you do other things, e.g:
Iterate all keys/values? For simplicity, you wouldn't want to nest dictionaries if iteration is relatively common.
What about iterating a subset of keys with a given securityID, type, etc.? Nested dictionaries (each keyed on one or more components of your key) would be beneficial if you needed to iterate "keys" with one component having a given value.
What about if you need to iterate based on a different subset of the key components? If that's the case, plain dict is probably not the best idea; you may want relational database, either the built-in sqlite3 module or a third party module for a more "production grade" DBMS.
Aside from that, it matters quite a bit how you construct and use keys. Strings cache their hash code (and can be interned for even faster comparisons), so if you reuse a string for lookup having stored it elsewhere, it's going to be fast. But tuples are usually safer (strings constructed from multiple pieces can accidentally produce the same string from different keys if the separation between components in the string isn't well maintained). And you can easily recover the original components from a tuple, where a string would need to be parsed to recover the values. Nested dicts aren't likely to win (and require some finesse with methods like setdefault to populate properly) in a simple contest of lookup speed, so it's only when iterating a subset of the data for a single component of the key that they're likely to be beneficial.
If you want to benchmark, I'd suggest populating a dict with sample data, then use the timeit module (or ipython's %timeit magic) to test something approximating your use case. Just make sure it's a fair test, e.g. don't lookup the same key each time (using itertools.cycle to repeat a few hundred keys would work better) since dict optimizes for that scenario, and make sure the key is constructed each time, not just reused (unless reuse would be common in the real scenario) so string's caching of hash codes doesn't interfere.

Sensible storage of 1 billion+ values in a python list type structure

I'm writing a program that creates vario-function plots for a fixed region of a digital elevation model that has been converted to an array. I calculate the variance (difference in elevation) and lag (distance) between point pairs within the window constraints. Every array position is compared with every other array position. For each pair, the lag and variance values are appended to separate lists. Once all pairs have been compared, these lists are then used for data binning, averaging and eventually plotting.
The program runs fine for smaller window sizes (say 60x60 px). For windows up to about 120x120 px or so, which would give 2 lists of 207,360,000 entries, I am able to slowly get the program running. Greater than this, and I run into "MemoryError" reports - e.g. for a 240x240 px region, I would have 3,317,760,000 entries
At the beginning of the program, I create an empty list:
variance = []
lag = []
Then within a for loop where I calculate my lags and variances, I append the values to the different lists:
variance.append(var_val)
lag.append(lag_val)
I've had a look over the stackoverflow pages and have seen a similar issue discussed here. This solution would potentially improve temporal program performance however the solution offered only goes up to 100 million entries and therefore doesn't help me out with the larger regions (as with the 240x240px example). I've also considered using numpy arrays to store the values but I don't think this will stave of the memory issues.
Any suggestions for ways to use some kind of list of the proportions I have defined for the larger window sizes would be much appreciated.
I'm new to python so please forgive any ignorance.
The main bulk of the code can be seen here
Use the array module of Python. It offers some list-like types that are more memory efficient (but cannot be used to store random objects, unlike regular lists). For example, you can have arrays containing regular floats ("doubles" in C terms), or even single-precision floats (four bytes each instead of eight, at the cost of a reduced precision). An array of 3 billion such single-floats would fit into 12 GB of memory.
You could look into PyTables, a library wrapping the HDF5 C library that can be used with numpy and pandas.
Essentially PyTables will store your data on disk and transparently load it into memory as needed.
Alternatively if you want to stick to pure python, you could use a sqlite3 database to store and manipulate your data - the docs say the size limit for a sqlite database is 140TB, which should be enough for your data.
try using heapq, import heapq. It uses the heap for storage rather than the stack allowing you to access the computer full memory.

Efficient Datatype Python (list or numpy array?)

I'm still confused whether to use list or numpy array.
I started with the latter, but since I have to do a lot of append
I ended up with many vstacks slowing my code down.
Using list would solve this problem, but I also need to delete elements
which again works well with delete on numpy array.
As it looks now I'll have to write my own data type (in a compiled language, and wrap).
I'm just curious if there isn't a way to get the job done using a python type.
To summarize this are the criterions my data type would have to fulfil:
2d n (variable) rows, each row k (fixed) elements
in memory in one piece (would be nice for efficient operating)
append row (with an in average constant time, like C++ vector just always k elements)
delete a set of elements (best: inplace, keep free space at the end for later append)
access element given the row and column index ( O(1) like data[row*k+ column]
It appears generally useful to me to have a data type like this and not impossible to implement in C/Fortran.
What would be the closest I could get with python?
(Or maybe, Do you think it would work to write a python class for the datatype? what performance should I expect in this case?)
As I see it, if you were doing this in C or Fortran, you'd have to have an idea of the size of the array so that you can allocate the correct amount of memory (ignoring realloc!). So assuming you do know this, why do you need to append to the array?
In any case, numpy arrays have the resize method, which you can use to extend the size of the array.

Storing and reloading large multidimensional data sets in Python

I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:
My data will basically take the form of a multidimensional array where each entry will look something like this:
data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]
Each argument has roughly the following numbers of potential values:
stringArg1: 50
stringArg2: 20
stringArg3: 6
stringArg4: 24
intArg1: 10,000
Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.
So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.
I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:
a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)
There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.
What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...
That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.
I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.
Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.
Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)
Is there a reason the basic 6 table approach doesn't apply?
i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.
Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.
I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can
be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.
Hopefully you'll find something in it's very easy to read documentation:
Numpy Documentation with Examples

Categories