I have a dynamic set consisting of a data series on the order of hundreds of objects, where each series must be identified (by integer) and consists of elements, also identified by an integer. Each element is a custom class.
I used a defaultdict to created a nested (2-D) dictionary. This enables me to quickly access a series and individual elements by key/ID. I needed to be able to add and delete elements and entire series, so the dict served me well. Also note that the IDs do not have to be sequential, due to add/delete. The IDs are important since they are unique and referenced elsewhere through my application.
For example, consider the following data set with keys/IDs,
[1][1,2,3,4,5]
[2][1,4,10]
[4][1]
However, now I realize I want to be able to insert elements in a series, but the dictionary doesn't quite support it. For example, I'd like to be able to insert a new element between 3 and 4 for series 1, causing the IDs above it (from 4,5) to increment (to 5,6):
[1][1,2,3,4,5] becomes
[1][1,2,3,4(new),5,6]
The order matters since the elements are part of a sequential series. I realize that this would be easier with a nested list since it supports insert(), but then I would be forced to iterate over the entire 2-D array to get element indices right?
What would be the most optimal way to implement this data structure in Python?
I think what you want is a dict with array values:
dict = {1:[...],3:[...], ....}
You can then operate on the arrays as you please. If the array values are sequential ints
just use:
dict[key].append(vals)
dict[key].sort()
Don't worry about the speed unless you find out it's a problem. Premature optimization
is the root of all evil.
In fact, don't even sort your dict vals until you have to, if you want to be really efficient.
Related
I have a list which is made up out of three layers, looking something like this for illustrative purposes:
a = [[['1'],['2'],['3'],['']],[['5'],['21','33']]]
Thus I have a top list which contains several other lists each of which again contains lists.
The first layer will contain in the tens of lists. The next layer could contain possibly millions of lists and the bottom layer will contain either an empty string, a single string, or a handful of values (each a string).
I now need to access the values in the bottom-most layer and store them in a new list in a particular order which is done inside a loop. What is the fastest way of accessing these values? The amount of memory used is not of primary concern to me (though I obviously don't want to squander it either).
I can think of two ways:
I access list a directly to retrieve the desired value, e.g. a[1][1][0] would return '21'.
I create a copy of the elements of a and then access these to flatten the list a bit more. In this case thus, e.g.: b=a[0], c=a[1] so instead of accessing a[1][1][0] I would now access b[1][0] to retrieve '21'.
Is there any performance penalty involved in accessing nested lists? Thus, is there any benefit to be gained in splitting list a it into separate lists or am I merely incurring a RAM penalty in doing so?
Accessing elements via their index (ie: a[1][1][0]) is a O(1) operation: source. You won't get much quicker than that.
Now, assignment is also a O(1) operation, so there's no difference between the two methods you've described as far as speed goes. The second one actually doesn't incur in any memory problems because assignments to lists are by reference, not by copy (except you explicitly tell it to do it otherwise).
The two methods are more or less identical, given that b=a[0] only binds another name to the list at that index. It does not copy the list. That said, the only difference is that, in your second method, the only difference is that you, in addition to access the nested lists, you end up throwing references around. So, in theory, it is a tiny little bit slower.
As pointed out by #joaquinlpereyra, the Python Wiki has a list of the complexity of such operations: https://wiki.python.org/moin/TimeComplexity
So, long answer cut short: Just accessing the list items is faster.
Porting some python code to LabVIEW and I run across the python set(). Is there a better way of representing this in LabVIEW other than with a variant or array?
The closest would be to use variant attributes.
You use a "dummy" variant to store key/value pairs. The Variant Set Attribute function prevents duplicates (overwrites existing with output indicating replaced) and the Get function will return all key/value pairs if no key value is specified.
The underlying functions use a red-black tree, making lookups very fast for large datasets.
http://forums.ni.com/t5/LabVIEW/Darren-s-Weekly-Nugget-10-09-2006/m-p/425269
As I recall LabView don't include analog of set() from box. Therefore you must create VI for delete duplicate values from Array. I hope below two links will help for you.
Remove duplicate values in an Array
Delete, Collapse, Array Duplicate Elements
Furthermore you can take some HashSet realisation (one, two, three) and call it from LabView.
Languages such as C++ require that an array hold elements of a single type. As I understand it, knowing the size of each element allows for pointer arithmetic, making access of a particular element O(1) time.
What about Python lists?
Python lists allow for mixing element types. Surely the implementation doesn't involve a slow-access data structure, such as a linked lists – right? Is accessing an element even constant time? If so, how does Python achieve it with variable element types?
Its a simple indexed lookup. Python stores references to objects in its lists, not the objects themselves. Consider a C++ list of (void*) pointers. Each pointer is a known size and array lookup is fast, but the things it points to can vary in size.
In Python, everything is an "object" (you can intuitively confirm that by something like (1).__add__(2)). So, roughly speaking, Python's list just contain references to the actual objects stored somewhere in memory. And if you look up an object via the list index - this is very, very simplified - it will redirect you to the actual object.
Here is a nice table that shows you the complexity (Big-Oh) of the different operations on lists.
I am developing a moving average filter for position "tracks" in Touch Designer, which implements a Python runtime. I'm new to Python and I'm unclear on the best data structures to use. The pcode is roughly:
Receive a new list of tracks formatted as:
id, posX, posY
2001, 0.54, 0.21
2002, 0.43, 0.23
...
Add incoming X and Y values to an existing data structure keyed on "id," if that id is already present
Create new entries for new ids
Remove any id entries that are not present in the incoming data
Return a moving average of X and Y values per id
Question: Would it be a good idea to do this as a hashtable where the key is id, and the values are a list of lists? eg:
ids = {2001: posVals, 2002: posVals2}
where posVals is a list of [x,y] pairs?
I think this is like a 3D array, but where I want to use id as a key for lots of operations.
Thanks
First, if the IDs are all relatively small positive integers, and the array is almost completely dense (that is, almost all IDs will exist), a dict is just adding extra overhead. On the other hand, if there are large numbers, or large gaps, or keys of different types, using a dict as a "sparse array" makes perfect sense.
Meanwhile, for the other two dimensions… how many IDs are you going to have, and how many pairs for each? If we're talking a handful of pairs per ID and a few thousands total pairs across all IDs, a list of pairs per ID is perfectly fine (although I'd probably represent each pair as a tuple rather than a list), and anything else would be adding unnecessary complexity. But if there are going to be a lot of pairs per ID, or a lot total, you may run into problems with storage, or performance.
If you can use the third-party library numpy, it can store a 2D array of numbers in much less memory than a list of pairs of numbers, and it can perform calculations like moving averages with much briefer/more readable code, and with much less CPU time. In fact, it can even store a sparse 3D array for you.
If you can only use the standard library, the array module can get you most of the same memory benefits, but without the simplicity benefits (in fact, your code becomes slightly more complex, as you have to represent the 2D array as a 1D array, old-school-C-style—although you can wrap that up pretty easily), or the time performance benefits, and it can't help with the sparseness.
Yes, this is how I would do it. It's very intuitive this way, assuming you're always looking things up by their id and don't need to sort in some other way.
Also, the terminology in Python is dict (as in dictionary), rather than hashtable.
I'm still confused whether to use list or numpy array.
I started with the latter, but since I have to do a lot of append
I ended up with many vstacks slowing my code down.
Using list would solve this problem, but I also need to delete elements
which again works well with delete on numpy array.
As it looks now I'll have to write my own data type (in a compiled language, and wrap).
I'm just curious if there isn't a way to get the job done using a python type.
To summarize this are the criterions my data type would have to fulfil:
2d n (variable) rows, each row k (fixed) elements
in memory in one piece (would be nice for efficient operating)
append row (with an in average constant time, like C++ vector just always k elements)
delete a set of elements (best: inplace, keep free space at the end for later append)
access element given the row and column index ( O(1) like data[row*k+ column]
It appears generally useful to me to have a data type like this and not impossible to implement in C/Fortran.
What would be the closest I could get with python?
(Or maybe, Do you think it would work to write a python class for the datatype? what performance should I expect in this case?)
As I see it, if you were doing this in C or Fortran, you'd have to have an idea of the size of the array so that you can allocate the correct amount of memory (ignoring realloc!). So assuming you do know this, why do you need to append to the array?
In any case, numpy arrays have the resize method, which you can use to extend the size of the array.