Porting some python code to LabVIEW and I run across the python set(). Is there a better way of representing this in LabVIEW other than with a variant or array?
The closest would be to use variant attributes.
You use a "dummy" variant to store key/value pairs. The Variant Set Attribute function prevents duplicates (overwrites existing with output indicating replaced) and the Get function will return all key/value pairs if no key value is specified.
The underlying functions use a red-black tree, making lookups very fast for large datasets.
http://forums.ni.com/t5/LabVIEW/Darren-s-Weekly-Nugget-10-09-2006/m-p/425269
As I recall LabView don't include analog of set() from box. Therefore you must create VI for delete duplicate values from Array. I hope below two links will help for you.
Remove duplicate values in an Array
Delete, Collapse, Array Duplicate Elements
Furthermore you can take some HashSet realisation (one, two, three) and call it from LabView.
Related
I am trying to migrate a Python 2 dictionary to Python 3.
The dict ordering behaviour has of course changed:
Python 2: Unordered - but in practice appears to be deterministic/consistent between subsequent calls of code using the same key-value pairs inserted in a consistent order on the same machine etc. The number of possible reorderings is restricted (though that is computer science territory from my point-of-view).
Python 3.7+: Insertion order is guaranteed.
I am looping over the elements of the dictionary and adding the key-value pairs to another structure in a way that is a function of the dict values that is order-dependent. Specifically, I am removing the elements based on their geometric proximity to certain landmarks ('nearest neighbours'). If the ordering changes, the removal order changes and the outcome is not the same.
I am trying to reproduce Python 2's exact mangling of the insertion order. This can be thought of as a python function or 'behaviour'. So, how can I replicate a Python 2 dict ordering in Python 3.7+?
Caveats:
Of course I can use an OrderedDict in Python 2 (or 3.7) and often do (and should have here). But for the purposes of this question please don't suggest it, or that not using one was a bad idea etc.
I could reorder the keys of the original dictionary. Could this be a route to solving the conundrum? In general can Python 2 be "emulated" in Python 3?
I am trying to reproduce Python 2's exact mangling of the insertion order. This can be thought of as a python function or 'behaviour'. So, how can I replicate a Python 2 dict ordering in Python 3.7+?
You cant.
I could reorder the keys of the original dictionary. Could this be a route to solving the conundrum? In general can Python 2 be "emulated" in Python 3?
No. Not without reimplementing Python 2's hashmap anyway (or compiling the "old" dict under a new name).
The Python 2 ordering (and that of hashmaps in general) is a consequence of the hash function, the collision resolution algorithm and the resizing thresholds (the load factor):
a hashmap is fundamentally an array of buckets, without special handling the iteration order is simply that of iterating the buckets array and yielding the entries in the non-empty buckets
item is stored at buckets[hash(key) % len(buckets)], hence the implications of the hash algorithm and size of the buckets array
if two items have the same modular hash, the collision will be resolved, for open addressing there are lots of algorithms (all of which end up with putting one of the items in the "wrong" bucket, but the way they do and which one they move can be quite different)
different collision resolution algorithms have different sensitivities to the buckets array being full, some will start degrading at 70% while others will be fine at 90%, this thus informs the load factor limits, therefore when you resize, and thus the new locations (and order) of the items
furthermore different hash tables can have different growth factors, even if all other factors are identical different growth factors will lead to different sizes of buckets arrays, and thus different distribution (and order) once modulo is applied
Since Python 3.6, P3 has changed the very implementation of hashmaps to use one that's "naturally ordered"1, the only way to get the not-insertion-order would be to interate the dk_indices field of the hashmap then explicitly dereference the corresponding dk_entries item, and neither field is part of the API (at the C level to say nothing of the Python), and even if they were you would have to check the implementation details to see if the new hashmap uses the same hashing and growth thresholds. Which again doesn't matter, because the internal details are not accessible.
[1]: you can see pypy's explanation of the scheme for a primer, though CPython's implementation is different/
I have a method for creating an image "hash" which is useful for duplicate frame detection. (Doesn't really matter for the question)
Currently I put each frame of a video in a set, and can do things like find videos that contain intersections by comparing the sets. (I have billions of hashes)
Since I have my own "hash" I don't need the values of the set, only the ability to detect duplicate items.
This would reduce my memory footprint by like half (since I would only have the hashes).
I know internally a set is actually hash,value pairs. There must be a way to make a "SparseSet" or a "hashonly" set.
Something like
2 in sparset(1,2,3)
True
but where
for s in sparset(1,2,3)
would return nothing, or hashes not values.
That's not quite how sets work. Both the hash value and the value are required, because the values must be checked for equality in case of a hash collision.
If you don't care about collisions, you can use a Bloom filter instead of a set. These are very memory efficient, but give probabilistic answers (either definitely not in the set, or maybe in the set). There's no Bloom filter in the standard library, but there are several implementations on PyPI.
If you care more about optimizing space than time, you could just keep the hashes in a list and then when you need to check for an element, sort it in place and do a binary search. Python's Timsort is very efficient when the list is mostly sorted already, so subsequent sorts will be relatively fast. Python lists have a sort() method and you can implement a binary search fairly easily using the standard library bisect module.
You can combine both techniques, i.e. don't bother sorting if the Bloom filter indicates the element is not in the set. And of course, don't bother sorting again if you haven't added any elements since last time.
The reason why I am asking this question is because I am working with huge datas.
In my algorithm, I basically need something like this:
users_per_document = []
documents_per_user = []
As you understand it from the names of the dictionaries, I need users that clicked a specific document and documents that are clicked by a specific user.
In that case I have "duplicated" datas, and both of them together overflows the memory and my script gets killed after a while. Because I use very large data sets, I have to make it in a efficient way.
I think that is not possible but I need to ask it, is there a way to get all keys of a specific value from dictionary?
Because if there is a way to do that, I will not need one of the dictionaries anymore.
For example:
users_per_document["document1"] obviously returns the appropriate
users,
what I need is users_per_document.getKeys("user1") because this will basically return the same thing with documents_per_user["user1"]
If it is not possible, any suggestion is pleased..
If you are using Python 3.x, you can do the following. If 2.x, just use .iteritems() instead.
user1_values = [key for key,value in users_per_document.items() if value == "user1"]
Note: This does iterate over the whole dictionary. A dictionary isn't really an ideal data structure to get all keys for a specific value, as it will be O(n^2) if you have to perform this operation n times.
I am not very sure about the python, but in general computer science you can solve the problem with the following way;
Basically, you can have three-dimensional array, first index is for users , second index for documents and the third index would be a boolean value.
The boolean value represents if there is relation between the specific user and the specific document.
PS: if you have really sparse matrix, you can make it much more efficient, but it is another story
I'm going to store on the order of 10,000 securities X 300 date pairs X 2 Types in some caching mechanism.
I'm assuming I'm going to use a dictionary.
Question Part 1:
Which is more efficient or Faster? Assume that I'll be generally looking up knowing a list of security IDs and the 2 dates plus type. If there is a big efficiency gain by tweaking my lookup, I'm happy to do that. Also assume I can be wasteful of memory to an extent.
Method 1: store and look up using keys that look like strings "securityID_date1_date2_type"
Method 2: store and look up using keys that look like tuples (securityID, date1, date2, type)
Method 3: store and look up using nested dictionaries of some variation mentioned in methods 1 and 2
Question Part 2:
Is there an easy and better way to do this?
It's going to depend a lot on your use case. Is lookup the only activity or will you do other things, e.g:
Iterate all keys/values? For simplicity, you wouldn't want to nest dictionaries if iteration is relatively common.
What about iterating a subset of keys with a given securityID, type, etc.? Nested dictionaries (each keyed on one or more components of your key) would be beneficial if you needed to iterate "keys" with one component having a given value.
What about if you need to iterate based on a different subset of the key components? If that's the case, plain dict is probably not the best idea; you may want relational database, either the built-in sqlite3 module or a third party module for a more "production grade" DBMS.
Aside from that, it matters quite a bit how you construct and use keys. Strings cache their hash code (and can be interned for even faster comparisons), so if you reuse a string for lookup having stored it elsewhere, it's going to be fast. But tuples are usually safer (strings constructed from multiple pieces can accidentally produce the same string from different keys if the separation between components in the string isn't well maintained). And you can easily recover the original components from a tuple, where a string would need to be parsed to recover the values. Nested dicts aren't likely to win (and require some finesse with methods like setdefault to populate properly) in a simple contest of lookup speed, so it's only when iterating a subset of the data for a single component of the key that they're likely to be beneficial.
If you want to benchmark, I'd suggest populating a dict with sample data, then use the timeit module (or ipython's %timeit magic) to test something approximating your use case. Just make sure it's a fair test, e.g. don't lookup the same key each time (using itertools.cycle to repeat a few hundred keys would work better) since dict optimizes for that scenario, and make sure the key is constructed each time, not just reused (unless reuse would be common in the real scenario) so string's caching of hash codes doesn't interfere.
I have a dynamic set consisting of a data series on the order of hundreds of objects, where each series must be identified (by integer) and consists of elements, also identified by an integer. Each element is a custom class.
I used a defaultdict to created a nested (2-D) dictionary. This enables me to quickly access a series and individual elements by key/ID. I needed to be able to add and delete elements and entire series, so the dict served me well. Also note that the IDs do not have to be sequential, due to add/delete. The IDs are important since they are unique and referenced elsewhere through my application.
For example, consider the following data set with keys/IDs,
[1][1,2,3,4,5]
[2][1,4,10]
[4][1]
However, now I realize I want to be able to insert elements in a series, but the dictionary doesn't quite support it. For example, I'd like to be able to insert a new element between 3 and 4 for series 1, causing the IDs above it (from 4,5) to increment (to 5,6):
[1][1,2,3,4,5] becomes
[1][1,2,3,4(new),5,6]
The order matters since the elements are part of a sequential series. I realize that this would be easier with a nested list since it supports insert(), but then I would be forced to iterate over the entire 2-D array to get element indices right?
What would be the most optimal way to implement this data structure in Python?
I think what you want is a dict with array values:
dict = {1:[...],3:[...], ....}
You can then operate on the arrays as you please. If the array values are sequential ints
just use:
dict[key].append(vals)
dict[key].sort()
Don't worry about the speed unless you find out it's a problem. Premature optimization
is the root of all evil.
In fact, don't even sort your dict vals until you have to, if you want to be really efficient.