I am trying to find differences between MongoDB records. After performing my queries, I end up with a set of unique results (by applying set()).
Now, I want to compare a new extraction with the set that I just defined to see if there are any new additions to the record.
What I have done now is the following:
unique_documents = set([str(i) for i in dict_of_uniques[my_key]])
all_documents = [str(i) for i in (dict_of_all_docs[my_key])]
Basically I am trying to compare the string version of a dict among the two variables.
I have several approaches, among which unique_documents.difference(all_documents), but it keeps out returning an empty set. I know for a fact that the all_documents variable contains two new entries in the record. I would like to know which ones are they.
Thank you,
If all_documents is the set with new elements that you want to get as the result, then you need to reverse the order of the arguments to the difference method.
unique_documents = set([str(i) for i in dict_of_uniques[my_key]])
all_documents = set([str(i) for i in (dict_of_all_docs[my_key])])
all_documents.difference(unique_documents)
See how the order matters:
>>> x = set([1,2,3])
>>> y = set([3,4,5])
>>> x.difference(y)
{1, 2}
>>> y.difference(x)
{4, 5}
difference gives you the elements of the first set that are not present in the second set.
If you want to see things that were either added or removed, you can symmetric_difference. This function is described as "symmetric" because it gives the same results regardless of argument order.
>>> x.symmetric_difference(y)
{1, 2, 4, 5}
>>> y.symmetric_difference(x)
{1, 2, 4, 5}
It is hard to tell without a description of the dictionary structure but your code seems to be comparing single keys only. If you want to compare the content of both dictionaries, you need to get all the values:
currentData = set( str(rec) for rec in dict_of_all_docs.values() )
changedKeys = [k for k,value in dict_of_fetched.items() if str(value) not in currentData]
This doesn't seem very efficient though but without more information on the data structure, it is hard to make a better suggestion. If your records can already matched by a dictionary key, you probably don't need to use a set at all. A simple loop should do.
Rather than unique_documents.difference(all_documents) use all_documents.difference(unique_documents)
More on Python Sets
Related
I have a list of sets given by,
sets1 = [{1},{2},{1}]
When I find the unique elements in this list using numpy's unique, I get
np.unique(sets1)
Out[18]: array([{1}, {2}, {1}], dtype=object)
As can be seen seen, the result is wrong as {1} is repeated in the output.
When I change the order in the input by making similar elements adjacent, this doesn't happen.
sets2 = [{1},{1},{2}]
np.unique(sets2)
Out[21]: array([{1}, {2}], dtype=object)
Why does this occur? Or is there something wrong in the way I have done?
What happens here is that the np.unique function is based on the np._unique1d function from NumPy (see the code here), which itself uses the .sort() method.
Now, sorting a list of sets that contain only one integer in each set will not result in a list with each set ordered by the value of the integer present in the set. So we will have (and that is not what we want):
sets = [{1},{2},{1}]
sets.sort()
print(sets)
# > [{1},{2},{1}]
# ie. the list has not been "sorted" like we want it to
Now, as you have pointed out, if the list of sets is already ordered in the way you want, np.unique will work (since you would have sorted the list beforehand).
One specific solution (though, please be aware that it will only work for a list of sets that each contain a single integer) would then be:
np.unique(sorted(sets, key=lambda x: next(iter(x))))
That is because set is unhashable type
{1} is {1} # will give False
you can use python collections.Counter if you can can convert the set to tuple like below
from collections import Counter
sets1 = [{1},{2},{1}]
Counter([tuple(a) for a in sets1])
There is one question about Python3.6. It's about the output of Set expressions. I do not know why the code below does not appear in order:
a = {i*2 for i in range(1, 5)}
print(a)
I expect {2, 4, 6, 8} but the output is {8, 2, 4, 6}
Why it is not in order?
If you take a look at the documentation; the first sentence of the set documentation is:
A set object is an unordered collection of distinct hashable objects.
So the order of the elements in the set is for all practical purposes random. Even in python-3.6.
1here is the example in python, elements of the set doesn't arrange in order i.e elements are arrange in random
Python sets are not ordered, they only contain elements. If you need your data structure to have a certain order, maybe consider an OrderedDict.
I'm trying to learn python (with a VBA background).
I've imported the following function into my interpreter:
def shuffle(dict_in_question): #takes a dictionary as an argument and shuffles it
shuff_dict = {}
n = len(dict_in_question.keys())
for i in range(0, n):
shuff_dict[i] = pick_item(dict_in_question)
return shuff_dict
following is a print of my interpreter;
>>> stuff = {"a":"Dave", "b":"Ben", "c":"Harry"}
>>> stuff
{'a': 'Dave', 'c': 'Harry', 'b': 'Ben'}
>>> decky11.shuffle(stuff)
{0: 'Harry', 1: 'Dave', 2: 'Ben'}
>>> stuff
{}
>>>
It looks like the dictionary gets shuffled, but after that, the dictionary is empty. Why? Or, am I using it wrong?
You need to assign it back to stuff too, as you're returning a new dictionary.
>>> stuff = decky11.shuffle(stuff)
Dogbert's answer solves your immediate problem, but keep in mind that dictionaries don't have an order! There's no such thing as "the first element of my_dict." (Using .keys() or .values() generates a list, which does have an order, but the dictionary itself doesn't.) So, it's not really meaningful to talk about "shuffling" a dictionary.
All you've actually done here is remapped the keys from letters a, b, c, to integers 0, 1, 2. These keys have different hash values than the keys you started with, so they print in a different order. But you haven't changed the order of the dictionary, because the dictionary didn't have an order to begin with.
Depending on what you're ultimately using this for (are you iterating over keys?), you can do something more direct:
shufflekeys = random.shuffle(stuff.keys())
for key in shufflekeys:
# do thing that requires order
As a side note, dictionaries (aka hash tables) are a really clever, hyper-useful data structure, which I'd recommend learning deeply if you're not already familiar. A good hash function (and non-pathological data) will give you O(1) (i.e., constant) lookup time - so you can check if a key is in a dictionary of a million items as fast as you can in a dictionary of ten items! The lack of order is a critical feature of a dictionary that enables this speed.
How do I add values to an existing set?
your_set.update(your_sequence_of_values)
e.g, your_set.update([1, 2, 3, 4]). Or, if you have to produce the values in a loop for some other reason,
for value in ...:
your_set.add(value)
But, of course, doing it in bulk with a single .update call is faster and handier, when otherwise feasible.
Define a set
a = set()
Use add to append single values
a.add(1)
a.add(2)
Use update to add elements from tuples, sets, lists or frozen-sets
a.update([3, 4])
>>> print(a)
{1, 2, 3, 4}
Note: Since set elements must be hashable, and lists are considered mutable, you cannot add a list to a set. You also cannot add other sets to a set. You can however, add the elements from lists and sets as demonstrated with the .update method.
You can also use the | operator to concatenate two sets (union in set theory):
>>> my_set = {1}
>>> my_set = my_set | {2}
>>> my_set
{1, 2}
Or a shorter form using |=:
>>> my_set = {1}
>>> my_set |= {2}
>>> my_set
{1, 2}
Note: In versions prior to Python 2.7, use set([...]) instead of {...}.
Use update like this:
keep.update(newvalues)
This question is the first one that shows up on Google when one looks up "Python how to add elements to set", so it's worth noting explicitly that, if you want to add a whole string to a set, it should be added with .add(), not .update().
Say you have a string foo_str whose contents are 'this is a sentence', and you have some set bar_set equal to set().
If you do
bar_set.update(foo_str), the contents of your set will be {'t', 'a', ' ', 'e', 's', 'n', 'h', 'c', 'i'}.
If you do bar_set.add(foo_str), the contents of your set will be {'this is a sentence'}.
The way I like to do this is to convert both the original set and the values I'd like to add into lists, add them, and then convert them back into a set, like this:
setMenu = {"Eggs", "Bacon"}
print(setMenu)
> {'Bacon', 'Eggs'}
setMenu = set(list(setMenu) + list({"Spam"}))
print(setMenu)
> {'Bacon', 'Spam', 'Eggs'}
setAdditions = {"Lobster", "Sausage"}
setMenu = set(list(setMenu) + list(setAdditions))
print(setMenu)
> {'Lobster', 'Spam', 'Eggs', 'Sausage', 'Bacon'}
This way I can also easily add multiple sets using the same logic, which gets me an TypeError: unhashable type: 'set' if I try doing it with the .update() method.
I just wanted to add a quick note here. So I was looking for the fastest method among the three methods.
Using the set.add() function
Using the set.update() function
Using the "|" operator function.
I find it out that to add either a single value or multiple values to a set you have to use the set.add() function. It is the most efficient method among the others.
So I ran a test and Here is the result:
set.add() Took: 0.5208224999951199
set.update() Took:
0.6461397000239231 `
"|" operator` Took: 0.7649438999942504
PS: If you want to know more the analysis.
Check here : Fastest way to append values to set.
For me, in Python 3, it's working simply in this way:
keep = keep.union((0,1,2,3,4,5,6,7,8,9,10))
I don't know if it may be correct...
keep.update((0,1,2,3,4,5,6,7,8,9,10))
Or
keep.update(np.arange(11))
Is anyone having experience working with pycassa I have a doubt with it. How do I get all the keys that are stored in the database?
well in this small snippet we need to give the keys in order to get the associated columns (here the keys are 'foo' and 'bar'),that is fine but my requirement is to get all the keys (only keys) at once as Python list or similar data structure.
cf.multiget(['foo', 'bar'])
{'foo': {'column1': 'val2'}, 'bar': {'column1': 'val3', 'column2': 'val4'}}
Thanks.
try:
list(cf.get_range().get_keys())
more good stuff here: http://github.com/vomjom/pycassa
You can try: cf.get_range(column_count=0,filter_empty=False).
# Since get_range() returns a generator - print only the keys.
for value in cf.get_range(column_count=0,filter_empty=False):
print value[0]
get_range([start][, finish][, columns][, column_start][, column_finish][, column_reversed][, column_count][, row_count][, include_timestamp][, super_column][, read_consistency_level][, buffer_size])
Get an iterator over rows in a
specified key range.
http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_range
Minor improvement on Santhosh's solution
dict(cf.get_range(column_count=0,filter_empty=False)).keys()
If you care about order:
OrderedDict(cf.get_range(column_count=0,filter_empty=False)).keys()
get_range returns a generator. We can create a dict from the generator and get the keys from that.
column_count=0 limits results to the row_key. However, because these results have no columns we also need filter_empty.
filter_empty=False will allow us to get the results. However empty rows and range ghosts may be included in our result now.
If we don't mind more overhead, getting just the first column will resolve the empty rows and range ghosts.
dict(cf.get_range(column_count=1)).keys()
There's a problem with Santhosh's and kzarns' answers, as you're bringing in memory a potentially huge dict that you are immediately discarding. A better approach would be using list comprehensions for this:
keys = [c[0] for c in cf.get_range(column_count=0, filter_empty=False)]
This iterates over the generator returned by get_range, keeps the key in memory and stores the list.
If the list of keys where also potentially too large to keep it in memory all at once and you only need to iterate once, you should use a generator expression instead of a list comprehension:
kgen = (c[0] for c in cf.get_range(column_count=0, filter_empty=False))
# you can iterate over kgen, but do not treat it as a list, it isn't!