I would like to loop through a big two dimension list:
authors = [["Bob", "Lisa"], ["Alice", "Bob"], ["Molly", "Jim"], ... ]
and get a list that contains all the names that occurs in authors.
When I loop through the list, I need a container to store names I've already seen, I'm wondering if I should use a list or a dict:
with a list:
seen = []
for author_list in authors:
for author in author_list:
if not author in seen:
seen.append(author)
result = seen
with a dict:
seen = {}
for author_list in authors:
for author in author_list:
if not author in seen:
seen[author] = True
result = seen.keys()
which one is faster? or is there better solutions?
You really want a set. Sets are faster than lists because they can only contain unique elements, which allows them to be implemented as hash tables. Hash tables allow membership testing (if element in my_set) in O(1) time. This contrasts with lists, where the only way to check if an element is in the list is to check every element of the list in turn (in O(n) time.)
A dict is similar to a set in that both allow unique keys only, and both are implemented as hash tables. They both allow O(1) membership testing. The difference is that a set only has keys, while a dict has both keys and values (which is extra overhead you don't need in this application.)
Using a set, and replacing the nested for loop with an itertools.chain() to flatten the 2D list to a 1D list:
import itertools
seen = set()
for author in itertools.chain(*authors):
seen.add(author)
Or shorter:
import itertools
seen = set( itertools.chain(*authors) )
Edit (thanks, #jamylak) more memory efficient for large lists:
import itertools
seen = set( itertools.chain.from_iterable(authors) )
Example on a list of lists:
>>> a = [[1,2],[1,2],[1,2],[3,4]]
>>> set ( itertools.chain(*a) )
set([1, 2, 3, 4])
P.S. : If, instead of finding all the unique authors, you want to count the number of times you see each author, use a collections.Counter, a special kind of dictionary optimised for counting things.
Here's an example of counting characters in a string:
>>> a = "DEADBEEF CAFEBABE"
>>> import collections
>>> collections.Counter(a)
Counter({'E': 5, 'A': 3, 'B': 3, 'D': 2, 'F': 2, ' ': 1, 'C': 1})
set should be faster.
>>> authors = [["Bob", "Lisa"], ["Alice", "Bob"], ["Molly", "Jim"]]
>>> from itertools import chain
>>> set(chain(*authors))
set(['Lisa', 'Bob', 'Jim', 'Molly', 'Alice'])
using a dict or a set is way faster then using a list
import itertools
result = set(itertools.chain.from_iterable(authors))
You can use set -
from sets import Set
seen = Set()
for author_list in authors:
for author in author_list:
seen.add(author)
result = seen
This way you are escaping the "if" checking, hence solution would be faster.
If you care about the performance of lookups, lookups in lists are O(n), while lookups in dictionaries are amortised to O(1).
You can find more information here.
Lists just store a bunch of items in a particular order. Think of your list of authors as a long line of pigeonhole boxes with author's names on bits of papers in the boxes. The names stay in the order you put them in, and you can find the author in any particular pigeonhole very easily, but if you want to know if a particular author is in any pigeonhole, then you have to look through each one until you find the name you're after. You can also have the same name in any number of pigeonholes.
Dictionaries are a bit more like a phone book. Given the author's name, you can very quickly check to see whether the author is listed in the phone book, and find the phone number listed with it. But you can only include each author once (with exactly one phone number), and you can't put the authors in there in any order you like, they have to be in the order that makes sense for the phone book. In a real phone book, that order is alphabetical; in Python dictionaries the order is completely unpredictable (and it changes when you add or remove things to the dictionary), but Python can find entries even faster in a dictionary than it could in a phone book.
Sets, on the other hand, are like phone books that just list names, not phone numbers. You still can't have the same name included more than once, it's either in the set or not. And you still can't use the order in which names are in the set for anything useful. But you can very quickly check whether a name is in the set.
Given your use case, a set would appear to be the obvious choice. You don't care about the order in which you've seen authors, or how many times you've seen each author, only that you can quickly check whether you've seen a particular author before.
Lists are bad for this case; they go to the effort of remembering duplicates in whatever order you specify, and they're slow to search. But you also don't have any need to map keys to values, which is what a dictionary does. To go back to the phone book analogy, you don't have anything equivalent to a "phone number"; in your dictionary example you're doing the equivalent of writing a phone book in which everybody's number is listed as True, so why bother listing the phone numbers at all?
A set, OTOH, does exactly what you need.
Related
I have a list of dictionaries. Which looks something like,
abc = [{"name":"bob",
"age": 33},
{"name":"fred",
"age": 18},
{"name":"mary",
"age": 64}]
Lets say I want to lookup bobs age. I know I can run a for loop through etc etc. However my questions is are there any quicker ways of doing this.
One thought is to use a loop but break out of the loop once the lookup (in this case the age for bob) has been completed.
The reason for this question is my datasets are thousands of lines long so Im looking for any performance gains I can get.
Edit : I can see you can use the following via the use of a generator, however im not too sure whether this would still iterate over all items of the list or just iterate until the the first dict containing the name bob is found ?
next(item for item in abc if item["name"] == "bob")
Thanks,
Depending on how many times you want to perform this operation, it might be worth defining a dictionary mapping names to the corresponding age (or the list of corresponding ages if more than two people can share the same name).
A dictionary comprehension can help you:
abc_dict = {x["name"]:x["age"] for x in abc}
I'd consider making another dictionary and then using that for multiple age lookups:
for person in abc:
age_by_name[person['name']] = person['age']
age_by_name['bob']
# this is a quick lookup!
Edit: This is equivalent to the dict comprehension listed in Josay's answer
Try indexing it first (once), and then using the index (many times).
You can index it eg. by using dict (keys would be what you are searching by, while the values would be what you are searching for), or by putting the data in the database. That should cover the case if you really have a lot more lookups and rarely need to modify the data.
define dictionary of dictionary like this only
peoples = {"bob":{"name":"bob","age": 33},
"fred":{"name":"fred","age": 18},
"mary": {"name":",mary","age": 64}}
person = peoples["bob"]
persons_age = person["age"]
look up "bob" then look up like "age"
this is correct no ?
You might write a helper function. Here's a take.
import itertools
# First returns the first element encountered in an iterable which
# matches the predicate.
#
# If the element is never found, StopIteration is raised.
# Args:
# pred The predicate which determines a matching element.
#
first = lambda pred, seq: next(itertools.dropwhile(lambda x: not pred(x), seq))
I'm trying to learn python (with a VBA background).
I've imported the following function into my interpreter:
def shuffle(dict_in_question): #takes a dictionary as an argument and shuffles it
shuff_dict = {}
n = len(dict_in_question.keys())
for i in range(0, n):
shuff_dict[i] = pick_item(dict_in_question)
return shuff_dict
following is a print of my interpreter;
>>> stuff = {"a":"Dave", "b":"Ben", "c":"Harry"}
>>> stuff
{'a': 'Dave', 'c': 'Harry', 'b': 'Ben'}
>>> decky11.shuffle(stuff)
{0: 'Harry', 1: 'Dave', 2: 'Ben'}
>>> stuff
{}
>>>
It looks like the dictionary gets shuffled, but after that, the dictionary is empty. Why? Or, am I using it wrong?
You need to assign it back to stuff too, as you're returning a new dictionary.
>>> stuff = decky11.shuffle(stuff)
Dogbert's answer solves your immediate problem, but keep in mind that dictionaries don't have an order! There's no such thing as "the first element of my_dict." (Using .keys() or .values() generates a list, which does have an order, but the dictionary itself doesn't.) So, it's not really meaningful to talk about "shuffling" a dictionary.
All you've actually done here is remapped the keys from letters a, b, c, to integers 0, 1, 2. These keys have different hash values than the keys you started with, so they print in a different order. But you haven't changed the order of the dictionary, because the dictionary didn't have an order to begin with.
Depending on what you're ultimately using this for (are you iterating over keys?), you can do something more direct:
shufflekeys = random.shuffle(stuff.keys())
for key in shufflekeys:
# do thing that requires order
As a side note, dictionaries (aka hash tables) are a really clever, hyper-useful data structure, which I'd recommend learning deeply if you're not already familiar. A good hash function (and non-pathological data) will give you O(1) (i.e., constant) lookup time - so you can check if a key is in a dictionary of a million items as fast as you can in a dictionary of ten items! The lack of order is a critical feature of a dictionary that enables this speed.
I'm pretty new to python (couple weeks into it) and I'm having some trouble wrapping my head around data structures. What I've done so far is extract text line-by-line from a .txt file and store them into a dictionary with the key as animal, for example.
database = {
'dog': ['apple', 'dog', '2012-06-12-08-12-59'],
'cat': [
['orange', 'cat', '2012-06-11-18-33-12'],
['blue', 'cat', '2012-06-13-03-23-48']
],
'frog': ['kiwi', 'frog', '2012-06-12-17-12-44'],
'cow': [
['pear', 'ant', '2012-06-12-14-02-30'],
['plum', 'cow', '2012-06-12-23-27-14']
]
}
# year-month-day-hour-min-sec
That way, when I print my dictionary out, it prints out by animal types, and the newest dates first.
Whats the best way to go about sorting this data by time? I'm on python 2.7. What I'm thinking is
for each key:
grab the list (or list of lists) --> get the 3rd entry --> '-'.split it, --> then maybe try the sorted(parameters)
I'm just not really sure how to go about this...
Walk through the elements of your dictionary. For each value, run sorted on your list of lists, and tell the sorting algorithm to use the third field of the list as the "key" element. This key element is what is used to compare values to other elements in the list in order to ascertain sort order. To tell sorted which element of your lists to sort with, use operator.itemgetter to specify the third element.
Since your timestamps are rigidly structured and each character in the timestamp is more temporally significant than the next one, you can sort them naturally, like strings - you don't need to convert them to times.
# Dictionary stored in d
from operator import itemgetter
# Iterate over the elements of the dictionary; below, by
# calling items(), k gets the key value of an entry and
# v gets the value of that entry
for k,v in d.items():
if v and isinstance(v[0], list):
v.sort(key=itemgetter(2)) # Start with 0, so third element is 2
If your dates are all in the format year-month-day-hour-min-sec,2012-06-12-23-27-14,I think your step of split it is not necessary,just compare them as string.
>>> '2012-06-12-23-27-14' > '2012-06-12-14-02-30'
True
Firstly, you'll probably want each key,value item in the dict to be of a similar type. At the moment some of them (eg: database['dog'] ) are a list of strings (a line) and some (eg: database['cat']) are a list of lines. If you get them all into list of lines format (even if there's only one item in the list of lines) it will be much easier.
Then, one (old) way would be to make a comparison function for those lines. This will be easy since your dates are already in a format that's directly (string) comparable. To compare two lines, you want to compare the 3rd (2nd index) item in them:
def compare_line_by_date(x,y):
return cmp(x[2],y[2])
Finally you can get the lines for a particular key sorted by telling the sorted builtin to use your compare_line_by_date function:
sorted(database['cat'],compare_line_by_date)
The above is suitable (but slow, and will disappear in python 3) for arbitrarily complex comparison/sorting functions. There are other ways to do your particular sort, for example by using the key parameter of sorted:
def key_for_line(line):
return line[2]
sorted(database['cat'],key=key_for_line)
Using keys for sorting is much faster than cmp because the key function only needs to be run once per item in the list to be sorted, instead of every time items in the list are compared (which is usually much more often than the number of items in the list). The idea of a key is to basically boil each list item down into something that be compared naturally, like a string or a number. In the example above we boiled the line down into just the date, which is then compared.
Disclaimer: I haven't tested any of the code in this answer... but it should work!
I am learning python, now, i came across a code snippet which looks like this:
my_name={'sujit','amit','ajit','arijit'}
for i, names in enumerate(my_name):
print "%s" %(names[i])
OUTPUT
s
m
i
t
But when I modify the code as:
my_name=['sujit','amit','ajit','arijit']
for i, names in enumerate(my_name):
print "%s" %(names[i])
OUTPUT
s
m
i
j
What is the difference between {} and []? The [] is giving me the desired result for printing the ith character of the current name from the list. Bu the use of {} is not.
{} creates a set, whereas [] creates a list. The key differences are:
the list preserves the order, whereas the set does not;
the list preserves duplicates, whereas the set does not;
the list can be accessed through indexing (i.e. l[5]), whereas the set can not.
The first point holds the key to your puzzle. When you use a list, the loop iterates over the names in order. When you're using a set, the loop iterates over the elements in an unspecified order, which in my Python interpreter happens to be sujit, amit, arijit, ajit.
P.S. {} can also be used to create a dictionary: {'a':1, 'b':2, 'c':3}.
The {} notation is set notation rather than list notation. That is basically the same as a list, but the items are stored in a jumbled up order, and duplicate elements are removed. (To make things even more confusing, {} is also dictionary syntax, but only when you use colons to separate keys and values -- the way you are using it, is a set.)
Secondly, you aren't using enumerate properly. (Or maybe you are, but I'm not sure...)
enumerate gives you corresponding index and value pairs. So enumerate(['sujit','amit','ajit','arijit']) gives you:
[(0, 'sujit'), (1, 'amit'), (2, 'ajit'), (3, 'arijit')]
So this will get you the first letter of "sujit", the second letter of "amit", and so on. Is that what you wanted?
{} do not enclose a list. They do not enclose any kind of sequence; they enclose (when used this way) a set (in the mathematical sense). The elements of a set do not have a specified order, so you get them enumerated in whatever order Python put them in. (It does this so that it can efficiently ensure the other important constraint on sets: they cannot contain a duplicate value).
This is specific to Python 3. In 2.x, {} cannot be used to create a set, but only to create a dict. (This also works in Python 3.) To do this, you specify the key-value pairs separated by colons, thus: {'sujit': 'amit', 'ajit': 'arijit'}.
(Also, a general note: if you say "question" instead everywhere that you currently say "doubt", you will be wrong much less often, at least per the standards of English as spoken outside of India. I don't particularly understand how the overuse of 'doubt' has become so common in English as spoken by those from India, but I've seen it in many places across the Internet...)
sets do not preserve order:
[] is a list:
>>> print ['sujit','amit','ajit','arijit']
['sujit', 'amit', 'ajit', 'arijit']
{} is a set:
>>> print {'sujit','amit','ajit','arijit'}
set(['sujit', 'amit', 'arijit', 'ajit'])
So you get s,m,i,j in the first case; s,m,i,t in the second.
I have two Python lists of dictionaries, entries9 and entries10. I want to compare the items and write joint items to a new list called joint_items. I also want to save the unmatched items to two new lists, unmatched_items_9 and unmatched_items_10.
This is my code. Getting the joint_items and unmatched_items_9 (in the outer list) is quite easy: but how do I get unmatched_items_10 (in the inner list)?
for counter, entry1 in enumerate(entries9):
match_found = False
for counter2,entry2 in enumerate(entries10):
if match_found:
continue
if entry1[a]==entry2[a] and entry1[b]==entry2[b]: # the dictionaries only have some keys in common, but we care about a and b
match_found = True
joint_item = entry1
joint_items.append(joint_item)
#entries10.remove(entry2) # Tried this originally, but realised it messes with the original list object!
if match_found:
continue
else:
unmatched_items_9.append(entry1)
Performance is not really an issue, since it's a one-off script.
The equivalent of what you're currently doing, but the other way around, is:
unmatched_items_10 = [d for d in entries10 if d not in entries9]
While more concise than your way of coding it, this has the same performance problem: it will take time proportional to the number of items in each list. If the lengths you're interested in are about 9 or 10 (as those numbers seem to indicate), no problem.
But for lists of substantial length you can get much better performance by sorting the lists and "stepping through" them "in parallel" so to speak (time proportional to N log N where N is the length of the longer list). There are other possibilities, too (of growing complication;-) if even this more advanced approach is not sufficient to get you the performance you need. I'll refrain from suggesting very complicated stuff unless you indicate that you do require it to get good performance (in which case, please mention the typical lengths of each list and the typical contents of the dicts that are their items, since of course such "details" are the crucial consideration for picking algorithms that are a good compromise between speed and simplicity).
Edit: the OP edited his Q to show what he cares about, for any two dicts d1 and d2 one each from the two lists, is not whether d1 == d2 (which is what the in operator checks), but rather d1[a]==d2[a] and d1[b]==d2[b]. In this case the in operator cannot be used (well, not without some funky wrapping, but that's a complication that's best avoided when feasible;-), but the all builtin replaces it handily:
unmatched_items_10 = [d for d in entries10
if all(d[a]!=d1[a] or d[b]!=d2[b] for d2 in entries9)]
I have switched the logic around (to != and or, per De Morgan's laws) since we want the dicts that are not matched. However, if you prefer:
unmatched_items_10 = [d for d in entries10
if not any(d[a]==d1[a] and d[b]==d2[b] for d2 in entries9)]
Personally, I don't like if not any and if not all, for stylistic reasons, but the maths are impeccable (by what the Wikipedia page calls the Extensions to De Morgan's laws, since any is an existential quantifier and all a universal quantifier, so to speak;-). Performance should be just about equivalent (but then, the OP did clarify in a comment that performance is not very important for them on this task).
The Python stdlib has a class, difflib.SequenceMatcher that looks like it can do what you want, though I don't know how to use it!
You may consider using sets and their associated methods, like intersection. You will however, need to turn your dictionaries into immutable data so that you can store them in a set (e.g. strings). Would something like this work?
a = set(str(x) for x in entries9)
b = set(str(x) for x in entries10)
# You'll have to change the above lines if you only care about _some_ of the keys
joint_items = a.union(b)
unmatched_items = a - b
# Now you can turn them back into dicts:
joint_items = [eval(i) for i in joint_items]
unmatched_items = [eval(i) for i in unmatched_items]