"in" operation on sets and lists in Python [duplicate] - python

This question already has answers here:
What makes sets faster than lists?
(8 answers)
Closed 1 year ago.
I have a book on Python which says:
in is a very fast operation on sets:
stopwords_list = ["a", "an"] + hundreds_of_other_words + ["yet", "you"]
"zip" in stopwords_list # False, but have to check every element
stopwords_set = set(stopwords_list)
"zip" in stopwords_set # Very fast to check
I have two questions:
Why is in faster on sets than on lists?
If the in operator really is faster on sets, then why don't the makers of Python just rewrite the in method for lists to do x in set(list)? Why can't the idea in this book just be made part of the language?

set uses hash tables to get an average O(1) lookup instead of O(n) lookup with lists, that's why it's way faster, provided the elements are hashable (read more: What makes sets faster than lists?).
Now converting a list into a set when in is called requires the full list to be parsed and converted to set, so it's even slower that searching in a list (even if the element is at the start of the list, the conversion is done on all the elements so there's no short-circuiting).
There's no magic. Converting a list into a set is useful, but only at init phase, and then the lookup is done in the set multiple times during the processing.
But in that case, creating a set directly is the best way.

Related

Is there a different way to automatically adjust the length of a jupyter list? [duplicate]

This question already has answers here:
Efficient 2d list initialization [python]
(2 answers)
Closed 9 months ago.
I am using for loops to shorten my code for grabbing specific sections of a dataframe and finding the mean and standard deviation. I am then saving those values into a list, since there are different data types.
I wanted to simply initializing the list by writing list=[0,0,0]*(len(cities)) to automatically create an x by 3 list, where x is however many cities there are. The problem with this, as shown in the picture, is that setting a single index to a value, such as list[0][1]=1, doesn't only affect that index, but instead affects every 3rd column in every row.
I was hoping for a more elegant solution than simply manually counting the number of cities and copy/pasting [0,0,0],.... that many times. I realize a solution is not to use for loops as I did and avoid lists altogether, but that's not the point for this question.
Simple image of the issue described above. Value being applied to all rows of list.
https://i.stack.imgur.com/hruFz.png
zeros = [[0,0,0] for _ in range(50)]
is one way
but probably better to do
zeros = numpy.zeros((50,3))
(if a numpy.array will suffice)
if you have mixed datatypes you will need to use (e.g. strings and numbers)
zeros = numpy.zeros((50,3),object)

In Python, how to iterate more than once for an iterative object? [duplicate]

This question already has answers here:
Why can't I iterate twice over the same iterator? How can I "reset" the iterator or reuse the data?
(5 answers)
Closed 4 years ago.
I encounter some code that get back an iterative object from the Dynamo database, and I can do:
print [en["student_id"] for en in enrollments]
However, when I do similar things again:
print [en["course_id"] for en in enrollments]
Then the second iteration will print out nothing, because the iterative structure can only be iterated only once and it has reached its end.
The question is, how can we iterate it more than once, for the case of (1) what if it is known to be only several items in the iteration (2) what if we know there will be lots of items (say a million items) in the iteration, and we don't want to cost a lot of additional memory space?
Related is, I looked up rewind, and it seems like it exists for PHP and Ruby, but not for Python?
enrollments is a generator. Either recreate the generator if you need to iterate again, or convert it to a list first:
enrollments = list(enrollments)
Take into account that APIs often use generators to avoid memory bloat; a list must have references to all objects it contains, so all those objects have to exist at the same time. A generator can produce the elements one by one, as needed; your list comprehension discards those objects again once the 'student_id' key has been extracted.
The alternative is to iterate just once, and do all the things with each object you want to do. So instead of running two list comprehensions, run one regular for loop and extract all the data you need in one place, appending to separate lists as you go along:
courses = []
students = []
for enrollment in enrollments:
courses.append(enrollment['course_id'])
students.append(enrollment['student_id'])
rewind in PHP is unrelated to this; Python has fileobj.seek(0) to do the same, but file objects are not generators.
import itertools
it1, it2 = itertools.tee(enrollments, n=2)
Looks like it is an answer from here: Why can't I iterate twice over the same data?
But it is valid only if you are going to iterate not too much times.

which is faster and efficient between generator expression or itertools.chain for iterating over large list?

I have large list of string and i want to iteratoe over this list. I want to figure out which is the best way to iterate over list. I have tried using the following ways:
Generator Expression: g = (x for x in list)
Itertools.chain: ch = itertools.chain(list)
Is there is another approach, better than these two, for list iteration?
The fastest way is just to iterate over the list. If you already have a list, layering more iterators/generators isn't going to speed anything up.
A good old for item in a_list: is going to be just as fast as any other option, and definitely more readable.
Iterators and generators are for when you don't already have a list sitting around in memory. itertools.count() for instance just generates a single number at a time; it's not working off of an existing list of numbers.
Another possible use is when you're chaining a number of operations - your intermediate steps can create iterators/generators rather than creating intermediate lists. For instance, if you're wanting to chain a lookup for each item in the list with a sum() call, you could use a generator expression for the output of the lookups, which sum() would then consume:
total_inches_of_snow = sum(inches_of_snow(date) for date in list_of_dates)
This allows you to avoid creating an intermediate list with all of the individual inches of snow and instead just generate them as sum() consumes them, thus saving memory.

Is there a better way to store a twoway dictionary than storing its inverse separate? [duplicate]

This question already has answers here:
How to implement an efficient bidirectional hash table?
(8 answers)
Closed 9 years ago.
Given a one-to-one dictionary (=bijection) generated à la
for key, value in someGenerator:
myDict[key] = value
an inverse lookup dictionary can be trivially created by adding
invDict[value] = key
to the for loop. But is this a Pythonic way? Should I instead write a class Bijection(dict) which manages this inverted dictionary in addition and provides a second lookup function? Or does such a structure (or a similar one) already exist?
What I've done in the past is created a reversedict function, which would take a dict and return the opposite mapping, either values to keys if I knew it was one-to-one (throwing exceptions on seeing the same value twice), or values to lists of keys if it wasn't. That way, instead of having to construct two dicts at the same time each time I wanted the inverse look-up, I could create my dicts as normal and just call the generic reversedict function at the end.
However, it seems that the bidict solution that Jon mentioned in the comments is probably the better one. (My reversedict function seems to be his bidict's ~ operator).
if you want O(log(n)) time for accessing values, you will need both a representation of the map and a representation of the inverse map.
otherwise the best you can do is O(log(n)) in one direction and O(n) in the other.
Edit: not O(log(n)), thanks Claudiu, but you are still going to need two data structures to implement the quick access times. And this will be more or less the same space as a dict and an inverse dict.

array vs hash key search

So I'm a longtime perl scripter who's been getting used to python since I changed jobs a few months back. Often in perl, if I had a list of values that I needed to check a variable against (simply to see if there is a match in the list), I found it easier to generate hashes to check against, instead of putting the values into an array, like so:
$checklist{'val1'} = undef;
$checklist{'val2'} = undef;
...
if (exists $checklist{$value_to_check}) { ... }
Obviously this wastes some memory because of the need for a useless right-hand value, but IMO is more efficients and easier to code than to loop through an array.
Now in python, the code for this is exactly the same no matter if you're searching an list or a dictionary:
if value_to_check in checklist_which_can_be_list_or_dict:
<code>
So my real question here is: in perl, the hash method was preferred for speed of processing vs. iterating through an array, but is this true in python? Given the code is the same, I'm wondering if python does list iteration better? Should I still use the dictionary method for larger lists?
Dictionaries are hashes. An in test on a list has to walk through every element to check it against, while an in test on a dictionary uses hashing to see if the key exists. Python just doesn't make you explicitly loop through the list.
Python also has a set datatype. It's basically a hash/dictionary without the right-hand values. If what you want is to be able to build up a collection of things, then test whether something is already in that collection, and you don't care about the order of the things or whether a thing is in the collection multiple times, then a set is exactly what you want!

Categories