Is set() faster than list(), python? - python

Suppose I have an input which contains space seperated integers which are unique, i.e one does not occur twice. In such a case, will using the following,
setA = set(input().split())
be faster than using the below one? If so(I actually experienced it in this way), why?
listA = list(input().split())
Please do not focus on the fact that there is not a conversion to int, while reading the input.
In a problem I am working on, using list() gives timeout, however by using set(), I am able to run it such that it is in the time limitations. I wonder why this is the case?
edit : In case it might be related, the code which is related,
arr = input().split()
for ele in arr:
if ele in setA:
happiness += 1
elif ele in setB:
happiness += -1
else:
pass
Where arr is a space seperated line of integers, no uniquneness this time.

Python’s set class represents the mathematical notion of a set, namely a collection
of elements, without duplicates, and without an inherent order to those elements.
The major advantage of using a set, as opposed to a list, is that it has a highly
an optimized method for checking whether a specific element is contained in the set.
This is based on a data structure known as a hash table
However, there are two important restrictions due to the
algorithmic underpinnings. The first is that the set does not maintain the elements
in any particular order. The second is that only instances of immutable types can be
added to a Python set. Therefore, objects such as integers, floating-point numbers,
and character strings are eligible to be elements of a set. It is possible to maintain a
set of tuples, but not a set of lists or a set of sets, as lists and sets are mutable.

Related

Get random number from set deprecation

I am trying to get a random n number of users from a set of unique users.
Here is what I have so far
users = set()
random_users = random.sample((users), num_of_user)
This works well but it is giving me a deprecated warning. What should I be using instead? random.choice doesn't work with sets
UPDATE
I am trying to get reactions on a post and want them to be unique which is why I used a set. Would it be better to stick with a list for this?
users = set()
for reaction in msg.reactions:
async for user in reaction.users():
users.add(user)
Convert your set to a list.
by using the list function:
random_users = random.choices(list(users),k=num_of_user)
by using * operator to unpack your set or dict:
random_users = random.choices([*users],k=num_of_user)
Solution 1. is 3 char longer than the 2., but solution 1. is more literal - to me.
It is not guaranteed that you will get the same result for the list order through different executions, python versions and platforms, therefore you may end up with different random result, despite a careful random number generator initialization. To resolve this issue, order your list.
You can also store the users in a list, and make the elements unique later with set, and then convert again to a list. For this, a common way is to convert your list to a set, and back to a list again.
FWIW, random.sample() in Python 3.9.2 says this when passed a dict:
TypeError: Population must be a sequence. For dicts or sets, use sorted(d).
And this solution does seem to work for both set and dict inputs.

Python: Fastest way to search if long string is in list of strings

I have an input of about 2-5 millions strings of about 400 characters each, coming from a stored text file.
I need to check for duplicates before adding them to the list that I check (doesn't have to be a list, can be any other data type, the list is technically a set since all items are unique).
I can expect about 0.01% at max of my data to be non-unique and I need to filter them out.
I'm wondering if there is any faster way for me to check if the item exists in the list rather than:
a=[]
for item in data:
if item not in a:
a.add(item)
I do not want to lose the order.
Would hashing be faster (I don't need encryption)? But then I'd have to maintain a hash table for all the values to check first.
Is there any way I'm missing?
I'm on python 2, can at max go upto python 3.5.
It's hard to answer this question because it keeps changing ;-) The version I'm answering asks whether there's a faster way than:
a=[]
for item in data:
if item not in a:
a.add(item)
That will be horridly slow, taking time quadratic in len(data). In any version of Python the following will take expected-case time linear in len(data):
seen = set()
for item in data:
if item not in seen:
seen.add(item)
emit(item)
where emit() does whatever you like (append to a list, write to a file, whatever).
In comments I already noted ways to achieve the same thing with ordered dictionaries (whether ordered by language guarantee in Python 3.7, or via the OrderedDict type from the collections package). The code just above is the most memory-efficient, though.
You can try this,
a = list(set(data))
A List is an ordered sequence of elements whereas Set is a distinct list of elements which is unordered.

Python - convert list into dictionary in order to reduce complexity

Let's say I have a big list:
word_list = [elt.strip() for elt in open("bible_words.txt", "r").readlines()]
//complexity O(n) --> proporcional to list length "n"
I have learned that hash function used for building up dictionaries allows lookup to be much faster, like so:
word_dict = dict((elt, 1) for elt in word_list)
// complexity O(l) ---> constant.
using word_list, is there a most efficient way which is recommended to reduce the complexity of my code?
The code from the question does just one thing: fills all words from a file into a list. The complexity of that is O(n).
Filling the same words into any other type of container will still have at least O(n) complexity, because it has to read all of the words from the file and it has to put all of the words into the container.
What is different with a dict?
Finding out whether something is in a list has O(n) complexity, because the algorithm has to go through the list item by item and check whether it is the sought item. The item can be found at position 0, which is fast, or it could be the last item (or not in the list at all), which makes it O(n).
In dict, data is organized in "buckets". When a key:value pair is saved to a dict, hash of the key is calculated and that number is used to identify the bucket into which data is stored. Later on, when the key is looked up, hash(key) is calculated again to identify the bucket and then only that bucket is searched. There is typically only one key:value pair per bucked, so the search can be done in O(1).
For more detils, see the article about DictionaryKeys on python.org.
How about a set?
A set is something like a dictionary with only keys and no values. The question contains this code:
word_dict = dict((elt, 1) for elt in word_list)
That is obviously a dictionary which does not need values, so a set would be more appropriate.
BTW, there is no need to create a word_list which is a list first and convert it to set or dict. The first step can be skipped:
set_of_words = {elt.strip() for elt in open("bible_words.txt", "r").readlines()}
Are there any drawbacks?
Always ;)
A set does not have duplicates. So counting how many times a word is in the set will never return 2. If that is needed, don't use a set.
A set is not ordered. There is no way to check which was the first word in the set. If that is needed, don't use a set.
Objects saved to sets have to be hashable, which kind-of implies that they are immutable. If it was possible to modify the object, then its hash would change, so it would be in the wrong bucket and searching for it would fail. Anyway, str, int, float, and tuple objects are immutable, so at least those can go into sets.
Writing to a set is probably going to be a bit slower than writing to a list. Still O(n), but a slower O(n), because it has to calculate hashes and organize into buckets, whereas a list just dumps one item after another. See timings below.
Reading everything from a set is also going to be a bit slower than reading everything from a list.
All of these apply to dict as well as to set.
Some examples with timings
Writing to list vs. set:
>>> timeit.timeit('[n for n in range(1000000)]', number=10)
0.7802875302271843
>>> timeit.timeit('{n for n in range(1000000)}', number=10)
1.025623542189976
Reading from list vs. set:
>>> timeit.timeit('989234 in values', setup='values=[n for n in range(1000000)]', number=10)
0.19846207875508526
>>> timeit.timeit('989234 in values', setup='values={n for n in range(1000000)}', number=10)
3.5699193290383846e-06
So, writing to a set seems to be about 30% slower, but finding an item in the set is thousands of times faster when there are thousands of items.

Converting strings to numbers (not parsing) for radix sort

I have a school project where I need to sort all kind of data types with different sorting algorithms. Radix sort works well but it can't sort anything else than integers. I'm probably not going to add sorting results for anything else than integer since every data type will get sorted as integers.
That said, I'd like to know if there is a better way to convert strings to integers? Here's what I came with. I didn't want to outsmart python and tried to use standard function as much as possible.
def charToHex(char):
return hex(ord(char))[2:]
def stringToHex(text):
t = ''
for char in text:
t += charToHex(char)
return t
def stringToInt(text):
return int(stringToHex(text), 16)
print stringToInt('allo')
print stringToInt('allp')
print stringToInt('all')
It does work well but I'd be happy to know if there is a better way to handle that. For what it's worth, sorting anything else than integers with the radix sort sounds pointless. Because even if you can sort a list of integers. You'll have to get the values for all keys back into the list.
I had in mind to do something like that. For each value in my list, get an integer key. Put that key inside a hashtable and the value in a list for that hash table. Replace the value in the list by the integer key and then sort the list of keys.
For each keys in the sorted list, get the list of value for that key and pop one item. Put that item inside the list and continue.
I'd also like to know if there is a way to optimize this process in a way to make it worth it using radix sort instead of an other sort that doesn't require any conversion. The amount of item in the list may go beyond 50000.
edit
Actually the code here doesn't work for strings of different sizes. I'm not so sure how to check that. Padding the strings with space seems to work.
def getMaxLen(ls):
lenght = 0
for text in ls:
lenght = max(lenght, len(text))
return lenght
def convertList(ls):
size = getMaxLen(ls)
copy = ls[:]
for i, val in enumerate(copy):
copy[i] = stringToInt(val.ljust(size, ' '))
return copy
print convertList(["allo", "all", "bal"])
First, take a look at this article. That article shows that yes, in some cases, you can figure out a radix sort algorithm for strings that is faster than any other sort.
Second, and more importantly, I'd ask yourself if you are doing premature optimization. Sorting 50k items with python's sort() function is going to be incredibly fast. Unless you're sure this is a bottleneck in your application I wouldn't worry about it and would just use the sort() function. If it is a bottleneck, I'd also make sure there isn't someway you can avoid doing all of these sorts (e.g. caching, algorithms that work on unsorted data, etc.)

How to grow a list to fit a given capacity in Python

I'm a Python newbie. I have a series of objects that need to be inserted at specific indices of a list, but they come out of order, so I can't just append them. How can I grow the list whenever necessary to avoid IndexErrors?
def set(index, item):
if len(nodes) <= index:
# Grow list to index+1
nodes[index] = item
I know you can create a list with an initial capacity via nodes = (index+1) * [None] but what's the usual way to grow it in place? The following doesn't seem efficient:
for _ in xrange(len(nodes), index+1):
nodes.append(None)
In addition, I suppose there's probably a class in the Standard Library that I should be using instead of built-in lists?
This is the best way to of doing it.
>>> lst.extend([None]*additional_size)
oops seems like I misunderstood your question at first. If you are asking how to expand the length of a list so you can insert something at an index larger than the current length of the list, then lst.extend([None]*(new_size - len(lst)) would probably be the way to go, as others have suggested. Of course, if you know in advance what the maximum index you will be needing is, it would make sense to create the list in advance and fill it with Nones.
For reference, I leave the original text: to insert something in the middle of the existing list, the usual way is not to worry about growing the list yourself. List objects come with an insert method that will let you insert an object at any point in the list. So instead of your set function, just use
lst.insert(item, index)
or you could do
lst[index:index] = item
which does the same thing. Python will take care of resizing the list for you.
There is not necessarily any class in the standard library that you should be using instead of list, especially if you need this sort of random-access insertion. However, there are some classes in the collections module which you should be aware of, since they can be useful for other situations (e.g. if you're always appending to one end of the list, and you don't know in advance how many items you need, deque would be appropriate).
Perhaps something like:
lst += [None] * additional_size
(you shouldn't call your list variable list, since it is also the name of the list constructor).

Categories