Fastest way to uniqify a list in Python without preserving order? I saw many complicated solutions on the Internet - could they be faster than simply:
list(set([a,b,c,a]))
Going to a set only works for lists such that all their items are hashable -- so e.g. in your example if c = [], the code you give will raise an exception. For non-hashable, but comparable items, sorting the list, then using itertools.groupby to extract the unique items from it, is the best available solution (O(N log N)). If items are neither all hashable, nor all comparable, your only "last ditch" solution is O(N squared).
You can code a function to "uniquify" any list that uses the best available approach by trying each approach in order, with a try/except around the first and second (and a return of the result either at the end of the try clause, or, elegantly, in an else clause of the try statement;-).
set([a, b, c, a])
Leave it in that form if possible.
This updated post by Peter Bengtsson suggests two of the fastest ways to make a list of unique items in Python 3.6+ are:
# Unordered (hashable items)
list(set(seq))
# Order preserving
list(dict.fromkeys(seq))
Related
I want to perform calculations on a list and assign this to a second list, but I want to do this in the most efficient way possible as I'll be using a lot of data. What is the best way to do this? My current version uses append:
f=time_series_data
output=[]
for i, f in enumerate(time_series_data):
if f > x:
output.append(calculation with f)
etc etc
should I use append or declare the output list as a list of zeros at the beginning?
Appending the values is not slower compared to other ways possible to accomplish this.
The code looks fine and creating a list of zeroes would not help any further. Although it can create problems as you might not know how many values will pass the condition f > x.
Since you wrote etc etc I am not sure how long or what operations you need to do there. If possible try using list comprehension. That would be a little faster.
You can have a look at below article which compared the speed for list creation using 3 methods, viz, list comprehension, append, pre-initialization.
https://levelup.gitconnected.com/faster-lists-in-python-4c4287502f0a
I have an input of about 2-5 millions strings of about 400 characters each, coming from a stored text file.
I need to check for duplicates before adding them to the list that I check (doesn't have to be a list, can be any other data type, the list is technically a set since all items are unique).
I can expect about 0.01% at max of my data to be non-unique and I need to filter them out.
I'm wondering if there is any faster way for me to check if the item exists in the list rather than:
a=[]
for item in data:
if item not in a:
a.add(item)
I do not want to lose the order.
Would hashing be faster (I don't need encryption)? But then I'd have to maintain a hash table for all the values to check first.
Is there any way I'm missing?
I'm on python 2, can at max go upto python 3.5.
It's hard to answer this question because it keeps changing ;-) The version I'm answering asks whether there's a faster way than:
a=[]
for item in data:
if item not in a:
a.add(item)
That will be horridly slow, taking time quadratic in len(data). In any version of Python the following will take expected-case time linear in len(data):
seen = set()
for item in data:
if item not in seen:
seen.add(item)
emit(item)
where emit() does whatever you like (append to a list, write to a file, whatever).
In comments I already noted ways to achieve the same thing with ordered dictionaries (whether ordered by language guarantee in Python 3.7, or via the OrderedDict type from the collections package). The code just above is the most memory-efficient, though.
You can try this,
a = list(set(data))
A List is an ordered sequence of elements whereas Set is a distinct list of elements which is unordered.
In Python, we can get the unique items of a list L by using set(L). However doing this breaks the order in which the values appear in the original list. Is there an elegant way to get the unique items in the order in which they appear in the list?
If all of the items in the list are hashable, then a dictionary can be used for an order-preserving dedupe:
L = list(dict.fromkeys(L))
For older Python versions (<= 3.6) where the standard dict is not order preserving, you can do the same thing using a collections.OrderedDict.
If any of the list items are unhashable, there will be a TypeError. You can use an alternative approach in this case, at the price of poorer performance. I refer you to the answer from Patrick Haugh.
l = []
for item in list_:
if item not in l:
l.append(item)
This gets slow for really big, diverse list_. In those cases, it would be worth it to also keep track of a set of seen values.
In Python, have a list with items (strings) that look like this:
"1.120e+03 8.140e+02 3.234e+01 1.450e+00 1.623e+01 7.940e+02 3.113e+01 1.580e+00 1.463e+01"
I want to sort this list based on the size of the first number in each string (smallest to largest). In the above case that would be "1.120e+03".
I can think of a couple of ways to do it, but they involve creating new lists and a couple of for loops which I guess isn't so efficient (or elegant). Is there a quick way to do this?
If you are sorting it more than once, I suggest you create your Class for the data and override methods for comparing such as __lt__.
I'm working through some tutorials on Python and am at a position where I am trying to decide what data type/structure to use in a certain situation.
I'm not clear on the differences between arrays, lists, dictionaries and tuples.
How do you decide which one is appropriate - my current understanding doesn't let me distinguish between them at all - they seem to be the same thing.
What are the benefits/typical use cases for each one?
How do you decide which data type to use? Easy:
You look at which are available and choose the one that does what you want. And if there isn't one, you make one.
In this case a dict is a pretty obvious solution.
Tuples first. These are list-like things that cannot be modified. Because the contents of a tuple cannot change, you can use a tuple as a key in a dictionary. That's the most useful place for them in my opinion. For instance if you have a list like item = ["Ford pickup", 1993, 9995] and you want to make a little in-memory database with the prices you might try something like:
ikey = tuple(item[0], item[1])
idata = item[2]
db[ikey] = idata
Lists, seem to be like arrays or vectors in other programming languages and are usually used for the same types of things in Python. However, they are more flexible in that you can put different types of things into the same list. Generally, they are the most flexible data structure since you can put a whole list into a single list element of another list, but for real data crunching they may not be efficient enough.
a = [1,"fred",7.3]
b = []
b.append(1)
b[0] = "fred"
b.append(a) # now the second element of b is the whole list a
Dictionaries are often used a lot like lists, but now you can use any immutable thing as the index to the dictionary. However, unlike lists, dictionaries don't have a natural order and can't be sorted in place. Of course you can create your own class that incorporates a sorted list and a dictionary in order to make a dict behave like an Ordered Dictionary. There are examples on the Python Cookbook site.
c = {}
d = ("ford pickup",1993)
c[d] = 9995
Arrays are getting closer to the bit level for when you are doing heavy duty data crunching and you don't want the frills of lists or dictionaries. They are not often used outside of scientific applications. Leave these until you know for sure that you need them.
Lists and Dicts are the real workhorses of Python data storage.
Best type for counting elements like this is usually defaultdict
from collections import defaultdict
s = 'asdhbaklfbdkabhvsdybvailybvdaklybdfklabhdvhba'
d = defaultdict(int)
for c in s:
d[c] += 1
print d['a'] # prints 7
Do you really require speed/efficiency? Then go with a pure and simple dict.
Personal:
I mostly work with lists and dictionaries.
It seems that this satisfies most cases.
Sometimes:
Tuples can be helpful--if you want to pair/match elements. Besides that, I don't really use it.
However:
I write high-level scripts that don't need to drill down into the core "efficiency" where every byte and every memory/nanosecond matters. I don't believe most people need to drill this deep.