I’m looking for the pythonic way to create a dictionary of lists where you append values to a list for a series of keys. So a dictionary n which stores the sum of digits for the values up to 1000, with the sum of the digits the key, for example:
n[25] = [799, 889, 898, 979, 988, 997]
Using a basic list comprehension doesn't work as it overwrites smaller values and only allows one, the largest, value per key,
n = {sumdigits(i): i for i in range(1000)}
I've got a two line working version below, but I am curious whether there is a neat one line solution to create a dictionary of variable length lists.
def sumdigits(x):
return sum(int(i) for i in str(x))
n = defaultdict(list)
for i in range(1000):
n[sumdigits(i)].append(i)
What you already have is very pythonic. There's absolutely nothing wrong with it and crucially it's clear what's happening.
If you really want a one line solution, I think you need 2 loops. One loop for values of n, which can be anything from 1 to 27 (sumdigits(999)). Then another loop to go over the items in the range 1-1000.
Here's what that would look like, but it's very inefficient from a time complexity view. What you have has time complexity O(n) which is good. But doing it in a comprehension has complexity
O(n^sumdigits(n-1)). Because for every key you have to iterate over the entire range 1-1000, but most of those are discarded.
{n: [i for i in range(1000) if sumdigits(i) == n] for n in range(sumdigits(999) + 1)}
This is also a possibility that is one line (as you don't count the defaultdict initialization in your 2-line solution). With the advantage that it is significantly faster than the other solutions.
n = defaultdict(list)
{n[sum(int(d) for d in str(nb))].append(nb) for nb in range(1000)}
or really in one line (using the walrus operator python3.8 +)
{n := collections.defaultdict(list)[sum(int(i) for i in str(x))].append(x) for x in range(NB)}
If you really want a one-line solution, you can do the following combining list and dict comprehension:
dct = {sumdigits(i): [j for j in range(1000) if sumdigits(i)==sumdigits(j)] for i in range(1000)}
That said, I do not think that it gets more pythonic than the simple for loop you've suggested yourself and I think you should stick to that due to performance reasons as well.
Related
Is there a way to efficiently filter items from the list of subsequences of a list?
For a minimal example, consider the list l = [-2, -1, 1, 2] and the subsequences in itertools.combinations(l, r=2). Suppose I'd like to filter the subsequences of this list so that for every number i represented in the list, at most one of (-i, i) should be in any subsequence. In this example, the desired output is [(-2, -1), (-2, 1), (-1, 2), (1, 2)].
Here's a naive way of filtering:
from itertools import combinations
def filtered( iterable, r ):
for i in combinations( iterable , r ):
for x in i:
if x*-1 in i: # condition line
break
else:
yield i
But with increasing input size, this rejection approach wastes a lot of data. For example, filtered( list(range(-20,20), 5 ) will reject about 35% of the original subsequences. It's also much slower (about six times) than combinations().
Ideally, I'd like to keep a configurable condition so that filtering on non-numeric data remains possible.
I think that AKX is somewhat right. You definitely need to iterate through all the possible combinations. You are doing this, as that outer for loop is O(n). However, the checking for duplicates in each combination is not optimized.
The inner for loop
for x in i:
if x*-1 in i: # condition line
break
Is not optimized. This is O(n^2) since the in operator iterates through the list.
You can make this inner for loop O(n) by using a hashset for each tuple (set() or dict() in python).
nums = set()
for num in my_tuple:
if num*-1 in nums:
break
nums.add(num)
The difference is that a hashset has a lookup of O(1), so the in operator is O(1) instead of O(n). We just sacrifice a little bit of space complexity.
The last thing you could do is re-implement the combination method yourself to include this type of checking, so that tuples that violate your condition aren't produced in the first place.
If you want to do this, here is some source code for the itertools.combinations() function, and you just have add some validation to it.
I have a list of tuples with duplicates and I've converted them to a dictionary using this code I found here:
https://stackoverflow.com/a/61201134/2415706
mylist = [(a,1),(a,2),(b,3)]
result = {}
for i in mylist:
result.setdefault(i[0],[]).append(i[1])
print(result)
>>> result = {a:[1,2], b:[3]}
I recall learning that most for loops can be re-written as comprehensions so I wanted to practice but I've failed for the past hour to make one work.
I read this: https://stackoverflow.com/a/56011919/2415706 and now I haven't been able to find another library that does this but I'm also not sure if this comprehension I want to write is a bad idea since append mutates things.
Comprehension is meant to map items in a sequence independent of each other, and is not suitable for aggregations such as the case in your question, where the sub-list an item appends to depends on a sub-list that a previous item appends to.
You can produce the desired output with a nested comprehension if you must, but it would turn what would've been solved in O(n) time complexity with a loop into one that takes O(n ^ 2) instead:
{k: [v for s, v in mylist if s == k] for k, _ in mylist}
sometimes you should have nested merge to merged list(it is similar to np.flatten() ).
when the list of list is given like below, and you should flatten it
a = [[j for j in range(0, 10)] for i in range(0, 10000)]
you have two kinds of solution to solve it. itertools.chain.from_iterable and functools.reduce.
%timeit list(itertools.chain.from_iterable(a))
%timeit reduce(lambda x, y: x+y, a)
do you think which one is faster and how much faster than other thing?
itertools.chain.from_iterable is 1000 times faster or more(when the length of the list is bigger).
If somebody knows why this thing happen, please let me know.
always thx for you support and help.
Yes, because list concatenation, i.e. using +, is an O(N) operation. When you do that to incrementally build a list of size N, it becomes O(N2).
Instead, using chain.from_iterable will simply iterate over all N items in the final list, using the list type constructor, which will have linear performance.
This is why you shouldn't use sum to flatten a list (note, reduce(lambda x, y: x+y,...) is simply sum).
Note, the idiomatic way to flatten a nested list like this is to use a list comprehension:
[x for sub in a for x in sub]
This is such an anti-pattern, the sum method prevents you doing it with str objects:
>>> sum(['here', 'is', 'some', 'strings'], '')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: sum() can't sum strings [use ''.join(seq) instead]
Note, your reduce/sum approach is equivalent to:
result = []
for sub in a:
result = result + sub
Which demonstrates the expensive + in the loop quite clearly. Note, the following naive approach actually has O(N) behavior instead of O(N2):
result = []
for sub in a:
result += sub
That is because my_list += something is equivalent to my_list.extend(something), and .extend (along with .append) have amortized constant-time behavior, so overall, it will be O(N).
What is the best way to add values to a List in terms of processing time, memory usage and just generally what is the best programming option.
list = []
for i in anotherArray:
list.append(i)
or
list = range(len(anotherArray))
for i in list:
list[i] = anotherArray[i]
Considering that anotherArray is for example an array of Tuples. (This is just a simple example)
It really depends on your use case. There is no generic answer here as it depends on what you are trying to do.
In your example, it looks like you are just trying to create a copy of the array, in which case the best way to do this would be to use copy:
from copy import copy
list = copy(anotherArray)
If you are trying to transform the array into another array you should use list comprehension.
list = [i[0] for i in anotherArray] # get the first item from tuples in anotherArray
If you are trying to use both indexes and objects, you should use enumerate:
for i, j in enumerate(list)
which is much better than your second example.
You can also use generators, lambas, maps, filters, etc. The reason all of these possibilities exist is because they are all "better" for different reasons. The writters of python are pretty big on "one right way", so trust me, if there was one generic way which was always better, that is the only way that would exist in python.
Edit: Ran some results of performance for tuple swap and here are the results:
comprehension: 2.682028295999771
enumerate: 5.359116118001111
for in append: 4.177091988000029
for in indexes: 4.612594166001145
As you can tell, comprehension is usually the best bet. Using enumerate is expensive.
Here is the code for the above test:
from timeit import timeit
some_array = [(i, 'a', True) for i in range(0,100000)]
def use_comprehension():
return [(b, a, i) for i, a, b in some_array]
def use_enumerate():
lst = []
for j, k in enumerate(some_array):
i, a, b = k
lst.append((b, a, i))
return lst
def use_for_in_with_append():
lst = []
for i in some_array:
i, a, b = i
lst.append((b, a, i))
return lst
def use_for_in_with_indexes():
lst = [None] * len(some_array)
for j in range(len(some_array)):
i, a, b = some_array[j]
lst[j] = (b, a, i)
return lst
print('comprehension:', timeit(use_comprehension, number=200))
print('enumerate:', timeit(use_enumerate, number=200))
print('for in append:', timeit(use_for_in_with_append, number=200))
print('for in indexes:', timeit(use_for_in_with_indexes, number=200))
Edit2:
It was pointed out to me the the OP just wanted to know the difference between "indexing" and "appending". Really, those are used for two different use cases as well. Indexing is for replacing objects, whereas appending is for adding. However, in a case where the list starts empty, appending will always be better because the indexing has the overhead of creating the list initially. You can see from the results above that indexing is slightly slower, mostly because you have to create the first list.
Best way is list comprehension :
my_list=[i for i in anotherArray]
But based on your problem you can use a generator expression (is more efficient than list comprehension when you just want to loop over your items and you don't need to use some list methods like indexing or len or ... )
my_list=(i for i in anotherArray)
I would actually say the best is a combination of index loops and value loops with enumeration:
for i, j in enumerate(list): # i is the index, j is the value, can't go wrong
I was surprised how different the speed was in two approaches to exclude one list of tuples for another. So I was wondering why.
I have a list for 1,500 tuples in the form of (int, float), sorted by the float value. (ADDED NOTE: each int value in the tuple list is distinct.) I wanted to figure out the fastest way to exclude a sublist. So first I created a sublist to exclude:
exclude_list = [v for i,v in enumerate(tuple_list) if (i % 3) == 0]
Then I timed two different approaches to removing exclude_list from tuple_list (but these aren't the two approaches I finally settled on):
remainder_list = [v for v in tuple_list if v not in exclude_list]
and,
remainder_set = set(tuple_list) - set(exclude_list)
remainder_list = sorted(remainder_set, key=itemgetter(1)) #edited to chance key to 1 from 0
The difference in time was huge: 14.7235 seconds (500 times) for the first approach and 0.3426 (500 times) for the second approach. I understand why these two approaches have such a different amount of time because the first requires searching through a sub_list for each item in the main list. So then, I came up with a better way to search/exclude:
exclude_dict = dict(exclude_list)
remainder_list = [v for v in tuple_list if v[0] not in exclude_dict]
I didn't think this version of excluding list items would be much faster than the first. Not only was it faster than the first approach, but it was faster than the second! It times in at 0.11177 (500 times). Why is this faster than my set-difference/resort approach?
You might want to check the time complexity of list and set operations.
remainder_list = [v for v in tuple_list if v not in exclude_list]
in operation here is O(N), it looks through all the elements in the tuple_list to see if the element exists in exclude_list or not. So its complexity is O(len(tuple_list) * len(exclude_list))
The difference - operation on a set has O(n) complexity since a set uses a hashtable as it's underlying data structure and has O(1) membership checking. Thus the line:
remainder_set = set(tuple_list) - set(exclude_list).
has O(len(tuple_list)) complexity.
The in operator for a list is O(N) to compute. It just does a linear search. To do better, you could change exclude_list to exclude_set:
exclude_set = {v for i,v in enumerate(tuple_list) if (i % 3) == 0}
Or, if you already have exclude_list:
exclude_set = set(exclude_list)
and then calculate your remainder_list as before:
remainder_list = [v for v in tuple_list if v not in exclude_set]
This is WAY better because in for a set is a very impressive O(1) (on average). And here, you don't need to re-sort the remainder_list either, so that removes an O(MlogM) step (where M == len(remainder_list))
Of course, with this trivial example, we could construct the whole thing with 1 list-comp:
remainder_list = [v for i,v in enumerate(tuple_list) if (i % 3) != 0]
Your algorithms are not equivalent. Your elements are couples. With the first two methods you exclude elements by matching couples. With the third method (with dict) you exclude elements comparing only the first element of your couples.
If the couples have few different first elements the dict method is then much faster but the result could be different.