Python find same elemts in an optional amount of lists - python

I'm looking for a way to find an element that exists in an optional amount of lists.
Currently I'm able to find it in two lists:
def Mengen(liste1, liste2):
print(liste2)
A = set(liste1)
B = set(liste2)
if (A & B):
return(A & B)
I now would like to expand that for multiple lists. I already tried to use pointers but I weren't able to make this work.
Has anyone a good way to solve that problem?
Thanks for any help in advance :)
Edit:
The function is called like this:
a = [1,2,3,4]
b = [4,5,6,7]
c = [4,8,9,20]
d = [4,567,756,456,423]
Mengen(a,b)
And the output should be just a 4 or [4]

Since you just want "what's in every input", this can be done easily with set's intersection method (which takes varargs, and does not require them to be sets themselves, so you can pass all of your arguments save the first to it), and it does it all in one efficient operation:
def common_entries(iterable, *rest):
return list(set(iterable).intersection(*rest))
This improves on the existing answer in a few ways (I originally began this as a comment suggesting improvements, but it got involved enough that I decided to separate it):
I removed a multiple unnecessary list comprehensions (the inner listcomp is completely unnecessary; [i for i in args] needlessly makes a list from the tuple received when you could just use args directly)
It avoids a ton of temporary sets you don't strictly need (set.intersection's arguments can be any iterable; you don't need to convert them)
Since one input is mandatory (the original code would explode in a more confusing way if you passed no arguments), you can accept it separately, making the code allowed by #2 neater
For the simple test input of:
a = [1,2,3,4]
b = [4,5,6,7]
c = [4,8,9,20]
d = [4,567,756,456,423]
my common_entries(a, b, c, d) takes roughly half the time to run as the other answer's findDuplicates(a, b, c, d), and reduces the number of explicit temporaries from n + 2 to zero (internally, set.intersection does make temporary sets to represent progressive intersection results, but those can't be avoided in any reasonable way).

def findDuplicated(*args):
return list(set.intersection(*[set(j) for j in [i for i in args]]))
findDuplicated([1,2,3,4], [4,5,6,7], [4,8,9,20], [4,567,756,456,423])

This works:
Set1 = set(input().split())
Set2 = set(input().split())
Set3 = set(input().split())
Set4 = set(input().split())
Set5 = Set1.intersection(Set2)
Set6 = Set3.intersection(Set4)
print(Set5.intersection(Set6))
The input should be in the format of this:
1 2 3 4
4 5 6 7
4 9 10
4 567 572
It has to be separated by the space, but this should work.

Related

Python: alternative way of reinitialize lists in a for loop

In below code I would like to find an alternative way of reinitializing lists.
iter_obj_main has over 200000 iter_obj and 1 iter_obj has over 1 million rows of data and over 50 columns. (in below example I use just 3 columns a[], b [], c[] for demonstration)
This makes code look very long and ugly.
I am actually looking to empty all 50 lists after every loop?
Any suggestion python gurus?
i = 0
for iter_obj in iter_obj_main:
for x in iter_obj:
i += 1
a,b,c = ([] for j in range(3))
if x == sometest:
a.insert(i,x[0])
else:
a.insert(i,'')
if x == sometest:
b.insert(i,x[1])
else:
b.insert(i,'')
if x == sometest:
c.insert(i,x[2])
else:
c.insert(i,'')
# moving data to database because of memory limitations and clearing lists.
maybe use:
for iter_obj in iter_obj_main:
for x in iter_obj:
a, b, c, *_ = x
Assigning twice to the same variable is a waste of time. There is no use in doing a = [] when you follow it by another assignment a = x[0].
Secondly, if x is always a list with 3 values, you could unpack it immediately:
for iter_obj in iter_obj_main:
for a, b, c in iter_obj:
# write to the database
Now the writing to the database would be the bottleneck anyway, so the above optimisation is not going to make it run significantly faster. You should focus on how you can write bulk data to your database without having to send instructions one-by-one.
Secondly, if your database table has 50 columns, it is very likely you have a design flaw in your database schema. It is very likely you didn't normalise it.

Compare two lists with custom functions and custom comparison

I have two lists
a = [1,2,3]
b = [2,3,4,5]
and two custom functions F1(a[i]) and F2(b[j]), which take elements from lists a and b return objects, say A and B. And there is a custom comparison function diff(A,B), which returns False if the objects are the same in some sense and True if they are different.
I would like to compare these two lists in a pythonic way to see if the lists are the same. In other words, that all objects A generated from list a have at least one equal object B generated from list b. For example if the outcome is the following then the lists are the same:
(diff(F1(1),F2(4)) or diff(F1(1),F2(5)) or diff(F1(2),F2(3)) or diff(F1(3),F2(2))) is False
For the function sorted there is a key function. Are there any comparison functions in Python that can take a custom function in this situation?
The only solution that I see is to loop through all elements in a and loop through all elements in b to check them element by element. But then if I want to extend the functionality this would require quite some development. If there is a way to use standard Python functions this would be very much appreciated.
all(True if any(diff(A, B) is False for B in [F2(j) for j in b]) else False for A in [F1(i) for i in a])
Honestly I just typed this quickly using my Python auto-pilot but it would look something like this I guess.
Probably the best you can get is first converting all entries of the list, and then iterate over both and check if any of a has a match in b and if that holds true for all as. So something like
af = [F1(i) for i in a]
bf = [F2(i) for i in b]
is_equal = all(any(not diff(a,b) for b in bf) for a in af )
all stops as soon as it finds the first value that is false, any stops as soon as it finds the first value that is True.

Adding lists by reference

Is it possible to add two lists using a reference of each list instead of a copy?
For example -
first_list = [1,2,3]
second_list = [5,6]
new_list = first_list + second_list
print(new_list) # Will print [1,2,3,5,6]
first_list.append(4)
print(new_list) # Should print [1,2,3,4,5,6]
Is there a way to do this in Python? Or is code re-write my only option?
Edit: I removed confusing comments I made about using C++ to do this.
You can't directly do this in Python any more than you can in C++.
But you can indirectly do it in Python the exact same way you can in C++: by writing an object that holds onto both lists and dispatches appropriately.
For example:
class TwoLists(collections.abc.Sequence):
def __init__(self, a, b):
self.a, self.b = a, b
def __len__(self):
return len(self.a) + len(self.b)
def __getitem__(self, idx):
# cheating a bit, not handling slices or negative indexing
if idx < len(self.a):
return self.a[idx]
else:
return self.b[idx - len(self.a)]
Now:
>>> first_list = [1,2,3]
>>> second_list = [5,6]
>>> two_lists = TwoLists(first_list, second_list)
>>> print(*two_lists)
1 2 3 5 6
>>> first_list.append(4)
>>> print(*two_lists)
1 2 3 4 5 6
What I think you were missing here is a fundamental distinction between Python and C++ in how variables work. Briefly, every Python variable (and attribute and list position and so on) is, in C++ terms, a reference variable.
Less misleadingly:
C++ variables (and attributes, etc.) are memory locations—they're where values live. If you want a to be a reference to the value in b, you have to make a reference-to-b value, and store that in a. (C++ has a bit of magic that lets you define a reference variable like int& a = b, but you can't later reassign a to refer to c; if you want that, you have to explicitly use pointers, C-style.)
Python variables (and etc.) are names for values, while the values live wherever they want to. If you want a to be a reference to the value in a, you just bind a to the same value b is bound to: a = b. (And, unlike C++, you can reassign a = c at any time.)
Of course the cost is performance: there's an extra indirection to reach any value from its name in Python, while in C++, that only happens when you use pointer variables. But that cost is pretty much always invisible compared to the other overhead of Python (interpreting bytecode and dynamically looking names up in dictionaries and so on), so it makes sense for a high-level language to just not give you the choice.
All that being said, there's usually not a good reason to do this in either language. Both Python, and the C++ standard library, are designed around (similar, but different notions of) iteration.
In Python, you usually don't actually need a sequence, just an iterable. And to chain two iterables together is trivial:
>>> from itertools import chain
>>> first_list = [1,2,3]
>>> second_list = [5,6]
>>> print(*chain(first_list, second_list))
1 2 3 5 6
>>> first_list.append(4)
>>> print(*chain(first_list, second_list))
1 2 3 4 5 6
Yes, I can only iterate over the chain once, but usually that's all you need. (Just as in C++ you usually only need to loop from begin(c) to end(c), not to build a new persistent object that holds onto them.)
And if you think that's cheating because I'm using itertools, we can define it ourselves:
def chain(*its):
for it in its:
yield from it

How to arrange an existing Python list of integers consecutively in memory?

The following blog post shows that a list of integers is processed quicker if the list is not randomly shuffled. Due to cache locality, the unshuffled list is faster to process since its adjacent elements are located adjacently in memory.
https://rickystewart.wordpress.com/2013/09/03/why-sorting-an-array-makes-a-python-loop-faster/
I tried the following approach so that the shuffled list would be re-ordered with adjacent elements consecutive in memory.
import copy
a = [i for i in range(1000000)]
shuffle(a)
# Approach 1
a = copy.deepcopy(a)
However, that did not improve performance, suggesting that the items aren't reordered consecutively in memory.
I also tried the following modifications after shuffling, which also did not improve performance.
# Approach 2
a = [x for x in a]
# Approach 3
a = [copy.deepcopy(x) for x in a]
The following approach improves performance, suggesting that the elements are re-ordered in memory.
# Approach 4
a = [x+0 for x in a]
My question is why do approaches 1 through 3 not re-order the elements in memory, whereas approach 4 does?
Is there a suggested way to do this, different from approach 4?
It boils down to whether you are creating new objects or not. It turns out approaches 1 to 3 do not create new objects, here is why.
Approach 1 & 3: ❌
While they look different, those two approaches are the same. When calling copy.deepcopy on an integer (or any immutable builtin type), the copy module uses the following method.
def _deepcopy_atomic(x, memo):
return x
So whenever you deepcopy an integer, the same object is returned. Likewise, deepcopying an list of integers actually returns a shallow copy.
from copy import deepcopy
l = [1000]
print(l[0] is deepcopy(l)[0]) # True
Approach 2: ❌
By doing [x for x in a], you trivially make a new list with exactly the same objects. Here is a sanity check.
l1 = [1000]
l2 = [x for x in l1]
print(l1[0] is l2[0]) # True
Approach 4: ✅
Now this approach actually creates a new object for integers bigger than 256.
x = 1000
print(x is x + 0) # False
Final word
While the last approach is the only one that actually creates a new object, I could not find anything in the doc stating that this is a property of the language. So keep in mind that this might be implementation specific and that it is not unlikely to encounter an interpret which optimizes x + 0 to always return the same object.

Python: List comprehension significantly faster than Filter? [duplicate]

I have a list that I want to filter by an attribute of the items.
Which of the following is preferred (readability, performance, other reasons)?
xs = [x for x in xs if x.attribute == value]
xs = filter(lambda x: x.attribute == value, xs)
It is strange how much beauty varies for different people. I find the list comprehension much clearer than filter+lambda, but use whichever you find easier.
There are two things that may slow down your use of filter.
The first is the function call overhead: as soon as you use a Python function (whether created by def or lambda) it is likely that filter will be slower than the list comprehension. It almost certainly is not enough to matter, and you shouldn't think much about performance until you've timed your code and found it to be a bottleneck, but the difference will be there.
The other overhead that might apply is that the lambda is being forced to access a scoped variable (value). That is slower than accessing a local variable and in Python 2.x the list comprehension only accesses local variables. If you are using Python 3.x the list comprehension runs in a separate function so it will also be accessing value through a closure and this difference won't apply.
The other option to consider is to use a generator instead of a list comprehension:
def filterbyvalue(seq, value):
for el in seq:
if el.attribute==value: yield el
Then in your main code (which is where readability really matters) you've replaced both list comprehension and filter with a hopefully meaningful function name.
This is a somewhat religious issue in Python. Even though Guido considered removing map, filter and reduce from Python 3, there was enough of a backlash that in the end only reduce was moved from built-ins to functools.reduce.
Personally I find list comprehensions easier to read. It is more explicit what is happening from the expression [i for i in list if i.attribute == value] as all the behaviour is on the surface not inside the filter function.
I would not worry too much about the performance difference between the two approaches as it is marginal. I would really only optimise this if it proved to be the bottleneck in your application which is unlikely.
Also since the BDFL wanted filter gone from the language then surely that automatically makes list comprehensions more Pythonic ;-)
Since any speed difference is bound to be miniscule, whether to use filters or list comprehensions comes down to a matter of taste. In general I'm inclined to use comprehensions (which seems to agree with most other answers here), but there is one case where I prefer filter.
A very frequent use case is pulling out the values of some iterable X subject to a predicate P(x):
[x for x in X if P(x)]
but sometimes you want to apply some function to the values first:
[f(x) for x in X if P(f(x))]
As a specific example, consider
primes_cubed = [x*x*x for x in range(1000) if prime(x)]
I think this looks slightly better than using filter. But now consider
prime_cubes = [x*x*x for x in range(1000) if prime(x*x*x)]
In this case we want to filter against the post-computed value. Besides the issue of computing the cube twice (imagine a more expensive calculation), there is the issue of writing the expression twice, violating the DRY aesthetic. In this case I'd be apt to use
prime_cubes = filter(prime, [x*x*x for x in range(1000)])
Although filter may be the "faster way", the "Pythonic way" would be not to care about such things unless performance is absolutely critical (in which case you wouldn't be using Python!).
I thought I'd just add that in python 3, filter() is actually an iterator object, so you'd have to pass your filter method call to list() in order to build the filtered list. So in python 2:
lst_a = range(25) #arbitrary list
lst_b = [num for num in lst_a if num % 2 == 0]
lst_c = filter(lambda num: num % 2 == 0, lst_a)
lists b and c have the same values, and were completed in about the same time as filter() was equivalent [x for x in y if z]. However, in 3, this same code would leave list c containing a filter object, not a filtered list. To produce the same values in 3:
lst_a = range(25) #arbitrary list
lst_b = [num for num in lst_a if num % 2 == 0]
lst_c = list(filter(lambda num: num %2 == 0, lst_a))
The problem is that list() takes an iterable as it's argument, and creates a new list from that argument. The result is that using filter in this way in python 3 takes up to twice as long as the [x for x in y if z] method because you have to iterate over the output from filter() as well as the original list.
An important difference is that list comprehension will return a list while the filter returns a filter, which you cannot manipulate like a list (ie: call len on it, which does not work with the return of filter).
My own self-learning brought me to some similar issue.
That being said, if there is a way to have the resulting list from a filter, a bit like you would do in .NET when you do lst.Where(i => i.something()).ToList(), I am curious to know it.
EDIT: This is the case for Python 3, not 2 (see discussion in comments).
I find the second way more readable. It tells you exactly what the intention is: filter the list.
PS: do not use 'list' as a variable name
generally filter is slightly faster if using a builtin function.
I would expect the list comprehension to be slightly faster in your case
Filter is just that. It filters out the elements of a list. You can see the definition mentions the same(in the official docs link I mentioned before). Whereas, list comprehension is something that produces a new list after acting upon something on the previous list.(Both filter and list comprehension creates new list and not perform operation in place of the older list. A new list here is something like a list with, say, an entirely new data type. Like converting integers to string ,etc)
In your example, it is better to use filter than list comprehension, as per the definition. However, if you want, say other_attribute from the list elements, in your example is to be retrieved as a new list, then you can use list comprehension.
return [item.other_attribute for item in my_list if item.attribute==value]
This is how I actually remember about filter and list comprehension. Remove a few things within a list and keep the other elements intact, use filter. Use some logic on your own at the elements and create a watered down list suitable for some purpose, use list comprehension.
Here's a short piece I use when I need to filter on something after the list comprehension. Just a combination of filter, lambda, and lists (otherwise known as the loyalty of a cat and the cleanliness of a dog).
In this case I'm reading a file, stripping out blank lines, commented out lines, and anything after a comment on a line:
# Throw out blank lines and comments
with open('file.txt', 'r') as lines:
# From the inside out:
# [s.partition('#')[0].strip() for s in lines]... Throws out comments
# filter(lambda x: x!= '', [s.part... Filters out blank lines
# y for y in filter... Converts filter object to list
file_contents = [y for y in filter(lambda x: x != '', [s.partition('#')[0].strip() for s in lines])]
It took me some time to get familiarized with the higher order functions filter and map. So i got used to them and i actually liked filter as it was explicit that it filters by keeping whatever is truthy and I've felt cool that I knew some functional programming terms.
Then I read this passage (Fluent Python Book):
The map and filter functions are still builtins
in Python 3, but since the introduction of list comprehensions and generator ex‐
pressions, they are not as important. A listcomp or a genexp does the job of map and
filter combined, but is more readable.
And now I think, why bother with the concept of filter / map if you can achieve it with already widely spread idioms like list comprehensions. Furthermore maps and filters are kind of functions. In this case I prefer using Anonymous functions lambdas.
Finally, just for the sake of having it tested, I've timed both methods (map and listComp) and I didn't see any relevant speed difference that would justify making arguments about it.
from timeit import Timer
timeMap = Timer(lambda: list(map(lambda x: x*x, range(10**7))))
print(timeMap.timeit(number=100))
timeListComp = Timer(lambda:[(lambda x: x*x) for x in range(10**7)])
print(timeListComp.timeit(number=100))
#Map: 166.95695265199174
#List Comprehension 177.97208347299602
In addition to the accepted answer, there is a corner case when you should use filter instead of a list comprehension. If the list is unhashable you cannot directly process it with a list comprehension. A real world example is if you use pyodbc to read results from a database. The fetchAll() results from cursor is an unhashable list. In this situation, to directly manipulating on the returned results, filter should be used:
cursor.execute("SELECT * FROM TABLE1;")
data_from_db = cursor.fetchall()
processed_data = filter(lambda s: 'abc' in s.field1 or s.StartTime >= start_date_time, data_from_db)
If you use list comprehension here you will get the error:
TypeError: unhashable type: 'list'
In terms of performance, it depends.
filter does not return a list but an iterator, if you need the list 'immediately' filtering and list conversion it is slower than with list comprehension by about 40% for very large lists (>1M). Up to 100K elements, there is almost no difference, from 600K onwards there starts to be differences.
If you don't convert to a list, filter is practically instantaneous.
More info at: https://blog.finxter.com/python-lists-filter-vs-list-comprehension-which-is-faster/
Curiously on Python 3, I see filter performing faster than list comprehensions.
I always thought that the list comprehensions would be more performant.
Something like:
[name for name in brand_names_db if name is not None]
The bytecode generated is a bit better.
>>> def f1(seq):
... return list(filter(None, seq))
>>> def f2(seq):
... return [i for i in seq if i is not None]
>>> disassemble(f1.__code__)
2 0 LOAD_GLOBAL 0 (list)
2 LOAD_GLOBAL 1 (filter)
4 LOAD_CONST 0 (None)
6 LOAD_FAST 0 (seq)
8 CALL_FUNCTION 2
10 CALL_FUNCTION 1
12 RETURN_VALUE
>>> disassemble(f2.__code__)
2 0 LOAD_CONST 1 (<code object <listcomp> at 0x10cfcaa50, file "<stdin>", line 2>)
2 LOAD_CONST 2 ('f2.<locals>.<listcomp>')
4 MAKE_FUNCTION 0
6 LOAD_FAST 0 (seq)
8 GET_ITER
10 CALL_FUNCTION 1
12 RETURN_VALUE
But they are actually slower:
>>> timeit(stmt="f1(range(1000))", setup="from __main__ import f1,f2")
21.177661532000116
>>> timeit(stmt="f2(range(1000))", setup="from __main__ import f1,f2")
42.233950221000214
I would come to the conclusion: Use list comprehension over filter since its
more readable
more pythonic
faster (for Python 3.11, see attached benchmark, also see )
Keep in mind that filter returns a iterator, not a list.
python3 -m timeit '[x for x in range(10000000) if x % 2 == 0]'
1 loop, best of 5: 270 msec per loop
python3 -m timeit 'list(filter(lambda x: x % 2 == 0, range(10000000)))'
1 loop, best of 5: 432 msec per loop
Summarizing other answers
Looking through the answers, we have seen a lot of back and forth, whether or not list comprehension or filter may be faster or if it is even important or pythonic to care about such an issue. In the end, the answer is as most times: it depends.
I just stumbled across this question while optimizing code where this exact question (albeit combined with an in expression, not ==) is very relevant - the filter + lambda expression is taking up a third of my computation time (of multiple minutes).
My case
In my case, the list comprehension is much faster (twice the speed). But I suspect that this varies strongly based on the filter expression as well as the Python interpreter used.
Test it for yourself
Here is a simple code snippet that should be easy to adapt. If you profile it (most IDEs can do that easily), you will be able to easily decide for your specific case which is the better option:
whitelist = set(range(0, 100000000, 27))
input_list = list(range(0, 100000000))
proximal_list = list(filter(
lambda x: x in whitelist,
input_list
))
proximal_list2 = [x for x in input_list if x in whitelist]
print(len(proximal_list))
print(len(proximal_list2))
If you do not have an IDE that lets you profile easily, try this instead (extracted from my codebase, so a bit more complicated). This code snippet will create a profile for you that you can easily visualize using e.g. snakeviz:
import cProfile
from time import time
class BlockProfile:
def __init__(self, profile_path):
self.profile_path = profile_path
self.profiler = None
self.start_time = None
def __enter__(self):
self.profiler = cProfile.Profile()
self.start_time = time()
self.profiler.enable()
def __exit__(self, *args):
self.profiler.disable()
exec_time = int((time() - self.start_time) * 1000)
self.profiler.dump_stats(self.profile_path)
whitelist = set(range(0, 100000000, 27))
input_list = list(range(0, 100000000))
with BlockProfile("/path/to/create/profile/in/profile.pstat"):
proximal_list = list(filter(
lambda x: x in whitelist,
input_list
))
proximal_list2 = [x for x in input_list if x in whitelist]
print(len(proximal_list))
print(len(proximal_list2))
Your question is so simple yet interesting. It just shows how flexible python is, as a programming language. One may use any logic and write the program according to their talent and understandings. It is fine as long as we get the answer.
Here in your case, it is just an simple filtering method which can be done by both but i would prefer the first one my_list = [x for x in my_list if x.attribute == value] because it seems simple and does not need any special syntax. Anyone can understands this command and make changes if needs it.
(Although second method is also simple, but it still has more complexity than the first one for the beginner level programmers)

Categories