Python find duplicates array operations - python

How can I form an array (c) composed of elements of b which are not in a?
a=[1,2,"ID123","ID126","ID124","ID125"]
b=[1,"ID123","ID124","ID125","343434","fffgfgf"]
c= []
Can this be done without using a list comprehension?

If the lists are long, you want to make a set of a first:
a_set = set(a)
c = [x for x in b if x not in a_set]
If the order of the elements don't matter, then just use sets:
c = list(set(b) - set(a))
Python lists don't offer a direct - operator, as Ruby arrays do.

Using list comprehension is most straight forward:
[i for i in b if i not in a]
c
['343434', 'fffgfgf']
However, if you really did not want to use list comprehension you could use a generator expression:
c = (i for i in b if i not in a)
This will also not generate the result list all at once in memory (in case that would be a concern).

The following will do it:
c = [v for v in b if v not in a]
If a is long, it might improve performance to turn it into a set:
a_set = set(a)
c = [v for v in b if v not in a_set]

Related

How to compare two lists to keep matching substrings?

As best I can describe it, I have two lists of strings and I want to return all results from list A that contain any of the strings in list B. Here are details:
A = ['dataFile1999', 'dataFile2000', 'dataFile2001', 'dataFile2002']
B = ['2000', '2001']
How do I return
C = ['dataFile2000', 'dataFile2001']?
I've been looking into list comprehensions, doing something like below
C=[x for x in A if B in A]
but I can't seem to make it work. Am I on the right track?
You were close, use any:
C=[x for x in A if any(b in x for b in B)]
More detailed:
A = ['dataFile1999', 'dataFile2000', 'dataFile2001', 'dataFile2002']
B = ['2000', '2001']
C = [x for x in A if any(b in x for b in B)]
print(C)
Output
['dataFile2000', 'dataFile2001']
You can use any() to check if any element of your list B is in x:
A = ['dataFile1999', 'dataFile2000', 'dataFile2001', 'dataFile2002']
B = ['2000', '2001']
c = [x for x in A if any(k in x for k in B)]
print(c)
Output:
['dataFile2000', 'dataFile2001']
First, I would construct a set of the years for the O(1) lookup time.1
>>> A = ['dataFile1999', 'dataFile2000', 'dataFile2001', 'dataFile2002']
>>> B = ['2000', '2001']
>>>
>>> years = set(B)
Now, keep only the elements of A that end with an element of years.
>>> [file for file in A if file[-4:] in years]
>>> ['dataFile2000', 'dataFile2001']
1 If you have very small lists (two elements certainly qualify) keep the lists. Sets have O(1) lookup but the hashing still introduces overhead.

How to calculate the difference between the elements in three lists efficiently?

I have 3 very large lists of strings, for visualization purposes consider:
A = ['one','four', 'nine']
B = ['three','four','six','five']
C = ['four','five','one','eleven']
How can I calculate the difference between this lists in order to get only the elements that are not repeating in the other lists. For example:
A = ['nine']
B = ['three','six']
C = ['eleven']
Method 1
You can arbitrarily add more lists just by changing the first line, e.g. my_lists = (A, B, C, D, E).
my_lists = (A, B, C)
my_sets = {n: set(my_list) for n, my_list in enumerate(my_lists)}
my_unique_lists = tuple(
list(my_sets[n].difference(*(my_sets[i] for i in range(len(my_sets)) if i != n)))
for n in range(len(my_sets)))
>>> my_unique_lists
(['nine'], ['six', 'three'], ['eleven'])
my_sets uses a dictionary comprehension to create sets for each of the lists. The key to the dictionary is the lists order ranking in my_lists.
Each set is then differenced with all other sets in the dictionary (barring itself) and then converted back to a list.
The ordering of my_unique_lists corresponds to the ordering in my_lists.
Method 2
You can use Counter to get all unique items (i.e. those that only appear in just one list and not the others), and then use a list comprehension to iterate through each list and select those that are unique.
from collections import Counter
c = Counter([item for my_list in my_lists for item in set(my_list)])
unique_items = tuple(item for item, count in c.items() if count == 1)
>>> tuple([item for item in my_list if item in unique_items] for my_list in my_lists)
(['nine'], ['three', 'six'], ['eleven'])
With sets:
convert all lists to sets
take the differences
convert back to lists
A, B, C = map(set, (A, B, C))
a = A - B - C
b = B - A - C
c = C - A - B
A, B, C = map(list, (a, b, c))
The (possible) problem with this is that the final lists are no longer ordered, e.g.
>>> A
['nine']
>>> B
['six', 'three']
>>> C
['eleven']
This could be fixed by sorting by the original indicies, but then the time complexity will dramatically increase so the benefit of using sets is almost entirely lost.
With list-comps (for-loops):
convert lists to sets
use list-comps to filter out elements from the original lists that are not in the other sets
sA, sB, sC = map(set, (A, B, C))
A = [e for e in A if e not in sB and e not in sC]
B = [e for e in B if e not in sA and e not in sC]
C = [e for e in C if e not in sA and e not in sB]
which then produces a result that maintains the original order of the lists:
>>> A
['nine']
>>> B
['three', 'six']
>>> C
['eleven']
Summary:
In conclusion, if you don't care about the order of the result, convert the lists to sets and then take their differences (and not bother converting back to lists). However, if you do care about order, then still convert the lists to sets (hash tables) as then the lookup will still be faster when filtering them (best case O(1) vs O(n) for lists).
You can iteratively go thru all lists elements adding current element to set if its not there, and if its there remove it from list. This way you will use additional up to O(n) space complexity, and O(n) time complexity but elements will remain in order.
You can also use a function define purposely to check the difference between three list. Here's an example of such a function:
def three_list_difference(l1, l2, l3):
lst = []
for i in l1:
if not(i in l2 or i in l3):
lst.append(i)
return lst
The function three_list_difference takes three list and checks if an element in the first list l1 is also in either l2 or l3. The deference can be determined by simple calling the function in the right configuration:
three_list_difference(A, B, C)
three_list_difference(B, A, C)
three_list_difference(C, B, A)
with outputs:
['nine']
['three', 'six']
['eleven']
Using a function is advantageous because the code is reusable.

Assign values to an array using two values

I am trying to generate an array that is the sum of two previous arrays. e.g
c = [A + B for A in a and B in b]
Here, get the error message
NameError: name 'B' is not defined
where
len(a) = len(b) = len(c)
Please can you let me know what I am doing wrong. Thanks.
The boolean and operator does not wire iterables together, it evaluates the truthiness (or falsiness) of its two operands.
What you're looking for is zip:
c = [A + B for A, B in zip(a, b)]
Items from the two iterables are successively assigned to A to B until one of the two is exhausted. B is now defined!
It should be
c = [A + B for A in a for B in b]
for instead of and. You might want to consider using numpy, where you can add 2 matrices directly, and more efficient.
'for' does not work the way you want it to work.
You could use zip().
A = [1,2,3]
B = [4,5,6]
c = [ a + b for a,b in zip(A,B)]
zip iterates through A & B and produces tuples.
To see what this looks like try:
[ x for x in zip(A,B)]

Remove Variable(s) in List A if Variable(s) is/are in List B, Python

Like the title states I want to remove variables in one list if they happen to be in another list. I have tried various techniques but I can't seem to get a proper code. Can anyone help with this?
You may use list comprehension if you want to maintain the order:
>>> l = [1,2,3,4]
>>> l2 = [1,5,6,3]
>>> [x for x in l if x not in l2]
[2, 4]
In case the order of elements in original list don't matter, you may use set:
>>> list(set(l) - set(l2))
[2, 4]
def returnNewList(a,b):
h = {}
for e in b:
h[e] = True
return [e for e in a if e not in h]
hash table is used to keep the run time complexity linear.
In case list b is sorted then on place of using hash table you can perform binary search, complexity in this case will be nlog(n)
There are several ways
# just make a new list
[i for i in a if i not in b]
# use sets
list(set(a).difference(set(b)))
I figured it out, however is there a shorter way to write this code?
a = [0,1,2,3,4,5,6,7,8]
b = [0,5,8]
for i in a:
if i in b:
a.remove(i)

Compare with each other, items inside two lists, in a for loop?

How do I check in Python, if an item of a list is repetead in another list?
I suppose that I should use a FOR loop, to go and check item by item, but I´m stuck in something like this (which I know it´s not correct):
def check(a, b):
for item in b:
for item1 in a:
if b[item] in a[item1]:
b.remove(b[item1])
I want to remove repeated elements in the second list in comparision with the first list.
Edit: I do assume that list a has items that are repeated in list b. Those items can be of any type.
Desired output:
a=[a,b,c]
b=[c,d,e]
I want to append both lists and print: a b c d e
Assuming that a and b do not contain duplicate items that need to be preserved and the items are all hashable, you can use Python has a built in set:
c = list(set(b) - set(a))
# c is now a list of everything in b that is not in a
This would work for:
a, b = range(7), range(5, 11)
but it would not work for:
a = [1, 2, 1, 1, 3, 4, 2]
b = [1, 3, 4]
# After the set operations c would be [2]
# rather than the possibly desired [2, 2]
In the case when duplicates are desired you can do the following:
set_b = set(b)
c = [x for x in a if x not in b]
Using a set for b will make the lookups O(1) rather than O(N) (which will not matter for small lists).
You can use Python's set operations without the need for loops:
>>> a = [1,2]
>>> b = [2]
>>> set(a) - set(b)
set([1])
>>>
Use the set command and list to get the list back.
d = list(set(a + b))
You can use list.sort() if you want to sort the list as well.

Categories