Python, finding unique words in multiple lists - python

I have the following code:
a= ['hello','how','are','hello','you']
b= ['hello','how','you','today']
len_b=len(b)
for word in a:
count=0
while count < len_b:
if word == b[count]:
a.remove(word)
break
else:
count=count+1
print a
The goal is that it basically outputs (contents of list a)-(contents of list b)
so the wanted result in this case would be a = ['are','hello']
but when i run my code i get a= ['how','are','you']
can anybody either point out what is wrong with my implementation, or is there another better way to solve this?

You can use a set to get all non duplicate elements
So you could do set(a) - set(b) for the difference of sets

The reason for this is because you are mutating the list a while iterating over it.
If you want to solve it correctly, you can try the below method. It uses list comprehension and dictionary to keep track of the number of words in the resulting set:
>>> a = ['hello','how','are','hello','you']
>>> b = ['hello','how','you','today']
>>>
>>> cnt_a = {}
>>> for w in a:
... cnt_a[w] = cnt_a.get(w, 0) + 1
...
>>> for w in b:
... if w in cnt_a:
... cnt_a[w] -= 1
... if cnt_a[w] == 0:
... del cnt_a[w]
...
>>> [y for k, v in cnt_a.items() for y in [k] * v]
['hello', 'are']
It works well in case where there are duplicates, even in the resulting list. However it may not preserve the order, but it can be easily modify to do this if you want.

set(a+b) is alright, too. You can use sets to get unique elements.

Related

Filtering for tuples from another list and extracting values

I am working on handling two lists of tuples and deducing results.
For example:
A = [('Hi','NNG'),('Good','VV'),...n]
B = [('Happy','VA',1.0),('Hi','NNG',0.5)...n]
First, I'd like to match the words between A and B.
like 'Hi'='Happy' or 'Hi'='Hi'
Second, if they are same and match, then match word class.
whether 'NNG'='NNG' or 'NNG'='VV'
Third, if all these steps match, then extract the number!
like if A=[('Hi','NNG')] and B=('Hi','NNG',0.5)
Extract 0.5
Lastly, I want to multiply all numbers from extraction.
There are more than 1,000 tuples in each A, B. So 'for' loop will be necessary to find out this process.
How can I do this in Python?
Try something like this:
A = [('Hi', 'NNG'), ('Good', 'VV')]
B = [('Happy', 'VA', 1.0), ('Hi', 'NNG', 0.5)]
print(', '.join(repr(j[2]) for i in A for j in B if i[0] == j[0] and i[1] == j[1]))
# 0.5
One way is to use a set and (optionally) a dictionary. The benefit of this method is you also keep the key data to know where your values originated.
A = [('Hi','NNG'),('Good','VV')]
B = [('Happy','VA',1.0),('Hi','NNG',0.5)]
A_set = set(A)
res = {(i[0], i[1]): i[2] for i in B if (i[0], i[1]) in A_set}
res = list(res.values())
# [0.5]
To multiply all results in the list, see How can I multiply all items in a list together with Python?
Explanation
Use a dictionary comprehension with for i in B. What this does is return a tuple of results iterating through each element of B.
For example, when iterating the first element, you will find i[0] = 'Happy', i[1] = 'VA', i[2] = 1.0.
Since we loop through the whole list, we construct a dictionary of results with tuple keys from the first 2 elements.
Additionally, we add the criterion (i[0], i[1]) in A_set to filter as per required logic.
Python is so high level that it feels like English. So, the following working solution can be written very easily with minimum experience:
A = [('Hi','NNG'),('Good','VV')]
B = [('Happy','VA',1.0),('Hi','NNG',0.5)]
tot = 1
for ia in A:
for ib in B:
if ia == ib[:2]:
tot *= ib[2]
break # remove this line if multiple successful checks are possible
print(tot) # -> 0.5
zip() is your friend:
for tupA,tupB in zip(A,B):
if tupA[:2] == tupB[:2] : print(tupB[2])
To use fancy pythonic list comprehension:
results = [tubB[2] for tubA,tubB in zip(A,B) if tubA[:2] == tubB[:2] ]
But... why do I have a sneaky feeling this isn't what you want to do?

removing duplicate tuples in python

I have a list of 50 numbers, [0,1,2,...49] and I would like to create a list of tuples without duplicates, where i define (a,b) to be a duplicate of (b,a). Similarly, I do not want tuples of the form (a,a).
I have this:
pairs = set([])
mylist = range(0,50)
for i in mylist:
for j in mylist:
pairs.update([(i,j)])
set((a,b) if a<=b else (b,a) for a,b in pairs)
print len(pairs)
>>> 2500
I get 2500 whereas I expect to get, i believe, 1225 (n(n-1)/2).
What is wrong?
You want all combinations. Python provides a module, itertools, with all sorts of combinatorial utilities like this. Where you can, I would stick with using itertool, it almost certainly faster and more memory efficient than anything you would cook up yourself. It is also battle-tested. You should not reinvent the wheel.
>>> import itertools
>>> combs = list(itertools.combinations(range(50),2))
>>> len(combs)
1225
>>>
However, as others have noted, in the case where you have a sequence (i.e. something indexable) such as a list, and you want N choose k, where k=2 the above could simply be implemented by a nested for-loop over the indices, taking care to generate your indices intelligently:
>>> result = []
>>> for i in range(len(numbers)):
... for j in range(i + 1, len(numbers)):
... result.append((numbers[i], numbers[j]))
...
>>> len(result)
1225
However, itertool.combinations takes any iterable, and also takes a second argument, r which deals with cases where k can be something like 7, (and you don't want to write a staircase).
Your approach essentially takes the cartesian product, and then filters. This is inefficient, but if you wanted to do that, the best way is to use frozensets:
>>> combinations = set()
>>> for i in numbers:
... for j in numbers:
... if i != j:
... combinations.add(frozenset([i,j]))
...
>>> len(combinations)
1225
And one more pass to make things tuples:
>>> combinations = [tuple(fz) for fz in combinations]
Try This,
pairs = set([])
mylist = range(0,50)
for i in mylist:
for j in mylist:
if (i < j):
pairs.append([(i,j)])
print len(pairs)
problem in your code snippet is that you filter out unwanted values but you don't assign back to pairs so the length is the same... also: this formula yields the wrong result because it considers (20,20) as valid for instance.
But you should just create the proper list at once:
pairs = set()
for i in range(0,50):
for j in range(i+1,50):
pairs.add((i,j))
print (len(pairs))
result:
1225
With that method you don't even need a set since it's guaranteed that you don't have duplicates in the first place:
pairs = []
for i in range(0,50):
for j in range(i+1,50):
pairs.append((i,j))
or using list comprehension:
pairs = [(i,j) for i in range(0,50) for j in range(i+1,50)]

How to find a duplicate in a list without using set in python?

I know that we can use the set in python to find if there is any duplicate in a list. I was just wondering, if we can find a duplicate in a list without using set.
Say, my list is
a=['1545','1254','1545']
then how to find a duplicate?
a=['1545','1254','1545']
from collections import Counter
print [item for item, count in Counter(a).items() if count != 1]
Output
['1545']
This solution runs in O(N). This will be a huge advantage if the list used has a lot of elements.
If you just want to find if the list has duplicates, you can simply do
a=['1545','1254','1545']
from collections import Counter
print any(count != 1 for count in Counter(a).values())
As #gnibbler suggested, this would be the practically fastest solution
from collections import defaultdict
def has_dup(a):
result = defaultdict(int)
for item in a:
result[item] += 1
if result[item] > 1:
return True
else:
return False
a=['1545','1254','1545']
print has_dup(a)
>>> lis = []
>>> a=['1545','1254','1545']
>>> for i in a:
... if i not in lis:
... lis.append(i)
...
>>> lis
['1545', '1254']
>>> set(a)
set(['1254', '1545'])
use list.count:
In [309]: a=['1545','1254','1545']
...: a.count('1545')>1
Out[309]: True
Using list.count:
>>> a = ['1545','1254','1545']
>>> any(a.count(x) > 1 for x in a) # To check whether there's any duplicate
True
>>> # To retrieve any single element that is duplicated
>>> next((x for x in a if a.count(x) > 1), None)
'1545'
# To get duplicate elements (used set literal!)
>>> {x for x in a if a.count(x) > 1}
set(['1545'])
sort the list and check that the next value is not equal to the last one..
a.sort()
last_x = None
for x in a:
if x == last_x:
print "duplicate: %s" % x
break # existence of duplicates is enough
last_x = x
This should be O(n log n) which is slower for big n than the Counter solution (but counter uses a dict under the hood.. which is not too dissimilar from a set really).
An alternative is to insert the elements and keep the list sorted.. see the bisect module. It makes your inserts slower but your check for duplicates fast.
If this is homework, your teacher is probably asking for the hideously inefficient .count() style answer.
In practice using a dict is your next best bet if set is disallowed.
>>> a = ['1545','1254','1545']
>>> D = {}
>>> for i in a:
... if i in D:
... print "duplicate", i
... break
... D[i] = i
... else:
... print "no duplicate"
...
duplicate 1545
Here is a version using groupby which is still much better that the .count() method
>>> from itertools import groupby
>>> a = ['1545','1254','1545']
>>> next(k for k, g in groupby(sorted(a)) if sum(1 for i in g) > 1)
'1545'
thanks all for working on this problem. I also got to learn a lot from different answers. This is how I have answered:
a=['1545','1254','1545']
d=[]
duplicates=False
for i in a:
if i not in d:
d.append(i)
if len(d)<len(a):
duplicates=True
else:
duplicates=False
print(duplicates)

Optimize search to find next matching value in a list

I have a program that goes through a list and for each objects finds the next instance that has a matching value. When it does it prints out the location of each objects. The program runs perfectly fine but the trouble I am running into is when I run it with a large volume of data (~6,000,000 objects in the list) it will take much too long. If anyone could provide insight into how I can make the process more efficient, I would greatly appreciate it.
def search(list):
original = list
matchedvalues = []
count = 0
for x in original:
targetValue = x.getValue()
count = count + 1
copy = original[count:]
for y in copy:
if (targetValue == y.getValue):
print (str(x.getLocation) + (,) + str(y.getLocation))
break
Perhaps you can make a dictionary that contains a list of indexes that correspond to each item, something like this:
values = [1,2,3,1,2,3,4]
from collections import defaultdict
def get_matches(x):
my_dict = defaultdict(list)
for ind, ele in enumerate(x):
my_dict[ele].append(ind)
return my_dict
Result:
>>> get_matches(values)
defaultdict(<type 'list'>, {1: [0, 3], 2: [1, 4], 3: [2, 5], 4: [6]})
Edit:
I added this part, in case it helps:
values = [1,1,1,1,2,2,3,4,5,3]
def get_next_item_ind(x, ind):
my_dict = get_matches(x)
indexes = my_dict[x[ind]]
temp_ind = indexes.index(ind)
if len(indexes) > temp_ind + 1:
return(indexes)[temp_ind + 1]
return None
Result:
>>> get_next_item_ind(values, 0)
1
>>> get_next_item_ind(values, 1)
2
>>> get_next_item_ind(values, 2)
3
>>> get_next_item_ind(values, 3)
>>> get_next_item_ind(values, 4)
5
>>> get_next_item_ind(values, 5)
>>> get_next_item_ind(values, 6)
9
>>> get_next_item_ind(values, 7)
>>> get_next_item_ind(values, 8)
There are a few ways you could increase the efficiency of this search by minimising additional memory use (particularly when your data is BIG).
you can operate directly on the list you are passing in, and don't need to make copies of it, in this way you won't need: original = list, or copy = original[count:]
you can use slices of the original list to test against, and enumerate(p) to iterate through these slices. You won't need the extra variable count and, enumerate(p) is efficient in Python
Re-implemented, this would become:
def search(p):
# iterate over p
for i, value in enumerate(p):
# if value occurs more than once, print locations
# do not re-test values that have already been tested (if value not in p[:i])
if value not in p[:i] and value in p[(i + 1):]:
print(e, ':', i, p[(i + 1):].index(e))
v = [1,2,3,1,2,3,4]
search(v)
1 : 0 2
2 : 1 2
3 : 2 2
Implementing it this way will only print out the values / locations where a value is repeated (which I think is what you intended in your original implementation).
Other considerations:
More than 2 occurrences of value: If the value repeats many times in the list, then you might want to implement a function to walk recursively through the list. As it is, the question doesn't address this - and it may be that it doesn't need to in your situation.
using a dictionary: I completely agree with Akavall above, dictionary's are a great way of looking up values in Python - especially if you need to lookup values again later in the program. This will work best if you construct a dictionary instead of a list when you originally create the list. But if you are only doing this once, it is going to cost you more time to construct the dictionary and query over it than simply iterating over the list as described above.
Hope this helps!

How to find common elements in list of lists?

I'm trying to figure out how to compare an n number of lists to find the common elements.
For example:
p=[ [1,2,3],
[1,9,9],
..
..
[1,2,4]
>> print common(p)
>> [1]
Now if I know the number of elements I can do comparions like:
for a in b:
for c in d:
for x in y:
...
but that wont work if I don't know how many elements p has. I've looked at this solution that compares two lists
https://stackoverflow.com/a/1388864/1320800
but after spending 4 hrs trying to figure a way to make that recursive, a solution still eludes me so any help would be highly appreciated!
You are looking for the set intersection of all the sublists, and the data type you should use for set operations is a set:
result = set(p[0])
for s in p[1:]:
result.intersection_update(s)
print result
A simple solution (one-line) is:
set.intersection(*[set(list) for list in p])
The set.intersection() method supports intersecting multiple inputs at a time. Use argument unpacking to pull the sublists out of the outer list and pass them into set.intersection() as separate arguments:
>>> p=[ [1,2,3],
[1,9,9],
[1,2,4]]
>>> set(p[0]).intersection(*p)
set([1])
Why not just:
set.intersection(*map(set, p))
Result:
set([1])
Or like this:
ip = iter(p)
s = set(next(ip))
s.intersection(*ip)
Result:
set([1])
edit:
copied from console:
>>> p = [[1,2,3], [1,9,9], [1,2,4]]
>>> set.intersection(*map(set, p))
set([1])
>>> ip = iter(p)
>>> s = set(next(ip))
>>> s.intersection(*ip)
set([1])
p=[ [1,2,3],
[1,9,9],
[1,2,4]]
ans = [ele[0] for ele in zip(*p) if len(set(ele)) == 1]
Result:
>>> ans
[1]
reduce(lambda x, y: x & y, (set(i) for i in p))
You are looking for the set intersection of all the sublists, and the data type you should use for set operations is a set:
result = set(p[0])
for s in p[1:]:
result.intersection_update(s)
print result
However, there is a limitation of 10 lists in a list. Anything bigger causes 'result' list to be out of order. Assuming you've made 'result' into a list by list(result).
Make sure you result.sort() to ensure it's ordered if you depend on it to be that way.

Categories