counting the number of co-occurences in a list - python

I have an array consisting of a set of lists of strings (can assume each string is a single word).
I want an efficient way, in Python, to count pairs of words in this array.
It is not collocation or bi-grams, as each word in the pair may be in any position on the list.

It's unclear how your list is, Is it something like:
li = ['hello','bye','hi','good','bye','hello']
If so the solution is simple:
In [1342]: [i for i in set(li) if li.count(i) > 1]
Out[1342]: ['bye', 'hello']
Otherwise if it is like:
li = [['hello'],['bye','hi','good'],['bye','hello']]
Then:
In [1378]: f = []
In [1379]: for x in li:
.......... for i in x:
.......... f.append(i)
In [1380]: f
Out[1380]: ['hello', 'bye', 'hi', 'good', 'bye', 'hello']
In [1381]: [i for i in set(f) if f.count(i) > 1]
Out[1381]: ['bye', 'hello']

>>> from itertools import chain
>>> from collections import Counter
>>> L = [['foo', 'bar'], ['apple', 'orange', 'mango'], ['bar']]
>>> c = Counter(frozenset(x) for x in combinations(chain.from_iterable(L), r=2))
>>> c
Counter({frozenset(['mango', 'bar']): 2, frozenset(['orange', 'bar']): 2, frozenset(['foo', 'bar']): 2, frozenset(['bar', 'apple']): 2, frozenset(['orange', 'apple']): 1, frozenset(['foo', 'apple']): 1, frozenset(['bar']): 1, frozenset(['orange', 'mango']): 1, frozenset(['foo', 'mango']): 1, frozenset(['mango', 'apple']): 1, frozenset(['orange', 'foo']): 1})

Related

how can I manipulate key with for loops to update dictionary

I am trying to put a list into a dictionary and count the number of occurrences of each word in the list. The only problem I don't understand is when I use the update function, it takes x as a dictionary key, when I want x to be the x value of list_ . I am new to python so any advice is appreciated. Thanks
list_ = ["hello", "there", "friend", "hello"]
d = {}
for x in list_:
d.update(x = list_.count(x))
Use a Counter object if you want a simple way of converting a list of items to a dictionary which contains a mapping of list_entry: number_of_occurences .
>>> from collections import Counter
>>> words = ['hello', 'there', 'friend', 'hello']
>>> c = Counter(words)
>>> print(c)
Counter({'hello': 2, 'there': 1, 'friend': 1})
>>> print(dict(c))
{'there': 1, 'hello': 2, 'friend': 1}
An option would be using dictionary comprehension with list.count() like this:
list_ = ["hello", "there", "friend", "hello"]
d = {item: list_.count(item) for item in list_}
Output:
>>> d
{'hello': 2, 'there': 1, 'friend': 1}
But the best option should be collections.Counter() used in #AK47's solution.

Efficient way of comparing multiple lists in python

I have 5 long lists with word pairs as given in the example below. Note that this could include word pair lists like [['Salad', 'Fat']] AND word pair list of lists like [['Bread', 'Oil'], ['Bread', ' Salt']]
list_1 = [ [['Salad', 'Fat']], [['Bread', 'Oil'], ['Bread', 'Salt']], [['Salt', 'Sugar'] ]
list_2 = [ [['Salad', 'Fat'], ['Salt', 'Sugar']], [['Protein', 'Soup']] ]
list_3 = [ [['Salad', ' Protein']], [['Bread', ' Oil']], [['Sugar', 'Salt'] ]
list_4 = [ [['Salad', ' Fat'], ['Salad', 'Chicken']] ]
list_5 = [ ['Sugar', 'Protein'], ['Sugar', 'Bread'] ]
Now I want to calculate the frequency of word pairs.
For example, in the above 5 lists, I should get the output as follows, where the word pairs and its frequency is shown.
output_list = [{'['Salad', 'Fat']': 3}, {['Bread', 'Oil']: 2}, {['Salt', 'Sugar']: 2,
{['Sugar','Salt']: 1} and so on]
What is the most efficient way of doing it in python?
Given you have uneven nested lists this makes the code ugly, so would look to fix the input lists.
collections.Counter() is built for this kind of thing but lists are not hashable so you need to turn them into tuples (as well as strip off the spurious spaces):
In []:
import itertools as it
from collections import Counter
list_1 = [ [['Salad', 'Fat']], [['Bread', 'Oil'], ['Bread', 'Salt']], [['Salt', 'Sugar'] ]]
list_2 = [ [['Salad', 'Fat'], ['Salt', 'Sugar']], [['Protein', 'Soup']] ]
list_3 = [ [['Salad', ' Protein']], [['Bread', ' Oil']], [['Sugar', 'Salt'] ]]
list_4 = [ [['Salad', ' Fat'], ['Salad', 'Chicken']] ]
list_5 = [ ['Sugar', 'Protein'], ['Sugar', 'Bread']]
t = lambda x: tuple(map(str.strip, x))
c = Counter(map(t, it.chain.from_iterable(it.chain(list_1, list_2, list_3, list_4))))
c += Counter(map(t, list_5))
c
Out[]:
Counter({('Bread', 'Oil'): 2,
('Bread', 'Salt'): 1,
('Protein', 'Soup'): 1,
('Salad', 'Chicken'): 1,
('Salad', 'Fat'): 3,
('Salad', 'Protein'): 1,
('Salt', 'Sugar'): 2,
('Sugar', 'Bread'): 1,
('Sugar', 'Protein'): 1,
('Sugar', 'Salt'): 1})
You could flatten all the lists. Then use Counter to count the word frequencies.
>>> import itertools
>>> from collections import Counter
>>> l = [[1,2,3],[3,4,1,5]]
>>> counts = Counter(list(itertools.chain(*l)))
>>> counts
Counter({1: 2, 3: 2, 2: 1, 4: 1, 5: 1})
NOTE: this flattening technique will work only with lists of lists. For other flattening techniques see the link provided above.
EDIT:
Thanks to AChampion counts = Counter(list(itertools.chain(*l))) can be written as counts = Counter(list(itertools.chain.from_iterable(l)))

Using list comprehension to add one to a value in a dictionary

I'm trying to get rid of this for loop and instead use list comprehension to give the same result.
fd= nltk.FreqDist()
html = requests.get("http://www.nrc.nl/nieuws/2015/04/19/louise-gunning-vertrekt-als-voorzitter-bestuur-uva/")
raw = BeautifulSoup(html.text).text
for word in nltk.word_tokenize(raw):
freqdist[word.lower()] += 1
I'm not sure if it's possible, but I can't get it to work because of the +=1. I've tried:
[freqdist[word.lower()] +=1 for word in nltk.word_tokenize(raw)]
But that will only raise an error. Could anyone point me in the right direction?
If you want to mutate an existing list/dictionary, using a list/dictionary comprehension is considered bad style because it creates an unnecessary throwaway-list/dictionary.
To be precise, I'm talking about the following:
>>> demo = ['a', 'b', 'c']
>>> freqdist = {'a': 0, 'b': 1, 'c': 2}
>>> [freqdist.__setitem__(key, freqdist[key] + 1) for key in demo]
[None, None, None]
>>> freqdist
{'a': 1, 'c': 3, 'b': 2}
As you can see, doing what you describe is possible, but that's not how you should do it because
it is hard to read
it creates an unused throwaway list [None, None, None]
list comprehensions should be used to build a new list that you actually need
Creating a new dictionary with a dictionary comprehension is cumbersome as well, because not every value should be incremented (only the ones in demo).
You could do
>>> demo = ['a', 'b', 'c']
>>> freqdist = {'a': 0, 'b': 1, 'c': 2}
>>> freqdist = {k:v + (k in demo) for k,v in freqdist.items()}
>>> freqdist
{'a': 1, 'c': 3, 'b': 2}
However, we have suboptimal runtime complexity now because for each key in freqdist we do a O(len(demo)) membership test for demo.
You could use a set for demo to reduce the complexity of the dictionary building to O(len(freqdist)), but only if the elements of demo are unique.
>>> demo = set(['a', 'b', 'c'])
>>> freqdist = {'a': 0, 'b': 1, 'c': 2}
>>> freqdist = {k:v + (k in demo) for k,v in freqdist.items()}
>>> freqdist
{'a': 1, 'c': 3, 'b': 2}
I don't think this solution is particularly elegant, either.
In conclusion, your for loop is perfectly fine. The only good alternative would be to use a Counter object that you update:
>>> from collections import Counter
>>> demo = ['a', 'b', 'c']
>>> freqdist = Counter({'a': 0, 'b': 1, 'c': 2})
>>> freqdist.update(demo)
>>> freqdist
Counter({'c': 3, 'b': 2, 'a': 1})
This is the solution I would use personally.
This works:
>>> txt = 'Hello goodbye hello GDby Dog cat dog'
>>> txt_new = txt.lower().split()
>>> print txt_new
['hello', 'goodbye', 'hello', 'gdby', 'dog', 'cat', 'dog']
Now use collections
>>> import collections
>>> collections.Counter(txt_new)
Counter({'hello': 2, 'dog': 2, 'gdby': 1, 'cat': 1, 'goodbye': 1})
If you are not allowed to use collections.Counter then:
>>> {word: txt_new.count(word) for word in set(txt_new)}
{'goodbye': 1, 'dog': 2, 'hello': 2, 'gdby': 1, 'cat': 1}

How to add String in-between list variables?

I am trying to add "!" after every variable in a list.
But my code only adds the series of "!" after the initial list.
For example:
lst = [1,2,3,4]
def addmark(lst):
emptylst = []
for n in range(0, len(lst)):
lst.append("!")
return lst
This would return [1,2,3,4,"!", "!", "!", "!"]
I want to reuturn [1, "!", 2, "!", 3, "!", 4, "!"]
def addmark(lst):
emptylst = []
for i in lst:
emptylst.append(i)
emptylst.append("!")
return emptylst
An alternative to the accepted answer using itertools:
from itertools import chain, repeat
lst = [1, 2, 3]
marker = repeat("!")
list(chain.from_iterable(zip(lst, marker)))
>>> [1, '!', 2, '!', 3, '!']
Using insert:
list.insert (i, x)
Insert an item at a given position. The first
argument is the index of the element before which to insert, so
a.insert(0, x) inserts at the front of the list, and a.insert(len(a),
x) is equivalent to a.append(x).
Reference: docs.python.org/2/tutorial/datastructures
Code:
def addmark(lst):
add = 0 # needed cause after every insertion of '!' the position where you want to add the next '!' changes
for i in range (1,len(lst)+1): # (start: adding after ls[0], finish: adding after the last element)
lst.insert(i+add, '!')
add += 1
return lst
this is code
#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''
def addmark(lst):
result = []
for i, item in enumerate(lst):
result.append(item)
result.append("!")
return result
if __name__ == '__main__':
lst = [1,2,3,4]
print addmark(lst)
create a list of lists then flatten
lst = [1,2,3,4]
lst2 = [[i,'!'] for i in lst]
lst3 = [item for sublist in lst2 for item in sublist]
print lst2
print lst3
>>> [[1, '!'], [2, '!'], [3, '!'], [4, '!']]
>>> [1, '!', 2, '!', 3, '!', 4, '!']
as a one liner:
lst = [1,2,3,4]
lst2 = [item for sublist in [[i,'!'] for i in lst] for item in sublist]
print lst2
>>> [1, '!', 2, '!', 3, '!', 4, '!']

How to count elements in a list of lists of strings

If a have a list like this:
[['welcome','a1'],['welcome','a1'],['hello','a2'],['hello','a3']]
and I want to return something like this:
[['welcome','a1', 2],['hello','a2', 1],['hello','a3', 1]]
If the same pair of strings in a sublist is encountered, increment the count
What I have so far:
counter = 0
for i in mylist:
counter += 1
if i[0]== i[0]:
if i[1] == i[1]:
counter -= 1
ouptut.append([mylist, counter])
I'm new at this and I appreciate your help!
Use a set here to get only unique items:
>>> lis = [['welcome','a1'],['welcome','a1'],['hello','a2'],['hello','a3']]
>>> [list(x) + [1] for x in set(map(tuple, lis))]
>>> [['welcome', 'a1', 1], ['hello', 'a3', 1], ['hello', 'a2', 1]]
Explanation:
Set always returns unique items from an iterable or iterator, but as sets can only contain immutable item so you should convert them to a tuple first. A verbose version of the above code, only difference is that will also preserve the original or
>>> lis = [['welcome','a1'],['welcome','a1'],['hello','a2'],['hello','a3']]
>>> s = set()
>>> for item in lis:
... tup = tuple(item) #covert to tuple
... s.add(tup)
>>> s
set([('welcome', 'a1'), ('hello', 'a3'), ('hello', 'a2')])
Now use a list comprehension to get the expected output:
>>> [list(item) + [1] for item in s]
[['welcome', 'a1', 1], ['hello', 'a3', 1], ['hello', 'a2', 1]]
If the order of items matter(sets don't preserve order), then use this:
>>> seen = set()
>>> ans = []
>>> for item in lis:
... tup = tuple(item)
... if tup not in seen:
... ans.append(item + [1])
... seen.add(tup)
...
>>> ans
[['welcome', 'a1', 1], ['hello', 'a2', 1], ['hello', 'a3', 1]]
I am not sure what's the point of using 1 here.

Categories