Finding least common elements in a list - python

I want to generate an ordered list of the least common words within a large body of text, with the least common word appearing first along with a value indicating how many times it appears in the text.
I scraped the text from some online journal articles, then simply assigned and split;
article_one = """ large body of text """.split()
=> ("large","body", "of", "text")
Seems like a regex would be appropriate for the next steps, but being new to programming I'm not well versed-
If the best answer includes a regex, could someone point me to a good regex tutorial other than pydoc?

How about a shorter/simpler version with a defaultdict, Counter is nice but needs Python 2.7, this works from 2.5 and up :)
import collections
counter = collections.defaultdict(int)
article_one = """ large body of text """
for word in article_one.split():
counter[word] += 1
print sorted(counter.iteritems(), key=lambda x: x[::-1])

Finding least common elements in a list. According to Counter class in Collections module
c.most_common()[:-n-1:-1] # n least common elements
So Code for least common element in list is
from collections import Counter
Counter( mylist ).most_common()[:-2:-1]
Two least common elements is
from collections import Counter
Counter( mylist ).most_common()[:-3:-1]
python-3.x

This uses a slightly different approach but it appears to suit your needs. Uses code from this answer.
#!/usr/bin/env python
import operator
import string
article_one = """A, a b, a b c, a b c d, a b c d efg.""".split()
wordbank = {}
for word in article_one:
# Strip word of punctuation and capitalization
word = word.lower().strip(string.punctuation)
if word not in wordbank:
# Create a new dict key if necessary
wordbank[word] = 1
else:
# Otherwise, increment the existing key's value
wordbank[word] += 1
# Sort dict by value
sortedwords = sorted(wordbank.iteritems(), key=operator.itemgetter(1))
for word in sortedwords:
print word[1], word[0]
Outputs:
1 efg
2 d
3 c
4 b
5 a
Works in Python >= 2.4, and Python 3+ if you parenthesize the print statement at the bottom and change iteritems to items.

ready made answer from the mothership.
# From the official documentation ->>
>>> # Tally occurrences of words in a list
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
... cnt[word] += 1
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
## ^^^^--- from the standard documentation.
>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall('\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]
>>> def least_common(adict, n=None):
.....: if n is None:
.....: return sorted(adict.iteritems(), key=itemgetter(1), reverse=False)
.....: return heapq.nsmallest(n, adict.iteritems(), key=itemgetter(1))
Obviously adapt to suite :D

If you need a fixed number of least-common words, e.g., the 10 least common, you probably want a solution using a counter dict and a heapq, as suggested by sotapme's answer (with WoLpH's suggestion) or WoLpH's answer:
wordcounter = collections.Counter(article_one)
leastcommon = word counter.nsmallest(10)
However, if you need an unbounded number of them, e.g., all words with fewer than 5 appearances, which could be 6 in one run and 69105 in the next, you might be better of just sorting the list:
wordcounter = collections.Counter(article_one)
allwords = sorted(wordcounter.items(), key=operator.itemgetter(1))
leastcommon = itertools.takewhile(lambda x: x[1] < 5, allwords)
Sorting takes longer than heapifying, but extracting the first M elements is a lot faster with a list than a heap. Algorithmically, the difference is just some log N factors, so the constants are going to be important here. So the best thing to do is test.
Taking my code at pastebin, and a file made by just doing cat reut2* >reut2.sgm on the Reuters-21578 corpus (without processing it to extract the text, so this is obviously not very good for serious work, but should be fine for benchmarking, because none of the SGML tags are going to be in the least common…):
$ python leastwords.py reut2.sgm # Apple 2.7.2 64-bit
heap: 32.5963380337
sort: 22.9287009239
$ python3 leastwords.py reut2.sgm # python.org 3.3.0 64-bit
heap: 32.47026552911848
sort: 25.855643508024514
$ pypy leastwords.py reut2.sgm # 1.9.0/2.7.2 64-bit
heap: 23.95291996
sort: 16.1843900681
I tried various ways to speed up each of them (including: takewhile around a genexp instead of a loop around yield in the heap version, popping optimistic batches with nsmallest and throwing away any excess, making a list and sorting in place, decorate-sort-undecorate instead of a key, partial instead of lambda, etc.), but none of them made more than 5% improvement (and some made things significantly slower).
At any rate, these are closer than I expected, so I'd probably go with whichever one is simpler and more readable. But I think sort beats heap there, as well, so…
Once again: If you just need the N least common, for reasonable N, I'm willing to bet without even testing that the heap implementation will win.

Related

Count occurrences of list of strings in text

I want to count occurrences of list elements in text with Python. I know that I can use .count() but I have read that this can effect on performance. Also, element in list can have more than 1 word.
my_list = ["largest", "biggest", "greatest", "the best"]
my_text = "i have the biggest house and the biggest car. My friend is the best. Best way win this is to make the largest house and largest treehouse and then you will be the greatest"
I can do this:
num = 0
for i in my_list:
num += my_text.lower().count(i.lower())
print(num)
This way works, but what If my list has 500 elements and my string is 3000 words, so in that case, I have very low performance.
Is there a way to do this but with good / fast performance?
Since my_list contains strings with more than one word, you'll have to find the n-grams of my_text to find matches, since splitting on spaces won't do. Also note that your approach is not advisable, as for every single string in my_list, you'll be traversing the whole string my_text by using count. A better way would be to predefine the n-grams that you'll be looking for beforehand.
Here's one approach using nltk's ngram.
I've added another string in my_list to better illustrate the process:
from nltk import ngrams
from collections import Counter, defaultdict
my_list = ["largest", "biggest", "greatest", "the best", 'My friend is the best']
my_text = "i have the biggest house and the biggest car. My friend is the best. Best way win this is to make the largest house and largest treehouse and then you will be the greatest"
The first step is to define a dictionary containing the different lengths of the n-grams that we'll be looking up:
d = defaultdict(list)
for i in my_list:
k = i.split()
d[len(k)].append(tuple(k))
print(d)
defaultdict(list,
{1: [('largest',), ('biggest',), ('greatest',)],
2: [('the', 'best')],
5: [('My', 'friend', 'is', 'the', 'best')]})
Then split my_text into a list, and for each key in d find the corresponding n-grams and build a Counter from the result. Then for each value in that specific key in d, update with the counts from the Counter:
my_text_split = my_text.replace('.', '').split()
match_counts = dict()
for n,v in d.items():
c = Counter(ngrams(my_text_split, n))
for k in v:
if k in c:
match_counts[k] = c[k]
Which will give:
print(match_counts)
{('largest',): 2,
('biggest',): 2,
('greatest',): 1,
('the', 'best'): 1,
('My', 'friend', 'is', 'the', 'best'): 1}

How to write a alphabet bigram (aa, ab, bc, cd ... zz) frequency analysis counter in python?

This is my current code which prints out the frequency of each character in the input file.
from collections import defaultdict
counters = defaultdict(int)
with open("input.txt") as content_file:
content = content_file.read()
for char in content:
counters[char] += 1
for letter in counters.keys():
print letter, (round(counters[letter]*100.00/1234,3))
I want it to print the frequency of bigrams of only the alphabets(aa,ab,ac ..zy,zz) and not the punctuation as well. How to do this?
You can build around the current code to handle pairs as well. Keep track of 2 characters instead of just 1 by adding another variable, and use a check to eliminate non alphabets.
from collections import defaultdict
counters = defaultdict(int)
paired_counters = defaultdict(int)
with open("input.txt") as content_file:
content = content_file.read()
prev = '' #keeps track of last seen character
for char in content:
counters[char] += 1
if prev and (prev+char).isalpha(): #checks for alphabets.
paired_counters[prev+char] += 1
prev = char #assign current char to prev variable for next iteration
for letter in counters.keys(): #you can iterate through both keys and value pairs from a dictionary instead using .items in python 3 or .iteritems in python 2.
print letter, (round(counters[letter]*100.00/1234,3))
for pairs,values in paired_counters.iteritems(): #Use .items in python 3. Im guessing this is python2.
print pairs, values
(disclaimer: i do not have python 2 on my system. if there is an issue in the code let me know.)
There is a more efficient way of counting bigraphs: with a Counter. Start by reading the text (assuming it is not too large):
from collections import Counter
with open("input.txt") as content_file:
content = content_file.read()
Filter out non-letters:
letters = list(filter(str.isalpha, content))
You probably should convert all letters to the lower case, too, but it's up to you:
letters = letters.lower()
Build a zip of the remaining letters with itself, shifted by one position, and count the bigraphs:
cntr = Counter(zip(letters, letters[1:]))
Normalize the dictionary:
total = len(cntr)
{''.join(k): v / total for k,v in cntr.most_common()}
#{'ow': 0.1111111111111111, 'He': 0.05555555555555555...}
The solution can be easily generalized to trigraphs, etc., by changing the counter:
cntr = Counter(zip(letters, letters[1:], letters[2:]))
If you're using nltk:
from nltk import ngrams
list(ngrams('hello', n=2))
[out]:
[('h', 'e'), ('e', 'l'), ('l', 'l'), ('l', 'o')]
To do a count:
from collections import Counter
Counter(list(ngrams('hello', n=2)))
If you want a python native solution, take a look at:
Is there a more efficient way to find most common n-grams?
Fast/Optimize N-gram implementations in python
Effective 1-5 grams extraction with python

how to find the longest N words from a list, using python?

I am now studying Python, and I am trying to solve the following exercise:
Assuming there is a list of words in a text file,
My goal is to print the longest N words in this list.
Where there are several important points:
The print order does not matter
Words that appear later in the file are given priority to be selected (when there are several words with the same length, i added an example for it)
assume that each row in the file contains only one single word
Is there a simple and easy solution for a short list of words, as opposed to a more complex solution for a situation where the list contains several thousand words?
I have attached an example of the starting code to a single word with a maximum length,
And an example of output for N = 4, for an explanation of my question.
Thanks for your advice,
word_list1 = open('WORDS.txt', 'r')
def find_longest_word(word_list):
longest_word = ''
for word in word_list:
if len(word) > len(longest_word):
longest_word = word
print(longest_word)
find_longest_word(word_list1)
example(N=4):
WORDS.TXT
---------
Mother
Dad
Cat
Bicycle
House
Hat
The result will be (as i said before, print order dosen't matter):
Hat
House
Bicycle
Mother
thanks in advance!
One alternative is to use a heap to maintain the top-n elements:
import heapq
from operator import itemgetter
def top(lst, n=4):
heap = [(0, i, '') for i in range(n)]
heapq.heapify(heap)
for i, word in enumerate(lst):
item = (len(word), i, word)
if item > heap[0]:
heapq.heapreplace(heap, item)
return list(map(itemgetter(2), heap))
words = ['Mother', 'Dad', 'Cat', 'Bicycle', 'House', 'Hat']
print(top(words))
Output
['Hat', 'House', 'Bicycle', 'Mother']
In the heap we keep items that correspond to length and position, so in case of ties the last one to appear gets selected.
sort the word_list based on length of the words and then based on a counter variable, so that words occurring later gets higher priority
>>> from itertools import count
>>> cnt = count()
>>> n = 4
>>> sorted(word_list, key=lambda word:(len(word), next(cnt)), reverse=True)[:n]
['Bicycle', 'Mother', 'House', 'Hat']
You can use sorted with a custom tuple key and then list slicing.
from io import StringIO
x = StringIO("""Mother
Dad
Cat
Bicycle
House
Hat
Brother""")
def find_longest_word(word_list, n):
idx, words = zip(*sorted(enumerate(word_list), key=lambda x: (-len(x[1]), -x[0]))[:n])
return words
res = find_longest_word(map(str.strip, x.readlines()), 4)
print(*res, sep='\n')
# Brother
# Bicycle
# Mother
# House

Increasing speed/performance in counting through very large lists in Python

I'm writing a program in Python 3 and part of it's functionality is to find out what word occurs the most in list and to return the number of occurrences of that word. I have code that works, but part of the requirement is that it take a list of 200,000+ words and complete this activity in under a few seconds, and my code takes a really long time to run. I was wondering what suggestions you may have for speed improvements in this method.
def max_word_frequency(words):
"""A method that takes a list and finds the word with the most
occurrences and returns the number of occurences of that word
as an integer.
"""
max_count = 0
for word in set(words):
count = words.count(word)
if count > max_count:
max_count = count
return max_count
I have contemplated using a dictionary as they are hashable and super speedy compared to lists but I cannot quite figure out how to implement this yet.
Thank you for your time everyone!
- Finn
First, your algorithm is looping m times over the whole list of 200 000 words, where m is the number of distinct words in this list. This is really not a good idea for just counting iterations of word and select the maximum. I could show you a more efficient algorithm (which could only iterate one time over the list), but Python already has tools to do what you want.
To solve your problem with few lines of code, you can use Python algorithm available in the standard library, which have been implemented in C and might be more efficient than your loop. The Counter class with its most_common method might help you:
>>> from collections import Counter
>>> counts = Counter(['abc', 'def', 'abc', 'foo', 'bar', 'foo', 'foo'])
>>> counts
Counter({'foo': 3, 'abc': 2, 'bar': 1, 'def': 1})
>>> Counter(['abc', 'def', 'abc', 'foo', 'bar', 'foo', 'foo']).most_common(1)
[('foo', 3)]
The you just have to return the second element of the tuple (there is only one tuple here, as we ask by the 1 argument in most_common)
Performance comparaison
Just to compare, I took a sample of a LaTeX file (~12Ko), split words by spaces (giving the x with 1835 words) and run your function and the one below with timeit. You can see a real gain.
>>> len(x)
1835
>>> def max_word_2(words):
... counts = Counter(words)
... return counts.most_common(1)[0][1]
>>> timeit.timeit("max_word_2(x)", setup="from __main__ import x, max_word_2", number=1000)
1.1040630340576172
>>> timeit.timeit("max_word_frequency(x)", setup="from __main__ import x, max_word_frequency", number=1000)
35.623037815093994
Just this change might be sufficient to speed up your process :)

Common elements between two lists not using sets in Python

I want count the same elements of two lists. Lists can have duplicate elements, so I can't convert this to sets and use & operator.
a=[2,2,1,1]
b=[1,1,3,3]
set(a) & set(b) work
a & b don't work
It is possible to do it withoud set and dictonary?
In Python 3.x (and Python 2.7, when it's released), you can use collections.Counter for this:
>>> from collections import Counter
>>> list((Counter([2,2,1,1]) & Counter([1,3,3,1])).elements())
[1, 1]
Here's an alternative using collections.defaultdict (available in Python 2.5 and later). It has the nice property that the order of the result is deterministic (it essentially corresponds to the order of the second list).
from collections import defaultdict
def list_intersection(list1, list2):
bag = defaultdict(int)
for elt in list1:
bag[elt] += 1
result = []
for elt in list2:
if elt in bag:
# remove elt from bag, making sure
# that bag counts are kept positive
if bag[elt] == 1:
del bag[elt]
else:
bag[elt] -= 1
result.append(elt)
return result
For both these solutions, the number of occurrences of any given element x in the output list is the minimum of the numbers of occurrences of x in the two input lists. It's not clear from your question whether this is the behavior that you want.
Using sets is the most efficient, but you could always do r = [i for i in l1 if i in l2].
SilentGhost, Mark Dickinson and Lo'oris are right, Thanks very much for report this problem - I need common part of lists, so for:
a=[1,1,1,2]
b=[1,1,3,3]
result should be [1,1]
Sorry for comment in not suitable place - I have registered today.
I modified yours solutions:
def count_common(l1,l2):
l2_copy=list(l2)
counter=0
for i in l1:
if i in l2_copy:
counter+=1
l2_copy.remove(i)
return counter
l1=[1,1,1]
l2=[1,2]
print count_common(l1,l2)
1

Categories