Counting the number of unique words in a list - python

Using the following code from https://stackoverflow.com/a/11899925, I am able to find if a word is unique or not (by comparing if it was used once or greater than once):
helloString = ['hello', 'world', 'world']
count = {}
for word in helloString :
if word in count :
count[word] += 1
else:
count[word] = 1
But, if I were to have a string with hundreds of words, how would I be able to count the number of unique words within that string?
For example, my code has:
uniqueWordCount = 0
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
count = {}
for word in words :
if word in count :
count[word] += 1
else:
count[word] = 1
How would I be able to set uniqueWordCount to 6? Usually, I am really good at solving these types of algorithmic puzzles, but I have been unsuccessful with figuring this one out. I feel as if it is right beneath my nose.

The best way to solve this is to use the set collection type. A set is a collection in which all elements are unique. Therefore:
unique = set([ 'one', 'two', 'two'])
len(unique) # is 2
You can use a set from the outset, adding words to it as you go:
unique.add('three')
This will throw out any duplicates as they are added. Or, you can collect all the elements in a list and pass the list to the set() function, which will remove the duplicates at that time. The example I provided above shows this pattern:
unique = set([ 'one', 'two', 'two'])
unique.add('three')
# unique now contains {'one', 'two', 'three'}
Read more about sets in Python.

You have many options for this, I recommend a set, but you can also use a counter, which counts the amount a number shows up, or you can look at the number of keys for the dictionary you made.
Set
You can also convert the list to a set, where all elements have to be unique. Not unique elements are discarded:
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
helloSet = set(helloString) #=> ['doing', 'how', 'are', 'world', 'you', 'hello', 'today']
uniqueWordCount = len(set(helloString)) #=> 7
Here's a link to further reading on sets
Counter
You can also use a counter, which can also tell you how often a word was used, if you still need that information.
from collections import Counter
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
counter = Counter(helloString)
len(counter) #=> 7
counter["world"] #=> 2
Loop
At the end for your loop, you can check the len of count, also, you mistyped helloString as words:
uniqueWordCount = 0
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
count = {}
for word in helloString:
if word in count :
count[word] += 1
else:
count[word] = 1
len(count) #=> 7

You can use collections.Counter
helloString = ['hello', 'world', 'world']
from collections import Counter
c = Counter(helloString)
print("There are {} unique words".format(len(c)))
print('They are')
for k, v in c.items():
print(k)
I know the question doesn't specifically ask for this, but to maintain order
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
pass
c = OrderedCounter(helloString)
print("There are {} unique words".format(len(c)))
print('They are')
for k, v in c.items():
print(k)

In your current code you can either increment uniqueWordCount in the else case where you already set count[word], or just lookup the number of keys in the dictionary: len(count).
If you only want to know the number of unique elements, then get the elements in the set: len(set(helloString))

I would do this using a set.
def stuff(helloString):
hello_set = set(helloString)
return len(hello_set)

Counter is the efficient way to do it.
This code is similar to counter,
text = ['hello', 'world']
# create empty dictionary
freq_dict = {}
# loop through text and count words
for word in text:
# set the default value to 0
freq_dict.setdefault(word, 0)
# increment the value by 1
freq_dict[word] += 1
for key,value in freq_dict.items():
if value == 1:
print(f'Word "{key}" has single appearance in the list')
Word "hello" has single appearance in the list
Word "world" has single appearance in the list
[Program finished]

I may be misreading the question but I believe the goal is to find all elements which only occur one time in the list.
from collections import Counter
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
counter = Counter(helloString)
uniques = [value for value, count in counter.items() if count == 1]
This will give us 6 items because "world" occurs twice in our list:
>>> uniques
['you', 'are', 'doing', 'how', 'today', 'hello']

Related

Need to output the final count instead of one by one

I turned a given string into a list of words and then made a loop to count the number of 'to' contained in the list. However, it comes out as individual '1's but I need the answer to be the total word_count == 4 somehow.
Input words = ['to', 'split;', 'bill', 'to', 'lint', 'to', 'leads', 'to', 'suffer']
print(words)
word_count = 0
for target_word in words:
if (target_word == 'to'):
print(word_count + 1)
Output:
['to', 'split;', 'bill', 'to', 'lint', 'to', 'leads', 'to', 'suffer']
1
1
1
1
Thanks to anyone for their help. I am in a different timezone and cannot contact my admin for help atm.
words = text.split()
print(words)
word_count = 0
for target_word in words:
if (target_word == 'to'):
word_count += 1
print(word_count)
An even simpler approach would be
print(words.count("to"))
If you want to get a summary of words, then:
from collections import Counter
c = Counter(words)
print(c)

How to get number of occurences of items after certain item in list

I have a list of strings like the one below, written in Python. What I want to do now is to get the number of occurences of the string 'you' after each string 'hello'. The output should then be something like the number 0 for the first two 'hello's , the number 2 for the third 'hello', number 1 for the fourth 'hello' and so on.
Does anyone know how to that exactly?
my_list = ['hello', 'hello', 'hello', 'you', 'you', 'hello', 'you', 'hello',
'you', 'you', 'you', ...]
Update:
Solved it myself, though Karan Elangovans approach also works:
This is how i did it:
list_counter = []
counter = 0
# I reverse the list because the loop below counts the number of
# occurences of 'you' behind each 'hello', not in front of it
my_list_rev = reversed(my_list)
for m in my_list_rev:
if m == 'you':
counter += 1
elif m == 'hello':
list_counter.append(counter)
counter = 0
# reverse the output to match it with my_list
list_counter = list(reversed(list_counter))
print(list_counter)
This outputs:
[0, 0, 2, 1, 3]
for:
my_list = ['hello', 'hello', 'hello', 'you', 'you', 'hello', 'you', 'hello',
'you', 'you', 'you']
Maybe not the best approach, as you have to reverse both the original list and the list with the results to get the correct output, but it works for this problem.
Try this one-liner:
reduce(lambda acc, cur: acc + ([] if cur[1] == 'you' else [next((i[0] - cur[0] - 1 for i in list(enumerate(my_list))[cur[0]+1:] if i[1] == 'hello'), len(my_list) - cur[0] - 1)]), enumerate(my_list), [])
It will give an array where the nth element is the number of 'you's following the nth occurrence of 'hello'.
e.g.
If my_list = ['hello', 'hello', 'hello', 'you', 'you', 'hello', 'you', 'hello', 'you', 'you', 'you'],
it will give: [0, 0, 2, 1, 3]

How to eliminate duplicate list entries in Python while preserving case-sensitivity?

I'm looking for a way to remove duplicate entries from a Python list but with a twist; The final list has to be case sensitive with a preference of uppercase words.
For example, between cup and Cup I only need to keep Cup and not cup. Unlike other common solutions which suggest using lower() first, I'd prefer to maintain the string's case here and in particular I'd prefer keeping the one with the uppercase letter over the one which is lowercase..
Again, I am trying to turn this list:
[Hello, hello, world, world, poland, Poland]
into this:
[Hello, world, Poland]
How should I do that?
Thanks in advance.
This does not preserve the order of words, but it does produce a list of "unique" words with a preference for capitalized ones.
In [34]: words = ['Hello', 'hello', 'world', 'world', 'poland', 'Poland', ]
In [35]: wordset = set(words)
In [36]: [item for item in wordset if item.istitle() or item.title() not in wordset]
Out[36]: ['world', 'Poland', 'Hello']
If you wish to preserve the order as they appear in words, then you could use a collections.OrderedDict:
In [43]: wordset = collections.OrderedDict()
In [44]: wordset = collections.OrderedDict.fromkeys(words)
In [46]: [item for item in wordset if item.istitle() or item.title() not in wordset]
Out[46]: ['Hello', 'world', 'Poland']
Using set to track seen words:
def uniq(words):
seen = set()
for word in words:
l = word.lower() # Use `word.casefold()` if possible. (3.3+)
if l in seen:
continue
seen.add(l)
yield word
Usage:
>>> list(uniq(['Hello', 'hello', 'world', 'world', 'Poland', 'poland']))
['Hello', 'world', 'Poland']
UPDATE
Previous version does not take care of preference of uppercase over lowercase. In the updated version I used the min as #TheSoundDefense did.
import collections
def uniq(words):
seen = collections.OrderedDict() # Use {} if the order is not important.
for word in words:
l = word.lower() # Use `word.casefold()` if possible (3.3+)
seen[l] = min(word, seen.get(l, word))
return seen.values()
Since an uppercase letter is "smaller" than a lowercase letter in a comparison, I think you can do this:
orig_list = ["Hello", "hello", "world", "world", "Poland", "poland"]
unique_list = []
for word in orig_list:
for i in range(len(unique_list)):
if unique_list[i].lower() == word.lower():
unique_list[i] = min(word, unique_list[i])
break
else:
unique_list.append(word)
The min will have a preference for words with uppercase letters earlier on.
Some better answers here, but hopefully something simple, different and useful. This code satisfies the conditions of your test, sequential pairs of matching words, but would fail on anything more complicated; such as non-sequential pairs, non-pairs or non-strings. Anything more complicated and I'd take a different approach.
p1 = ['Hello', 'hello', 'world', 'world', 'Poland', 'poland']
p2 = ['hello', 'Hello', 'world', 'world', 'Poland', 'Poland']
def pref_upper(p):
q = []
a = 0
b = 1
for x in range(len(p) /2):
if p[a][0].isupper() and p[b][0].isupper():
q.append(p[a])
if p[a][0].isupper() and p[b][0].islower():
q.append(p[a])
if p[a][0].islower() and p[b][0].isupper():
q.append(p[b])
if p[a][0].islower() and p[b][0].islower():
q.append(p[b])
a +=2
b +=2
return q
print pref_upper(p1)
print pref_upper(p2)

Is there a better way to get just 'important words' from a list in python?

I wrote some code to find the most popular words in submission titles on reddit, using the reddit praw api.
import nltk
import praw
picksub = raw_input('\nWhich subreddit do you want to analyze? r/')
many = input('\nHow many of the top words would you like to see? \n\t> ')
print 'Getting the top %d most common words from r/%s:' % (many,picksub)
r = praw.Reddit(user_agent='get the most common words from chosen subreddit')
submissions = r.get_subreddit(picksub).get_top_from_all(limit=200)
hey = []
for x in submissions:
hey.extend(str(x).split(' '))
fdist = nltk.FreqDist(hey) # creates a frequency distribution for words in 'hey'
top_words = fdist.keys()
common_words = ['its','am', 'ago','took', 'got', 'will', 'been', 'get', 'such','your','don\'t', 'if', 'why', 'do', 'does', 'or', 'any', 'but', 'they', 'all', 'now','than','into','can', 'i\'m','not','so','just', 'out','about','have','when', 'would' ,'where', 'what', 'who' 'I\'m','says' 'not', '', 'over', '_', '-','after', 'an','for', 'who', 'by', 'from', 'it', 'how', 'you', 'about' 'for', 'on', 'as', 'be', 'has', 'that', 'was', 'there', 'with','what', 'we', '::', 'to', 'the', 'of', ':', '...', 'a', 'at', 'is', 'my', 'in' , 'i', 'this', 'and', 'are', 'he', 'she', 'is', 'his', 'hers']
already = []
counter = 0
number = 1
print '-----------------------'
for word in top_words:
if word.lower() not in common_words and word.lower() not in already:
print str(number) + ". '" + word + "'"
counter +=1
number +=1
already.append(word.lower())
if counter == many:
break
print '-----------------------\n'
so inputting subreddit 'python' and getting 10 posts returns:
'Python'
'PyPy'
'code'
'use'
'136'
'181'
'd...'
'IPython'
'133'
10. '158'
How can I make this script not return numbers, and error words like 'd...'? The first 4 results are acceptable, but I would like to replace this rest with words that make sense. Making a list common_words is unreasonable, and doesn't filter these errors. I'm relatively new to writing code, and I appreciate the help.
I disagree. Making a list of common words is correct, there is no easier way to filter out the, for, I, am, etc.. However, it is unreasonable to use the common_words list to filter out results that aren't words, because then you'd have to include every possible non-word you don't want. Non-words should be filtered out differently.
Some suggestions:
1) common_words should be a set(), since your list is long this should speed things up. The in operation for sets in O(1), while for lists it is O(n).
2) Getting rid of all number strings is trivial. One way you could do it is:
all([w.isdigit() for w in word])
Where if this returns True, then the word is just a series of numbers.
3) Getting rid of the d... is a little more tricky. It depends on how you define a non-word. This:
tf = [ c.isalpha() for c in word ]
Returns a list of True/False values (where it is False if the char was not a letter). You can then count the values like:
t = tf.count(True)
f = tf.count(False)
You could then define a non-word as one that has more non-letter chars in it than letters, as one that has any non-letter characters at all, etc. For example:
def check_wordiness(word):
# This returns true only if a word is all letters
return all([ c.isalpha() for c in word ])
4) In the for word in top_words: block, are you sure that you have not mixed up counter and number? Also, counter and number are pretty much redundant, you could rewrite the last bit as:
for word in top_words:
# Since you are calling .lower() so much,
# you probably want to define it up here
w = word.lower()
if w not in common_words and w not in already:
# String formatting is preferred over +'s
print "%i. '%s'" % (number, word)
number +=1
# This could go under the if statement. You only want to add
# words that could be added again. Why add words that are being
# filtered out anyways?
already.append(w)
# this wasn't indented correctly before
if number == many:
break
Hope that helps.

Python: Split list based on first character of word

Im kind of stuck on an issue and Ive gone round and round with it until ive confused myself.
What I am trying to do is take a list of words:
['About', 'Absolutely', 'After', 'Aint', 'Alabama', 'AlabamaBill', 'All', 'Also', 'Amos', 'And', 'Anyhow', 'Are', 'As', 'At', 'Aunt', 'Aw', 'Bedlam', 'Behind', 'Besides', 'Biblical', 'Bill', 'Billgone']
Then sort them under and alphabetical order:
A
About
Absolutely
After
B
Bedlam
Behind
etc...
Is there and easy way to do this?
Use itertools.groupby() to group your input by a specific key, such as the first letter:
from itertools import groupby
from operator import itemgetter
for letter, words in groupby(sorted(somelist), key=itemgetter(0)):
print letter
for word in words:
print word
print
If your list is already sorted, you can omit the sorted() call. The itemgetter(0) callable will return the first letter of each word (the character at index 0), and groupby() will then yield that key plus an iterable that consists only of those items for which the key remains the same. In this case that means looping over words gives you all items that start with the same character.
Demo:
>>> somelist = ['About', 'Absolutely', 'After', 'Aint', 'Alabama', 'AlabamaBill', 'All', 'Also', 'Amos', 'And', 'Anyhow', 'Are', 'As', 'At', 'Aunt', 'Aw', 'Bedlam', 'Behind', 'Besides', 'Biblical', 'Bill', 'Billgone']
>>> from itertools import groupby
>>> from operator import itemgetter
>>>
>>> for letter, words in groupby(sorted(somelist), key=itemgetter(0)):
... print letter
... for word in words:
... print word
... print
...
A
About
Absolutely
After
Aint
Alabama
AlabamaBill
All
Also
Amos
And
Anyhow
Are
As
At
Aunt
Aw
B
Bedlam
Behind
Besides
Biblical
Bill
Billgone
Instead of using any library imports, or anything fancy.
Here is the logic:
def splitLst(x):
dictionary = dict()
for word in x:
f = word[0]
if f in dictionary.keys():
dictionary[f].append(word)
else:
dictionary[f] = [word]
return dictionary
splitLst(['About', 'Absolutely', 'After', 'Aint', 'Alabama', 'AlabamaBill', 'All', 'Also', 'Amos', 'And', 'Anyhow', 'Are', 'As', 'At', 'Aunt', 'Aw', 'Bedlam', 'Behind', 'Besides', 'Biblical', 'Bill', 'Billgone'])
def split(n):
n2 = []
for i in n:
if i[0] not in n2:
n2.append(i[0])
n2.sort()
for j in n:
z = j[0]
z1 = n2.index(z)
n2.insert(z1+1, j)
return n2
word_list = ['be','have','do','say','get','make','go','know','take','see','come','think',
'look','want','give','use','find','tell','ask','work','seem','feel','leave','call']
print(split(word_list))

Categories