count word occurances in a string - python

I have the following situation:
str='this is the string that Luci want to parse for a dataset uci at web'
word='uci'
str.count(word)=?
I want to count only 'uci' which appear independently (not inside any word)
so the output should be 1 and not 2!
Python script required.

>>> s = 'this is the string that Luci want to parse for a dataset uci at web'
>>> s.split(' ').count('uci')
1

Without giving too much away, you can use re to find patterns. In particular, you might look for 'uci' surrounded by word barriers:
string = 'this is the string that Luci want to parse for a dataset uci at web'
count = len(re.findall(r'[^\W]uci[\W$]', string))
Alternatively, you could split on non-word characters and count the occurrences there:
count = re.split(r'\W', string).count('uci')
Both of these approaches return 1

def count_words(str):
words = str.split()
counts = {}
for word in words:
if word in counts:
counts[word] = counts[word] + 1
else:
counts[word] = 1
return counts
count_words(str)
{'a': 1, 'web': 1, 'string': 1, 'for': 1, 'that': 1, 'this': 1, 'is': 1, 'dataset': 1, 'parse': 1, 'to': 1, 'at': 1, 'want': 1, 'the': 1, 'Luci': 1, 'uci': 1}

Related

Python is there a way to sort a list of lists by frequency of a specific element?

I know the sort() function in python has a key element to it to allow to sort by a specification. However, is there a way to sort a list of words and their frequency in a list of sentences by a specific word? My list would take multiple sentences in a list then divide each sentence into own list, making everything lower case and removing punctuation, then putting the frequency of the word next to it.
For example my list would be given a list like:
['Hello world! My name is Mary, However', 'Is the water running? Is it cold?', 'Everything is is is okay.']
And it would be transformed into:
[ {'hello': 1, 'world': 1, 'my': 1, 'name': 1, 'is': 1, 'mary': 1, 'however': 1}, {'is': 2, 'the': 1, 'water': 1, 'running': 1, 'it': 1, 'cold': 1} {'everything': 1, 'is': 3, 'okay': 1} ]
In this scenario I would want to sort the list of sentences by the frequency of the word 'is'. How could I go about that without changing the word lists?
First of all, we need to generate the dictionaries. So we need to take each sentence, remove any punctuation, and split it into words:
sentances = ['Hello world! My name is Mary, However', 'Is the water running? Is it cold?', 'Everything is is is okay.']
processed_sentences = []
for sentence in sentences:
sentence = sentence.replace("!", "").replace(",", "").replace("?", "").replace(".", "")
sentence = sentence.lower()
sentence_words = sentence.split(" ")
processed_sentences.append(sentence_words)
Then we need to count each word. Python has a construct to handle that, called collections.Counter.
from collections import Counter
counted_sentences = []
for sentence_list in processed_sentences:
counted_words = Counter(sentence_list)
counted_sentences.append(counted_words)
Then, we need to sort the list. Python lists handily have a sort method, but we need to specify how we want the lists sorted. We do that using a key:
counted_sentences.sort(key=lambda cw: cw.get("is", 0))
Note that list.sort sorts the list in place, so we're not assigning it to anything.
Then we just need to clean up a little to get your desired output:
result = []
for counted_sentence in counted_sentences:
result.append(dict(counted_sentence))
print(result)
And there you go.
You can use Counter:
from collections import Counter
import re
def replace_non_alphanumeric(word):
return re.sub(r'[^A-Za-z0-9 ]+', '', word)
data = ['Hello world! My name is Mary, However', 'Is the water running? Is it cold?', 'Everything is is is okay.']
words_without_punct = [replace_non_alphanumeric(sentence.lower()).split(' ') for sentence in data]
sorted_counters = sorted([dict(Counter(words)) for words in words_without_punct], key = lambda x: x.get('is', 0))
print(sorted_counters)
# [{'hello': 1, 'world': 1, 'my': 1, 'name': 1, 'is': 1, 'mary': 1, 'however': 1}, {'is': 2, 'the': 1, 'water': 1, 'running': 1, 'it': 1, 'cold': 1}, {'everything': 1, 'is': 3, 'okay': 1}]

My NLTK code almost does what I need it to, but not quite

Code:
def add_lexical_features(fdist, feature_vector):
for word, freq in fdist.items():
fname = "unigram:{0}".format(word)
if selected_features == None or fname in selected_features:
feature_vector[fname] = 1
if selected_features == None or fname in selected_features:
feature_vector[fname] = float(freq) / fdist.N()
print(feature_vector)
if __name__ == '__main__':
file_name = "restaurant-training.data"
p = process_reviews(file_name)
for i in range(0, len(p)):
print(p[i]+ "\n")
uni_dist = nltk.FreqDist(p[0])
feature_vector = {}
x = add_lexical_features(uni_dist, feature_vector)
What this is trying to do is output the frequency of words in the list of reviews (p being the list of reviews, p[0] being the string). And this works....except it does it by letter, not my word.
I am still new to NLTK, so this might be obvious, but I really can't get it.
For example, this currently outputs a large list of things like:
{'unigram:n': 0.0783132530120482}
This is fine, and I think that is the right number (number of time n appears over total letters) but I want it to be by word, not by letter.
Now, I also want it do it by bigrams, once I can get it working by single words, making the double words might be easy, but I am not quite seeing it, so some guidance their would be nice.
Thanks.
The input to nltk.FreqDist should be a list of strings, not just a string. See the difference:
>>> import nltk
>>> uni_dist = nltk.FreqDist(['the', 'dog', 'went', 'to', 'the', 'park'])
>>> uni_dist
FreqDist({'the': 2, 'went': 1, 'park': 1, 'dog': 1, 'to': 1})
>>> uni_dist2 = nltk.FreqDist('the dog went to the park')
>>> uni_dist2
FreqDist({' ': 5, 't': 4, 'e': 3, 'h': 2, 'o': 2, 'a': 1, 'd': 1, 'g': 1, 'k': 1, 'n': 1, ...})
You can convert your string into a list of individual words using split.
Side note: I think you might want to be calling nltk.FreqDist on p[i] rather than p[0].

python programs to count letters in each word of a sentence

I'm pretty new to python and I need a program that not only counts the words from an input sentence but also counts the number of letters in each word. This is what I have so far. Any help would be very much appreciated!
def main():
s = input("Please enter your sentence: ")
words = s.split()
wordCount = len(words)
print ("Your word and letter counts are:", wordCount)
main()
You can generate a mapping from words to word lengths, as follows:
s = "this is a sentence"
words = s.split()
letter_count_per_word = {w:len(w) for w in words}
This yields
letter_count_per_word == {'this': 4, 'a': 1, 'is': 2, 'sentence': 8}
Actually, Python has a collections class called Counter which will count the number of occurrences of each word for you.
from collections import Counter
my_sentence = 'Python is a widely used programming language'
print Counter(my_sentence.split())
Output
Counter({'a': 1, 'used': 1, 'language': 1, 'Python': 1, 'is': 1, 'programming': 1, 'widely': 1})
Try following code
words = str(input("Please enter your sentence. "))
print (len(words))

Counting word frequency and making a dictionary from it

This question already has answers here:
How do I split a string into a list of words?
(9 answers)
Using a dictionary to count the items in a list
(8 answers)
Closed yesterday.
I want to take every word from a text file, and count the word frequency in a dictionary.
Example: 'this is the textfile, and it is used to take words and count'
d = {'this': 1, 'is': 2, 'the': 1, ...}
I am not that far, but I just can't see how to complete it. My code so far:
import sys
argv = sys.argv[1]
data = open(argv)
words = data.read()
data.close()
wordfreq = {}
for i in words:
#there should be a counter and somehow it must fill the dict.
If you don't want to use collections.Counter, you can write your own function:
import sys
filename = sys.argv[1]
fp = open(filename)
data = fp.read()
words = data.split()
fp.close()
unwanted_chars = ".,-_ (and so on)"
wordfreq = {}
for raw_word in words:
word = raw_word.strip(unwanted_chars)
if word not in wordfreq:
wordfreq[word] = 0
wordfreq[word] += 1
for finer things, look at regular expressions.
Although using Counter from the collections library as suggested by #Michael is a better approach, I am adding this answer just to improve your code. (I believe this will be a good answer for a new Python learner.)
From the comment in your code it seems like you want to improve your code. And I think you are able to read the file content in words (while usually I avoid using read() function and use for line in file_descriptor: kind of code).
As words is a string, in for loop, for i in words: the loop-variable i is not a word but a char. You are iterating over chars in the string instead of iterating over words in the string words. To understand this, notice following code snippet:
>>> for i in "Hi, h r u?":
... print i
...
H
i
,
h
r
u
?
>>>
Because iterating over the given string char by chars instead of word by words is not what you wanted to achieve, to iterate words by words you should use the split method/function from string class in Python.
str.split(str="", num=string.count(str)) method returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
Notice the code examples below:
Split:
>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?']
loop with split:
>>> for i in "Hi, how are you?".split():
... print i
...
Hi,
how
are
you?
And it looks like something you need. Except for word Hi, because split(), by default, splits by whitespaces so Hi, is kept as a single string (and obviously) you don't want that.
To count the frequency of words in the file, one good solution is to use regex. But first, to keep the answer simple I will be using replace() method. The method str.replace(old, new[, max]) returns a copy of the string in which the occurrences of old have been replaced with new, optionally restricting the number of replacements to max.
Now check code example below to see what I suggested:
>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?'] # it has , with Hi
>>> "Hi, how are you?".replace(',', ' ').split()
['Hi', 'how', 'are', 'you?'] # , replaced by space then split
loop:
>>> for word in "Hi, how are you?".replace(',', ' ').split():
... print word
...
Hi
how
are
you?
Now, how to count frequency:
One way is use Counter as #Michael suggested, but to use your approach in which you want to start from empty an dict. Do something like this code sample below:
words = f.read()
wordfreq = {}
for word in .replace(', ',' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
# ^^ add 1 to 0 or old value from dict
What am I doing? Because initially wordfreq is empty you can't assign it to wordfreq[word] for the first time (it will raise key exception error). So I used setdefault dict method.
dict.setdefault(key, default=None) is similar to get(), but will set dict[key]=default if key is not already in dict. So for the first time when a new word comes, I set it with 0 in dict using setdefault then add 1 and assign to the same dict.
I have written an equivalent code using with open instead of single open.
with open('~/Desktop/file') as f:
words = f.read()
wordfreq = {}
for word in words.replace(',', ' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq
That runs like this:
$ cat file # file is
this is the textfile, and it is used to take words and count
$ python work.py # indented manually
{'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2,
'it': 1, 'to': 1, 'take': 1, 'words': 1,
'the': 1, 'textfile': 1}
Using re.split(pattern, string, maxsplit=0, flags=0)
Just change the for loop: for i in re.split(r"[,\s]+", words):, that should produce the correct output.
Edit: better to find all alphanumeric character because you may have more than one punctuation symbols.
>>> re.findall(r'[\w]+', words) # manually indent output
['this', 'is', 'the', 'textfile', 'and',
'it', 'is', 'used', 'to', 'take', 'words', 'and', 'count']
use for loop as: for word in re.findall(r'[\w]+', words):
How would I write code without using read():
File is:
$ cat file
This is the text file, and it is used to take words and count. And multiple
Lines can be present in this file.
It is also possible that Same words repeated in with capital letters.
Code is:
$ cat work.py
import re
wordfreq = {}
with open('file') as f:
for line in f:
for word in re.findall(r'[\w]+', line.lower()):
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq
Used lower() to convert an upper letter to lower letter.
output:
$python work.py # manually strip output
{'and': 3, 'letters': 1, 'text': 1, 'is': 3,
'it': 2, 'file': 2, 'in': 2, 'also': 1, 'same': 1,
'to': 1, 'take': 1, 'capital': 1, 'be': 1, 'used': 1,
'multiple': 1, 'that': 1, 'possible': 1, 'repeated': 1,
'words': 2, 'with': 1, 'present': 1, 'count': 1, 'this': 2,
'lines': 1, 'can': 1, 'the': 1}
from collections import Counter
t = 'this is the textfile, and it is used to take words and count'
dict(Counter(t.split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}
Or better with removing punctuation before counting:
dict(Counter(t.replace(',', '').replace('.', '').split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile': 1}
The following takes the string, splits it into a list with split(), for loops the list and counts
the frequency of each item in the sentence with Python's count function count (). The
words,i, and its frequency are placed as tuples in an empty list, ls, and then converted into
key and value pairs with dict().
sentence = 'this is the textfile, and it is used to take words and count'.split()
ls = []
for i in sentence:
word_count = sentence.count(i) # Pythons count function, count()
ls.append((i,word_count))
dict_ = dict(ls)
print dict_
output; {'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}
sentence = "this is the textfile, and it is used to take words and count"
# split the sentence into words.
# iterate thorugh every word
counter_dict = {}
for word in sentence.lower().split():
# add the word into the counter_dict initalize with 0
if word not in counter_dict:
counter_dict[word] = 0
# increase its count by 1
counter_dict[word] =+ 1
#open your text book,Counting word frequency
File_obj=open("Counter.txt",'r')
w_list=File_obj.read()
print(w_list.split())
di=dict()
for word in w_list.split():
if word in di:
di[word]=di[word] + 1
else:
di[word]=1
max_count=max(di.values())
largest=-1
maxusedword=''
for k,v in di.items():
print(k,v)
if v>largest:
largest=v
maxusedword=k
print(maxusedword,largest)
you can also use default dictionaries with int type.
from collections import defaultdict
wordDict = defaultdict(int)
text = 'this is the textfile, and it is used to take words and count'.split(" ")
for word in text:
wordDict[word]+=1
explanation:
we initialize a default dictionary whose values are of the type int. This way the default value for any key will be 0 and we don't need to check if a key is present in the dictionary or not. we then split the text with the spaces into a list of words. then we iterate through the list and increment the count of the word's count.
wordList = 'this is the textfile, and it is used to take words and count'.split()
wordFreq = {}
# Logic: word not in the dict, give it a value of 1. if key already present, +1.
for word in wordList:
if word not in wordFreq:
wordFreq[word] = 1
else:
wordFreq[word] += 1
print(wordFreq)
My approach is to do few things from ground:
Remove punctuations from the text input.
Make list of words.
Remove empty strings.
Iterate through list.
Make each new word a key into Dictionary with value 1.
If a word is already exist as key then increment it's value by one.
text = '''this is the textfile, and it is used to take words and count'''
word = '' #This will hold each word
wordList = [] #This will be collection of words
for ch in text: #traversing through the text character by character
#if character is between a-z or A-Z or 0-9 then it's valid character and add to word string..
if (ch >= 'a' and ch <= 'z') or (ch >= 'A' and ch <= 'Z') or (ch >= '0' and ch <= '9'):
word += ch
elif ch == ' ': #if character is equal to single space means it's a separator
wordList.append(word) # append the word in list
word = '' #empty the word to collect the next word
wordList.append(word) #the last word to append in list as loop ended before adding it to list
print(wordList)
wordCountDict = {} #empty dictionary which will hold the word count
for word in wordList: #traverse through the word list
if wordCountDict.get(word.lower(), 0) == 0: #if word doesn't exist then make an entry into dic with value 1
wordCountDict[word.lower()] = 1
else: #if word exist then increament the value by one
wordCountDict[word.lower()] = wordCountDict[word.lower()] + 1
print(wordCountDict)
Another approach:
text = '''this is the textfile, and it is used to take words and count'''
for ch in '.\'!")(,;:?-\n':
text = text.replace(ch, ' ')
wordsArray = text.split(' ')
wordDict = {}
for word in wordsArray:
if len(word) == 0:
continue
else:
wordDict[word.lower()] = wordDict.get(word.lower(), 0) + 1
print(wordDict)
One more function:
def wcount(filename):
counts = dict()
with open(filename) as file:
a = file.read().split()
# words = [b.rstrip() for b in a]
for word in a:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
return counts
def play_with_words(input):
input_split = input.split(",")
input_split.sort()
count = {}
for i in input_split:
if i in count:
count[i] += 1
else:
count[i] = 1
return count
input ="i,am,here,where,u,are"
print(play_with_words(input))
Write a Python program to create a list of strings by taking input from the user and then create a dictionary containing each string along with their frequencies. (e.g. if the list is [‘apple’, ‘banana’, ‘fig’, ‘apple’, ‘fig’, ‘banana’, ‘grapes’, ‘fig’, ‘grapes’, ‘apple’] then output should be {'apple': 3, 'banana': 2, 'fig': 3, 'grapes': 2}.
lst = []
d = dict()
print("ENTER ZERO NUMBER FOR EXIT !!!!!!!!!!!!")
while True:
user = input('enter string element :: -- ')
if user == "0":
break
else:
lst.append(user)
print("LIST ELEMENR ARE :: ",lst)
l = len(lst)
for i in range(l) :
c = 0
for j in range(l) :
if lst[i] == lst[j ]:
c += 1
d[lst[i]] = c
print("dictionary is :: ",d)
You can also go with this approach. But you need to store the text file's content in a variable as a string first after reading the file.
In this way, You don't need to use or import any external libraries.
s = "this is the textfile, and it is used to take words and count"
s = s.split(" ")
d = dict()
for i in s:
c = ""
if i.isalpha() == True:
if i not in d:
d[i] = 1
else:
d[i] += 1
else:
for j in i:
l = len(j)
if j.isalpha() == True:
c+=j
if c not in d:
d[c] = 1
else:
d[c] += 1
print(d)
Result:

Trying to manipulate data, assign a list to the first item in a higher list, second item will be information about that list

Ok, I'm trying to transmit a list of values, alongside information regarding that list of values. I am trying to do that while manipulating the data. Let me show you what's going on:
worddictlist2 = []
for innertweet in namelist:
worddictlist = []
for tweet in innertweet[0]:
worddict = {word: tweet.count(word) for word in wordlist}
worddictlist.append(worddict)
worddictlist2.append(worddictlist)
namelist is a variable with the following information:
[[['blah blah blah string blah blah blah blah blah blah', 'another string, blah blah blah, string string', 'string string string'], category], ['string string another string, blah', 'more words, more words, etc', 'yet again, here we go'], category2]
I am counting the number of times that a particular word occurs in each phrase. However I still want to keep the category assignment in some way.
I've been trying to append different lists throughout the various loops, I've tried different list comprehensions, and I'm just not seeing the result I want, which will be as follows:
[[{word1: 0, word2: 7, word3: 12, word4: 6}, category], {word1: 3, word2: 9, word3: 1, word4: 2}, category2]]
How can I get this output? Am I doing this inefficiently? The way I am torturing this data makes me feel like I am doing this process inefficiently.
Given data:
category = "C"
category2 = "C2"
namelist = [
[['blah blah blah string blah blah blah blah blah blah', 'another string, blah blah blah, string string', 'string string string'],
category
],
[['string string another string, blah', 'more words, more words, etc', 'yet again, here we go'],
category2
]
]
wordlist = "blah string words".split()
Then this should work as described:
from collections import defaultdict
worddictlist2 = []
for innertweet in namelist:
worddict = defaultdict(lambda: 0)
category = innertweet[1]
for tweet in innertweet[0]:
for word in wordlist:
worddict[word] += tweet.count(word)
# optional - transform defaultdict into standard dict to make it printable
worddictClean = {}
worddictClean.update(worddict)
worddictlist2.append([worddictClean, category])
print worddictlist2
And it outputs:
[[{'blah': 12, 'string': 7, 'words': 0}, 'C'], [{'blah': 1, 'string': 3, 'words': 2}, 'C2']]
First, in the current code the worddict gets created anew for each tweet, which is probably not what you want.
Also, using the method str.count() you run the risk of counting a word that occurs in the tweet as a part of another word, e.g. 'as is the case'.count('as') would be 2, rather than 1, since as appears in the word case as substring.
I would suggest splitting the tweet by whitespace and than iterating over the unique words in that split instead, like words = tweet.split() and {word: words.count(word) for word in list(set(words)) or simply iterating over the words and incrementing the counts in the dictionary for every occurrence of a word, I'm not sure which is more efficient.
So, my suggestion would be
worddictlist2 = []
for innertweet in namelist:
worddict = {}
for tweet in innertweet[0]:
words = tweet.split()
for word in words:
if not worddict.has_key(word):
worddict[word] = 1
else:
worddict[word] += 1
worddictlist2.append([worddict, innertweet[1]])
given the input
namelist = [[['blah blah blah string blah blah blah blah blah blah', 'another string, blah blah blah, string string', 'string string string'], 'category'], [['string string another string, blah', 'more words, more words, etc', 'yet again, here we go'], 'category2']]
this code generates
[[{'blah,': 1, 'blah': 11, 'string,': 1, 'string': 6, 'another': 1}, 'category'], [{'string,': 1, 'string': 2, 'again,': 1, 'etc': 1, 'we': 1, 'here': 1, 'blah': 1, 'words,': 2, 'another': 1, 'go': 1, 'yet': 1, 'more': 2}, 'category2']]
In order to get rid of the words with commas attached, you might want to eliminate the punctuation before counting the words, e.g. by adding tweet = re.sub(r'[^a-zA-Z0-9]', ' ', tweet) to the code above:
import re
worddictlist2 = []
for innertweet in namelist:
worddict = {}
for tweet in innertweet[0]:
tweet = re.sub(r'[^a-zA-Z0-9]', ' ', tweet)
words = tweet.split()
for word in words:
if not worddict.has_key(word):
worddict[word] = 1
else:
worddict[word] += 1
worddictlist2.append([worddict, innertweet[1]])
print worddictlist2
that yields
[[{'blah': 12, 'string': 7, 'another': 1}, 'category'], [{'again': 1, 'we': 1, 'string': 3, 'etc': 1, 'here': 1, 'blah': 1, 'another': 1, 'words': 2, 'go': 1, 'yet': 1, 'more': 2}, 'category2']]
Perhaps like this:
worddictlist2 = []
wdlist = {}
for innertweet,cat in namelist:
for i in innertweet:
for j in i.split():
j = j.strip(',') # strip comma
wdlist.setdefault(j,0) # if 'j' unknown key
wdlist[j] += 1
worddictlist2.append(wdlist, cat)
wdlist = {}
print(worddictlist2)
gives:
[
[{'another': 1, 'blah': 12, 'string': 7}, 'category'],
[{'again': 1, 'another': 1, 'blah': 1, 'etc': 1, 'go': 1, 'here': 1, 'more': 2, 'string': 3, 'we': 1, 'words': 2, 'yet': 1}, 'category2']
]

Categories