Beginner here. I'm currently writing a program that will turn every word in a "movie reviews" text file into a key, storing a list value containing the review number and the number of times the word has been seen. For example:
4 I loved it
1 I hated it
... might look like this as a dictionary:
words['i'] = [5,2]
words['loved'] = [4,1]
words['it'] = [5,2]
words['hated'] = [1,1]
However, this is the output I've been getting:
{'i': [1, 2], 'loved': [4, 1], 'it': [1, 2], 'hated': [1, 1]}
I figured out the counter part, but I can't figure out how to update the review number. Here is my code so far:
def main():
reviews = open("testing.txt", "r")
data = reviews.read();
reviews.close()
# create new dictionary
words = {}
# iterate over every review in text file
splitlines = data.split("\n")
for line in splitlines:
lower = line.lower()
value = lower.split()
rev = int(value[0])
for word in value:
if word.isalpha():
count = 1
if word not in words:
words[word] = [rev, count]
else:
words[word] = [rev, count + 1]
How can I update the review number count?
This is pretty easy to do. Assuming each key has only 2 items in the value list:
if word not in words:
words[word] = [rev, 1]
else:
temp = words[word][1]
words[word] = [rev, temp + 1]
When updating the count, you're using count + 1, but count will always be 1 here; you need to retrieve the existing count first, using something like: count = words[word][1]
Related
Link to problem statement
Please help. I am very confused on how to execute this:
This is what I currently have:
def similarityAnalysis(paragraph1, paragraph2):
dict = {}
for word in lst:
if word in dict:
dict[word] = dict[word] + 1
else:
dict[word] = 1
for key, vale in dict.items():
print(key, val)
see below.
For find common words we use set intersection
For counting we use a dict
Code
lst1 = ['jack','Jim','apple']
lst2 = ['chair','jack','ball','steve']
common = set.intersection(set(lst1),set(lst2))
print('commom words below:')
print(common)
print()
print('counter below:')
counter = dict()
for word in lst1:
if word not in counter:
counter[word] = [0,0]
counter[word][0] += 1
for word in lst2:
if word not in counter:
counter[word] = [0,0]
counter[word][1] += 1
print(counter)
output
commom words below:
{'jack'}
counter below:
{'jack': [1, 1], 'Jim': [1, 0], 'apple': [1, 0], 'chair': [0, 1], 'ball': [0, 1], 'steve': [0, 1]}
Analysing your code as follows:
You use the variable name dict which is a reserved keyword (for creating dictionaries). By using this as a variable name, you will loose the ability to use the dict function.
The function uses a variable named lst which is not one of its arguments. Where do the values for this variable come from?
In the second for loop, you use the variable name vale but then later reference a different variable called val.
Otherwise, looks good. There may be other issues, that's as far as I got.
Recommend googling the following and seeing what code you find
"Python count the number of words in a paragraph"
Update:
There are many ways to do this, but here's one answer:
def word_counts(lst):
counts = {}
for word in lst:
counts[word] = counts.get(word, 0) + 1
return counts
def similarityAnalysis(paragraph1, paragraph2):
lst1 = paragraph1.split()
lst2 = paragraph2.split()
counts1 = word_counts(lst1)
counts2 = word_counts(lst2)
common_words = set(lst1).intersection(lst2)
return {word: (counts1[word], counts2[word]) for word in common_words}
paragraph1 = 'one three two one two four'
paragraph2 = 'one two one three three one'
print(similarityAnalysis(paragraph1, paragraph2))
Output:
{'three': (1, 2), 'one': (2, 3), 'two': (2, 1)}
I'm new to python and I'm learning it slowly. I'm trying to code a simple word counter that tracks instances of words across multiple lines. I'm attempting to place the line into a list and then track each list point in a dictionary, whilst removing each word from the list as the dictionary is updated. So far I have:
dic = {}
count = ''
liste = line.split()
listes = liste[0]
num = 0
while line:
while not liste:
listes = liste[0]
if listes in dic:
count = str(dic[listes])
count = count.rstrip("]")
count = count.lstrip("[")
count = int(count) + 1
liste.pop(0)
else:
skadoing = 1
dic [listes] = [skadoing]
line = input("Enter line: ")
for word in sorted(dic):
print(word, dic[word])
When run, it currently outputs the following:
Enter line: which witch
Enter line: is which
Enter line:
which ['']
I need it to output this:
Enter line: which witch
Enter line: is which
Enter line:
is 1
which 2
witch 1
liste is the list of words from the inputted line and listes is the word that I'm trying to update in the dictionary.
Any ideas?
I believe this is what you're looking to achieve:
dic = {}
line = input("Enter line: ")
while line:
for word in line.split(" "):
if word not in dic:
dic[word] = 1
else:
dic[word] +=1
line = input("Enter line: ")
for word in sorted(dic):
print(word, dic[word])
Output:
Enter line: hello world
Enter line: world
Enter line:
hello 1
world 2
If you really want to implement this by yourself and count the words, then it would be great to use defaultdict:
from collections import defaultdict
sentence = '''this is a test for which is witch and which
because of which'''
words = sentence.split()
d = defaultdict(int)
for word in words:
d[word] = d[word]+ 1
print(d)
Output:
{'this': 1, 'is': 2, 'a': 1, 'test': 1, 'for': 1, 'which': 3, 'witch': 1, 'and': 1, 'because': 1, 'of': 1}
Maybe you can use the collections package to do the job:-
from collections import Counter
line = input("Enter line: ")
words = line.split(" ")
word_count = dict(Counter(words))
print(word_count)
Enter line: hi how are you are you fine
{'hi': 1, 'how': 1, 'are': 2, 'you': 2, 'fine': 1}
Hope this helps!!
my code creates for every document I am processing a vector based Bag-of-words.
It works and prints the frequency of every single word in the document. Additionally I would like to print every word just right in front of the number, just like this:
['word', 15]
I tried it on my own. What I get right now looks like this:
This is my code:
for doc in docsClean:
bag_vector = np.zeros(len(doc))
for w in doc:
for i,word in enumerate(doc):
if word == w:
bag_vector[i] += 1
print(bag_vector)
print("{0},{1}\n".format(w,bag_vector[i]))
I would suggest using a dict to store the frequency of each word.
There is already an inbuilt python feature to do this - collections.Counter.
from collections import Counter
# Random words
words = ['lacteal', 'brominating', 'postmycotic', 'legazpi', 'enclosing', 'arytaenoid', 'brominating', 'postmycotic', 'legazpi', 'enclosing']
frequency = Counter(words)
print(frequency)
Output:
Counter({'brominating': 2, 'postmycotic': 2, 'legazpi': 2, 'enclosing': 2, 'lacteal': 1, 'arytaenoid': 1})
If, for any reason, you don't want to use collections.Counter, here is a simple code to do the same task.
words = ['lacteal', 'brominating', 'postmycotic', 'legazpi', 'enclosing', 'arytaenoid', 'brominating', 'postmycotic', 'legazpi', 'enclosing']
freq = {} # Empty dict
for word in words:
freq[word] = freq.get(word, 0) + 1
print(freq)
This code works by adding 1 to the frequency of word, if it is already present in freq, otherwise freq.get(word, 0) returns 0, so the frequency of a new word gets stored as 1.
Output:
{'lacteal': 1, 'brominating': 2, 'postmycotic': 2, 'legazpi': 2, 'enclosing': 2, 'arytaenoid': 1}
I have code which saves all the words in the sentence to a text file and saves the list of positions in to another textfile.
Rather than saving all the words in to the list I'm trying to find a method so that it will only save each word once to avoid duplication.
Additionally for my list of positions it will see if the word appears more than once and if it does it saves it as the first position which appears in the word which is fine but then it skips a position e.g [1,2,3,2,5] rather than the last position be 5 it should be 4 as there's no position 4 if that makes sense.
I don't expect anyone to do this for me but is there a method I should be using e.g if word in sentence do x or using enumerate()?
Here is my code:
#SUBROUTINES
def saveItem():
#save an item into a new file
print("creating a text file with the write() method")
textfile=open("positions.txt","w")
textfile.write(positions)
textfile.write("\n")
textfile.close()
print("The file has been added!")
#SUBROUTINES
def saveItem2():
#save an item into a new file
print("creating a text file with the write() method")
textfile=open("words.txt","w")
textfile.write(str(words))
textfile.write("\n")
textfile.close()
print("The file has been added!")
#mainprogram
sentence = input("Write your sentence here ")
words = sentence.split()
positions = str([words.index(word) + 1 for word in words])
print (sentence)
print (positions)
#we have finished with the file now.
a=True
while a:
print("what would you like to do?:\n\
1.Save a list of words?\n\
2.Save a list of positions?\n\
3.quit?\n\:")
z=int(input())
if z == 1:
saveItem()
elif z==2:
saveItem2()
elif z ==3:
print("Goodbye!!!")
a=False
else:
print("incorrect option")
Sample input sentence:
Programming is great Programming is so much fun
Sample list of words stored in text file:
['Programming','is','great','Programming','is','so','much','fun']
(the words are repeated)
Sample positions:
[1,2,3,1,2,6,7,8]
Instead I'd like the list to be stored like:
['Programming','is','great','so,'much','fun']
and the list of positions like:
[1,2,3,1,2,4,5,6]
Haven't tested it but I think this should work:
from collections import Counter
sentence = raw_input(">>> ")
words, positions, d = [], [], {}
for i,word in enumerate(sentence.split(' ')):
if word not in d.keys():
d[word]=i
words.append(word)
positions.append(d[word])
# To further process the list
c, new_positions = Counter(positions), []
cnt = list(i for i in range(len(positions)+1) if not(i in c and c[i]>1))
new_positions = [p if c[p]>1 else cnt.pop(0) for p in positions]
# store the positions result
with open('positions.txt','w') as f:
f.write(' '.join(map(str,new_positions)))
# store the words result
with open('words.txt','w') as w:
w.write(' '.join(words))
Output:
$ ./test.py
>>> Programming is great Programming is so much fun
Words list: ['Programming', 'is', 'great', 'so', 'much', 'fun']
Positions list: [0, 1, 2, 0, 1, 5, 6, 7]
New Positions list: [0, 1, 2, 0, 1, 3, 4, 5]
I am learning some basic python 3 and have been stuck at this problem for 2 days now and i can't seem to get anywhere...
Been reading the "think python" book and I'm working on chapter 13 and the case study it contains. The chapter is all about reading a file and doing some magic with it like counting total number of words and the most used words.
One part of the program is about "Dictionary subtraction" where the program fetches all the word from one textfile that are not found in another textfile.
What I also need the program to do is count the most common word from the first file, excluding the words found in the "dictionary" text file. This functionality has had me stuck for two days and i don't really know how to solve this...
The Code to my program is as follow:
import string
def process_file(filename):
hist = {}
fp = open(filename)
for line in fp:
process_line(line, hist)
return hist
def process_line(line, hist):
line = line.replace('-', ' ')
for word in line.split():
word = word.strip(string.punctuation + string.whitespace)
word = word.lower()
hist[word] = hist.get(word, 0) + 1
def most_common(hist):
t = []
for key, value in hist.items():
t.append((value, key))
t.sort()
t.reverse()
return t
def subtract(d1, d2):
res = {}
for key in d1:
if key not in d2:
res[key] = None
return res
hist = process_file('alice-ch1.txt')
words = process_file('common-words.txt')
diff = subtract(hist, words)
def total_words(hist):
return sum(hist.values())
def different_words(hist):
return len(hist)
if __name__ == '__main__':
print ('Total number of words:', total_words(hist))
print ('Number of different words:', different_words(hist))
t = most_common(hist)
print ('The most common words are:')
for freq, word in t[0:7]:
print (word, '\t', freq)
print("The words in the book that aren't in the word list are:")
for word in diff.keys():
print(word)
I then created a test dict containing a few words and imaginary times they occur and a test list to try and solve my problem and the code for that is:
histfake = {'hello': 12, 'removeme': 2, 'hi': 3, 'fish':250, 'chicken':55, 'cow':10, 'bye':20, 'the':93, 'she':79, 'to':75}
listfake =['removeme', 'fish']
newdict = {}
for key, val in histfake.items():
for commonword in listfake:
if key != commonword:
newdict[key] = val
else:
newdict[key] = 0
sortcommongone = []
for key, value in newdict.items():
sortcommongone.append((value, key))
sortcommongone.sort()
sortcommongone.reverse()
for freq, word in sortcommongone:
print(word, '\t', freq)
The problem is that that code only works for one word. Only one matched word between the dict and the list gets the value of 0 (thought that I could give the duplicate words the vale 0 since I only need the 7 most common words that are not found in the common-word text file.
How can I solve this? Created a account here just to try and get some help with this since Stackowerflow has helped me before with other problems. But this time I needed to ask the question myself. Thanks!
You can filter out the items using a dict comprehension
>>> {key: value for key, value in histfake.items() if key not in listfake}
{'hi': 3, 'she': 79, 'to': 75, 'cow': 10, 'bye': 20, 'chicken': 55, 'the': 93, 'hello': 12}
Unless listfake is larger than histfake ,the most efficient way will be to delete keys in it listfake
for key in listfake:
del histfake[key]
Complexity of list comprehension and this solution is O(n)- but the list is supposedly much shorter than the dictionary.
EDIT:
Or it may be done - if you have more keys than actual words -
for key in histfake:
if key in listfake:
del histfake[key]
You may want to test run time
Then, of course, you'll have to sort dictionary into list - and recreate it
from operator import itemgetter
most_common_7 = dict(sorted(histfake.items(), key=itemgetter(1))[:7])
BTW, you may use Counter from Collections to count words. And maybe part of your problem is that you don't remove all non-letter characters from your text