This is actually a 4 part question:
1) Returns a dictionary, which each key is word length and its value is the number of words with that length.
e.g. if the input file's text is "Hello Python people Welcome to the world of Python", then the dictionary should be:
{2: 2, 3: 1, 5: 2, 6: 3, 7: 1}
2) Returns a dictionary, which each key is a word and its value is the number of occurrences of that word.
e.g. {'hello': 1, 'of': 1, 'people': 1, 'python': 2, 'the': 1, 'to':
1,'welcome': 1, 'world': 1}
I already completed the first two parts using the following codes below.
def make_length_wordcount(x):
filename=x+'.txt'
infile=open(filename)
wordlist=infile.read().split()
counter1={}
for word in wordlist:
if len(word) in counter1:
counter1[len(word)]+=1
else:
counter1[len(word)]=1
infile.close()
print(counter1)
def make_word_count(string):
words=string.lower().split()
dictionary={}
for word in words:
dictionaryp[word]=0
for word in words:
dictionary[word]+=1
print(dictionary)
I'M HAVING TROUBLE FIGURING OUT HOW TO DO PART 3) AND 4):
3) Uses the two functions above - make_length_wordcount() and make_word_count() - to construct (i) length-wordcount dictionary and (ii) word count dictionary.
Opens a new output file "FILE_analyzed_FIRST_LAST.txt" and write two dictionaries into this file (in the format below). the output file name is
"test_analyzed_HYUN_KANG.txt" and it should contain the following lines:
Words of length 2 : 2
Words of length 3 : 1
Words of length 5 : 2
Words of length 6 : 3
Words of length 7 : 1
to : 1
of : 1
people : 1
the : 1
python : 2
welcome : 1
hello : 1
world : 1
4) In "hw2_FIRST_LAST.py" file, run the analyze_text() function three times with the following inputs:
a. "nasdaq.txt"
b. "raven.txt"
c. "frankenstein.txt"
Your hw2.py code should generate the following three files:
"nasdaq_analyzed_FIRST_LAST.txt", "raven_analyzed_FIRST_LAST.txt",
"frankenstein_analyzed_FIRST_LAST.txt"
My instructor didn't really teach us anything about writing files, so this is very confusing to me.
A few things first:
1) you can avoid the
if len(word) in counter1:
counter1[len(word)]+=1
else:
counter1[len(word)]=1
by using defaultdict or Counter from the collections module:
counter1 = defaultdict(int)
for word in wordlist:
counter1[word] += 1
Same applies to make_word_count:
def make_word_count(string):
words = string.lower().split()
dictionary = Counter(words)
print(dictionary.most_common(10))
For your 3rd points (I didn't test, but you get the idea):
def make_text_wordlen(counter1):
text = ''
for wordlen, value in counter1.items():
text += f'Words of length {wordlen} : {value}\n'
with open('your_file.txt', 'w') as f:
f.write(text)
def make_text_wordcount(dictionary):
text = ''
for word, count in dictionary.items():
text += f'{word} : {count}\n'
with open('your_file.txt', 'a') as f: # 'a' parameter to append to existing file
f.write(text)
I'll let you figure out the 4th point.
Related
Beginner here. I'm currently writing a program that will turn every word in a "movie reviews" text file into a key, storing a list value containing the review number and the number of times the word has been seen. For example:
4 I loved it
1 I hated it
... might look like this as a dictionary:
words['i'] = [5,2]
words['loved'] = [4,1]
words['it'] = [5,2]
words['hated'] = [1,1]
However, this is the output I've been getting:
{'i': [1, 2], 'loved': [4, 1], 'it': [1, 2], 'hated': [1, 1]}
I figured out the counter part, but I can't figure out how to update the review number. Here is my code so far:
def main():
reviews = open("testing.txt", "r")
data = reviews.read();
reviews.close()
# create new dictionary
words = {}
# iterate over every review in text file
splitlines = data.split("\n")
for line in splitlines:
lower = line.lower()
value = lower.split()
rev = int(value[0])
for word in value:
if word.isalpha():
count = 1
if word not in words:
words[word] = [rev, count]
else:
words[word] = [rev, count + 1]
How can I update the review number count?
This is pretty easy to do. Assuming each key has only 2 items in the value list:
if word not in words:
words[word] = [rev, 1]
else:
temp = words[word][1]
words[word] = [rev, temp + 1]
When updating the count, you're using count + 1, but count will always be 1 here; you need to retrieve the existing count first, using something like: count = words[word][1]
my code creates for every document I am processing a vector based Bag-of-words.
It works and prints the frequency of every single word in the document. Additionally I would like to print every word just right in front of the number, just like this:
['word', 15]
I tried it on my own. What I get right now looks like this:
This is my code:
for doc in docsClean:
bag_vector = np.zeros(len(doc))
for w in doc:
for i,word in enumerate(doc):
if word == w:
bag_vector[i] += 1
print(bag_vector)
print("{0},{1}\n".format(w,bag_vector[i]))
I would suggest using a dict to store the frequency of each word.
There is already an inbuilt python feature to do this - collections.Counter.
from collections import Counter
# Random words
words = ['lacteal', 'brominating', 'postmycotic', 'legazpi', 'enclosing', 'arytaenoid', 'brominating', 'postmycotic', 'legazpi', 'enclosing']
frequency = Counter(words)
print(frequency)
Output:
Counter({'brominating': 2, 'postmycotic': 2, 'legazpi': 2, 'enclosing': 2, 'lacteal': 1, 'arytaenoid': 1})
If, for any reason, you don't want to use collections.Counter, here is a simple code to do the same task.
words = ['lacteal', 'brominating', 'postmycotic', 'legazpi', 'enclosing', 'arytaenoid', 'brominating', 'postmycotic', 'legazpi', 'enclosing']
freq = {} # Empty dict
for word in words:
freq[word] = freq.get(word, 0) + 1
print(freq)
This code works by adding 1 to the frequency of word, if it is already present in freq, otherwise freq.get(word, 0) returns 0, so the frequency of a new word gets stored as 1.
Output:
{'lacteal': 1, 'brominating': 2, 'postmycotic': 2, 'legazpi': 2, 'enclosing': 2, 'arytaenoid': 1}
I have code which saves all the words in the sentence to a text file and saves the list of positions in to another textfile.
Rather than saving all the words in to the list I'm trying to find a method so that it will only save each word once to avoid duplication.
Additionally for my list of positions it will see if the word appears more than once and if it does it saves it as the first position which appears in the word which is fine but then it skips a position e.g [1,2,3,2,5] rather than the last position be 5 it should be 4 as there's no position 4 if that makes sense.
I don't expect anyone to do this for me but is there a method I should be using e.g if word in sentence do x or using enumerate()?
Here is my code:
#SUBROUTINES
def saveItem():
#save an item into a new file
print("creating a text file with the write() method")
textfile=open("positions.txt","w")
textfile.write(positions)
textfile.write("\n")
textfile.close()
print("The file has been added!")
#SUBROUTINES
def saveItem2():
#save an item into a new file
print("creating a text file with the write() method")
textfile=open("words.txt","w")
textfile.write(str(words))
textfile.write("\n")
textfile.close()
print("The file has been added!")
#mainprogram
sentence = input("Write your sentence here ")
words = sentence.split()
positions = str([words.index(word) + 1 for word in words])
print (sentence)
print (positions)
#we have finished with the file now.
a=True
while a:
print("what would you like to do?:\n\
1.Save a list of words?\n\
2.Save a list of positions?\n\
3.quit?\n\:")
z=int(input())
if z == 1:
saveItem()
elif z==2:
saveItem2()
elif z ==3:
print("Goodbye!!!")
a=False
else:
print("incorrect option")
Sample input sentence:
Programming is great Programming is so much fun
Sample list of words stored in text file:
['Programming','is','great','Programming','is','so','much','fun']
(the words are repeated)
Sample positions:
[1,2,3,1,2,6,7,8]
Instead I'd like the list to be stored like:
['Programming','is','great','so,'much','fun']
and the list of positions like:
[1,2,3,1,2,4,5,6]
Haven't tested it but I think this should work:
from collections import Counter
sentence = raw_input(">>> ")
words, positions, d = [], [], {}
for i,word in enumerate(sentence.split(' ')):
if word not in d.keys():
d[word]=i
words.append(word)
positions.append(d[word])
# To further process the list
c, new_positions = Counter(positions), []
cnt = list(i for i in range(len(positions)+1) if not(i in c and c[i]>1))
new_positions = [p if c[p]>1 else cnt.pop(0) for p in positions]
# store the positions result
with open('positions.txt','w') as f:
f.write(' '.join(map(str,new_positions)))
# store the words result
with open('words.txt','w') as w:
w.write(' '.join(words))
Output:
$ ./test.py
>>> Programming is great Programming is so much fun
Words list: ['Programming', 'is', 'great', 'so', 'much', 'fun']
Positions list: [0, 1, 2, 0, 1, 5, 6, 7]
New Positions list: [0, 1, 2, 0, 1, 3, 4, 5]
I am learning some basic python 3 and have been stuck at this problem for 2 days now and i can't seem to get anywhere...
Been reading the "think python" book and I'm working on chapter 13 and the case study it contains. The chapter is all about reading a file and doing some magic with it like counting total number of words and the most used words.
One part of the program is about "Dictionary subtraction" where the program fetches all the word from one textfile that are not found in another textfile.
What I also need the program to do is count the most common word from the first file, excluding the words found in the "dictionary" text file. This functionality has had me stuck for two days and i don't really know how to solve this...
The Code to my program is as follow:
import string
def process_file(filename):
hist = {}
fp = open(filename)
for line in fp:
process_line(line, hist)
return hist
def process_line(line, hist):
line = line.replace('-', ' ')
for word in line.split():
word = word.strip(string.punctuation + string.whitespace)
word = word.lower()
hist[word] = hist.get(word, 0) + 1
def most_common(hist):
t = []
for key, value in hist.items():
t.append((value, key))
t.sort()
t.reverse()
return t
def subtract(d1, d2):
res = {}
for key in d1:
if key not in d2:
res[key] = None
return res
hist = process_file('alice-ch1.txt')
words = process_file('common-words.txt')
diff = subtract(hist, words)
def total_words(hist):
return sum(hist.values())
def different_words(hist):
return len(hist)
if __name__ == '__main__':
print ('Total number of words:', total_words(hist))
print ('Number of different words:', different_words(hist))
t = most_common(hist)
print ('The most common words are:')
for freq, word in t[0:7]:
print (word, '\t', freq)
print("The words in the book that aren't in the word list are:")
for word in diff.keys():
print(word)
I then created a test dict containing a few words and imaginary times they occur and a test list to try and solve my problem and the code for that is:
histfake = {'hello': 12, 'removeme': 2, 'hi': 3, 'fish':250, 'chicken':55, 'cow':10, 'bye':20, 'the':93, 'she':79, 'to':75}
listfake =['removeme', 'fish']
newdict = {}
for key, val in histfake.items():
for commonword in listfake:
if key != commonword:
newdict[key] = val
else:
newdict[key] = 0
sortcommongone = []
for key, value in newdict.items():
sortcommongone.append((value, key))
sortcommongone.sort()
sortcommongone.reverse()
for freq, word in sortcommongone:
print(word, '\t', freq)
The problem is that that code only works for one word. Only one matched word between the dict and the list gets the value of 0 (thought that I could give the duplicate words the vale 0 since I only need the 7 most common words that are not found in the common-word text file.
How can I solve this? Created a account here just to try and get some help with this since Stackowerflow has helped me before with other problems. But this time I needed to ask the question myself. Thanks!
You can filter out the items using a dict comprehension
>>> {key: value for key, value in histfake.items() if key not in listfake}
{'hi': 3, 'she': 79, 'to': 75, 'cow': 10, 'bye': 20, 'chicken': 55, 'the': 93, 'hello': 12}
Unless listfake is larger than histfake ,the most efficient way will be to delete keys in it listfake
for key in listfake:
del histfake[key]
Complexity of list comprehension and this solution is O(n)- but the list is supposedly much shorter than the dictionary.
EDIT:
Or it may be done - if you have more keys than actual words -
for key in histfake:
if key in listfake:
del histfake[key]
You may want to test run time
Then, of course, you'll have to sort dictionary into list - and recreate it
from operator import itemgetter
most_common_7 = dict(sorted(histfake.items(), key=itemgetter(1))[:7])
BTW, you may use Counter from Collections to count words. And maybe part of your problem is that you don't remove all non-letter characters from your text
I am working on a script and the script should read a text file and test to see if the specified letters(a,a,r,d,v,a,r,k) are on each line. Im having a problem as I am trying to check for 3 different a's instead of just one. My code is below:
#Variables
advk = ['a','a','r','d','v','a','r','k']
textfile = []
number = 0
life = 0
for line in open('input.txt', 'rU'):
textfile.append(line.rstrip().lower())
while life == 0:
if all(word in textfile[number] for word in advk):
printed = number + 1
print ("Aardvark on line " + str(printed))
number+=1
if number == len(textfile):
life+=1
else:
number+=1
Everytime you want to count something in python, keep the Counter class in mind.
from collections import Counter
advk = Counter(['a','a','r','d','v','a','r','k'])
with open('input.txt', 'rU') as file:
for i, line in enumerate(file.readlines()):
if not advk - Counter(line.lower()):
print ("Aardvark on line " + str(i+1))
Given the input line
dffdaardvarksdsda
the Counter would look like these
Counter({'d': 5, 'a': 4, 'f': 2, 's': 2, 'r': 2, 'k': 1, 'v': 1})
and
Counter({'a': 3, 'r': 2, 'd': 1, 'k': 1, 'v': 1})
for your list of letters to search.
We use a trick by simply substracting the two Counters advl - Counter(line.lower()) and check if the resulting Counter has no elements left.
Other things to note:
You can use the with statement to ensure your file gets closed.
You can use enumerate instead counting the line numbers.
if advk list is variable and contents are read from somewhere else then to maintain unique elements in the list you can convert it to set and check.
advk = ['a','a','r','d','v','a','r','k']
advk = list(set(advk))
This makes advk a unique list and avoids check multiple 'a's in the line.
# If the line "ardvrk" should match, this is a solution:
chars=set('aardvark')
for nr, line in enumerate(open('input.txt', 'rU')):
if not chars-set(line): # is subset?
print 'match', nr, line,