I have around around 20000 text files, numbered 5.txt,10.txt and so on..
I am storing the filepaths of these files in a list "list2" that i have created.
I also have a text file "temp.txt" with a list of 500 words
vs
mln
money
and so on..
I am storing these words in another list "list" that i have created.
Now i create a nested dictionary d2[file][word]=frequency count of "word" in "file"
Now,
I need to iterate through these words for each text file as,
i am trying to get the following output :
filename.txt- sum(d[filename][word]*log(prob))
Here, filename.txt is of the form 5.txt,10.txt and so on...
"prob",which is a value that i have already obtained
I basically need to find the sum of the inner keys'(words) values, (which is the frequency of the word) for every outer key(file).
Say:
d['5.txt']['the']=6
here "the" is my word and "5.txt" is the file.Now 6 is the number of times "the" occurs in "5.txt".
Similarly:
d['5.txt']['as']=2.
I need to find the sum of the dictionary values.
So,here for 5.txt: i need my answer to be :
6*log(prob('the'))+2*log(prob('as'))+...`(for all the words in list)
I need this to be done for all the files.
My problem lies in the part where I am supposed to iterate through the nested dictionary
import collections, sys, os, re
sys.stdout=open('4.txt','w')
from collections import Counter
from glob import glob
folderpath='d:/individual-articles'
folderpaths='d:/individual-articles/'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))
#test contains: d:/individual-articles/5.txt,d:/individual,articles/10.txt,d:/individual-articles/15.txt and so on...
with open('test.txt', 'r') as fi:
list2= [line.strip() for line in fi]
#temp contains the list of words
with open('temp.txt', 'r') as fi:
list= [line.strip() for line in fi]
#the dictionary that contains d2[file][word]
d2 =defaultdict(dict)
for fil in list2:
with open(fil) as f:
path, name = os.path.split(fil)
words_c = Counter([word for line in f for word in line.split()])
for word in list:
d2[name][word] = words_c[word]
#this portion is also for the generation of dictionary "prob",that is generated from file 2.txt can be overlooked!
with open('2.txt', 'r+') as istream:
for line in istream.readlines():
try:
k,r = line.strip().split(':')
answer_ca[k.strip()].append(r.strip())
except ValueError:
print('Ignoring: malformed line: "{}"'.format(line))
#my problem lies here
items = d2.items()
small_d2 = dict(next(items) for _ in range(10))
for fil in list2:
total=0
for k,v in small_d2[fil].items():
total=total+(v*answer_ca[k])
print("Total of {} is {}".format(fil,total))
for fil in list2: #list2 contains the filenames
total = 0
for k,v in d[fil].iteritems():
total += v*log(prob[k]) #where prob is a dict
print "Total of {} is {}".format(fil,total)
with open(f) as fil assigns fil to whatever the contents of f are. When you later access the entries in your dictionary as
total=sum(math.log(prob)*d2[fil][word].values())
I believe you mean
total = sum(math.log(prob)*d2[f][word])
though, this doesn't seem to quite match up with the order you were expecting, so I would instead suggest something more like this:
word_list = [#list of words]
file_list = [#list of files]
dictionary = {#your dictionary}
summation = lambda file_name,prob: sum([(math.log(prob)*dictionary[word][file_name]) for word in word_list])
return_value = []
for file_name in file_list:
prob = #something
return_value.append(summation(file_name))
The summation line there is defining an anonymous function within python. These are called lambda functions. Essentially, what that line in particular means is:
summation = lambda file_name,prob:
is almost the same as:
def summation(file_name, prob):
and then
sum([(math.log(prob)*dictionary[word][file_name]) for word in word_list])
is almost the same as:
result = []
for word in word_list:
result.append(math.log(prob)*dictionary[word][file_name]
return sum(result)
so in total you have:
summation = lambda file_name,prob: sum([(math.log(prob)*dictionary[word][file_name]) for word in word_list])
instead of:
def summation(file_name, prob):
result = []
for word in word_list:
result.append(math.log(prob)*dictionary[word][file_name])
return sum(result)
though the lambda function with the list comprehension is much faster than the for loop implementation. There are very few cases in python where one should use a for loop instead of a list comprehension, but they certainly exist.
Related
I am new in Python and I have the following problem to solve:
"Open the file sample.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order."
I have done the following code, with some good result, but I can't understand the reason my result appears to multiple list. I just need to have the words in one list.
thanks in advance!
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
lst=fh.read().split()
final_list=list()
for line in lst:
if line in lst not in final_list:
final_list.append(line)
final_list.sort()
print(final_list)
Your code is largely correct; the major problem is the conditional on your if statement:
if line in lst not in final_list:
The expression line in lst produces a boolean result, so this will end up looking something like:
if false not in final_list:
That will always evaluate to false (because you're adding strings to your list, not boolean values). What you want is simply:
if line not in final_list:
Right now, you're sorting and printing your list inside the loop, but it would be better to do that once at the end, making your code look like this:
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
lst=fh.read().split()
final_list=list()
for line in lst:
if line not in final_list:
final_list.append(line)
final_list.sort()
print(final_list)
I have a few additional comments on your code:
You don't need to explicitly initialize a variable (as in lst = list())) if you're going to immediately assign something to it. You can just write:
fh = open(fname)
lst=fh.read().split()
On the other hand, you do need to initialize final_list because
you're going to try to call the .append method on it, although it
would be more common to write:
final_list = []
In practice, it would be more common to use a set to
collect the words, since a set will de-duplicate things
automatically:
final_list = set()
for line in lst:
final_list.add(line)
print(sorted(final_list))
Lastly, if I were to write this code, it might look like this:
fname = input("Enter file name: ")
with open(fname) as fh:
lst = fh.read().split()
final_list = set(word.lower() for word in lst)
print(sorted(final_list))
Your code has following problems as is:
if line in lst not in final_list - Not sure what you are trying to do here. I think you expect this to go over all words in the line and check in the final_list
Your code also have some indentation issues
Missing the call to close() method
You need to read all the lines to a list and iterate over the list of lines and perform the splitting and adding elements to the list as:
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
lst = fh.read().split()
final_list=list()
for word in lst:
if word not in final_list:
final_list.append(word)
final_list.sort()
print(final_list)
fh.close()
I am making a small script that tries to compare words from a text file, for now, I have been able to compare extracting all the words and counting their frequency, now, how could I make the algorithm only extract the words from the .txt that are in a list determined by me... so far I have this
from collections import Counter
def word_count(filename):
with open('hola.txt','r') as f:
return Counter(f.read().split())
counter = word_count('hola.txt')
for i in counter:
print (i, ":", counter [i])
You can create your set of the words you want to consider, and feed the Counter a generator that contains only the words in your set, using a generator comprehension:
from collections import Counter
words_to_keep = {'only', 'keep', 'these', 'words'}
def word_count(filename):
with open(filename, 'r') as f: # use `filename`
return Counter(w for w in f.read().split() if w in words_to_keep)
counter = word_count('hola.txt')
for w, c in counter.items():
print (w, ":", c)
I have a file having 10,000 word on it. I wrote a program to find anagram word from that file but its taking too much time to get to output. For small file program works well. Try to optimize the code.
count=0
i=0
j=0
with open('file.txt') as file:
lines = [i.strip() for i in file]
for i in range(len(lines)):
for j in range(i):
if sorted(lines[i]) == sorted(lines[j]):
#print(lines[i])
count=count+1
j=j+1
i=i+1
print('There are ',count,'anagram words')
I don't fully understand your code (for example, why do you increment i and j inside the loop?). But the main problem is that you have a nested loop, which makes the runtime of the algorithm O(n^2), i.e. if the file becomes 10 times as large, the execution time will become (approximately) 100 times as long.
So you need a way to avoid that. One possible way is to store the lines in a smarter way, so that you don't have to walk through all lines every time. Then the runtime becomes O(n). In this case you can use the fact that anagrams consist of the same characters (only in a different order). So you can use the "sorted" variant as a key in a dictionary to store all lines that can be made from the same letters in a list under the same dictionary key. There are other possibilities of course, but in this case I think it works out quite nice :-)
So, fully working example code:
#!/usr/bin/env python3
from collections import defaultdict
d = defaultdict(list)
with open('file.txt') as file:
lines = [line.strip() for line in file]
for line in lines:
sorted_line = ''.join(sorted(line))
d[sorted_line].append(line)
anagrams = [d[k] for k in d if len(d[k]) > 1]
# anagrams is a list of lists of lines that are anagrams
# I would say the number of anagrams is:
count = sum(map(len, anagrams))
# ... but in your example your not counting the first words, only the "duplicates", so:
count -= len(anagrams)
print('There are', count, 'anagram words')
UPDATE
Without duplicates, and without using collections (although I strongly recommend to use it):
#!/usr/bin/env python3
d = {}
with open('file.txt') as file:
lines = [line.strip() for line in file]
lines = set(lines) # remove duplicates
for line in lines:
sorted_line = ''.join(sorted(line))
if sorted_line in d:
d[sorted_line].append(line)
else:
d[sorted_line] = [line]
anagrams = [d[k] for k in d if len(d[k]) > 1]
# anagrams is a list of lists of lines that are anagrams
# I would say the number of anagrams is:
count = sum(map(len, anagrams))
# ... but in your example your not counting the first words, only the "duplicates", so:
count -= len(anagrams)
print('There are', count, 'anagram words')
Well it is unclear whether you account for duplicates or not, however if you don't you can remove duplicates from your list of words and that will spare you a huge amount of runtime in my opinion. You can check for anagrams and then use sum() to get the their total number. This should do it:
def get_unique_words(lines):
unique = []
for word in " ".join(lines).split(" "):
if word not in unique:
unique.append(word)
return unique
def check_for_anagrams(test_word, words):
return sum([1 for word in words if (sorted(test_word) == sorted(word) and word != test_word)])
with open('file.txt') as file:
lines = [line.strip() for line in file]
unique = get_unique_words(lines)
count = sum([check_for_anagrams(word, unique) for word in unique])
print('There are ', count,'unique anagram words aka', int(count/2), 'unique anagram couples')
i'm writing a code that should take in a filename and create an initial list. Then, i'm trying to sum up each item in the list. The code i've written so far looks something like this...
filename = input('Enter filename: ')
Lists = []
for line in open(filename):
line = line.strip().split()
Lists = line
print(Lists)
total = 0
for i in Lists:
total = sum(int(Lists[i]))
print(total)
I take in a filename and set all the objects in the line = to the List. Then, I make a variable total which should print out the total of each item in the list. For instance, if List = [1,2,3] then the total will be 6. However, is it possible to append integer objects to a list? The error i'm receiving is...
File "/Users/sps329/Desktop/testss copy 2.py", line 10, in main
total = sum(int(Lists[i]))
TypeError: list indices must be integers, not str
Something like this doesn't work also because the items in the List are strings and not numbers. Would I have to implement the function isdigit even though I know the input file will always be integers?...
total = sum(i)
Instead of
Lists = line
you need
Lists.append(line)
You can get the total sum like this
total = sum(sum(map(int, item)) for item in Lists)
If you dont want to create list of lists, you can use extend function
Lists.extend(line)
...
total = sum(map(int, Lists))
# creates a list of the lines in the file and closes the file
with open(filename) as f:
Lists = f.readlines()
# just in case, perhaps not necessary
Lists = [i.strip() for i in Lists]
# convert all elements of Lists to ints
int_list = [int(i) for i in Lists]
# sum elements of Lists
total = sum(int_list)
print sum([float(x.strip()) for x in open(filename)])
Was trying this exercise from the book think python. Write a program that reads a word list from a file (see Section 9.1) and prints all the sets of words that are anagrams.
My strategy is to get file, sort each word and store in a list of strings (called listy). Then I'll look through the original list of words again and compare against listy. If same, store the sorted word as key and un-sorted word from original file as value in a dictionary. Then simply print out all the values under each key. They should be anagrams.
The first function I created was to generate listy. Have broken down the code and checked it and seems fine. However, when I compile and run it, python hangs as though it encountered an infinite loop. Could anyone tell me why this is so?
def newlist():
fin = open('words.txt')
listy = []
for word in fin:
n1 = word.strip()
n2 = sorted(n1)
red = ''.join(n2)
if red not in listy:
listy.append(red)
return listy
newlist()
Use a set to check for whether the word has been preocessed or not:
def newlist():
with open('words.txt') as fin:
listy = set()
for word in fin:
n1 = word.strip()
n2 = sorted(n1)
red = ''.join(n2)
listy.add(red)
return listy
newlist()
You could even write it as:
def newlist():
with open('words.txt') as fin:
return set(''.join(sorted(word.strip())) for word in fin)
newlist()