Anagram from large file - python

I have a file having 10,000 word on it. I wrote a program to find anagram word from that file but its taking too much time to get to output. For small file program works well. Try to optimize the code.
count=0
i=0
j=0
with open('file.txt') as file:
lines = [i.strip() for i in file]
for i in range(len(lines)):
for j in range(i):
if sorted(lines[i]) == sorted(lines[j]):
#print(lines[i])
count=count+1
j=j+1
i=i+1
print('There are ',count,'anagram words')

I don't fully understand your code (for example, why do you increment i and j inside the loop?). But the main problem is that you have a nested loop, which makes the runtime of the algorithm O(n^2), i.e. if the file becomes 10 times as large, the execution time will become (approximately) 100 times as long.
So you need a way to avoid that. One possible way is to store the lines in a smarter way, so that you don't have to walk through all lines every time. Then the runtime becomes O(n). In this case you can use the fact that anagrams consist of the same characters (only in a different order). So you can use the "sorted" variant as a key in a dictionary to store all lines that can be made from the same letters in a list under the same dictionary key. There are other possibilities of course, but in this case I think it works out quite nice :-)
So, fully working example code:
#!/usr/bin/env python3
from collections import defaultdict
d = defaultdict(list)
with open('file.txt') as file:
lines = [line.strip() for line in file]
for line in lines:
sorted_line = ''.join(sorted(line))
d[sorted_line].append(line)
anagrams = [d[k] for k in d if len(d[k]) > 1]
# anagrams is a list of lists of lines that are anagrams
# I would say the number of anagrams is:
count = sum(map(len, anagrams))
# ... but in your example your not counting the first words, only the "duplicates", so:
count -= len(anagrams)
print('There are', count, 'anagram words')
UPDATE
Without duplicates, and without using collections (although I strongly recommend to use it):
#!/usr/bin/env python3
d = {}
with open('file.txt') as file:
lines = [line.strip() for line in file]
lines = set(lines) # remove duplicates
for line in lines:
sorted_line = ''.join(sorted(line))
if sorted_line in d:
d[sorted_line].append(line)
else:
d[sorted_line] = [line]
anagrams = [d[k] for k in d if len(d[k]) > 1]
# anagrams is a list of lists of lines that are anagrams
# I would say the number of anagrams is:
count = sum(map(len, anagrams))
# ... but in your example your not counting the first words, only the "duplicates", so:
count -= len(anagrams)
print('There are', count, 'anagram words')

Well it is unclear whether you account for duplicates or not, however if you don't you can remove duplicates from your list of words and that will spare you a huge amount of runtime in my opinion. You can check for anagrams and then use sum() to get the their total number. This should do it:
def get_unique_words(lines):
unique = []
for word in " ".join(lines).split(" "):
if word not in unique:
unique.append(word)
return unique
def check_for_anagrams(test_word, words):
return sum([1 for word in words if (sorted(test_word) == sorted(word) and word != test_word)])
with open('file.txt') as file:
lines = [line.strip() for line in file]
unique = get_unique_words(lines)
count = sum([check_for_anagrams(word, unique) for word in unique])
print('There are ', count,'unique anagram words aka', int(count/2), 'unique anagram couples')

Related

For each word in the text file, extract surrounding 5 words

For each occurrence of a certain word, I need to display the context by showing about 5 words preceding and following the occurrence of the word.
Example output for the word 'stranger' in a text file of content when you enter occurs('stranger', 'movie.txt'):
My code so far:
def occurs(word, filename):
infile = open(filename,'r')
lines = infile.read().splitlines()
infile.close()
wordsString = ''.join(lines)
words = wordsString.split()
print(words)
for i in range(len(words)):
if words[i].find(word):
#stuck here
I'd suggest slicing words depending on i:
print(words[i-5:i+6])
(This would go where your comment is)
Alternatively, to print as shown in your example:
print("...", " ".join(words[i-5:i+6]), "...")
To account for the word being in the first 5:
if i > 5:
print("...", " ".join(words[i-5:i+6]), "...")
else:
print("...", " ".join(words[0:i+6]), "...")
Additionally, find is not doing what you think it is. If find() doesn't find the string, it returns -1 which evaluates to True when used in a if statement. Try:
if word in words[i].lower():
This retrieves the index of every occurrence of the word in words, which is a list of all words in the file. Then slicing is used to get a list of the matched word and the 5 words before and after.
def occurs(word, filename):
infile = open(filename,'r')
lines = infile.read().splitlines()
infile.close()
wordsString = ''.join(lines)
words = wordsString.split()
matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]
for m in matches:
l = " ".join(words[m-5:m+6])
print(f"... {l} ...")
Consider the more_itertools.adajacent tool.
Given
import more_itertools as mit
s = """\
But we did not answer him, for he was a stranger and we were not used to, strangers and were shy of them.
We were simple folk, in our village, and when a stranger was a pleasant person we were soon friends.
"""
word, distance = "stranger", 5
words = s.splitlines()[0].split()
Demo
neighbors = list(mit.adjacent(lambda x: x == word, words, distance))
" ".join(word for bool_, word in neighbors if bool_)
# 'him, for he was a stranger and we were not used'
Details
more_itertools.adjacent returns an iterable of tuples, e.g. (bool, item) pairs. A True boolean is returned for words in the string that satisfy the predicate. Example:
>>> neighbors
[(False, 'But'),
...
(True, 'a'),
(True, 'stranger'),
(True, 'and'),
...
(False, 'to,')]
Neighboring words are filtered from the results given a distance from the target word.
Note: more_itertools is a third-party library. Install by pip install more_itertools.
Whenever I see rolling views of files, I think collections.deque
import collections
def occurs(needle, fname):
with open(fname) as f:
lines = f.readlines()
words = iter(''.join(lines).split())
view = collections.deque(maxlen=11)
# prime the deque
for _ in range(10): # leaves an 11-length deque with 10 elements
view.append(next(words, ""))
for w in words:
view.append(w)
if view[5] == needle:
yield list(view.copy())
Note that this approach intentionally does not handle any edge cases for needle names in the first 5 words or the last 5 words of the file. The question is ambiguous as to whether matching the third word should give the first through ninth words, or something different.

GC content in DNA sequence (Rosaland): how to improve my code?

Below is my code for a Rosalind question to calculate GC content without using Biopython.
Can anyone give me some suggestions how to improve it? For example, I cannot include the last sequence in the seq_list inside the for loop and have to append one more time.
Also, is there a better way to pair seq_name and GC content so I can easily print out the sequence name with highest GC content?
Thank you very much
# to open FASTA format sequence file:
s=open('5_GC_content.txt','r').readlines()
# to create two lists, one for names, one for sequences
name_list=[]
seq_list=[]
data='' # to put the sequence from several lines together
for line in s:
line=line.strip()
for i in line:
if i == '>':
name_list.append(line[1:])
if data:
seq_list.append(data)
data=''
break
else:
line=line.upper()
if all([k==k.upper() for k in line]):
data=data+line
seq_list.append(data) # is there a way to include the last sequence in the for loop?
GC_list=[]
for seq in seq_list:
i=0
for k in seq:
if k=="G" or k=='C':
i+=1
GC_cont=float(i)/len(seq)*100.0
GC_list.append(GC_cont)
m=max(GC_list)
print name_list[GC_list.index(m)] # to find the index of max GC
print "{:0.6f}".format(m)
if all([k==k.upper() for k in line]):
Why don't you just check that line == line.upper() ?
i=0
for k in seq:
if k=="G" or k=='C':
i+=1
can be replaced with
i = sum(1 for k in seq if k in ['G', 'C'])
is there a way to include the last sequence in the for loop?
I think there is no better way to do that.
To avoid appending your seq list a second time remove:
if all([k==k.upper() for k in line]):
data=data+line
and add it below line.strip()
The problem you are facing is data is an empty string the first time you enter the for i in line loop. Therefore, if data: is False.

Think Python Ch12 Ex 4: Code seems correct but hangs when compiled

Was trying this exercise from the book think python. Write a program that reads a word list from a file (see Section 9.1) and prints all the sets of words that are anagrams.
My strategy is to get file, sort each word and store in a list of strings (called listy). Then I'll look through the original list of words again and compare against listy. If same, store the sorted word as key and un-sorted word from original file as value in a dictionary. Then simply print out all the values under each key. They should be anagrams.
The first function I created was to generate listy. Have broken down the code and checked it and seems fine. However, when I compile and run it, python hangs as though it encountered an infinite loop. Could anyone tell me why this is so?
def newlist():
fin = open('words.txt')
listy = []
for word in fin:
n1 = word.strip()
n2 = sorted(n1)
red = ''.join(n2)
if red not in listy:
listy.append(red)
return listy
newlist()
Use a set to check for whether the word has been preocessed or not:
def newlist():
with open('words.txt') as fin:
listy = set()
for word in fin:
n1 = word.strip()
n2 = sorted(n1)
red = ''.join(n2)
listy.add(red)
return listy
newlist()
You could even write it as:
def newlist():
with open('words.txt') as fin:
return set(''.join(sorted(word.strip())) for word in fin)
newlist()

finding sum of values in a nested dictionary in python

I have around around 20000 text files, numbered 5.txt,10.txt and so on..
I am storing the filepaths of these files in a list "list2" that i have created.
I also have a text file "temp.txt" with a list of 500 words
vs
mln
money
and so on..
I am storing these words in another list "list" that i have created.
Now i create a nested dictionary d2[file][word]=frequency count of "word" in "file"
Now,
I need to iterate through these words for each text file as,
i am trying to get the following output :
filename.txt- sum(d[filename][word]*log(prob))
Here, filename.txt is of the form 5.txt,10.txt and so on...
"prob",which is a value that i have already obtained
I basically need to find the sum of the inner keys'(words) values, (which is the frequency of the word) for every outer key(file).
Say:
d['5.txt']['the']=6
here "the" is my word and "5.txt" is the file.Now 6 is the number of times "the" occurs in "5.txt".
Similarly:
d['5.txt']['as']=2.
I need to find the sum of the dictionary values.
So,here for 5.txt: i need my answer to be :
6*log(prob('the'))+2*log(prob('as'))+...`(for all the words in list)
I need this to be done for all the files.
My problem lies in the part where I am supposed to iterate through the nested dictionary
import collections, sys, os, re
sys.stdout=open('4.txt','w')
from collections import Counter
from glob import glob
folderpath='d:/individual-articles'
folderpaths='d:/individual-articles/'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))
#test contains: d:/individual-articles/5.txt,d:/individual,articles/10.txt,d:/individual-articles/15.txt and so on...
with open('test.txt', 'r') as fi:
list2= [line.strip() for line in fi]
#temp contains the list of words
with open('temp.txt', 'r') as fi:
list= [line.strip() for line in fi]
#the dictionary that contains d2[file][word]
d2 =defaultdict(dict)
for fil in list2:
with open(fil) as f:
path, name = os.path.split(fil)
words_c = Counter([word for line in f for word in line.split()])
for word in list:
d2[name][word] = words_c[word]
#this portion is also for the generation of dictionary "prob",that is generated from file 2.txt can be overlooked!
with open('2.txt', 'r+') as istream:
for line in istream.readlines():
try:
k,r = line.strip().split(':')
answer_ca[k.strip()].append(r.strip())
except ValueError:
print('Ignoring: malformed line: "{}"'.format(line))
#my problem lies here
items = d2.items()
small_d2 = dict(next(items) for _ in range(10))
for fil in list2:
total=0
for k,v in small_d2[fil].items():
total=total+(v*answer_ca[k])
print("Total of {} is {}".format(fil,total))
for fil in list2: #list2 contains the filenames
total = 0
for k,v in d[fil].iteritems():
total += v*log(prob[k]) #where prob is a dict
print "Total of {} is {}".format(fil,total)
with open(f) as fil assigns fil to whatever the contents of f are. When you later access the entries in your dictionary as
total=sum(math.log(prob)*d2[fil][word].values())
I believe you mean
total = sum(math.log(prob)*d2[f][word])
though, this doesn't seem to quite match up with the order you were expecting, so I would instead suggest something more like this:
word_list = [#list of words]
file_list = [#list of files]
dictionary = {#your dictionary}
summation = lambda file_name,prob: sum([(math.log(prob)*dictionary[word][file_name]) for word in word_list])
return_value = []
for file_name in file_list:
prob = #something
return_value.append(summation(file_name))
The summation line there is defining an anonymous function within python. These are called lambda functions. Essentially, what that line in particular means is:
summation = lambda file_name,prob:
is almost the same as:
def summation(file_name, prob):
and then
sum([(math.log(prob)*dictionary[word][file_name]) for word in word_list])
is almost the same as:
result = []
for word in word_list:
result.append(math.log(prob)*dictionary[word][file_name]
return sum(result)
so in total you have:
summation = lambda file_name,prob: sum([(math.log(prob)*dictionary[word][file_name]) for word in word_list])
instead of:
def summation(file_name, prob):
result = []
for word in word_list:
result.append(math.log(prob)*dictionary[word][file_name])
return sum(result)
though the lambda function with the list comprehension is much faster than the for loop implementation. There are very few cases in python where one should use a for loop instead of a list comprehension, but they certainly exist.

How many common English words of 4 letters or more can you make from the letters of a given word (each letter can only be used once)

On the back of a block calendar I found the following riddle:
How many common English words of 4 letters or more can you make from the letters
of the word 'textbook' (each letter can only be used once).
My first solution that I came up with was:
from itertools import permutations
with open('/usr/share/dict/words') as f:
words = f.readlines()
words = map(lambda x: x.strip(), words)
given_word = 'textbook'
found_words = []
ps = (permutations(given_word, i) for i in range(4, len(given_word)+1))
for p in ps:
for word in map(''.join, p):
if word in words and word != given_word:
found_words.append(word)
print set(found_words)
This gives the result set(['tote', 'oboe', 'text', 'boot', 'took', 'toot', 'book', 'toke', 'betook']) but took more than 7 minutes on my machine.
My next iteration was:
with open('/usr/share/dict/words') as f:
words = f.readlines()
words = map(lambda x: x.strip(), words)
given_word = 'textbook'
print [word for word in words if len(word) >= 4 and sorted(filter(lambda letter: letter in word, given_word)) == sorted(word) and word != given_word]
Which return an answer almost immediately but gave as answer: ['book', 'oboe', 'text', 'toot']
What is the fastest, correct and most pythonic solution to this problem?
(edit: added my earlier permutations solution and its different output).
I thought I'd share this slightly interesting trick although it takes a good bit more code than the rest and isn't really "pythonic". This will take a good bit more code than the other solutions but should be rather quick if I look at the timing the others need.
We're doing a bit preprocessing to speed up the computations. The basic approach is the following: We assign every letter in the alphabet a prime number. E.g. A = 2, B = 3, and so on. We then compute a hash for every word in the alphabet which is simply the product of the prime representations of every character in the word. We then store every word in a dictionary indexed by the hash.
Now if we want to find out which words are equivalent to say textbook we only have to compute the same hash for the word and look it up in our dictionary. Usually (say in C++) we'd have to worry about overflows, but in python it's even simpler than that: Every word in the list with the same index will contain exactly the same characters.
Here's the code with the slightly optimization that in our case we only have to worry about characters also appearing in the given word, which means we can get by with a much smaller prime table than otherwise (the obvious optimization would be to only assign characters that appear in the word a value at all - it was fast enough anyhow so I didn't bother and this way we could pre process only once and do it for several words). The prime algorithm is useful often enough so you should have one yourself anyhow ;)
from collections import defaultdict
from itertools import permutations
PRIMES = list(gen_primes(256)) # some arbitrary prime generator
def get_dict(path):
res = defaultdict(list)
with open(path, "r") as file:
for line in file.readlines():
word = line.strip().upper()
hash = compute_hash(word)
res[hash].append(word)
return res
def compute_hash(word):
hash = 1
for char in word:
try:
hash *= PRIMES[ord(char) - ord(' ')]
except IndexError:
# contains some character out of range - always 0 for our purposes
return 0
return hash
def get_result(path, given_word):
words = get_dict(path)
given_word = given_word.upper()
result = set()
powerset = lambda x: powerset(x[1:]) + [x[:1] + y for y in powerset(x[1:])] if x else [x]
for word in (word for word in powerset(given_word) if len(word) >= 4):
hash = compute_hash(word)
for equiv in words[hash]:
result.add(equiv)
return result
if __name__ == '__main__':
path = "dict.txt"
given_word = "textbook"
result = get_result(path, given_word)
print(result)
Runs on my ubuntu word list (98k words) rather quickly, but not what I'd call pythonic since it's basically a port of a c++ algorithm. Useful if you want to compare more than one word that way..
How about this?
from itertools import permutations, chain
with open('/usr/share/dict/words') as fp:
words = set(fp.read().split())
given_word = 'textbook'
perms = (permutations(given_word, i) for i in range(4, len(given_word)+1))
pwords = (''.join(p) for p in chain(*perms))
matches = words.intersection(pwords)
print matches
which gives
>>> print matches
set(['textbook', 'keto', 'obex', 'tote', 'oboe', 'text', 'boot', 'toto', 'took', 'koto', 'bott', 'tobe', 'boke', 'toot', 'book', 'bote', 'otto', 'toke', 'toko', 'oket'])
There is a generator itertools.permutations with which you can gather all permutations of a sequence with a specified length. That makes it easier:
from itertools import permutations
GIVEN_WORD = 'textbook'
with open('/usr/share/dict/words', 'r') as f:
words = [s.strip() for s in f.readlines()]
print len(filter(lambda x: ''.join(x) in words, permutations(GIVEN_WORD, 4)))
Edit #1: Oh! It says "4 or more" ;) Forget what I said!
Edit #2: This is the second version I came up with:
LETTERS = set('textbook')
with open('/usr/share/dict/words') as f:
WORDS = filter(lambda x: len(x) >= 4, [l.strip() for l in f])
matching = filter(lambda x: set(x).issubset(LETTERS) and all([x.count(c) == 1 for c in x]), WORDS)
print len(matching)
Create the whole power set, then check whether the dictionary word is in the set (order of the letters doesn't matter):
powerset = lambda x: powerset(x[1:]) + [x[:1] + y for y in powerset(x[1:])] if x else [x]
pw = map(lambda x: sorted(x), powerset(given_word))
filter(lambda x: sorted(x) in pw, words)
The following just checks each word in the dictionary to see if it is of the appropriate length, and then if it is a permutation of 'textbook'. I borrowed the permutation check from
Checking if two strings are permutations of each other in Python
but changed it slightly.
given_word = 'textbook'
with open('/usr/share/dict/words', 'r') as f:
words = [s.strip() for s in f.readlines()]
matches = []
for word in words:
if word != given_word and 4 <= len(word) <= len(given_word):
if all(word.count(char) <= given_word.count(char) for char in word):
matches.append(word)
print sorted(matches)
This finishes almost immediately and gives the correct result.
Permutations get very big for longer words. Try counterrevolutionary for example.
I would filter the dict for words from 4 to len(word) (8 for textbook).
Then I would filter with regular expression "oboe".matches ("[textbook]+").
The remaining words, I would sort, and compare them with a sorted version of your word, ("beoo", "bekoottx") with jumping to the next index of a matching character, to find mismatching numbers of characters:
("beoo", "bekoottx")
("eoo", "ekoottx")
("oo", "koottx")
("oo", "oottx")
("o", "ottx")
("", "ttx") => matched
("bbo", "bekoottx")
("bo", "ekoottx") => mismatch
Since I don't talk python, I leave the implementation as an exercise to the audience.

Categories