Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm writing a program to find all the possible combinations of a jumbled word from a dictionary in python.
Here's what I've wrote. It's in O(n^2) time. So, my question is Can it be made faster ?
import sys
dictfile = "dictionary.txt"
def get_words(text):
""" Return a list of dict words """
return text.split()
def get_possible_words(words,jword):
""" Return a list of possible solutions """
possible_words = []
jword_length = len(jword)
for word in words:
jumbled_word = jword
if len(word) == jword_length:
letters = list(word)
for letter in letters:
if jumbled_word.find(letter) != -1:
jumbled_word = jumbled_word.replace(letter,'',1)
if not jumbled_word:
possible_words.append(word)
return possible_words
if __name__ == '__main__':
words = get_words(file(dictfile).read())
if len(sys.argv) != 2:
print "Incorrect Format. Type like"
print "python %s <jumbled word>" % sys.argv[0]
sys.exit()
jumbled_word = sys.argv[1]
words = get_possible_words(words,jumbled_word)
print "possible words :"
print '\n'.join(words)
The usual fast solution to anagram problems in to build a mapping of sorted letters to a list of the unsorted words.
With that structure in-hand, the lookups are immediate and fast:
def build_table(wordlist):
table = {}
for word in wordlist:
key = ''.join(sorted(word))
table.setdefault(key, []).append(word)
return table
def lookup(jumble, table):
key = ''.join(sorted(jumble))
return table.get(key, [])
if __name__ == '__main__':
# Build table
with open('/usr/share/dict/words') as f:
wordlist = f.read().lower().split()
table = build_table(wordlist)
# Solve some jumbles
for jumble in ['tesb', 'amgaarn', 'lehsffu', 'tmirlohag']:
print(lookup(jumble, table))
Notes on speed:
The lookup() code is the fast part.
The slower buildtable() function is written for clarity.
Building the table is a one-time operation.
If you care about run-time across repeated runs, the table should be cached in a text file.
Text file format (alpha-order first, followed by the matching words):
aestt state taste tates testa
enost seton steno stone
...
With the preprocessed anagram file, it becomes a simple matter to use subprocess to grep the file for the appropriate line of matching words. This should give a very fast run time (because the sorts and matches were precomputed and because grep is so fast).
Build the preprocessed anagram file like this:
with open('/usr/share/dict/words') as f:
wordlist = f.read().split()
table = {}
for word in wordlist:
key = ''.join(sorted(word)).lower()
table[key] = table.get(key, '') + ' ' + word
lines = ['%s%s\n' % t for t in table.iteritems()]
with open('anagrams.txt', 'w') as f:
f.writelines(lines)
I was trying to solve using ruby -
https://github.com/hackings/jumble_solver
alter the getwords to return a dict(). Make each key have a value of true or 1
import itertools and use itertools.combinations to make all possible anagramatic strings
from the "jumbled_word"
then loop over the possible strings checking if they are keys in the dict
if you wanted a DIY algorithm solution then loading the dictionary into a tree might be "better" but I doubt in the real world that it would be faster
Related
I'm trying to get all the words made from the letters, 'crbtfopkgevyqdzsh' from a file called web2.txt. The posted cell below follows a block of code which improperly returned the whole run up to a full word e.g. for the word shocked it would return s, sh, sho, shoc, shock, shocke, shocked
So I tried a trie (know pun intended).
web2.txt is 2.5 MB in size, and contains 2,493,838 words of varying length. The trie in the cell below is breaking my Google Colab notebook. I even upgraded to Google Colab Pro, and then to Google Colab Pro+ to try and accommodate the block of code, but it's still too much. Any more efficient ideas besides trie to get the same result?
# Find the words3 word list here: svnweb.freebsd.org/base/head/share/dict/web2?view=co
trie = {}
with open('/content/web2.txt') as words3:
for word in words3:
cur = trie
for l in word:
cur = cur.setdefault(l, {})
cur['word'] = True # defined if this node indicates a complete word
def findWords(word, trie = trie, cur = '', words3 = []):
for i, letter in enumerate(word):
if letter in trie:
if 'word' in trie[letter]:
words3.append(cur)
findWords(word, trie[letter], cur+letter, words3 )
# first example: findWords(word[:i] + word[i+1:], trie[letter], cur+letter, word_list )
return [word for word in words3 if word in words3]
words3 = findWords("crbtfopkgevyqdzsh")
I'm using Pyhton3
A trie is overkill. There's about 200 thousand words, so you can just make one pass through all of them to see if you can form the word using the letters in the base string.
This is a good use case for collections.Counter, which gives us a clean way to get the frequencies (i.e. "counters") of the letters of an arbitrary string:
from collections import Counter
base_counter = Counter("crbtfopkgevyqdzsh")
with open("data.txt") as input_file:
for line in input_file:
line = line.rstrip()
line_counter = Counter(line.lower())
# Can use <= instead if on Python 3.10
if line_counter & base_counter == line_counter:
print(line)
Is there a way to obtain a random word from PyEnchant's dictionaries?
I've tried doing the following:
enchant.Dict("<language>").keys() #Type 'Dict' has no attribute 'keys'
list(enchant.Dict("<language>")) #Type 'Dict' is not iterable
I've also tried looking into the module to see where it gets its wordlist from but haven't had any success.
Using the separate "Random-Words" module is a workaround, but as it doesn't follow the same wordlist as PyEnchant, not all words will match. It is also quite a slow method. It is, however, the best alternative I've found so far.
Your question really got me curious so I thought of some way to make a random word using enchant.
import enchant
import random
import string
# here I am getting hold of all the letters
letters = string.ascii_lowercase
# crating a string with a random length with random letters
word = "".join([random.choice(letters) for _ in range(random.randint(3, 8))])
d = enchant.Dict("en_US")
# using the `enchant` to suggest a word based on the random string we provide
random_word = d.suggest(word)
Sometimes the suggest method will not return any suggestion so you will need to make a loop to check if random_word has any value.
With the help of #furas this question has been resolved.
Using the dict-en text file in furas' PyWordle, I wrote a short code that filters out invalid words in pyenchant's wordlist.
import enchant
wordlist = enchant.Dict("en_US")
baseWordlist = open("dict-en.txt", "r")
lines = baseWordlist.readlines()
baseWordlist.close()
newWordlist = open("dict-en_NEW.txt", "w") #write to new text file
for line in lines:
word = line.strip("\n")
if wordList.check(word) == True: #if word exists in pyenchant's dictionary
print(line + " is valid.")
newWordlist.write(line)
else:
print(line + " is invalid.")
newWordlist.close()
Afterwards, calling the text file will allow you to gather the information in that line.
validWords = open("dict-en_NEW", "r")
wordList = validWords.readlines()
myWord = wordList[<line>]
#<line> can be any int (max is .txt length), either a chosen one or a random one.
#this will return the word located at line <line>.
Hi so i have 2 text files I have to read the first text file count the frequency of each word and remove duplicates and create a list of list with the word and its count in the file.
My second text file contains keywords I need to count the frequency of these keywords in the first text file and return the result without using any imports, dict, or zips.
I am stuck on how to go about this second part I have the file open and removed punctuation etc but I have no clue how to find the frequency
I played around with the idea of .find() but no luck as of yet.
Any suggestions would be appreciated this is my code at the moment seems to find the frequency of the keyword in the keyword file but not in the first text file
def calculateFrequenciesTest(aString):
listKeywords= aString
listSize = len(listKeywords)
keywordCountList = []
while listSize > 0:
targetWord = listKeywords [0]
count =0
for i in range(0,listSize):
if targetWord == listKeywords [i]:
count = count +1
wordAndCount = []
wordAndCount.append(targetWord)
wordAndCount.append(count)
keywordCountList.append(wordAndCount)
for i in range (0,count):
listKeywords.remove(targetWord)
listSize = len(listKeywords)
sortedFrequencyList = readKeywords(keywordCountList)
return keywordCountList;
EDIT- Currently toying around with the idea of reopening my first file again but this time without turning it into a list? I think my errors are somehow coming from it counting the frequency of my list of list. These are the types of results I am getting
[[['the', 66], 1], [['of', 32], 1], [['and', 27], 1], [['a', 23], 1], [['i', 23], 1]]
You can try something like:
I am taking a list of words as an example.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = {}
for word in word_list:
if word not in frequency_list:
frequency_list[word] = 1
else:
frequency_list[word] += 1
print(frequency_list)
RESULT: {'test': 1, 'world': 1, 'hello': 2}
Since, you have put a constraint on dicts, I have made use of two lists to do the same task. I am not sure how efficient it is, but it serves the purpose.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = []
frequency_word = []
for word in word_list:
if word not in frequency_word:
frequency_word.append(word)
frequency_list.append(1)
else:
ind = frequency_word.index(word)
frequency_list[ind] += 1
print(frequency_word)
print(frequency_list)
RESULT : ['hello', 'world', 'test']
[2, 1, 1]
You can change it to how you like or re-factor it as you wish
I agree with #bereal that you should use Counter for this. I see that you have said that you don't want "imports, dict, or zips", so feel free to disregard this answer. Yet, one of the major advantages of Python is its great standard library, and every time you have list available, you'll also have dict, collections.Counter and re.
From your code I'm getting the impression that you want to use the same style that you would have used with C or Java. I suggest trying to be a little more pythonic. Code written this way may look unfamiliar, and can take time getting used to. Yet, you'll learn way more.
Claryfying what you're trying to achieve would help. Are you learning Python? Are you solving this specific problem? Why can't you use any imports, dict, or zips?
So here's a proposal utilizing built in functionality (no third party) for what it's worth (tested with Python 2):
#!/usr/bin/python
import re # String matching
import collections # collections.Counter basically solves your problem
def loadwords(s):
"""Find the words in a long string.
Words are separated by whitespace. Typical signs are ignored.
"""
return (s
.replace(".", " ")
.replace(",", " ")
.replace("!", " ")
.replace("?", " ")
.lower()).split()
def loadwords_re(s):
"""Find the words in a long string.
Words are separated by whitespace. Only characters and ' are allowed in strings.
"""
return (re.sub(r"[^a-z']", " ", s.lower())
.split())
# You may want to read this from a file instead
sourcefile_words = loadwords_re("""this is a sentence. This is another sentence.
Let's write many sentences here.
Here comes another sentence.
And another one.
In English, we use plenty of "a" and "the". A whole lot, actually.
""")
# Sets are really fast for answering the question: "is this element in the set?"
# You may want to read this from a file instead
keywords = set(loadwords_re("""
of and a i the
"""))
# Count for every word in sourcefile_words, ignoring your keywords
wordcount_all = collections.Counter(sourcefile_words)
# Lookup word counts like this (Counter is a dictionary)
count_this = wordcount_all["this"] # returns 2
count_a = wordcount_all["a"] # returns 1
# Only look for words in the keywords-set
wordcount_keywords = collections.Counter(word
for word in sourcefile_words
if word in keywords)
count_and = wordcount_keywords["and"] # Returns 2
all_counted_keywords = wordcount_keywords.keys() # Returns ['a', 'and', 'the', 'of']
Here is a solution with no imports. It uses nested linear searches which are acceptable with a small number of searches over a small input array, but will become unwieldy and slow with larger inputs.
Still the input here is quite large, but it handles it in reasonable time. I suspect if your keywords file was larger (mine has only 3 words) the slow down would start to show.
Here we take an input file, iterate over the lines and remove punctuation then split by spaces and flatten all the words into a single list. The list has dupes, so to remove them we sort the list so the dupes come together and then iterate over it creating a new list containing the string and a count. We can do this by incrementing the count as long the same word appears in the list and moving to a new entry when a new word is seen.
Now you have your word frequency list and you can search it for the required keyword and retrieve the count.
The input text file is here and the keyword file can be cobbled together with just a few words in a file, one per line.
python 3 code, it indicates where applicable how to modify for python 2.
# use string.punctuation if you are somehow allowed
# to import the string module.
translator = str.maketrans('', '', '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~')
words = []
with open('hamlet.txt') as f:
for line in f:
if line:
line = line.translate(translator)
# py 2 alternative
#line = line.translate(None, string.punctuation)
words.extend(line.strip().split())
# sort the word list, so instances of the same word are
# contiguous in the list and can be counted together
words.sort()
thisword = ''
counts = []
# for each word in the list add to the count as long as the
# word does not change
for w in words:
if w != thisword:
counts.append([w, 1])
thisword = w
else:
counts[-1][1] += 1
for c in counts:
print('%s (%d)' % (c[0], c[1]))
# function to prevent need to break out of nested loop
def findword(clist, word):
for c in clist:
if c[0] == word:
return c[1]
return 0
# open keywords file and search for each word in the
# frequency list.
with open('keywords.txt') as f2:
for line in f2:
if line:
word = line.strip()
thiscount = findword(counts, word)
print('keyword %s appear %d times in source' % (word, thiscount))
If you were so inclined you could modify findword to use a binary search, but its still not going to be anywhere near a dict. collections.Counter is the right solution when there are no restrictions. Its quicker and less code.
I am creating a code where I need to take a string of words, convert it into numbers where hi bye hi hello would turn into 0 1 0 2. I have used dictionary's to do this and this is why I am having trouble on the next part. I then need to compress this into a text file, to then decompress and reconstruct it into a string again. This is the bit I am stumped on.
The way I would like to do it is by compressing the indexes of the numbers, so the 0 1 0 2 bit into the text file with the dictionary contents, so in the text file it would have 0 1 0 2 and {hi:0, bye:1, hello:3}.
Now what I would like to do to decompress or read this into the python file, to use the indexes(this is how I will refer to the 0 1 0 2 from now on) to then take each word out of the dictionary and reconstruct the sentence, so if a 0 came up, it would look into the dictionary and then find what has a 0 definition, then pull that out to put into the string, so it would find hi and take that.
I hope that this is understandable and that at least one person knows how to do it, because I am sure it is possible, however I have been unable to find anything here or on the internet mentioning this subject.
TheLazyScripter gave a nice workaround solution for the problem, but the runtime characteristics are not good because for each reconstructed word you have to loop through the whole dict.
I would say you chose the wrong dict design: To be efficient, lookup should be done in one step, so you should have the numbers as keys and the words as items.
Since your problem looks like a great computer science homework (I'll consider it for my students ;-) ), I'll just give you a sketch for the solution:
use word in my_dict.values() #(adapt for py2/py3) to test whether the word is already in the dictionary.
If no, insert the next available index as key and the word as value.
you are done.
For reconstructing the sentence, just
loop through your list of numbers
use the number as key in your dict and print(my_dict[key])
Prepare exception handling for the case a key is not in the dict (which should not happen if you are controlling the whole process, but it's good practice).
This solution is much more efficient then your approach (and easier to implement).
Yes, you can just use regular dicts and lists to store the data. And use json or pickle to persist the data to disk.
import pickle
s = 'hi hello hi bye'
words = s.split()
d = {}
for word in word:
if word not in d:
d[word] = len(d)
data = [d[word] for word in words]
with open('/path/to/file', 'w') as f:
pickle.dump({'lookup': d, 'data': data}, f)
Then read it back in
with open('/path/to/file', 'r') as f:
dic = pickle.load(f)
d = d['lookup']
reverse_d = {v: k for k, v in d.iteritems()}
data = d['data']
words = [reverse_d[index] for index in data]
line = ' '.join(words)
print line
Because I don't know exactly how you have your keymap created the best I could do is guess. Here I have created 2 functions than can be used to write a string to a txt file based on a keymap and read a txt file and return a string based on a keymap. I hope this works for you or at least gives you a solid understanding on the process! Good Luck!
import os
def pack(out_file, string, conversion_map):
out_string = ''
for word in string.split(' '):
for key,value in conversion_map.iteritems():
if word.lower() == value.lower():
out_string += str(key)+' '
break
else:
out_string += word+' '
with open(out_file, 'wb') as file:
file.write(out_string)
return out_string.rstrip()
def unpack(in_file, conversion_map, on_lookup_error=None):
if not os.path.exists(in_file):
return
in_file = ''.join(open(in_file, 'rb').readlines())
out_string = ''
for word in in_file.split(' '):
for key, value in conversion_map.iteritems():
if word.lower() == str(key).lower():
out_string += str(value)+' '
break
else:
if on_lookup_error:
on_lookup_error()
else:
out_string += str(word)+' '
return out_string.rstrip()
def fail_on_lookup():
print 'We failed to find all words in our key map.'
raise Exception
string = 'Hello, my first name is thelazyscripter'
word_to_int_map = {0:'first',
1:'name',
2:'is',
3:'TheLazyScripter',
4:'my'}
d = pack('data', string, word_to_int_map) #pack and write the data based on the conversion map
print d #the data that was written to the file
print unpack('data', word_to_int_map) #here we unpack the data from the file
print unpack('data', word_to_int_map, fail_on_lookup)
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
I'm aiming to create a python script that counts the number of times each letter appears from a text file. So if the text file contained Hi there, the output would be something like
E is shown 2 times
H is shown 2 times
I is shown 1 time
R is shown 1 time
T is shown 1 time
I've tried different ways of getting this but I have no output being shown as I carry on getting syntax errors. I've tried the following
import collections
import string
def count_letters(example.txt, case_sensitive=False):
with open(example.txt, 'r') as f:
original_text = f.read()
if case_sensitive:
alphabet = string.ascii_letters
text = original_text
else:
alphabet = string.ascii_lowercase
text = original_text.lower()
alphabet_set = set(alphabet)
counts = collections.Counter(c for c in text if c in alphabet_set)
for letter in alphabet:
print(letter, counts[letter])
print("total:", sum(counts.values()))
return counts
And
def count_letters(example.txt, case_sensitive=False):
alphabet = "abcdefghijlkmnopqrstuvxyzABCDEFGHIJKLMNOPQRSTUVXYZ"
with open(example.txt, 'r') as f:
text = f.read()
if not case_sensitive:
alpahbet = alphabet[:26]
text = text.lower()
letter_count = {ltr: 0 for ltr in alphabet}
for char in text:
if char in alphabet:
letter_count[char] += 1
for key in sorted(letter_count):
print(key, letter_count[key])
print("total", sum(letter_count()))
There were a few problems I found when running your script. One was correctly found by #Priyansh Goel in his answer: you can't use example.txt as a parameter. You should just choose a variable name like text_file and when you call the function, you pass in the string of the file's name.
Also there was an indentation error or two. Here's the script I got to work:
import collections
import string
def count_letters(text_file, case_sensitive=False):
with open(text_file, 'r') as f:
original_text = f.read()
if case_sensitive:
alphabet = string.ascii_letters
text = original_text
else:
alphabet = string.ascii_lowercase
text = original_text.lower()
alphabet_set = set(alphabet)
counts = collections.Counter(c for c in text if c in alphabet_set)
for letter in alphabet:
print(letter, counts[letter])
print("total:", sum(counts.values()))
return counts
count_letters("example.txt")
If you will only ever use this on "example.txt", just get rid of the first parameter and hard code the file name into the function:
def count_letters(case_sensitive=False):
with open("example.txt", 'r') as f:
...
count_letters()
One of the best skills you can develop as a programmer is learning to read and understand the errors that get thrown. They're not meant to be scary or frustrating (although sometimes they are), they're meant to be helpful. Syntax errors like what you had are especially useful. If it isn't totally obvious what the errors are indicating, copy and paste the error into a Google search and more often than not you'll find the answer to your question already exists out there.
Good luck in learning! Python was a great choice for your (presumably) first language!
In your function you can't have example.txt as a parameter name.
The following code only traverses through the letters of the text and not the whole alphabet set. I am using a dict to store the frequency of letters. isalpha is used so that we just put alphabets in the dictionary.
import collections
import string
def count_letters(textfile, case_sensitive=False):
with open(textfile, 'r') as f:
original_text = f.read()
if case_sensitive:
text = original_text
else:
text = original_text.lower()
p = dict()
for i in text:
if i in p.keys():
p[i] += 1
elif i.isalpha():
p[i] = 1;
keys = p.keys()
for k in keys:
print str(k) + " " + str(p[k])
count_letters("example.txt")