Trie is Using Too Much Memory - python

I'm trying to get all the words made from the letters, 'crbtfopkgevyqdzsh' from a file called web2.txt. The posted cell below follows a block of code which improperly returned the whole run up to a full word e.g. for the word shocked it would return s, sh, sho, shoc, shock, shocke, shocked
So I tried a trie (know pun intended).
web2.txt is 2.5 MB in size, and contains 2,493,838 words of varying length. The trie in the cell below is breaking my Google Colab notebook. I even upgraded to Google Colab Pro, and then to Google Colab Pro+ to try and accommodate the block of code, but it's still too much. Any more efficient ideas besides trie to get the same result?
# Find the words3 word list here: svnweb.freebsd.org/base/head/share/dict/web2?view=co
trie = {}
with open('/content/web2.txt') as words3:
for word in words3:
cur = trie
for l in word:
cur = cur.setdefault(l, {})
cur['word'] = True # defined if this node indicates a complete word
def findWords(word, trie = trie, cur = '', words3 = []):
for i, letter in enumerate(word):
if letter in trie:
if 'word' in trie[letter]:
words3.append(cur)
findWords(word, trie[letter], cur+letter, words3 )
# first example: findWords(word[:i] + word[i+1:], trie[letter], cur+letter, word_list )
return [word for word in words3 if word in words3]
words3 = findWords("crbtfopkgevyqdzsh")
I'm using Pyhton3

A trie is overkill. There's about 200 thousand words, so you can just make one pass through all of them to see if you can form the word using the letters in the base string.
This is a good use case for collections.Counter, which gives us a clean way to get the frequencies (i.e. "counters") of the letters of an arbitrary string:
from collections import Counter
base_counter = Counter("crbtfopkgevyqdzsh")
with open("data.txt") as input_file:
for line in input_file:
line = line.rstrip()
line_counter = Counter(line.lower())
# Can use <= instead if on Python 3.10
if line_counter & base_counter == line_counter:
print(line)

Related

Is there any way to obtain a random word from PyEnchant?

Is there a way to obtain a random word from PyEnchant's dictionaries?
I've tried doing the following:
enchant.Dict("<language>").keys() #Type 'Dict' has no attribute 'keys'
list(enchant.Dict("<language>")) #Type 'Dict' is not iterable
I've also tried looking into the module to see where it gets its wordlist from but haven't had any success.
Using the separate "Random-Words" module is a workaround, but as it doesn't follow the same wordlist as PyEnchant, not all words will match. It is also quite a slow method. It is, however, the best alternative I've found so far.
Your question really got me curious so I thought of some way to make a random word using enchant.
import enchant
import random
import string
# here I am getting hold of all the letters
letters = string.ascii_lowercase
# crating a string with a random length with random letters
word = "".join([random.choice(letters) for _ in range(random.randint(3, 8))])
d = enchant.Dict("en_US")
# using the `enchant` to suggest a word based on the random string we provide
random_word = d.suggest(word)
Sometimes the suggest method will not return any suggestion so you will need to make a loop to check if random_word has any value.
With the help of #furas this question has been resolved.
Using the dict-en text file in furas' PyWordle, I wrote a short code that filters out invalid words in pyenchant's wordlist.
import enchant
wordlist = enchant.Dict("en_US")
baseWordlist = open("dict-en.txt", "r")
lines = baseWordlist.readlines()
baseWordlist.close()
newWordlist = open("dict-en_NEW.txt", "w") #write to new text file
for line in lines:
word = line.strip("\n")
if wordList.check(word) == True: #if word exists in pyenchant's dictionary
print(line + " is valid.")
newWordlist.write(line)
else:
print(line + " is invalid.")
newWordlist.close()
Afterwards, calling the text file will allow you to gather the information in that line.
validWords = open("dict-en_NEW", "r")
wordList = validWords.readlines()
myWord = wordList[<line>]
#<line> can be any int (max is .txt length), either a chosen one or a random one.
#this will return the word located at line <line>.

Print a list of unique words from a text file after removing punctuation, and find longest word

Goal is to a) print a list of unique words from a text file and also b) find the longest word.
I cannot use imports in this challenge.
File handling and main functionality are what I want, however the list needs to be cleaned. As you can see from the output, words are getting joined with punctuation and therefore maxLength is obviously incorrect.
with open("doc.txt") as reader, open("unique.txt", "w") as writer:
unwanted = "[],."
unique = set(reader.read().split())
unique = list(unique)
unique.sort(key=len)
regex = [elem.strip(unwanted).split() for elem in unique]
writer.write(str(regex))
reader.close()
maxLength = len(max(regex,key=len ))
print(maxLength)
res = [word for word in regex if len(word) == maxLength]
print(res)
===========
Sample:
pioneered the integrated placement year concept over 50 years ago [7][8][9] with more than 70 per cent of students taking a placement year, the highest percentage in the UK.[10]
Here's a solution that uses str.translate() to throw away all bad characters (+ newline) before we ever do the split(). (Normally we'd use a regex with re.sub(), but you're not allowed.) This makes the cleaning a one-liner, which is really neat:
bad = "[],.\n"
bad_transtable = str.maketrans(bad, ' ' * len(bad))
# We can directly read and clean the entire output, without a reader object:
cleaned_input = open('doc.txt').read().translate(bad_transtable)
#with open("doc.txt") as reader:
# cleaned_input = reader.read().translate(bad_transtable)
# Get list of unique words, in decreasing length
unique_words = sorted(set(cleaned_input.split()), key=lambda w: -len(w))
with open("unique.txt", "w") as writer:
for word in unique_words:
writer.write(f'{word}\n')
max_length = len(unique_words[0])
print ([word for word in unique_words if len(word) == max_length])
Notes:
since the input is already 100% cleaned and split, no need to append to a list/insert to a set as we go, then have to make another cleaning pass later. We can just create unique_words directly! (using set() to keep only uniques). And while we're at it, we might as well use sorted(..., key=lambda w: -len(w)) to sort it in decreasing length. Only need to call sort() once. And no iterative-append to lists.
hence we guarantee that max_length = len(unique_words[0])
this approach is also going to be more performant than nested loops for line in <lines>: for word in line.split(): ...iterative append() to wordlist
no need to do explicit writer/reader.open()/.close(), that's what the with statement does for you. (It's also more elegant for handling IO when exceptions happen.)
you could also merge the printing of the max_length words inside the writer loop. But it's cleaner code to keep them separate.
note we use f-string formatting f'{word}\n' to add the newline back when we write() an output line
in Python we use lower_case_with_underscores for variable names, hence max_length not maxLength. See PEP8
in fact here, we don't strictly need a with-statement for the writer, if all we're going to do is slurp its entire contents in one go in with open('doc.txt').read(). (That's not scaleable for huge files, you'd have to read in chunks or n lines).
str.maketrans() is a builtin, but if your teacher objects to the module reference, you can also call it on a bound string e.g. ' '.maketrans()
str.maketrans() is really a throwback to the days when we only had 95 printable ASCII characters, not Unicode. It still works on Unicode, but building and using huge translation dicts is annoying and uses memory, regex on Unicode is easier, you can define entire character classes.
Alternative solution if you don't yet know str.translate()
dirty_input = open('doc.txt').read()
cleaned_input = dirty_input
# If you can't use either 're.sub()' or 'str.translate()', have to manually
# str.replace() each bad char one-by-one (or else use a method like str.isalpha())
for bad_char in bad:
cleaned_input = cleaned_input.replace(bad_char, ' ')
And if you wanted to be ridiculously minimalist, you could write the entire output file in one line with a list-comprehension. Don't do this, it would be terrible for debugging, e.g if you couldn't open/write/overwrite the output file, or got IOError, or unique_words wasn't a list, etc:
open("unique.txt", "w").writelines([f'{word}\n' for word in unique_words])
Here is another solution without any function.
bad = '`~##$%^&*()-_=+[]{}\|;\':\".>?<,/?'
clean = ' '
for i in a:
if i not in bad:
clean += i
else:
clean += ' '
cleans = [i for i in clean.split(' ') if len(i)]
clean_uniq = list(set(cleans))
clean_uniq.sort(key=len)
print(clean_uniq)
print(len(clean_uniq[-1]))
Here is a solution. The trick is to use the python str method .isalpha() to filter non-alphanumerics.
with open("unique.txt", "w") as writer:
with open("doc.txt") as reader:
cleaned_words = []
for line in reader.readlines():
for word in line.split():
cleaned_word = ''.join([c for c in word if c.isalpha()])
if len(cleaned_word):
cleaned_words.append(cleaned_word)
# print unique words
unique_words = set(cleaned_words)
print(unique_words)
# write words to file? depends what you need here
for word in unique_words:
writer.write(str(word))
writer.write('\n')
# print length of longest
print(len(sorted(unique_words, key=len, reverse=True)[0]))

How can I pull out text snippets around specific words?

I have a large txt file and I'm trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above.
def occurs(word1, word2, filename):
import os
infile = open(filename,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(filename, 'w')
f.write("start")
f.write(os.linesep)
for word in wordlist:
matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]
for m in matches:
l = " ".join(words[m-15:m+16])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
f.close
So far, when two of the same word are too close together, the program just doesn't run on one of them. Instead, I want to get out a longer chunk of text that extends 15 words behind and in front of furthest back and forward words
This snippet will get number of words around the chosen keyword. If there are some keywords together, it will join them:
s = '''xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above. xxx'''
words = s.split()
from itertools import groupby, chain
word = 'xxx'
def get_snippets(words, word, l):
snippets, current_snippet, cnt = [], [], 0
for v, g in groupby(words, lambda w: w != word):
w = [*g]
if v:
if len(w) < l:
current_snippet += [w]
else:
current_snippet += [w[:l] if cnt % 2 else w[-l:]]
snippets.append([*chain.from_iterable(current_snippet)])
current_snippet = [w[-l:] if cnt % 2 else w[:l]]
cnt = 0
cnt += 1
else:
if current_snippet:
current_snippet[-1].extend(w)
else:
current_snippet += [w]
if current_snippet[-1][-1] == word or len(current_snippet) > 1:
snippets.append([*chain.from_iterable(current_snippet)])
return snippets
for snippet in get_snippets(words, word, 15):
print(' '.join(snippet))
Prints:
xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15
other, which I'm trying to get as one large snippet of text. I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working
topic. So far, I have working code for all instances except the scenario mentioned above. xxx
With the same data and different lenght:
for snippet in get_snippets(words, word, 2):
print(' '.join(snippet))
Prints:
xxx and I'm
I have xxx trying to
trying to xxx get chunks
mentioned above. xxx
As always, a variety of solutions avaliable here. A fun one would a be a recursive wordFind, where it searches the next 15 words and if it finds the target word it can call itself.
A simpler, though perhaps not efficient, solution would be to add words one at a time:
for m in matches:
l = " ".join(words[m-15:m])
i = 1
while i < 16:
if (words[m+i].lower() == word):
i=1
else:
l.join(words[m+(i++)])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
Or if you're wanting the subsequent uses to be removed...
bExtend = false;
for m in matches:
if (!bExtend):
l = " ".join(words[m-15:m])
f.write("...")
bExtend = false
i = 1
while (i < 16):
if (words[m].lower() == word):
l.join(words[m+i])
bExtend = true
break
else:
l.join(words[m+(i++)])
f.write(l)
if (!bExtend):
f.write("...")
f.write(os.linesep)
Note, have not tested so may require a bit of debugging. But the gist is clear: add words piecemeal and extend the addition process when a target word is encountered. This also allows you to extend with other target words other than the current one with a bit of addition to to the second conditional if.

Frequency of keywords in a list

Hi so i have 2 text files I have to read the first text file count the frequency of each word and remove duplicates and create a list of list with the word and its count in the file.
My second text file contains keywords I need to count the frequency of these keywords in the first text file and return the result without using any imports, dict, or zips.
I am stuck on how to go about this second part I have the file open and removed punctuation etc but I have no clue how to find the frequency
I played around with the idea of .find() but no luck as of yet.
Any suggestions would be appreciated this is my code at the moment seems to find the frequency of the keyword in the keyword file but not in the first text file
def calculateFrequenciesTest(aString):
listKeywords= aString
listSize = len(listKeywords)
keywordCountList = []
while listSize > 0:
targetWord = listKeywords [0]
count =0
for i in range(0,listSize):
if targetWord == listKeywords [i]:
count = count +1
wordAndCount = []
wordAndCount.append(targetWord)
wordAndCount.append(count)
keywordCountList.append(wordAndCount)
for i in range (0,count):
listKeywords.remove(targetWord)
listSize = len(listKeywords)
sortedFrequencyList = readKeywords(keywordCountList)
return keywordCountList;
EDIT- Currently toying around with the idea of reopening my first file again but this time without turning it into a list? I think my errors are somehow coming from it counting the frequency of my list of list. These are the types of results I am getting
[[['the', 66], 1], [['of', 32], 1], [['and', 27], 1], [['a', 23], 1], [['i', 23], 1]]
You can try something like:
I am taking a list of words as an example.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = {}
for word in word_list:
if word not in frequency_list:
frequency_list[word] = 1
else:
frequency_list[word] += 1
print(frequency_list)
RESULT: {'test': 1, 'world': 1, 'hello': 2}
Since, you have put a constraint on dicts, I have made use of two lists to do the same task. I am not sure how efficient it is, but it serves the purpose.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = []
frequency_word = []
for word in word_list:
if word not in frequency_word:
frequency_word.append(word)
frequency_list.append(1)
else:
ind = frequency_word.index(word)
frequency_list[ind] += 1
print(frequency_word)
print(frequency_list)
RESULT : ['hello', 'world', 'test']
[2, 1, 1]
You can change it to how you like or re-factor it as you wish
I agree with #bereal that you should use Counter for this. I see that you have said that you don't want "imports, dict, or zips", so feel free to disregard this answer. Yet, one of the major advantages of Python is its great standard library, and every time you have list available, you'll also have dict, collections.Counter and re.
From your code I'm getting the impression that you want to use the same style that you would have used with C or Java. I suggest trying to be a little more pythonic. Code written this way may look unfamiliar, and can take time getting used to. Yet, you'll learn way more.
Claryfying what you're trying to achieve would help. Are you learning Python? Are you solving this specific problem? Why can't you use any imports, dict, or zips?
So here's a proposal utilizing built in functionality (no third party) for what it's worth (tested with Python 2):
#!/usr/bin/python
import re # String matching
import collections # collections.Counter basically solves your problem
def loadwords(s):
"""Find the words in a long string.
Words are separated by whitespace. Typical signs are ignored.
"""
return (s
.replace(".", " ")
.replace(",", " ")
.replace("!", " ")
.replace("?", " ")
.lower()).split()
def loadwords_re(s):
"""Find the words in a long string.
Words are separated by whitespace. Only characters and ' are allowed in strings.
"""
return (re.sub(r"[^a-z']", " ", s.lower())
.split())
# You may want to read this from a file instead
sourcefile_words = loadwords_re("""this is a sentence. This is another sentence.
Let's write many sentences here.
Here comes another sentence.
And another one.
In English, we use plenty of "a" and "the". A whole lot, actually.
""")
# Sets are really fast for answering the question: "is this element in the set?"
# You may want to read this from a file instead
keywords = set(loadwords_re("""
of and a i the
"""))
# Count for every word in sourcefile_words, ignoring your keywords
wordcount_all = collections.Counter(sourcefile_words)
# Lookup word counts like this (Counter is a dictionary)
count_this = wordcount_all["this"] # returns 2
count_a = wordcount_all["a"] # returns 1
# Only look for words in the keywords-set
wordcount_keywords = collections.Counter(word
for word in sourcefile_words
if word in keywords)
count_and = wordcount_keywords["and"] # Returns 2
all_counted_keywords = wordcount_keywords.keys() # Returns ['a', 'and', 'the', 'of']
Here is a solution with no imports. It uses nested linear searches which are acceptable with a small number of searches over a small input array, but will become unwieldy and slow with larger inputs.
Still the input here is quite large, but it handles it in reasonable time. I suspect if your keywords file was larger (mine has only 3 words) the slow down would start to show.
Here we take an input file, iterate over the lines and remove punctuation then split by spaces and flatten all the words into a single list. The list has dupes, so to remove them we sort the list so the dupes come together and then iterate over it creating a new list containing the string and a count. We can do this by incrementing the count as long the same word appears in the list and moving to a new entry when a new word is seen.
Now you have your word frequency list and you can search it for the required keyword and retrieve the count.
The input text file is here and the keyword file can be cobbled together with just a few words in a file, one per line.
python 3 code, it indicates where applicable how to modify for python 2.
# use string.punctuation if you are somehow allowed
# to import the string module.
translator = str.maketrans('', '', '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~')
words = []
with open('hamlet.txt') as f:
for line in f:
if line:
line = line.translate(translator)
# py 2 alternative
#line = line.translate(None, string.punctuation)
words.extend(line.strip().split())
# sort the word list, so instances of the same word are
# contiguous in the list and can be counted together
words.sort()
thisword = ''
counts = []
# for each word in the list add to the count as long as the
# word does not change
for w in words:
if w != thisword:
counts.append([w, 1])
thisword = w
else:
counts[-1][1] += 1
for c in counts:
print('%s (%d)' % (c[0], c[1]))
# function to prevent need to break out of nested loop
def findword(clist, word):
for c in clist:
if c[0] == word:
return c[1]
return 0
# open keywords file and search for each word in the
# frequency list.
with open('keywords.txt') as f2:
for line in f2:
if line:
word = line.strip()
thiscount = findword(counts, word)
print('keyword %s appear %d times in source' % (word, thiscount))
If you were so inclined you could modify findword to use a binary search, but its still not going to be anywhere near a dict. collections.Counter is the right solution when there are no restrictions. Its quicker and less code.

Read the next word in a file in python

I am looking for some words in a file in python. After I find each word I need to read the next two words from the file. I've looked for some solution but I could not find reading just the next words.
# offsetFile - file pointer
# searchTerms - list of words
for line in offsetFile:
for word in searchTerms:
if word in line:
# here get the next two terms after the word
Thank you for your time.
Update: Only the first appearance is necessary. Actually only one appearance of the word is possible in this case.
file:
accept 42 2820 access 183 3145 accid 1 4589 algebra 153 16272 algem 4 17439 algol 202 6530
word: ['access', 'algebra']
Searching the file when I encounter 'access' and 'algebra', I need the values of 183 3145 and 153 16272 respectively.
An easy way to deal with this is to read the file using a generator that yields one word at a time from the file.
def words(fileobj):
for line in fileobj:
for word in line.split():
yield word
Then to find the word you're interested in and read the next two words:
with open("offsetfile.txt") as wordfile:
wordgen = words(wordfile)
for word in wordgen:
if word in searchterms: # searchterms should be a set() to make this fast
break
else:
word = None # makes sure word is None if the word wasn't found
foundwords = [word, next(wordgen, None), next(wordgen, None)]
Now foundwords[0] is the word you found, foundwords[1] is the word after that, and foundwords[2] is the second word after it. If there aren't enough words, then one or more elements of the list will be None.
It is a little more complex if you want to force this to match only within one line, but usually you can get away with considering the file as just a sequence of words.
If you need to retrieve only two first words, just do it:
offsetFile.readline().split()[:2]
word = '3' #Your word
delim = ',' #Your delim
with open('test_file.txt') as f:
for line in f:
if word in line:
s_line = line.strip().split(delim)
two_words = (s_line[s_line.index(word) + 1],\
s_line[s_line.index(word) + 2])
break
def searchTerm(offsetFile, searchTerms):
# remove any found words from this list; if empty we can exit
searchThese = searchTerms[:]
for line in offsetFile:
words_in_line = line.split()
# Use this list comprehension if always two numbers continue a word.
# Else use words_in_line.
for word in [w for i, w in enumerate(words_in_line) if i % 3 == 0]:
# No more words to search.
if not searchThese:
return
# Search remaining words.
if word in searchThese:
searchThese.remove(word)
i = words_in_line.index(word)
print words_in_line[i:i+3]
For 'access', 'algebra' I get this result:
['access', '183', '3145']
['algebra', '153', '16272']

Categories