I have a large txt file and I'm trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above.
def occurs(word1, word2, filename):
import os
infile = open(filename,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(filename, 'w')
f.write("start")
f.write(os.linesep)
for word in wordlist:
matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]
for m in matches:
l = " ".join(words[m-15:m+16])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
f.close
So far, when two of the same word are too close together, the program just doesn't run on one of them. Instead, I want to get out a longer chunk of text that extends 15 words behind and in front of furthest back and forward words
This snippet will get number of words around the chosen keyword. If there are some keywords together, it will join them:
s = '''xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above. xxx'''
words = s.split()
from itertools import groupby, chain
word = 'xxx'
def get_snippets(words, word, l):
snippets, current_snippet, cnt = [], [], 0
for v, g in groupby(words, lambda w: w != word):
w = [*g]
if v:
if len(w) < l:
current_snippet += [w]
else:
current_snippet += [w[:l] if cnt % 2 else w[-l:]]
snippets.append([*chain.from_iterable(current_snippet)])
current_snippet = [w[-l:] if cnt % 2 else w[:l]]
cnt = 0
cnt += 1
else:
if current_snippet:
current_snippet[-1].extend(w)
else:
current_snippet += [w]
if current_snippet[-1][-1] == word or len(current_snippet) > 1:
snippets.append([*chain.from_iterable(current_snippet)])
return snippets
for snippet in get_snippets(words, word, 15):
print(' '.join(snippet))
Prints:
xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15
other, which I'm trying to get as one large snippet of text. I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working
topic. So far, I have working code for all instances except the scenario mentioned above. xxx
With the same data and different lenght:
for snippet in get_snippets(words, word, 2):
print(' '.join(snippet))
Prints:
xxx and I'm
I have xxx trying to
trying to xxx get chunks
mentioned above. xxx
As always, a variety of solutions avaliable here. A fun one would a be a recursive wordFind, where it searches the next 15 words and if it finds the target word it can call itself.
A simpler, though perhaps not efficient, solution would be to add words one at a time:
for m in matches:
l = " ".join(words[m-15:m])
i = 1
while i < 16:
if (words[m+i].lower() == word):
i=1
else:
l.join(words[m+(i++)])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
Or if you're wanting the subsequent uses to be removed...
bExtend = false;
for m in matches:
if (!bExtend):
l = " ".join(words[m-15:m])
f.write("...")
bExtend = false
i = 1
while (i < 16):
if (words[m].lower() == word):
l.join(words[m+i])
bExtend = true
break
else:
l.join(words[m+(i++)])
f.write(l)
if (!bExtend):
f.write("...")
f.write(os.linesep)
Note, have not tested so may require a bit of debugging. But the gist is clear: add words piecemeal and extend the addition process when a target word is encountered. This also allows you to extend with other target words other than the current one with a bit of addition to to the second conditional if.
Related
I'm trying to get all the words made from the letters, 'crbtfopkgevyqdzsh' from a file called web2.txt. The posted cell below follows a block of code which improperly returned the whole run up to a full word e.g. for the word shocked it would return s, sh, sho, shoc, shock, shocke, shocked
So I tried a trie (know pun intended).
web2.txt is 2.5 MB in size, and contains 2,493,838 words of varying length. The trie in the cell below is breaking my Google Colab notebook. I even upgraded to Google Colab Pro, and then to Google Colab Pro+ to try and accommodate the block of code, but it's still too much. Any more efficient ideas besides trie to get the same result?
# Find the words3 word list here: svnweb.freebsd.org/base/head/share/dict/web2?view=co
trie = {}
with open('/content/web2.txt') as words3:
for word in words3:
cur = trie
for l in word:
cur = cur.setdefault(l, {})
cur['word'] = True # defined if this node indicates a complete word
def findWords(word, trie = trie, cur = '', words3 = []):
for i, letter in enumerate(word):
if letter in trie:
if 'word' in trie[letter]:
words3.append(cur)
findWords(word, trie[letter], cur+letter, words3 )
# first example: findWords(word[:i] + word[i+1:], trie[letter], cur+letter, word_list )
return [word for word in words3 if word in words3]
words3 = findWords("crbtfopkgevyqdzsh")
I'm using Pyhton3
A trie is overkill. There's about 200 thousand words, so you can just make one pass through all of them to see if you can form the word using the letters in the base string.
This is a good use case for collections.Counter, which gives us a clean way to get the frequencies (i.e. "counters") of the letters of an arbitrary string:
from collections import Counter
base_counter = Counter("crbtfopkgevyqdzsh")
with open("data.txt") as input_file:
for line in input_file:
line = line.rstrip()
line_counter = Counter(line.lower())
# Can use <= instead if on Python 3.10
if line_counter & base_counter == line_counter:
print(line)
Hi so i have 2 text files I have to read the first text file count the frequency of each word and remove duplicates and create a list of list with the word and its count in the file.
My second text file contains keywords I need to count the frequency of these keywords in the first text file and return the result without using any imports, dict, or zips.
I am stuck on how to go about this second part I have the file open and removed punctuation etc but I have no clue how to find the frequency
I played around with the idea of .find() but no luck as of yet.
Any suggestions would be appreciated this is my code at the moment seems to find the frequency of the keyword in the keyword file but not in the first text file
def calculateFrequenciesTest(aString):
listKeywords= aString
listSize = len(listKeywords)
keywordCountList = []
while listSize > 0:
targetWord = listKeywords [0]
count =0
for i in range(0,listSize):
if targetWord == listKeywords [i]:
count = count +1
wordAndCount = []
wordAndCount.append(targetWord)
wordAndCount.append(count)
keywordCountList.append(wordAndCount)
for i in range (0,count):
listKeywords.remove(targetWord)
listSize = len(listKeywords)
sortedFrequencyList = readKeywords(keywordCountList)
return keywordCountList;
EDIT- Currently toying around with the idea of reopening my first file again but this time without turning it into a list? I think my errors are somehow coming from it counting the frequency of my list of list. These are the types of results I am getting
[[['the', 66], 1], [['of', 32], 1], [['and', 27], 1], [['a', 23], 1], [['i', 23], 1]]
You can try something like:
I am taking a list of words as an example.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = {}
for word in word_list:
if word not in frequency_list:
frequency_list[word] = 1
else:
frequency_list[word] += 1
print(frequency_list)
RESULT: {'test': 1, 'world': 1, 'hello': 2}
Since, you have put a constraint on dicts, I have made use of two lists to do the same task. I am not sure how efficient it is, but it serves the purpose.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = []
frequency_word = []
for word in word_list:
if word not in frequency_word:
frequency_word.append(word)
frequency_list.append(1)
else:
ind = frequency_word.index(word)
frequency_list[ind] += 1
print(frequency_word)
print(frequency_list)
RESULT : ['hello', 'world', 'test']
[2, 1, 1]
You can change it to how you like or re-factor it as you wish
I agree with #bereal that you should use Counter for this. I see that you have said that you don't want "imports, dict, or zips", so feel free to disregard this answer. Yet, one of the major advantages of Python is its great standard library, and every time you have list available, you'll also have dict, collections.Counter and re.
From your code I'm getting the impression that you want to use the same style that you would have used with C or Java. I suggest trying to be a little more pythonic. Code written this way may look unfamiliar, and can take time getting used to. Yet, you'll learn way more.
Claryfying what you're trying to achieve would help. Are you learning Python? Are you solving this specific problem? Why can't you use any imports, dict, or zips?
So here's a proposal utilizing built in functionality (no third party) for what it's worth (tested with Python 2):
#!/usr/bin/python
import re # String matching
import collections # collections.Counter basically solves your problem
def loadwords(s):
"""Find the words in a long string.
Words are separated by whitespace. Typical signs are ignored.
"""
return (s
.replace(".", " ")
.replace(",", " ")
.replace("!", " ")
.replace("?", " ")
.lower()).split()
def loadwords_re(s):
"""Find the words in a long string.
Words are separated by whitespace. Only characters and ' are allowed in strings.
"""
return (re.sub(r"[^a-z']", " ", s.lower())
.split())
# You may want to read this from a file instead
sourcefile_words = loadwords_re("""this is a sentence. This is another sentence.
Let's write many sentences here.
Here comes another sentence.
And another one.
In English, we use plenty of "a" and "the". A whole lot, actually.
""")
# Sets are really fast for answering the question: "is this element in the set?"
# You may want to read this from a file instead
keywords = set(loadwords_re("""
of and a i the
"""))
# Count for every word in sourcefile_words, ignoring your keywords
wordcount_all = collections.Counter(sourcefile_words)
# Lookup word counts like this (Counter is a dictionary)
count_this = wordcount_all["this"] # returns 2
count_a = wordcount_all["a"] # returns 1
# Only look for words in the keywords-set
wordcount_keywords = collections.Counter(word
for word in sourcefile_words
if word in keywords)
count_and = wordcount_keywords["and"] # Returns 2
all_counted_keywords = wordcount_keywords.keys() # Returns ['a', 'and', 'the', 'of']
Here is a solution with no imports. It uses nested linear searches which are acceptable with a small number of searches over a small input array, but will become unwieldy and slow with larger inputs.
Still the input here is quite large, but it handles it in reasonable time. I suspect if your keywords file was larger (mine has only 3 words) the slow down would start to show.
Here we take an input file, iterate over the lines and remove punctuation then split by spaces and flatten all the words into a single list. The list has dupes, so to remove them we sort the list so the dupes come together and then iterate over it creating a new list containing the string and a count. We can do this by incrementing the count as long the same word appears in the list and moving to a new entry when a new word is seen.
Now you have your word frequency list and you can search it for the required keyword and retrieve the count.
The input text file is here and the keyword file can be cobbled together with just a few words in a file, one per line.
python 3 code, it indicates where applicable how to modify for python 2.
# use string.punctuation if you are somehow allowed
# to import the string module.
translator = str.maketrans('', '', '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~')
words = []
with open('hamlet.txt') as f:
for line in f:
if line:
line = line.translate(translator)
# py 2 alternative
#line = line.translate(None, string.punctuation)
words.extend(line.strip().split())
# sort the word list, so instances of the same word are
# contiguous in the list and can be counted together
words.sort()
thisword = ''
counts = []
# for each word in the list add to the count as long as the
# word does not change
for w in words:
if w != thisword:
counts.append([w, 1])
thisword = w
else:
counts[-1][1] += 1
for c in counts:
print('%s (%d)' % (c[0], c[1]))
# function to prevent need to break out of nested loop
def findword(clist, word):
for c in clist:
if c[0] == word:
return c[1]
return 0
# open keywords file and search for each word in the
# frequency list.
with open('keywords.txt') as f2:
for line in f2:
if line:
word = line.strip()
thiscount = findword(counts, word)
print('keyword %s appear %d times in source' % (word, thiscount))
If you were so inclined you could modify findword to use a binary search, but its still not going to be anywhere near a dict. collections.Counter is the right solution when there are no restrictions. Its quicker and less code.
I am creating a code where I need to take a string of words, convert it into numbers where hi bye hi hello would turn into 0 1 0 2. I have used dictionary's to do this and this is why I am having trouble on the next part. I then need to compress this into a text file, to then decompress and reconstruct it into a string again. This is the bit I am stumped on.
The way I would like to do it is by compressing the indexes of the numbers, so the 0 1 0 2 bit into the text file with the dictionary contents, so in the text file it would have 0 1 0 2 and {hi:0, bye:1, hello:3}.
Now what I would like to do to decompress or read this into the python file, to use the indexes(this is how I will refer to the 0 1 0 2 from now on) to then take each word out of the dictionary and reconstruct the sentence, so if a 0 came up, it would look into the dictionary and then find what has a 0 definition, then pull that out to put into the string, so it would find hi and take that.
I hope that this is understandable and that at least one person knows how to do it, because I am sure it is possible, however I have been unable to find anything here or on the internet mentioning this subject.
TheLazyScripter gave a nice workaround solution for the problem, but the runtime characteristics are not good because for each reconstructed word you have to loop through the whole dict.
I would say you chose the wrong dict design: To be efficient, lookup should be done in one step, so you should have the numbers as keys and the words as items.
Since your problem looks like a great computer science homework (I'll consider it for my students ;-) ), I'll just give you a sketch for the solution:
use word in my_dict.values() #(adapt for py2/py3) to test whether the word is already in the dictionary.
If no, insert the next available index as key and the word as value.
you are done.
For reconstructing the sentence, just
loop through your list of numbers
use the number as key in your dict and print(my_dict[key])
Prepare exception handling for the case a key is not in the dict (which should not happen if you are controlling the whole process, but it's good practice).
This solution is much more efficient then your approach (and easier to implement).
Yes, you can just use regular dicts and lists to store the data. And use json or pickle to persist the data to disk.
import pickle
s = 'hi hello hi bye'
words = s.split()
d = {}
for word in word:
if word not in d:
d[word] = len(d)
data = [d[word] for word in words]
with open('/path/to/file', 'w') as f:
pickle.dump({'lookup': d, 'data': data}, f)
Then read it back in
with open('/path/to/file', 'r') as f:
dic = pickle.load(f)
d = d['lookup']
reverse_d = {v: k for k, v in d.iteritems()}
data = d['data']
words = [reverse_d[index] for index in data]
line = ' '.join(words)
print line
Because I don't know exactly how you have your keymap created the best I could do is guess. Here I have created 2 functions than can be used to write a string to a txt file based on a keymap and read a txt file and return a string based on a keymap. I hope this works for you or at least gives you a solid understanding on the process! Good Luck!
import os
def pack(out_file, string, conversion_map):
out_string = ''
for word in string.split(' '):
for key,value in conversion_map.iteritems():
if word.lower() == value.lower():
out_string += str(key)+' '
break
else:
out_string += word+' '
with open(out_file, 'wb') as file:
file.write(out_string)
return out_string.rstrip()
def unpack(in_file, conversion_map, on_lookup_error=None):
if not os.path.exists(in_file):
return
in_file = ''.join(open(in_file, 'rb').readlines())
out_string = ''
for word in in_file.split(' '):
for key, value in conversion_map.iteritems():
if word.lower() == str(key).lower():
out_string += str(value)+' '
break
else:
if on_lookup_error:
on_lookup_error()
else:
out_string += str(word)+' '
return out_string.rstrip()
def fail_on_lookup():
print 'We failed to find all words in our key map.'
raise Exception
string = 'Hello, my first name is thelazyscripter'
word_to_int_map = {0:'first',
1:'name',
2:'is',
3:'TheLazyScripter',
4:'my'}
d = pack('data', string, word_to_int_map) #pack and write the data based on the conversion map
print d #the data that was written to the file
print unpack('data', word_to_int_map) #here we unpack the data from the file
print unpack('data', word_to_int_map, fail_on_lookup)
I am iterating through hundreds of thousands of words in several documents, looking to find the frequencies of contractions in English. I have formatted the documents appropriately, and it's now a matter of writing the correct function and storing the data properly. I need to store information for each document on which contractions were found and how frequently they were used in the document. Ideally, my data frame would look something like the following:
filename contraction count
file1 it's 34
file1 they're 13
file1 she's 9
file2 it's 14
file2 we're 15
file3 it's 4
file4 it's 45
file4 she's 13
How can I best go about this?
Edit: Here's my code, thus far:
for i in contractions_list: # for each of the 144 contractions in my list
for l in every_link: # for each speech
count = 0
word_count = 0
content_2 = processURL_short(l)
for word in content2.split():
word = word.strip(p)
word_count = word_count + 1
if i in contractions:
count = count + 1
Where processURL_short() is a function I wrote that scrapes a website and returns a speech as str.
Edit2:
link_store = {}
for i in contractions_list_test: # for each of the 144 contractions
for l in every_link_test: # for each speech
link_store[l] = {}
count = 0
word_count = 0
content_2 = processURL_short(l)
for word in content_2.split():
word = word.strip(p)
word_count = word_count + 1
if word == i:
count = count + 1
if count: link_store[l][i] = count
print i,l,count
Here's my file-naming code:
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president,speech_num)
Opening and reading are slow operations: don't cycle through the entire file list 144 times.
Exceptions are slow: throwing an exception for every non-contraction in every speech will be ponderous.
Don't cycle through your list of contractions checking against words. Instead, use the built-in in function to see whether that contraction is on the list, and then use a dictionary to tally the entries, just as you might do by hand.
Go through the files, word by word. When you see a word on the contraction list, see whether it's already on your tally sheet. If so, add a mark, if not, add it to the sheet with a count of 1.
Here's an example. I've made very short speeches and a trivial processURL_short function.
def processURL_short(string):
return string.lower()
every_link = [
"It's time for going to Sardi's",
"We're in the mood; it's about DST",
"They're he's it's don't",
"I'll be home for Christmas"]
contraction_list = [
"it's",
"don't",
"can't",
"i'll",
"he's",
"she's",
"they're"
]
for l in every_link: # for each speech
contraction_count = {}
content = processURL_short(l)
for word in content.split():
if word in contraction_list:
if word in contraction_count:
contraction_count[word] += 1
else:
contraction_count[word] = 1
for key, value in contraction_count.items():
print key, '\t', value
you can have your structure set up like this:
links = {}
for l in every_link:
links[l] = {}
for i in contractions_list:
count = 0
... #here is where you do your count, which you seem to know how to do
... #note that in your code, i think you meant if i in word/ if i == word for your final if statement
if count: links[l][i] = count #only adds the value if count is not 0
you would end up with a data structure like this:
links = {
'file1':{
"it's":34,
"they're":14,
...,
},
'file2':{
....,
},
...,
}
which you could easily iterate through to write the necessary data to your file (which i again assume you know how to do since its seemingly not part of the question)
Dictionaries seems to be the best option here, because they will allow
you easier manipulation of your data. Your goal should be indexing
results by filename extracted form the link (the URL to your speech
text) to a mapping of contraction and its count.
Something like:
{"file1": {"it's": 34, "they're": 13, "she's": 9},
"file2": {"it's": 14, "we're": 15},
"file3": {"it's": 4},
"file4": {"it's": 45, "she's": 13}}
Here's the full code:
ret = {}
for link, text in ((l, processURL_short(l)) for l in every_link):
contractions = {c:0 for c in contractions_list}
for word in text.split():
try:
contractions[word] += 1
except KeyError:
# Word or contraction not found.
pass
ret[file_naming_code(link)] = contractions
Let's go into each step.
First we intialize ret, it will be the resulting dictionary. Then we use
generator expressions
to perform processURL_short() for each step (instead of going though
all link list at once). We return a list of tuple (<link-name>, <speech-test>) so we can use the link name later.
Next That's the contractions count mapping, intialize to 0s, it
will be used to count contractions.
Then we split the text into words, for each word we search for it
in the contractions mapping, if found we count it, otherwise
KeyError will be raise for each key not found.
(Another question stated that this will perform poorly, another
possiblity is checking with in, like word in contractions.)
Finally:
ret[file_naming_code(link)] = contractions
Now ret is a dictionary of filename mapping to contractions
occurrences. Now you can easily create your table using:
Here's how you would get your output:
print '\t'.join(('filename', 'contraction', 'count'))
for link, counts in ret.items():
for name, count in counts.items():
print '\t'.join((link, name, count))
Need some advice in improving the performance of my code.
I have two files ( Keyword.txt , description.txt ). Keyword.txt consists of list of keywords (11,000+ to be specific) and descriptions.txt consists of very large text descriptions(9,000+ ).
I am trying to read keywords from keyword.txt one at a time and check if the keyword exists in the description. If the keyword exists I am writing it to a new file. So this is like a many - to - many relationship ( 11,000 * 9,000).
Sample Keywords:
Xerox
VMWARE CLOUD
Sample Description(it's huge):
Planning and implementing entire IT Infrastructure. Cyberoam firewall implementation and administration in head office and branch office. Report generation and analysis. Including band width conception, internet traffic and application performance. Windows 2003/2008 Server Domain controller implementation and managing. VERITAS Backup for Clients backup, Daily backup of applications and database. Verify the backed up database for data integrity. Send backup tapes to remote location for safe storage Installing and configuring various network devices; Routers, Modems, Access Points, Wireless ADSL+ modems / Routers Monitoring, managing & optimizing Network. Maintaining Network Infrastructure for various clients. Creating Users and maintaining the Linux Proxy servers for clients. Trouble shooting, diagnosing, isolating & resolving Windows / Network Problems. Configuring CCTV camera, Biometrics attendance machine, Access Control System Kaspersky Internet Security / ESET NOD32
Below is the code which I've written:
import csv
import nltk
import re
wr = open(OUTPUTFILENAME,'w')
def match():
c = 0
ft = open('DESCRIPTION.TXT','r')
ky2 = open('KEYWORD.TXT','r')
reader = csv.reader(ft)
keywords = []
keyword_reader2 = csv.reader(ky2)
for x in keyword_reader2: # Storing all the keywords to a list
keywords.append(x[1].lower())
string = ' '
c = 0
for row in reader:
sentence = row[1].lower()
id = row[0]
for word in keywords:
if re.search(r'\b{}\b'.format(re.escape(word.lower())),sentence):
string = string + id+'$'+word.lower()+'$'+sentence+ '\n'
c = c + 1
if c > 5000: # I am writing 5000 lines at a time.
print("Batch printed")
c = 0
wr.write(string)
string = ' '
wr.write(string)
ky2.close()
ft.close()
wr.close()
match()
Now this code takes around 120 min to complete. I tried a couple of ways to improve the speed.
At first I was writing one line at a time, then I changed it to 5000 lines at a time since it is a small file and i can afford to put everything in memory. Did not see much improvement.
I pushed everything to stdout and used pipe from console to append everything to file. This was even slower.
I want to know if there is a better way of doing this, since I may have done something wrong in the code.
My PC Specs : Ram : 15gb Processor: i7 4th gen
If all your search-word phrases consist of whole words (begin/end on a word boundary) then parallel indexing into a word tree would about as efficient as it gets.
Something like
# keep lowercase characters and digits
# keep apostrophes for contractions (isn't, couldn't, etc)
# convert uppercase characters to lowercase
# replace all other printable symbols with spaces
TO_ALPHANUM_LOWER = str.maketrans(
"ABCDEFGHIJKLMNOPQRSTUVWXYZ'!#$%&()*+,-./:;<=>?#[]^_`{|}~ \t\n\r\x0b\x0c\"\\",
"abcdefghijklmnopqrstuvwxyz' "
)
def clean(s):
"""
Convert string `s` to canonical form for searching
"""
return s.translate(TO_ALPHANUM_LOWER)
class WordTree:
__slots__ = ["children", "terminal"]
def __init__(self, phrases=None):
self.children = {} # {"word": WordTrie}
self.terminal = '' # if end of search phrase, full phrase is stored here
# preload tree
if phrases:
for phrase in phrases:
self.add_phrase(phrase)
def add_phrase(self, phrase):
tree = self
words = clean(phrase).split()
for word in words:
ch = tree.children
if word in ch:
tree = ch[word]
else:
tree = ch[word] = WordTree()
tree.terminal = " ".join(words)
def inc_search(self, word):
"""
Search one level deeper into the tree
Returns
(None, '' ) if word not found
(subtree, '' ) if word found but not terminal
(subtree, phrase) if word found and completes a search phrase
"""
ch = self.children
if word in ch:
wt = ch[word]
return wt, wt.terminal
else:
return (None, '')
def parallel_search(self, text):
"""
Return all search phrases found in text
"""
found = []
fd = found.append
partials = []
for word in clean(text).split():
new_partials = []
np = new_partials.append
# new search from root
wt, phrase = self.inc_search(word)
if wt: np(wt)
if phrase: fd(phrase)
# continue existing partial matches
for partial in partials:
wt, phrase = partial.inc_search(word)
if wt: np(wt)
if phrase: fd(phrase)
partials = new_partials
return found
def tree_repr(self, depth=0, indent=" ", terminal=" *"):
for word,tree in self.children.items():
yield indent * depth + word + (terminal if tree.terminal else '')
yield from tree.tree_repr(depth + 1, indent, terminal)
def __repr__(self):
return "\n".join(self.tree_repr())
then your program becomes
import csv
SEARCH_PHRASES = "keywords.csv"
SEARCH_INTO = "descriptions.csv"
RESULTS = "results.txt"
# get search phrases, build WordTree
with open(SEARCH_PHRASES) as inf:
wt = WordTree(*(phrase for _,phrase in csv.reader(inf)))
with open(SEARCH_INTO) as inf, open(RESULTS, "w") as outf:
# bound methods (save some look-ups)
find_phrases = wt.parallel_search
fmt = "{}${}${}\n".format
write = outf.write
# sentences to search
for id,sentence in csv.reader(inf):
# search phrases found
for found in find_phrases(sentence):
# store each result
write(fmt(id, found, sentence))
which should be something like a thousand times faster.
I'm guessing you want to make your searches faster. In which case, if you don't care about the frequency of the keywords in the description, only that they exist, you could try the following:
For each description file, split the text into individual words, and generate a set of unique words.
Then, for each keyword in your list of keywords, check if the set contains keyword, write to file if true.
That should make your iterations faster. It should also help you skip the regexes, which are also likely part of your performance problem.
PS: My approach assumes that you filter out punctuation.