How to boost this script speed? - python

Hello so I am trying to filter the bad words from this list, I have for this script usually list of 5 to 10 million line of words, I tried threading to make it fast but after the first 20k word it gets slower and slower why is that, will it be faster if I use Multiprocessing instead ?
I run this script on Ubuntu with 48 CPU core and 200GB RAM
from tqdm import tqdm
import queue
import threading
a=input("The List: ")+".txt"
thr=input('Threads: ')
c=input("clear old[y]: ")
inputQueue = queue.Queue()
if c == 'y' or c == 'Y':#clean
if c =="y":
open("goodWord.txt",'w').close()
s = ["bad_word"]#bad words list
class myclass:
def dem(self,my_word):
for key in s:
if key in my_word:
return 1
return 0
def chk(self):
while 1:
old = open("goodWord.txt","r",encoding='utf-8',errors='ignore').readlines()
y = inputQueue.get()
if my_word not in old:
rez = self.dem(my_word)
if rez == 0:
sav = open("goodWord.txt","a+")
sav.write(my_word+"\n")
sav.close()
self.pbar.update(1)
else :
self.pbar.update(1)
inputQueue.task_done()
def run_thread(self):
for y in tqdm(open(a, 'r',encoding='utf-8', errors='ignore').readlines()):
inputQueue.put(y)
tqdm.write("All in the Queue")
self.pbar = tqdm(total=inputQueue.qsize(),unit_divisor=1000)
for x in range(int(thr)):
t = threading.Thread(target=self.chk)
t.setDaemon(True)
t.start()
inputQueue.join()
try:
open("goodWord.txt","a")
except:
open("goodWord.txt","w")
old = open("goodWord.txt","r",encoding='utf-8',errors='ignore').readlines()
myclass=myclass()
omyclass.run_thread()

For the sake of curiosity and education, I wrote a virtually identical (in function) program:
import pathlib
from tqdm import tqdm
# check_words_file_path = pathlib.Path(input("Enter the path of the file which contains the words to check: "))
check_words_file_path = pathlib.Path("/Users/****/Documents/Projects/AdHoc/resources/temp/check_words.txt")
good_words_file_path = pathlib.Path("/Users/****/Documents/Projects/AdHoc/resources/temp/good_words.txt")
bad_words = {"abadword", "anotherbadword"}
# load the list of good words
with open(good_words_file_path) as good_words_file:
stripped_lines = (line.rstrip() for line in good_words_file)
good_words = set(stripped_line for stripped_line in stripped_lines if stripped_line)
# check each word to see if is one of the bad words
# if it isn't, add it to the good words
with open(check_words_file_path) as check_words_file:
for curr_word in tqdm(check_words_file):
curr_word = curr_word.rstrip()
if curr_word not in bad_words:
good_words.add(curr_word)
# write the new/expanded list of good words back to file
with open(good_words_file_path, "w") as good_words_file:
for good_word in good_words:
good_words_file.write(good_word + "\n")
It is based on my understanding of the original program, which, as I already mentioned, I find far too complex.
I hope that this one is clearer, and it is almost certainly much faster. In fact, this might be fast enough that there is no need to consider things like multiprocessing.

Related

Make the program run faster

I have written a program which checks curse word in a text document.
I convert the document into a list of words and pass each word to the site for checking if it is a curse word or not.
Problem is if the text is too big, it is running very slow.
How do I make it faster?
import urllib.request
def read_text():
quotes = open(r"C:\Self\General\Pooja\Edu_Career\Learning\Python\Code\Udacity_prog_foundn_python\movie_quotes.txt") #built in function
contents_of_file = quotes.read().split()
#print(contents_of_file)
quotes.close()
check_profanity(contents_of_file)
def check_profanity(text_to_check):
flag = 0
for word in text_to_check:
connection = urllib.request.urlopen("http://www.wdylike.appspot.com/?q="+word)
output = connection.read()
# print(output)
if b"true" in output: # file is opened in bytes mode and output is in byte so compare byte to byte
flag= flag +1
if flag > 0:
print("profanity alert")
else:
print("the text has no curse words")
connection.close()
read_text()
The website you are using supports more than one word per fetch. Hence, to make your code faster:
A) Break the loop when you find the first curse word.
B) Send super word to site.
Hence:
def check_profanity(text_to_check):
flag = 0
super_word = ''
for i in range(len(text_to_check)):
if i < 100 and i < len(text_to_check): #100 or max number of words you can check at the same time
super_word = super_word + " " + word
else:
connection = urllib.request.urlopen("http://www.wdylike.appspot.com/?q="+super_word)
super_word = ''
output = connection.read()
if b"true" in output:
flag = flag +1
break
if flag > 0:
print("profanity alert")
else:
print("the text has no curse words")
First off, as Menno Van Dijk suggests, storing a subset of common known curse words locally would allow rapid checks for profanity up front, with no need to query the website at all; if a known curse word is found, you can alert immediately, without checking anything else.
Secondly, inverting that suggestion, cache at least the first few thousand most common known non-cursewords locally; there is no reason that every text containing the word "is", "the" or "a" should be rechecking those words over and over. Since the vast majority of written English uses mostly the two thousand most common words (and an even larger majority uses almost exclusively the ten thousand most common words), that can save an awful lot of checks.
Third, uniquify your words before checking them; if a word is used repeatedly, it's just as good or bad the second time as it was the first, so checking it twice is wasteful.
Lastly, as MTMD suggests, the site allows you to batch your queries, so do so.
Between all of these suggestions, you'll likely go from a 100,000 word file requiring 100,000 connections to requiring only 1-2. While multithreading might have helped your original code (at the expense of slamming the webservice), these fixes should make multithreading pointless; with only 1-2 requests, you can wait the second or two it would take for them to run sequentially.
As a purely stylistic issue, having read_text call check_profanity is odd; those should really be separate behaviors (read_text returns text which check_profanity can then be called on).
With my suggestions (assumes existence of files with one known word per line, one for bad words, one for good):
import itertools # For islice, useful for batching
import urllib.request
def load_known_words(filename):
with open(filename) as f:
return frozenset(map(str.rstrip, f))
known_bad_words = load_known_words(r"C:\path\to\knownbadwords.txt")
known_good_words = load_known_words(r"C:\path\to\knowngoodwords.txt")
def read_text():
with open(r"C:\Self\General\Pooja\Edu_Career\Learning\Python\Code\Udacity_prog_foundn_python\movie_quotes.txt") as quotes:
return quotes.read()
def check_profanity(text_to_check):
# Uniquify contents so words aren't checked repeatedly
if not isinstance(text_to_check, (set, frozenset)):
text_to_check = set(text_to_check)
# Remove words known to be fine from set to check
text_to_check -= known_good_words
# Precheck for any known bad words so loop is skipped completely if found
has_profanity = not known_bad_words.isdisjoint(text_to_check)
while not has_profanity and text_to_check:
block_to_check = frozenset(itertools.islice(text_to_check, 100))
text_to_check -= block_to_check
with urllib.request.urlopen("http://www.wdylike.appspot.com/?q="+' '.join(block_to_check)) as connection:
output = connection.read()
# print(output)
has_profanity = b"true" in output
if has_profanity:
print("profanity alert")
else:
print("the text has no curse words")
text = read_text()
check_profanity(text.split())
There a few things you can do:
Read batches of text
Give each batch of text to a worker process which then checks for profanity.
Introduce a cache which saves commonly used curse words offline to minimize the amount of required HTTP requests
Use multithreading.
Read batches of text.
Assign each batch to a thread and check all the batches seperately.
Send many words at once. Change number_of_words to the number of words you want to send at once.
import urllib.request
def read_text():
quotes = open("test.txt")
contents_of_file = quotes.read().split()
quotes.close()
check_profanity(contents_of_file)
def check_profanity(text):
number_of_words = 200
word_lists = [text[x:x+number_of_words] for x in range(0, len(text), number_of_words)]
flag = False
for word_list in word_lists:
connection = urllib.request.urlopen("http://www.wdylike.appspot.com/?q=" + "%20".join(word_list))
output = connection.read()
if b"true" in output:
flag = True
break
connection.close()
if flag:
print("profanity alert")
else:
print("the text has no curse words")
read_text()

Suggestion required - Python code performance improvement

Need some advice in improving the performance of my code.
I have two files ( Keyword.txt , description.txt ). Keyword.txt consists of list of keywords (11,000+ to be specific) and descriptions.txt consists of very large text descriptions(9,000+ ).
I am trying to read keywords from keyword.txt one at a time and check if the keyword exists in the description. If the keyword exists I am writing it to a new file. So this is like a many - to - many relationship ( 11,000 * 9,000).
Sample Keywords:
Xerox
VMWARE CLOUD
Sample Description(it's huge):
Planning and implementing entire IT Infrastructure. Cyberoam firewall implementation and administration in head office and branch office. Report generation and analysis. Including band width conception, internet traffic and application performance. Windows 2003/2008 Server Domain controller implementation and managing. VERITAS Backup for Clients backup, Daily backup of applications and database. Verify the backed up database for data integrity. Send backup tapes to remote location for safe storage Installing and configuring various network devices; Routers, Modems, Access Points, Wireless ADSL+ modems / Routers Monitoring, managing & optimizing Network. Maintaining Network Infrastructure for various clients. Creating Users and maintaining the Linux Proxy servers for clients. Trouble shooting, diagnosing, isolating & resolving Windows / Network Problems. Configuring CCTV camera, Biometrics attendance machine, Access Control System Kaspersky Internet Security / ESET NOD32
Below is the code which I've written:
import csv
import nltk
import re
wr = open(OUTPUTFILENAME,'w')
def match():
c = 0
ft = open('DESCRIPTION.TXT','r')
ky2 = open('KEYWORD.TXT','r')
reader = csv.reader(ft)
keywords = []
keyword_reader2 = csv.reader(ky2)
for x in keyword_reader2: # Storing all the keywords to a list
keywords.append(x[1].lower())
string = ' '
c = 0
for row in reader:
sentence = row[1].lower()
id = row[0]
for word in keywords:
if re.search(r'\b{}\b'.format(re.escape(word.lower())),sentence):
string = string + id+'$'+word.lower()+'$'+sentence+ '\n'
c = c + 1
if c > 5000: # I am writing 5000 lines at a time.
print("Batch printed")
c = 0
wr.write(string)
string = ' '
wr.write(string)
ky2.close()
ft.close()
wr.close()
match()
Now this code takes around 120 min to complete. I tried a couple of ways to improve the speed.
At first I was writing one line at a time, then I changed it to 5000 lines at a time since it is a small file and i can afford to put everything in memory. Did not see much improvement.
I pushed everything to stdout and used pipe from console to append everything to file. This was even slower.
I want to know if there is a better way of doing this, since I may have done something wrong in the code.
My PC Specs : Ram : 15gb Processor: i7 4th gen
If all your search-word phrases consist of whole words (begin/end on a word boundary) then parallel indexing into a word tree would about as efficient as it gets.
Something like
# keep lowercase characters and digits
# keep apostrophes for contractions (isn't, couldn't, etc)
# convert uppercase characters to lowercase
# replace all other printable symbols with spaces
TO_ALPHANUM_LOWER = str.maketrans(
"ABCDEFGHIJKLMNOPQRSTUVWXYZ'!#$%&()*+,-./:;<=>?#[]^_`{|}~ \t\n\r\x0b\x0c\"\\",
"abcdefghijklmnopqrstuvwxyz' "
)
def clean(s):
"""
Convert string `s` to canonical form for searching
"""
return s.translate(TO_ALPHANUM_LOWER)
class WordTree:
__slots__ = ["children", "terminal"]
def __init__(self, phrases=None):
self.children = {} # {"word": WordTrie}
self.terminal = '' # if end of search phrase, full phrase is stored here
# preload tree
if phrases:
for phrase in phrases:
self.add_phrase(phrase)
def add_phrase(self, phrase):
tree = self
words = clean(phrase).split()
for word in words:
ch = tree.children
if word in ch:
tree = ch[word]
else:
tree = ch[word] = WordTree()
tree.terminal = " ".join(words)
def inc_search(self, word):
"""
Search one level deeper into the tree
Returns
(None, '' ) if word not found
(subtree, '' ) if word found but not terminal
(subtree, phrase) if word found and completes a search phrase
"""
ch = self.children
if word in ch:
wt = ch[word]
return wt, wt.terminal
else:
return (None, '')
def parallel_search(self, text):
"""
Return all search phrases found in text
"""
found = []
fd = found.append
partials = []
for word in clean(text).split():
new_partials = []
np = new_partials.append
# new search from root
wt, phrase = self.inc_search(word)
if wt: np(wt)
if phrase: fd(phrase)
# continue existing partial matches
for partial in partials:
wt, phrase = partial.inc_search(word)
if wt: np(wt)
if phrase: fd(phrase)
partials = new_partials
return found
def tree_repr(self, depth=0, indent=" ", terminal=" *"):
for word,tree in self.children.items():
yield indent * depth + word + (terminal if tree.terminal else '')
yield from tree.tree_repr(depth + 1, indent, terminal)
def __repr__(self):
return "\n".join(self.tree_repr())
then your program becomes
import csv
SEARCH_PHRASES = "keywords.csv"
SEARCH_INTO = "descriptions.csv"
RESULTS = "results.txt"
# get search phrases, build WordTree
with open(SEARCH_PHRASES) as inf:
wt = WordTree(*(phrase for _,phrase in csv.reader(inf)))
with open(SEARCH_INTO) as inf, open(RESULTS, "w") as outf:
# bound methods (save some look-ups)
find_phrases = wt.parallel_search
fmt = "{}${}${}\n".format
write = outf.write
# sentences to search
for id,sentence in csv.reader(inf):
# search phrases found
for found in find_phrases(sentence):
# store each result
write(fmt(id, found, sentence))
which should be something like a thousand times faster.
I'm guessing you want to make your searches faster. In which case, if you don't care about the frequency of the keywords in the description, only that they exist, you could try the following:
For each description file, split the text into individual words, and generate a set of unique words.
Then, for each keyword in your list of keywords, check if the set contains keyword, write to file if true.
That should make your iterations faster. It should also help you skip the regexes, which are also likely part of your performance problem.
PS: My approach assumes that you filter out punctuation.

Faster operation reading file

I have to process a 15MB txt file (nucleic acid sequence) and find all the different substrings (size 5). For instance:
ABCDEF
would return 2, as we have both ABCDE and BCDEF, but
AAAAAA
would return 1. My code:
control_var = 0
f=open("input.txt","r")
list_of_substrings=[]
while(f.read(5)!=""):
f.seek(control_var)
aux = f.read(5)
if(aux not in list_of_substrings):
list_of_substrings.append(aux)
control_var += 1
f.close()
print len(list_of_substrings)
Would another approach be faster (instead of comparing the strings direct from the file)?
Depending on what your definition of a legal substring is, here is a possible solution:
import re
regex = re.compile(r'(?=(\w{5}))')
with open('input.txt', 'r') as fh:
input = fh.read()
print len(set(re.findall(regex, input)))
Of course, you may replace \w with whatever you see fit to qualify as a legal character in your substring. [A-Za-z0-9], for example will match all alphanumeric characters.
Here is an execution example:
>>> import re
>>> input = "ABCDEF GABCDEF"
>>> set(re.findall(regex, input))
set(['GABCD', 'ABCDE', 'BCDEF'])
EDIT: Following your comment above, that all character in the file are valid, excluding the last one (which is \n), it seems that there is no real need for regular expressions here and the iteration approach is much faster. You can benchmark it yourself with this code (note that I slightly modified the functions to reflect your update regarding the definition of a valid substring):
import timeit
import re
FILE_NAME = r'input.txt'
def re_approach():
return len(set(re.findall(r'(?=(.{5}))', input[:-1])))
def iter_approach():
return len(set([input[i:i+5] for i in xrange(len(input[:-6]))]))
with open(FILE_NAME, 'r') as fh:
input = fh.read()
# verify that the output of both approaches is identicle
assert set(re.findall(r'(?=(.{5}))', input[:-1])) == set([input[i:i+5] for i in xrange(len(input[:-6]))])
print timeit.repeat(stmt = re_approach, number = 500)
print timeit.repeat(stmt = iter_approach, number = 500)
15MB doesn't sound like a lot. Something like this probably would work fine:
import Counter, re
contents = open('input.txt', 'r').read()
counter = Counter.Counter(re.findall('.{5}', contents))
print len(counter)
Update
I think user590028 gave a great solution, but here is another option:
contents = open('input.txt', 'r').read()
print set(contents[start:start+5] for start in range(0, len(contents) - 4))
# Or using a dictionary
# dict([(contents[start:start+5],True) for start in range(0, len(contents) - 4)]).keys()
You could use a dictionary, where each key is a substring. It will take care of duplicates, and you can just count the keys at the end.
So: read through the file once, storing each substring in the dictionary, which will handle finding duplicate substrings & counting the distinct ones.
Reading all at once is more i/o efficient, and using a dict() is going to be faster than testing for existence in a list. Something like:
fives = {}
buf = open('input.txt').read()
for x in xrange(len(buf) - 4):
key = buf[x:x+5]
fives[key] = 1
for keys in fives.keys():
print keys

Python no output comparing string with imported large dictionary file

I'm trying to write code to help me at crossword puzzle. I'm experiencing the following errors.
1.When I try to use the much larger text file with my word list I receive no output only the small 3 string word list works.
2.The match test positive for the first two strings of my test word list. I need it to only test true for the entire words in my word list. [ SOLVED SOLUTION in the code bellow ]
lex.txt contains
dad
add
test
I call the code using the following.
./cross.py dad
[ SOLVED SOLUTION ] This is really slow.
#!/usr/bin/env python
import itertools, sys, re
sys.dont_write_bytecode = True
original_string=str(sys.argv[1])
lenth_of_string=len(original_string)
string_to_tuple=tuple(original_string)
with open('wordsEn.txt', 'r') as inF:
for line in inF:
for a in set (itertools.permutations(string_to_tuple, lenth_of_string)):
joined_characters="".join(a)
if re.search('\\b'+joined_characters+'\\b',line):
print joined_characters
Let's take a look at your code. You take the input string, you create all possible permutations of it and then you look for these permutations in the dictionary.
The most significant speed impact from my point of view is that you create the permutations of the word over and over again, for every word in your dictionary. This is very time consuming.
Besides of that, you don't even need the permutations. It's obvious that two words can be "converted" to each other by permuting if they've got the same letters. So your piece of code can be reimplemented as follows :
import itertools, sys, re
import time
from collections import Counter
sys.dont_write_bytecode = True
original_string=str(sys.argv[1]).strip()
lenth_of_string=len(original_string)
string_to_tuple=tuple(original_string)
def original_impl():
to_return = []
with open('wordsEn.txt', 'r') as inF:
for line in inF:
for a in set (itertools.permutations(string_to_tuple, lenth_of_string)):
joined_characters="".join(a)
if re.search('\\b'+joined_characters+'\\b',line):
to_return.append(joined_characters)
return to_return
def new_impl():
to_return = []
stable_counter = Counter(original_string)
with open('wordsEn.txt', 'r') as inF:
for line in inF:
l = line.strip()
c = Counter(l)
if c == stable_counter:
to_return.append(l)
return to_return
t1 = time.time()
result1 = original_impl()
t2 = time.time()
result2 = new_impl()
t3 = time.time()
assert result1 == result2
print "Original impl took ", t2 - t1, ", new impl took ", t3 - t2, "i.e. new impl is ", (t2-t1) / (t3 - t2), " faster"
For a dictionary with 100 words of 8 letters, the output is :
Original impl took 42.1336319447 , new impl took 0.000784158706665 i.e. new impl is 53731.0006081 faster
The time consumed by the original implementation for 10000 records in the dictionary is unbearable.

Python algorithm - Jumble solver [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm writing a program to find all the possible combinations of a jumbled word from a dictionary in python.
Here's what I've wrote. It's in O(n^2) time. So, my question is Can it be made faster ?
import sys
dictfile = "dictionary.txt"
def get_words(text):
""" Return a list of dict words """
return text.split()
def get_possible_words(words,jword):
""" Return a list of possible solutions """
possible_words = []
jword_length = len(jword)
for word in words:
jumbled_word = jword
if len(word) == jword_length:
letters = list(word)
for letter in letters:
if jumbled_word.find(letter) != -1:
jumbled_word = jumbled_word.replace(letter,'',1)
if not jumbled_word:
possible_words.append(word)
return possible_words
if __name__ == '__main__':
words = get_words(file(dictfile).read())
if len(sys.argv) != 2:
print "Incorrect Format. Type like"
print "python %s <jumbled word>" % sys.argv[0]
sys.exit()
jumbled_word = sys.argv[1]
words = get_possible_words(words,jumbled_word)
print "possible words :"
print '\n'.join(words)
The usual fast solution to anagram problems in to build a mapping of sorted letters to a list of the unsorted words.
With that structure in-hand, the lookups are immediate and fast:
def build_table(wordlist):
table = {}
for word in wordlist:
key = ''.join(sorted(word))
table.setdefault(key, []).append(word)
return table
def lookup(jumble, table):
key = ''.join(sorted(jumble))
return table.get(key, [])
if __name__ == '__main__':
# Build table
with open('/usr/share/dict/words') as f:
wordlist = f.read().lower().split()
table = build_table(wordlist)
# Solve some jumbles
for jumble in ['tesb', 'amgaarn', 'lehsffu', 'tmirlohag']:
print(lookup(jumble, table))
Notes on speed:
The lookup() code is the fast part.
The slower buildtable() function is written for clarity.
Building the table is a one-time operation.
If you care about run-time across repeated runs, the table should be cached in a text file.
Text file format (alpha-order first, followed by the matching words):
aestt state taste tates testa
enost seton steno stone
...
With the preprocessed anagram file, it becomes a simple matter to use subprocess to grep the file for the appropriate line of matching words. This should give a very fast run time (because the sorts and matches were precomputed and because grep is so fast).
Build the preprocessed anagram file like this:
with open('/usr/share/dict/words') as f:
wordlist = f.read().split()
table = {}
for word in wordlist:
key = ''.join(sorted(word)).lower()
table[key] = table.get(key, '') + ' ' + word
lines = ['%s%s\n' % t for t in table.iteritems()]
with open('anagrams.txt', 'w') as f:
f.writelines(lines)
I was trying to solve using ruby -
https://github.com/hackings/jumble_solver
alter the getwords to return a dict(). Make each key have a value of true or 1
import itertools and use itertools.combinations to make all possible anagramatic strings
from the "jumbled_word"
then loop over the possible strings checking if they are keys in the dict
if you wanted a DIY algorithm solution then loading the dictionary into a tree might be "better" but I doubt in the real world that it would be faster

Categories