Python Multithreading for search

Python Multithreading for search - python

I have a class that I have written that will open a text document and search it line by line for the keywords that are input from a GUI that I have created in a different file. It works great, the only problem is the text document that I am searching is long (over 60,000 entries). I was looking at ways to make the search faster and have been playing around with multithreading but have not had any success yet. Basically, the main program calls the search function which takes the line and breaks it into individual words. Then over a loop checks each of the words against the keywords from the user. If the keyword is in that word then it says its true and adds a 1 to a list. At the end, if there is the same number of keywords as true statements then it adds that line to a set that is returned at the end of main.
What I would like to do is incorporate multithreading into this so that it will run much faster but at the end of the main function will still return results. Any advice or direction with being able to accomplish this will be very helpful. I have tried to read a bunch of examples and watched a bunch of youtube videos but it didn't seem to transfer over when I tried. Thank you for your help and your time.
import pdb
from threading import Thread
class codeBook:
def __init__(self):
pass
def main(self, search):
count = 0
results = set()
with open('CodeBook.txt') as current_CodeBook:
lines = current_CodeBook.readlines()
for line in lines:
line = line.strip()
new_search = self.change_search(line,search)
line = new_search[0]
search = new_search[1]
#if search in line:
if self.speed_search(line,search) == True:
results.add(line)
else:
pass
count = count + 1
results = sorted(list(results))
return results
def change_search(self, current_line, search):
current_line = current_line.lower()
search = search.lower()
return current_line, search
def search(self,line,keywords):
split_line = line.split()
split_keywords = keywords.split()
numberOfTrue = list()
for i in range(0,len(split_keywords)):
if split_keywords[i] in line:
numberOfTrue.append(1)
if len(split_keywords) == len(numberOfTrue):
return True
else:
return False

You can split the file into several parts and create a new thread that reads and processes a specific part. You can keep a data structure global to all threads and add lines that match the search query from all the threads to it. This structure should either be thread-safe or you need to use some kind of synchronization (like a lock) to work with it.
Note: CPython interpreter has a global interpreter lock (GIL), so if you're using it and your application is CPU-heavy (which seems to the case here), you might not get any benefits from multithreading whatsoever.
You can use the multiprocessing module instead. It comes with means of interprocess communitation. A Queue looks like the right structure for your problem (each process could add matching lines to the queue). After that, you just need to get all lines from the queue and do what you did with the results in your code.

While threading and/or multiprocessing can be beneficial and speed up execution, I would want to direct your attention to looking into the possibility to optimize your current algorithm, running in a single thread, before doing that.
Looking at your implementation I believe a lot of work is done several times for no reason. To the best of my understanding the following function will perform the same operation as your codeBook.main but with less overhead:
def search_keywords(keyword_string, filename='CodeBook.txt'):
results = set()
keywords = set()
for keyword in keyword_string.lower():
keywords.add(keyword)
with open(filename) as code_book:
for line in code_book:
words = line.strip().lower()
kws_present = True
for keyword in keywords:
kws_present = keyword in words
if not kws_present:
break
if kws_present:
results.add(line)
return sorted(list(results))
Try this function, as is, or slightly modified for your needs and see if that gives you a sufficient speed-up. First when that is not enough, you should look into more complex solutions, as it invariably will increase the complexity of your program to introduce more threads/processes.

Related

How do I add a separate function for average calculation?

I am stuck on this problem. Code I have so far works but my Professor wants to see some changes. I need to add error handing and I need a separate function for calculating average which I will call in main. Here is the what I have so far...
import os
def process_file(filename):
f = open(filename,'r')
lines = f.readlines()[1:]
f.close()
scores = []
for line in lines:
parsed = line.split(",")
count = int(parsed[1])
scores.append(count)
calculate_result(scores)
def calculate_result(scores):
print("High: ", max(scores))
print("Low: ", min(scores))
print("Average: ", sum(scores)/len(scores))
def main():
filename = "scores.text"
if os.path.isfile(filename):
process_file(filename)
else:
print ("File does not exist")
return 0
main()

I guess there are 2 parts:
I need to add error handling
and
I need a separate function for calculating average which I will call in main
The second part I don't think you need help with. But error handling is kind of an art, so I can see where you might be stuck on that. Here are some suggestions to help get started.
The most common type of error handling involves dealing with input. Thinking more broadly, we could expand that to anything that crosses the boundary of the programs memory space. This includes not just user input, but also output; filesystem interaction; using network interfaces (or any communication device or hardware interface); starting/stopping or otherwise interacting with other programs; calling a library that does any of these things on our behalf; and many more....
So what parts of your program are interacting with "the outside" ? I can see a few:
in main() the program is making an assumption about the existence of a file. You are already checking to make sure this file exists, and returning 0 if it doesn't (you might want to change that to a non-zero value, since 0 is usually used to signal that no error occurred)
process_file() does this: f = open(filename,'r') but are you sure that will work? Are there conditions where this could fail?
What if the user that is running the program doesn't have permissions to read that file?
What if the file was deleted or changed between the time it was checked in main and the subsequent open call in process_file? This is a TOCTOU race condition, and it is something that every software developer needs to watch out for.
Probably the most obvious source of potential errors for this program is the content of the input file:
We're assuming the input is comma-separated. What if the user uses tabs or some other character?
While processing the lines, you've got: count = int(parsed[1]), but how do you know that parsed[1] can be cast to an int?
What will happen if the file exists, but is empty (hint: len(scores)==0)? Always look at these edge cases.
Finally, it looks like you are using if-then statements for error checking. That is fine, but another powerful tool for dealing with errors are try-except statements. They are not mutually exclusive: sometimes it's easier to use an if statement, and sometimes catching an exception with try-except is better. Some of the errors you'll need to deal with are easier to handle using one approach over the other.

In python, how to use queues properly?

So far I have the following:
fnamw = input("Enter name of file:")
def carrem(fnamw):
s = Queue( )
for line in fnamw:
s.enqueue(line)
return s
print(carrem(fnamw))
The above doesn't print a list of the numbers in the file that I input instead the following is obtained:
<__main__.Queue object at 0x0252C930>

When printing a Queue, you're just printing the object directly, which is why you get that result.
You don't want to print the object representation, but I'm assuming you want to print the contents of the Queue. To do so you need to call the get method of the Queue. It's worth noting that in doing so, you will exhaust the Queue.
Replacing print(carrem(fnamw)) with print(carrem(fnamw).get()) should print the first item of the Queue.
If you really just want to print the list of items in the Queue, you should just use a list. Queue are specifically if you're looking for a FIFO (first-in-first-out) data structure.

It seems to me that you don't actually have any need for a Queue in that program. A Queue is used primarily for synchronization and data transfer in multithreaded programming. And it really doesn't seem as if that is what you're attempting to do.
For you usage, you could just as well use an ordinary Python list:
fnamw = input("Enter name of file:")
def carrem(fnamw):
s = []
for line in fnamw:
s.append(line)
return s
print(carrem(fnamw))
On that same note, however, you're not actually reading the file. The program as you quoted it will simply put each character in the filename as a post of its own into the list (or Queue). What you really want is this:
def carrem(fnamw):
s = []
with open(fnamw) as fp:
for line in fp:
s.append(line)
return s
Or, even simpler:
def carrem(fnamw):
with open(fnamw) as fp:
return list(fp)

multiprocessing when getting URLs python 3.2

I've made a script to get inventory data from the Steam API and I'm a bit unsatisfied with the speed. So I read a bit about multiprocessing in python and simply cannot wrap my head around it. The program works as such: it gets the SteamID from a list, gets the inventory and then appends the SteamID and the inventory in a dictionary with the ID as the key and inventory contents as the value.
I've also understood that there are some issues involved with using a counter when multiprocessing, which is a small problem as I'd like to be able to resume the program from the last fetched inventory rather than from the beginning again.
Anyway, what I'm asking for is really a concrete example of how to do multiprocessing when opening the URL that contains the inventory data so that the program can fetch more than one inventory at a time rather than just one.
onto the code:
with open("index_to_name.json", "r", encoding=("utf-8")) as fp:
index_to_name=json.load(fp)
with open("index_to_quality.json", "r", encoding=("utf-8")) as fp:
index_to_quality=json.load(fp)
with open("index_to_name_no_the.json", "r", encoding=("utf-8")) as fp:
index_to_name_no_the=json.load(fp)
with open("steamprofiler.json", "r", encoding=("utf-8")) as fp:
steamprofiler=json.load(fp)
with open("itemdb.json", "r", encoding=("utf-8")) as fp:
players=json.load(fp)
error=list()
playerinventories=dict()
c=127480
while c<len(steamprofiler):
inventory=dict()
items=list()
try:
url=urllib.request.urlopen("http://api.steampowered.com/IEconItems_440/GetPlayerItems/v0001/?key=DD5180808208B830FCA60D0BDFD27E27&steamid="+steamprofiler[c]+"&format=json")
inv=json.loads(url.read().decode("utf-8"))
url.close()
except (urllib.error.HTTPError, urllib.error.URLError, socket.error, UnicodeDecodeError) as e:
c+=1
print("HTTP-error, continuing")
error.append(c)
continue
try:
for r in inv["result"]["items"]:
inventory[r["id"]]=r["quality"], r["defindex"]
except KeyError:
c+=1
error.append(c)
continue
for key in inventory:
try:
if index_to_quality[str(inventory[key][0])]=="":
items.append(
index_to_quality[str(inventory[key][0])]
+""+
index_to_name[str(inventory[key][1])]
)
else:
items.append(
index_to_quality[str(inventory[key][0])]
+" "+
index_to_name_no_the[str(inventory[key][1])]
)
except KeyError:
print("keyerror, uppdate def_to_index")
c+=1
error.append(c)
continue
playerinventories[int(steamprofiler[c])]=items
c+=1
if c % 10==0:
print(c, "inventories downloaded")
I hope my problem was clear, otherwise just say so obviously. I would optimally avoid using 3rd party libraries but if it's not possible it's not possible. Thanks in advance

So you're assuming the fetching of the URL might be the thing slowing your program down? You'd do well to check that assumption first, but if it's indeed the case using the multiprocessing module is a huge overkill: for I/O bound bottlenecks threading is quite a bit simpler and might even be a bit faster (it takes a lot more time to spawn another python interpreter than to spawn a thread).
Looking at your code, you might get away with sticking most of the content of your while loop in a function with c as a parameter, and starting a thread from there using another function, something like:
def process_item(c):
# The work goes here
# Replace al those 'continue' statements with 'return'
for c in range(127480, len(steamprofiler)):
thread = threading.Thread(name="inventory {0}".format(c), target=process_item, args=[c])
thread.start()
A real problem might be that there's no limit to the amount of threads being spawned, which might break the program. Also the guys at Steam might not be amused at getting hammered by your script, and they might decide to un-friend you.
A better approach would be to fill a collections.deque object with your list of c's and then start a limited set of threads to do the work:
def process_item(c):
# The work goes here
# Replace al those 'continue' statements with 'return'
def process():
while True:
process_item(work.popleft())
work = collections.deque(range(127480, len(steamprofiler)))
threads = [threading.Thread(name="worker {0}".format(n), target=process)
for n in range(6)]
for worker in threads:
worker.start()
Note that I'm counting on work.popleft() to throw an IndexError when we're out of work, which will kill the thread. That's a bit sneaky, so consider using a try...except instead.
Two more things:
Consider using the excellent Requests library instead of urllib (which, API-wise, is by far the worst module in the entire Python standard library that I've worked with).
For Requests, there's an add-on called grequests which allows you to do fully asynchronous HTTP-requests. That would have made for even simpler code.
I hope this helps, but please keep in mind this is all untested code.

The outermost while loop seems to be distributed over a few processes(or tasks).
When you break the loop into tasks, note that you are sharing playerinventories and error object between processes. You will need to use multiprocessing.Manager for the sharing issue.
I recommend you to start modifying your code from this snippet.

Is there a hidden possible deadlock in ppmap/parallel python?

I am having some trouble with using a parallel version of map (ppmap wrapper, implementation by Kirk Strauser).
The function I am trying to run in parallel runs a simple regular expression search on large number of strings (protein sequences), which are parsed from the filesystem using BioPython's SeqIO. Each of function calls uses their own file.
If I run the function using a normal map, everything works as expected. However, when using the ppmap, some of the runs simple freeze, there is no CPU usage and the main program does not even react to KeyboardInterrupt. Also, when I look onto the running processes, the workers are still there (but not using any CPU anymore).
e.g.
/usr/bin/python -u /usr/local/lib/python2.7/dist-packages/pp-1.6.1-py2.7.egg/ppworker.py 2>/dev/null
Furthermore, the workers do not seem to freeze on any particular data entry - if I manually kill the process and re-run the execution, it stops at a different point. (So I have temporarily resorted to keeping a list of finished entries and re-started the program multiple times).
Is there any way to see where the problem is?
Sample of the code that I am running:
def analyse_repeats(data):
"""
Loads whole proteome in memory and then looks for repeats in sequences,
flags both real repeats and sequences not containing particular aminoacid
"""
(organism, organism_id, filename) = data
import re
letters = ['C','M','F','I','L','V','W','Y','A','G','T','S','Q','N','E','D','H','R','K','P']
try:
handle = open(filename)
data = Bio.SeqIO.parse(handle, "fasta")
records = [record for record in data]
store_records = []
for record in records:
sequence = str(record.seq)
uniprot_id = str(record.name)
for letter in letters:
items = set(re.compile("(%s+)" % tuple(([letter] * 1))).findall(sequence))
if items:
for item in items:
store_records.append((organism_id,len(item), uniprot_id, letter))
else:
# letter not present in the string, "zero" repeat
store_records.append((organism_id,0, uniprot_id, letter))
handle.close()
return (organism,store_records)
except IOError as e:
print e
return (organism, [])
res_generator = ppmap.ppmap(
None,
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
for res in res_generator:
# process the output
If I use simple map instead of the ppmap, everything works fine:
res_generator = map(
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)

You could try using one of the methods (like map) of the Pool object from the multiprocessing module instead. The advantage is that it's built in and doesn't require external packages. It also works very well.
By default, it uses as many worker processes as your computer has cores, but you can specifiy a higher number as well.

May I suggest using dispy (http://dispy.sourceforge.net)? Disclaimer: I am the author. I understand it doesn't address the question directly, but hopefully helps you.

Save memory in Python. How to iterate over the lines and save them efficiently with a 2million line file?

I have a tab-separated data file with a little over 2 million lines and 19 columns.
You can find it, in US.zip: http://download.geonames.org/export/dump/.
I started to run the following but with for l in f.readlines(). I understand that just iterating over the file is supposed to be more efficient so I'm posting that below. Still, with this small optimization, I'm using 30% of my memory on the process and have only done about 6.5% of the records. It looks like, at this pace, it will run out of memory like it did before. Also, the function I have is very slow. Is there anything obvious I can do to speed it up? Would it help to del the objects with each pass of the for loop?
def run():
from geonames.models import POI
f = file('data/US.txt')
for l in f:
li = l.split('\t')
try:
p = POI()
p.geonameid = li[0]
p.name = li[1]
p.asciiname = li[2]
p.alternatenames = li[3]
p.point = "POINT(%s %s)" % (li[5], li[4])
p.feature_class = li[6]
p.feature_code = li[7]
p.country_code = li[8]
p.ccs2 = li[9]
p.admin1_code = li[10]
p.admin2_code = li[11]
p.admin3_code = li[12]
p.admin4_code = li[13]
p.population = li[14]
p.elevation = li[15]
p.gtopo30 = li[16]
p.timezone = li[17]
p.modification_date = li[18]
p.save()
except IndexError:
pass
if __name__ == "__main__":
run()
EDIT, More details (the apparently important ones):
The memory consumption is going up as the script runs and saves more lines.
The method, .save() is an adulterated django model method with unique_slug snippet that is writing to a postgreSQL/postgis db.
SOLVED: DEBUG database logging in Django eats memory.

Make sure that Django's DEBUG setting is set to False

This looks perfectly fine to me. Iterating over the file like that or using xreadlines() will read each line as needed (with sane buffering behind the scenes). Memory usage should not grow as you read in more and more data.
As for performance, you should profile your app. Most likely the bottleneck is somewhere in a deeper function, like POI.save().

There's no reason to worry in the data you've given us: is memory consumption going UP as you read more and more lines? Now that would be cause for worry -- but there's no indication that this would happen in the code you've shown, assuming that p.save() saves the object to some database or file and not in memory, of course. There's nothing real to be gained by adding del statements, as the memory is getting recycled at each leg of the loop anyway.
This could be sped up if there's a faster way to populate a POI instance than binding its attributes one by one -- e.g., passing those attributes (maybe as keyword arguments? positional would be faster...) to the POI constructor. But whether that's the case depends on that geonames.models module, of which I know nothing, so I can only offer very generic advice -- e.g., if the module lets you save a bunch of POIs in a single gulp, then making them (say) 100 at a time and saving them in bunches should yield a speedup (at the cost of slightly higher memory consumption).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.