In python, how to use queues properly? - python

So far I have the following:
fnamw = input("Enter name of file:")
def carrem(fnamw):
s = Queue( )
for line in fnamw:
s.enqueue(line)
return s
print(carrem(fnamw))
The above doesn't print a list of the numbers in the file that I input instead the following is obtained:
<__main__.Queue object at 0x0252C930>

When printing a Queue, you're just printing the object directly, which is why you get that result.
You don't want to print the object representation, but I'm assuming you want to print the contents of the Queue. To do so you need to call the get method of the Queue. It's worth noting that in doing so, you will exhaust the Queue.
Replacing print(carrem(fnamw)) with print(carrem(fnamw).get()) should print the first item of the Queue.
If you really just want to print the list of items in the Queue, you should just use a list. Queue are specifically if you're looking for a FIFO (first-in-first-out) data structure.

It seems to me that you don't actually have any need for a Queue in that program. A Queue is used primarily for synchronization and data transfer in multithreaded programming. And it really doesn't seem as if that is what you're attempting to do.
For you usage, you could just as well use an ordinary Python list:
fnamw = input("Enter name of file:")
def carrem(fnamw):
s = []
for line in fnamw:
s.append(line)
return s
print(carrem(fnamw))
On that same note, however, you're not actually reading the file. The program as you quoted it will simply put each character in the filename as a post of its own into the list (or Queue). What you really want is this:
def carrem(fnamw):
s = []
with open(fnamw) as fp:
for line in fp:
s.append(line)
return s
Or, even simpler:
def carrem(fnamw):
with open(fnamw) as fp:
return list(fp)

Related

Python multiprocessing imap without list comprehension?

I've parallelized my code using imap and the pyfastx library but the problem is that the sequences get loaded using a list comprehension. When the input file is large this becomes problematic because all seq values are loaded in memory. Is there a way to do this without completely populating the list that's inputted to imap?
import pyfastx
import multiprocessing
def pSeq(seq):
...
return(A1,A2,B)
pool=multiprocessing.Pool(5)
for (A1,A2,B) in
pool.imap(pSeq,[seq for _,seq,_ in pyfastx.Fastq(temp2.name, build_index=False)],chunksize=100000):
if A1 == A2 and A1 != B:
matchedA[A1][B] += 1
I also tried skipping the list comprehension and using the apply_async function since pyfastx supports loading the sequences one at a time but because each individual loop is fairly short and there's no chunksize argument this ends up taking way longer than just not using multiprocessing at all.
import pyfastx
import multiprocessing
def pSeq(seq):
...
return(A1,A2,B)
pool=multiprocessing.Pool(5)
results = []
for _,seq,_ in pyfastx.Fastq(temp2.name, build_index=False):
results.append(pool.apply_async(pSeq,seq))
pool.join()
pool.close()
for result in results:
if result[0] == result[1] and result[0] != result[2]:
matchedA[result[0]][result[2]] +=1
Any suggestions?
I know it's been a while since the original post, but I actually dealt with a similar issue so thought this may be helpful for someone at some point.
First of all, the general solution is to pass imap an iterator or generator object, instead of a list. In this case, you would modify pSeq to accept a tuple of 3 and simply drop the list comprehension.
I am including some code below to demonstrate what I mean, but let me preempt somebody trying this - it doesn't work (at least in my hands). My guess is this happens because, for some reason, pyfastx.Fastq doesn't return an iterator or generator object (I did verify this tidbit - the returned object doesn't implement next)...
I worked around this by using fastq-and-furious, which is comparably fast and does return a generator (and also has a more flexible python API). That workaround code is at the bottom, if you want to skip the "solution that should have worked".
At any rate, here is what I would have liked to work:
def pSeq(seq_tuple):
_, seq, _ = seq_tuple
...
return(A1,A2,B)
...
import multiprocessing as mp
with mp.Pool(5) as pool:
# this fails (when I ran it on Mac, the program hung and I had to keyboard interrupt)
# most likely due to pyfastx.Fastq not returning a generator or iterator
parser = pyfastx.Fastq(temp2.name, build_index=False)
result_iterator = pool.imap(pSeq, parser, chunksize=100000)
for result in result_itertor:
do something
Just to make this answer complete, I am also adding my workaround code, which does work for me. Unfortunately, I couldn't get it to run properly while still using pyfastx:
import fastqandfurious.fastqandfurious as fqf
import fastqandfurious._fastqandfurious as _fqf
# if you don't supply an entry function, fqf returns (name, seq, quality) as byte-strings
def pfx_like_entry(buf, pos, offset=0):
"""
Return a tuple with identical format to pyfastx, so reads can be
processed with the same function regardless of which parser we use
"""
name = buf[pos[0]:pos[1]].decode('ascii')
seq = buf[pos[2]:pos[3]].decode('ascii')
quality = buf[pos[4]:pos[5]].decode('ascii')
return name, seq, quality
# can be replaced with fqf.automagic_open(), gzip.open(), some other equivalent
with open(temp2.name, mode='rb') as handle, \
mp.Pool(5) as pool:
# this does work. You can also use biopython's fastq parsers
# (or any other parser that returns an iterator/ generator)
parser = fqf.readfastq_iter(fh=handle,
fbufsize=20000,
entryfunc=pfx_like_entry
_entrypos=_fqf.entrypos
)
result_iterator = pool.imap(pSeq, parser, chunksize=100000)
for result in result_itertor:
do something

Printing in loop

I have following code to print to the System-Default printer:
def printFile(file):
print("printing file...")
with open(file, "rb") as source:
printer = subprocess.Popen('/usr/bin/lpr', stdin=subprocess.PIPE)
printer.stdin.write(source.read())
This function works quite well if I use it on its own. But if use it in a loop construct like this:
while True:
printFile(file)
(...)
the printing job won't run (although) the loop will continue without error...
I tried to build in a time delay, but it didn't helped...
[Edit]: further investigations showed me that the printing function (when called from the loop) will put the printing jobs on hold...?
In modern Python3, it is advised to use subprocess.run() in most cases instead of using subprocess.Popen directly. And I would leave it to lpr to read the file, rather than passing it to standard input:
def printFile(file):
print("printing file...")
cp = subprocess.run(['\usr\bin\lpr', file])
return cp.returncode
Using subprocess.run allows you to ascertain that the lpr process finished correctly. And this way you don't have to read and write the complete file. You can even remove the file once lpr is finished.
Using Popen directly has some disadvantages here;
Using Popen.stdin might produce a deadlock if it overfills the OS pipe buffers (according to the Python docs).
Since you don't wait() for the Popen process to finish, you don't know if it finished without errors.
Depending on how lpr is set up, it might have rate controls. That is, it might stop printing if it gets a lot of print requests in a short span of time.
Edit: I just thought of something. Most lpr implementations allow you to print more than one file at a time. So you could also do:
def printFile(files):
"""
Print file(s).
Arguments:
files: string or sequence of strings.
"""
if isinstance(files, str):
files = [files]
# if you want to be super strict...
if not isinstance(files (list, tuple)):
raise ValueError('files must be a sequence type')
else:
if not all(isinstance(f, str) for f in files):
raise ValueError('files must be a sequence of strings')
cp = subprocess.run(['\usr\bin\lpr'] + files)
return cp.returncode
That would print a single file or a whole bunch of them in one go...

Python Multithreading for search

I have a class that I have written that will open a text document and search it line by line for the keywords that are input from a GUI that I have created in a different file. It works great, the only problem is the text document that I am searching is long (over 60,000 entries). I was looking at ways to make the search faster and have been playing around with multithreading but have not had any success yet. Basically, the main program calls the search function which takes the line and breaks it into individual words. Then over a loop checks each of the words against the keywords from the user. If the keyword is in that word then it says its true and adds a 1 to a list. At the end, if there is the same number of keywords as true statements then it adds that line to a set that is returned at the end of main.
What I would like to do is incorporate multithreading into this so that it will run much faster but at the end of the main function will still return results. Any advice or direction with being able to accomplish this will be very helpful. I have tried to read a bunch of examples and watched a bunch of youtube videos but it didn't seem to transfer over when I tried. Thank you for your help and your time.
import pdb
from threading import Thread
class codeBook:
def __init__(self):
pass
def main(self, search):
count = 0
results = set()
with open('CodeBook.txt') as current_CodeBook:
lines = current_CodeBook.readlines()
for line in lines:
line = line.strip()
new_search = self.change_search(line,search)
line = new_search[0]
search = new_search[1]
#if search in line:
if self.speed_search(line,search) == True:
results.add(line)
else:
pass
count = count + 1
results = sorted(list(results))
return results
def change_search(self, current_line, search):
current_line = current_line.lower()
search = search.lower()
return current_line, search
def search(self,line,keywords):
split_line = line.split()
split_keywords = keywords.split()
numberOfTrue = list()
for i in range(0,len(split_keywords)):
if split_keywords[i] in line:
numberOfTrue.append(1)
if len(split_keywords) == len(numberOfTrue):
return True
else:
return False
You can split the file into several parts and create a new thread that reads and processes a specific part. You can keep a data structure global to all threads and add lines that match the search query from all the threads to it. This structure should either be thread-safe or you need to use some kind of synchronization (like a lock) to work with it.
Note: CPython interpreter has a global interpreter lock (GIL), so if you're using it and your application is CPU-heavy (which seems to the case here), you might not get any benefits from multithreading whatsoever.
You can use the multiprocessing module instead. It comes with means of interprocess communitation. A Queue looks like the right structure for your problem (each process could add matching lines to the queue). After that, you just need to get all lines from the queue and do what you did with the results in your code.
While threading and/or multiprocessing can be beneficial and speed up execution, I would want to direct your attention to looking into the possibility to optimize your current algorithm, running in a single thread, before doing that.
Looking at your implementation I believe a lot of work is done several times for no reason. To the best of my understanding the following function will perform the same operation as your codeBook.main but with less overhead:
def search_keywords(keyword_string, filename='CodeBook.txt'):
results = set()
keywords = set()
for keyword in keyword_string.lower():
keywords.add(keyword)
with open(filename) as code_book:
for line in code_book:
words = line.strip().lower()
kws_present = True
for keyword in keywords:
kws_present = keyword in words
if not kws_present:
break
if kws_present:
results.add(line)
return sorted(list(results))
Try this function, as is, or slightly modified for your needs and see if that gives you a sufficient speed-up. First when that is not enough, you should look into more complex solutions, as it invariably will increase the complexity of your program to introduce more threads/processes.

Is there a hidden possible deadlock in ppmap/parallel python?

I am having some trouble with using a parallel version of map (ppmap wrapper, implementation by Kirk Strauser).
The function I am trying to run in parallel runs a simple regular expression search on large number of strings (protein sequences), which are parsed from the filesystem using BioPython's SeqIO. Each of function calls uses their own file.
If I run the function using a normal map, everything works as expected. However, when using the ppmap, some of the runs simple freeze, there is no CPU usage and the main program does not even react to KeyboardInterrupt. Also, when I look onto the running processes, the workers are still there (but not using any CPU anymore).
e.g.
/usr/bin/python -u /usr/local/lib/python2.7/dist-packages/pp-1.6.1-py2.7.egg/ppworker.py 2>/dev/null
Furthermore, the workers do not seem to freeze on any particular data entry - if I manually kill the process and re-run the execution, it stops at a different point. (So I have temporarily resorted to keeping a list of finished entries and re-started the program multiple times).
Is there any way to see where the problem is?
Sample of the code that I am running:
def analyse_repeats(data):
"""
Loads whole proteome in memory and then looks for repeats in sequences,
flags both real repeats and sequences not containing particular aminoacid
"""
(organism, organism_id, filename) = data
import re
letters = ['C','M','F','I','L','V','W','Y','A','G','T','S','Q','N','E','D','H','R','K','P']
try:
handle = open(filename)
data = Bio.SeqIO.parse(handle, "fasta")
records = [record for record in data]
store_records = []
for record in records:
sequence = str(record.seq)
uniprot_id = str(record.name)
for letter in letters:
items = set(re.compile("(%s+)" % tuple(([letter] * 1))).findall(sequence))
if items:
for item in items:
store_records.append((organism_id,len(item), uniprot_id, letter))
else:
# letter not present in the string, "zero" repeat
store_records.append((organism_id,0, uniprot_id, letter))
handle.close()
return (organism,store_records)
except IOError as e:
print e
return (organism, [])
res_generator = ppmap.ppmap(
None,
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
for res in res_generator:
# process the output
If I use simple map instead of the ppmap, everything works fine:
res_generator = map(
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
You could try using one of the methods (like map) of the Pool object from the multiprocessing module instead. The advantage is that it's built in and doesn't require external packages. It also works very well.
By default, it uses as many worker processes as your computer has cores, but you can specifiy a higher number as well.
May I suggest using dispy (http://dispy.sourceforge.net)? Disclaimer: I am the author. I understand it doesn't address the question directly, but hopefully helps you.

Troubles with python list and file saving

I really don't know why my code is not saving for me the readings from the adc and gps receiver to the file I already open it in the first line in the code. it save only one record from both adc and gps receiver.
this is my code:
import MDM
f = open("cord+adc.txt", 'w')
def getADC():
res = MDM.send('AT#ADC?\r', 0)
res = MDM.receive(100)
if(res.find('OK') != -1):
return res
else:
return ""
def AcquiredPosition():
res = MDM.send('AT$GPSACP\r', 0)
res = MDM.receive(30)
if(res.find('OK') != -1):
tmp = res.split("\r\n")
res = tmp[1]
tmp = res.split(" ")
return tmp[1]
else:
return ""
while (1):
cordlist = []
adclist = []
p = AcquiredPosition()
res = MDM.receive(60)
cordlist.append(p)
cordlist.append("\r\n")
f.writelines(cordlist)
q = getADC()
res = MDM.receive(60)
adclist.append(q)
adclist.append("\r\n")
f.writelines(adclist)
and this is the file called "cord+adc.txt":
174506.000,2612.7354N,05027.5971E,1.0,23.1,3,192.69,0.18,0.09,191109,07
#ADC: 0
if there is another way to write my code, please advise me or just point to me the error in the above code.
thanks for any suggestion
You have two problems here, you are not closing you file. There is a bigger problem in your program though your while loop will go forever (or until something else goes wrong in your program) there is no terminating condition. You are looping while 1 but never explicitly breaking out of the loop. I assume that when the function AcquiredPosition() returns an empty string you want the loop to terminate so I added the code if not p: break after the call to the function if it returns an empty string the loop will terminate the file will be closed thanks to the with statement.You should restructure your while loop like below:
with open("cord+adc.txt", 'w') as f:
while (1):
cordlist = []
adclist = []
p = AcquiredPosition()
if not p:
break
res = MDM.receive(60)
cordlist.append(p)
cordlist.append("\r\n")
f.writelines(cordlist)
q = getADC()
res = MDM.receive(60)
adclist.append(q)
adclist.append("\r\n")
f.writelines(adclist)
Because you never explicitly flush() or close() your file, there's no guarantee at all about what will wind up in it. You should probably flush() it after each packet, and you must explicitly close() it when you wish your program to exit.
If your modem connection is a socket,
make sure your socket is functioning by calling getADC() and AcquiredPosition() directly from the interactive interpreter. Just drop the while(1) loop in a function (main() is the common practice), then import the module from the interactive prompt.
Your example is missing the initialization of the socket object, MDM. Make sure it is correctly set up to the appropriate address, with code like:
import socket
MDM = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
MDM.connect((HOST, PORT))
If MDM doesn't refer to a TCP socket, you can still try calling the mentioned methods interactively.
I don't see you closing the file anywhere. Add this as the last line of your code:
f.close()
That should contribute to fixing your problem. I don;t know much about sockets, etc, so I can't help you there.
When you write a line into a file, it is actualy buffered into memory first (this is the C way of handling files). When the maximum size for the buffer is hit or you close the file, the buffer is emptyed into the specified file.
From the explanation so far i think you got the scary image of file manipulation. Now, the best way to solve any and all problems is to flush the buffer's content to the file (meaning after the flush() function is executed and the buffer is empty you have all the content safely saved into your file). Of cource it wold be a great thing to close the file also, but in an infinite loop it's hardly possible (you could hardcode an event maybe, send it to the actual function and when the infinite loop stops - closing the program - close the file also; just a sugestion of cource, the flush () thing shold do the trick.

Categories