run function parallel on python tornado - python

I'm currently developing in python3 (still beginner) on a tornado framework and I have a function which I would like to run in the background. To be more precise the task of the function is to download a big file (chunk by chunk) and probably do some more things after each chunk is downloaded. But the calling function should not wait for the download-function to complete but should rather continue execution.
Here some code examples:
#gen.coroutine
def dosomethingfunc(self, env):
print("Do something")
self.downloadfunc(file_url, target_path) #I don't want to wait here
print("Do something else")
#gen.coroutine
def downloadfunc(self, file_url, target_path):
response = urllib.request.urlopen(file_url)
CHUNK = 16 * 1024
with open(target_path, 'wb') as f:
while True:
chunk = response.read(CHUNK)
if not chunk:
break
f.write(chunk)
time.sleep(0.1) #do something after a chunk is downloaded - sleep only as example
I've read this answer on stackoverflow https://stackoverflow.com/a/25083098/2492068 and tried use it.
Actually I thought if I use #gen.coroutine but no yield the dosomethingfunc would continue without waiting for downloadfunc to finish. But actually the behaviour is the same (with yield or without) - "Do something else" will only be printed after downloadfunc finished the download.
What I'm missing here?

To benefit of Tornado's asynchronous there must be yielded a non-blocking function at some point. Since the code of downloadfunc is all blocking, the dosomethingfunc does not get back control until called function is finished.
There are couples issue with your code:
time.sleep is blocking, use tornado.gen.sleep instead,
urllib's urlopen is blocking, use tornado.httpclient.AsyncHTTPClient
So the downloadfunc could look like:
#gen.coroutine
def downloadfunc(self, file_url, target_path):
client = tornado.httpclient.AsyncHTTPClient()
# below code will start downloading and
# give back control to the ioloop while waiting for data
res = yield client.fetch(file_url)
with open(target_path, 'wb') as f:
f.write(res)
yield tornado.gen.sleep(0.1)
To implement it with streaming (by chunk) support, you might want to do it like this:
# for large files you must increase max_body_size
# because deault body limit in Tornado is set to 100MB
tornado.web.AsyncHTTPClient.configure(None, max_body_size=2*1024**3)
#gen.coroutine
def downloadfunc(self, file_url, target_path):
client = tornado.httpclient.AsyncHTTPClient()
# the streaming_callback will be called with received portion of data
yield client.fetch(file_url, streaming_callback=write_chunk)
def write_chunk(chunk):
# note the "a" mode, to append to the file
with open(target_path, 'ab') as f:
print('chunk %s' % len(chunk))
f.write(chunk)
Now you can call it in dosomethingfunc without yield and the rest of the function will proceed.
edit
Modifying the chunk size is not supported (exposed), both from server and client side. You may also look at https://groups.google.com/forum/#!topic/python-tornado/K8zerl1JB5o

Related

Is it possible to read multiple asyncio Streams concurrently?

I need to read the output of several asyncio tasks running concurrently.
These tasks are actually created using asyncio.create_subprocess_exec().
In the simplest form I would need to print stdout/stderr of a single process while accumulating lines in separate strings.
My current (working) code is:
async def run_command(*args, stdin=None, can_fail=False, echo=False):
"""
Run command asynchronously in subprocess.
Waits for command completion and returns return code, stdout and stdin
Example from:
http://asyncio.readthedocs.io/en/latest/subprocess.html
"""
# Create subprocess
try:
process = await asyncio.create_subprocess_exec(
*args,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
except (FileNotFoundError, OSError):
if not can_fail:
log.error("run_command(%s): Error FileNotFound", args)
return -1, '', 'File "%s" NotFound' % args[0]
# Status
log.debug("run_command(%s): pid=%s", args, process.pid)
# Wait for the subprocess to finish
stdout, stderr = await process.communicate(stdin)
# Progress
if process.returncode == 0:
log.debug("run_command(%s): ok: %s", process.pid, stdout.decode().strip())
else:
log.debug("run_command(%s): ko: %s", process.pid, stderr.decode().strip())
# Result
result = process.returncode, stdout.decode().strip(), stderr.decode().strip()
# Return stdout
return result
Problem with this code is I see nothing till process terminates; some of the spawned processes may take several minutes to complete and would print "interesting" info while executing. How can I print (or log) output as soon as it happens while capturing? (I am aware that omitting capture the underlying process would print, but I also need the capture)
I tried to do something along the lines:
_stdout = ''
while True:
data = process.stdout.readline()
if not data:
break
print(data)
_stdout += data.decode()
but I have no idea how to extend this to multiple streams (in this case just stdout/stderr, but potentially expanding to multiple programs). Is there something akin to select() call?
Any hint welcome
Is there something akin to select() call?
The answer to this must be yes, as asyncio is wholly built around a call to select(). However it's not always obvious how to translate that to a select on the level of streams. The thing to notice is that you shouldn't try to select the stream exactly - instead, start reading on the stream and rely on the ability to select the progress of the coroutines. The equivalent of select() would thus be to use asyncio.wait(return_when=FIRST_COMPLETED) to drive the reads in a loop.
An even more elegant alternative is to spawn separate tasks where each does its thing, and just let them run in parallel. The code is easier to understand than with a select, boiling down to a single call to gather, and yet under the hood asyncio performs exactly the kind of select() that was requested:
import asyncio, sys, io
async def _read_all(stream, echo):
# helper function to read the whole stream, optionally
# displaying data as it arrives
buf = io.BytesIO() # BytesIO is preferred to +=
while True:
chunk = await stream.read(4096)
if len(chunk) == 0:
break
buf.write(chunk)
if echo:
sys.stdout.buffer.write(chunk)
return buf.getvalue()
async def run_command(*args, stdin=None, echo=False):
process = await asyncio.create_subprocess_exec(
*args,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
if stdin is not None:
process.stdin.write(stdin)
process.stdin.close()
stdout, stderr = await asyncio.gather(
_read_all(process.stdout, echo),
_read_all(process.stderr, echo)
)
return process.returncode, stdout.decode().strip(), stderr.decode().strip()
Note that asyncio's write() is not a coroutine, it defaults to writing in the background, so we don't need to include the write among the coroutines we gather().

Python how to use Threading in multi download

I am use threading to do parallel download now i have a url_list img_list i want to download it in 2 thread so def two download.
i put half into download1 and half in download2 so it will speed up to complete, but finally when i run the scripts my download still in serial,i don't know why, how can i modify my script?
here's the code:
import requests,threading
img_list=[...]
num=len(img_list)
def download_1(img_list):
n=0
for i in img_list:
n+=1
with open('./img/'+str(n)+'.jpg','wb')as f:
f.write(requests.get(i).content)
print(str(n)+"download1 complete")
def download_2(img_list):
n=len(img_list)
for i in img_list:
n+=1
with open('./img/'+str(n)+'.jpg','wb')as f:
f.write(requests.get(i).content)
print(str(n)+"download2 complete")
thread_1 = threading.Thread(target=download_1(img_list[:int(num/2)]))
thread_2 = threading.Thread(target=download_2(img_list[int(num/2):]))
thread_1.start()
thread_2.start()
In this line
threading.Thread(target=download_1(img_list[:int(num/2)]))
you call download_1(...) and pass the result (null) to thread. That's why it runs serially. Instead you want to pass download_1 function itself (not the result of calling it) to the thread. Like this:
threading.Thread(target=download_1, args=(img_list[:int(num/2)],))
Do it in both places.
Side note: you should t.join() both threads at the end.
You are calling both functions at the time of creating threads. So, threads are passed null and, therefore, do nothing. You should change the code like this:
thread_1 = threading.Thread(target=download_1, args=(img_list[:int(num/2)]))
thread_2 = threading.Thread(target=download_2, args=(img_list[int(num/2):]))
thread_1.start()
thread_2.start()

Receiving socket data in one thread, writing the data in another -- python

I'm currently writing a Python program to receive data from either a TCP/UDP socket, and then write the data to a file. Right now, my program is I/O bound by writing each datagram to the file as it comes in (I'm doing this for very large files, so the slowdown is considerable). With that in mind, I've decided that I'd like to trying receiving the data from the socket in one thread, and then write that data in a different thread. So far, I've come up with the following rough draft. At the moment, it only writes a single data chunk (512 bytes) to a file.
f = open("t1.txt","wb")
def write_to_file(data):
f.write(data)
def recv_data():
dataChunk, addr = sock.recvfrom(buf) #THIS IS THE DATA THAT GETS WRITTEN
try:
w = threading.Thread(target = write_to_file, args = (dataChunk,))
threads.append(w)
w.start()
while(dataChunk):
sock.settimeout(4)
dataChunk,addr = sock.recvfrom(buf)
except socket.timeout:
print "Timeout"
sock.close()
f.close()
threads = []
r = threading.Thread(target=recv_data)
threads.append(r)
r.start()
I imagine I'm doing something wrong, I'm just not sure what the best way to use threading is. Right now, my issue is that I have to supply an argument when I create my thread, but the value of that argument doesn't properly change to reflect the new data chunks that come in. However, if I put the line w=threading.Thread(target=write_to_file, arg=(dataChunk,)) inside the while(dataChunk) loop, wouldn't I be creating a new thread each iteration?
Also, for what it's worth, this is just my small proof-of-concept for using separate receive and write threads. This is not the larger program that should ultimately make use of this concept.
You need to have a buffer that the reading thread writes to, and the writing thread reads from. A deque from the collections module is perfect, as it allows append/pop from either side without performance degradation.
So, don't pass dataChunk to your thread(s), but the buffer.
import collections # for the buffer
import time # to ease polling
import threading
def write_to_file(path, buffer, terminate_signal):
with open(path, 'wb') as out_file: # close file automatically on exit
while not terminate_signal.is_set() or buffer: # go on until end is signaled
try:
data = buffer.pop() # pop from RIGHT end of buffer
except IndexError:
time.sleep(0.5) # wait for new data
else:
out_file.write(data) # write a chunk
def read_from_socket(sock, buffer, terminate_signal):
sock.settimeout(4)
try:
while True:
data, _ = sock.recvfrom(buf)
buffer.appendleft(data) # append to LEFT of buffer
except socket.timeout:
print "Timeout"
terminate_signal.set() # signal writer that we are done
sock.close()
buffer = collections.deque() # buffer for reading/writing
terminate_signal = threading.Event() # shared signal
threads = [
threading.Thread(target=read_from_socket, kwargs=dict(
sock=sock,
buffer=buffer,
terminate_signal=terminate_signal
)),
threading.Thread(target= write_to_file, kwargs=dict(
path="t1.txt",
buffer=buffer,
terminate_signal=terminate_signal
))
]
for t in threads: # start both threads
t.start()
for t in threads: # wait for both threads to finish
t.join()

Printing to file by multiple pool workers

I'm trying to make a pool which prints data into a file.
def get_and_print_something(url):
with open('file.txt','a') as f:
f.write(get_line(url))
pool = Pool(50)
for url in urls:
pool.apply_async(get_something, args=(url,))
The problem is that sometimes it writes wrong data. It's because two workers manipulates with the same file in the same time. Is it possible to allow waiting until the file could be modified?
Example of the txt:
This is a correct line.
This is a correct line.
orrect line.
This is a correct line.
...
You can take the example from e.g. this site:
http://effbot.org/zone/thread-synchronization.htm#locks, or
https://pymotw.com/2/threading/
which basically boils down to:
import threading
lock = threading.Lock()
def get_and_print_something(url):
# Not yet in critical section because we want this to happen concurrently:
line = get_line(url)
lock.acquire() # Will wait if necessary until any other thread has finished its file access.
# In critical section now. Only one thread may run this at any one time.
try:
with open('file.txt','a') as f:
f.write( line )
finally:
lock.release() # Release lock, so that other threads can access the file again.

Have Python wait for a function to finish before proceeding with the program

I have a python program that I have written. This python program calls a function within a module I have also written and passes it some data.
program:
def Response(Response):
Resp = Response
def main():
myModule.process_this("hello") #Send string to myModule Process_this function
#Should wait around here for Resp to contain the Response
print Resp
That function processes it and passes it back as a response to function Response in the main program.
myModule:
def process_this(data)
#process data
program.Response(data)
I checked and all the data is being passed correctly. I have left out all the imports and the data processing to keep this question as concise as possible.
I need to find some way of having Python wait for resp to actually contain the response before proceeding with the program. I've been looking threading and using semaphores or using the Queue module, but i'm not 100% sure how I would incorporate either into my program.
Here's a working solution with queues and the threading module. Note: if your tasks are CPU bound rather than IO bound, you should use multiprocessing instead
import threading
import Queue
def worker(in_q, out_q):
""" threadsafe worker """
abort = False
while not abort:
try:
# make sure we don't wait forever
task = in_q.get(True, .5)
except Queue.Empty:
abort = True
else:
# process task
response = task
# return result
out_q.put(response)
in_q.task_done()
# one queue to pass tasks, one to get results
task_q = Queue.Queue()
result_q = Queue.Queue()
# start threads
t = threading.Thread(target=worker, args=(task_q, result_q))
t.start()
# submit some work
task_q.put("hello")
# wait for results
task_q.join()
print "result", result_q.get()

Categories