Receiving socket data in one thread, writing the data in another -- python - python

I'm currently writing a Python program to receive data from either a TCP/UDP socket, and then write the data to a file. Right now, my program is I/O bound by writing each datagram to the file as it comes in (I'm doing this for very large files, so the slowdown is considerable). With that in mind, I've decided that I'd like to trying receiving the data from the socket in one thread, and then write that data in a different thread. So far, I've come up with the following rough draft. At the moment, it only writes a single data chunk (512 bytes) to a file.
f = open("t1.txt","wb")
def write_to_file(data):
f.write(data)
def recv_data():
dataChunk, addr = sock.recvfrom(buf) #THIS IS THE DATA THAT GETS WRITTEN
try:
w = threading.Thread(target = write_to_file, args = (dataChunk,))
threads.append(w)
w.start()
while(dataChunk):
sock.settimeout(4)
dataChunk,addr = sock.recvfrom(buf)
except socket.timeout:
print "Timeout"
sock.close()
f.close()
threads = []
r = threading.Thread(target=recv_data)
threads.append(r)
r.start()
I imagine I'm doing something wrong, I'm just not sure what the best way to use threading is. Right now, my issue is that I have to supply an argument when I create my thread, but the value of that argument doesn't properly change to reflect the new data chunks that come in. However, if I put the line w=threading.Thread(target=write_to_file, arg=(dataChunk,)) inside the while(dataChunk) loop, wouldn't I be creating a new thread each iteration?
Also, for what it's worth, this is just my small proof-of-concept for using separate receive and write threads. This is not the larger program that should ultimately make use of this concept.

You need to have a buffer that the reading thread writes to, and the writing thread reads from. A deque from the collections module is perfect, as it allows append/pop from either side without performance degradation.
So, don't pass dataChunk to your thread(s), but the buffer.
import collections # for the buffer
import time # to ease polling
import threading
def write_to_file(path, buffer, terminate_signal):
with open(path, 'wb') as out_file: # close file automatically on exit
while not terminate_signal.is_set() or buffer: # go on until end is signaled
try:
data = buffer.pop() # pop from RIGHT end of buffer
except IndexError:
time.sleep(0.5) # wait for new data
else:
out_file.write(data) # write a chunk
def read_from_socket(sock, buffer, terminate_signal):
sock.settimeout(4)
try:
while True:
data, _ = sock.recvfrom(buf)
buffer.appendleft(data) # append to LEFT of buffer
except socket.timeout:
print "Timeout"
terminate_signal.set() # signal writer that we are done
sock.close()
buffer = collections.deque() # buffer for reading/writing
terminate_signal = threading.Event() # shared signal
threads = [
threading.Thread(target=read_from_socket, kwargs=dict(
sock=sock,
buffer=buffer,
terminate_signal=terminate_signal
)),
threading.Thread(target= write_to_file, kwargs=dict(
path="t1.txt",
buffer=buffer,
terminate_signal=terminate_signal
))
]
for t in threads: # start both threads
t.start()
for t in threads: # wait for both threads to finish
t.join()

Related

Ensuring a python queue that can be populated by multiple threads will always be cleared without polling

I have the below code that shows how a queue would always be cleared even with multiple threads adding to the queue. It's using recursion but a while loop could work as well. Is this a bad practice or would there be a scenario where the queue might have an object and it won't get pulled until something gets added to the queue.
The primary purpose of this is to have a queue that ensures order of execution without the need to continually poll or block with q.get()
import queue
import threading
lock = threading.RLock()
q = queue.Queue()
def execute():
with lock:
if not q.empty():
text = q.get()
print(text)
execute()
def add_to_queue(text):
q.put(text)
execute()
# Assume multiple threads can call add to queue
add_to_queue("Hello")
This is one solution that uses timeout on the .get function, one pushes to the queue and one reads from the queue. You could have multiple readers and writers.
import queue
import threading
q = queue.Queue()
def read():
try:
while True:
text = q.get(timeout=1)
print(text)
except queue.Empty:
print("exiting")
def write():
q.put("Hello")
q.put("There")
q.put("My")
q.put("Friend")
writer = threading.Thread(target=write)
reader = threading.Thread(target=read)
writer.start()
reader.start()
reader.join()

run function parallel on python tornado

I'm currently developing in python3 (still beginner) on a tornado framework and I have a function which I would like to run in the background. To be more precise the task of the function is to download a big file (chunk by chunk) and probably do some more things after each chunk is downloaded. But the calling function should not wait for the download-function to complete but should rather continue execution.
Here some code examples:
#gen.coroutine
def dosomethingfunc(self, env):
print("Do something")
self.downloadfunc(file_url, target_path) #I don't want to wait here
print("Do something else")
#gen.coroutine
def downloadfunc(self, file_url, target_path):
response = urllib.request.urlopen(file_url)
CHUNK = 16 * 1024
with open(target_path, 'wb') as f:
while True:
chunk = response.read(CHUNK)
if not chunk:
break
f.write(chunk)
time.sleep(0.1) #do something after a chunk is downloaded - sleep only as example
I've read this answer on stackoverflow https://stackoverflow.com/a/25083098/2492068 and tried use it.
Actually I thought if I use #gen.coroutine but no yield the dosomethingfunc would continue without waiting for downloadfunc to finish. But actually the behaviour is the same (with yield or without) - "Do something else" will only be printed after downloadfunc finished the download.
What I'm missing here?
To benefit of Tornado's asynchronous there must be yielded a non-blocking function at some point. Since the code of downloadfunc is all blocking, the dosomethingfunc does not get back control until called function is finished.
There are couples issue with your code:
time.sleep is blocking, use tornado.gen.sleep instead,
urllib's urlopen is blocking, use tornado.httpclient.AsyncHTTPClient
So the downloadfunc could look like:
#gen.coroutine
def downloadfunc(self, file_url, target_path):
client = tornado.httpclient.AsyncHTTPClient()
# below code will start downloading and
# give back control to the ioloop while waiting for data
res = yield client.fetch(file_url)
with open(target_path, 'wb') as f:
f.write(res)
yield tornado.gen.sleep(0.1)
To implement it with streaming (by chunk) support, you might want to do it like this:
# for large files you must increase max_body_size
# because deault body limit in Tornado is set to 100MB
tornado.web.AsyncHTTPClient.configure(None, max_body_size=2*1024**3)
#gen.coroutine
def downloadfunc(self, file_url, target_path):
client = tornado.httpclient.AsyncHTTPClient()
# the streaming_callback will be called with received portion of data
yield client.fetch(file_url, streaming_callback=write_chunk)
def write_chunk(chunk):
# note the "a" mode, to append to the file
with open(target_path, 'ab') as f:
print('chunk %s' % len(chunk))
f.write(chunk)
Now you can call it in dosomethingfunc without yield and the rest of the function will proceed.
edit
Modifying the chunk size is not supported (exposed), both from server and client side. You may also look at https://groups.google.com/forum/#!topic/python-tornado/K8zerl1JB5o

Multiple stdout w/ flush going on in Python threading

I have a small piece of code that I made to test out and hopefully debug the problem without having to modify the code in my main applet in Python. This has let me to build this code:
#!/usr/bin/env python
import sys, threading, time
def loop1():
count = 0
while True:
sys.stdout.write('\r thread 1: ' + str(count))
sys.stdout.flush()
count = count + 1
time.sleep(.3)
pass
pass
def loop2():
count = 0
print ""
while True:
sys.stdout.write('\r thread 2: ' + str(count))
sys.stdout.flush()
count = count + 2
time.sleep(.3)
pass
if __name__ == '__main__':
try:
th = threading.Thread(target=loop1)
th.start()
th1 = threading.Thread(target=loop2)
th1.start()
pass
except KeyboardInterrupt:
print ""
pass
pass
My goal with this code is to be able to have both of these threads displaying output in stdout format (with flushing) at the same time and have then side by side or something. problem is that I assume since it is flushing each one, it flushes the other string by default. I don't quite know how to get this to work if it is even possible.
If you just run one of the threads, it works fine. However I want to be able to run both threads with their own string running at the same time in the terminal output. Here is a picture displaying what I'm getting:
terminal screenshot
let me know if you need more info. thanks in advance.
Instead of allowing each thread to output to stdout, a better solution is to have one thread control stdout exclusively. Then provide a threadsafe channel for the other threads to dispatch data to be output.
One good method to achieve this is to share a Queue between all threads. Ensure that only the output thread is accessing data after it has been added to the queue.
The output thread can store the last message from each other thread and use that data to format stdout nicely. This can include clearing output to display something like this, and update it as each thread generates new data.
Threads
#1: 0
#2: 0
Example
Some decisions were made to simplify this example:
There are gotchas to be wary of when giving arguments to threads.
Daemon threads terminate themselves when the main thread exits. They are used to avoid adding complexity to this answer. Using them on long-running or large applications can pose problems. Other
questions discuss how to exit a multithreaded application without leaking memory or locking system resources. You will need to think about how your program needs to signal an exit. Consider using asyncio to save yourself these considerations.
No newlines are used because \r carriage returns cannot clear the whole console. They only allow the current line to be rewritten.
import queue, threading
import time, sys
q = queue.Queue()
keepRunning = True
def loop_output():
thread_outputs = dict()
while keepRunning:
try:
thread_id, data = q.get_nowait()
thread_outputs[thread_id] = data
except queue.Empty:
# because the queue is used to update, there's no need to wait or block.
pass
pretty_output = ""
for thread_id, data in thread_outputs.items():
pretty_output += '({}:{}) '.format(thread_id, str(data))
sys.stdout.write('\r' + pretty_output)
sys.stdout.flush()
time.sleep(1)
def loop_count(thread_id, increment):
count = 0
while keepRunning:
msg = (thread_id, count)
try:
q.put_nowait(msg)
except queue.Full:
pass
count = count + increment
time.sleep(.3)
pass
pass
if __name__ == '__main__':
try:
th_out = threading.Thread(target=loop_output)
th_out.start()
# make sure to use args, not pass arguments directly
th0 = threading.Thread(target=loop_count, args=("Thread0", 1))
th0.daemon = True
th0.start()
th1 = threading.Thread(target=loop_count, args=("Thread1", 3))
th1.daemon = True
th1.start()
# Keep the main thread alive to wait for KeyboardInterrupt
while True:
time.sleep(.1)
except KeyboardInterrupt:
print("Ended by keyboard stroke")
keepRunning = False
for th in [th0, th1]:
th.join()
Example Output:
(Thread0:110) (Thread1:330)

Printing to file by multiple pool workers

I'm trying to make a pool which prints data into a file.
def get_and_print_something(url):
with open('file.txt','a') as f:
f.write(get_line(url))
pool = Pool(50)
for url in urls:
pool.apply_async(get_something, args=(url,))
The problem is that sometimes it writes wrong data. It's because two workers manipulates with the same file in the same time. Is it possible to allow waiting until the file could be modified?
Example of the txt:
This is a correct line.
This is a correct line.
orrect line.
This is a correct line.
...
You can take the example from e.g. this site:
http://effbot.org/zone/thread-synchronization.htm#locks, or
https://pymotw.com/2/threading/
which basically boils down to:
import threading
lock = threading.Lock()
def get_and_print_something(url):
# Not yet in critical section because we want this to happen concurrently:
line = get_line(url)
lock.acquire() # Will wait if necessary until any other thread has finished its file access.
# In critical section now. Only one thread may run this at any one time.
try:
with open('file.txt','a') as f:
f.write( line )
finally:
lock.release() # Release lock, so that other threads can access the file again.

The thread hangs using FTP LIST with Python

I'm using ftplib for connecting and getting file list from FTP server.
The problem I have is that the connection hangs from time to time and I don't know why. I'm running python script as a daemon, using threads.
See what I mean:
def main():
signal.signal(signal.SIGINT, signal_handler)
app.db = MySQLWrapper()
try:
app.opener = FTP_Opener()
mainloop = MainLoop()
while not app.terminate:
# suspend main thread until the queue terminates
# this lets to restart the queue automatically in case of unexpected shutdown
mainloop.join(10)
while (not app.terminate) and (not mainloop.isAlive()):
time.sleep(script_timeout)
print time.ctime(), "main: trying to restart the queue"
try:
mainloop = MainLoop()
except Exception:
time.sleep(60)
finally:
app.db.close()
app.db = None
app.opener = None
mainloop = None
try:
os.unlink(PIDFILE)
except:
pass
# give other threads time to terminate
time.sleep(1)
print time.ctime(), "main: main thread terminated"
MainLoop() has some functions for FTP connect, download specific files and disconnect from the server.
Here's how I get the file's list:
file_list = app.opener.load_list()
And how FTP_Opener.load_list() function looks like:
def load_list(self):
attempts = 0
while attempts<=ftp_config.load_max_attempts:
attempts += 1
filelist = []
try:
self._connect()
self._chdir()
# retrieve file list to 'filelist' var
self.FTP.retrlines('LIST', lambda s: filelist.append(s))
filelist = self._filter_filelist(self._parse_filelist(filelist))
return filelist
except Exception:
print sys.exc_info()
self._disconnect()
sleep(0.1)
print time.ctime(), "FTP Opener: can't load file list"
return []
Why sometimes the FTP connection hangs and how can I monitor this? So if it happens I would like to terminate the thread somehow and start a new one.
Thanks
If you are building for robustness, I would highly recommend that you look into using an event-driven method. One such which have FTP support is Twisted (API).
The advantage is that you don't block the thread while waiting for i/O and you can create simple timer functions to monitor your connections if you so prefer. It also scales a lot better. It is slightly more complicated to code using event-driven patterns, so if this is just a simple script it may or may not be worth the extra effort, but since you write that you are writing a daemon, it might be worth looking into.
Here is an example of an FTP client: ftpclient.py

Categories