Trying to fix a friends code where the loop doesn't continue until a for loop is satisfied. I feel it is something wrong with the readbuffer. Basically, we want the while loop to loop continuously, but if the for loop is satisfied run that. Is someone could help me understand what is happening in the readbuffer and temp, I'd be greatly thankful.
Here's the snippet:
s = openSocket()
joinRoom(s)
readbuffer = ""
while True:
readbuffer = readbuffer + s.recv(1024)
temp = string.split(readbuffer, "\n")
readbuffer = temp.pop()
for line in temp:
user = getUser(line)
message = getMessage(line)
Base on my understanding to your question, you want to execute the for loop while continues to receive packets.
I'm not sure what you did in getUser and getMessage, if there are I/O operations (read/write files, DB I/O, send/recv ...) in them you can use async feature in python to write asynchronous programs. (See: https://docs.python.org/3/library/asyncio-task.html)
I assume, however, you are just extracting a single element from line, which involves no I/O operations. In that case, async won't help. If getUser and getMessage really take too much CPU time, you can put the for loop in a new thread, making string operations non-blocking. (See: https://docs.python.org/3/library/threading.html)
from threading import Thread
def getUserProfile(lines, profiles, i):
for line in lines:
user = getUser(line)
message = getMessage(line)
profiles.append((user, message))
profiles = []
threads = []
s = openSocket()
joinRoom(s)
while True:
readbuffer = s.recv(1024)
lines = readbuffer.decode('utf-8').split('\n')
t = Thread(target=getUserProfile, args=(lines, profiles, count))
t.run()
threads.append(t)
# If somehow the loop may be interrupted,
# These two lines should be added to wait for all threads to finish
for th in threads:
th.join() # will block main thread until all threads are terminated
Update
Of course this is not a typical way to solve this issue, it's just easier to understand for beginners, and for simple assignments.
One better way is to use something like Future, making send/recv asynchronous, and pass a callback to it so that it can pass the received data to your callback. If you want to move heavy CPU workload to another thread create an endless loop(routine), just create a Thread in callback or somewhere else, depending on your architecture design.
I implemented a lightweight distributed computing framework for my network programming course. And I wrote my own future class for the project if anyone is interested.
Related
I am working on a school project. I set some rules in iptables which logs INPUT and OUTPUT connections. My goal is to read these logs line by line, parse them and find out which process with which PID is causing this.
My problem starts when I use psutil to find a match with (ip, port) tuple with the corresponding PID. iptables is saving logs to file too fast, like 1x10^-6 seconds. My Python script also read lines as fast as iptables. But when I use the following code:
def get_proc(src: str, spt: str, dst: str, dpt: str) -> str:
proc_info = ""
if not (src and spt and dst and dpt):
return proc_info
for proc in psutil.process_iter(["pid", "name"]):
for conn in proc.connections(kind="all"):
if flag.is_set():
return proc_info
if not all([
hasattr(conn.laddr, "ip"), hasattr(conn.laddr, "port"),
hasattr(conn.raddr, "ip"), hasattr(conn.raddr, "port"),
]):
continue
if not all([
conn.laddr.ip == src, conn.laddr.port == int(spt),
conn.raddr.ip == dst, conn.raddr.port == int(dpt),
]):
continue
return f"pid={proc.pid},name={proc.name()}"
return proc_info
psutil finishes its job like 1x10^-3 seconds, means 10^3 times slower than reading process. What happens is that: If I run this get_proc function once, I read 1000 lines. So this slowness quickly becomes a problem when 1x10^6 lines are read at the end. Because in order to find the PID, I need to run this method immediately when the log is received.
I thought of using multithreading but as far as I understand it won't solve my problem. Because the same latency problem.
I haven't done much coding so far because I still can't find an algorithm to use. That's way no more code here.
How can I solve this problem with or without multithreading? Because I can't speed up the execution of psutil. I believe there must be better approaches.
Edit
Code part for reading logs from iptables.log:
flag = threading.Event()
def stop(signum, _frame):
"""
Tell everything to stop themselves.
:param signum: The captured signal number.
:param _frame: No use.
"""
if flag.is_set():
return
sys.stderr.write(f"Signal {signum} received.")
flag.set()
signal.signal(signal.SIGINT, stop)
def receive_logs(file, queue__):
global CURSOR_POSITION
with open(file, encoding="utf-8") as _f:
_f.seek(CURSOR_POSITION)
while not flag.is_set():
line = re.sub(r"[\[\]]", "", _f.readline().rstrip())
if not line:
continue
# If all goes okay do some parsing...
# .
# .
queue__.put_nowait((nettup, additional_info))
CURSOR_POSITION = _f.tell()
Here is an approach that may help a bit. As I've mentioned in comments, the issue cannot be entirely avoided unless you change to a better approach entirely.
The idea here is to scan the list of processes not once per connection but for all connections that have arrived since the last scan. Since checking connections can be done with a simple hash table lookup in O(1) time, we can process messages much faster.
I chose to go with a simple 1-producer-1-consumer multithreading approach. I think this will work fine because most time is spent in system calls, so Python's global interpreter lock (GIL) is less of an issue. But that requires testing. Possible variations:
Use no multithreading, instead read incoming logs nonblocking, then process what you've got
Swap the threading module and queue for multiprocessing module
Use multiple consumer threads and maybe batch block sizes to have multiple scans through the process list in parallel
import psutil
import queue
import threading
def receive_logs(consumer_queue):
"""Placeholder for actual code reading iptables log"""
for connection in log:
nettup = (connection.src, int(connection.spt),
connection.dst, int(connection.dpt))
additional_info = connection.additional_info
consumer_queue.put((nettup, additional_info))
The log reading is not part of the posted code, so this is just some placeholder.
Now we consume all queued connections in a second thread:
def get_procs(producer_queue):
# 1. Construct a set of connections to search for
# Blocks until at least one available
nettup, additional_info = producer_queue.get()
connections = {nettup: additional_info}
try: # read as many as possible
while True:
nettup, additional_info = producer_queue.get_nowait()
connections[nettup] = additional_info
except queue.Empty:
pass
found = []
for proc in psutil.process_iter(["pid", "name"]):
for conn in proc.connections(kind="all"):
try:
src = conn.laddr.ip
spt = conn.laddr.port
dst = conn.raddr.ip
dpt = conn.raddr.port
except AttributeError: # not an IP address
continue
nettup = (src, spt, dst, dpt)
if nettup in connections:
additional_info = connections[nettup]
found.append((proc, nettup, additional_info))
found_connections = {nettup for _, nettup, _ in found}
lost = [(nettup, additional_info)
for nettup, additional_info in connections.items()
if not nettup in found_connections]
return found, lost
I don't really understand parts of the posted code in the question, such as the if flag.is_set(): return proc_info part so I just left those out. Also, I got rid of some of the less pythonic and potentially slow parts such as hasattr(). Adapt as needed.
Now we tie it all together by calling the consumer repeatedly and starting both threads:
def consume(producer_queue):
while True:
found, lost = get_procs(producer_queue)
for proc, (src, spt, dst, dpt), additional_info in found:
print(f"pid={proc.pid},name={proc.name()}")
def main():
producer_consumer_queue = queue.SimpleQueue()
producer = threading.Thread(
target=receive_logs, args=((producer_consumer_queue, ))
consumer = threading.Thread(
target=consume, args=((producer_consumer_queue, ))
consumer.start()
producer.start()
consumer.join()
producer.join()
My application needs remote control over SSH.
I wish to use this example: https://asyncssh.readthedocs.io/en/latest/#simple-server-with-input
The original app is rather big, using GPIO and 600lines of code, 10 libraries. so I've made a simple example here:
import asyncio, asyncssh, sys, time
# here would be 10 libraries in the original 600line application
is_open = True
return_value = 0;
async def handle_client(process):
process.stdout.write('Enter numbers one per line, or EOF when done:\n')
process.stdout.write(is_open)
total = 0
try:
async for line in process.stdin:
line = line.rstrip('\n')
if line:
try:
total += int(line)
except ValueError:
process.stderr.write('Invalid number: %s\n' % line)
except asyncssh.BreakReceived:
pass
process.stdout.write('Total = %s\n' % total)
process.exit(0)
async def start_server():
await asyncssh.listen('', 8022, server_host_keys=['key'],
authorized_client_keys='key.pub',
process_factory=handle_client)
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(start_server())
except (OSError, asyncssh.Error) as exc:
sys.exit('Error starting server: ' + str(exc))
loop.run_forever()
# here is the "old" program: that would not run now as loop.run_forever() runs.
#while True:
# print(return_value)
# time.sleep(0.1)
The main app is mostly driven by a while True loop with lots of functions and sleep.
I've commented that part out in the simple example above.
My question is: How should I implement the SSH part, that uses loop.run_forever() - and still be able to run my main loop?
Also: the handle_client(process) - must be able to interact with variables in the main program. (read/write)
You have basically three options:
Rewrite your main loop to be asyncio compatible
A main while True loop with lots of sleeps is exactly the kind of code you want to write asynchronously. Convert this:
while True:
task_1() # takes n ms
sleep(0.2)
task_2() # takes n ms
sleep(0.4)
into this:
async def task_1():
while True:
stuff()
await asyncio.sleep(0.6)
async def task_2():
while True:
stuff()
await asyncio.sleep(0.01)
other_stuff()
await asyncio.sleep(0.8)
loop = asyncio.get_event_loop()
loop.add_task(task_1())
loop.add_task(task_2())
...
loop.run_forever()
This is the most work, but it is almost certain that your current code will be better written, clearer, easier to maintain and easier to develop if written as a bunch of coroutines. If you do this the problem goes away: with cooperative multitasking you tell the code when to yield, so sharing state is generally pretty easy. By not awaiting anything in between getting and using a state var you prevent race conditions: no need for any kind of thread-safe var.
Run your asyncio loop in a thread
Leave your current loop intact, but run your ascynio loop in a thread (or process) with either threading or multiprocessing. Expose some kind of thread-safe variable to allow the background thread to change state, or transition to a (thread safe) messaging paradigm, where the ssh thread emits messages into a queue which your main loop handles in its own time (a message could be something like ("a", 5) which would be handled by doing something like state_dict[msg[0]] == msg[1] for everything in the queue).
If you want to go this way, have a look at the multiprocessing and/or threading docs for examples of the right ways to pass variables or messages between threads. Note that this version will likely be less performant than a pure asyncio solution, particularly if your code is mostly sleeping in the main loop anyhow.
Run your synchronous code in a thread, and have asyncio in the foreground
As #MisterMiyagi points out, asyncio has loop.run_in_executor() for launching a process to run blocking code. It's more generally used to run the odd blocking bit of code without tying up the whole loop, but you can run your whole main loop in it. The same concerns about some kind of thread safe variable or message sharing apply. This has the advantage (as #MisterMiyagi points out) of keeping asyncio where it expects to be. I have a few projects which use background asyncio threads in generally non-asyncio code (event-driven gui code with an asyncio thread interacting with custom hardware over usb). It can be done, but you do have to be careful as to how you write it.
Note btw that if you do decide to use multiple threads, message-passing (with a queue) is usually easier than directly sharing variables.
I am running a python app where I for various reasons have to host my program on a server in one part of the world and then have my database in another.
I tested via a simple script, and from my home which is in a neighboring country to the database server, the time to write and retrieve a row from the database is about 0.035 seconds (which is a nice speed imo) compared to 0,16 seconds when my python server in the other end of the world performs same action.
This is an issue as I am trying to keep my python app as fast as possible so I was wondering if there is a smart way to do this?
As I am running my code synchronously my program is waiting every time it has to write to the db, which is about 3 times a second so the time adds up. Is it possible to run the connection to the database in a separate thread or something, so it doesn't halt the whole program while it tries to send data to the database? Or can this be done using asyncio (I have no experience with async code)?
I am really struggling figuring out a good way to solve this issue.
In advance, many thanks!
Yes, you can create a thread that does the writes in the background. In your case, it seems reasonable to have a queue where the main thread puts things to be written and the db thread gets and writes them. The queue can have a maximum depth so that when too much stuff is pending, the main thread waits. You could also do something different like drop things that happen too fast. Or, use a db with synchronization and write a local copy. You also may have an opportunity to speed up the writes a bit by committing multiple at once.
This is a sketch of a worker thread
import threading
import queue
class SqlWriterThread(threading.Thread):
def __init__(self, db_connect_info, maxsize=8):
super().__init__()
self.db_connect_info = db_connect_info
self.q = queue.Queue(maxsize)
# TODO: Can expose q.put directly if you don't need to
# intercept the call
# self.put = q.put
self.start()
def put(self, statement):
print(f"DEBUG: Putting\n{statement}")
self.q.put(statement)
def run(self):
db_conn = None
while True:
# get all the statements you can, waiting on first
statements = [self.q.get()]
try:
while True:
statements.append(self.q.get(), block=False)
except queue.Empty:
pass
try:
# early exit before connecting if channel is closed.
if statements[0] is None:
return
if not db_conn:
db_conn = do_my_sql_connect()
try:
print("Debug: Executing\n", "--------\n".join(f"{id(s)} {s}" for s in statements))
# todo: need to detect closed connection, then reconnect and resart loop
cursor = db_conn.cursor()
for statement in statements:
if statement is None:
return
cursor.execute(*statement)
finally:
cursor.commit()
finally:
for _ in statements:
self.q.task_done()
sql_writer = SqlWriterThread(('user', 'host', 'credentials'))
sql_writer.put(('execute some stuff',))
Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()
It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()
You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).
For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.
The problem I've got right now is one concerning this chat client I've been trying to get working for some days now. It's supposed to be an upgrade of my original chat client, that could only reply to people if it received a message first.
So after asking around and researching people I decided to use select.select to handle my client.
Problem is it has the same problem as always.
*The loop gets stuck on receiving and won't complete until it receives something*
Here's what I wrote so far:
import select
import sys #because why not?
import threading
import queue
print("New Chat Client Using Select Module")
HOST = input("Host: ")
PORT = int(input("Port: "))
s = socket(AF_INET,SOCK_STREAM)
print("Trying to connect....")
s.connect((HOST,PORT))
s.setblocking(0)
# Not including setblocking(0) because select handles that.
print("You just connected to",HOST,)
# Lets now try to handle the client a different way!
while True:
# Attempting to create a few threads
Reading_Thread = threading.Thread(None,s)
Reading_Thread.start()
Writing_Thread = threading.Thread()
Writing_Thread.start()
Incoming_data = [s]
Exportable_data = []
Exceptions = []
User_input = input("Your message: ")
rlist,wlist,xlist = select.select(Incoming_data,Exportable_data,Exceptions)
if User_input == True:
Exportable_data += [User_input]
Your probably wondering why I've got threading and queues in there.
That's because people told me I could solve the problem by using threading and queues, but after reading documentation, looking for video tutorials or examples that matched my case. I still don't know at all how I can use them to make my client work.
Could someone please help me out here? I just need to find a way to have the client enter messages as much as they'd like without waiting for a reply. This is just one of the ways I am trying to do it.
Normally you'd create a function in which your While True loop runs and can receive the data, which it can write to some buffer or queue to which your main thread has access.
You'd need to synchronize access to this queue so as to avoid data races.
I'm not too familiar with Python's threading API, however creating a function which runs in a thread can't be that hard. Lemme find an example.
Turns out you could create a class with a function where the class derives from threading.Thread. Then you can create an instance of your class and start the thread that way.
class WorkerThread(threading.Thread):
def run(self):
while True:
print 'Working hard'
time.sleep(0.5)
def runstuff():
worker = WorkerThread()
worker.start() #start thread here, which will call run()
You can also use a simpler API and create a function and call thread.start_new_thread(fun, args) on it, which will run that function in a thread.
def fun():
While True:
#do stuff
thread.start_new_thread(fun) #run in thread.