I am working on a school project. I set some rules in iptables which logs INPUT and OUTPUT connections. My goal is to read these logs line by line, parse them and find out which process with which PID is causing this.
My problem starts when I use psutil to find a match with (ip, port) tuple with the corresponding PID. iptables is saving logs to file too fast, like 1x10^-6 seconds. My Python script also read lines as fast as iptables. But when I use the following code:
def get_proc(src: str, spt: str, dst: str, dpt: str) -> str:
proc_info = ""
if not (src and spt and dst and dpt):
return proc_info
for proc in psutil.process_iter(["pid", "name"]):
for conn in proc.connections(kind="all"):
if flag.is_set():
return proc_info
if not all([
hasattr(conn.laddr, "ip"), hasattr(conn.laddr, "port"),
hasattr(conn.raddr, "ip"), hasattr(conn.raddr, "port"),
]):
continue
if not all([
conn.laddr.ip == src, conn.laddr.port == int(spt),
conn.raddr.ip == dst, conn.raddr.port == int(dpt),
]):
continue
return f"pid={proc.pid},name={proc.name()}"
return proc_info
psutil finishes its job like 1x10^-3 seconds, means 10^3 times slower than reading process. What happens is that: If I run this get_proc function once, I read 1000 lines. So this slowness quickly becomes a problem when 1x10^6 lines are read at the end. Because in order to find the PID, I need to run this method immediately when the log is received.
I thought of using multithreading but as far as I understand it won't solve my problem. Because the same latency problem.
I haven't done much coding so far because I still can't find an algorithm to use. That's way no more code here.
How can I solve this problem with or without multithreading? Because I can't speed up the execution of psutil. I believe there must be better approaches.
Edit
Code part for reading logs from iptables.log:
flag = threading.Event()
def stop(signum, _frame):
"""
Tell everything to stop themselves.
:param signum: The captured signal number.
:param _frame: No use.
"""
if flag.is_set():
return
sys.stderr.write(f"Signal {signum} received.")
flag.set()
signal.signal(signal.SIGINT, stop)
def receive_logs(file, queue__):
global CURSOR_POSITION
with open(file, encoding="utf-8") as _f:
_f.seek(CURSOR_POSITION)
while not flag.is_set():
line = re.sub(r"[\[\]]", "", _f.readline().rstrip())
if not line:
continue
# If all goes okay do some parsing...
# .
# .
queue__.put_nowait((nettup, additional_info))
CURSOR_POSITION = _f.tell()
Here is an approach that may help a bit. As I've mentioned in comments, the issue cannot be entirely avoided unless you change to a better approach entirely.
The idea here is to scan the list of processes not once per connection but for all connections that have arrived since the last scan. Since checking connections can be done with a simple hash table lookup in O(1) time, we can process messages much faster.
I chose to go with a simple 1-producer-1-consumer multithreading approach. I think this will work fine because most time is spent in system calls, so Python's global interpreter lock (GIL) is less of an issue. But that requires testing. Possible variations:
Use no multithreading, instead read incoming logs nonblocking, then process what you've got
Swap the threading module and queue for multiprocessing module
Use multiple consumer threads and maybe batch block sizes to have multiple scans through the process list in parallel
import psutil
import queue
import threading
def receive_logs(consumer_queue):
"""Placeholder for actual code reading iptables log"""
for connection in log:
nettup = (connection.src, int(connection.spt),
connection.dst, int(connection.dpt))
additional_info = connection.additional_info
consumer_queue.put((nettup, additional_info))
The log reading is not part of the posted code, so this is just some placeholder.
Now we consume all queued connections in a second thread:
def get_procs(producer_queue):
# 1. Construct a set of connections to search for
# Blocks until at least one available
nettup, additional_info = producer_queue.get()
connections = {nettup: additional_info}
try: # read as many as possible
while True:
nettup, additional_info = producer_queue.get_nowait()
connections[nettup] = additional_info
except queue.Empty:
pass
found = []
for proc in psutil.process_iter(["pid", "name"]):
for conn in proc.connections(kind="all"):
try:
src = conn.laddr.ip
spt = conn.laddr.port
dst = conn.raddr.ip
dpt = conn.raddr.port
except AttributeError: # not an IP address
continue
nettup = (src, spt, dst, dpt)
if nettup in connections:
additional_info = connections[nettup]
found.append((proc, nettup, additional_info))
found_connections = {nettup for _, nettup, _ in found}
lost = [(nettup, additional_info)
for nettup, additional_info in connections.items()
if not nettup in found_connections]
return found, lost
I don't really understand parts of the posted code in the question, such as the if flag.is_set(): return proc_info part so I just left those out. Also, I got rid of some of the less pythonic and potentially slow parts such as hasattr(). Adapt as needed.
Now we tie it all together by calling the consumer repeatedly and starting both threads:
def consume(producer_queue):
while True:
found, lost = get_procs(producer_queue)
for proc, (src, spt, dst, dpt), additional_info in found:
print(f"pid={proc.pid},name={proc.name()}")
def main():
producer_consumer_queue = queue.SimpleQueue()
producer = threading.Thread(
target=receive_logs, args=((producer_consumer_queue, ))
consumer = threading.Thread(
target=consume, args=((producer_consumer_queue, ))
consumer.start()
producer.start()
consumer.join()
producer.join()
Related
I am running a python app where I for various reasons have to host my program on a server in one part of the world and then have my database in another.
I tested via a simple script, and from my home which is in a neighboring country to the database server, the time to write and retrieve a row from the database is about 0.035 seconds (which is a nice speed imo) compared to 0,16 seconds when my python server in the other end of the world performs same action.
This is an issue as I am trying to keep my python app as fast as possible so I was wondering if there is a smart way to do this?
As I am running my code synchronously my program is waiting every time it has to write to the db, which is about 3 times a second so the time adds up. Is it possible to run the connection to the database in a separate thread or something, so it doesn't halt the whole program while it tries to send data to the database? Or can this be done using asyncio (I have no experience with async code)?
I am really struggling figuring out a good way to solve this issue.
In advance, many thanks!
Yes, you can create a thread that does the writes in the background. In your case, it seems reasonable to have a queue where the main thread puts things to be written and the db thread gets and writes them. The queue can have a maximum depth so that when too much stuff is pending, the main thread waits. You could also do something different like drop things that happen too fast. Or, use a db with synchronization and write a local copy. You also may have an opportunity to speed up the writes a bit by committing multiple at once.
This is a sketch of a worker thread
import threading
import queue
class SqlWriterThread(threading.Thread):
def __init__(self, db_connect_info, maxsize=8):
super().__init__()
self.db_connect_info = db_connect_info
self.q = queue.Queue(maxsize)
# TODO: Can expose q.put directly if you don't need to
# intercept the call
# self.put = q.put
self.start()
def put(self, statement):
print(f"DEBUG: Putting\n{statement}")
self.q.put(statement)
def run(self):
db_conn = None
while True:
# get all the statements you can, waiting on first
statements = [self.q.get()]
try:
while True:
statements.append(self.q.get(), block=False)
except queue.Empty:
pass
try:
# early exit before connecting if channel is closed.
if statements[0] is None:
return
if not db_conn:
db_conn = do_my_sql_connect()
try:
print("Debug: Executing\n", "--------\n".join(f"{id(s)} {s}" for s in statements))
# todo: need to detect closed connection, then reconnect and resart loop
cursor = db_conn.cursor()
for statement in statements:
if statement is None:
return
cursor.execute(*statement)
finally:
cursor.commit()
finally:
for _ in statements:
self.q.task_done()
sql_writer = SqlWriterThread(('user', 'host', 'credentials'))
sql_writer.put(('execute some stuff',))
Trying to fix a friends code where the loop doesn't continue until a for loop is satisfied. I feel it is something wrong with the readbuffer. Basically, we want the while loop to loop continuously, but if the for loop is satisfied run that. Is someone could help me understand what is happening in the readbuffer and temp, I'd be greatly thankful.
Here's the snippet:
s = openSocket()
joinRoom(s)
readbuffer = ""
while True:
readbuffer = readbuffer + s.recv(1024)
temp = string.split(readbuffer, "\n")
readbuffer = temp.pop()
for line in temp:
user = getUser(line)
message = getMessage(line)
Base on my understanding to your question, you want to execute the for loop while continues to receive packets.
I'm not sure what you did in getUser and getMessage, if there are I/O operations (read/write files, DB I/O, send/recv ...) in them you can use async feature in python to write asynchronous programs. (See: https://docs.python.org/3/library/asyncio-task.html)
I assume, however, you are just extracting a single element from line, which involves no I/O operations. In that case, async won't help. If getUser and getMessage really take too much CPU time, you can put the for loop in a new thread, making string operations non-blocking. (See: https://docs.python.org/3/library/threading.html)
from threading import Thread
def getUserProfile(lines, profiles, i):
for line in lines:
user = getUser(line)
message = getMessage(line)
profiles.append((user, message))
profiles = []
threads = []
s = openSocket()
joinRoom(s)
while True:
readbuffer = s.recv(1024)
lines = readbuffer.decode('utf-8').split('\n')
t = Thread(target=getUserProfile, args=(lines, profiles, count))
t.run()
threads.append(t)
# If somehow the loop may be interrupted,
# These two lines should be added to wait for all threads to finish
for th in threads:
th.join() # will block main thread until all threads are terminated
Update
Of course this is not a typical way to solve this issue, it's just easier to understand for beginners, and for simple assignments.
One better way is to use something like Future, making send/recv asynchronous, and pass a callback to it so that it can pass the received data to your callback. If you want to move heavy CPU workload to another thread create an endless loop(routine), just create a Thread in callback or somewhere else, depending on your architecture design.
I implemented a lightweight distributed computing framework for my network programming course. And I wrote my own future class for the project if anyone is interested.
I'm new to python and I'm having trouble understanding how threading works. By skimming through the documentation, my understanding is that calling join() on a thread is the recommended way of blocking until it completes.
To give a bit of background, I have 48 large csv files (multiple GB) which I am trying to parse in order to find inconsistencies. The threads share no state. This can be done single threadedly in a reasonable ammount of time for a one-off, but I am trying to do it concurrently as an exercise.
Here's a skeleton of the file processing:
def process_file(data_file):
with open(data_file) as f:
print "Start processing {0}".format(data_file)
line = f.readline()
while line:
# logic omitted for brevity; can post if required
# pretty certain it works as expected, single 'thread' works fine
line = f.readline()
print "Finished processing file {0} with {1} errors".format(data_file, error_count)
def process_file_callable(data_file):
try:
process_file(data_file)
except:
print >> sys.stderr, "Error processing file {0}".format(data_file)
And the concurrent bit:
def partition_list(l, n):
""" Yield successive n-sized partitions from a list.
"""
for i in xrange(0, len(l), n):
yield l[i:i+n]
partitions = list(partition_list(data_files, 4))
for partition in partitions:
threads = []
for data_file in partition:
print "Processing file {0}".format(data_file)
t = Thread(name=data_file, target=process_file_callable, args = (data_file,))
threads.append(t)
t.start()
for t in threads:
print "Joining {0}".format(t.getName())
t.join(5)
print "Joined the first chunk of {0}".format(map(lambda t: t.getName(), threads))
I run this as:
python -u datautils/cleaner.py > cleaner.out 2> cleaner.err
My understanding is that join() should block the calling thread waiting for the thread it's called on to finish, however the behaviour I'm observing is inconsistent with my expectation.
I never see errors in the error file, but I also never see the expected log messages on stdout.
The parent process does not terminate unless I explicitly kill it from the shell. If I check how many prints I have for Finished ... it's never the expected 48, but somewhere between 12 and 15. However, having run this single-threadedly, I can confirm that the multithreaded run is actually processing everything and doing all the expected validation, only it does not seem to terminate cleanly.
I know I must be doing something wrong, but I would really appreciate if you can point me in the right direction.
I can't understand where mistake in your code. But I can recommend you to refactor it a little bit.
First at all, threading in python is not concurrent at all. It's just illusion, because there is a Global Interpreter Lock, so only one thread can be executed in same time. That's why I recommend you to use multiprocessing module:
from multiprocessing import Pool, cpu_count
pool = Pool(cpu_count)
for partition in partition_list(data_files, 4):
res = pool.map(process_file_callable, partition)
print res
At second, you are using not pythonic way to read file:
with open(...) as f:
line = f.readline()
while line:
... # do(line)
line = f.readline()
Here is pythonic way:
with open(...) as f:
for line in f:
... # do(line)
This is memory efficient, fast, and leads to simple code. (c) PyDoc
By the way, I have only one hypothesis what can happen with your program in multithreading way - app became more slower, because unordered access to hard disk drive is significantly slower than ordered. You can try to check this hypothesis using iostat or htop, if you are using Linux.
If your app does not finish work, and it doesn't do anything in process monitor (cpu or disk is not active), it means you have some kind of deadlock or blocked access to same resource.
Thanks everybody for your input and sorry for not replying sooner - I'm working on this on and off as a hobby project.
I've managed to write a simple example that proves it was my bad:
from itertools import groupby
from threading import Thread
from random import randint
from time import sleep
for key, partition in groupby(range(1, 50), lambda k: k//10):
threads = []
for idx in list(partition):
thread_name = 'thread-%d' % idx
t = Thread(name=thread_name, target=sleep, args=(randint(1, 5),))
threads.append(t)
print 'Starting %s' % t.getName()
t.start()
for t in threads:
print 'Joining %s' % t.getName()
t.join()
print 'Joined the first group of %s' % map(lambda t: t.getName(), threads)
The reason it was failing initially was the while loop the 'logic omitted for brevity' was working fine, however some of the input files that were being fed in were corrupted (had jumbled lines) and the logic went into an infinite loop on them. This is the reason some threads were never joined. The timeout for the join made sure that they were all started, but some never finished hence the inconsistency between 'starting' and 'joining'. The other fun fact was that the corruption was on the last line, so all the expected data was being processed.
Thanks again for your advice - the comment about processing files in a while instead of the pythonic way pointed me in the right direction, and yes, threading behaves as expected.
TL;DR: Getting different results after running code with threading and multiprocessing and single threaded. Need guidance on troubleshooting.
Hello, I apologize in advance if this may be a bit too generic, but I need a bit of help troubleshooting an issue and I am not sure how best to proceed.
Here is the story; I have a bunch of data indexed into a Solr Collection (~250m items), all items in that collection have a sessionid. Some items can share the same session id. I am combing through the collection to extract all items that have the same session, massage the data a bit and spit out another JSON file for indexing later.
The code has two main functions:
proc_day - accepts a day and processes all the sessions for that day
and
proc_session - does everything that needs to happen for a single session.
Multiprocessing is implemented on proc_day, so each day would be processed by a separate process, the proc_session function can be ran with threads. Below is the code I am using for threading/multiprocessing below. It accepts a function, a list of arguments and number of threads / multiprocesses. It will then create a queue based on input args, then create processes/threads and let them go through it. I am not posting the actual code, since it generally runs fine single threaded without any issues, but can post it if needed.
autoprocs.py
import sys
import logging
from multiprocessing import Process, Queue,JoinableQueue
import time
import multiprocessing
import os
def proc_proc(func,data,threads,delay=10):
if threads < 0:
return
q = JoinableQueue()
procs = []
for i in range(threads):
thread = Process(target=proc_exec,args=(func,q))
thread.daemon = True;
thread.start()
procs.append(thread)
for item in data:
q.put(item)
logging.debug(str(os.getpid()) + ' *** Processes started and data loaded into queue waiting')
s = q.qsize()
while s > 0:
logging.info(str(os.getpid()) + " - Proc Queue Size is:" + str(s))
s = q.qsize()
time.sleep(delay)
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
logging.debug(str(os.getpid()) + ' - *** Main Proc waiting')
q.join()
logging.debug(str(os.getpid()) + ' - *** Done')
def proc_exec(func,q):
p = multiprocessing.current_process()
logging.debug(str(os.getpid()) + ' - Starting:{},{}'.format(p.name, p.pid))
while True:
d = q.get()
try:
logging.debug(str(os.getpid()) + " - Starting to Process {}".format(d))
func(d)
sys.stdout.flush()
logging.debug(str(os.getpid()) + " - Marking Task as Done")
q.task_done()
except:
logging.error(str(os.getpid()) + " - Exception in subprocess execution")
logging.error(sys.exc_info()[0])
logging.debug(str(os.getpid()) + 'Ending:{},{}'.format(p.name, p.pid))
autothreads.py:
import threading
import logging
import time
from queue import Queue
def thread_proc(func,data,threads):
if threads < 0:
return "Thead Count not specified"
q = Queue()
for i in range(threads):
thread = threading.Thread(target=thread_exec,args=(func,q))
thread.daemon = True
thread.start()
for item in data:
q.put(item)
logging.debug('*** Main thread waiting')
s = q.qsize()
while s > 0:
logging.debug("Queue Size is:" + str(s))
s = q.qsize()
time.sleep(1)
logging.debug('*** Main thread waiting')
q.join()
logging.debug('*** Done')
def thread_exec(func,q):
while True:
d = q.get()
#logging.debug("Working...")
try:
func(d)
except:
pass
q.task_done()
I am running into problems with validating data after python runs under different multiprocessing/threading configs. There is a lot of data, so I really need to get multiprocessing working. Here are the results of my test yesterday.
Only with multiprocessing - 10 procs:
Days Processed 30
Sessions Found 3,507,475
Sessions Processed 3,514,496
Files 162,140
Data Output: 1.9G
multiprocessing and multithreading - 10 procs 10 threads
Days Processed 30
Sessions Found 3,356,362
Sessions Processed 3,272,402
Files 424,005
Data Output: 2.2GB
just threading - 10 threads
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 733,664
Data Output: 3.3GB
Single process/ no threading
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 162,190
Data Output: 1.9GB
These counts were gathered by grepping and counties entries in the log files (1 per main process). The first thing that jumps out is that days processed doesn't match. However, I manually checked the log files and it looks like a log entry was missing, there are follow on log entries to indicate that the day was actually processed. I have no idea why it was omitted.
I really don't want to write more code to validate this code, just seems like a terrible waste of time, is there any alternative?
I gave some general hints in the comments above. I think there are multiple problems with your approach, at very different levels of abstraction. You are also not showing all code of relevance.
The issue might very well be
in the method you are using to read from solr or in preparing read data before feeding it to your workers.
in the architecture you have come up with for distributing the work among multiple processes.
in your logging infrastructure (as you have pointed out yourself).
in your analysis approach.
You have to go through all of these points, and as of the complexity of the issue surely nobody here will be able to identify the exact issues for you.
Regarding points (3) and (4):
If you are not sure about the completeness of your log files, you should perform the analysis based on the payload output of your processing engine. What I am trying to say: the log files probably are just a side product of your data processing. The primary product is the thing you should analyze. Of course it is also important to get your logs right. But these two problems should be treated independently.
My contribution regarding point (2) in the list above:
What is especially suspicious about your multiprocessing-based solution is your way to wait for the workers to finish. You seem not to be sure by which method you should wait for your workers, so you apply three different methods:
First, you are monitoring the size of the queue in a while loop and wait for it to become 0. This is a non-canonical approach, which might actually work.
Secondly, you join() your processes in a weird way:
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
Why are you defining a timeout of one second here and do not respond to whether the process actually terminated within that time frame? You should either really join a process, i.e. wait until it has terminated or you specify a timeout and, if that timeout expires before the process finishes, treat that situation specially. Your code does not distinguish these situations, so p.join(1) is like writing time.sleep(1) instead.
Thirdly, you join the queue.
So, after making sure that q.qsize() returns 0 and after waiting for another second, do you really think that joining the queue is important? Does it make any difference? One of these approaches should be enough, and you need to think about which of these criteria is most important to your problem. That is, one of these conditions should deterministically implicate the other two.
All this looks like a quick & dirty hack of a multiprocessing solution, whereas you yourself are not really sure how that solution should behave. One of the most important insights I have obtained while working on concurrency architectures: You, the architect, must be 100 % aware of how the communication and control flow works in your system. Not properly monitoring and controlling the state of your worker processes may very well be the source of the issues you are observing.
I figured it out, I followed Jan-Philip's advice and started examining the output data of the multiprocess/multithreaded process. Turned out that an object that does all these things with the data from Solr was shared among threads. I did not have any locking mechanisms, so in a case it had mixed data from multiple sessions which caused inconsistent output. I validated this by instantiating a new object for every thread and the counts matched up. It is a bit slower, but still workable.
Thanks
Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()
It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()
You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).
For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.