I'm processing a list of thousands of domain names from a DNSBL through dig, creating a CSV of URLs and IPs. This is a very time-consuming process that can take several hours. My server's DNSBL updates every fifteen minutes. Is there a way I can increase throughput in my Python script to keep pace with the server's updates?
Edit: the script, as requested.
import re
import subprocess as sp
text = open("domainslist", 'r')
text = text.read()
text = re.split("\n+", text)
file = open('final.csv', 'w')
for element in text:
try:
ip = sp.Popen(["dig", "+short", url], stdout = sp.PIPE)
ip = re.split("\n+", ip.stdout.read())
file.write(url + "," + ip[0] + "\n")
except:
pass
Well, it's probably the name resolution that's taking you so long. If you count that out (i.e., if somehow dig returned very quickly), Python should be able to deal with thousands of entries easily.
That said, you should try a threaded approach. That would (theoretically) resolve several addresses at the same time, instead of sequentially. You could just as well continue to use dig for that, and it should be trivial to modify my example code below for that, but, to make things interesting (and hopefully more pythonic), let's use an existing module for that: dnspython
So, install it with:
sudo pip install -f http://www.dnspython.org/kits/1.8.0/ dnspython
And then try something like the following:
import threading
from dns import resolver
class Resolver(threading.Thread):
def __init__(self, address, result_dict):
threading.Thread.__init__(self)
self.address = address
self.result_dict = result_dict
def run(self):
try:
result = resolver.query(self.address)[0].to_text()
self.result_dict[self.address] = result
except resolver.NXDOMAIN:
pass
def main():
infile = open("domainlist", "r")
intext = infile.readlines()
threads = []
results = {}
for address in [address.strip() for address in intext if address.strip()]:
resolver_thread = Resolver(address, results)
threads.append(resolver_thread)
resolver_thread.start()
for thread in threads:
thread.join()
outfile = open('final.csv', 'w')
outfile.write("\n".join("%s,%s" % (address, ip) for address, ip in results.iteritems()))
outfile.close()
if __name__ == '__main__':
main()
If that proves to start too many threads at the same time, you could try doing it in batches, or using a queue (see http://www.ibm.com/developerworks/aix/library/au-threadingpython/ for an example)
The vast majority of the time here is spent in the external calls to dig, so to improve that speed, you'll need to multithread. This will allow you to run multiple calls to dig at the same time. See for example: Python Subprocess.Popen from a thread . Or, you can use Twisted ( http://twistedmatrix.com/trac/ ).
EDIT: You're correct, much of that was unnecessary.
I'd consider using a pure-Python library to do the DNS queries, rather than delegating to dig, because invoking another process can be relatively time-consuming. (Of course, looking up anything on the internet is also relatively time-consuming, so what gilesc said about multithreading still applies) A Google search for python dns will give you some options to get started with.
In order to keep pace with the server updates, one must take less than 15 minutes to execute. Does your script take 15 minutes to run? If it doesn't take 15 minutes, you're done!
I would investigate caching and diffs from previous runs in order to increase performance.
Related
I am working on a school project. I set some rules in iptables which logs INPUT and OUTPUT connections. My goal is to read these logs line by line, parse them and find out which process with which PID is causing this.
My problem starts when I use psutil to find a match with (ip, port) tuple with the corresponding PID. iptables is saving logs to file too fast, like 1x10^-6 seconds. My Python script also read lines as fast as iptables. But when I use the following code:
def get_proc(src: str, spt: str, dst: str, dpt: str) -> str:
proc_info = ""
if not (src and spt and dst and dpt):
return proc_info
for proc in psutil.process_iter(["pid", "name"]):
for conn in proc.connections(kind="all"):
if flag.is_set():
return proc_info
if not all([
hasattr(conn.laddr, "ip"), hasattr(conn.laddr, "port"),
hasattr(conn.raddr, "ip"), hasattr(conn.raddr, "port"),
]):
continue
if not all([
conn.laddr.ip == src, conn.laddr.port == int(spt),
conn.raddr.ip == dst, conn.raddr.port == int(dpt),
]):
continue
return f"pid={proc.pid},name={proc.name()}"
return proc_info
psutil finishes its job like 1x10^-3 seconds, means 10^3 times slower than reading process. What happens is that: If I run this get_proc function once, I read 1000 lines. So this slowness quickly becomes a problem when 1x10^6 lines are read at the end. Because in order to find the PID, I need to run this method immediately when the log is received.
I thought of using multithreading but as far as I understand it won't solve my problem. Because the same latency problem.
I haven't done much coding so far because I still can't find an algorithm to use. That's way no more code here.
How can I solve this problem with or without multithreading? Because I can't speed up the execution of psutil. I believe there must be better approaches.
Edit
Code part for reading logs from iptables.log:
flag = threading.Event()
def stop(signum, _frame):
"""
Tell everything to stop themselves.
:param signum: The captured signal number.
:param _frame: No use.
"""
if flag.is_set():
return
sys.stderr.write(f"Signal {signum} received.")
flag.set()
signal.signal(signal.SIGINT, stop)
def receive_logs(file, queue__):
global CURSOR_POSITION
with open(file, encoding="utf-8") as _f:
_f.seek(CURSOR_POSITION)
while not flag.is_set():
line = re.sub(r"[\[\]]", "", _f.readline().rstrip())
if not line:
continue
# If all goes okay do some parsing...
# .
# .
queue__.put_nowait((nettup, additional_info))
CURSOR_POSITION = _f.tell()
Here is an approach that may help a bit. As I've mentioned in comments, the issue cannot be entirely avoided unless you change to a better approach entirely.
The idea here is to scan the list of processes not once per connection but for all connections that have arrived since the last scan. Since checking connections can be done with a simple hash table lookup in O(1) time, we can process messages much faster.
I chose to go with a simple 1-producer-1-consumer multithreading approach. I think this will work fine because most time is spent in system calls, so Python's global interpreter lock (GIL) is less of an issue. But that requires testing. Possible variations:
Use no multithreading, instead read incoming logs nonblocking, then process what you've got
Swap the threading module and queue for multiprocessing module
Use multiple consumer threads and maybe batch block sizes to have multiple scans through the process list in parallel
import psutil
import queue
import threading
def receive_logs(consumer_queue):
"""Placeholder for actual code reading iptables log"""
for connection in log:
nettup = (connection.src, int(connection.spt),
connection.dst, int(connection.dpt))
additional_info = connection.additional_info
consumer_queue.put((nettup, additional_info))
The log reading is not part of the posted code, so this is just some placeholder.
Now we consume all queued connections in a second thread:
def get_procs(producer_queue):
# 1. Construct a set of connections to search for
# Blocks until at least one available
nettup, additional_info = producer_queue.get()
connections = {nettup: additional_info}
try: # read as many as possible
while True:
nettup, additional_info = producer_queue.get_nowait()
connections[nettup] = additional_info
except queue.Empty:
pass
found = []
for proc in psutil.process_iter(["pid", "name"]):
for conn in proc.connections(kind="all"):
try:
src = conn.laddr.ip
spt = conn.laddr.port
dst = conn.raddr.ip
dpt = conn.raddr.port
except AttributeError: # not an IP address
continue
nettup = (src, spt, dst, dpt)
if nettup in connections:
additional_info = connections[nettup]
found.append((proc, nettup, additional_info))
found_connections = {nettup for _, nettup, _ in found}
lost = [(nettup, additional_info)
for nettup, additional_info in connections.items()
if not nettup in found_connections]
return found, lost
I don't really understand parts of the posted code in the question, such as the if flag.is_set(): return proc_info part so I just left those out. Also, I got rid of some of the less pythonic and potentially slow parts such as hasattr(). Adapt as needed.
Now we tie it all together by calling the consumer repeatedly and starting both threads:
def consume(producer_queue):
while True:
found, lost = get_procs(producer_queue)
for proc, (src, spt, dst, dpt), additional_info in found:
print(f"pid={proc.pid},name={proc.name()}")
def main():
producer_consumer_queue = queue.SimpleQueue()
producer = threading.Thread(
target=receive_logs, args=((producer_consumer_queue, ))
consumer = threading.Thread(
target=consume, args=((producer_consumer_queue, ))
consumer.start()
producer.start()
consumer.join()
producer.join()
To begin with, we're given the following piece of code:
from validate_email import validate_email
import time
import os
def verify_emails(email_path, good_filepath, bad_filepath):
good_emails = open(good_filepath, 'w+')
bad_emails = open(bad_filepath, 'w+')
emails = set()
with open(email_path) as f:
for email in f:
email = email.strip()
if email in emails:
continue
emails.add(email)
if validate_email(email, verify=True):
good_emails.write(email + '\n')
else:
bad_emails.write(email + '\n')
if __name__ == "__main__":
os.system('cls')
verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")
I expect contacting SMTP servers to be the most expensive part by far from my program when emails.txt contains large amount of lines (>1k). Using some form of parallel or asynchronous I/O should speed this up a lot, since I can wait for multiple servers to respond instead of waiting sequentially.
As far as I have read:
Asynchronous I/O operates by queuing a request for I/O to the file
descriptor, tracked independently of the calling process. For a file
descriptor that supports asynchronous I/O (raw disk devcies
typically), a process can call aio_read() (for instance) to request a
number of bytes be read from the file descriptor. The system call
returns immediately, whether or not the I/O has completed. Some time
later, the process then polls the operating system for the completion
of the I/O (that is, buffer is filled with data).
To be sincere, I didn't quite understand how to implement async I/O on my program. Can anybody take a little time and explain me the whole process ?
EDIT as per PArakleta suggested:
from validate_email import validate_email
import time
import os
from multiprocessing import Pool
import itertools
def validate_map(e):
return (validate_email(e.strip(), verify=True), e)
seen_emails = set()
def unique(e):
if e in seen_emails:
return False
seen_emails.add(e)
return True
def verify_emails(email_path, good_filepath, bad_filepath):
good_emails = open(good_filepath, 'w+')
bad_emails = open(bad_filepath, 'w+')
with open(email_path, "r") as f:
for result in Pool().imap_unordered(validate_map,
itertools.ifilter(unique, f):
(good, email) = result
if good:
good_emails.write(email)
else:
bad_emails.write(email)
good_emails.close()
bad_emails.close()
if __name__ == "__main__":
os.system('cls')
verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")
You're asking the wrong question
Having looked at the validate_email package your real problem is that you're not efficiently batching your results. You should be only doing the MX lookup once per domain and then only connect to each MX server once, go through the handshake, and then check all of the addresses for that server in a single batch. Thankfully the validate_email package does the MX result caching for you, but you still need to be group the email addresses by server to batch the query to the server itself.
You need to edit the validate_email package to implement batching, and then probably give a thread to each domain using the actual threading library rather than multiprocessing.
It's always important to profile your program if it's slow and figure out where it is actually spending the time rather than trying to apply optimisation tricks blindly.
The requested solution
IO is already asynchronous if you are using buffered IO and your use case fits with the OS buffering. The only place you could potentially get some advantage is in read-ahead but Python already does this if you use the iterator access to a file (which you are doing). AsyncIO is an advantage to programs that are moving large amounts of data and have disabled the OS buffers to prevent copying the data twice.
You need to actually profile/benchmark your program to see if it has any room for improvement. If your disks aren't already throughput bound then there is a chance to improve the performance by parallel execution of the processing of each email (address?). The easiest way to check this is probably to check to see if the core running your program is maxed out (i.e. you are CPU bound and not IO bound).
If you are CPU bound then you need to look at threading. Unfortunately Python threading doesn't work in parallel unless you have non-Python work to be done so instead you'll have to use multiprocessing (I'm assuming validate_email is a Python function).
How exactly you proceed depends on where the bottleneck's in your program are and how much of a speed up you need to get to the point where you are IO bound (since you cannot actually go any faster than that you can stop optimising when you hit that point).
The emails set object is hard to share because you'll need to lock around it so it's probably best that you keep that in one thread. Looking at the multiprocessing library the easiest mechanism to use is probably Process Pools.
Using this you would need to wrap your file iterable in an itertools.ifilter which discards duplicates, and then feed this into a Pool.imap_unordered and then iterate that result and write into your two output files.
Something like:
with open(email_path) as f:
for result in Pool().imap_unordered(validate_map,
itertools.ifilter(unique, f):
(good, email) = result
if good:
good_emails.write(email)
else:
bad_emails.write(email)
The validate_map function should be something simple like:
def validate_map(e):
return (validate_email(e.strip(), verify=True), e)
The unique function should be something like:
seen_emails = set()
def unique(e):
if e in seen_emails:
return False
seen_emails.add(e)
return True
ETA: I just realised that validate_email is a library which actually contacts SMTP servers. Given that it's not busy in Python code you can use threading. The threading API though is not as convenient as the multiprocessing library but you can use multiprocessing.dummy to have a thread based Pool.
If you are CPU bound then it's not really worth having more threads/processes than cores but since your bottleneck is network IO you can benefit from many more threads/processes. Since processes are expensive you want to swap to threads and then crank up the number running in parallel (although you should be polite not to DOS-attack the servers you are connecting to).
Consider from multiprocessing.dummy import Pool as ThreadPool and then call ThreadPool(processes=32).imap_unordered().
I'm new to python and I'm having trouble understanding how threading works. By skimming through the documentation, my understanding is that calling join() on a thread is the recommended way of blocking until it completes.
To give a bit of background, I have 48 large csv files (multiple GB) which I am trying to parse in order to find inconsistencies. The threads share no state. This can be done single threadedly in a reasonable ammount of time for a one-off, but I am trying to do it concurrently as an exercise.
Here's a skeleton of the file processing:
def process_file(data_file):
with open(data_file) as f:
print "Start processing {0}".format(data_file)
line = f.readline()
while line:
# logic omitted for brevity; can post if required
# pretty certain it works as expected, single 'thread' works fine
line = f.readline()
print "Finished processing file {0} with {1} errors".format(data_file, error_count)
def process_file_callable(data_file):
try:
process_file(data_file)
except:
print >> sys.stderr, "Error processing file {0}".format(data_file)
And the concurrent bit:
def partition_list(l, n):
""" Yield successive n-sized partitions from a list.
"""
for i in xrange(0, len(l), n):
yield l[i:i+n]
partitions = list(partition_list(data_files, 4))
for partition in partitions:
threads = []
for data_file in partition:
print "Processing file {0}".format(data_file)
t = Thread(name=data_file, target=process_file_callable, args = (data_file,))
threads.append(t)
t.start()
for t in threads:
print "Joining {0}".format(t.getName())
t.join(5)
print "Joined the first chunk of {0}".format(map(lambda t: t.getName(), threads))
I run this as:
python -u datautils/cleaner.py > cleaner.out 2> cleaner.err
My understanding is that join() should block the calling thread waiting for the thread it's called on to finish, however the behaviour I'm observing is inconsistent with my expectation.
I never see errors in the error file, but I also never see the expected log messages on stdout.
The parent process does not terminate unless I explicitly kill it from the shell. If I check how many prints I have for Finished ... it's never the expected 48, but somewhere between 12 and 15. However, having run this single-threadedly, I can confirm that the multithreaded run is actually processing everything and doing all the expected validation, only it does not seem to terminate cleanly.
I know I must be doing something wrong, but I would really appreciate if you can point me in the right direction.
I can't understand where mistake in your code. But I can recommend you to refactor it a little bit.
First at all, threading in python is not concurrent at all. It's just illusion, because there is a Global Interpreter Lock, so only one thread can be executed in same time. That's why I recommend you to use multiprocessing module:
from multiprocessing import Pool, cpu_count
pool = Pool(cpu_count)
for partition in partition_list(data_files, 4):
res = pool.map(process_file_callable, partition)
print res
At second, you are using not pythonic way to read file:
with open(...) as f:
line = f.readline()
while line:
... # do(line)
line = f.readline()
Here is pythonic way:
with open(...) as f:
for line in f:
... # do(line)
This is memory efficient, fast, and leads to simple code. (c) PyDoc
By the way, I have only one hypothesis what can happen with your program in multithreading way - app became more slower, because unordered access to hard disk drive is significantly slower than ordered. You can try to check this hypothesis using iostat or htop, if you are using Linux.
If your app does not finish work, and it doesn't do anything in process monitor (cpu or disk is not active), it means you have some kind of deadlock or blocked access to same resource.
Thanks everybody for your input and sorry for not replying sooner - I'm working on this on and off as a hobby project.
I've managed to write a simple example that proves it was my bad:
from itertools import groupby
from threading import Thread
from random import randint
from time import sleep
for key, partition in groupby(range(1, 50), lambda k: k//10):
threads = []
for idx in list(partition):
thread_name = 'thread-%d' % idx
t = Thread(name=thread_name, target=sleep, args=(randint(1, 5),))
threads.append(t)
print 'Starting %s' % t.getName()
t.start()
for t in threads:
print 'Joining %s' % t.getName()
t.join()
print 'Joined the first group of %s' % map(lambda t: t.getName(), threads)
The reason it was failing initially was the while loop the 'logic omitted for brevity' was working fine, however some of the input files that were being fed in were corrupted (had jumbled lines) and the logic went into an infinite loop on them. This is the reason some threads were never joined. The timeout for the join made sure that they were all started, but some never finished hence the inconsistency between 'starting' and 'joining'. The other fun fact was that the corruption was on the last line, so all the expected data was being processed.
Thanks again for your advice - the comment about processing files in a while instead of the pythonic way pointed me in the right direction, and yes, threading behaves as expected.
Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()
It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()
You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).
For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.
The name kind of says it all. I'm writing this program in python 2.7, and I'm trying to take advantage of threaded queues to make a whole bunch of web requests. Here's the problem: I would like to have two different queues, one to handle the threaded requests, and a separate one to handle the responses. If I have a queue in my program that isn't named "queue", for example if I want the initial queue to be named "input_q", then the program crashes and just refuses to work. This makes absolutely no sense to me. In the code below, all of the imported custom modules work just fine (at least, they did independently, passed all unit tests, and don't see any reason they could be the source of the problem).
Also, via diagnostic statements, I have determined that it crashes just before it spawns the thread pool.
Thanks in advance.
EDIT: Crash may be the wrong term here. It actually just stops. Even after waiting half an hour to complete, when the original program ran in under thirty seconds, the program wouldn't run. When I told it to print out toCheck, it would only make it part way through the list, stop in the middle of an entry, and do nothing.
EDIT2: Sorry for wasting everyones time, I forgot about this post. Someone had changed one of my custom modules (threadcheck). It looks like it was initializing the module, then running along its merry way with the rest of the program. Threadcheck was crashing after initialization, when the program was in the middle of computations, and that crash was taking the whole thing down with it.
code:
from binMod import binExtract
from grabZip import grabZip
import random
import Queue
import time
import threading
import urllib2
from threadCheck import threadUrl
import datetime
queue = Queue.Queue()
#output_q = Queue.Queue()
#input_q = Queue.Queue()
#output = queue
p=90
qb = 22130167533
url = grabZip(qb)
logFile = "log.txt"
metaC = url.grabMetacell()
toCheck = []
print metaC[0]['images']
print "beginning random selection"
for i in range(4):
if (len(metaC[i]['images'])>0):
print metaC[i]['images'][0]
for j in range(len(metaC[i]['images'])):
chance = random.randint(0, 100)
if chance <= p:
toCheck.append(metaC[i]['images'][j]['resolution 7 url'])
print "Spawning threads..."
for i in range(20):
t = threadUrl(queue)
t.setDaemon(True)
t.start()
print "initializing queue..."
for i in range(len(toCheck)):
queue.put(toCheck[i])
queue.join()
#input_q.join()
output = open(logFile, 'a')
done = datetime.datetime.now()
results = "\n %s \t %s \t %s \t %s"%(done, qb, good, bad)
output.write(results)
What the names are is irrelevant to Python -- Python doesn't care, and the objects themselves (for the most part) don't even know the names they have been assigned to. So the problem has to be somewhere else.
As has been suggested in the comments, carefully check your renames of queue.
Also, try it without daemon mode.