I want to fetch whois data from a txt file with 50000 urls. It's working but takes at least 20 minutes. What can i do to improve performance of this
import whois
from concurrent.futures import ThreadPoolExecutor
import threading
import time
pool = ThreadPoolExecutor(max_workers=2500)
def query(domain):
while True:
try:
w = whois.whois(domain)
fwrite = open("whoISSuccess.txt", "a")
fwrite.write('\n{0} : {1}'.format(w.domain, w.expiration_date))
fwrite.close()
except:
time.sleep(3)
continue
else:
break
with open('urls.txt') as f:
for line in f:
lines = line.rstrip("\n\r")
pool.submit(query, lines)
pool.shutdown(wait=True)
You can do two things to improve the speed:
Use multiprocessing, rather than threading as python Threads do not really run in parallel, while processes will be managed the OS and truly run in parallel.
Secondly, have each process write to its own file, e.g. <url>.txt, as having all processes write to the same file will cause lock contention on the file writing, which significantly slows your program. After all processes have completed you can then aggregate all files to a single one if this is a critical requirement. Alternatively, you can just keep the whois result in memory and then write it out to a file in the end.
Related
I'm writing a piece of code that needs to compare a python set to many other sets and retain the names of the files which have a minimum intersection length. I currently have a synchronous version but was wondering if it could benefit from async/await. I wanted to start by comparing the loading of sets. I wrote a simple script that writes a small set to disk and just reads it in n amount of times. I was suprised to see the sync version of this was a lot faster. Is this to be expected? and if not is there a flaw in the way I have coded it below?
My code is the following:
Synchronous version:
import pickle
import asyncio
import time
import aiofiles
pickle.dump(set(range(1000)), open('set.pkl', 'wb'))
def count():
print("Started Loading")
with open('set.pkl', mode='rb') as f:
contents = pickle.loads(f.read())
print("Finishd Loading")
def main():
for _ in range(100):
count()
if __name__ == "__main__":
s = time.perf_counter()
main()
elapsed = time.perf_counter() - s
print(f"{__file__} executed in {elapsed:0.3f} seconds.")
Asynchronous version:
import pickle
import asyncio
import time
import aiofiles
pickle.dump(set(range(1000)), open('set.pkl', 'wb'))
async def count():
print("Started Loading")
async with aiofiles.open('set.pkl', mode='rb') as f:
contents = pickle.loads(await f.read())
print("Finishd Loading")
async def main():
await asyncio.gather(*(count() for _ in range(100)))
if __name__ == "__main__":
import time
s = time.perf_counter()
asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"{__file__} executed in {elapsed:0.3f} seconds.")
Execuitng them led to:
async.py executed in 0.052 seconds.
sync.py executed in 0.011 seconds.
Asyncio doesn’t help in this case because your workload is basically disk-IO bound and CPU bound.
CPU bound workload cannot be sped up by Asyncio.
Disk-IO bound workload could benefit from async operation if but the disk operation is very slow and your program can do other things during that time. This may not be your situation.
So the slower asyncio performance is mainly due to the additional overhead introduced.
aiofiles is implemented by using threads, so each time you tell it to read a file another thread will be instructed to read the file.
the file being read is actually very small it fits in 3 KB which is under 1 page in your memory and also smaller than your core L1 cache, the computer didn't actually read anything from the disk most of the time, it's all being moved between parts of your memory.
in the async case it is being moved from one core's memory to the second, which is slower than keeping everything within 1 core's cache, but for larger files that are actually read from disk and other tasks to attend to, such as reading from sockets and reading different files from disk and doing some processing concurrently you will find the async version is faster, because it is using threads under the hood, and some tasks drop the gil, like reading from files and sockets, and some processing libraries.
you are still reading files at the same speed in both cases as you will be limited by your drive read speed, you will only be reducing the "dead-time" of when you are not reading files, and your example has no "dead-time", it isn't even reading a file from disk.
an exception to the above happens when you are reading data from multiple HDDs and SSDs concurrently where 1 thread can never read the data fast enough so the async version will be faster, because it can read from multiple drives at the same time (assuming you have the cores and IO lanes for it in your CPU)
I have a function readFiles that I need to call 8.5 million times (essentially stress-testing a logger to ensure the log rotates correctly). I don't care about the output/result of the function, only that I run it N times as quickly as possible.
My current solution is this:
from threading import Thread
import subprocess
def readFile(filename):
args = ["/usr/bin/ls", filename]
subprocess.run(args)
def main():
filename = "test.log"
threads = set()
for i in range(8500000):
thread = Thread(target=readFile, args=(filename,)
thread.start()
threads.add(thread)
# Wait for all the reads to finish
while len(threads):
# Avoid changing size of set while iterating
for thread in threads.copy():
if not thread.is_alive():
threads.remove(thread)
readFile has been simplified, but the concept is the same. I need to run readFile 8.5 million times, and I need to wait for all the reads to finish. Based on my mental math, this spawns ~60 threads per second, which means it will take ~40 hours to finish. Ideally, this would finish within 1-8 hours.
Is this possible? Is the number of iterations simply too high for this to be done in a reasonable span of time?
Oddly enough, when I wrote a test script, I was able to generate a thread about every ~0.0005 seconds, which should equate to ~2000 threads per second, but this is not the case here.
I considered iteration 8500000 / 10 times, and spawning a thread which then runs the readFile function 10 times, which should decrease the amount of time by ~90%, but it caused some issues with blocking resources, and I think passing a lock around would be a bit complicated insofar as keeping the function usable by methods that don't incorporate threading.
Any tips?
Based on #blarg's comment, and scripts I've used using multiprocessing, the following can be considered.
It simply reads the same file based on the size of the list. Here I'm looking at 1M reads.
With 1 core it takes around 50 seconds. With 8 cores it's down to around 22 seconds. this is on a windows PC, but I use these scripts on linux EC2 (AWS) instances as well.
just put this in a python file and run:
import os
import time
from multiprocessing import Pool
from itertools import repeat
def readfile(fn):
f = open(fn, "r")
def _multiprocess(mylist, num_proc):
with Pool(num_proc) as pool:
r = pool.starmap(readfile, zip(mylist))
pool.close()
pool.join()
return r
if __name__ == "__main__":
__spec__=None
# use the system cpus or change explicitly
num_proc = os.cpu_count()
num_proc = 1
start = time.time()
mylist = ["test.txt"]*1000000 # here you'll want to 8.5M, but test first that it works with smaller number. note this module is slow with low number of reads, meaning 8 cores is slower than 1 core until you reach a certain point, then multiprocessing is worth it
rs = _multiprocess(mylist, num_proc=num_proc)
print('total seconds,', time.time()-start )
I think you should considering using subprocess here, if you just want to execute ls command I think it's better to use os.system since it will reduce the resource consumption of your current GIL
also you have to put a little delay with time.sleep() while waiting the thread to be finished to reduce resource consumption
from threading import Thread
import os
import time
def readFile(filename):
os.system("/usr/bin/ls "+filename)
def main():
filename = "test.log"
threads = set()
for i in range(8500000):
thread = Thread(target=readFile, args=(filename,)
thread.start()
threads.add(thread)
# Wait for all the reads to finish
while len(threads):
time.sleep(0.1) # put this delay to reduce resource consumption while waiting
# Avoid changing size of set while iterating
for thread in threads.copy():
if not thread.is_alive():
threads.remove(thread)
I want to read and process a file by using multiprocessing with low memory consumption, high throughput (sentence/s), and - especially important - ordered results.
I was wondering whether we can use linecache's getline for this purpose. The following code reads a file, hopefully in parallel, and executes some function on the lines that are gathered in the subprocess. Here I opted for running some tokenisation on the files with spaCy.
import datetime
from multiprocessing import Pool, current_process
from os import cpu_count
from pathlib import Path
from functools import partial
from linecache import getline
import spacy
class Processor:
def __init__(self, spacy_model='en_core_web_sm', batch_size=2048):
self.nlp = spacy.load(spacy_model, disable=['ner', 'textcat'])
self.batch_size = batch_size
#staticmethod
def get_n_lines(pfin):
with pfin.open(encoding='utf-8') as fhin:
for line_idx, _ in enumerate(fhin, 1):
pass
return line_idx
def process_file(self, fin):
pfin = Path(fin).resolve()
total_lines = self.get_n_lines(pfin)
start_time = datetime.datetime.now()
procfunc = partial(self.process_batch, pfin)
with Pool(cpu_count() - 1) as pool:
# map the starting indexex to the processes
for _ in pool.imap(procfunc, range(0, total_lines+1, self.batch_size)):
pass
print('done', (datetime.datetime.now() - start_time).total_seconds())
def process_batch(self, pfin, start):
lines = [getline(str(pfin), i) for i in range(start, start+self.batch_size)]
# Parse text with spaCy
docs = list(self.nlp.pipe(lines))
# Chop into sentences
spacy_sents = [str(sent) for doc in docs for sent in doc.sents]
return str(current_process()), spacy_sents
if __name__ == '__main__':
fn = r'data/train.tok.low.en'
proc = Processor()
proc.process_file(fn)
I found that on my work laptop, running with 3 active cores on a file of 140K sentences the duration is 261 seconds. When running with a single core (n_workers=1), the processing time is 431 seconds. I am not sure how to interpret this difference but I guess it comes down to the question: does linecache.getline allow for concurrent reading? Parallel execution s faster, but considering getline expects a file name (rather than a file object), I expect it to have to open the file every time and as such blocking access for other processes. Is this assumption correct because parallel execution still seems much faster? Is there a better way to read files fast and in parallel whilst also keeping the results ordered?
You don't need linecache, and it doesn't help.
First, you don't need any special tricks to read the same file simultaneously from multiple processes. You can just do it. It'll work.
Second, linecache loads a whole file immediately as soon as a single line is requested from that file. You're not splitting the work of reading the file at all. You're doing more I/O than if you just had the parent process read the file and let the workers inherit the data. If you're getting any speedup from multiprocessing, it's probably due to parallelizing the NLP work, not the file reading.
Third, linecache is designed to support the traceback module, and it does a lot of stuff that doesn't make sense for a general-purpose file reading module, including searching the import path for a file if it doesn't find the file in the current directory.
This question already has answers here:
Python subprocess in parallel
(4 answers)
Closed 8 years ago.
I have the following code that writes the md5sums to a logfile
for file in files_output:
p=subprocess.Popen(['md5sum',file],stdout=logfile)
p.wait()
Will these be written in parallel? i.e. if md5sum takes a long time for one of the files, will another one be started before waiting for a previous one to complete?
If the answer to the above is yes, can I assume the order of the md5sums written to logfile may differ based upon how long md5sum takes for each file? (some files can be huge, some small)
Yes, these md5sum processes will be started in parallel.
Yes, the order of md5sums writes will be unpredictable. And generally it is considered a bad practice to share a single resource like file from many processes this way.
Also your way of making p.wait() after the for loop will wait just for the last of md5sum processes to finish and the rest of them might still be running.
But you can modify this code slightly to still have benefits of parallel processing and predictability of synchronized output if you collect the md5sum output into temporary files and collect it back into one file once all processes are done.
import subprocess
import os
processes = []
for file in files_output:
f = os.tmpfile()
p = subprocess.Popen(['md5sum',file],stdout=f)
processes.append((p, f))
for p, f in processes:
p.wait()
f.seek(0)
logfile.write(f.read())
f.close()
All sub processes are run in parallel. (To avoid this one has to wait explicitly for their completion.) They even can write into the log file at the same time, thus garbling the output. To avoid this you should let each process write into a different logfile and collect all outputs when all processes are finished.
q = Queue.Queue()
result = {} # used to store the results
for fileName in fileNames:
q.put(fileName)
def worker():
while True:
fileName = q.get()
if fileName is None: # Sentinel?
return
subprocess_stuff_using(fileName)
wait_for_finishing_subprocess()
checksum = collect_md5_result_for(fileName)
result[fileName] = checksum # store it
threads = [ threading.Thread(target=worker) for _i in range(20) ]
for thread in threads:
thread.start()
q.put(None) # one Sentinel marker for each thread
After this the results should be stored in result.
A simple way to collect output from parallel md5sum subprocesses is to use a thread pool and write to the file from the main process:
from multiprocessing.dummy import Pool # use threads
from subprocess import check_output
def md5sum(filename):
try:
return check_output(["md5sum", filename]), None
except Exception as e:
return None, e
if __name__ == "__main__":
p = Pool(number_of_processes) # specify number of concurrent processes
with open("md5sums.txt", "wb") as logfile:
for output, error in p.imap(md5sum, filenames): # provide filenames
if error is None:
logfile.write(output)
the output from md5sum is small so you can store it in memory
imap preserves order
number_of_processes may be different from number of files or CPU cores (larger values doesn't mean faster: it depends on relative performance of IO (disks) and CPU)
You can try to pass several files at once to the md5sum subprocesses.
You don't need external subprocess in this case; you can calculate md5 in Python:
import hashlib
from functools import partial
def md5sum(filename, chunksize=2**15, bufsize=-1):
m = hashlib.md5()
with open(filename, 'rb', bufsize) as f:
for chunk in iter(partial(f.read, chunksize), b''):
m.update(chunk)
return m.hexdigest()
To use multiple processes instead of threads (to allow the pure Python md5sum() to run in parallel utilizing multiple CPUs) just drop .dummy from the import in the above code.
[Python 3.1]
My program takes a long time to run just because of the pickle.load method on a huge data structure. This makes debugging very annoying and time-consuming: every time I make a small change, I need to wait for a few minutes to see if the regression tests passed.
I would like replace pickle with an in-memory data structure.
I thought of starting a python program in one process, and connecting to it from another; but I am afraid the inter-process communication overhead will be huge.
Perhaps I could run a python function from the interpreter to load the structure in memory. Then as I modify the rest of the program, I can run it many times (without exiting the interpreter in between). This seems like it would work, but I'm not sure if I will suffer any overhead or other problems.
You can use mmap to open a view on the same file in multiple processes, with access at almost the speed of memory once the file is loaded.
First you can pickle different parts of the hole object using this method:
# gen_objects.py
import random
import pickle
class BigBadObject(object):
def __init__(self):
self.a_dictionary={}
for x in xrange(random.randint(1, 1000)):
self.a_dictionary[random.randint(1,98675676)]=random.random()
self.a_list=[]
for x in xrange(random.randint(1000, 10000)):
self.a_list.append(random.random())
self.a_string=''.join([chr(random.randint(65, 90))
for x in xrange(random.randint(100, 10000))])
if __name__=="__main__":
output=open('lotsa_objects.pickled', 'wb')
for i in xrange(10000):
pickle.dump(BigBadObject(), output, pickle.HIGHEST_PROTOCOL)
output.close()
Once you generated the BigFile in various separate parts you can read it with a python program with several running at the same time reading each one different parts.
# reader.py
from threading import Thread
from Queue import Queue, Empty
import cPickle as pickle
import time
import operator
from gen_objects import BigBadObject
class Reader(Thread):
def __init__(self, filename, q):
Thread.__init__(self, target=None)
self._file=open(filename, 'rb')
self._queue=q
def run(self):
while True:
try:
one_object=pickle.load(self._file)
except EOFError:
break
self._queue.put(one_object)
class uncached(object):
def __init__(self, filename, queue_size=100):
self._my_queue=Queue(maxsize=queue_size)
self._my_reader=Reader(filename, self._my_queue)
self._my_reader.start()
def __iter__(self):
while True:
if not self._my_reader.is_alive():
break
# Loop until we get something or the thread is done processing.
try:
print "Getting from the queue. Queue size=", self._my_queue.qsize()
o=self._my_queue.get(True, timeout=0.1) # Block for 0.1 seconds
yield o
except Empty:
pass
return
# Compute an average of all the numbers in a_lists, just for show.
list_avg=0.0
list_count=0
for x in uncached('lotsa_objects.pickled'):
list_avg+=reduce(operator.add, x.a_list)
list_count+=len(x.a_list)
print "Average: ", list_avg/list_count
This way of reading the pickle file will take 1% of the time it takes in the other way. This is because you are running 100 parallel threads at the same time.