I have data, which is in a text file. Each line is a computation to do. This file has around 100 000 000 lines.
First I load everything into the ram, then I have a a method that performs the computation and gives the following results:
def process(data_line):
#do computation
return result
Then I call it like this with packets of 2000 lines and then save the result to disk :
POOL_SIZE = 15 #nbcore - 1
PACKET_SIZE = 2000
pool = Pool(processes=POOL_SIZE)
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets = int(number_of_lines/ PACKET_SIZE)
for i in range(number_of_packets):
lines_packet = data_lines[:PACKET_SIZE]
data_lines = data_lines[PACKET_SIZE:]
results = pool.map(process, lines_packet)
save_computed_data_to_disk(to_be_computed_filename, results)
# process the last packet, which is smaller
results.extend(pool.map(process, data_lines))
save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")
The problem is, while I was writing to disk, my CPU is computing nothing and has 8 cores. It is looking at the task manager and it seems that quite a lot of CPU time is lost.
I have to write to disk after having completed my computation because the results are 1000 times larger than the input.
Anyways, I would have to write to the disk at some point. If time is not lost here, it will be lost later.
What could I do to allow one core to write to disk, while still computing with the others? Switch to C?
At this rate I can process 100 millions lines in 75h, but I have 12 billions lines to process, so any improvement is welcome.
example of timings:
Processing packet 2/15 953 of C:/processing/drop_zone\to_be_processed_txt_files\t_to_compute_303620.txt
Launching task and waiting for it to finish...
Task completed, Continuing
Packet was processed in 11.534576654434204 seconds
We are currently going at a rate of 0.002306915330886841 sec/words
Which is 433.47928145051293 words per seconds
Saving in temporary file
Printing writing 5000 computed line to disk took 0.04400920867919922 seconds
saving word to resume from : 06 20 25 00 00
Estimated time for processing the remaining packets is : 51:19:25
Note: This SharedMemory works only for Python >= 3.8 since it first appeared there
Start 3 kinds of processes: Reader, Processor(s), Writer.
Have Reader process read the file incrementally, sharing the result via shared_memory and Queue.
Have the Processor(s) consume the Queue, consume the shared_memory, and return the result(s) via another Queue. Again, as shared_memory.
Have the Writer process consume the second Queue, writing to the destination file.
Have them all communicate through, say, some Events or DictProxy, with the MainProcess who will act as the orchestrator.
Example:
import time
import random
import hashlib
import multiprocessing as MP
from queue import Queue, Empty
# noinspection PyCompatibility
from multiprocessing.shared_memory import SharedMemory
from typing import Dict, List
def readerfunc(
shm_arr: List[SharedMemory], q_out: Queue, procr_ready: Dict[str, bool]
):
numshm = len(shm_arr)
for batch in range(1, 6):
print(f"Reading batch #{batch}")
for shm in shm_arr:
#### Simulated Reading ####
for j in range(0, shm.size):
shm.buf[j] = random.randint(0, 255)
#### ####
q_out.put((batch, shm))
# Need to sync here because we're reusing the same SharedMemory,
# so gotta wait until all processors are done before sending the
# next batch
while not q_out.empty() or not all(procr_ready.values()):
time.sleep(1.0)
def processorfunc(
q_in: Queue, q_out: Queue, suicide: type(MP.Event()), procr_ready: Dict[str, bool]
):
pname = MP.current_process().name
procr_ready[pname] = False
while True:
time.sleep(1.0)
procr_ready[pname] = True
if q_in.empty() and suicide.is_set():
break
try:
batch, shm = q_in.get_nowait()
except Empty:
continue
print(pname, "got batch", batch)
procr_ready[pname] = False
#### Simulated Processing ####
h = hashlib.blake2b(shm.buf, digest_size=4, person=b"processor")
time.sleep(random.uniform(5.0, 7.0))
#### ####
q_out.put((pname, h.hexdigest()))
def writerfunc(q_in: Queue, suicide: type(MP.Event())):
while True:
time.sleep(1.0)
if q_in.empty() and suicide.is_set():
break
try:
pname, digest = q_in.get_nowait()
except Empty:
continue
print("Writing", pname, digest)
#### Simulated Writing ####
time.sleep(random.uniform(3.0, 6.0))
#### ####
print("Writing", pname, digest, "done")
def main():
shm_arr = [
SharedMemory(create=True, size=1024)
for _ in range(0, 5)
]
q_read = MP.Queue()
q_write = MP.Queue()
procr_ready = MP.Manager().dict()
poison = MP.Event()
poison.clear()
reader = MP.Process(target=readerfunc, args=(shm_arr, q_read, procr_ready))
procrs = []
for n in range(0, 3):
p = MP.Process(
target=processorfunc, name=f"Proc{n}", args=(q_read, q_write, poison, procr_ready)
)
procrs.append(p)
writer = MP.Process(target=writerfunc, args=(q_write, poison))
reader.start()
[p.start() for p in procrs]
writer.start()
reader.join()
print("Reader has ended")
while not all(procr_ready.values()):
time.sleep(5.0)
poison.set()
[p.join() for p in procrs]
print("Processors have ended")
writer.join()
print("Writer has ended")
[shm.close() for shm in shm_arr]
[shm.unlink() for shm in shm_arr]
if __name__ == '__main__':
main()
You say you have 8 cores, yet you have:
POOL_SIZE = 15 #nbcore - 1
Assuming you want to leave one processor free (presumably for the main process?) why wouldn't this number be 7? But why do you even want to read a processor free? You are making successive calls to map. While the main process is waiting for these calls to return, it requires know CPU. This is why if you do not specify a pool size when you instantiate your pool it defaults to the number of CPUs you have and not that number minus one. I will have more to say about this below.
Since you have a very large, in-memory list, is it possible that you are expending waisted cycles in your loop rewriting this list on each iteration of the loop. Instead, you can just take a slice of the list and pass that as the iterable argument to map:
POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)
with Pool(processes=POOL_SIZE) as pool:
offset = 0
for i in range(number_of_packets):
results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
offset += PACKET_SIZE
save_computed_data_to_disk(to_be_computed_filename, results)
if remainder:
results = pool.map(process, data_lines[offset:offset+remainder])
save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")
Between each call to map the main process is writing out the results to to_be_computed_filename. In the meanwhile, every process in your pool is sitting idle. This should be given to another process (actually a thread running under the main process):
import multiprocessing
import queue
import threading
POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)
def save_data(q):
while True:
results = q.get()
if results is None:
return # signal to terminate
save_computed_data_to_disk(to_be_computed_filename, results)
q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()
with Pool(processes=POOL_SIZE) as pool:
offset = 0
for i in range(number_of_packets):
results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
offset += PACKET_SIZE
q.put(results)
if remainder:
results = pool.map(process, data_lines[offset:offset+remainder])
q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")
I've chosen to run save_data in a thread of the main process. This could also be another process in which case you would need to use a multiprocessing.Queue instance. But I figured the main process thread is mostly waiting for the map to complete and there would not be competition for the GIL. Now if you do not leave a processor free for the threading job, save_data, it may end up doing most of the saving only after all of the results have been created. You would need to experiment a bit with this.
Ideally, I would also modify the reading of the input file so as to not have to first read it all into memory but rather read it line by line yielding 2000 line chunks and submitting those as jobs for map to process:
import multiprocessing
import queue
import threading
POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
def save_data(q):
while True:
results = q.get()
if results is None:
return # signal to terminate
save_computed_data_to_disk(to_be_computed_filename, results)
def read_data():
"""
yield lists of PACKET_SIZE
"""
lines = []
with open(some_file, 'r') as f:
for line in iter(f.readline(), ''):
lines.append(line)
if len(lines) == PACKET_SIZE:
yield lines
lines = []
if lines:
yield lines
q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()
with Pool(processes=POOL_SIZE) as pool:
for l in read_data():
results = pool.map(process, l)
q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")
I made two assumptions: The writing is hitting the I/O bound, not the CPU bound - meaning that throwing more cores onto writing would not improve the performance. And the process function contains some heavy computations.
I would approach it differently:
Split up the large list into a list of list
Feed it than into the processes
Store the total result
Here is the example code:
import multiprocessing as mp
data_lines = [0]*10000 # read it from file
size = 2000
# Split the list into a list of list (with chunksize `size`)
work = [data_lines[i:i + size] for i in range(0, len(data_lines), size)]
def process(data):
result = len(data) # some something fancy
return result
with mp.Pool() as p:
result = p.map(process, work)
save_computed_data_to_disk(file_name, result)
On meta: You may also have a look into numpy or pandas (depending on the data) because it sounds that you would like to do something into that direction.
The first thing that comes to mind for the code is to run the saving function in the thread. By this we exclude the bottelneck of waiting disk writing. Like so:
executor = ThreadPoolExecutor(max_workers=2)
future = executor.submit(save_computed_data_to_disk, to_be_computed_filename, results)
saving_futures.append(future)
...
concurrent.futures.wait(saving_futures, return_when=ALL_COMPLETED) # wait all saved to disk after processing
print("Done")
Hello I have a script that does a GET request and I need to measure the thread that is loaded with that function. This is the code that I have written but it doesn`t show the correct time it shows 0 and sometimes 0.001 or something like that.
import requests
import threading
import time
def functie():
URL = "http://10.250.100.170:9082/SPVWS2/rest/listaMesaje"
r = requests.get(url = URL)
data = r.json()
threads = []
for i in range(5):
start = time.clock_gettime_ns()
t = threading.Thread(target=functie)
threads.append(t)
t.start()
end = time.clock_gettime_ns()
print(end-start)
I need an example on how to get in my code the exact thread execution time. Thanks
The code in this script runs on the main thread and you are trying to measure the timing of thread t. To do that, you can tell main thread to wait until thread t has finished like this:
import requests
import threading
import time
threads = []
start = []
end = []
def functie():
start.append(time.clock_gettime_ns())
URL = "http://10.250.100.170:9082/SPVWS2/rest/listaMesaje"
r = requests.get(url = URL)
data = r.json()
end.append(time.clock_gettime_ns())
for i in range(5):
start.append(time.clock_gettime_ns())
t = threading.Thread(target=functie)
threads.append(t)
t.start()
for (i,t) in enumerate(threads):
t.join()
print(end[i]-start[i])
The other answer would produce incorrect results. If the first thread takes longer than the second, the time of the second will be recorded as the same as the first. This is because the end times are recorded sequentially after each join finishes rather than when the thread's target function actually finishes which may be in any order.
A better way would be to wrap the target functions of the threads with code that does this:
def thread_time(target):
def wrapper(*args, **kwargs):
st = time.time()
try:
return target(*args, **kwargs)
finally:
et = time.time()
print(et - st)
threading.currentThread().duration = et - st
return wrapper
def functie():
print "starting"
time.sleep(1)
print "ending"
t = threading.Thread(target=thread_time(functie))
t.start()
t.join()
print(t.duration)
I am currently making a http request that has over 3,000,000 records by using the pagination method. Sometimes the call fails because of a 104 server error, so I retry and it works on the second or third time.
Because there are so many requests, I am using the multiprocess function in python to speed this along. I'm using a ubuntu 16 machine, python3.5 and an 8 core machine. The odd thing here is that all the files get written, and the process "Finishes" i.e reaches the end of the range (regardless of the size so 1 million or 2 million or 3 million) but it wont pass the pool line. So my tmux sessions just says "Working on date (lastrecordnumber)" I need that to occur so I can send an email to let me know the task has finished.
I've tried pool.map(); pool.aysnc(); pool.map_async(), they all seem to have the same issue.
import http.client
from multiprocessing import Pool
from functools import partial
def get_raw_data(auth, url_conn, skip):
headers = {'authorization': "Basic {}".format(auth)}
sucess = None
loop = 0
while not sucess:
try:
conn = http.client.HTTPSConnection(url_conn)
conn.request("GET", "SOME_API&$skip={}".format(skip), headers=headers)
res = conn.getresponse()
data = res.read()
raw_data = json.loads(data.decode("utf-8"))
sucess = 'yes'
except Exception as e:
print('stuck in loop {} {} {}'.format(skip, loop, e))
loop += 1
with open('{}.json'.format(skip), 'w') as outfile:
json.dump(raw_data, outfile)
def process_skips(skip):
print('Working on date {}'.format(skip))
get_raw_data(skip)
if __name__ == '__main__':
print("We started at {}".format(dt.datetime.now()))
n = range(0,3597351,5000)
n = list(n)
pool = Pool(8)
pool.map_async(process_skips, n)
pool.close()
pool.join()
Using pool as a context manager using with which takes care of closing/joining the processes and seems to be the preferred method in the docs.
if __name__ == '__main__':
print("We started at {}".format(dt.datetime.now()))
n = list(range(0,3597351,5000))
with Pool(8) as pool:
pool.map_async(process_skips, n)
If your main process is working and writing your file correctly, that should make your processes close out correctly.
I expect to have maybe something like 100k URLs from different domains. I wrote this code which has a list of URLs in all_urls and forms N threads to run in one batch. Currently I'm using threading module to make these requests in parallel.
import requests
import os
import threading
import time
all_urls = [] # a list of URLs to request, can have up to 100k
global success, fail
success = 0
fail = 0
def func(url_to_request):
global success, fail
try:
r = requests.get(url_to_request, timeout=5)
c = r.content
success = success +1
except:
fail = fail +1
return
batch_count = 1
N = 200 # number of threads
all_threads_urls = []
time_start = time.time()
for item in all_urls:
all_threads_urls.append(item)
if all_urls.index(item) == len(all_urls)-1 or len(all_threads_urls) == N:
# call it
all_threads = []
for link in all_threads_urls:
current_thread = threading.Thread(target=func, args=(link,))
all_threads.append(current_thread)
current_thread.start()
for thr in all_threads:
thr.join()
all_threads_urls = [] # for the next batch
time_end = time.time()
print "Request number", all_urls.index(item)+1, "Good:", success, "Bad:", fail, "Duration:", round(time_end - time_start,2 ), "seconds."
time_start = time_end
Results for this are a bit weird, it seems that the script starts very fast but then slows down a lot (see image). Printed durations are for each batch.
Can someone explain what is the bottleneck here? Is there maybe a better module for this or there is no way around this?
So I've started learning python now, and I absolutely am in love with it.
I'm building a small scale facebook data scraper. Basically, it will use the Graph API and scrape the first names of the specified number of users. It works fine in a single thread (or no thread I guess).
I used online tutorials to come up with the following multithreaded version (updated code):
import requests
import json
import time
import threading
import Queue
GraphURL = 'http://graph.facebook.com/'
first_names = {} # will store first names and their counts
queue = Queue.Queue()
def getOneUser(url):
http_response = requests.get(url) # open the request URL
if http_response.status_code == 200:
data = http_response.text.encode('utf-8', 'ignore') # Get the text of response, and encode it
json_obj = json.loads(data) # load it as a json object
# name = json_obj['name']
return json_obj['first_name']
# last = json_obj['last_name']
return None
class ThreadGet(threading.Thread):
""" Threaded name scraper """
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
#print 'thread started\n'
url = GraphURL + str(self.queue.get())
first = getOneUser(url) # get one user's first name
if first is not None:
if first_names.has_key(first): # if name has been encountered before
first_names[first] = first_names[first] + 1 # increment the count
else:
first_names[first] = 1 # add the new name
self.queue.task_done()
#print 'thread ended\n'
def main():
start = time.time()
for i in range(6):
t = ThreadGet(queue)
t.setDaemon(True)
t.start()
for i in range(100):
queue.put(i)
queue.join()
for name in first_names.keys():
print name + ': ' + str(first_names[name])
print '----------------------------------------------------------------'
print '================================================================'
# Print top first names
for key in first_names.keys():
if first_names[key] > 2:
print key + ': ' + str(first_names[key])
print 'It took ' + str(time.time()-start) + 's'
main()
To be honest, I don't understand some of the parts of the code but I get the main idea. The output is nothing. I mean the shell has nothing in it, so I believe it keeps on running.
So what I am doing is filling queue with integers that are the user id's on fb. Then each ID is used to build the api call URL. getOneUser returns the name of one user at a time. That task (ID) is marked as 'done' and it moves on.
What is wrong with the code above?
Your usage of first_names is not thread-safe. You could add a lock to protect the increment. Otherwise the code should work. You might be hitting some facebook api limit i.e., you should limit your request rate.
You could simplify your code by using a thread pool and counting the names in the main thread:
#!/usr/bin/env python
import json
import urllib2
from collections import Counter
from multiprocessing.dummy import Pool # use threads
def get_name(url):
try:
return json.load(urllib2.urlopen(url))['first_name']
except Exception:
return None # error
urls = ('http://graph.facebook.com/%d' % i for i in xrange(100))
p = Pool(5) # 5 concurrent connections
first_names = Counter(p.imap_unordered(get_name, urls))
print first_names.most_common()
To see what errors you get, you could add logging:
#!/usr/bin/env python
import json
import logging
import urllib2
from collections import Counter
from multiprocessing.dummy import Pool # use threads
logging.basicConfig(level=logging.DEBUG,
format="%(asctime)s %(threadName)s %(message)s")
def get_name(url):
try:
name = json.load(urllib2.urlopen(url))['first_name']
except Exception as e:
logging.debug('error: %s url: %s', e, url)
return None # error
else:
logging.debug('done url: %s', url)
return name
urls = ('http://graph.facebook.com/%d' % i for i in xrange(100))
p = Pool(5) # 5 concurrent connections
first_names = Counter(p.imap_unordered(get_name, urls))
print first_names.most_common()
A simple way to limit number of requests per given time period is to use a semaphore:
#!/usr/bin/env python
import json
import logging
import time
import urllib2
from collections import Counter
from multiprocessing.dummy import Pool # use threads
from threading import _BoundedSemaphore as BoundedSemaphore, Timer
logging.basicConfig(level=logging.DEBUG,
format="%(asctime)s %(threadName)s %(message)s")
class RatedSemaphore(BoundedSemaphore):
"""Limit to 1 request per `period / value` seconds (over long run)."""
def __init__(self, value=1, period=1):
BoundedSemaphore.__init__(self, value)
t = Timer(period, self._add_token_loop,
kwargs=dict(time_delta=float(period) / value))
t.daemon = True
t.start()
def _add_token_loop(self, time_delta):
"""Add token every time_delta seconds."""
while True:
try:
BoundedSemaphore.release(self)
except ValueError: # ignore if already max possible value
pass
time.sleep(time_delta) # ignore EINTR
def release(self):
pass # do nothing (only time-based release() is allowed)
def get_name(gid, rate_limit=RatedSemaphore(value=100, period=600)):
url = 'http://graph.facebook.com/%d' % gid
try:
with rate_limit:
name = json.load(urllib2.urlopen(url))['first_name']
except Exception as e:
logging.debug('error: %s url: %s', e, url)
return None # error
else:
logging.debug('done url: %s', url)
return name
p = Pool(5) # 5 concurrent connections
first_names = Counter(p.imap_unordered(get_name, xrange(200)))
print first_names.most_common()
After the initial burst, it should make a single request every 6 seconds.
Consider using batch requests.
Your original run function only processed one item from the queue. In all you've only removed 5 items from the queue.
Usually run functions look like
run(self):
while True:
doUsefulWork()
i.e. they have a loop which causes the recurring work to be done.
[Edit] OP edited code to include this change.
Some other useful things to try:
Add a print statement into the run function: you'll find that it is only called 5 times.
Remove the queue.join() call, this is what is causing the module to block, then you will be able to probe the state of the queue.
put the entire body of run into a function. Verify that you can use that function in a single threaded manner to get the desired results, then
try it with just a single worker thread, then finally go for
multiple worker threads.