Count total/inside CPU instructions issued in Python - python

I need to test the performance of code that will be objective and roughly the same across all machines. Timing code does not work since it's tied to your or mine machine specs, but counting instructions issued by a CPU does (with minor differences).
I can use strace in Linux, but my god its slow and I just want total not individual calls.
Say:
def foo(bar):
for i in range(bar):
print(i)
foo(10)
This will execute at different speeds on different machines (bear with me, imagine a more complicated algorithm). But the amount of operation done is the same, 10 ios. This is important because if you have a faster computer you won't notice a millisecond that might take 5 seconds on my machine.
Is there a way to count # of CPU instructions done since in Python?
I'm asking because I want to know if a refactor will 2x my CPU instructions.
Thank you.

You can use the Python profiler cProfile
$ python -m cProfile euler048.py
1007 function calls in 0.061 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall
filename:lineno(function)
1 0.000 0.000 0.061 0.061 <string>:1(<module>)
1000 0.051 0.000 0.051 0.000 euler048.py:2(<lambda>)
1 0.005 0.005 0.061 0.061 euler048.py:2(<module>)
1 0.000 0.000 0.061 0.061 {execfile}
1 0.002 0.002 0.053 0.053 {map}
1 0.000 0.000 0.000 0.000 {method someMethod}
1 0.000 0.000 0.000 0.000 {range}
1 0.003 0.003 0.003 0.003 {sum}
excerpt from the previous question I linked, hope this helps

Related

What is "{method 'recv_into' of '_socket.socket' objects}" at Cprofile result? How can I reduce its time consumption?

This is the profiling result of my python code.
As you can see below, method 'recv_into' of '_socket.socket' objects takes too much time ( 17.265 as tottime )
What is it? And is there any way to reduce its time?
When is it called?
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.402 0.402 37.668 37.668 c:\Users\user\Google ����̺�\Business\Project\Jessica Project\jessica-1\simulation\simulatorW.py:239(backtestWithArgumentsList)
1 0.173 0.173 26.762 26.762 c:\Users\user\Google ����̺�\Business\Project\Jessica Project\jessica-1\simulation\simulatorW.py:110(getPrices)
1 0.000 0.000 26.588 26.588 c:\Users\user\Google ����̺�\Business\Project\Jessica Project\jessica-1\dto\__init__.py:5(__init__)
1 1.734 1.734 25.380 25.380 c:\Users\user\Google ����̺�\Business\Project\Jessica Project\jessica-1\dto\__init__.py:21(priceInfoListToDeque)
815679 2.204 0.000 23.473 0.000 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\cursor.py:1152(next)
13 0.021 0.002 20.631 1.587 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\cursor.py:1039(_refresh)
12 0.008 0.001 20.609 1.717 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\cursor.py:937(__send_message)
12 0.000 0.000 20.601 1.717 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\mongo_client.py:1306(_run_operation_with_response)
12 0.000 0.000 20.601 1.717 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\mongo_client.py:1437(_retryable_read)
12 0.000 0.000 20.597 1.716 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\mongo_client.py:1334(_cmd)
12 0.001 0.000 20.597 1.716 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\server.py:70(run_operation_with_response)
18 0.001 0.000 17.386 0.966 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\network.py:192(receive_message)
12 0.013 0.001 17.379 1.448 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\pool.py:637(receive_message)
36 0.066 0.002 17.331 0.481 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\network.py:249(_receive_data_on_socket)
19984 17.265 0.001 17.265 0.001 {method 'recv_into' of '_socket.socket' objects}
1 2.499 2.499 6.522 6.522 c:\Users\user\Google ����̺�\Business\Project\Jessica Project\jessica-1\simulation\simulatorW.py:138(filterIndicesWithTimeCondition)
It's a low level networking call. This is the time spent reading whatever you are loading. Take a look at its callers.
p.print_callers("{method 'recv_into' of '_socket.socket' objects}")
Keep going up the callers tree picking the ones that have longer times. Remember that the restriction is a regexp. Use escapes when necessary:
p.sort_stats("tottime").print_callers("api.py:104\(post\)")
The top 4 lines are more interesting than the recv_into one. If you go up the caller tree, you're likely to end up in one of those. There could be many ways to optimize those, since no details are provided. Cacheing, compressing, getting only what you need, and otherwise reducing network footprint.

How to extract useful info from cProfile with Pandas and Numpy?

I have some Python code that is generating a large data set via numerical simulation. The code is using Numpy for a lot of the calculations and Pandas for a lot of the top-level data. The data sets are large so the code is running slowly, and now I'm trying to see if I can use cProfile to find and fix some hot spots.
The trouble is that cProfile is identifying a lot of the hot spots as pieces of code within Pandas, within Numpy, and/or Python builtins. Here are the cProfile statistics sorted by 'tottime' (total time within the function itself). Note that I'm obscuring project name and file names since the code itself is not owned by me and I don't have permission to share details.
foo.sort_stats('tottime').print_stats(50)
Wed Jun 5 13:18:28 2019 c:\localwork\xxxxxx\profile_data
297514385 function calls (291105230 primitive calls) in 306.898 seconds
Ordered by: internal time
List reduced from 4141 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
281307 31.918 0.000 34.731 0.000 {pandas._libs.lib.infer_dtype}
800 31.443 0.039 31.476 0.039 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\numpy\lib\function_base.py:4703(delete)
109668 23.837 0.000 23.837 0.000 {method 'clear' of 'dict' objects}
153481 19.369 0.000 19.369 0.000 {method 'ravel' of 'numpy.ndarray' objects}
5861614 14.182 0.000 78.492 0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexes\base.py:3090(get_value)
5861614 8.891 0.000 8.891 0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}
5861614 8.376 0.000 99.084 0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\series.py:764(__getitem__)
26840695 7.032 0.000 11.009 0.000 {built-in method builtins.isinstance}
26489324 6.547 0.000 14.410 0.000 {built-in method builtins.getattr}
11846279 6.177 0.000 19.809 0.000 {pandas._libs.lib.values_from_object}
[...]
Is there a sensible way for me to figure out which parts of my code are excessively leaning on these library functions and built-ins? I anticipate one answer would be "look at the cumulative time statistics, that will probably indicate where these costly calls are originating". The cumulative times give a little bit of insight:
foo.sort_stats('cumulative').print_stats(50)
Wed Jun 5 13:18:28 2019 c:\localwork\xxxxxx\profile_data
297514385 function calls (291105230 primitive calls) in 306.898 seconds
Ordered by: cumulative time
List reduced from 4141 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
643/1 0.007 0.000 307.043 307.043 {built-in method builtins.exec}
1 0.000 0.000 307.043 307.043 xxxxxx.py:1(<module>)
1 0.002 0.002 306.014 306.014 xxxxxx.py:264(write_xxx_data)
1 0.187 0.187 305.991 305.991 xxxxxx.py:256(write_yyyy_data)
1 0.077 0.077 305.797 305.797 xxxxxx.py:250(make_zzzzzzz)
1 0.108 0.108 187.845 187.845 xxxxxx.py:224(generate_xyzxyz)
108223 1.977 0.000 142.816 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexing.py:298(_setitem_with_indexer)
1 0.799 0.799 126.733 126.733 xxxxxx.py:63(populate_abcabc_data)
1 0.030 0.030 117.874 117.874 xxxxxx.py:253(<listcomp>)
7201 0.077 0.000 116.612 0.016 C:\LocalWork\xxxxxx\yyyyyyyyyyyy.py:234(xxx_yyyyyy)
108021 0.497 0.000 112.908 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexing.py:182(__setitem__)
5861614 8.376 0.000 99.084 0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\series.py:764(__getitem__)
110024 0.917 0.000 81.210 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\internals.py:3500(apply)
108021 0.185 0.000 80.685 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\internals.py:3692(setitem)
5861614 14.182 0.000 78.492 0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexes\base.py:3090(get_value)
108021 1.887 0.000 73.064 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\internals.py:819(setitem)
[...]
Is there a good way to pin down the hot spots -- better than "crawl through xxxxxx.py and search for all places where Pandas might be inferring a dataype, and where Numpy might be deleting objects"...?

Initializing large number of objects in python: time complexity?

I have an application which requires to initialize a large number of objects with Python (3.5.2) and encounter some occasional slow-downs.
The slow-down seems to occur on a specific initialization: most of the calls to __init__ last less than 1 ns, but one of them sometimes lasts several dozens of seconds.
I've been able to reproduce this using the following snippet that initializes 500k a simple object.
import cProfile
class A:
def __init__(self):
pass
cProfile.run('[A() for _ in range(500000)]')
I'm running this code in a notebook. Most of the times (9/10), this code outputs the following (normal execution)
500004 function calls in 0.675 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
500000 0.031 0.000 0.031 0.000 <ipython-input-5-634b77609653>:2(__init__)
1 0.627 0.627 0.657 0.657 <string>:1(<listcomp>)
1 0.018 0.018 0.675 0.675 <string>:1(<module>)
1 0.000 0.000 0.675 0.675 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
The other times, it outputs the following (slow execution)
500004 function calls in 40.154 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
500000 0.031 0.000 0.031 0.000 <ipython-input-74-634b77609653>:2(__init__)
1 40.110 40.110 40.140 40.140 <string>:1(<listcomp>)
1 0.014 0.014 40.154 40.154 <string>:1(<module>)
1 0.000 0.000 40.154 40.154 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Using tqdm, the loop seems to get stuck on one iteration. It's important to note that I was able to reproduce this in a notebook with already a lot of memory allocated.
I suspect that it comes from the list of references to the objects used by the garbage collector that might need to be copied from time to time.
What is exactly happening here, and are there any ways to avoid this ?

Increase the speed of my code

i have created the code below, it takes a series of values,
and generates 10 numbers between x and r with an average value of 8000
in order to meet the specification to cover the range as well as possible, I also calculated the standard deviation, which is a good measure of spread. So whenever a sample set meets the criteria of mean of 8000, I compared it to previous matches and constantly choose the samples that have the highest std dev (mean always = 8000)
def node_timing(average_block_response_computational_time, min_block_response_computational_time, max_block_response_computational_time):
sample_count = 10
num_of_trials = 1
# print average_block_response_computational_time
# print min_block_response_computational_time
# print max_block_response_computational_time
target_sum = sample_count * average_block_response_computational_time
samples_list = []
curr_stdev_max = 0
for trials in range(num_of_trials):
samples = [0] * sample_count
while sum(samples) != target_sum:
samples = [rd.randint(min_block_response_computational_time, max_block_response_computational_time) for trial in range(sample_count)]
# print ("Mean: ", st.mean(samples), "Std Dev: ", st.stdev(samples), )
# print (samples, "\n")
if st.stdev(samples) > curr_stdev_max:
curr_stdev_max = st.stdev(samples)
samples_best = samples[:]
return samples_best[0]
i take the first value in the list and use this as a timing value, however this code is REALLY slow, i need to call this piece of code several thousand times during the simulation so need to improve the efficency of the code some how
anyone got any suggestions on how to ?
To see where we'd get the best speed improvements, I started by profiling your code.
import cProfile
pr = cProfile.Profile()
pr.enable()
for i in range(100):
print(node_timing(8000, 7000, 9000))
pr.disable()
pr.print_stats(sort='time')
The top of the results show where your code is spending most of its time:
23561178 function calls (23561176 primitive calls) in 10.612 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
4502300 3.694 0.000 7.258 0.000 random.py:172(randrange)
4502300 2.579 0.000 3.563 0.000 random.py:222(_randbelow)
4502300 1.533 0.000 8.791 0.000 random.py:216(randint)
450230 1.175 0.000 9.966 0.000 counter.py:19(<listcomp>)
4608421 0.690 0.000 0.690 0.000 {method 'getrandbits' of '_random.Random' objects}
100 0.453 0.005 10.596 0.106 counter.py:5(node_timing)
4502300 0.294 0.000 0.294 0.000 {method 'bit_length' of 'int' objects}
450930 0.141 0.000 0.150 0.000 {built-in method builtins.sum}
100 0.016 0.000 0.016 0.000 {built-in method builtins.print}
600 0.007 0.000 0.025 0.000 statistics.py:105(_sum)
2200 0.005 0.000 0.006 0.000 fractions.py:84(__new__)
...
From this output, we can see that we're spending ~7.5 seconds (out of 10.6 seconds) generating random numbers. Therefore, the only way to make this noticeably faster is to generate fewer random numbers or generate them faster. You're not using a cryptographic random number generator so I don't have a way to make generating numbers faster. However, we can fudge the algorithm a bit and drastically reduce the number of values we need to generate.
Instead of only accepting samples with a mean of exactly 8000, what if we accepted samples with a mean of 8000 +- 0.1% (then we're taking samples with a mean of 7992 to 8008)? By being a tiny bit inexact, we can drastically speed up the algorithm. I replaced the while condition with:
while abs(sum(samples) - target_sum) > epsilon
Where epsilon = target_sum * 0.001. Then I ran the script again and got much better profiler numbers.
232439 function calls (232437 primitive calls) in 0.163 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
100 0.032 0.000 0.032 0.000 {built-in method builtins.print}
31550 0.026 0.000 0.053 0.000 random.py:172(randrange)
31550 0.019 0.000 0.027 0.000 random.py:222(_randbelow)
31550 0.011 0.000 0.064 0.000 random.py:216(randint)
4696 0.010 0.000 0.013 0.000 fractions.py:84(__new__)
3155 0.008 0.000 0.073 0.000 counter.py:19(<listcomp>)
600 0.008 0.000 0.039 0.000 statistics.py:105(_sum)
100 0.006 0.000 0.131 0.001 counter.py:4(node_timing)
32293 0.005 0.000 0.005 0.000 {method 'getrandbits' of '_random.Random' objects}
1848 0.004 0.000 0.009 0.000 fractions.py:401(_add)
Allowing the mean to be up to 0.1% off of the target dropped the number of calls to randint by 100x. Naturally, the code also runs 100x faster (and now spends most of its time printing to console).

Multi-processing in Python: Numpy + Vector Summation -> Huge Slowdown

Please don't be discouraged by the long post. I try to present as much data as I can, and I really need help with the problem :S. I'll update daily if there are new tips or ideas
Problem:
I try to run a Python code on a two core machine in parallel with the help of parallel processes (to avoid GIL), but have the problem that the code significantly slows down. For example, a run on a one core machine takes 600sec per workload, but a run on a two core machine takes 1600sec (800sec per workload).
What I already tried:
I measured the memory, and there appears to be no memory problem.
[just using 20% at the high point].
I used “htop” to check whether I am really running the program on different cores, or if my core affinity is messed up. But no luck either, my program is running on all of my cores.
The problem is a CPU-bounded problem, and so I checked and confirmed that my code is running at 100% CPU on all cores, most of the time.
I checked the process ID’s and I am, indeed, spawning two different processes.
I changed my function which I’m submitting into the executor [ e.submit(function,[…]) ] to a calculate-pie function and observed a huge speedup. So the problem is probable in my process_function(…) which I am submitting into the executor and not in the code before.
Currently I'm using "futures" from "concurrent" to parallize the task. But I also tried the "pool" class from "multiprocessing". However, the result remained the same.
Code:
Spawn processes:
result = [None]*psutil.cpu_count()
e = futures.ProcessPoolExecutor( max_workers=psutil.cpu_count() )
for i in range(psutil.cpu_count()):
result[i] = e.submit(process_function, ...)
process_function:
from math import floor
from math import ceil
import numpy
import MySQLdb
import time
db = MySQLdb.connect(...)
cursor = db.cursor()
query = "SELECT ...."
cursor.execute(query)
[...] #save db results into the variable db_matrix (30 columns, 5.000 rows)
[...] #save db results into the variable bp_vector (3 columns, 500 rows)
[...] #save db results into the variable option_vector( 3 columns, 4000 rows)
cursor.close()
db.close()
counter = 0
for i in range(4000):
for j in range(500):
helper[:] = (1-bp_vector[j,0]-bp_vector[j,1]-bp_vector[j,2])*db_matrix[:,0]
+ db_matrix[:,option_vector[i,0]] * bp_vector[j,0]
+ db_matrix[:,option_vector[i,1]] * bp_vector[j,1]
+ db_matrix[:,option_vector[i,2]] * bp_vector[j,2]
result[counter,0] = (helper < -7.55).sum()
counter = counter + 1
return result
My guess:
My guess is, that for some reason the weigthed vector multiplication which creates the vector "helper" is causing problems. [I believe the Time Measurement section confirms this guess]
Could it be the case, that numpy creates these problems? Is numpy compatible with multi-processing? If not, what can I do? [Already answered in the comments]
Could it be the case because of the cache memory? I read on the forum about it, but to be honest, didn't really understood it. But if the problem is rooted there, I would make myself familar with this topic.
Time Measurement: (edit)
One core: time to get the data from the db: 8 sec.
Two core: time to get the data from the db: 12 sec.
One core: time to do the double-loop in the process_function: ~ 640 sec.
Two core: time to do the double-loop in the process_function: ~ 1600 sec
Update: (edit)
When I measure the time with two processes for every 100 i's in the loop, I see that it is roughly 220% of the time I observe when I measure the same thing while running on only one process. But what is even more mysterious is that if I quit on process during the run, the other process speeds up! The other process then actually speeds up to the same level it had during the solo run. So, there must be some dependencies between the processes I just don't see at the moment :S
Update-2: (edit)
So, I did a few more test runs and measurements. In the test runs, I used as compute-instances either a one-core linux machine (n1-standard-1, 1 vCPU, 3.75 GB memory) or a two-core linux machine (n1-standard-2, 2 vCPUs, 7.5 GB memory) from Google cloud compute engine. However, I did also tests on my local computer and observed roughly the same results. (-> therefore, the virtualized environment should be fine). Here are the results:
P.S: The time here differs from the measurements above, because I limited the loop a little bit and did the testing on Google Cloud instead of on my home pc.
1-core machine, started 1 process:
time: 225sec , CPU utilization: ~100%
1-core machine, started 2 process:
time: 557sec , CPU utilization: ~100%
1-core machine, started 1 process, limited max. CPU-utilization to 50%:
time: 488sec , CPU utilization: ~50%
.
2-core machine, started 2 process:
time: 665sec , CPU-1 utilization: ~100% , CPU-2 utilization: ~100%
the process did not jumped between the cores, each used 1 core
(at least htop displayed these results with the “Process” column)
2-core machine, started 1 process:
time: 222sec , CPU-1 utilization: ~100% (0%) , CPU-2 utilization: ~0% (100%)
however, the process jumped sometimes between the cores
2-core machine, started 1 process, limited max. CPU-utilization to 50%:
time: 493sec , CPU-1 utilization: ~50% (0%) , CPU-2 utilization: ~0% (100%)
however, the process jumped extremely often between the cores
I used "htop" and the python module "time" to obtain these results.
Update - 3: (edit)
I used cProfile for profiling my code:
python -m cProfile -s cumtime fun_name.py
The files are too long to post here, but I believe if they contain valuable information at all, this information is probably the one on top of the outcome-text. Therefore, I will post the first lines of the results here:
1-core machine, started 1 process:
623158 function calls (622735 primitive calls) in 229.286 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.371 0.371 229.287 229.287 20_with_multiprocessing.py:1(<module>)
3 0.000 0.000 225.082 75.027 threading.py:309(wait)
1 0.000 0.000 225.082 225.082 _base.py:378(result)
25 225.082 9.003 225.082 9.003 {method 'acquire' of 'thread.lock' objects}
1 0.598 0.598 3.081 3.081 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren)
3 0.000 0.000 2.877 0.959 cursors.py:164(execute)
3 0.000 0.000 2.877 0.959 cursors.py:353(_query)
3 0.000 0.000 1.958 0.653 cursors.py:315(_do_query)
3 0.000 0.000 1.943 0.648 cursors.py:142(_do_get_result)
3 0.000 0.000 1.943 0.648 cursors.py:351(_get_result)
3 1.943 0.648 1.943 0.648 {method 'store_result' of '_mysql.connection' objects}
3 0.001 0.000 0.919 0.306 cursors.py:358(_post_get_result)
3 0.000 0.000 0.917 0.306 cursors.py:324(_fetch_row)
3 0.917 0.306 0.917 0.306 {built-in method fetch_row}
591314 0.161 0.000 0.161 0.000 {range}
1-core machine, started 2 process:
626052 function calls (625616 primitive calls) in 578.086 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.310 0.310 578.087 578.087 20_with_multiprocessing.py:1(<module>)
30 574.310 19.144 574.310 19.144 {method 'acquire' of 'thread.lock' objects}
2 0.000 0.000 574.310 287.155 _base.py:378(result)
3 0.000 0.000 574.310 191.437 threading.py:309(wait)
1 0.544 0.544 2.854 2.854 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren)
3 0.000 0.000 2.563 0.854 cursors.py:164(execute)
3 0.000 0.000 2.563 0.854 cursors.py:353(_query)
3 0.000 0.000 1.715 0.572 cursors.py:315(_do_query)
3 0.000 0.000 1.701 0.567 cursors.py:142(_do_get_result)
3 0.000 0.000 1.701 0.567 cursors.py:351(_get_result)
3 1.701 0.567 1.701 0.567 {method 'store_result' of '_mysql.connection' objects}
3 0.001 0.000 0.848 0.283 cursors.py:358(_post_get_result)
3 0.000 0.000 0.847 0.282 cursors.py:324(_fetch_row)
3 0.847 0.282 0.847 0.282 {built-in method fetch_row}
591343 0.152 0.000 0.152 0.000 {range}
.
2-core machine, started 1 process:
623164 function calls (622741 primitive calls) in 235.954 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.246 0.246 235.955 235.955 20_with_multiprocessing.py:1(<module>)
3 0.000 0.000 232.003 77.334 threading.py:309(wait)
25 232.003 9.280 232.003 9.280 {method 'acquire' of 'thread.lock' objects}
1 0.000 0.000 232.003 232.003 _base.py:378(result)
1 0.593 0.593 3.104 3.104 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren)
3 0.000 0.000 2.774 0.925 cursors.py:164(execute)
3 0.000 0.000 2.774 0.925 cursors.py:353(_query)
3 0.000 0.000 1.981 0.660 cursors.py:315(_do_query)
3 0.000 0.000 1.970 0.657 cursors.py:142(_do_get_result)
3 0.000 0.000 1.969 0.656 cursors.py:351(_get_result)
3 1.969 0.656 1.969 0.656 {method 'store_result' of '_mysql.connection' objects}
3 0.001 0.000 0.794 0.265 cursors.py:358(_post_get_result)
3 0.000 0.000 0.792 0.264 cursors.py:324(_fetch_row)
3 0.792 0.264 0.792 0.264 {built-in method fetch_row}
591314 0.144 0.000 0.144 0.000 {range}
2-core machine, started 2 process:
626072 function calls (625636 primitive calls) in 682.460 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.334 0.334 682.461 682.461 20_with_multiprocessing.py:1(<module>)
4 0.000 0.000 678.231 169.558 threading.py:309(wait)
33 678.230 20.552 678.230 20.552 {method 'acquire' of 'thread.lock' objects}
2 0.000 0.000 678.230 339.115 _base.py:378(result)
1 0.527 0.527 2.974 2.974 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren)
3 0.000 0.000 2.723 0.908 cursors.py:164(execute)
3 0.000 0.000 2.723 0.908 cursors.py:353(_query)
3 0.000 0.000 1.749 0.583 cursors.py:315(_do_query)
3 0.000 0.000 1.736 0.579 cursors.py:142(_do_get_result)
3 0.000 0.000 1.736 0.579 cursors.py:351(_get_result)
3 1.736 0.579 1.736 0.579 {method 'store_result' of '_mysql.connection' objects}
3 0.001 0.000 0.975 0.325 cursors.py:358(_post_get_result)
3 0.000 0.000 0.973 0.324 cursors.py:324(_fetch_row)
3 0.973 0.324 0.973 0.324 {built-in method fetch_row}
5 0.093 0.019 0.304 0.061 __init__.py:1(<module>)
1 0.017 0.017 0.275 0.275 __init__.py:106(<module>)
1 0.005 0.005 0.198 0.198 add_newdocs.py:10(<module>)
591343 0.148 0.000 0.148 0.000 {range}
I, personally, don't really know what to do with these results. Would be glad to receive tips, hints or any other help - thanks :)
Reply to Answer-1: (edit)
Roland Smith looked at the data and suggest, that multiprocessing might hurt the performance more than it does help. Therefore, I did one more measurement without multiprocessing (like the code he suggested):
Am I right in the conclusion, that this is not the case? Because the measured times appear to be similar to the times measured before with multiprocessing?
1-core machine:
Database access took 2.53 seconds
Matrix manipulation took 236.71 seconds
1842384 function calls (1841974 primitive calls) in 241.114 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 219.036 219.036 241.115 241.115 20_with_multiprocessing.py:1(<module>)
406000 0.873 0.000 18.097 0.000 {method 'sum' of 'numpy.ndarray' objects}
406000 0.502 0.000 17.224 0.000 _methods.py:31(_sum)
406001 16.722 0.000 16.722 0.000 {method 'reduce' of 'numpy.ufunc' objects}
1 0.587 0.587 3.222 3.222 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren)
3 0.000 0.000 2.964 0.988 cursors.py:164(execute)
3 0.000 0.000 2.964 0.988 cursors.py:353(_query)
3 0.000 0.000 1.958 0.653 cursors.py:315(_do_query)
3 0.000 0.000 1.944 0.648 cursors.py:142(_do_get_result)
3 0.000 0.000 1.944 0.648 cursors.py:351(_get_result)
3 1.944 0.648 1.944 0.648 {method 'store_result' of '_mysql.connection' objects}
3 0.001 0.000 1.006 0.335 cursors.py:358(_post_get_result)
3 0.000 0.000 1.005 0.335 cursors.py:324(_fetch_row)
3 1.005 0.335 1.005 0.335 {built-in method fetch_row}
591285 0.158 0.000 0.158 0.000 {range}
2-core machine:
Database access took 2.32 seconds
Matrix manipulation took 242.45 seconds
1842390 function calls (1841980 primitive calls) in 246.535 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 224.705 224.705 246.536 246.536 20_with_multiprocessing.py:1(<module>)
406000 0.911 0.000 17.971 0.000 {method 'sum' of 'numpy.ndarray' objects}
406000 0.526 0.000 17.060 0.000 _methods.py:31(_sum)
406001 16.534 0.000 16.534 0.000 {method 'reduce' of 'numpy.ufunc' objects}
1 0.617 0.617 3.113 3.113 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren)
3 0.000 0.000 2.789 0.930 cursors.py:164(execute)
3 0.000 0.000 2.789 0.930 cursors.py:353(_query)
3 0.000 0.000 1.938 0.646 cursors.py:315(_do_query)
3 0.000 0.000 1.920 0.640 cursors.py:142(_do_get_result)
3 0.000 0.000 1.920 0.640 cursors.py:351(_get_result)
3 1.920 0.640 1.920 0.640 {method 'store_result' of '_mysql.connection' objects}
3 0.001 0.000 0.851 0.284 cursors.py:358(_post_get_result)
3 0.000 0.000 0.849 0.283 cursors.py:324(_fetch_row)
3 0.849 0.283 0.849 0.283 {built-in method fetch_row}
591285 0.160 0.000 0.160 0.000 {range}
Your programs seem to spend most of their time acquiring locks. That seems to indicate that in your case multiprocessing hurts more than it helps.
Remove all the multiprocessing stuff and start measuring how long things take without it. E.g. like this.
from math import floor
from math import ceil
import numpy
import MySQLdb
import time
start = time.clock()
db = MySQLdb.connect(...)
cursor = db.cursor()
query = "SELECT ...."
cursor.execute(query)
stop = time.clock()
print "Database access took {:.2f} seconds".format(stop - start)
start = time.clock()
[...] #save db results into the variable db_matrix (30 columns, 5.000 rows)
[...] #save db results into the variable bp_vector (3 columns, 500 rows)
[...] #save db results into the variable option_vector( 3 columns, 4000 rows)
stop = time.clock()
print "Creating matrices took {:.2f} seconds".format(stop - start)
cursor.close()
db.close()
counter = 0
start = time.clock()
for i in range(4000):
for j in range(500):
helper[:] = (1-bp_vector[j,0]-bp_vector[j,1]-bp_vector[j,2])*db_matrix[:,0]
+ db_matrix[:,option_vector[i,0]] * bp_vector[j,0]
+ db_matrix[:,option_vector[i,1]] * bp_vector[j,1]
+ db_matrix[:,option_vector[i,2]] * bp_vector[j,2]
result[counter,0] = (helper < -7.55).sum()
counter = counter + 1
stop = time.clock()
print "Matrix manipulation took {:.2f} seconds".format(stop - start)
Edit-1
Based on your measurements I stand by my conclusion (in a slightly rephrased form) that on a multi-core machine, using multiprocessing as you are doing now hurts your performance very much. On a dual-core machine the program with multiprocessing takes much longer than the one without it!
That there is no difference between using multiprocessing or not when using a single core machine isn't very relevant, I think. A single core machine won't see that much benefit from multiprocessing anyway.
The new measurements show that most of the time is spent in the matrix manipulation. This is logical since you are using an explicit nested for-loop, which is not very fast.
There are basically four possible solutions;
The first is to re-write your nested loop into numpy operations. Numpy operations have implicit loops (written in C) instead of explicit loops in Python and thus are faster. (A rare case where explicit is worse than implicit. ;-) ) The downside is that this will probable use a significant amount of memory.
The second option is to split up the calculations of helper, which consists out of 4 parts. Execute each part in a separate process and add the results together at the end. This does incur some overhead; each process has to retrieve all data from the database, and has to transfer the partial result back to the main process (maybe also via the database?).
The third option might be to use pypy instead of Cpython. It can be significantly faster.
A fourth option would be to re-write the critical matrix manipulation in Cython or C.

Categories