This is the profiling result of my python code.
As you can see below, method 'recv_into' of '_socket.socket' objects takes too much time ( 17.265 as tottime )
What is it? And is there any way to reduce its time?
When is it called?
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.402 0.402 37.668 37.668 c:\Users\user\Google ����̺�\Business\Project\Jessica Project\jessica-1\simulation\simulatorW.py:239(backtestWithArgumentsList)
1 0.173 0.173 26.762 26.762 c:\Users\user\Google ����̺�\Business\Project\Jessica Project\jessica-1\simulation\simulatorW.py:110(getPrices)
1 0.000 0.000 26.588 26.588 c:\Users\user\Google ����̺�\Business\Project\Jessica Project\jessica-1\dto\__init__.py:5(__init__)
1 1.734 1.734 25.380 25.380 c:\Users\user\Google ����̺�\Business\Project\Jessica Project\jessica-1\dto\__init__.py:21(priceInfoListToDeque)
815679 2.204 0.000 23.473 0.000 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\cursor.py:1152(next)
13 0.021 0.002 20.631 1.587 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\cursor.py:1039(_refresh)
12 0.008 0.001 20.609 1.717 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\cursor.py:937(__send_message)
12 0.000 0.000 20.601 1.717 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\mongo_client.py:1306(_run_operation_with_response)
12 0.000 0.000 20.601 1.717 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\mongo_client.py:1437(_retryable_read)
12 0.000 0.000 20.597 1.716 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\mongo_client.py:1334(_cmd)
12 0.001 0.000 20.597 1.716 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\server.py:70(run_operation_with_response)
18 0.001 0.000 17.386 0.966 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\network.py:192(receive_message)
12 0.013 0.001 17.379 1.448 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\pool.py:637(receive_message)
36 0.066 0.002 17.331 0.481 C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\network.py:249(_receive_data_on_socket)
19984 17.265 0.001 17.265 0.001 {method 'recv_into' of '_socket.socket' objects}
1 2.499 2.499 6.522 6.522 c:\Users\user\Google ����̺�\Business\Project\Jessica Project\jessica-1\simulation\simulatorW.py:138(filterIndicesWithTimeCondition)
It's a low level networking call. This is the time spent reading whatever you are loading. Take a look at its callers.
p.print_callers("{method 'recv_into' of '_socket.socket' objects}")
Keep going up the callers tree picking the ones that have longer times. Remember that the restriction is a regexp. Use escapes when necessary:
p.sort_stats("tottime").print_callers("api.py:104\(post\)")
The top 4 lines are more interesting than the recv_into one. If you go up the caller tree, you're likely to end up in one of those. There could be many ways to optimize those, since no details are provided. Cacheing, compressing, getting only what you need, and otherwise reducing network footprint.
I have an application which requires to initialize a large number of objects with Python (3.5.2) and encounter some occasional slow-downs.
The slow-down seems to occur on a specific initialization: most of the calls to __init__ last less than 1 ns, but one of them sometimes lasts several dozens of seconds.
I've been able to reproduce this using the following snippet that initializes 500k a simple object.
import cProfile
class A:
def __init__(self):
pass
cProfile.run('[A() for _ in range(500000)]')
I'm running this code in a notebook. Most of the times (9/10), this code outputs the following (normal execution)
500004 function calls in 0.675 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
500000 0.031 0.000 0.031 0.000 <ipython-input-5-634b77609653>:2(__init__)
1 0.627 0.627 0.657 0.657 <string>:1(<listcomp>)
1 0.018 0.018 0.675 0.675 <string>:1(<module>)
1 0.000 0.000 0.675 0.675 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
The other times, it outputs the following (slow execution)
500004 function calls in 40.154 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
500000 0.031 0.000 0.031 0.000 <ipython-input-74-634b77609653>:2(__init__)
1 40.110 40.110 40.140 40.140 <string>:1(<listcomp>)
1 0.014 0.014 40.154 40.154 <string>:1(<module>)
1 0.000 0.000 40.154 40.154 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Using tqdm, the loop seems to get stuck on one iteration. It's important to note that I was able to reproduce this in a notebook with already a lot of memory allocated.
I suspect that it comes from the list of references to the objects used by the garbage collector that might need to be copied from time to time.
What is exactly happening here, and are there any ways to avoid this ?
I am trying to parallelize an embarrassingly parallel for loop (previously asked here) and settled on this implementation that fit my parameters:
with Manager() as proxy_manager:
shared_inputs = proxy_manager.list([datasets, train_size_common, feat_sel_size, train_perc,
total_test_samples, num_classes, num_features, label_set,
method_names, pos_class_index, out_results_dir, exhaustive_search])
partial_func_holdout = partial(holdout_trial_compare_datasets, *shared_inputs)
with Pool(processes=num_procs) as pool:
cv_results = pool.map(partial_func_holdout, range(num_repetitions))
The reason I need to use a proxy object (shared between processes) is the first element in the shared proxy list datasets that is a list of large objects (each about 200-300MB). This datasets list usually has 5-25 elements. I typically need to run this program on a HPC cluster.
Here is the question, when I run this program with 32 processes and 50GB of memory (num_repetitions=200, with datasets being a list of 10 objects, each 250MB), I do not see a speedup even by factor of 16 (with 32 parallel processes). I do not understand why - any clues? Any obvious mistakes, or bad choices? Where can I improve this implementation? Any alternatives?
I am sure this has been discussed before, and the reasons can be varied and very specific to implementation - hence I request you to provide me your 2 cents. Thanks.
Update: I did some profiling with cProfile to get a better idea - here is some info, sorted by cumulative time.
In [19]: p.sort_stats('cumulative').print_stats(50)
Mon Oct 16 16:43:59 2017 profiling_log.txt
555404 function calls (543552 primitive calls) in 662.201 seconds
Ordered by: cumulative time
List reduced from 4510 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
897/1 0.044 0.000 662.202 662.202 {built-in method builtins.exec}
1 0.000 0.000 662.202 662.202 test_rhst.py:2(<module>)
1 0.001 0.001 661.341 661.341 test_rhst.py:70(test_chance_classifier_binary)
1 0.000 0.000 661.336 661.336 /Users/Reddy/dev/neuropredict/neuropredict/rhst.py:677(run)
4 0.000 0.000 661.233 165.308 /Users/Reddy/anaconda/envs/py36/lib/python3.6/threading.py:533(wait)
4 0.000 0.000 661.233 165.308 /Users/Reddy/anaconda/envs/py36/lib/python3.6/threading.py:263(wait)
23 661.233 28.749 661.233 28.749 {method 'acquire' of '_thread.lock' objects}
1 0.000 0.000 661.233 661.233 /Users/Reddy/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py:261(map)
1 0.000 0.000 661.233 661.233 /Users/Reddy/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py:637(get)
1 0.000 0.000 661.233 661.233 /Users/Reddy/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py:634(wait)
866/8 0.004 0.000 0.868 0.108 <frozen importlib._bootstrap>:958(_find_and_load)
866/8 0.003 0.000 0.867 0.108 <frozen importlib._bootstrap>:931(_find_and_load_unlocked)
720/8 0.003 0.000 0.865 0.108 <frozen importlib._bootstrap>:641(_load_unlocked)
596/8 0.002 0.000 0.865 0.108 <frozen importlib._bootstrap_external>:672(exec_module)
1017/8 0.001 0.000 0.863 0.108 <frozen importlib._bootstrap>:197(_call_with_frames_removed)
522/51 0.001 0.000 0.765 0.015 {built-in method builtins.__import__}
The profiling info now sorted by time
In [20]: p.sort_stats('time').print_stats(20)
Mon Oct 16 16:43:59 2017 profiling_log.txt
555404 function calls (543552 primitive calls) in 662.201 seconds
Ordered by: internal time
List reduced from 4510 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
23 661.233 28.749 661.233 28.749 {method 'acquire' of '_thread.lock' objects}
115/80 0.177 0.002 0.211 0.003 {built-in method _imp.create_dynamic}
595 0.072 0.000 0.072 0.000 {built-in method marshal.loads}
1 0.045 0.045 0.045 0.045 {method 'acquire' of '_multiprocessing.SemLock' objects}
897/1 0.044 0.000 662.202 662.202 {built-in method builtins.exec}
3 0.042 0.014 0.042 0.014 {method 'read' of '_io.BufferedReader' objects}
2037/1974 0.037 0.000 0.082 0.000 {built-in method builtins.__build_class__}
286 0.022 0.000 0.061 0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/site-packages/scipy/misc/doccer.py:12(docformat)
2886 0.021 0.000 0.021 0.000 {built-in method posix.stat}
79 0.016 0.000 0.016 0.000 {built-in method posix.read}
597 0.013 0.000 0.021 0.000 <frozen importlib._bootstrap_external>:830(get_data)
276 0.011 0.000 0.013 0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/sre_compile.py:250(_optimize_charset)
108 0.011 0.000 0.038 0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:626(_construct_argparser)
1225 0.011 0.000 0.050 0.000 <frozen importlib._bootstrap_external>:1233(find_spec)
7179 0.009 0.000 0.009 0.000 {method 'splitlines' of 'str' objects}
33 0.008 0.000 0.008 0.000 {built-in method posix.waitpid}
283 0.008 0.000 0.015 0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/site-packages/scipy/misc/doccer.py:128(indentcount_lines)
3 0.008 0.003 0.008 0.003 {method 'poll' of 'select.poll' objects}
7178 0.008 0.000 0.008 0.000 {method 'expandtabs' of 'str' objects}
597 0.007 0.000 0.007 0.000 {method 'read' of '_io.FileIO' objects}
More profiling info sorted by percall info:
Update 2
The elements in the large list datasets I mentioned earlier are not usually as big - they are typically 10-25MB each. But depending on the floating point precision used, number of samples and features, this can easily grow to 500MB-1GB per element also. hence I'd prefer a solution that can scale.
Update 3:
The code inside holdout_trial_compare_datasets uses method GridSearchCV of scikit-learn, which internally uses joblib library if we set n_jobs > 1 (or whenever we even set it). This might lead to some bad interactions between multiprocessing and joblib. So trying another config where I do not set n_jobs at all (which should to default no parallelism within scikit-learn). Will keep you posted.
Based on discussion in the comments, I did a mini experiment, compared three versions of implementation:
v1: basically as same as your approach, in fact, as partial(f1, *shared_inputs) will unpack proxy_manager.list immediately, Manager.List not involved here, data passed to worker with the internal queue of Pool.
v2: v2 made use Manager.List, work function will receive a ListProxy object, it fetches shared data via a internal connection to a server process.
v3: child process share data from the parent, take advantage of fork(2) system call.
def f1(*args):
for e in args[0]: pow(e, 2)
def f2(*args):
for e in args[0][0]: pow(e, 2)
def f3(n):
for i in datasets: pow(i, 2)
def v1(np):
with mp.Manager() as proxy_manager:
shared_inputs = proxy_manager.list([datasets,])
pf = partial(f1, *shared_inputs)
with mp.Pool(processes=np) as pool:
r = pool.map(pf, range(16))
def v2(np):
with mp.Manager() as proxy_manager:
shared_inputs = proxy_manager.list([datasets,])
pf = partial(f2, shared_inputs)
with mp.Pool(processes=np) as pool:
r = pool.map(pf, range(16))
def v3(np):
with mp.Pool(processes=np) as pool:
r = pool.map(f3, range(16))
datasets = [2.0 for _ in range(10 * 1000 * 1000)]
for f in (v1, v2, v3):
print(f.__code__.co_name)
for np in (2, 4, 8, 16):
s = time()
f(np)
print("%s %.2fs" % (np, time()-s))
results taken on a 16 core E5-2682 VPC, it is obvious that v3 scales better:
{method 'acquire' of '_thread.lock' objects}
Looking at your profiler output I would say that the shared object lock/unlock overhead overwhelms the speed gains of multithreading.
Refactor so that the work is farmed out to workers that do not need to talk to one another as much.
Specifically, if possible, derive one answer per data pile and then act on the accumulated results.
This is why Queues can seem so much faster: they involve a type of work that does not require an object that has to be 'managed' and so locked/unlocked.
Only 'manage' things that absolutely need to be shared between processes. Your managed list contains some very complicated looking objects...
A faster paradigm is:
allwork = manager.list([a, b,c])
theresult = manager.list()
and then
while mywork:
unitofwork = allwork.pop()
theresult = myfunction(unitofwork)
If you do not need a complex shared object, then only use a list of the most simple objects imaginable.
Then tell the workers to acquire the complex data that they can process in their own little world.
Try:
allwork = manager.list([datasetid1, datasetid2 ,...])
theresult = manager.list()
while mywork:
unitofworkid = allwork.pop()
theresult = myfunction(unitofworkid)
def myfunction(unitofworkid):
thework = acquiredataset(unitofworkid)
result = holdout_trial_compare_datasets(thework, ...)
I hope that this makes sense. It should not take too much time to refactor in this direction. And you should see that {method 'acquire' of '_thread.lock' objects} number drop like a rock when you profile.
i have created the code below, it takes a series of values,
and generates 10 numbers between x and r with an average value of 8000
in order to meet the specification to cover the range as well as possible, I also calculated the standard deviation, which is a good measure of spread. So whenever a sample set meets the criteria of mean of 8000, I compared it to previous matches and constantly choose the samples that have the highest std dev (mean always = 8000)
def node_timing(average_block_response_computational_time, min_block_response_computational_time, max_block_response_computational_time):
sample_count = 10
num_of_trials = 1
# print average_block_response_computational_time
# print min_block_response_computational_time
# print max_block_response_computational_time
target_sum = sample_count * average_block_response_computational_time
samples_list = []
curr_stdev_max = 0
for trials in range(num_of_trials):
samples = [0] * sample_count
while sum(samples) != target_sum:
samples = [rd.randint(min_block_response_computational_time, max_block_response_computational_time) for trial in range(sample_count)]
# print ("Mean: ", st.mean(samples), "Std Dev: ", st.stdev(samples), )
# print (samples, "\n")
if st.stdev(samples) > curr_stdev_max:
curr_stdev_max = st.stdev(samples)
samples_best = samples[:]
return samples_best[0]
i take the first value in the list and use this as a timing value, however this code is REALLY slow, i need to call this piece of code several thousand times during the simulation so need to improve the efficency of the code some how
anyone got any suggestions on how to ?
To see where we'd get the best speed improvements, I started by profiling your code.
import cProfile
pr = cProfile.Profile()
pr.enable()
for i in range(100):
print(node_timing(8000, 7000, 9000))
pr.disable()
pr.print_stats(sort='time')
The top of the results show where your code is spending most of its time:
23561178 function calls (23561176 primitive calls) in 10.612 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
4502300 3.694 0.000 7.258 0.000 random.py:172(randrange)
4502300 2.579 0.000 3.563 0.000 random.py:222(_randbelow)
4502300 1.533 0.000 8.791 0.000 random.py:216(randint)
450230 1.175 0.000 9.966 0.000 counter.py:19(<listcomp>)
4608421 0.690 0.000 0.690 0.000 {method 'getrandbits' of '_random.Random' objects}
100 0.453 0.005 10.596 0.106 counter.py:5(node_timing)
4502300 0.294 0.000 0.294 0.000 {method 'bit_length' of 'int' objects}
450930 0.141 0.000 0.150 0.000 {built-in method builtins.sum}
100 0.016 0.000 0.016 0.000 {built-in method builtins.print}
600 0.007 0.000 0.025 0.000 statistics.py:105(_sum)
2200 0.005 0.000 0.006 0.000 fractions.py:84(__new__)
...
From this output, we can see that we're spending ~7.5 seconds (out of 10.6 seconds) generating random numbers. Therefore, the only way to make this noticeably faster is to generate fewer random numbers or generate them faster. You're not using a cryptographic random number generator so I don't have a way to make generating numbers faster. However, we can fudge the algorithm a bit and drastically reduce the number of values we need to generate.
Instead of only accepting samples with a mean of exactly 8000, what if we accepted samples with a mean of 8000 +- 0.1% (then we're taking samples with a mean of 7992 to 8008)? By being a tiny bit inexact, we can drastically speed up the algorithm. I replaced the while condition with:
while abs(sum(samples) - target_sum) > epsilon
Where epsilon = target_sum * 0.001. Then I ran the script again and got much better profiler numbers.
232439 function calls (232437 primitive calls) in 0.163 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
100 0.032 0.000 0.032 0.000 {built-in method builtins.print}
31550 0.026 0.000 0.053 0.000 random.py:172(randrange)
31550 0.019 0.000 0.027 0.000 random.py:222(_randbelow)
31550 0.011 0.000 0.064 0.000 random.py:216(randint)
4696 0.010 0.000 0.013 0.000 fractions.py:84(__new__)
3155 0.008 0.000 0.073 0.000 counter.py:19(<listcomp>)
600 0.008 0.000 0.039 0.000 statistics.py:105(_sum)
100 0.006 0.000 0.131 0.001 counter.py:4(node_timing)
32293 0.005 0.000 0.005 0.000 {method 'getrandbits' of '_random.Random' objects}
1848 0.004 0.000 0.009 0.000 fractions.py:401(_add)
Allowing the mean to be up to 0.1% off of the target dropped the number of calls to randint by 100x. Naturally, the code also runs 100x faster (and now spends most of its time printing to console).
I'm trying to profile a few lines of Pandas code, and when I run %prun i'm finding most of my time is taken by {isinstance}. This seems to happen a lot -- can anyone suggest what that means and, for bonus points, suggest a way to avoid it?
This isn't meant to be application specific, but here's a thinned out version of the code if that's important:
def flagOtherGroup(df):
try:mostUsed0 = df[df.subGroupDummy == 0].siteid.iloc[0]
except: mostUsed0 = -1
try: mostUsed1 = df[df.subGroupDummy == 1].siteid.iloc[0]
except: mostUsed1 = -1
df['mostUsed'] = 0
df.loc[(df.subGroupDummy == 0) & (df.siteid == mostUsed1), 'mostUsed'] = 1
df.loc[(df.subGroupDummy == 1) & (df.siteid == mostUsed0), 'mostUsed'] = 1
return df[['mostUsed']]
%prun -l15 temp = test.groupby('userCode').apply(flagOtherGroup)
And top lines of prun:
Ordered by: internal time
List reduced from 531 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
834472 1.908 0.000 2.280 0.000 {isinstance}
497048/395400 1.192 0.000 1.572 0.000 {len}
32722 0.879 0.000 4.479 0.000 series.py:114(__init__)
34444 0.613 0.000 1.792 0.000 internals.py:3286(__init__)
25990 0.568 0.000 0.568 0.000 {method 'reduce' of 'numpy.ufunc' objects}
82266/78821 0.549 0.000 0.744 0.000 {numpy.core.multiarray.array}
42201 0.544 0.000 1.195 0.000 internals.py:62(__init__)
42201 0.485 0.000 1.812 0.000 internals.py:2015(make_block)
166244 0.476 0.000 0.615 0.000 {getattr}
4310 0.455 0.000 1.121 0.000 internals.py:2217(_rebuild_blknos_and_blklocs)
12054 0.417 0.000 2.134 0.000 internals.py:2355(apply)
9474 0.385 0.000 1.284 0.000 common.py:727(take_nd)
isinstance, len and getattr are just the built-in functions. There are a huge number of calls to the isinstance() function here; it is not that the call itself takes a lot of time, but the function was used 834472 times.
Presumably it is the pandas code that uses it.