I have recently started using the python memory profiler from here. As a test run, I tried to profile the toy code from here following the instructions therein. I have some naive questions on the outputs I saw.
import time
#profile
def test1():
n = 10000
a = [1] * n
time.sleep(1)
return a
#profile
def test2():
n = 100000
b = [1] * n
time.sleep(1)
return b
if __name__ == "__main__":
test1()
test2()
This is the output using mprof run and then plot command line options:
After removing the #profile lines, I ran the profiler again and obtained the following result:
Except for the brackets for the functions, I was expecting almost identical plots (since the code is simple), but I am seeing some significant differences such as the ending time of the plot, variations within brackets etc.
Can someone please shed light into these differences?
Edit:
For small intervals, the plot with function profiling looks like:
The differences you are seeing are probably due to the fact that the information stored by #profile is counted within the total memory used by the program. There is also a slight overhead of storing this information, hence the different running times.
Also, you might get slightly different plots in different runs just due to variations in how Python manages memory.
Related
I have been working on a python script using dask to speed up the processing time. At a high level, the script calls a dask delayed function a number of times to perform new computations. Each time the dask delayed function is called, it has no relation to the previous call. A simplified example of what I have is shown below.
def func1(list1):
for subList in list1:
t2 = dask.delayed(func2)(subList)
output.append(t2)
output = dask.compute(*output)
return output
def func2(subList):
#Some operations
return tuple2 # Large Tuple with combination of lists and numpy arrays
if __name__ == '__main__':
largeList = [..]
for list1 in largeList:
output = func1(list1)
print(output)
I noticed that as this program was being executed, each time func1 was called, the time to completion gradually got longer. At first I believed this was a memory issue because the variable, output, is typically a large tuple with many arrays and lists. However, looking at the Dask dashboard while the program was running, the 'Bytes Stored' plot didn't seem to max out. I know this is a very vague example, but does anybody have any ideas about why the func1 slows down the more times it is called? My hunch is still that the large tuple output has something to do with the issue. If so, how can I fix this problem? Any feedback on this would be greatly appreciated.
My problem seems to be a simple one, but so far I haven't found a satisfactory answer. The code I am running is very time consuming, and I need to run it many times (ideally 100 times or more) and average the results from each trial. I have been told to try multiprocessing, and I have made some progress (in JupyterLab).
#my_code.py
def Run_Code():
<code>
return result
import multiprocessing as mp
import numpy as np
import my_code as mc
trial_amount = 2
if __name__ == '__main__':
pool = mp.Pool(2)
result = pool.map(mc.Run_code, np.arange(trial_amount))
print(result)
I was guided by this introduction (https://sebastianraschka.com/Articles/2014_multiprocessing.html#sections). The ultimate goal is to just run each trial simultaneously (or as many as possible at once and once it finished start another trial, and so on) and put the results in a list that will then be averaged. I tried this and it continued running for hours, much longer than a single trial, and never finished.
Try mpi4py - https://stackoverflow.com/a/15717768/1021819 gives an example of how it goes.
There is a great and simple tutorial here:
https://mpi4py.readthedocs.io/en/stable/tutorial.html
It really is just a few lines. The following also has enough to get you going. It splits a loop of work over cores then aggregates the results on master:
https://stackoverflow.com/a/51318100/1021819
If we assume that the following code is a huge and complicated code and lasts for minutes or hours and we want to inform the user how many percents of the code are passing, what should I do?
num=1
for i in range(1,100000):
num=num*i
print(num)
I want to inform users with progression bar, similar to installing something.
I checked here but I did not understand how to write a progression bar depending on my code progression.
In the examples similar to the mentioned link, they are defining the sleep or delaying time. this is not acceptable. Because we do not know the calculation time of Python in different code with different functions.
If your index i corresponds to your actuall progress, the tqdm package is a good option. A simple example:
from tqdm import tqdm
import time
for i in tqdm(range(1000)):
time.sleep(0.01) # sleep 0.01s
Output:
1%| | 1010/100000 [00:10<16:46, 98.30it/s]
Edit: The progress bar also works if the progress is not known.
def loop_without_known_length():
# same as above, but the length is not known outside of this function.
for i in range(1000):
yield i
for i in tqdm(loop_without_known_length()):
time.sleep(0.01)
Output:
60it [00:00, 97.23it/s]
I have been playing with memory_profiler for some time and got this interesting but confusing results from the small program below:
import pandas as pd
import numpy as np
#profile
def f(p):
tmp = []
for _, frame in p.iteritems():
tmp.append([list(record) for record in frame.to_records(index=False)])
# initialize a list of pandas panels
lp = []
for j in xrange(50):
d = {}
for i in xrange(50):
df = pd.DataFrame(np.random.randn(200, 50))
d[i] = df
lp.append(pd.Panel(d))
# execution (iteration)
for panel in lp:
f(panel)
Then if I use memory_profiler's mprof to analyze the memory usage during runtime, mprof run test.py without any other parameters, I get this:
.
There seems to be memory unreleased after each function call f().
tmp is just a local list and should be reassigned and memory reallocated each time f() is called. Obviously there is some discrepancy here in the graph attached. I know that python has its own memory management blocks and also has free list for int and other types, and gc.collect() should do the magic. It turns out that explicit gc.collect() doesn't work. (Maybe because we are working with pandas objects, panels and frames? I don't know.)
The most confusing part is, I don't change or modify any variable in f(). All it does is just put some list representation copies in a local list. Therefore python doesn't need to make a copy of anything. Then why and how does this happen?
=================
Some other observations:
1) If I call f() with f(panel.copy()) (last line of code), passing the copy instead of the original object reference, I have a totally different memory usage result: . Is python that smart to tell that this value passed is a copy so that it could do some internal tricks to release the memory after each function call?
2) I think it might be because of df.to_records(). Well if I change it to frame.values, I would get similar flat memory curve, just like memory_profiling_results_2.png shown above, during the iteration (Although I do need to_records() because it maintains the column dtype, while .values messes the dtypes up). But I looked into frame.py's implementation on to_records(). I don't see why it would hold on the memory out there, while .values would work just fine.
I am running the program on Windows, with python 2.7.8, memory_profiler 0.43 and psutil 5.0.1.
This is not a memory leak. What you are seeing is a side effect of pandas.core.NDFrame caching some results. This allows it to return the same information the second time you ask for it without running the calculations again. Change the end of your sample code to look like the following code and run it. You should find that the second time through the memory increase will not happen, and the execution time will be less.
import time
# execution (iteration)
start_time = time.time()
for panel in lp:
f(panel)
print(time.time() - start_time)
print('-------------------------------------')
start_time = time.time()
for panel in lp:
f(panel)
print(time.time() - start_time)
I have a script in python but it takes more than 20 hours to run until the end.
Since my code is pretty big, I will post a simplified one.
The first part of the code:
flag = 1
mydic = {}
for i in mylist:
mydic[flag] = myfunction(i)
flag += 1
mylist has more than 700 entries and each time I call myfunction it run for around 20sec.
So, I was thinking if I can use paraller programming to split the iteration into two groups and run it simultaneously. Is that possible and will I need the half time than before?
The second part of the code:
mymatrix = []
for n1 in range(0,flag):
mat = []
for n2 in range(0,flag):
if n1 >= n2:
mat.append(0)
else:
res = myfunction2(mydic(n1),mydic(n2))
mat.append(res)
mymatrix.append(mat)
So, if mylist has 700 entries, I want to create a 700x700 matrix where it is upper triangular matrix. But the myfunction2() needs around 30sec each time. I don't know if I can use parallel programming here too.
I cannot simplify the myfunction() and myfunction2() since they are functions where I call an external api and return the results.
Do you have any suggestion of how can I change it to make it faster.
Based on your comments, I think it's very likely that the 30seconds of time is mostly due to external API calls. I would add some timing code to test what portions of your code are actually responsible for the slowness.
If it is from the external API calls, there are some easy fixes. The external API calls block, so you'll get a speedup if you can move to a parallel model ( though 30s of blocking sounds huge to me ).
I think it would be easiest to create a quick "task list" by having the output of 2 loops be a matrix of arguments to pass into a function. Then I'd pipe them into Celery to run the tasks. That should give you a decent speedup with a minimal amount of work.
You would probably save a lot more time with the threading or multiprocessing modules to run tasks (or sections) , or even write it all in Twisted python - but that usually takes longer than a simple celery function.
The one caveat with the Celery approach is that you'll be dispatching a lot of work - so you'll have to have some functionality to poll for results. That could be a while loop that just sleeps(10) and repeats itself until celery has a result for every task. If you do it in Twisted, you can access/track results on finish. I've never had to do something like this with multiprocessing, so don't know how that would fit in.
how about using a generator for the second part instead of one of the for loops
def fn():
for n1 in range(0, flag):
yield n1
generate = fn()
while True:
a = next(generate)
for n2 in range(0, flag):
if a >= n2:
mat.append(0)
else:
mat.append(myfunction2(mydic(a),mydic(n2))
mymatrix.append(mat)