Amount of memory used in a function? - python

Which python module can I use to calculate the amount of memory spent executing a function? I'm using memory_profiler, but it shows the amount of memory spent by each line of the algorithm, in this case I want one that shows the total amount spent.

You can use tracemalloc to do what memory_profiller does automatically. It's a little unfriendly but I think it does what you want to do pretty well.
Just follow the code snippet below.
import tracemalloc
def myFucn():
tracemalloc.start()
## Your code
print( tracemalloc..get_traced_memory())
tracemalloc.stop()
The output is given in form of (current,peak),i.e, current memory is the memory the code is currently using and peak memory is the maximum space the program used while executing.
The code is from geeksforgeeks. Check it out for more info. Also, there are a couple of other methods to trace the memory explained inside the link. Make sure to check it out.

Related

long-running python program ram usage

I am currently working on a project where a python program is supposed to be running for several days, essentially in an endless loop until an user intervenes.
I have observed that the ram usage (as shown in the windows task manager) rises - slowly, but steadily. For example from ~ 80 MB at program start to ~ 120 MB after one day. To get a closer look at this, I started to log the allocated memory with
tracemalloc.get_traced_memory() at regular intervalls throughout the program execution. The output was written to the time series db (see image below).
tracemalloc output for one day runtime
To me it looks like the memory that is needed for the program does not accumulate over time. How does this fit in the output of the windows task manager? Should I go through my program to search for growing data structures?
Thank your very much in advance!
Okay, turns out the answer is: no, this is not proper behaviour, the ram usage can stay absolutely stable. I have tested this for three weeks now and the ram usage never exceeded 80 mb.
The problem was in the usage of the influxdb v2 client.
You need to close both the write_api (implicitly done with the "with... as write_api:" statement) and the client itself (explicitly done via the "client.close()" in the example below).
In my previous version that had increasing memory usage, I only closed the write_api and not the client.
client = influxdb_client.InfluxDBClient(url=self.url, token=self.token, org=self.org)
with client.write_api(write_options=SYNCHRONOUS) as write_api:
# force datatypes, because influx does not do fluffy ducktyping
datapoint = influxdb_client.Point("TaskPriorities")\
.tag("task_name", str(task_name))\
.tag("run_uuid", str(run_uuid))\
.tag("task_type", str(task_type))\
.field("priority", float(priority))\
.field("process_time_h", float(process_time))\
.time(time.time_ns())
answer= write_api.write(bucket=self.bucket, org=self.org, record=datapoint)
client.close()

How can I measure the coverage (in production system)?

I would like to measure the coverage of my Python code which gets executed in the production system.
I want an answer to this question:
Which lines get executed often (hot spots) and which lines are never used (dead code)?
Of course this must not slow down my production site.
I am not talking about measuring the coverage of tests.
I assume you are not talking about test suite code coverage which the other answer is referring to. That is a job for CI indeed.
If you want to know which code paths are hit often in your production system, then you're going to have to do some instrumentation / profiling. This will have a cost. You cannot add measurements for free. You can do it cheaply though and typically you would only run it for short amounts of time, long enough until you have your data.
Python has cProfile to do full profiling, measuring call counts per function etc. This will give you the most accurate data but will likely have relatively high impact on performance.
Alternatively, you can do statistical profiling which basically means you sample the stack on a timer instead of instrumenting everything. This can be much cheaper, even with high sampling rate! The downside of course is a loss of precision.
Even though it is surprisingly easy to do in Python, this stuff is still a bit much to put into an answer here. There is an excellent blog post by the Nylas team on this exact topic though.
The sampler below was lifted from the Nylas blog with some tweaks. After you start it, it fires an interrupt every millisecond and records the current call stack:
import collections
import signal
class Sampler(object):
def __init__(self, interval=0.001):
self.stack_counts = collections.defaultdict(int)
self.interval = interval
def start(self):
signal.signal(signal.VTALRM, self._sample)
signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)
def _sample(self, signum, frame):
stack = []
while frame is not None:
formatted_frame = '{}({})'.format(
frame.f_code.co_name,
frame.f_globals.get('__name__'))
stack.append(formatted_frame)
frame = frame.f_back
formatted_stack = ';'.join(reversed(stack))
self.stack_counts[formatted_stack] += 1
signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)
You inspect stack_counts to see what your program has been up to. This data can be plotted in a flame-graph which makes it really obvious to see in which code paths your program is spending the most time.
If i understand it right you want to learn which parts of your application is used most often by users.
TL;DR;
Use one of the metrics frameworks for python if you do not want to do it by hand. Some of them are above:
DataDog
Prometheus
Prometheus Python Client
Splunk
It is usually done by function level and it actually depends on application;
If it is a desktop app with internet access:
You can create a simple db and collect how many times your functions are called. For accomplish it you can write a simple function and call it inside every function that you want to track. After that you can define an asynchronous task to upload your data to internet.
If it is a web application:
You can track which functions are called from js (mostly preferred for user behaviour tracking) or from web api. It is a good practice to start from outer to go inner. First detect which end points are frequently called (If you are using a proxy like nginx you can analyze server logs to gather information. It is the easiest and cleanest way). After that insert a logger to every other function that you want to track and simply analyze your logs for every week or month.
But you want to analyze your production code line by line (it is a very bad idea) you can start your application with python profilers. Python has one already: cProfile.
Maybe make a text file and through your every program method just append some text referenced to it like "Method one executed". Run the web application like 10 times thoroughly as a viewer would and after this make a python program that reads the file and counts a specific parts of it or maybe even a pattern and adds it to a variable and outputs the variables.

Long-running Python program (using Pandas) keeps ramping up memory usage

I'm running a python script that handles and processes data using Pandas functions inside an infinite loop. But the program seems to be leaking memory over time.
This is the graph produced by the memory-profiler package:
Sadly, I cannot identify the source of the increasing memory usage. To my knowledge, all data (pandas timeseries) are stored in the object Obj, and I track the memory usage of this object using the pandas function .memory_usage and the objsize function get_deep_size(). According to their output, the memory usage should be stable around 90-100 MB. Other than this, I don't see where memory can ramp up.
It may be useful to know that the python program is running inside a docker container.
Below is a simplified version of the script which should illuminate the basic working principle.
from datetime import datetime
from time import sleep
import objsize
from dateutil import relativedelta
def update_data(Obj, now_utctime):
# attaining the newest timeseries data
new_data = requests.get(host, start=Obj.data[0].index, end=now_utctime)
Obj.data.append(new_data)
# cut off data older than 1 day
Obj.data.truncate(before=now_utctime-relativedelta.relativedelta(days=1))
class ExampleClass():
def __init__(self):
now_utctime = datetime.utcnow()
data = requests.get(host, start=now_utctime-relativedelta.relativedelta(days=1), end=now_utctime)
Obj = ExampleClass()
while True:
update_data(Obj, datetime.utcnow())
logger.info(f"Average at {datetime.utcnow()} is at {Obj.data.mean()}")
logger.info(f"Stored timeseries memory usage at {Obj.data.memory_usage(deep=True)* 10 ** -6} MB")
logger.info(f"Stored Object memory usage at {objsize.get_deep_size(Obj) * 10 ** -6} MB")
time.sleep(60)
Any advice into where memory could ramp up, or how to further investigate, would be appreciated.
EDIT: Looking at the chart, it makes sense that there will be spikes before I truncate, but since the data ingress is steady I don't know why it wouldn't normalize, but remain at a higher point. Then there is this sudden drop after every 4th cycle, even though the process does not have another, broader cycle that could explain this ...
As suggested by moooeeeep, the increase of memory usage was related to a memory leak, the exact source of which remains to be identified. However, I was able to resolve the issue by manually calling the garbage collector after every loop, via gc.collect().

Why is Spark's show() function very slow?

I have
df.select("*").filter(df.itemid==itemid).show()
and that never terminated, however if I do
print df.select("*").filter(df.itemid==itemid)
It prints in less than a second. Why is this?
That's because select and filter are just building up the execution instructions, so they aren't doing anything with the data. Then, when you call show it actually executes those instructions. If it isn't terminating, then I'd review the logs to see if there are any errors or connection issues. Or maybe the dataset is still too large - try only taking 5 to see if that comes back quick.
This usually happens if you dont have enough available memory in computer. free up some memory and try again.

Python script runs slower when executed multiple times

I have a python script with a normal runtime of ~90 seconds. However, when I change only minor things in it (like alternating the colors in my final pyplot figure) and execute it thusly multiple times in quick succession, its runtime increases up to close to 10 minutes.
Some bullet points of what I'm doing:
I'm not downloading anything neither creating new files with my script.
I merely open some locally saved .dat-files using numpy.genfromtxt and crunch some numbers with them.
I transform my data into a rec-array and use indexing via array.columnname extensively.
For each file I loop over a range of criteria that basically constitute different maximum and minimum values for evaluation, and embedded in that I use an inner loop over the lines of the data arrays. A few if's here and there but nothing fancy, really.
I use the multiprocessing module as follows
import multiprocessing
npro = multiprocessing.cpu_count() # Count the number of processors
pool = multiprocessing.Pool(processes=npro)
bigdata = list(pool.map(analyze, range(len(FileEndings))))
pool.close()
with analyze being my main function and FileEndings its input, a string, to create the right name of the file I want to load and the evaluate. Afterwards, I use it a second time with
pool2 = multiprocessing.Pool(processes=npro)
listofaverages = list(pool2.map(averaging, range(8)))
pool2.close()
averaging being another function of mine.
I use numba's #jit decorator to speed up the basic calculations I do in my inner loops, nogil, nopython, and cache all set to be True. Commenting these out doesn't resolve the issue.
I run the scipt on Ubuntu 16.04 and am using a recent Anaconda build of python to compile.
I write the code in PyCharm and run it in its console most of the time. However, changing to bash doesn't help either.
Simply not running the script for about 3 minutes lets it go back to its normal runtime.
Using htop reveals that all processors are at full capacity when running. I am also seeing a lot of processes stemming from PyCharm (50 or so) that are each at equal MEM% of 7.9. The CPU% is at 0 for most of them, a few exceptions are in the range of several %.
Has anyone experienced such an issue before? And if so, any suggestions what might help? Or are any of the things I use simply prone to cause these problems?
May be closed, the problem was caused by a malfunction of the fan in my machine.

Categories