long-running python program ram usage - python

I am currently working on a project where a python program is supposed to be running for several days, essentially in an endless loop until an user intervenes.
I have observed that the ram usage (as shown in the windows task manager) rises - slowly, but steadily. For example from ~ 80 MB at program start to ~ 120 MB after one day. To get a closer look at this, I started to log the allocated memory with
tracemalloc.get_traced_memory() at regular intervalls throughout the program execution. The output was written to the time series db (see image below).
tracemalloc output for one day runtime
To me it looks like the memory that is needed for the program does not accumulate over time. How does this fit in the output of the windows task manager? Should I go through my program to search for growing data structures?
Thank your very much in advance!

Okay, turns out the answer is: no, this is not proper behaviour, the ram usage can stay absolutely stable. I have tested this for three weeks now and the ram usage never exceeded 80 mb.
The problem was in the usage of the influxdb v2 client.
You need to close both the write_api (implicitly done with the "with... as write_api:" statement) and the client itself (explicitly done via the "client.close()" in the example below).
In my previous version that had increasing memory usage, I only closed the write_api and not the client.
client = influxdb_client.InfluxDBClient(url=self.url, token=self.token, org=self.org)
with client.write_api(write_options=SYNCHRONOUS) as write_api:
# force datatypes, because influx does not do fluffy ducktyping
datapoint = influxdb_client.Point("TaskPriorities")\
.tag("task_name", str(task_name))\
.tag("run_uuid", str(run_uuid))\
.tag("task_type", str(task_type))\
.field("priority", float(priority))\
.field("process_time_h", float(process_time))\
.time(time.time_ns())
answer= write_api.write(bucket=self.bucket, org=self.org, record=datapoint)
client.close()

Related

Python long time script running huge RAM and swap memory consumption

I am trying to write python script that will periodically (each 20ms) read the data from USB port and write obtained data to .csv file. Program need to run on RaspberryPi 3B for at least 1 week. But now I am facing the problem with RAM and swap memory consumption. After 9 hours of running Linux killing my process with just one word 'Killed' in terminal. I have checked the RAM usage using psutil module and it seems like the problem is the RAM and swap usage (1 min before crash it was 100% of swap is used overall processes and 57% of RAM is in use by this process). I was trying to find out where is this memory leakage happening by using memory profiler, so it seems like the problem is in csv_append function (after 10 minutes of running it increments 7Mb of data), but when I have a closer look on this function with #profiler decorator it seems like there is no leakage.
Here is an example of this function:
def _csv_append(self,data):
"""
Appends to .csv file
"""
with open(self.last_file_name, 'a') as csvfile:
csv_writter = csv.writer(csvfile)
csv_writter.writerow(data)
Is there anything that I can improve in my program so it will stop memory leaking and work for a long time without get killed by Linux OOM? In main loop function there is nothing more then reading bytes, interpreting them as int using int.from_bytes(), calling csv_append() and wait if some time left to ensure 0.02s period
Thank you for your help :)
Analyze memory consumption using memory profiler, no info that can help. Seems like the problem in csv_append() but there is no leakage
Delete all variables each cycle and use garbage collector gc.collect()
I just ran the following little script:
import csv
Iterations = 1_000
def csv_append(data):
with open("log.csv", "a", newline="") as f:
writer = csv.writer(f)
writer.writerow(data)
for i in range(Iterations):
data = [i, "foo", "bar"]
csv_append(data)
and got some basic stats with /usr/bin/time -l ./main.py:
Iterations
Real_time
Peak_mem_footprint
1_000
0.06
5_718_848
1_000_000
22.08
5_833_664
I'm not even clearing data and memory is virtualy unchanged with 1000 times more iterations. I don't think it's the CSV file opening/writing.
I think there's something else in your program/setup you need to consider.

Long-running Python program (using Pandas) keeps ramping up memory usage

I'm running a python script that handles and processes data using Pandas functions inside an infinite loop. But the program seems to be leaking memory over time.
This is the graph produced by the memory-profiler package:
Sadly, I cannot identify the source of the increasing memory usage. To my knowledge, all data (pandas timeseries) are stored in the object Obj, and I track the memory usage of this object using the pandas function .memory_usage and the objsize function get_deep_size(). According to their output, the memory usage should be stable around 90-100 MB. Other than this, I don't see where memory can ramp up.
It may be useful to know that the python program is running inside a docker container.
Below is a simplified version of the script which should illuminate the basic working principle.
from datetime import datetime
from time import sleep
import objsize
from dateutil import relativedelta
def update_data(Obj, now_utctime):
# attaining the newest timeseries data
new_data = requests.get(host, start=Obj.data[0].index, end=now_utctime)
Obj.data.append(new_data)
# cut off data older than 1 day
Obj.data.truncate(before=now_utctime-relativedelta.relativedelta(days=1))
class ExampleClass():
def __init__(self):
now_utctime = datetime.utcnow()
data = requests.get(host, start=now_utctime-relativedelta.relativedelta(days=1), end=now_utctime)
Obj = ExampleClass()
while True:
update_data(Obj, datetime.utcnow())
logger.info(f"Average at {datetime.utcnow()} is at {Obj.data.mean()}")
logger.info(f"Stored timeseries memory usage at {Obj.data.memory_usage(deep=True)* 10 ** -6} MB")
logger.info(f"Stored Object memory usage at {objsize.get_deep_size(Obj) * 10 ** -6} MB")
time.sleep(60)
Any advice into where memory could ramp up, or how to further investigate, would be appreciated.
EDIT: Looking at the chart, it makes sense that there will be spikes before I truncate, but since the data ingress is steady I don't know why it wouldn't normalize, but remain at a higher point. Then there is this sudden drop after every 4th cycle, even though the process does not have another, broader cycle that could explain this ...
As suggested by moooeeeep, the increase of memory usage was related to a memory leak, the exact source of which remains to be identified. However, I was able to resolve the issue by manually calling the garbage collector after every loop, via gc.collect().

Why does Python use only 4% CPU?

I have a Python script that counts the occurrences of every k-long substring in a very large text. This is how it does it, after having stored and deduplicated the substrings:
counts = {}
for s in all_substrings:
counts[s] = full_text.count(s)
I was surprised to see that this script uses only 4% CPU on average. I have a 4-core, 8-thread CPU, but no core is used at more than single-digit percentages. I would have expected the script to use 100% of one core, since it doesn't do IO.
Why does it use so little computing power, and how can I improve that?
Your snippet is likely a memory bandwidth limited program. You do next to no computation in that loop. However most cpu monitoring programs report 100% for a program that just accesses memory. I am puzzled too.
If you want to see more of what is happening i recommend you play with the excellent perf tool in linux.
You could start by looking at page faults
# Sample page faults with stack traces, until Ctrl-C:
perf record -e page-faults -ag
I was a victim of the BD PROCHOT bug in Dell's XPS 15, which constantly limited the CPU to 800 Mhz. Disabling BD PROCHOT with ThrottleStop fixed the problem and my script now uses 100% of the CPU cores.

Is it possible to force a 2 second looping callback in Python?

I'm trying to get a looping call to run every 2 seconds. Sometimes, I get the desired functionality, but othertimes I have to wait up to ~30 seconds which is unacceptable for my applications purposes.
I reviewed this SO post and found that looping call might not be reliable for this by default. Is there a way to fix this?
My usage/reason for needing a consistent ~2 seconds:
The function I am calling scans an image (using CV2) for a dollar value and if it finds that amount it sends a websocket message to my point of sale client. I can't have customers waiting 30 seconds for the POS terminal to ask them to pay.
My source code is very long and not well commented as of yet, so here is a short example of what I'm doing:
#scan the image for sales every 2 seconds
def scanForSale():
print ("Now Scanning for sale requests")
#retrieve a new image every 2 seconds
def getImagePreview():
print ("Loading Image From Capture Card")
lc = LoopingCall(scanForSale)
lc.start(2)
lc2 = LoopingCall(getImagePreview)
lc2.start(2)
reactor.run()
I'm using a Raspberry Pi 3 for this application, which is why I suspect it hangs for so long. Can I utilize multithreading to fix this issue?
Raspberry Pi is not a real time computing platform. Python is not a real time computing language. Twisted is not a real time computing library.
Any one of these by itself is enough to eliminate the possibility of a guarantee that you can run anything once every two seconds. You can probably get close but just how close depends on many things.
The program you included in your question doesn't actually do much. If this program can't reliably print each of the two messages once every two seconds then presumably you've overloaded your Raspberry Pi - a Linux-based system with multitasking capabilities. You need to scale back your usage of its resources until there are enough available to satisfy the needs of this (or whatever) program.
It's not clear whether multithreading will help - however, I doubt it. It's not clear because you've only included an over-simplified version of your program. I would have to make a lot of wild guesses about what your real program does in order to think about making any suggestions of how to improve it.

64 bit python fills up memory until computer freezes with no memerror

I used to run 32 bit python on a 32-bit OS and whenever i accidentally appended values to an array in an infinite list or tried to load too big of a file, python would just stop with an out of memory error. However, i now use 64-bit python on a 64-bit OS, and instead of giving an exception, python uses up every last bit of memory and causes my computer to freeze up so i am forced to restart it.
I looked around stack overflow and it doesn't seem as if there is a good way to control memory usage or limit memory usage. For example, this solution: How to set memory limit for thread or process in python? limits the resources python can use, but it would be impractical to paste into every piece of code i want to write.
How can i prevent this from happening?
I don't know if this will be the solution for anyone else but me, as my case was very specific, but I thought I'd post it here in case someone could use my procedure.
I was having a VERY huge dataset with millions of rows of data. Once I queried this data through a postgreSQL database I used up a lot of my available memory (63,9 GB available in total on a Windows 10 64 bit PC using Python 3.x 64 bit) and for each of my queries I used around 28-40 GB of memory as the rows of data was to be kept in memory while Python did calculations on the data. I used the psycopg2 module to connect to my postgreSQL.
My initial procedure was to perform calculations and then append the result to a list which I would return in my methods. I quite quickly ended up having too much stored in memory and my PC started freaking out (froze up, logged me out of Windows, display driver stopped responding and etc).
Therefore I changed my approach using Python Generators. And as I would want to store the data I did calculations on back in my database, I would write each row, as I was done performing calculations on it, to my database.
def fetch_rows(cursor, arraysize=1000):
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
And with this approach I would do calculations on my yielded result by using my generator:
def main():
connection_string = "...."
connection = psycopg2.connect(connection_string)
cursor = connection.cursor()
# Using generator
for row in fecth_rows(cursor):
# placeholder functions
result = do_calculations(row)
write_to_db(result)
This procedure does however indeed require that you have enough physical RAM to store the data in memory.
I hope this helps whomever is out there with same problems.

Categories