Python long time script running huge RAM and swap memory consumption - python

I am trying to write python script that will periodically (each 20ms) read the data from USB port and write obtained data to .csv file. Program need to run on RaspberryPi 3B for at least 1 week. But now I am facing the problem with RAM and swap memory consumption. After 9 hours of running Linux killing my process with just one word 'Killed' in terminal. I have checked the RAM usage using psutil module and it seems like the problem is the RAM and swap usage (1 min before crash it was 100% of swap is used overall processes and 57% of RAM is in use by this process). I was trying to find out where is this memory leakage happening by using memory profiler, so it seems like the problem is in csv_append function (after 10 minutes of running it increments 7Mb of data), but when I have a closer look on this function with #profiler decorator it seems like there is no leakage.
Here is an example of this function:
def _csv_append(self,data):
"""
Appends to .csv file
"""
with open(self.last_file_name, 'a') as csvfile:
csv_writter = csv.writer(csvfile)
csv_writter.writerow(data)
Is there anything that I can improve in my program so it will stop memory leaking and work for a long time without get killed by Linux OOM? In main loop function there is nothing more then reading bytes, interpreting them as int using int.from_bytes(), calling csv_append() and wait if some time left to ensure 0.02s period
Thank you for your help :)
Analyze memory consumption using memory profiler, no info that can help. Seems like the problem in csv_append() but there is no leakage
Delete all variables each cycle and use garbage collector gc.collect()

I just ran the following little script:
import csv
Iterations = 1_000
def csv_append(data):
with open("log.csv", "a", newline="") as f:
writer = csv.writer(f)
writer.writerow(data)
for i in range(Iterations):
data = [i, "foo", "bar"]
csv_append(data)
and got some basic stats with /usr/bin/time -l ./main.py:
Iterations
Real_time
Peak_mem_footprint
1_000
0.06
5_718_848
1_000_000
22.08
5_833_664
I'm not even clearing data and memory is virtualy unchanged with 1000 times more iterations. I don't think it's the CSV file opening/writing.
I think there's something else in your program/setup you need to consider.

Related

long-running python program ram usage

I am currently working on a project where a python program is supposed to be running for several days, essentially in an endless loop until an user intervenes.
I have observed that the ram usage (as shown in the windows task manager) rises - slowly, but steadily. For example from ~ 80 MB at program start to ~ 120 MB after one day. To get a closer look at this, I started to log the allocated memory with
tracemalloc.get_traced_memory() at regular intervalls throughout the program execution. The output was written to the time series db (see image below).
tracemalloc output for one day runtime
To me it looks like the memory that is needed for the program does not accumulate over time. How does this fit in the output of the windows task manager? Should I go through my program to search for growing data structures?
Thank your very much in advance!
Okay, turns out the answer is: no, this is not proper behaviour, the ram usage can stay absolutely stable. I have tested this for three weeks now and the ram usage never exceeded 80 mb.
The problem was in the usage of the influxdb v2 client.
You need to close both the write_api (implicitly done with the "with... as write_api:" statement) and the client itself (explicitly done via the "client.close()" in the example below).
In my previous version that had increasing memory usage, I only closed the write_api and not the client.
client = influxdb_client.InfluxDBClient(url=self.url, token=self.token, org=self.org)
with client.write_api(write_options=SYNCHRONOUS) as write_api:
# force datatypes, because influx does not do fluffy ducktyping
datapoint = influxdb_client.Point("TaskPriorities")\
.tag("task_name", str(task_name))\
.tag("run_uuid", str(run_uuid))\
.tag("task_type", str(task_type))\
.field("priority", float(priority))\
.field("process_time_h", float(process_time))\
.time(time.time_ns())
answer= write_api.write(bucket=self.bucket, org=self.org, record=datapoint)
client.close()

Reading WARC Files Efficiently

I am reading a WARC file with python's 'warc' library. Current file that I am using, is around 4.50 GB. The thing is ;
file = warc.open("random.warc")
html_lists = [line for line in file]
Executing these 2 lines takes up to 40 seconds. Since there will be 64000 more files like this one, it is not acceptable that it takes 40 seconds per file. Do you guys have any tips to improve performance or any different approaches?
Edit : I found out that Beautifulsoup operations take some time. So I removed it and wrote the necessary stuff myself. It is 100x faster now. It takes +- 60 seconds to read and process 4.50 GB data. With this line of code I remove the scripts from data;
clean = re.sub(r"<script.*?</script>", "", string=text)
And with this one I split the text and remove the stamp which I don't need
warc_stamp = str(soup).split(r"\r\n\r\n")
As I said it is faster but 60 seconds are not that good in this case. Any suggestions ?
but 60 seconds are not that good in this case
Of course, it would mean that processing all 64,000 WARC files takes 45 days if not done in parallel. But as a comparison: the Hadoop jobs to crawl the content of the WARC files and also those to transform WARCs into WAT and WET files need around 600 CPU days each.
WARC files are gzip-compressed because disk space and download bandwidth are usually the limiting factors. Decompression defines the baseline for any optimization. E.g., decompressing a 946 MB WARC file takes 21 seconds:
% time zcat CC-MAIN-20170629154125-20170629174125-00719.warc.gz >/dev/null
real 0m21.546s
user 0m21.304s
sys 0m0.240s
Iterating over the WARC records needs only little extra time:
% cat benchmark_warc.py
import gzip
import sys
import warc
n_records = 0
for record in warc.WARCFile(fileobj=(gzip.open(sys.argv[1]))):
if record['Content-Type'] == 'application/http; msgtype=response':
n_records += 1
print("{} records".format(n_records))
% time python benchmark_warc.py CC-MAIN-20170629154125-20170629174125-00719.warc.gz
43799 records
real 0m23.048s
user 0m22.169s
sys 0m0.878s
If processing the payload only doubles or triples the time needed anyway for decompression (I cannot imagine that you can outperform the GNU gzip implementation significantly), you're close to the optimum. If 45 days is too long, the development time is better invested in parallelization of the processing. There are already plenty of examples available how to achieve this for Common Crawl data, e.g. cc-mrjob or cc-pyspark.
Get the source code of that module, and check for optimization potential.
Use a profiler to identify performance bottlenecks, then focus on these for optimization.
It can make a huge difference to rewrite Python code in Cython and compile it into native code. So that is likely worth a try.
But by any means, rather than speculating on an internet forum on how to accelerate a two line script, you really need to work with the actual code underneath!

NVMe Throughput Testing with Python

currently I need to do some throughput testing. My hardware setup is that I have a Samsung 950 Pro connected to an NVMe controller that is hooked to the motherboard via and PCIe port. I have a Linux nvme device corresponding to the device which I have mounted at a location on the filesystem.
My hope was to use Python to do this. I was planning on opening a file on the file system where the SSD is mounted, recording the time, writing some n length stream of bytes to the file, recording the time, then closing the file using os module file operation utilities. Here is the function to gauge write throughput.
def perform_timed_write(num_bytes, blocksize, fd):
"""
This function writes to file and records the time
The function has three steps. The first is to write, the second is to
record time, and the third is to calculate the rate.
Parameters
----------
num_bytes: int
blocksize that needs to be written to the file
fd: string
location on filesystem to write to
Returns
-------
bytes_per_second: float
rate of transfer
"""
# generate random string
random_byte_string = os.urandom(blocksize)
# open the file
write_file = os.open(fd, os.O_CREAT | os.O_WRONLY | os.O_NONBLOCK)
# set time, write, record time
bytes_written = 0
before_write = time.clock()
while bytes_written < num_bytes:
os.write(write_file, random_byte_string)
bytes_written += blocksize
after_write = time.clock()
#close the file
os.close(write_file)
# calculate elapsed time
elapsed_time = after_write - before_write
# calculate bytes per second
bytes_per_second = num_bytes / elapsed_time
return bytes_per_second
My other method of testing is to use Linux fio utility.
https://linux.die.net/man/1/fio
After mounting the SSD at /fsmnt/fs1, I used this jobfile to test the throughput
;Write to 1 file on partition
[global]
ioengine=libaio
buffered=0
rw=write
bs=4k
size=1g
openfiles=1
[file1]
directory=/fsmnt/fs1
I noticed that the write speed returned from the Python function is significantly higher than that of the fio. Because Python is so high-level there is a lot of control you give up. I am wondering if Python is doing something under the hood to cheat its speeds higher. Does anyone know why Python would generate write speeds so much higher than those generated by fio?
The reason your Python program does better than your fio job is because this is not a fair comparison and they are testing different things:
You banned fio from using Linux's buffer cache (by using buffered=0 which is the same as saying direct=1) by telling it to do O_DIRECT operations. With the job you specified, fio will have to send down a single 4k write and then wait for that write to complete at the device (and that acknowledgement has to get all the way back to fio) before it can send the next.
Your Python script is allowed to send down writes that can be buffered at multiple levels (e.g. within userspace by the C library and then again in the buffer cache of the kernel) before touching your SSD. This will generally mean the writes will be accumulated and merged together before being sent down to the lower level resulting in chunkier I/Os that have less overhead. Further, since you don't do any explicit flushing in theory no I/O has to be sent to the disk before your program exits (in practice this will depend on a number of factors like how much I/O you do, the amount of RAM Linux can set aside for buffers, the maximum time the filesystem will hold dirty data for, how long you do the I/O for etc)! Your os.close(write_file) will just be turned into an fclose() which says this in its Linux man page:
Note that fclose() flushes only the user-space buffers provided by the C library. To ensure that the data is physically stored on disk the kernel buffers must be flushed too, for example, with sync(2) or fsync(2).
In fact you take your final time before calling os.close(), so you may even be omitting the time it took for the final "batches" of data to be sent only to the kernel let alone the SSD!
Your Python script is closer to this fio job:
[global]
ioengine=psync
rw=write
bs=4k
size=1g
[file1]
filename=/fsmnt/fio.tmp
Even with this fio is still at a disadvantage because your Python program has userspace buffering (so bs=8k may be closer).
The key takeaway is your Python program is not really testing your SSD's speed at your specified block size and your original fio job is a bit weird, heavily restricted (the libaio ioengine is asynchronous but with a depth of 1 you're not going to be able to benefit from that and that's before we get to the behaviour of Linux AIO when using filesystems) and does different things to your Python program. if you're not doing significantly more buffered I/O compared to the size of the largest buffer (and on Linux the kernel's buffer size scales with RAM) and if the buffered I/Os are small the exercise turns into a demonstration of the effectiveness of buffering.
If you need the exact performance of the NVMe device, fio is the best choice. FIO can write test data to the device directly, without any file system. Here is an example:
[global]
ioengine=libaio
invalidate=1
iodepth=32
time_based
direct=1
filename=/dev/nvme0n1
[write-nvme]
stonewall
bs=128K
rw=write
numjobs=1
runtime=10000
SPDK is another choice. There is an existed example of performance test at https://github.com/spdk/spdk/tree/master/examples/nvme/perf.
Pynvme, which is based on SPDK, is a Python extension. You can write performance test with its ioworker().

64 bit python fills up memory until computer freezes with no memerror

I used to run 32 bit python on a 32-bit OS and whenever i accidentally appended values to an array in an infinite list or tried to load too big of a file, python would just stop with an out of memory error. However, i now use 64-bit python on a 64-bit OS, and instead of giving an exception, python uses up every last bit of memory and causes my computer to freeze up so i am forced to restart it.
I looked around stack overflow and it doesn't seem as if there is a good way to control memory usage or limit memory usage. For example, this solution: How to set memory limit for thread or process in python? limits the resources python can use, but it would be impractical to paste into every piece of code i want to write.
How can i prevent this from happening?
I don't know if this will be the solution for anyone else but me, as my case was very specific, but I thought I'd post it here in case someone could use my procedure.
I was having a VERY huge dataset with millions of rows of data. Once I queried this data through a postgreSQL database I used up a lot of my available memory (63,9 GB available in total on a Windows 10 64 bit PC using Python 3.x 64 bit) and for each of my queries I used around 28-40 GB of memory as the rows of data was to be kept in memory while Python did calculations on the data. I used the psycopg2 module to connect to my postgreSQL.
My initial procedure was to perform calculations and then append the result to a list which I would return in my methods. I quite quickly ended up having too much stored in memory and my PC started freaking out (froze up, logged me out of Windows, display driver stopped responding and etc).
Therefore I changed my approach using Python Generators. And as I would want to store the data I did calculations on back in my database, I would write each row, as I was done performing calculations on it, to my database.
def fetch_rows(cursor, arraysize=1000):
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
And with this approach I would do calculations on my yielded result by using my generator:
def main():
connection_string = "...."
connection = psycopg2.connect(connection_string)
cursor = connection.cursor()
# Using generator
for row in fecth_rows(cursor):
# placeholder functions
result = do_calculations(row)
write_to_db(result)
This procedure does however indeed require that you have enough physical RAM to store the data in memory.
I hope this helps whomever is out there with same problems.

Python's mmap() performance down with time

I am wondering why Python's mmap() performance going down with time? I mean I have a little app which make changes to N files, if set is big (not too really big, say 1000) first 200 is demon-speed but after that it goes slower and slower. It looks like I should free memory once in a while but don't know how and most importantly why Python do not do this automagically.
Any help?
-- edit --
It's something like that:
def function(filename, N):
fd = open(filename, 'rb+')
size = os.path.getsize(filename)
mapped = mmap(fd.fileno(), size)
for i in range(N):
some_operations_on_mmaped_block()
mapped.close()
Your OS caches the mmap'd pages in RAM. Reads and writes go at RAM speed from the cache. Dirty pages are eventually flushed. On Linux performance will be great until you have to start flushing pages, this is controlled by vm.dirty_ratio sysctl variable. Once your start flushing dirty pages to disk the reads will compete with the writes on your busy IO bus/device. Another thing to consider is simply whether your OS has enough RAM to cache all the files (the buffers counter in top output). So I would watch the output of "vmstat 1" while your program runs and watch the cache / buff counters go up until suddenly you start doing IO.

Categories