SDL_BlitSurface in PySDL2 causing segfault on larger surfaces

SDL_BlitSurface in PySDL2 causing segfault on larger surfaces - python

Background
I am creating a window with pysdl2 and using SDL_Blit_Surface for embedding a skia-python surface inside this window with the following code:
import skia
import sdl2 as sdl
from ctypes import byref as pointer
class Window:
DEFAULT_FLAGS = sdl.SDL_WINDOW_SHOWN
BYTE_ORDER = {
# ---------- -> RED GREEN BLUE ALPHA
"BIG_ENDIAN": (0xff000000, 0x00ff0000, 0x0000ff00, 0x000000ff),
"LIL_ENDIAN": (0x000000ff, 0x0000ff00, 0x00ff0000, 0xff000000)
}
PIXEL_DEPTH = 32 # BITS PER PIXEL
PIXEL_PITCH_FACTOR = 4 # Multiplied by Width to get BYTES PER ROW
def __init__(self, title, width, height, x=None, y=None, flags=None, handlers=None):
self.title = bytes(title, "utf8")
self.width = width
self.height = height
# Center Window By default
self.x, self.y = x, y
if x is None:
self.x = sdl.SDL_WINDOWPOS_CENTERED
if y is None:
self.y = sdl.SDL_WINDOWPOS_CENTERED
# Override flags
self.flags = flags
if flags is None:
self.flags = self.DEFAULT_FLAGS
# Handlers
self.handlers = handlers
if self.handlers is None:
self.handlers = {}
# SET RGBA MASKS BASED ON BYTE_ORDER
is_big_endian = sdl.SDL_BYTEORDER == sdl.SDL_BIG_ENDIAN
self.RGBA_MASKS = self.BYTE_ORDER["BIG_ENDIAN" if is_big_endian else "LIL_ENDIAN"]
# CALCULATE PIXEL PITCH
self.PIXEL_PITCH = self.PIXEL_PITCH_FACTOR * self.width
# SKIA INIT
self.skia_surface = self.__create_skia_surface()
# SDL INIT
sdl.SDL_Init(sdl.SDL_INIT_EVENTS) # INITIALIZE SDL EVENTS
self.sdl_window = self.__create_SDL_Window()
def __create_SDL_Window(self):
window = sdl.SDL_CreateWindow(
self.title,
self.x, self.y,
self.width, self.height,
self.flags
)
return window
def __create_skia_surface(self):
"""
Initializes the main skia surface that will be drawn upon,
creates a raster surface.
"""
surface_blueprint = skia.ImageInfo.Make(
self.width, self.height,
ct=skia.kRGBA_8888_ColorType,
at=skia.kUnpremul_AlphaType
)
# noinspection PyArgumentList
surface = skia.Surface.MakeRaster(surface_blueprint)
return surface
def __pixels_from_skia_surface(self):
"""
Converts Skia Surface into a bytes object containing pixel data
"""
image = self.skia_surface.makeImageSnapshot()
pixels = image.tobytes()
return pixels
def __transform_skia_surface_to_SDL_surface(self):
"""
Converts Skia Surface to an SDL surface by first converting
Skia Surface to Pixel Data using .__pixels_from_skia_surface()
"""
pixels = self.__pixels_from_skia_surface()
sdl_surface = sdl.SDL_CreateRGBSurfaceFrom(
pixels,
self.width, self.height,
self.PIXEL_DEPTH, self.PIXEL_PITCH,
*self.RGBA_MASKS
)
return sdl_surface
def update(self):
window_surface = sdl.SDL_GetWindowSurface(self.sdl_window) # the SDL surface associated with the window
transformed_skia_surface = self.__transform_skia_surface_to_SDL_surface()
# Transfer skia surface to SDL window's surface
sdl.SDL_BlitSurface(
transformed_skia_surface, None,
window_surface, None
)
# Update window with new copied data
sdl.SDL_UpdateWindowSurface(self.sdl_window)
def event_loop(self):
handled_events = self.handlers.keys()
event = sdl.SDL_Event()
while True:
sdl.SDL_WaitEvent(pointer(event))
if event.type == sdl.SDL_QUIT:
break
elif event.type in handled_events:
self.handlers[event.type](event)
if __name__ == "__main__":
skiaSDLWindow = Window("Browser Test", 500, 500, flags=sdl.SDL_WINDOW_SHOWN | sdl.SDL_WINDOW_RESIZABLE)
skiaSDLWindow.event_loop()
I monitor my CPU usage for the above code and it stays well below 20% with hardly any change in RAM usage.
Problem
The problem is that as soon I make a window greater than 690 x 549 (or any other size where width and height's products are the same) I get a segfault (core dumped) with CPU usage going upto 100%, no change in RAM usage.
What I have already tried/know
I know the fault is with SDL_BlitSurface as reported by the faulthandler module in python, and the classic print("here") lines.
I am not familiar with languages like c so from my basic understanding of a segfault I tried to match the size of the byte string returned by Window.__pixels_from_skia_surface with sys.getsizeof against C datatypes to see if it was close to the size of any, because I suspected an overflow. (forgive me if this is the stupidest debugging method you have ever seen). But the size didn't come close to any of the c datatypes.

As SDL_CreateRGBSurfaceFrom documentation says, it doesn't allocate memory for pixels data but takes external memory buffer passed to it. While there's a benefit in having no copy operation at all, it have lifetime implications - note "you must free the surface before you free the pixel data".
Python tracks references for its objects and automatically destroys objects once their reference count reaches 0 (i.e. no references to that object possible - delete it immediately). But nither SDL nor skia are python libraries, and whatever references they keep in their native code is not exposed to python. So, python automatic memory management doesn't help you here.
What's happening is you get pixels data from skia as bytes array (python object, automatically freed when no longer referenced), then pass it to SDL_CreateRGBSurfaceFrom (native code, python don't know that it'd keep internal reference), and then your pixels goes out of scope and python deletes them. You have surface but SDL says the way you created it pixels must not be destroyed (there are other ways, like SDL_CreateRGBSurface, that actually allocate their own memory). Then you try to blit it and surface still points to location where pixels were, but that array is no longer there.
[Everything that follows is explaination of why exactly it didn't crash with smaller surface size, and that turned out to require much more words than i thought. Sorry. If you're not interested in that stuff, don't read any further]
What happens next purely depends on memory allocator used by python. First, segmentation fault is a critical signal sent by operating system to your program, and it happens when you access memory pages in a way that you're not supposed to - e.g. reading memory that have no mapped pages or writing to pages that are mapped as read-only. All that, and the way to map/unmap pages, is provided by your operating system kernel (e.g. in linux it is handled by mmap/munmap calls), but OS kernel only operates on page level; you can't request half-of-page, but you can have large block backed by N pages. For most current operating systems, minimal page size is 4kb; some OS supports 2Mb or even larger 'huge' pages.
So, you get segmentation fault when you have larger surface, but don't get it when surface is smaller. Meaning for larger surface your BlitSurface hits memory that is already unmapped and OS sends your program polite "sorry can't allow that, correct yourself immediately or you're going down". But when surface is smaller memory that pixels were kept in still mapped; it doesn't necessarily mean it still contains the same data (e.g. python could have placed some other object there), but as far as OS concerned this memory region is still 'yours' to read. And the difference in that behaviour is indeed caused by size of allocated buffer (but of course you can't rely for that behaviour to be kept on other OS, other python versions, or even other with different set of environment variables).
As i've said before, you only mmap entire pages, but python (that's just an example, as you'll see later) have a lot of smaller objects (integers, floats, smaller strings, short arrays, ...) that are much smaller than a page. Allocating entire page for each of that would be a massive waste of memory (also other problems like reduced performance because of bad caching). To handle that what we do ('we' being every single program that needs smaller allocations, i.e. 99% of programs you use everyday) is allocate a larger block of memory and track which parts of that block is allocated/freed in userspace (as oppoosed to pages that are being tracked by OS kernel - in kernelspace) entirely. That way you could have very tight packing of small allocations without too much of an overhead, but the downside is that this allocations are not distinguishable on OS level. When you 'free' some small allocation that is placed in that kind of pre-allocated block, you just internally mark this region as unused and next time some other part of your program request some memory you start searching for a place where you can put it. It also means you usually don't return (unmap) memory to OS as you can't give back the block if at least one byte of it is still in use.
Python internally manages small objects (<512b) itself, by allocating 256kb blocks and placing objects in that blocks. If larger allocation is required - it passes it to libc malloc (python itself is written in C and uses libc; the most popular libc for linux is glibc). And malloc documentation for glibc says following:
When allocating blocks of memory larger than MMAP_THRESHOLD bytes, the
glibc malloc() implementation allocates the memory as a private
anonymous mapping using mmap(2). MMAP_THRESHOLD is 128 kB by default,
but is adjustable using mallopt(3)
So, allocations for larger objects should go to mmap/munmap, and freeing that pages should make them unaccessible (causing segfault if you try to access it, instead of silently reading potentially garbage data; bonus point if you try to write into it - so-called memory stomping, overwriting something else, probably even internal libc markers that it uses to track which memory is used; anything could happen after that). While there is still a chance that next mmap will randomly place next page on the same address, i'm going to neglect that. Unfortunately this is very old documentation that, while explains basic intent, no longer reclects how glibc behaves nowadays. Take a look at comment in glibc source (emphasis is mine):
M_MMAP_THRESHOLD is the request size threshold for using mmap()
to service a request. Requests of at least this size that cannot be
allocated using already-existing space will be serviced via mmap.
(If enough normal freed space already exists it is used instead.)
...
The implementation works with a sliding threshold, which is by
default limited to go between 128Kb and 32Mb (64Mb for 64
bitmachines) and starts out at 128Kb as per the 2001 default.
...
The threshold goes up in value when the application frees
memory that was allocated with the mmap allocator. The idea is
that once the application starts freeing memory of a certain size,
it's highly probable that this is a size the application uses for
transient allocations.
So, it tries to adapt to your allocation behaviour to balance performance with releasing memory back to OS.
But, different OSes will behave differently, and even with just linux we have multiple libc implementations (e.g. musl) that will implement malloc differently, and a lot of different memory allocators (jemalloc, tcmalloc, dlmalloc, you name it) that could be injected via LD_PRELOAD and your program (e.g. python itself in that case) will use different allocator with different rules on mmap usage. There are even debug allocators that injects "guard" pages around every allocation, that don't have any access rights at all (can't read, write or execute), to catch common memory-related programming mistakes, at a cost of massively larger memory usage.
To sum it up - you had a lifetime management bug in your code, and unfortunately it didn't crash immediately due to internals of libc memory allocation scheme, but it did crash when surface size got larger and libc decided to allocate exclusive pages for that buffer. That is unfortunate turn of events that languages without automatic memory managerment are exposed to, and by virtue of using python C bindings your python program is, to some extent, exposed as well.

Related

How to free up memory in Lambda due to numpy.core._exceptions._ArrayMemoryError using pandas compare()

When I run this code in a lambda function in which the memory allocation setting is set to max (10240):
df_compare = first_less_dupes[compare_columns].compare(second_less_dupes[compare_columns])
I'm seeing this error:
Unable to allocate 185. MiB for an array with shape (2697080, 9) and data type float64 - Error type:<class 'numpy.core._exceptions._ArrayMemoryError'>
I've run this code many times with smaller dfs without issue. So I began attacking this from a memory capacity/clean-up approach my assumption being: I need to free up memory. I use two snippets of code to audit my memory usage:
def print_current_memory():
'''
Gets the current process and checks current memory usage
'''
process = psutil.Process(os.getpid())
mbs = round(process.memory_info().rss / 1024 / 1024,2)
print('Current memory usage:',mbs, 'MB')
And
for obj_name in list(locals().keys()):
size = str(sys.getsizeof(locals()[obj_name]))
mbs = str(round(int(size) / 1024 / 1024,2))
print(f'{obj_name}: {mbs}MB. {size}B.')
The print_current_memory function does just what it says in its comments. The loop prints out a list of all local variables and their size. Using the loop I identified several objects that I did not need. (Strangely the summed size of the listed objects should have greatly exceeded the lambda limit (even before the error)).
So I delete those objects and garbage collect (I understand gc may not be necessary).
print_current_memory()
print('Deleting first & limited')
del first
del first_limited
print('Deleting second & limited')
del second
del second_limited
print('Deleting both_df')
del both_df
print('Garbage collecting')
gc.collect()
print_current_memory()
After running this I see:
I am clearly doing something wrong since the current memory usage doesn't decrease. And that is my main concern: How do I decrease memory usage to make space for this new dataframe? But Perhaps I'm asking the wrong question and need to question my assumptions like: Can I monitor current-memory-usuage in a Lambda the same way I would with a Window OS? Am I deleting objects the right way? My use of gc probably illustrates how little I know about it so am I using that correctly?

del does not properly delete objects, but simple drops the reference tied to the variable name being deleted. You must make sure that every other reference is properly dropped too.
Then you might still need to wait for garbage collection to happen. However with pandas and numpy, the actual data is managed in C++, and therefore the garbage collection should be immediate when the last reference is dropped.
Since you are working in an Amazon lambda, the data you transform does not need to be kept, because you just want your result out. Then it is certainly safe for you to use inplace to replace the original data with your processed data, and therefore free up space. Perhaps such tutorial could get you started.

ZMQ pub/sub for transfering base64 images

I'm facing with a task: using zmq socket to send and receive base64 string (which generated from 800x600 images). Currently, I'm using pub/sub connection to perform this task. But look like the message is large so that the socket can't transfer it immediately, and the later messages stuck in network buffer. Although I don't want to lose so many messages, I must restrict the HWM value so that the socket works properly. So I have some questions:
Is there another effective library/way to perform my task? Or should I use other connection type that zmq provides (router/dealer and request/reply)?
To transfer image (that processed by OpenCV), is there an approach I can use to minimize the size of the sending image, except converting into base64 format?
If I must continue using zmq pub/sub connection, how can I limit the time for storing old messages, not the number of them, like that for 3 minutes?
Here my python code for the socket:
Publisher
import numpy as np
import zmq
import base64
context = zmq.Context()
footage_socket = context.socket(zmq.PUB)
footage_socket.setsockopt(zmq.SNDHWM, 1)
footage_socket.connect(<tcp address>)
def send_func(frame, camera_link):
height, width, _ = frame.shape
frame = np.ascontiguousarray(frame)
base64_img = base64.b64encode(frame)
data = {"camera_link":camera_link,'base64_img':base64_img, "img_width":width, "img_height":height}
footage_socket.send_json(data)
Subcriber
footage_socket = context.socket(zmq.SUB)
footage_socket.bind(<tcp address>)
footage_socket.setsockopt(zmq.RCVHWM, 1)
def rcv_func():
while True:
print("run socket")
try:
framex = footage_socket.recv_string()
data = json.loads(framex)
frame = data['base64_img']
img = np.frombuffer(base64.b64decode(frame), np.uint8)
img = img.reshape(int(frame_height), int(frame_width), 3)
except Exception as e:
print(e)

Before we start, let me take a few notes: - avoid re-packing data into JSON, if it were just for the ease of coding. JSON-re-serialised data "grow"-in size, without delivering you a single value-added for ultra-fast & resources-efficient stream-processing. Professional systems "resort" to JSON-format only if they have plenty of time and almost unlimited spare CPU-processing power, they waste into re-packing the valuable data into a just another box-of-data-inside-another-box-of-data. Where feasible, they can pay all the costs and inefficiencies - here, you will result in getting nothing in exchange to the spent CPU-clocks, more than doubled the RAM-needed to re-pack itself and also having to transport even larger data - review, if camera indeed provides image-data that "deserve" to become 8-Byte / 64-bit "deep", if not, you have the first remarkable image-data reduction free-of-chage
Using sys.getsizeof() may surprise you:
>>> aa = np.ones( 1000 )
>>> sys.getsizeof( aa )
8096 <---------------------------- 8096 [B] object here "contains"
>>> (lambda array2TEST: array2TEST.itemsize * array2TEST.size )( aa )
8000 <---------------------------- 8000 [B] of data
>>> bb = aa.view() # a similar effect happen in smart VECTORISED computing
>>> sys.getsizeof( bb )
96 <------------------------------ 96 [B] object here "contains"
>>> (lambda array2TEST: array2TEST.itemsize * array2TEST.size )( bb )
8000 <---------------------------- 8000 [B] of data
>>> bb.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False <-------------------------------||||||||||
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> bb.dtype
dtype('float64') <-------------- 8 [B] per image-pixel even for {1|0} B/W
Q : is there an approach I can use to minimize the size of the sending image...?
Yes, there have been already spent millions of [man * years] of R&D, dedicated to solving this problem, and still evolving the best of the class methods for doing it.
The best results, as anyone may have already expected on one's own, are needed for extremely corner-cases - for a satellite imagery transport from away, far in a deep space, back home - like when JAXA was on it's second asteroid rendezvous mission, this time visiting the Ryugu asteroid.
Your as-is code produces 800x600-image-frames at so far unspecified fps-rate and color-depth. A brief view shows, how much data that can easily generate, within those said -3-minutes-, if the process is not handled with more attention and a due care:
>>> (lambda a2T: a2T.itemsize * a2T.size )( np.ones( ( 800, 600, 3 ) ) ) / 1E6
11.52 <---- each 800x600-RGB-FRAME handled/processed this way takes ~ 11.5 [MB]
#~30 fps ~345.6 [MB/s]
~ 62.2 [GB/3min]
Solution? Take the inspiration from The Best in the Class know-how :
There you have limited power ( both the energy-wise and the processing-wise - do not forget, the CPU-s "inside" this satellite were already manufactured more than some 5 - 7 years ago, before the Project launch - no one serious will dare to send a mission with bright and hot new, but unproven, COTS chips ), limited RAM ( again, the power plus weight limits, as the amount of the fuel needed to liftoff and fly "there" grows with every single gram of "The Useful Payload" ) and the last but not least - the most limiting factor - you have very limited means of R/F-COMMs - a so "loooooooong"-wire ( it takes almost half a day, to get a first bit from "there" back "here" + the same, if you try to ACK/NACK from "here" answering any remote-request or requesting a re-send after an error was detected ). The current DSN effective-telemetry data transport-speeds are about 6.4 ~ 9.6 kbps ( yes, not more than about 7000 bits/sec )
Here, the brightest minds have put all the art of the human intellect, into making this happen:
ultimate means of image compression - never send a bit unless it is indeed vital & necessary
ultimate means of transcoded-image data error self-correction added - if anything is worth adding, the error-detection is not ( you will have to wait for almost a day, to get it "re-transmited" again, hopefully without another error there ). Here we need a means of ( limited - see the costs of sending a single bit above, so this has to be very economic add-on ) self-correction, which can indeed repair some limited-scope of signal/data-transport errors, that may appear and do appear during the R/F-COMMs signal traveling from deep space back home. On larger errors, you have to wait a few days to get a re-scheduled image-data error recovery solved by another try to send a larger pack, that was not recoverable from the "damaged"-data by the capabilities engineered into the built-in error self-correction.
Where to start from?
If your use-case does not have the remarkable amount of the "spare" CPU-power available ( it is indeed needed to have pretty enough "free" CPU+RAM-resources to perform any such advanced image-data trans-coding & error-recovery re-processing, both in scale ( volume of additional data for the trans-coding and re-processing - both of which come at large sizes - orders of magnitude larger than a size of a single image-frame ) and in time ( speed of the additional CPU-processing ) ) there is no magic trick to get the ultimate image-data compression and your story ends here.
If your use-case can spin up more CPU-power, your next enemy is the time. Both the time to design a clever-enough image-processing and the time to process each image-frame, using your engineered image-data trans-coding, within a reasonably short amount of time, before sending over to the recipient end. The former is manageable by your Project-resources ( by finance - to get the right skilled engineers on board, and by people who execute (do) the actual design & engineering phase ). The latter is not manageable, it depends on your Project's needs - how fast ( fps ) and bearing what latency ( how late, in accumulated [ms] of delays ) your Project can still survive to perform the intended function.
python is an easy prototyping eco-system, once you need to boost the throughput ( ref. above ), this most probably ( 30+ years of experience make me awfully well confident in saying this - even if you pull in add-on steroids, like moving into cython + C-extensions for doing the whole circus indeed a bit, but only a bit faster, at an immense add-on cost of having to ( acquire new skill if not already on board - having an expensive learning curve duration and grows in salaries for those well-skilled ) re-engineer and re-factor your so far well-prototyped code-base ) will be the first blocker of the show going on
OpenCV can and will provide you some elementary image-manipulation tools to start from
image-data trans-coding and ordinary or ultimate data-compression have to follow, to reduce the data-size
ZeroMQ is the least problematic part - both performance-wise scalable and having unique low-latency throughput capabilities. Without any details, one may forget about the PUB/SUB, unless you keep prevented and avoided any subscription-list processing at all ( the costs of doing this would cause immense side-effects on the { central-node | network-dataflows + all remote-nodes }-overloads, having no practical effect for the intended fast and right-sized image-data pipeline-processing.
Q : If I must continue using zmq pub/sub connection, how can I limit the time for storing old messages, not the number of them, like that for 3 minutes?
ZeroMQ is a smart tool, yet one has to understand it's powers - ZeroCopy will help you in keeping low-RAM-profile in production, yet if you plan to store -3-minutes of image-data streaming, you will need both immense loads of RAM and CPU-power and it all also heavily depends on the actual amount of .recv()-ing peers.
ZeroMQ is a broker-less system, so you do not actually "store" messages, but the .send()-method just tells the ZeroMQ infrastructure, that the provided data are free-to-get-sent, whenever ZeroMQ infrastructure is seeing a chance to dispatch 'em to the designated peer-recipient ( be it locally or over the Atlantic or over the satellite-connection ). This means, the proper ZeroMQ configuration is a must, if you plan to have the sending/receiving-side's ready to enqueue / transmit / receive / dequeue ~3-minutes of even the most compressed image-data stream(s), potentially providing multiples of that, in case 1:many-party communication appears in production.
Proper analysis and sound design decisions are the only chance for your Project to survive all these requirements, given the CPU, RAM and transport-means are a-priori known to be limited.

Writing to memory in a single operation

I'm writing a userspace driver for accessing FPGA registers in Python 3.5 that mmaps the FPGA's PCI address space, obtains a memoryview to provide direct access to the memory-mapped register space, and then uses struct.pack_into("<I", ...) to write a 32-bit value into the selected 32-bit aligned address.
def write_u32(address, data):
assert address % 4 == 0, "Address must be 32-bit aligned"
path = path.lib.Path("/dev/uio0")
file_size = path.stat().st_size
with path.open(mode='w+b') as f:
mv = memoryview(mmap.mmap(f.fileno(), file_size))
struct.pack_into("<I", mv, address, data)
Unfortunately, it appears that struct.pack_into does a memset(buf, 0, ...) that clears the register before the actual value is written. By examining write operations within the FPGA, I can see that the register is set to 0x00000000 before the true value is set, so there are at least two writes across the PCI bus (in fact for 32-bit access there are three, two zero writes, then the actual data. 64-bit involves six writes). This causes side-effects with some registers that count the number of write operations, or some that "clear on write" or trigger some event when written.
I'd like to use an alternative method to write the register data in a single write to the memory-mapped register space. I've looked into ctypes.memmove and it looks promising (not yet working), but I'm wondering if there are other ways to do this.
Note that a register read using struct.unpack_from works perfectly.
Note that I've also eliminated the FPGA from this by using a QEMU driver that logs all accesses - I see the same double zero-write access before data is written.
I revisited this in 2022 and the situation hasn't really changed. If you're considering using memoryview to write blocks of data at once, you may find this interesting.

Perhaps this would work as needed?
mv[address:address+4] = struct.pack("<I", data)
Update:
As seen from the comments, the code above does not solve the problem. The following variation of it does, however:
mv_as_int = mv.cast('I')
mv_as_int[address/4] = data
Unfortunately, precise understanding of what happens under the hood and why exactly memoryview behaves this way is beyond the capabilities of modern technology and will thus stay open for the researchers of the future to tackle.

You could try something like this:
def __init__(self,offset,size=0x10000):
self.offset = offset
self.size = size
mmap_file = os.open('/dev/mem', os.O_RDWR | os.O_SYNC)
mem = mmap.mmap(mmap_file, self.size,
mmap.MAP_SHARED,
mmap.PROT_READ | mmap.PROT_WRITE,
offset=self.offset)
os.close(mmap_file)
self.array = np.frombuffer(mem, np.uint32, self.size >> 2)
def wread(self,address):
idx = address >> 2
return_val = int(self.array[idx])
return return_val
def wwrite(self,address,data):
idx = address >> 2
self.array[idx] = np.uint32(data)

Python Memory Leak - Why is it happening?

For some background on my problem, I'm importing a module, data_read_module.pyd, written by someone else, and I cannot see the contents of that module.
I have one file, let's called it myfunctions. Ignore the ### for now, I'll comment on the commented portions later.
import data_read_module
def processData(fname):
data = data_read_module.read_data(fname)
''' process data here '''
return t, x
### return 1
I call this within the framework of a larger program, a TKinter GUI specifically. For purposes of this post, I've pared down to the bare essentials. Within the GUI code, I call the above as follows:
import myfunctions
class MyApplication:
def __init__(self,parent):
self.t = []
self.x = []
def openFileAndProcessData(self):
# self.t = None
# self.x = None
self.t,self.x = myfunctions.processData(fname)
## myfunctions.processData(fname)
I noticed what every time I run openFileAndProcessData, Windows Task Manager reports that my memory usage increases, so I thought that I had a memory leak somewhere in my GUI application. So the first thing I tried is the
# self.t = None
# self.x = None
that you see commented above. Next, I tried calling myfunctions.processData without assigning the output to any variables as follows:
## myfunctions.processData(fname)
This also had no effect. As a last ditch effort, I changed the processData function so it simply returns 1 without even processing any of the data that comes from the module, data_read_module.pyd. Unfortunately, even this results in more memory being taken up with each successive call to processData, which narrows the problem down to data_read_module.read_data. I thought that within the Python framework, this is the exact type of thing that is automatically taken care of. Referring to this website, it seems that memory taken up by a function will be released when the function terminates. In my case, I would expect the memory used in processData to be released after a call [with the exception of the output that I am keeping track of with self.t and self.x]. I understand I won't get a fix to this kind of issue without access to data_read_module.pyd, but I'd like to understand how this can happen to begin with.

A .pyd file is basically a DLL. You're calling code written in C, C++, or another such compiled language. If that code allocates memory and doesn't release it properly, you will get a memory leak. The fact that the code is being called from Python won't magically fix it.

readinto() replacement?

Copying a File using a straight-forward approach in Python is typically like this:
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
while 1:
buf = fsrc.read(length)
if not buf:
break
fdst.write(buf)
(This code snippet is from shutil.py, by the way).
Unfortunately, this has drawbacks in my special use-case (involving threading and very large buffers) [Italics part added later]. First, it means that with each call of read() a new memory chunk is allocated and when buf is overwritten in the next iteration this memory is freed, only to allocate new memory again for the same purpose. This can slow down the whole process and put unnecessary load on the host.
To avoid this I'm using the file.readinto() method which, unfortunately, is documented as deprecated and "don't use":
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
buffer = array.array('c')
buffer.fromstring('-' * length)
while True:
count = fsrc.readinto(buffer)
if count == 0:
break
if count != len(buffer):
fdst.write(buffer.toString()[:count])
else:
buf.tofile(fdst)
My solution works, but there are two drawbacks as well: First, readinto() is not to be used. It might go away (says the documentation). Second, with readinto() I cannot decide how many bytes I want to read into the buffer and with buffer.tofile() I cannot decide how many I want to write, hence the cumbersome special case for the last block (which also is unnecessarily expensive).
I've looked at array.array.fromfile(), but it cannot be used to read "all there is" (reads, then throws EOFError and doesn't hand out the number of processed items). Also it is no solution for the ending special-case problem.
Is there a proper way to do what I want to do? Maybe I'm just overlooking a simple buffer class or similar which does what I want.

This code snippet is from shutil.py
Which is a standard library module. Why not just use it?
First, it means that with each call of read() a new memory chunk is allocated and when buf is overwritten in the next iteration this memory is freed, only to allocate new memory again for the same purpose. This can slow down the whole process and put unnecessary load on the host.
This is tiny compared to the effort required to actually grab a page of data from disk.

Normal Python code would not be in need off such tweaks as this - however if you really need all that performance tweaking to read files from inside Python code (as in, you are on the rewriting some server coe you wrote and already works for performance or memory usage) I'd rather call the OS directly using ctypes - thus having a copy performed as low level as I want too.
It may even be possible that simple calling the "cp" executable as an external process is less of a hurdle in your case (and it would take full advantages of all OS and filesystem level optimizations for you).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.