Regularly loading a large file into python - python

So I have this large file (about 1.5GB) which I load into python pretty regularly. The loading parses it into an object, that ends up taking about 3GB of RAM.
The loading process is not that long (takes about 40 seconds on my PC), but still it becomes an issue when I want to debug programs that load it.
I was trying to come up with a solution for loading it more quickly, at first I though about pickling the resulting python object, but as I said earlier it's 3GB, so unpickling it ended up taking even longer then the parsing process.
Is there a way to let python access it more quickly? I am not really opposed to any working solution (cloud server? other programming languages?) but I am not even sure if this is technically possible at all.

Related

Is there a way to reduce resource usage when reading and writing large dataframes with polars?

For my specific problem I have been converting ".csv" files to ".parquet" files. The CSV files on disk are about 10-20 GB each.
Awhile back I have been using ".SAS7BDAT" files of similar size to convert to ".parquet" files of similar data but now I get them in CSVs so this might not be a good control, but I used the pyreadstat library to read these files in (with multi-threading on in the parameter, which didn't make a difference for some reason) and pandas to write. It was also a tiny bit faster but I feel the code ran on a single thread, and it took a week to convert all my data.
This time, I tried the polars library and it was blazing fast. The CPU usage was near 100%, memory usage was also quite high. I tested this on a single file which would have taken hours, only to complete in minutes. The problem is that it uses too much of my computer's resources and my PC stalls. VSCode has crashed on some occasions. I have tried passing in the low memory parameter but it still uses a lot of resources. My suspicion is with the "reader.next_batches(500)" variable but I don't know for sure.
Regardless, is there a way to limit the CPU and memory usage while running this operation so I can at least browse the internet/listen to music while this runs in the background? With pandas the process is too slow, with polars the process is fast but my PC becomes unusable at times. See image for the code I used.
Thanks.
I tried the low memory parameter with polars but memory usage was still quite high. I was expecting to at least use my PC while this worked in the background. My hope is to use 50-80% of my PC's resources such that enough resources are free for other work while the files are being converted.
I see you're on Windows so convert your notebook into a py script then from the command line run
start /low python yourscript.py
And/or use task manager to lower the priority of your python process once it's running.

keep data in memory persistently

I want to write a Python script which loads 2 GB of data from the hard disk into memory and then whenever requested by other program, it must get an input and do some calculations on this data based on the input. the important thing for me is to keep this 2 GB data in memory persistently to speed up the calculations and more importantly avoid huge I/O load.
how should I keep the data in memory forever? or more generally, how should I solve such problem in Python?
Depending on what kind of data you have, you can keep the data in a Python list, set, hashmap or any other data structure. If this is just meant to be a cache, you can use a server like Redis or memcached too.
There is nothing special about loading data to memory "forever" or doing it every time you need it. You can just load into Python variables and keep them around.
Make sure you have 2GB of free and available RAM and then use the mmap module (https://docs.python.org/3/library/mmap.html) to map the entire array into active memory.

Python 3 - Faster Print & I/O

I'm currently involved in a Python project that involves handling massive amounts of data. In this, I have to print massive amounts of data to files. They are always one-liners, but sometimes consisting of millions of digits.
The actual mathematical operations in Python only take seconds, minutes at most. Printing them to a file takes up to several hours; which I don't always have.
Is there any way of speeding up the I/O?
From what I figure, the number is stored in the RAM (Or at least I assume so, it's the only thing which would take up 11GB of RAM), but Python does not print it to a text file immediately. Is there a way to dump that information -- if it is the number -- to a file? I've tried Task Manager's Dump, which gave me a 22GB dump file (Yes, you read that right), and it doesn't look like there's what I was looking for in there, albeit it wasn't very clear.
If it makes a difference, I have Python 3.5.1 (Anaconda and Spyder), Windows 8.1 x64 and 16GB RAM.
By the way, I do run Garbage Collect (gc module) inside the script, and I delete variables that are not needed, so those 11GB aren't just junk.
If you are indeed I/O bound by the time it takes to write the file, multi-threading with a pool of threads may help. Of course, there is a limit to that, but at least, it would allow you to issue non-blocking file writes.
Multithreading could speed it up (have printers on other threads that you write to in memory that have a queue).
Maybe a system design standpoint, but maybe evaluate whether or not you need to write everything to the file. Perhaps consider creating various levels of logging so that a release mode could run faster (if that makes sense in your context).
Use HDF5 file format
The problem is, you have to write a lot of data.
HDF5 is format being very efficient in size and allowing access to it by various tools.
Be prepared for few challenges:
there are multiple python packages for HDF5, you will have to find the one which fits your needs
installation is not always very simple (but there might be Windows installation binary)
expect a bit of study to understand the data structures to be stored.
it will occasionally need some CPU cycles - typically you write a lot of data quickly and at one moment it has to be flushed to the disk. At this moment it starts compressing the data what can take few seconds. See GIL for IO bounded thread in C extension (HDF5)
Anyway, I think, it is very likely, you will manage and apart of faster writes to the files you will also gain smaller files, which are simpler to handle.

Frustrating behavior for google modules/backends

I'm working with GAE, and I'm trying to process a large zip file (~150mb zipped, 500 unzipped), which I need to do every day for my app.
I created a module to load a file from Google Cloud Storage, and parse through it, saving specific pieces of information in Google Datastore along the way. The problem is that it will shut itself down within a few minutes, and I basically lose where I am in the file. I am giving the instance more than enough CPU/memory, so that's not the issue..
Is there some way to handle this? The documentation for handling shutdowns is quite limited, and it seems shutdown requests aren't even guaranteed.. It seems really odd to me that GAE isn't able to handle a ~150mb file, nor can GAE guarantee 10-15 minutes of uptime at a time. Is there a way to get around these limitations? Thanks..
EDIT:
Why when I go to load my module ([modulename].[appname].appspot.com), it loads all available instances:
The documentation states
"http://module.app-id.appspot.com
Send the request to an available instance of the default version of the named module (round robin scheduling is used)."
Did you really measure that there is enough memory ?! If you load the 500Mb unzipped into memory, then that's a lot.
I have seen this behavior when running out of memory. I would suggest to try with a smaller test file. If that works, try to implement a streaming solution where the size of the file won't matter since it is never loaded into memory.

How to debug a MemoryError in Python? Tools for tracking memory use?

I have a Python program that dies with a MemoryError when I feed it a large file. Are there any tools that I could use to figure out what's using the memory?
This program ran fine on smaller input files. The program obviously needs some scalability improvements; I'm just trying to figure out where. "Benchmark before you optimize", as a wise person once said.
(Just to forestall the inevitable "add more RAM" answer: This is running on a 32-bit WinXP box with 4GB RAM, so Python has access to 2GB of usable memory. Adding more memory is not technically possible. Reinstalling my PC with 64-bit Windows is not practical.)
EDIT: Oops, this is a duplicate of Which Python memory profiler is recommended?
Heapy is a memory profiler for Python, which is the type of tool you need.
The simplest and lightweight way would likely be to use the built in memory query capabilities of Python, such as sys.getsizeof - just run it on your objects for a reduced problem (i.e. a smaller file) and see what takes a lot of memory.
In your case, the answer is probably very simple: Do not read the whole file at once but process the file chunk by chunk. That may be very easy or complicated depending on your usage scenario. Just for example, a MD5 checksum computation can be done much more efficiently for huge files without reading the whole file in. The latter change has dramatically reduced memory consumption in some SCons usage scenarios but was almost impossible to trace with a memory profiler.
If you still need a memory profiler: eliben already suggested sys.getsizeof. If that doesn't cut it, try Heapy or Pympler.
You asked for a tool recommendation:
Python Memory Validator allows you to monitor the memory usage, allocation locations, GC collections, object instances, memory snapshots, etc of your Python application. Windows only.
http://www.softwareverify.com/python/memory/index.html
Disclaimer: I was involved in the creation of this software.

Categories