How to specify memory allocation for Pandas Dataframes?

How to specify memory allocation for Pandas Dataframes? - python

I am trying to merge two big pandas dataframes but it raises a memory error on my 4GB RAM laptop so I tried on in computer lab on 16 GB RAM but it's still raising the same error (on the same line code crashes ).
I am not able to resolve why pandas raising the same error and not using the 16 GB RAM space. Please help me to resolve it .
feature_AtomPairs2DFingerprintCount=pd.read_csv("/home/adarsh/big_data_features/AtomPairs2DFingerprintCount.csv")
feature_AtomPairs2DFingerprinter=pd.read_csv("/home/adarsh/big_data_features/AtomPairs2DFingerprinter.csv")
merged_data_2=pd.merge(feature_AtomPairs2DFingerprinter,feature_AtomPairs2DFingerprintCount,how='left')
MERGED_DATA=pd.read_csv('/home/adarsh/comp_des.csv')
total_merged=pd.merge(MERGED_DATA,merged_data_2,how='left')

The resource.getrlimit call will tell you the hard and soft limits for various system resources. For memory
soft, hard = resource.getrlimit(resource.RLIMIT_AS)
The softlimit is the value which, when reached, the operating system will typically restrict the process or notify it with a signal. The hard limit represents an upper bound on the values for the softlimit. The softlimit can be modified with an appropriate call to resource.setrlimit(). The hard limit is typically controlled by a system-wide parameter set the the system administrator. It cannot be raised by user level processes, although it can be lowered. This is reported to work on Linux but not MacOS or Windows where it returns -1 for both values.
I suspect that you are running up against the OS's max for process size.

Related

How to avoid 'Too many open files' error when using parallelization within scipy.optimize.differential_evolution?

I am running a python script which uses scipy.optimize.differential_evolution to find optimum parameters for given data samples. I am processing my samples sequentially. The script is running fine, although when I wanted to make use of the parallel computing options implemented in the package, by calling it via:
res = scipy.optimize.differential_evolution( min_fun,
bounds = bnds,
updating = 'deferred',
workers = 120
)
after evaluating res for a few times, it throws an error
File "[...]/miniconda3/envs/remote_fit_environment/lib/python3.8/multiprocessing/popen_fork.py", line 69, in _launch
child_r, parent_w = os.pipe()
OSError: [Errno 24] Too many open files
If I allocate less CPUs, e.g. workers=20 it takes longer, as in more times calling differential_evolution() , until the error occurs.
I am aware that I could raise the limit for open files (1024 by default on that machine), but this seems strange to me. Since the Error origins in the multiprocessing module and is traced back to differential_evolution, I am wondering whether something with the parallelisation implementation within scipy.optimize.differential_evolution might be wrong or "not clean" (although it is much more likely that I am missing something as I am completely new to this whole parallel/multiprocessing thing)?
Any similar experiences or ideas how to solve this?

The root cause is in O/S not having enough "file-descriptors" available for so many process-to-process IPC channels ( as Python IPC as in joblib.parallel() or multiprocessing modules use os.pipe-s for communicating parameters between the "main"-Python-Interpreter process and the (safe)-spawn()-ed / (unsafe, as documented elsewhere why)-fork()-ed sub-ordinate processes, which SciPy re-wraps and re-uses under the calling-signature workers = 120 parameter.
Even if you "increase" the configured amount of O/S filedescriptors ( which intelligent operating systems permit ) for having more os.pipe-instances to work further, the whole intent is not reasonable, as the end-to-end performance will annihilate, as 128-worker processes will have unresolvably self-devastating bottlenecks on CPU-to-RAM physical-RAM-I/O channels on such "over-booked" CPU-to-RAM I/O-traffic density.
Asking more you easily receive less ( or nothing at all )
Details matter.

How can I cause a Memory Error in python 2.7?

I'm working on a system that has about 128KB of RAM, one of my scripts occasionally causes a ERRNO 12 Cannot Allocate Memory error.
I have a few solutions I want to test.
But how can I replicate the problem when it seemingly happens randomly once a day?
Any bad scripts that will cause ERRNO 12 Cannot Allocate Memory error?
Most posts are trying to solve Memory errors, I want to cause one to test the robustness of my code.

If you want to force Python to run out of memory, this should make it happen very quickly regardless of how much memory is available:
x = [None]
while True:
x += x
This will double the length of x on every iteration until it fails.

Insight to np.linalg.lstsq and np.linalg.pinv limitations and failures during multiprocessing?

I am seeking insight into a vexing problem I have encountered using numpy in a multiprocessing environment. Initially, some weeks ago, "things did not work" at all and as a result of guidance from this site, the problem was traced to numpy needing to be recompiled to allow multiprocessing. This was done and appears at first blush to be functional EXCEPT that now there are limits on the problem size that I can attack.
When numpy was used "as downloaded as binary distribution" in single processor mode, I could solve large problems using linalg.pinv and linalg.svd without complaint (except for throughput). Now, relatively small problems fail. Specifically, when linalg.lstsq is called (which appears to invoke svd and/or pinv) the code expends the cpu effort (as shown by process monitor) but the code never executes the next statement after the linalg call. No exception is raised, no error indication given, no results generated; it just bails and the process that contains the call is deemed done and that is that.
These matrices are not that large: 8 x 4096 will solve and create results in a timely manner. 8 x 8192 fails with no error indication, no exception raised and no results are produced. The single process version of numpy solved 12 x 50,000 without issue. My platform is a 12-core Mac running most-recent OS X (10.11.6) with 64GB of physical ram. Minimal memory pressure is put on the machine during these runs; there is plenty of headroom on space. Python is 2.7.11 under anaconda, numpy was recompiled for multiprocessing using gcc.
I am running out of places to look; I have not found a numpy control that address this issue. Does anyone have any pointers as to where I should concentrate my search for a resolution?
This is a show-stopper for me and I have to assume that the underlying cause is simple, but perhaps not obvious. I am rather new to python, so I would appreciate any insight fellow over-flowers may have.

Spark: PySpark Slowness , memory issue in writing to Cassandra

I am using pyspark to aggregate and group a largish csv on a low end machine ; 4 GB Ram and 2 CPU Core. This is done to check the memory limits for the prototype. After aggregation I need to store the RDD to Cassandra which is running in another machine.
I am using Datastax cassandra-python driver. First I used rdd.toLocalIterator and iterated through the RDD and used the drivers synchronous API session.execute. I managed to insert about 100,000 records in 5 mts- very slow. Checking this I found as explained here python driver cpu bound, that when running nload nw monitor in the Cassandra node, the data put out by the python driver is at a very slow rate, causing the slowness
So I tried session.execute_async and I could see the NW transfer at very high speed, and insertion time was also very fast.
This would have been a happy story but for the fact that, using session.execute_async, I am now running out of memory while inserting to a few more tables (with different primary keys)
Sincerdd.toLocalIterator is said to need memory equal to a partition, I shifted the write to Spark worker using rdd.foreachPartition(x), but still going out of memory.
I am doubting that it is not the rdd iteration that causes this, but the fast serialization ? of execute_async of the python driver (using Cython)
Of course I can shift to a bigger RAM node and try; but it would be sweet to solve this problem in this node; maybe will try also multiprocessing next; but if there are better suggestions please reply
the memory error I am getting is from JVM/or OS outofmemory,
6/05/27 05:58:45 INFO MapOutputTrackerMaster: Size of output statuses for
shuffle 0 is 183 bytes
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fdea10cc000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/ec2-user/hs_err_pid3208.log

I tried the execution in a machine with a bigger RAM - 16 GB; This time I am able to avert the Out if Memory scenario of above;
However this time I change the insert a bit to insert to multiple table;
So even with session.executeAsysc too I am finding that the python driver is CPU bound (and I guess due to GIL not able to make use of all CPU cores), and what goes out in the NW is a trickle.
So I am not able to attain case 2; Planning to change to Scala now
Case 1: Very less output to the NW - Write speed is fast but nothing to write
Case 2:
Ideal Case - Inserts being IO bound: Cassandra writes very fast

64 bit python fills up memory until computer freezes with no memerror

I used to run 32 bit python on a 32-bit OS and whenever i accidentally appended values to an array in an infinite list or tried to load too big of a file, python would just stop with an out of memory error. However, i now use 64-bit python on a 64-bit OS, and instead of giving an exception, python uses up every last bit of memory and causes my computer to freeze up so i am forced to restart it.
I looked around stack overflow and it doesn't seem as if there is a good way to control memory usage or limit memory usage. For example, this solution: How to set memory limit for thread or process in python? limits the resources python can use, but it would be impractical to paste into every piece of code i want to write.
How can i prevent this from happening?

I don't know if this will be the solution for anyone else but me, as my case was very specific, but I thought I'd post it here in case someone could use my procedure.
I was having a VERY huge dataset with millions of rows of data. Once I queried this data through a postgreSQL database I used up a lot of my available memory (63,9 GB available in total on a Windows 10 64 bit PC using Python 3.x 64 bit) and for each of my queries I used around 28-40 GB of memory as the rows of data was to be kept in memory while Python did calculations on the data. I used the psycopg2 module to connect to my postgreSQL.
My initial procedure was to perform calculations and then append the result to a list which I would return in my methods. I quite quickly ended up having too much stored in memory and my PC started freaking out (froze up, logged me out of Windows, display driver stopped responding and etc).
Therefore I changed my approach using Python Generators. And as I would want to store the data I did calculations on back in my database, I would write each row, as I was done performing calculations on it, to my database.
def fetch_rows(cursor, arraysize=1000):
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
And with this approach I would do calculations on my yielded result by using my generator:
def main():
connection_string = "...."
connection = psycopg2.connect(connection_string)
cursor = connection.cursor()
# Using generator
for row in fecth_rows(cursor):
# placeholder functions
result = do_calculations(row)
write_to_db(result)
This procedure does however indeed require that you have enough physical RAM to store the data in memory.
I hope this helps whomever is out there with same problems.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.