Spark: PySpark Slowness , memory issue in writing to Cassandra

Spark: PySpark Slowness , memory issue in writing to Cassandra - python

I am using pyspark to aggregate and group a largish csv on a low end machine ; 4 GB Ram and 2 CPU Core. This is done to check the memory limits for the prototype. After aggregation I need to store the RDD to Cassandra which is running in another machine.
I am using Datastax cassandra-python driver. First I used rdd.toLocalIterator and iterated through the RDD and used the drivers synchronous API session.execute. I managed to insert about 100,000 records in 5 mts- very slow. Checking this I found as explained here python driver cpu bound, that when running nload nw monitor in the Cassandra node, the data put out by the python driver is at a very slow rate, causing the slowness
So I tried session.execute_async and I could see the NW transfer at very high speed, and insertion time was also very fast.
This would have been a happy story but for the fact that, using session.execute_async, I am now running out of memory while inserting to a few more tables (with different primary keys)
Sincerdd.toLocalIterator is said to need memory equal to a partition, I shifted the write to Spark worker using rdd.foreachPartition(x), but still going out of memory.
I am doubting that it is not the rdd iteration that causes this, but the fast serialization ? of execute_async of the python driver (using Cython)
Of course I can shift to a bigger RAM node and try; but it would be sweet to solve this problem in this node; maybe will try also multiprocessing next; but if there are better suggestions please reply
the memory error I am getting is from JVM/or OS outofmemory,
6/05/27 05:58:45 INFO MapOutputTrackerMaster: Size of output statuses for
shuffle 0 is 183 bytes
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fdea10cc000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/ec2-user/hs_err_pid3208.log

I tried the execution in a machine with a bigger RAM - 16 GB; This time I am able to avert the Out if Memory scenario of above;
However this time I change the insert a bit to insert to multiple table;
So even with session.executeAsysc too I am finding that the python driver is CPU bound (and I guess due to GIL not able to make use of all CPU cores), and what goes out in the NW is a trickle.
So I am not able to attain case 2; Planning to change to Scala now
Case 1: Very less output to the NW - Write speed is fast but nothing to write
Case 2:
Ideal Case - Inserts being IO bound: Cassandra writes very fast

Related

Why am I getting out of memory error on spark driver when trying to read lots of Avro files? No collect operation happening

I am trying to read in a large amount of Avro files in subdirectories from s3 using spark.read.load on databricks. I either get an error due to the max result size exceeding spark.driver.maxResultSize, or if I increase that limit, the driver runs out of memory.
I am not performing any collect operation so I'm not sure why so much memory is being used on the driver. I wondered if it was something to do with an excessive number of partitions, so I tried playing around with different values of spark.sql.files.maxPartitionBytes, to no avail. I also tried increasing memory on the driver and using a bigger cluster.
The only thing that seemed to help slightly was specifying Avro schema beforehand rather than inferring; this meant the spark.read.load finished without error, however memory usage on the driver was still extremely high and the driver still crashed if I attempted any further operations on the resulting DataFrame.

I discovered the problem was the spark.sql.sources.parallelPartitionDiscovery.parallelism option. This was set too low for the large number of files I was trying to read, resulting in the driver crashing. Increased the value of this and now my code is working.

long-running python program ram usage

I am currently working on a project where a python program is supposed to be running for several days, essentially in an endless loop until an user intervenes.
I have observed that the ram usage (as shown in the windows task manager) rises - slowly, but steadily. For example from ~ 80 MB at program start to ~ 120 MB after one day. To get a closer look at this, I started to log the allocated memory with
tracemalloc.get_traced_memory() at regular intervalls throughout the program execution. The output was written to the time series db (see image below).
tracemalloc output for one day runtime
To me it looks like the memory that is needed for the program does not accumulate over time. How does this fit in the output of the windows task manager? Should I go through my program to search for growing data structures?
Thank your very much in advance!

Okay, turns out the answer is: no, this is not proper behaviour, the ram usage can stay absolutely stable. I have tested this for three weeks now and the ram usage never exceeded 80 mb.
The problem was in the usage of the influxdb v2 client.
You need to close both the write_api (implicitly done with the "with... as write_api:" statement) and the client itself (explicitly done via the "client.close()" in the example below).
In my previous version that had increasing memory usage, I only closed the write_api and not the client.
client = influxdb_client.InfluxDBClient(url=self.url, token=self.token, org=self.org)
with client.write_api(write_options=SYNCHRONOUS) as write_api:
# force datatypes, because influx does not do fluffy ducktyping
datapoint = influxdb_client.Point("TaskPriorities")\
.tag("task_name", str(task_name))\
.tag("run_uuid", str(run_uuid))\
.tag("task_type", str(task_type))\
.field("priority", float(priority))\
.field("process_time_h", float(process_time))\
.time(time.time_ns())
answer= write_api.write(bucket=self.bucket, org=self.org, record=datapoint)
client.close()

How to specify memory allocation for Pandas Dataframes?

I am trying to merge two big pandas dataframes but it raises a memory error on my 4GB RAM laptop so I tried on in computer lab on 16 GB RAM but it's still raising the same error (on the same line code crashes ).
I am not able to resolve why pandas raising the same error and not using the 16 GB RAM space. Please help me to resolve it .
feature_AtomPairs2DFingerprintCount=pd.read_csv("/home/adarsh/big_data_features/AtomPairs2DFingerprintCount.csv")
feature_AtomPairs2DFingerprinter=pd.read_csv("/home/adarsh/big_data_features/AtomPairs2DFingerprinter.csv")
merged_data_2=pd.merge(feature_AtomPairs2DFingerprinter,feature_AtomPairs2DFingerprintCount,how='left')
MERGED_DATA=pd.read_csv('/home/adarsh/comp_des.csv')
total_merged=pd.merge(MERGED_DATA,merged_data_2,how='left')

The resource.getrlimit call will tell you the hard and soft limits for various system resources. For memory
soft, hard = resource.getrlimit(resource.RLIMIT_AS)
The softlimit is the value which, when reached, the operating system will typically restrict the process or notify it with a signal. The hard limit represents an upper bound on the values for the softlimit. The softlimit can be modified with an appropriate call to resource.setrlimit(). The hard limit is typically controlled by a system-wide parameter set the the system administrator. It cannot be raised by user level processes, although it can be lowered. This is reported to work on Linux but not MacOS or Windows where it returns -1 for both values.
I suspect that you are running up against the OS's max for process size.

64 bit python fills up memory until computer freezes with no memerror

I used to run 32 bit python on a 32-bit OS and whenever i accidentally appended values to an array in an infinite list or tried to load too big of a file, python would just stop with an out of memory error. However, i now use 64-bit python on a 64-bit OS, and instead of giving an exception, python uses up every last bit of memory and causes my computer to freeze up so i am forced to restart it.
I looked around stack overflow and it doesn't seem as if there is a good way to control memory usage or limit memory usage. For example, this solution: How to set memory limit for thread or process in python? limits the resources python can use, but it would be impractical to paste into every piece of code i want to write.
How can i prevent this from happening?

I don't know if this will be the solution for anyone else but me, as my case was very specific, but I thought I'd post it here in case someone could use my procedure.
I was having a VERY huge dataset with millions of rows of data. Once I queried this data through a postgreSQL database I used up a lot of my available memory (63,9 GB available in total on a Windows 10 64 bit PC using Python 3.x 64 bit) and for each of my queries I used around 28-40 GB of memory as the rows of data was to be kept in memory while Python did calculations on the data. I used the psycopg2 module to connect to my postgreSQL.
My initial procedure was to perform calculations and then append the result to a list which I would return in my methods. I quite quickly ended up having too much stored in memory and my PC started freaking out (froze up, logged me out of Windows, display driver stopped responding and etc).
Therefore I changed my approach using Python Generators. And as I would want to store the data I did calculations on back in my database, I would write each row, as I was done performing calculations on it, to my database.
def fetch_rows(cursor, arraysize=1000):
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
And with this approach I would do calculations on my yielded result by using my generator:
def main():
connection_string = "...."
connection = psycopg2.connect(connection_string)
cursor = connection.cursor()
# Using generator
for row in fecth_rows(cursor):
# placeholder functions
result = do_calculations(row)
write_to_db(result)
This procedure does however indeed require that you have enough physical RAM to store the data in memory.
I hope this helps whomever is out there with same problems.

psycopg2 leaking memory after large query

I'm running a large query in a python script against my postgres database using psycopg2 (I upgraded to version 2.5). After the query is finished, I close the cursor and connection, and even run gc, but the process still consumes a ton of memory (7.3gb to be exact). Am I missing a cleanup step?
import psycopg2
conn = psycopg2.connect("dbname='dbname' user='user' host='host'")
cursor = conn.cursor()
cursor.execute("""large query""")
rows = cursor.fetchall()
del rows
cursor.close()
conn.close()
import gc
gc.collect()

I ran into a similar problem and after a couple of hours of blood, sweat and tears, found the answer simply requires the addition of one parameter.
Instead of
cursor = conn.cursor()
write
cursor = conn.cursor(name="my_cursor_name")
or simpler yet
cursor = conn.cursor("my_cursor_name")
The details are found at http://initd.org/psycopg/docs/usage.html#server-side-cursors
I found the instructions a little confusing in that I though I'd need to rewrite my SQL to include
"DECLARE my_cursor_name ...." and then a "FETCH count 2000 FROM my_cursor_name" but it turns out psycopg does that all for you under the hood if you simply overwrite the "name=None" default parameter when creating a cursor.
The suggestion above of using fetchone or fetchmany doesn't resolve the problem since, if you leave the name parameter unset, psycopg will by default attempt to load the entire query into ram. The only other thing you may need to to (besides declaring a name parameter) is change the cursor.itersize attribute from the default 2000 to say 1000 if you still have too little memory.

Joeblog has the correct answer. The way you deal with the fetching is important but far more obvious than the way you must define the cursor. Here is a simple example to illustrate this and give you something to copy-paste to start with.
import datetime as dt
import psycopg2
import sys
import time
conPG = psycopg2.connect("dbname='myDearDB'")
curPG = conPG.cursor('testCursor')
curPG.itersize = 100000 # Rows fetched at one time from the server
curPG.execute("SELECT * FROM myBigTable LIMIT 10000000")
# Warning: curPG.rowcount == -1 ALWAYS !!
cptLigne = 0
for rec in curPG:
cptLigne += 1
if cptLigne % 10000 == 0:
print('.', end='')
sys.stdout.flush() # To see the progression
conPG.commit() # Also close the cursor
conPG.close()
As you will see, dots came by group rapidly, than pause to get a buffer of rows (itersize), so you don't need to use fetchmany for performance. When I run this with /usr/bin/time -v, I get the result in less than 3 minutes, using only 200MB of RAM (instead of 60GB with client-side cursor) for 10 million rows. The server doesn't need more ram as it uses temporary table.

Please see the next answer by #joeblog for the better solution.
First, you shouldn't need all that RAM in the first place. What you should be doing here is fetching chunks of the result set. Don't do a fetchall(). Instead, use the much more efficient cursor.fetchmany method. See the psycopg2 documentation.
Now, the explanation for why it isn't freed, and why that isn't a memory leak in the formally correct use of that term.
Most processes don't release memory back to the OS when it's freed, they just make it available for re-use elsewhere in the program.
Memory may only be released to the OS if the program can compact the remaining objects scattered through memory. This is only possible if indirect handle references are used, since otherwise moving an object would invalidate existing pointers to the object. Indirect references are rather inefficient, especially on modern CPUs where chasing pointers around does horrible things to performance.
What usually lands up happening unless extra caution is exersised by the program is that each large chunk of memory allocated with brk() lands up with a few small pieces still in use.
The OS can't tell whether the program considers this memory still in use or not, so it can't just claim it back. Since the program doesn't tend to access the memory the OS will usually swap it out over time, freeing physical memory for other uses. This is one of the reasons you should have swap space.
It's possible to write programs that hand memory back to the OS, but I'm not sure that you can do it with Python.
See also:
python - memory not being given back to kernel
Why doesn't memory get released to system after large queries (or series of queries) in django?
Releasing memory in Python
So: this isn't actually a memory leak. If you do something else that uses lots of memory, the process shouldn't grow much if at all, it'll re-use the previously freed memory from the last big allocation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.