How to find pyspark dataframe memory usage? - python

For python dataframe, info() function provides memory usage.
Is there any equivalent in pyspark ?
Thanks

Try to use the _to_java_object_rdd() function:
import py4j.protocol
from py4j.protocol import Py4JJavaError
from py4j.java_gateway import JavaObject
from py4j.java_collections import JavaArray, JavaList
from pyspark import RDD, SparkContext
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
# your dataframe what you'd estimate
df
# Helper function to convert python object to Java objects
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
# First you have to convert it to an RDD
JavaObj = _to_java_object_rdd(df.rdd)
# Now we can run the estimator
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)

I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.
select 1% of data sample = df.sample(fraction = 0.01)
pdf = sample.toPandas()
get pandas dataframe memory usage by pdf.info()
Multiply that values by 100, this should give a rough estimate of your whole spark dataframe memory usage.
Correct me if i am wrong :|

As per the documentation:
The best way to size the amount of memory consumption a dataset will require is to create an RDD, put it into cache, and look at the “Storage” page in the web UI. The page will tell you how much memory the RDD is occupying.
To estimate the memory consumption of a particular object, use SizeEstimator’s estimate method. This is useful for experimenting with different data layouts to trim memory usage, as well as determining the amount of space a broadcast variable will occupy on each executor heap.

For the dataframe df you can do this:
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(df._jdf)

You can persist dataframe in memory and take action as df.count(). You would be able to check the size under storage tab on spark web ui.. let me know if it works for you.

How about below? It's in KB, X100 to get the estimated real size.
df.sample(fraction = 0.01).cache().count()

Related

Why does manipulating xarrays takes up so much memory permanently?

I'm new at using Xarray (using it inside jupyter notebooks), and up to now everything has worked like a charm, except when I started to look at how much RAM is used by my functions (e.g. htop), which is confusing me (I didn't find anything on stackexchange).
I am combining monthly data to yearly means, taking into account month lengths, masking nan values and also using specific months only, which requires the use of groupby and resample.
As I can see from using the memory profiler these operations temporarily take up ~15gm RAM, which as such is not a problem because I have 64gb RAM at hand.
Nonetheless it seems like some memory is blocked permanently, even though I call these methods inside a function. For the function below it blocks ~4gb of memory although the resulting xarray only has a size of ~440mb (55*10**6 float 64entries), with more complex operations it blocks more memory.
Explicitly using del , gc.collect() or Dataarray.close() inside the function did not change anything.
A basic function to compute a weighted yearly mean from monthly data looks like this:
import xarray as xr
test=xr.open_dataset(path)['pr']
def weighted_temporal_mean(ds):
"""
Taken from https://ncar.github.io/esds/posts/2021/yearly-averages-xarray/
Compute yearly average from monthly data taking into account month length and
masking nan values
"""
# Determine the month length
month_length = ds.time.dt.days_in_month
# Calculate the weights
wgts = month_length.groupby("time.year") / month_length.groupby("time.year").sum()
# Setup our masking for nan values
cond = ds.isnull()
ones = xr.where(cond, 0.0, 1.0)
# Calculate the numerator
obs_sum = (ds * wgts).resample(time="AS").sum(dim="time")
# Calculate the denominator
ones_out = (ones * wgts).resample(time="AS").sum(dim="time")
# Return the weighted average
return obs_sum / ones_out
wm=weighted_temporal_mean(test)
print("nbytes in MB:", wm.nbytes / (1024*1024))
Any idea how to ensure that the memory is freed up, or am I overlooking something and this behavior is actually expected?
Thank you!
The only hypothesis I have for this behavior is that some of the operations involving the passed in ds modify it in place, increasing its size, as, apart of the returned objects, this the the only object that should survive after the function execution.
That can be easily verified by using del on the ds structure used as input after the function is run. (If you need the data afterwards, re-read it, or make a deepcopy before calling the function).
If that does not resolve the problem, then this is an issue with the xarray project, and I'd advise you to open an issue in their project.

How to get data from object in Python

I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')
You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.
Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id

Save intermediate states of my Apache Spark pipeline

I have a quite complex Apache PySpark pipeline which performs several transformations on a (very large) set of text files. The intended output of my pipeline are different stages of the pipeline. Which is the best way (i.e. more efficient but also more sparkling, in the sense of: more fitting the Spark programming model and style) to do this?
Right now, my code looks like the following:
# initialize the pipeline and perform the first set of transformations.
ctx = pyspark.SparkContext('local', 'MyPipeline')
rdd = ctx.textFile(...).map(...).map(...)
# first checkpoint: the `first_serialization` function serializes
# the data into properly formatted string.
rdd..map(first_serialization).saveAsTextFile("ckpt1")
# here, I have to read again from the previously saved checkpoint
# using a `first_deserialization` function that deserializes what has
# been serialized from the `firs_serialization` function. Then performs
# other transformations.
rdd = ctx.textFile("ckpt1").map(...).map(...)
and so on. I would like to get rid of the serialization methods and of the multiple save/read -- by the way, does it impact the efficiency? I assume yes.
Any hint?
Thanks in advance.
This seems obviously simple, because it is, but I would recommend writing the intermediate stages out while continuing to reuse the existing RDD (side bar: use datasets/dataframes instead of RDDs to get more performance) and continue processing, writing out intermediate results as you go.
There's no need to pay the penalty of reading from disk/network when you already have the data processed (ideally even cached!) for further usage.
Example using your own code:
# initialize the pipeline and perform the first set of transformations.
ctx = pyspark.SparkContext('local', 'MyPipeline')
rdd = ctx.textFile(...).map(...).map(...)
# first checkpoint: the `first_serialization` function serializes
# the data into properly formatted string.
string_rdd = rdd..map(first_serialization)
string_rdd.saveAsTextFile("ckpt1")
# reuse the existing RDD after writing out the intermediate results
rdd = rdd.map(...).map(...) # rdd here is the same variable we used to create the string_rdd results above. alternatively, you may want to use the string_rdd variable here instead of the original rdd variable.

(Py)Spark combineByKey mergeCombiners output type != mergeCombinerVal type

I'm trying to optimize one piece of Software written in Python using Pandas DF . The algorithm takes a pandas DF as input, can't be distributed and it outputs a metric for each client.
Maybe it's not the best solution, but my time-efficient approach is to load all files in parallel and then build a DF for each client
This works fine BUT very few clients have really HUGE amount of data. So I need to save memory when creating their DF.
In order to do this I'm performing a groupBy() (actually a combineByKey, but logically it's a groupBy) and then for each group (aka now a single Row of an RDD) I build a list and from it, a pandas DF.
However this makes many copies of the data (RDD rows, List and pandas DF...) in a single task/node and crashes and I would like to remove that many copies in a single node.
I was thinking on a "special" combineByKey with the following pseudo-code:
def createCombiner(val):
return [val]
def mergeCombinerVal(x,val):
x.append(val);
return x;
def mergeCombiners(x,y):
#Not checking if y is a pandas DF already, but we can do it too
if (x is a list):
pandasDF= pd.Dataframe(data=x,columns=myCols);
pandasDF.append(y);
return pandasDF
else:
x.append(y);
return x;
My question here, docs say nothing, but someone knows if it's safe to assume that this will work? (return dataType of merging two combiners is not the same than the combiner). I can control datatype on mergeCombinerVal too if the amount of "bad" calls is marginal, but it would be very inefficient to append to a pandas DF row by row.
Any better idea to perform want i want to do?
Thanks!,
PS: Right now I'm packing Spark rows, would switching from Spark rows to python lists without column names help reducing memory usage?
Just writing my comment as answer
At the end I've used regular combineByKey, its faster than groupByKey (idk the exact reason, I guess it helps packing the rows, because my rows are small size, but there are maaaany rows), and also allows me to group them into a "real" Python List (groupByKey groups into some kind of Iterable which Pandas doesn't support and forces me to create another copy of the structure, which doubles memory usage and crashes), which helps me with memory management when packing them into Pandas/C datatypes.
Now I can use those lists to build a dataframe directly without any extra transformation (I don't know what structure is Spark's groupByKey "list", but pandas won't accept it in the constructor).
Still, my original idea should have given a little less memory usage (at most 1x DF + 0.5x list, while now I have 1x DF + 1x list), but as user8371915 said it's not guaranteed by the API/docs..., better not to put that into production :)
For now, my biggest clients fit into a reasonable amount of memory. I'm processing most of my clients in a very parallel low-memory-per-executor job and the biggest ones in a not-so-parallel high-memory-per-executor job. I decide based on a pre-count I perform

Using map in Spark Python to manipulate Resilient Distributed Datasets

I created an array using numpy's arange, and want to convert that array into a RDD using the spark.sparkContext.parallelize.
np_array = [np.arange(0,300)]
rdd_numbers = spark.sparkContext.parallelize(np_array)
times_twelve = rdd_numbers.map(lambda rdd_numbers: rdd_numbers * 12)
I would now like to make an RDD called times_twelve, that is basically every number in rdd_numbers multiplied by twelve. For some reason times_twelve does not print properly, any ideas where I could have gone wrong?
Reading the comments, I can say that Shagun Sodhani is right when he says:
print(anyrdd) will not print the content of the RDD
If you want to see the content of the RDD on the screen, you can use the following command (recommended only for small RDDs):
print times_twelve.take(times_twelve.count())
You can check here the docs about these actions supported by Spark.

Categories