Prevent Spark Out of Memory when applying Pandas UDF - python

I am running spark version :3.1.3 on YARN cluster 2 nodes each 15 GB. I am trying to run DBSCAN on spark based on this answer :
1)https://stackoverflow.com/a/60153719/13378785
spark_dbscan takes a dataframe which contains the features and a partition based on which some sub-groups have been created and sklearn DBSCAN will be applied later on locally .
Run_dbscan is the implementation of DBSCAN from sklearn (which the only difference i consider noise as 0 instead of -1).
As an output i just get a dataframe with some local labels for each group (which later on with another process are merged).
An example of the input and output is shown below:
def create_scheme(dimensions):
schema = StructType()
for i in range(0, dimensions):
schema.add(str(i), FloatType())
schema.add('index', IntegerType())
schema.add('partitions', IntegerType())
schema.add('cluster', IntegerType())
return schema
def Run_dbscan(df , eps=0.2, min_samples=45, colums=None, dim=2):
db = DBSCAN(eps=eps, min_samples=min_samples).fit(df[colums[:dim]])
df['cluster'] = db.labels_
df['cluster'] = df['cluster'] + 1
return df
def spark_dbscan(train_d_path, spark, esp=0.3, min_pts=3, dim=2, part=8,roungd=0):
def dbscan(data):
return Run_dbscan(data=data, eps=esp, colums=col_d, min_samples=min_pts, dim=dim)
#partition_df = has been created earlier and is the one displayed below as Input
col_d = partition_df.columns
partition_df = partition_df.repartition(X) #X integer [2~200]
df = partition_df.groupby("partitions").applyInPandas(dbscan, schema=create_scheme(dim))
df = df.select([c for c in df.columns if c in ['cluster', 'index', 'partitions']])
#write.csv will either throw OOM or task will fail in YARN
df.write.csv('apath', header='true')
return df
Input :
+-----------+------------+-----+----------+
| 0| 1|index|partitions|
+-----------+------------+-----+----------+
|0.009805393| -1.15588| 0| 400|
| -0.3587753| -0.8127602| 1| 400|
| -0.3587753| -0.8127602| 1| 600|
| 0.39499614| 0.57317185| 2| 500|
| 0.39499614| 0.57317185| 2| 800|
|-0.11886594| -1.3287356| 3| 400|
|0.045839764| -1.333411| 4| 400|
| -1.406546| 0.75050515| 5| 200|
| -1.406546| 0.75050515| 5| 300|
| 1.3432932| -1.1720116| 6| 700|
| 0.23757286| 0.89215803| 7| 800|
| -1.0563164| 1.4410132| 8| 200|
| -0.3718215| -0.49574503| 9| 600|
| 0.7345064| -1.3823572| 10| 700|
| 1.5790144| -0.79659116| 11| 100|
| 1.5790144| -0.79659116| 11| 700|
|-0.04173269| 1.3471143| 12| 800|
| -1.4477109| 0.5785008| 13| 200|
| -1.4477109| 0.5785008| 13| 300|
| 0.59039986|-0.057216637| 14| 500|
+-----------+------------+-----+----------+
Output:
+-----+----------+-------+
|index|partitions|cluster|
+-----+----------+-------+
|98663| 300| 2|
|80841| 300| 1|
|62544| 300| 1|
| 2977| 300| 1|
|23948| 300| 1|
|44470| 300| 1|
|84420| 300| 1|
|30907| 300| 1|
|93084| 300| 1|
|58671| 300| 1|
|58051| 300| 1|
|46485| 300| 2|
|29175| 300| 1|
|41179| 300| 1|
|95224| 300| 1|
|56119| 300| 1|
|22074| 300| 1|
|88336| 300| 1|
|16965| 300| 1|
|35793| 300| 0|
+-----+----------+-------+
only showing top 20 rows
For small datasets this works fine (less than 1 million rows) however for larger dataset if i run it locally it will always end up with either Heap error or OOM due to the memory which is required for the DBSCAN. SKlearn dbscan is memory based so i am expecting all the data to moved to the memory
however i expected with this implementation to be able to "split" the original dataset into smaller chuncks and prevent these errors .
So my question is how to overcome this, so basically a trade off memory for speed
I have tried the following:
1)Increasing the number of partitions which lowers the number of data/rows grouped together thus spark would slower calculate the labels.
2)Increase the number of the dataframe partitions/shuffle (X value) .
Neither of those solutions could prevent the issue i mentioned from happening.
The only method which i haven't tried but might could work is a filter like approach where i would create N dataframe (#N the number of partitions/sub-groups) and run DBSCAN to each of the with a for loop and then joining them together but i couldn't find any post supporting such solution.
d = {}
num_p = [1,2,3,4] #consider list with the partitionsnumber 1,2,3,4etc
for ie in num_p:
d[ie] = df.filter((f.col("partitions")==ie)
pandas_df = d[ie].toPandas()
#apply dbscan
#drop features and convert back to spark dataframe with scheme
merge all spark dataframe into one with union and move to global merging.

Related

Apply value to a dataframe after filtering another dataframe based on conditions

My question:
There are two dataframes and the info one is in progress to build.
What I want to do is filtering in reference dataframe based on the condition. When key is b, then apply value is 2 to the into table as whole column.
The output dataframe is the final one I want to do.
Dataframe (info)
+-----+-----+
| key|value|
+-----+-----+
| a| 10|
| b| 20|
| c| 50|
| d| 40|
+-----+-----+
Dataframe (Reference)
+-----+-----+
| key|value|
+-----+-----+
| a| 42|
| b| 2|
| c| 9|
| d| 100|
+-----+-----+
Below is the output I want:
Dataframe (Output)
+-----+-----+-----+
| key|value|const|
+-----+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 50| 2|
| d| 40| 2|
+-----+-----+-----+
I have tried several methods and below one is the latest one I tried, but system warm me that pyspark do not have loc function.
df_cal = (
info
.join(reference)
.withColumn('const', reference.loc[reference['key']=='b', 'value'].iloc[0])
.select('key', 'result', 'const')
)
df_cal.show()
And below is the warming that reminded by system:
AttributeError: 'Dataframe' object has no attribute 'loc'
This solve:
from pyspark.sql.functions import lit
target = 'b'
const = [i['value'] for i in df2.collect() if i['key'] == f'{target}']
df_cal = df1.withColumn('const', lit(const[0]))
df_cal.show()
+---+-----+-----+
|key|value|const|
+---+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 30| 2|
| d| 40| 2|
+---+-----+-----+

Delta Lake Table Storage Sorting

I have a delta lake table and inserting the data into that table. Business asked to sort the data while storing it in the table.
I sorted my dataframe before creating the delta table as below
df.sort()
and then created the delta table as below
df.write.format('delta').Option('mergeSchema, true).save('deltalocation')
when retrieving this data into dataframe i see the data is still unsorted.
and i have to do df.sort in order to display the sorted data.
Per my understanding the data cannot actually be stored in a sorted order and the user will have to write a sorting query while extracting the data from the table.
I need to understand if this is correct and also how the delta lake internally stores the data.
My understanding is that it partitions the data and doesn't care about the sort order. data is spread across multiple partitions.
Can someone please clarify this in more detail and advise if my undertanding is correct ?
Delta Lake itself does not itself enable sorting because this would require any engine writing to sort the data. To balance simplicity, speed of ingestion, and speed of query, this is why Delta Lake itself does not require or enable sorting per se. i.e., your statement is correct.
My understanding is that it partitions the data and doesn't care about the sort order. data is spread across multiple partitions.
Note that Delta Lake includes data skipping and OPTIMIZE ZORDER. This allows you to skip files/data using the column statistics and by clustering the data. While sorting can be helpful for a single column, Z-order provides better multi-column data cluster. More info is available in Delta 2.0 - The Foundation of your Data Lakehouse is Open.
Saying this, how Delta Lake stores the data is often a product of what the writer itself is doing. If you were to specify a sort during the write phase, e.g.:
df_sorted = df.repartition("date").sortWithinPartitions("date", "id")
df_sorted.write.format("delta").partitionBy("date").save('deltalocation')
Then the data should be sorted and when read it will be sorted as well.
In response to the question about the potential order, allow me to provide a simple example:
from pyspark.sql.functions import expr
data = spark.range(0, 100)
df = data.withColumn("mod", expr("mod(id, 10)")).show()
# Write unsorted table
df.write.format("delta").partitionBy("mod").save("/tmp/df")
# Sort within partitions
df_sorted = df.repartition("mod").sortWithinPartitions("mod", "id")
# Write sorted table
df_sorted.write.format("delta").partitionBy("mod").save("/tmp/df_sorted")
The two data frames have been saved as Delta tables to their respective df and df_sorted locations.
You can read the data by the following:
# Load data
spark.read.format("delta").load("/tmp/df").show()
spark.read.format("delta").load("/tmp/df").orderBy("mod").show()
spark.read.format("delta").load("/tmp/df_sorted").show()
spark.read.format("delta").load("/tmp/df_sorted").orderBy("mod").show()
For the un-sorted query, here are the first 20 rows and as expected, the data is not sorted.
+---+---+
| id|mod|
+---+---+
| 63| 3|
| 73| 3|
| 83| 3|
| 93| 3|
| 3| 3|
| 13| 3|
| 23| 3|
| 33| 3|
| 43| 3|
| 53| 3|
| 88| 8|
| 98| 8|
| 28| 8|
| 38| 8|
| 48| 8|
| 58| 8|
| 8| 8|
| 18| 8|
| 68| 8|
| 78| 8|
+---+---+
But in the case of df_sorted:
+---+---+
| id|mod|
+---+---+
| 2| 2|
| 12| 2|
| 22| 2|
| 32| 2|
| 42| 2|
| 52| 2|
| 62| 2|
| 72| 2|
| 82| 2|
| 92| 2|
| 9| 9|
| 19| 9|
| 29| 9|
| 39| 9|
| 49| 9|
| 59| 9|
| 69| 9|
| 79| 9|
| 89| 9|
| 99| 9|
+---+---+
As noted, the data within the partitions are sorted. The partitions themselves are not sorted because different worker threads will extract the data by different partitions so there is no guarantee of the order of the partitions unless you explicitly specify a sort order of the partitions.

AttributeError: 'numpy.int64' object has no attribute '_get_object_id'

I have a dataset in pyspark for which I create a row_num column, so my data looks like:
#data:
+-----------------+-----------------+-----+------------------+-------+
|F1_imputed |F2_imputed |label| features|row_num|
+-----------------+-----------------+-----+------------------+-------+
| -0.002353| 0.9762| 0|[-0.002353,0.9762]| 1|
| 0.1265| 0.1176| 0| [0.1265,0.1176]| 2|
| -0.08637| 0.06524| 0|[-0.08637,0.06524]| 3|
| -0.1428| 0.4705| 0| [-0.1428,0.4705]| 4|
| -0.1015| 0.6811| 0| [-0.1015,0.6811]| 5|
| -0.01146| 0.8273| 0| [-0.01146,0.8273]| 6|
| 0.0853| 0.2525| 0| [0.0853,0.2525]| 7|
| 0.2186| 0.2725| 0| [0.2186,0.2725]| 8|
| -0.145| 0.3592| 0| [-0.145,0.3592]| 9|
| -0.1176| 0.4225| 0| [-0.1176,0.4225]| 10|
+-----------------+-----------------+-----+------------------+-------+
I'm trying to filter out a random selection of rows using:
count = data.count()
sample = [np.random.choice(np.arange(count), replace=True, size=50)]
filtered = data.filter(data.row_num.isin(sample))
However the second line gives an error:
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
What is causing this? I use the same filtering code to spilt the rows by label (binary column of ones and zeros) which does work, but reapplying the code now doesn't work for sampling
Numpy data types don't interact well with Spark. You can convert them to Python data types using .tolist() before calling .isin:
sample = np.random.choice(np.arange(count), replace=True, size=50).tolist()

PySpark: Using Window Functions to roll-up dataframe

I have a dataframe my_df that contains 4 columns:
+----------------+---------------+--------+---------+
| user_id| domain|isp_flag|frequency|
+----------------+---------------+--------+---------+
| josh| wanadoo.fr| 1| 15|
| josh| random.it| 0| 12|
| samantha| wanadoo.fr| 1| 16|
| bob| eidsiva.net| 1| 5|
| bob| media.net| 0| 1|
| dylan| vodafone.it| 1| 448|
| dylan| somesite.net| 0| 20|
| dylan| yolosite.net| 0| 49|
| dylan| random.it| 0| 3|
| don| vodafone.it| 1| 39|
| don| popsugar.com| 0| 10|
| don| fabio.com| 1| 49|
+----------------+---------------+--------+---------+
This is what I'm planning to do-
Find all the user_id where the maximum frequency domain with isp_flag=0 has a frequency that is less than 25% of the maximum frequency domain with isp_flag=1.
So, in the example that I have above, my output_df would look like-
+----------------+---------------+--------+---------+
| user_id| domain|isp_flag|frequency|
+----------------+---------------+--------+---------+
| bob| eidsiva.net| 1| 5|
| bob| media.net| 0| 1|
| dylan| vodafone.it| 1| 448|
| dylan| yolosite.net| 0| 49|
| don| fabio.com| 1| 49|
| don| popsugar.com| 0| 10|
+----------------+---------------+--------+---------+
I believe I need window functions to do this, and so I tried the following to first find the maximum frequency domains for isp_flag=0 and isp_flag=1 respectively, for each of the user_id-
>>> win_1 = Window().partitionBy("user_id", "domain", "isp_flag").orderBy((col("frequency").desc()))
>>> final_df = my_df.select("*", rank().over(win_1).alias("rank")).filter(col("rank")==1)
>>> final_df.show(5) # this just gives me the original dataframe back
What am I doing wrong here? How do I get to the final output_df I printed above?
IIUC, you can try the following: calculate the max_frequencies (max_0, max_1) for each user having isp_flag == 0 or 1 respectively. and then filter by condition max_0 < 0.25*max_1 and plus frequency in (max_1, max_0) to select only the records with maximum frequency.
from pyspark.sql import Window, functions as F
# set up the Window to calculate max_0 and max_1 for each user
# having isp_flag = 0 and 1 respectively
w1 = Window.partitionBy('user_id').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('max_1', F.max(F.expr("IF(isp_flag==1, frequency, NULL)")).over(w1))\
.withColumn('max_0', F.max(F.expr("IF(isp_flag==0, frequency, NULL)")).over(w1))\
.where('max_0 < 0.25*max_1 AND frequency in (max_1, max_0)') \
.show()
+-------+------------+--------+---------+-----+-----+
|user_id| domain|isp_flag|frequency|max_1|max_0|
+-------+------------+--------+---------+-----+-----+
| don|popsugar.com| 0| 10| 49| 10|
| don| fabio.com| 1| 49| 49| 10|
| dylan| vodafone.it| 1| 448| 448| 49|
| dylan|yolosite.net| 0| 49| 448| 49|
| bob| eidsiva.net| 1| 5| 5| 1|
| bob| media.net| 0| 1| 5| 1|
+-------+------------+--------+---------+-----+-----+
Some Explanations per request:
the WindowSpec w1 is set to examine all records for the same user(partitionBy), so that the F.max() function will compare all rows based on the same user.
we use IF(isp_flag==1, frequency, NULL) to find frequency for rows having isp_flag==1, it returns NULL when isp_flag is not 1 and thus is skipped in F.max() function. this is an SQL expression and thus we need F.expr() function to run it.
F.max(...).over(w1) will take the max value of the result from executing the above SQL expression. this calculation is based on the Window w1.

Implementing an Auto-Increment column in a DataFrame

I'm trying to implement an auto-increment column in a DataFrame.
I already found a solution but I want to know if there's a better way to do this.
I'm using monotonically_increasing_id() function from pyspark.sql.functions.
The problem with this is that start at 0 and I want it to starts at 1.
So, I did the following and is working fine:
(F.monotonically_increasing_id()+1).alias("songplay_id")
dfLog.join(dfSong, (dfSong.artist_name == dfLog.artist) & (dfSong.title == dfLog.song))\
.select((F.monotonically_increasing_id()+1).alias("songplay_id"), \
dfLog.ts.alias("start_time"), dfLog.userId.alias("user_id"), \
dfLog.level, \
dfSong.song_id, \
dfSong.artist_id, \
dfLog.sessionId.alias("session_id"), \
dfLog.location, \
dfLog.userAgent.alias("user_agent"))
Is there a better way to implement what im trying to do?
I think, it's too much works to implement a udf function just for that or is just me?
Thanks.-
The sequence monotonically_increasing_id is not guaranted to be consecutive, but they are guaranted to be monotonically increasing. Each task of your job will be assigned a starting integer from which it's going to increment by 1 at every row, but you'll have gaps between the last id of one batch and the first id of another.
To verify this behavior, you can create a job containing two tasks by repartitioning a sample data frame:
import pandas as pd
import pyspark.sql.functions as psf
spark.createDataFrame(pd.DataFrame([[i] for i in range(10)], columns=['value'])) \
.repartition(2) \
.withColumn('id', psf.monotonically_increasing_id()) \
.show()
+-----+----------+
|value| id|
+-----+----------+
| 3| 0|
| 0| 1|
| 6| 2|
| 2| 3|
| 4| 4|
| 7|8589934592|
| 5|8589934593|
| 8|8589934594|
| 9|8589934595|
| 1|8589934596|
+-----+----------+
In order to make sure your index yields consecutive values, you can use a window function.
from pyspark.sql import Window
w = Window.orderBy('id')
spark.createDataFrame(pd.DataFrame([[i] for i in range(10)], columns=['value'])) \
.withColumn('id', psf.monotonically_increasing_id()) \
.withColumn('id2', psf.row_number().over(w)) \
.show()
+-----+---+---+
|value| id|id2|
+-----+---+---+
| 0| 0| 1|
| 1| 1| 2|
| 2| 2| 3|
| 3| 3| 4|
| 4| 4| 5|
| 5| 5| 6|
| 6| 6| 7|
| 7| 7| 8|
| 8| 8| 9|
| 9| 9| 10|
+-----+---+---+
Notes:
monotonically_increasing_id allows you to set an order on your rows as they are read, it starts at 0 for the first task and increases but not necessarily in a sequential manner
row_number sequentially indexes the rows in an ordered window and starts at 1

Categories