Implementing an Auto-Increment column in a DataFrame - python

I'm trying to implement an auto-increment column in a DataFrame.
I already found a solution but I want to know if there's a better way to do this.
I'm using monotonically_increasing_id() function from pyspark.sql.functions.
The problem with this is that start at 0 and I want it to starts at 1.
So, I did the following and is working fine:
(F.monotonically_increasing_id()+1).alias("songplay_id")
dfLog.join(dfSong, (dfSong.artist_name == dfLog.artist) & (dfSong.title == dfLog.song))\
.select((F.monotonically_increasing_id()+1).alias("songplay_id"), \
dfLog.ts.alias("start_time"), dfLog.userId.alias("user_id"), \
dfLog.level, \
dfSong.song_id, \
dfSong.artist_id, \
dfLog.sessionId.alias("session_id"), \
dfLog.location, \
dfLog.userAgent.alias("user_agent"))
Is there a better way to implement what im trying to do?
I think, it's too much works to implement a udf function just for that or is just me?
Thanks.-

The sequence monotonically_increasing_id is not guaranted to be consecutive, but they are guaranted to be monotonically increasing. Each task of your job will be assigned a starting integer from which it's going to increment by 1 at every row, but you'll have gaps between the last id of one batch and the first id of another.
To verify this behavior, you can create a job containing two tasks by repartitioning a sample data frame:
import pandas as pd
import pyspark.sql.functions as psf
spark.createDataFrame(pd.DataFrame([[i] for i in range(10)], columns=['value'])) \
.repartition(2) \
.withColumn('id', psf.monotonically_increasing_id()) \
.show()
+-----+----------+
|value| id|
+-----+----------+
| 3| 0|
| 0| 1|
| 6| 2|
| 2| 3|
| 4| 4|
| 7|8589934592|
| 5|8589934593|
| 8|8589934594|
| 9|8589934595|
| 1|8589934596|
+-----+----------+
In order to make sure your index yields consecutive values, you can use a window function.
from pyspark.sql import Window
w = Window.orderBy('id')
spark.createDataFrame(pd.DataFrame([[i] for i in range(10)], columns=['value'])) \
.withColumn('id', psf.monotonically_increasing_id()) \
.withColumn('id2', psf.row_number().over(w)) \
.show()
+-----+---+---+
|value| id|id2|
+-----+---+---+
| 0| 0| 1|
| 1| 1| 2|
| 2| 2| 3|
| 3| 3| 4|
| 4| 4| 5|
| 5| 5| 6|
| 6| 6| 7|
| 7| 7| 8|
| 8| 8| 9|
| 9| 9| 10|
+-----+---+---+
Notes:
monotonically_increasing_id allows you to set an order on your rows as they are read, it starts at 0 for the first task and increases but not necessarily in a sequential manner
row_number sequentially indexes the rows in an ordered window and starts at 1

Related

Apply value to a dataframe after filtering another dataframe based on conditions

My question:
There are two dataframes and the info one is in progress to build.
What I want to do is filtering in reference dataframe based on the condition. When key is b, then apply value is 2 to the into table as whole column.
The output dataframe is the final one I want to do.
Dataframe (info)
+-----+-----+
| key|value|
+-----+-----+
| a| 10|
| b| 20|
| c| 50|
| d| 40|
+-----+-----+
Dataframe (Reference)
+-----+-----+
| key|value|
+-----+-----+
| a| 42|
| b| 2|
| c| 9|
| d| 100|
+-----+-----+
Below is the output I want:
Dataframe (Output)
+-----+-----+-----+
| key|value|const|
+-----+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 50| 2|
| d| 40| 2|
+-----+-----+-----+
I have tried several methods and below one is the latest one I tried, but system warm me that pyspark do not have loc function.
df_cal = (
info
.join(reference)
.withColumn('const', reference.loc[reference['key']=='b', 'value'].iloc[0])
.select('key', 'result', 'const')
)
df_cal.show()
And below is the warming that reminded by system:
AttributeError: 'Dataframe' object has no attribute 'loc'
This solve:
from pyspark.sql.functions import lit
target = 'b'
const = [i['value'] for i in df2.collect() if i['key'] == f'{target}']
df_cal = df1.withColumn('const', lit(const[0]))
df_cal.show()
+---+-----+-----+
|key|value|const|
+---+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 30| 2|
| d| 40| 2|
+---+-----+-----+

remove all rows with the values in all column similar with another pyspark dataframe

For example, if i have 2 pyspark dataframe with 100++ columns.
I wish to compare dataframe A, with Dataframe B and check whether is there row with similar values based on all 100++columns. If yes, then remove the rows.
I do not have a code because I do not know how to begin. please help.
I am wondering whether could I use outer join between two dataframes.
The left_anti join removes matching rows from the left table, and is well suited for your use case. Here's an example.
data1_sdf.show()
# +---+---+---+---+
# | c1| c2| c3| c4|
# +---+---+---+---+
# | 1| 2| 3| 4|
# | 1| 2| 3| 4|
# | 3| 2| 1| 4|
# +---+---+---+---+
data2_sdf.show()
# +---+---+---+---+
# | c1| c2| c3| c4|
# +---+---+---+---+
# | 1| 2| 3| 4|
# | 5| 7| 3| 4|
# +---+---+---+---+
# left anti join to remove matching rows from data1_sdf
data1_sdf. \
withColumn('allcol_concat', func.concat_ws('_', *data1_sdf.columns)). \
join(data2_sdf.withColumn('allcol_concat', func.concat_ws('_', *data2_sdf.columns)).
select('allcol_concat'),
'allcol_concat',
'leftanti'
). \
show()
# +-------------+---+---+---+---+
# |allcol_concat| c1| c2| c3| c4|
# +-------------+---+---+---+---+
# | 3_2_1_4| 3| 2| 1| 4|
# +-------------+---+---+---+---+

Prevent Spark Out of Memory when applying Pandas UDF

I am running spark version :3.1.3 on YARN cluster 2 nodes each 15 GB. I am trying to run DBSCAN on spark based on this answer :
1)https://stackoverflow.com/a/60153719/13378785
spark_dbscan takes a dataframe which contains the features and a partition based on which some sub-groups have been created and sklearn DBSCAN will be applied later on locally .
Run_dbscan is the implementation of DBSCAN from sklearn (which the only difference i consider noise as 0 instead of -1).
As an output i just get a dataframe with some local labels for each group (which later on with another process are merged).
An example of the input and output is shown below:
def create_scheme(dimensions):
schema = StructType()
for i in range(0, dimensions):
schema.add(str(i), FloatType())
schema.add('index', IntegerType())
schema.add('partitions', IntegerType())
schema.add('cluster', IntegerType())
return schema
def Run_dbscan(df , eps=0.2, min_samples=45, colums=None, dim=2):
db = DBSCAN(eps=eps, min_samples=min_samples).fit(df[colums[:dim]])
df['cluster'] = db.labels_
df['cluster'] = df['cluster'] + 1
return df
def spark_dbscan(train_d_path, spark, esp=0.3, min_pts=3, dim=2, part=8,roungd=0):
def dbscan(data):
return Run_dbscan(data=data, eps=esp, colums=col_d, min_samples=min_pts, dim=dim)
#partition_df = has been created earlier and is the one displayed below as Input
col_d = partition_df.columns
partition_df = partition_df.repartition(X) #X integer [2~200]
df = partition_df.groupby("partitions").applyInPandas(dbscan, schema=create_scheme(dim))
df = df.select([c for c in df.columns if c in ['cluster', 'index', 'partitions']])
#write.csv will either throw OOM or task will fail in YARN
df.write.csv('apath', header='true')
return df
Input :
+-----------+------------+-----+----------+
| 0| 1|index|partitions|
+-----------+------------+-----+----------+
|0.009805393| -1.15588| 0| 400|
| -0.3587753| -0.8127602| 1| 400|
| -0.3587753| -0.8127602| 1| 600|
| 0.39499614| 0.57317185| 2| 500|
| 0.39499614| 0.57317185| 2| 800|
|-0.11886594| -1.3287356| 3| 400|
|0.045839764| -1.333411| 4| 400|
| -1.406546| 0.75050515| 5| 200|
| -1.406546| 0.75050515| 5| 300|
| 1.3432932| -1.1720116| 6| 700|
| 0.23757286| 0.89215803| 7| 800|
| -1.0563164| 1.4410132| 8| 200|
| -0.3718215| -0.49574503| 9| 600|
| 0.7345064| -1.3823572| 10| 700|
| 1.5790144| -0.79659116| 11| 100|
| 1.5790144| -0.79659116| 11| 700|
|-0.04173269| 1.3471143| 12| 800|
| -1.4477109| 0.5785008| 13| 200|
| -1.4477109| 0.5785008| 13| 300|
| 0.59039986|-0.057216637| 14| 500|
+-----------+------------+-----+----------+
Output:
+-----+----------+-------+
|index|partitions|cluster|
+-----+----------+-------+
|98663| 300| 2|
|80841| 300| 1|
|62544| 300| 1|
| 2977| 300| 1|
|23948| 300| 1|
|44470| 300| 1|
|84420| 300| 1|
|30907| 300| 1|
|93084| 300| 1|
|58671| 300| 1|
|58051| 300| 1|
|46485| 300| 2|
|29175| 300| 1|
|41179| 300| 1|
|95224| 300| 1|
|56119| 300| 1|
|22074| 300| 1|
|88336| 300| 1|
|16965| 300| 1|
|35793| 300| 0|
+-----+----------+-------+
only showing top 20 rows
For small datasets this works fine (less than 1 million rows) however for larger dataset if i run it locally it will always end up with either Heap error or OOM due to the memory which is required for the DBSCAN. SKlearn dbscan is memory based so i am expecting all the data to moved to the memory
however i expected with this implementation to be able to "split" the original dataset into smaller chuncks and prevent these errors .
So my question is how to overcome this, so basically a trade off memory for speed
I have tried the following:
1)Increasing the number of partitions which lowers the number of data/rows grouped together thus spark would slower calculate the labels.
2)Increase the number of the dataframe partitions/shuffle (X value) .
Neither of those solutions could prevent the issue i mentioned from happening.
The only method which i haven't tried but might could work is a filter like approach where i would create N dataframe (#N the number of partitions/sub-groups) and run DBSCAN to each of the with a for loop and then joining them together but i couldn't find any post supporting such solution.
d = {}
num_p = [1,2,3,4] #consider list with the partitionsnumber 1,2,3,4etc
for ie in num_p:
d[ie] = df.filter((f.col("partitions")==ie)
pandas_df = d[ie].toPandas()
#apply dbscan
#drop features and convert back to spark dataframe with scheme
merge all spark dataframe into one with union and move to global merging.

PySpark: Using Window Functions to roll-up dataframe

I have a dataframe my_df that contains 4 columns:
+----------------+---------------+--------+---------+
| user_id| domain|isp_flag|frequency|
+----------------+---------------+--------+---------+
| josh| wanadoo.fr| 1| 15|
| josh| random.it| 0| 12|
| samantha| wanadoo.fr| 1| 16|
| bob| eidsiva.net| 1| 5|
| bob| media.net| 0| 1|
| dylan| vodafone.it| 1| 448|
| dylan| somesite.net| 0| 20|
| dylan| yolosite.net| 0| 49|
| dylan| random.it| 0| 3|
| don| vodafone.it| 1| 39|
| don| popsugar.com| 0| 10|
| don| fabio.com| 1| 49|
+----------------+---------------+--------+---------+
This is what I'm planning to do-
Find all the user_id where the maximum frequency domain with isp_flag=0 has a frequency that is less than 25% of the maximum frequency domain with isp_flag=1.
So, in the example that I have above, my output_df would look like-
+----------------+---------------+--------+---------+
| user_id| domain|isp_flag|frequency|
+----------------+---------------+--------+---------+
| bob| eidsiva.net| 1| 5|
| bob| media.net| 0| 1|
| dylan| vodafone.it| 1| 448|
| dylan| yolosite.net| 0| 49|
| don| fabio.com| 1| 49|
| don| popsugar.com| 0| 10|
+----------------+---------------+--------+---------+
I believe I need window functions to do this, and so I tried the following to first find the maximum frequency domains for isp_flag=0 and isp_flag=1 respectively, for each of the user_id-
>>> win_1 = Window().partitionBy("user_id", "domain", "isp_flag").orderBy((col("frequency").desc()))
>>> final_df = my_df.select("*", rank().over(win_1).alias("rank")).filter(col("rank")==1)
>>> final_df.show(5) # this just gives me the original dataframe back
What am I doing wrong here? How do I get to the final output_df I printed above?
IIUC, you can try the following: calculate the max_frequencies (max_0, max_1) for each user having isp_flag == 0 or 1 respectively. and then filter by condition max_0 < 0.25*max_1 and plus frequency in (max_1, max_0) to select only the records with maximum frequency.
from pyspark.sql import Window, functions as F
# set up the Window to calculate max_0 and max_1 for each user
# having isp_flag = 0 and 1 respectively
w1 = Window.partitionBy('user_id').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('max_1', F.max(F.expr("IF(isp_flag==1, frequency, NULL)")).over(w1))\
.withColumn('max_0', F.max(F.expr("IF(isp_flag==0, frequency, NULL)")).over(w1))\
.where('max_0 < 0.25*max_1 AND frequency in (max_1, max_0)') \
.show()
+-------+------------+--------+---------+-----+-----+
|user_id| domain|isp_flag|frequency|max_1|max_0|
+-------+------------+--------+---------+-----+-----+
| don|popsugar.com| 0| 10| 49| 10|
| don| fabio.com| 1| 49| 49| 10|
| dylan| vodafone.it| 1| 448| 448| 49|
| dylan|yolosite.net| 0| 49| 448| 49|
| bob| eidsiva.net| 1| 5| 5| 1|
| bob| media.net| 0| 1| 5| 1|
+-------+------------+--------+---------+-----+-----+
Some Explanations per request:
the WindowSpec w1 is set to examine all records for the same user(partitionBy), so that the F.max() function will compare all rows based on the same user.
we use IF(isp_flag==1, frequency, NULL) to find frequency for rows having isp_flag==1, it returns NULL when isp_flag is not 1 and thus is skipped in F.max() function. this is an SQL expression and thus we need F.expr() function to run it.
F.max(...).over(w1) will take the max value of the result from executing the above SQL expression. this calculation is based on the Window w1.

best way to generate per key auto increment numerals after sorting

I wanted to ask whats the best way to achieve per key auto increment
numerals after sorting, for eg. :
raw file:
1,a,b,c,1,1
1,a,b,d,0,0
1,a,b,e,1,0
2,a,e,c,0,0
2,a,f,d,1,0
post-output (the last column is the position number after grouping on
first three fields and reverse sorting on last two values)
1,a,b,c,1,1,1
1,a,b,d,0,0,3
1,a,b,e,1,0,2
2,a,e,c,0,0,2
2,a,f,d,1,0,1
I am using solution that uses groupbykey but that is running into some
issues (possibly bug with pyspark/spark?), wondering if there is a
better way to achieve this.
My solution:
A = sc.textFile("train.csv")
.filter(lambda x:not isHeader(x))
.map(split)
.map(parse_train)
.filter(lambda x: not x is None)
B = A.map(lambda k:((k.first_field,k.second_field,k.first_field,k.third_field),(k[0:5])))
.groupByKey()
B.map(sort_n_set_position)
.flatMap(lambda line: line)
where sort and set position iterates over the iterator and performs
sorting and adding last column
Since you have big keys (all 3 first values), I'll assume you will not have a ton of rows per key. Given this, I would just use groupByKey([numTasks]) and then use normal code to sort and add your index to each row on the resulting iterables.
A little bit different approach combining spark-csv, DataFrames and window functions. I assume that header line is x1,x2,x4,x4,x5,x6 for brevity:
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber, col
df = (sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("train.csv"))
w = (Window()
.partitionBy(col("x1"), col("x2"), col("x3"))
.orderBy(col("x5").desc(), col("x6").desc()))
df_with_rn = df.select(col("*"), rowNumber().over(w).alias("x7"))
df_with_rn.show()
## +---+---+---+---+---+---+---+
## | x1| x2| x3| x4| x5| x6| x7|
## +---+---+---+---+---+---+---+
## | 2| a| e| c| 0| 0| 1|
## | 2| a| f| d| 1| 0| 1|
## | 1| a| b| c| 1| 1| 1|
## | 1| a| b| e| 1| 0| 2|
## | 1| a| b| d| 0| 0| 3|
## +---+---+---+---+---+---+---+
If you want a plain RDD as an output you can simply map as follows:
df_with_rn.map(lambda r: r.asDict())

Categories