How to do a cummsum in a lambda call using PySpark - python

I am trying to replicate a code in Python using PySpark, and I found myself in a problem. So this is the code I am trying to replicate:
df_act = (df_act.assign(n_cycles = (lambda x: (x.cycles_bol != x.cycles_bol.shift(1)).cumsum())))
Keep in mind that I am working with a dataframe, and that cycles_bol is a column of dataframe "df_act".
and I simply can't. The closest I think I have gotten to the solution is the following:
df_act=df_act.withColumn(
"grp",
when(df_act['cycles_bol'] == lead("cycles_bol").over(Window.partitionBy("user_id").orderBy("timestamp")),0).otherwise(1).over(Window.orderBy("timestamp"))
).drop("grp").show()
Can anyone please help me?
Thanks in advance!

You dindt give much information
You have to orderBy,use lag to check if cycles_bol consecutives are the same and conditionally add. Use an existing column to orderBy if it wont change the order of cycles_bol. If you dont have such a column, generate one using monotonically_increasing function like I did.
df_act.withColumn('id', monotonically_increasing_id()).withColumn('n_cycles',sum(when(lag('cycles_bol').over(Window.orderBy('id'))!=col('cycles_bol'),1).otherwise(0)).over(Window.orderBy('id'))).drop('id').show()
+----------+--------+
|cycles_bol|n_cycles|
+----------+--------+
| A| 0|
| B| 1|
| B| 1|
| A| 2|
| B| 3|
| A| 4|
| A| 4|
| B| 5|
| C| 6|
+----------+--------+

Related

How to iterate over a pyspark dataframe and create a dictionary out of it

I have the following pyspark dataframe:
import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','a', 'b','b','b','b'],
'time': [1,2,3,4,1,2,3,4],
'col': ['1','2','1','2','3','2','3','2']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+---+
| id|time|col|
+---+----+---+
| a| 1| 1|
| a| 2| 2|
| a| 3| 1|
| a| 4| 2|
| b| 1| 3|
| b| 2| 2|
| b| 3| 3|
| b| 4| 2|
+---+----+---+
I would like to iterate over all ids and obtain a python dictionary that would have as keys the id and as values the col and would look like this:
foo_dict = {'a': ['1','2','1','2'], 'b': ['3','2','3','2']})
I have in total 10k ids and around 10m rows in foo, so I am looking for an efficient implementation.
Any ideas ?
It's a pandas dataframe. You should checkout the documentaton. The dataframe object has inbuilt methods to help iterate, slice and dice your data. There is also this fun tool to help you visualize what is going on.
pandas has a ready-made method to convert a dataframe to a dict.

AttributeError: 'numpy.int64' object has no attribute '_get_object_id'

I have a dataset in pyspark for which I create a row_num column, so my data looks like:
#data:
+-----------------+-----------------+-----+------------------+-------+
|F1_imputed |F2_imputed |label| features|row_num|
+-----------------+-----------------+-----+------------------+-------+
| -0.002353| 0.9762| 0|[-0.002353,0.9762]| 1|
| 0.1265| 0.1176| 0| [0.1265,0.1176]| 2|
| -0.08637| 0.06524| 0|[-0.08637,0.06524]| 3|
| -0.1428| 0.4705| 0| [-0.1428,0.4705]| 4|
| -0.1015| 0.6811| 0| [-0.1015,0.6811]| 5|
| -0.01146| 0.8273| 0| [-0.01146,0.8273]| 6|
| 0.0853| 0.2525| 0| [0.0853,0.2525]| 7|
| 0.2186| 0.2725| 0| [0.2186,0.2725]| 8|
| -0.145| 0.3592| 0| [-0.145,0.3592]| 9|
| -0.1176| 0.4225| 0| [-0.1176,0.4225]| 10|
+-----------------+-----------------+-----+------------------+-------+
I'm trying to filter out a random selection of rows using:
count = data.count()
sample = [np.random.choice(np.arange(count), replace=True, size=50)]
filtered = data.filter(data.row_num.isin(sample))
However the second line gives an error:
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
What is causing this? I use the same filtering code to spilt the rows by label (binary column of ones and zeros) which does work, but reapplying the code now doesn't work for sampling
Numpy data types don't interact well with Spark. You can convert them to Python data types using .tolist() before calling .isin:
sample = np.random.choice(np.arange(count), replace=True, size=50).tolist()

Implementing an Auto-Increment column in a DataFrame

I'm trying to implement an auto-increment column in a DataFrame.
I already found a solution but I want to know if there's a better way to do this.
I'm using monotonically_increasing_id() function from pyspark.sql.functions.
The problem with this is that start at 0 and I want it to starts at 1.
So, I did the following and is working fine:
(F.monotonically_increasing_id()+1).alias("songplay_id")
dfLog.join(dfSong, (dfSong.artist_name == dfLog.artist) & (dfSong.title == dfLog.song))\
.select((F.monotonically_increasing_id()+1).alias("songplay_id"), \
dfLog.ts.alias("start_time"), dfLog.userId.alias("user_id"), \
dfLog.level, \
dfSong.song_id, \
dfSong.artist_id, \
dfLog.sessionId.alias("session_id"), \
dfLog.location, \
dfLog.userAgent.alias("user_agent"))
Is there a better way to implement what im trying to do?
I think, it's too much works to implement a udf function just for that or is just me?
Thanks.-
The sequence monotonically_increasing_id is not guaranted to be consecutive, but they are guaranted to be monotonically increasing. Each task of your job will be assigned a starting integer from which it's going to increment by 1 at every row, but you'll have gaps between the last id of one batch and the first id of another.
To verify this behavior, you can create a job containing two tasks by repartitioning a sample data frame:
import pandas as pd
import pyspark.sql.functions as psf
spark.createDataFrame(pd.DataFrame([[i] for i in range(10)], columns=['value'])) \
.repartition(2) \
.withColumn('id', psf.monotonically_increasing_id()) \
.show()
+-----+----------+
|value| id|
+-----+----------+
| 3| 0|
| 0| 1|
| 6| 2|
| 2| 3|
| 4| 4|
| 7|8589934592|
| 5|8589934593|
| 8|8589934594|
| 9|8589934595|
| 1|8589934596|
+-----+----------+
In order to make sure your index yields consecutive values, you can use a window function.
from pyspark.sql import Window
w = Window.orderBy('id')
spark.createDataFrame(pd.DataFrame([[i] for i in range(10)], columns=['value'])) \
.withColumn('id', psf.monotonically_increasing_id()) \
.withColumn('id2', psf.row_number().over(w)) \
.show()
+-----+---+---+
|value| id|id2|
+-----+---+---+
| 0| 0| 1|
| 1| 1| 2|
| 2| 2| 3|
| 3| 3| 4|
| 4| 4| 5|
| 5| 5| 6|
| 6| 6| 7|
| 7| 7| 8|
| 8| 8| 9|
| 9| 9| 10|
+-----+---+---+
Notes:
monotonically_increasing_id allows you to set an order on your rows as they are read, it starts at 0 for the first task and increases but not necessarily in a sequential manner
row_number sequentially indexes the rows in an ordered window and starts at 1

PySpark: Add a new column with a tuple created from columns

Here I have a dateframe created as follow,
df = spark.createDataFrame([('a',5,'R','X'),('b',7,'G','S'),('c',8,'G','S')],
["Id","V1","V2","V3"])
It looks like
+---+---+---+---+
| Id| V1| V2| V3|
+---+---+---+---+
| a| 5| R| X|
| b| 7| G| S|
| c| 8| G| S|
+---+---+---+---+
I'm looking to add a column that is a tuple consisting of V1,V2,V3.
The result should look like
+---+---+---+---+-------+
| Id| V1| V2| V3|V_tuple|
+---+---+---+---+-------+
| a| 5| R| X|(5,R,X)|
| b| 7| G| S|(7,G,S)|
| c| 8| G| S|(8,G,S)|
+---+---+---+---+-------+
I've tried to use similar syntex as in Python but it didn't work:
df.withColumn("V_tuple",list(zip(df.V1,df.V2,df.V3)))
TypeError: zip argument #1 must support iteration.
Any help would be appreciated!
I'm coming from scala but I do believe that there's a similar way in python :
Using sql.functions package mehtod :
If you want to get a StructType with this three column use the struct(cols: Column*): Column method like this :
from pyspark.sql.functions import struct
df.withColumn("V_tuple",struct(df.V1,df.V2,df.V3))
but if you want to get it as a String you can use the concat(exprs: Column*): Column method like this :
from pyspark.sql.functions import concat
df.withColumn("V_tuple",concat(df.V1,df.V2,df.V3))
With this second method you may have to cast the columns into Strings
I'm not sure about the python syntax, Just edit the answer if there's a syntax error.
Hope this help you. Best Regards
Use struct:
from pyspark.sql.functions import struct
df.withColumn("V_tuple", struct(df.V1,df.V2,df.V3))

best way to generate per key auto increment numerals after sorting

I wanted to ask whats the best way to achieve per key auto increment
numerals after sorting, for eg. :
raw file:
1,a,b,c,1,1
1,a,b,d,0,0
1,a,b,e,1,0
2,a,e,c,0,0
2,a,f,d,1,0
post-output (the last column is the position number after grouping on
first three fields and reverse sorting on last two values)
1,a,b,c,1,1,1
1,a,b,d,0,0,3
1,a,b,e,1,0,2
2,a,e,c,0,0,2
2,a,f,d,1,0,1
I am using solution that uses groupbykey but that is running into some
issues (possibly bug with pyspark/spark?), wondering if there is a
better way to achieve this.
My solution:
A = sc.textFile("train.csv")
.filter(lambda x:not isHeader(x))
.map(split)
.map(parse_train)
.filter(lambda x: not x is None)
B = A.map(lambda k:((k.first_field,k.second_field,k.first_field,k.third_field),(k[0:5])))
.groupByKey()
B.map(sort_n_set_position)
.flatMap(lambda line: line)
where sort and set position iterates over the iterator and performs
sorting and adding last column
Since you have big keys (all 3 first values), I'll assume you will not have a ton of rows per key. Given this, I would just use groupByKey([numTasks]) and then use normal code to sort and add your index to each row on the resulting iterables.
A little bit different approach combining spark-csv, DataFrames and window functions. I assume that header line is x1,x2,x4,x4,x5,x6 for brevity:
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber, col
df = (sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("train.csv"))
w = (Window()
.partitionBy(col("x1"), col("x2"), col("x3"))
.orderBy(col("x5").desc(), col("x6").desc()))
df_with_rn = df.select(col("*"), rowNumber().over(w).alias("x7"))
df_with_rn.show()
## +---+---+---+---+---+---+---+
## | x1| x2| x3| x4| x5| x6| x7|
## +---+---+---+---+---+---+---+
## | 2| a| e| c| 0| 0| 1|
## | 2| a| f| d| 1| 0| 1|
## | 1| a| b| c| 1| 1| 1|
## | 1| a| b| e| 1| 0| 2|
## | 1| a| b| d| 0| 0| 3|
## +---+---+---+---+---+---+---+
If you want a plain RDD as an output you can simply map as follows:
df_with_rn.map(lambda r: r.asDict())

Categories