Joining with a lookup table in PySpark - python

I have 2 tables: Table 'A' and Table 'Lookup'
Table A:
ID Day
A 1
B 1
C 2
D 4
The lookup table has percentage values for each ID-Day combination.
Table Lookup:
ID 1 2 3 4
A 20 10 50 30
B 0 50 0 50
C 50 10 10 30
D 10 25 25 40
My expected output is to have an additional field in Table 'A' named 'Percent' with values filled in from the lookup table:
ID Day Percent
A 1 20
B 1 0
C 2 10
D 4 40
Since both the tables are large, I do not want to pivot any of the tables.

I have written code in scala. You can refer same for python.
scala> TableA.show()
+---+---+
| ID|Day|
+---+---+
| A| 1|
| B| 1|
| C| 2|
| D| 4|
+---+---+
scala> lookup.show()
+---+---+---+---+---+
| ID| 1| 2| 3| 4|
+---+---+---+---+---+
| A| 20| 10| 50| 30|
| B| 0| 50| 0| 50|
| C| 50| 10| 10| 30|
| D| 10| 25| 25| 40|
+---+---+---+---+---+
//UDF Functon to retrieve data from lookup table
val lookupUDF = (r:Row, s:String) => {
r.getAs(s).toString}
//Join over Key column "ID"
val joindf = TableA.join(lookup,"ID")
//final output DataFrame creation
val final_df = joindf.map(x => (x.getAs("ID").toString, x.getAs("Day").toString, lookupUDF(x,x.getAs("Day")))).toDF("ID","Day","Percentage")
final_df.show()
+---+---+----------+
| ID|Day|Percentage|
+---+---+----------+
| A| 1| 20|
| B| 1| 0|
| C| 2| 10|
| D| 4| 40|
+---+---+----------+

(Posting my answer a day after I posted the question)
I was able to solve this by converting the tables to a pandas dataframe.
from pyspark.sql.types import *
schema = StructType([StructField("id", StringType())\
,StructField("day", StringType())\
,StructField("1", IntegerType())\
,StructField("2", IntegerType())\
,StructField("3", IntegerType())\
,StructField("4", IntegerType())])
# Day field is String type
data = [['A', 1, 20, 10, 50, 30], ['B', 1, 0, 50, 0, 50], ['C', 2, 50, 10, 10, 30], ['D', 4, 10, 25, 25, 40]]
df = spark.createDataFrame(data,schema=schema)
df.show()
# After joining the 2 tables on "id", the tables would look like this:
+---+---+---+---+---+---+
| id|day| 1| 2| 3| 4|
+---+---+---+---+---+---+
| A| 1| 20| 10| 50| 30|
| B| 1| 0| 50| 0| 50|
| C| 2| 50| 10| 10| 30|
| D| 4| 10| 25| 25| 40|
+---+---+---+---+---+---+
# Converting to a pandas dataframe
pandas_df = df.toPandas()
id day 1 2 3 4
A 1 20 10 50 30
B 1 0 50 0 50
C 2 50 10 10 30
D 4 10 25 25 40
# UDF:
def udf(x):
return x[x['day']]
pandas_df['percent'] = pandas_df.apply(udf, axis=1)
# Converting back to a Spark DF:
spark_df = sqlContext.createDataFrame(pandas_df)
+---+---+---+---+---+---+---+
| id|day| 1| 2| 3| 4|new|
+---+---+---+---+---+---+---+
| A| 1| 20| 10| 50| 30| 20|
| B| 1| 0| 50| 0| 50| 0|
| C| 2| 50| 10| 10| 30| 10|
| D| 4| 10| 25| 25| 40| 40|
+---+---+---+---+---+---+---+
spark_df.select("id", "day", "percent").show()
+---+---+-------+
| id|day|percent|
+---+---+-------+
| A| 1| 20|
| B| 1| 0|
| C| 2| 10|
| D| 4| 40|
+---+---+-------+
I would appreciate if someone posts an answer in PySpark without the pandas-df conversion.

df = spark.createDataFrame([{'ID':'A','Day':1}
,{'ID':'B','Day':1}
,{'ID':'C','Day':2}
,{'ID':'D','Day':4}])
df1 = spark.createDataFrame([{'ID':'A','1':20,'2':10,'3':50,'4':30},
{'ID':'B','1':0,'2':50,'3':0,'4':50},
{'ID':'C','1':50,'2':10,'3':10,'4':30},
{'ID':'D','1':10,'2':25,'3':25,'4':40}
])
df1=df1.withColumn('1',col('1').cast('int')).withColumn('2',col('2').cast('int')).withColumn('3',col('3').cast('int')).withColumn('4',col('4').cast('int'))
df=df.withColumn('Day',col('Day').cast('int'))
df_final = df.join(df1,'ID')
df_final_rdd = df_final.rdd
print(df_final_rdd.collect())
def create_list(r,s):
s=str(s)
k = (r['ID'],r['Day'],r[s])
return k
l=[]
for element in df_final_rdd.collect():
l.append(create_list(element,element['Day']))
rdd = sc.parallelize(l)
df= spark.createDataFrame(rdd).toDF('ID','Day','Percent')

Related

How to filter dataframe using percentiles to filter out outliers?

Suppose I have a spark dataframe like this:
+------------+-----------+
|category |value |
+------------+-----------+
| a| 1|
| a| 2|
| b| 2|
| a| 3|
| b| 4|
| a| 4|
| b| 6|
| b| 8|
+------------+-----------+
I want to set values higher than 0.75 percentile to nan for each category.
That being;
a_values = [1,2,3,4] => a_values_filtered = [1,2,3,nan]
b_values = [2,4,6,8] => b_values_filtered = [2,3,6,nan]
So the expected output is:
+------------+-----------+
|category |value |
+------------+-----------+
| a| 1|
| a| 2|
| b| 2|
| a| 3|
| b| 4|
| a| nan|
| b| 6|
| b| nan|
+------------+-----------+
Any idea how to do it cleanly?
PS: I am new to spark
Use percent_rank function to get the percentiles, and then use when to assign values > 0.75 percent_rank to null.
from pyspark.sql import Window
from pyspark.sql.functions import percent_rank,when
w = Window.partitionBy(df.category).orderBy(df.value)
percentiles_df = df.withColumn('percentile',percent_rank().over(w))
result = percentiles_df.select(percentiles_df.category
,when(percentiles_df.percentile <= 0.75,percentiles_df.value).alias('value'))
result.show()
Here is another snippet similar to Prabhala's answer, I use percentile_approx UDF instead.
from pyspark.sql import Window
import pyspark.sql.functions as F
window = Window.partitionBy('category')
percentile = F.expr('percentile_approx(value, 0.75)')
tmp_df = df.withColumn('percentile_value', percentile.over(window))
result = tmp_df.select('category', when(tmp_df.percentile_value >= tmp_df.value, tmp_df.value).alias('value'))
result.show()
+--------+-----+
|category|value|
+--------+-----+
| b| 2|
| b| 4|
| b| 6|
| b| null|
| a| 1|
| a| 2|
| a| 3|
| a| null|
+--------+-----+

pyspark count not null values for pairs in two column within group

I have some data like this
A B C
1 Null 3
1 2 4
2 Null 6
2 2 Null
2 1 2
3 Null 4
and I want to groupby A and then calculat the number of rows that don't contain Null value. So, the result should be
A count
1 1
2 1
3 0
I don't think this will work..., does it?
df.groupby('A').agg(count('B','C'))
Personally, I would use an auxiliary column saying whether B or C is Null. Negative result in this solution and return 1 or 0. And use sum for this column.
from pyspark.sql.functions import sum, when
# ...
df.withColumn("isNotNull", when(df.B.isNull() | df.C.isNull(), 0).otherwise(1))\
.groupBy("A").agg(sum("isNotNull"))
Demo:
df.show()
# +---+----+----+
# | _1| _2| _3|
# +---+----+----+
# | 1|null| 3|
# | 1| 2| 4|
# | 2|null| 6|
# | 2| 2|null|
# | 2| 1| 2|
# | 3|null| 4|
# +---+----+----+
df.withColumn("isNotNull", when(df._2.isNull() | df._3.isNull(), 0).otherwise(1)).show()
# +---+----+----+---------+
# | _1| _2| _3|isNotNull|
# +---+----+----+---------+
# | 1|null| 3| 0|
# | 1| 2| 4| 1|
# | 2|null| 6| 0|
# | 2| 2|null| 0|
# | 2| 1| 2| 1|
# | 3|null| 4| 0|
# +---+----+----+---------+
df.withColumn("isNotNull", when(df._2.isNull() | df._3.isNull(), 0).otherwise(1))\
.groupBy("_1").agg(sum("isNotNull")).show()
# +---+--------------+
# | _1|sum(isNotNull)|
# +---+--------------+
# | 1| 1|
# | 3| 0|
# | 2| 1|
# +---+--------------+
You can drop rows that contain null values and then groupby + count:
df.select('A').dropDuplicates().join(
df.dropna(how='any').groupby('A').count(), on=['A'], how='left'
).show()
+---+-----+
| A|count|
+---+-----+
| 1| 1|
| 3| null|
| 2| 1|
+---+-----+
If you don't want to do the join, create another column to indicate whether there is null in columns B or C:
import pyspark.sql.functions as f
df.selectExpr('*',
'case when B is not null and C is not null then 1 else 0 end as D'
).groupby('A').agg(f.sum('D').alias('count')).show()
+---+-----+
| A|count|
+---+-----+
| 1| 1|
| 3| 0|
| 2| 1|
+---+-----+

Pyspark: how to duplicate a row n time in dataframe?

I've got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one:
A B n
1 2 1
2 9 1
3 8 2
4 1 1
5 3 3
And transform like this:
A B n
1 2 1
2 9 1
3 8 2
3 8 2
4 1 1
5 3 3
5 3 3
5 3 3
I think I should use explode, but I don't understand how it works...
Thanks
With Spark 2.4.0+, this is easier with builtin functions: array_repeat + explode:
from pyspark.sql.functions import expr
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)], ["A", "B", "n"])
new_df = df.withColumn('n', expr('explode(array_repeat(n,int(n)))'))
>>> new_df.show()
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+
The explode function returns a new row for each element in the given array or map.
One way to exploit this function is to use a udf to create a list of size n for each row. Then explode the resulting array.
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import ArrayType, IntegerType
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)] ,["A", "B", "n"])
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
+---+---+---+
# use udf function to transform the n value to n times
n_to_array = udf(lambda n : [n] * n, ArrayType(IntegerType()))
df2 = df.withColumn('n', n_to_array(df.n))
+---+---+---------+
| A| B| n|
+---+---+---------+
| 1| 2| [1]|
| 2| 9| [1]|
| 3| 8| [2, 2]|
| 4| 1| [1]|
| 5| 3|[3, 3, 3]|
+---+---+---------+
# now use explode
df2.withColumn('n', explode(df2.n)).show()
+---+---+---+
| A | B | n |
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+
I think the udf answer by #Ahmed is the best way to go, but here is an alternative method, that may be as good or better for small n:
First, collect the maximum value of n over the whole DataFrame:
max_n = df.select(f.max('n').alias('max_n')).first()['max_n']
print(max_n)
#3
Now create an array for each row of length max_n, containing numbers in range(max_n). The output of this intermediate step will result in a DataFrame like:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)])).show()
#+---+---+---+---------+
#| A| B| n| n_array|
#+---+---+---+---------+
#| 1| 2| 1|[0, 1, 2]|
#| 2| 9| 1|[0, 1, 2]|
#| 3| 8| 2|[0, 1, 2]|
#| 4| 1| 1|[0, 1, 2]|
#| 5| 3| 3|[0, 1, 2]|
#+---+---+---+---------+
Now we explode the n_array column, and filter to keep only the values in the array that are less than n. This will ensure that we have n copies of each row. Finally we drop the exploded column to get the end result:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)]))\
.select('A', 'B', 'n', f.explode('n_array').alias('col'))\
.where(f.col('col') < f.col('n'))\
.drop('col')\
.show()
#+---+---+---+
#| A| B| n|
#+---+---+---+
#| 1| 2| 1|
#| 2| 9| 1|
#| 3| 8| 2|
#| 3| 8| 2|
#| 4| 1| 1|
#| 5| 3| 3|
#| 5| 3| 3|
#| 5| 3| 3|
#+---+---+---+
However, we are creating a max_n length array for each row- as opposed to just an n length array in the udf solution. It's not immediately clear to me how this will scale vs. udf for large max_n, but I suspect the udf will win out.

cumulative sum function in pyspark grouping on multiple columns based on condition

I need to create a event_id basically a counter grouping on multiple columns(v_id,d_id,ip,l_id) and incrementing it when delta > 40 to get
the output like this
v_id d_id ip l_id delta event_id last_event_flag
1 20 30 40 1 1 N
1 20 30 40 2 1 N
1 20 30 40 3 1 N
1 20 30 40 4 1 Y
1 20 20 40 1 1 Y
1 30 30 40 2 1 N
1 30 30 40 3 1 N
1 30 30 40 4 1 N
1 30 30 40 5 1 Y
i was able to achieve this using pandas data frame
df['event_id'] = (df.delta >=40.0).groupby([df.l_id,df.v_id,d_id,ip]).cumsum() + 1
df.append(df['event_id'], ignore_index=True
but seeing memory error when executing it on a larger data .
How to do similar thing in pyspark.
In pyspark you can do it using a window function:
First let's create the dataframe. Note that you can also directly load it as a dataframe from a csv:
df = spark.createDataFrame(
sc.parallelize(
[[1,20,30,40,1,1],
[1,20,30,40,2,1],
[1,20,30,40,3,1],
[1,20,30,40,4,1],
[1,20,30,40,45,2],
[1,20,30,40,1,2],
[1,30,30,40,2,1],
[1,30,30,40,3,1],
[1,30,30,40,4,1],
[1,30,30,40,5,1]]
),
["v_id","d_id","ip","l_id","delta","event_id"]
)
You have an implicit ordering in your table, we need to create a monotonically increasing id so that we don't end up shuffling it around:
import pyspark.sql.functions as psf
df = df.withColumn(
"rn",
psf.monotonically_increasing_id()
)
+----+----+---+----+-----+--------+----------+
|v_id|d_id| ip|l_id|delta|event_id| rn|
+----+----+---+----+-----+--------+----------+
| 1| 20| 30| 40| 1| 1| 0|
| 1| 20| 30| 40| 2| 1| 1|
| 1| 20| 30| 40| 3| 1| 2|
| 1| 20| 30| 40| 4| 1| 3|
| 1| 20| 30| 40| 45| 2| 4|
| 1| 20| 30| 40| 1| 2|8589934592|
| 1| 30| 30| 40| 2| 1|8589934593|
| 1| 30| 30| 40| 3| 1|8589934594|
| 1| 30| 30| 40| 4| 1|8589934595|
| 1| 30| 30| 40| 5| 1|8589934596|
+----+----+---+----+-----+--------+----------+
Now to compute event_id and last_event_flag:
from pyspark.sql import Window
w1 = Window.partitionBy("v_id", "d_id", "l_id", "ip").orderBy("rn")
w2 = Window.partitionBy("v_id", "d_id", "l_id", "ip").orderBy(psf.desc("rn"))
df.withColumn(
"event_id",
psf.sum((df.delta >= 40).cast("int")).over(w1) + 1
).withColumn(
"last_event_flag",
psf.row_number().over(w2) == 1
).drop("rn")
+----+----+---+----+-----+--------+---------------+
|v_id|d_id| ip|l_id|delta|event_id|last_event_flag|
+----+----+---+----+-----+--------+---------------+
| 1| 20| 30| 40| 1| 1| false|
| 1| 20| 30| 40| 2| 1| false|
| 1| 20| 30| 40| 3| 1| false|
| 1| 20| 30| 40| 4| 1| false|
| 1| 20| 30| 40| 45| 2| false|
| 1| 20| 30| 40| 1| 2| true|
| 1| 30| 30| 40| 2| 1| false|
| 1| 30| 30| 40| 3| 1| false|
| 1| 30| 30| 40| 4| 1| false|
| 1| 30| 30| 40| 5| 1| true|
+----+----+---+----+-----+--------+---------------+
Perhaps you should calculate df = df[df.delta>=40] before running the groupby- I'm not sure if that matters.
Also you can look into chunksize to perform calculations based on chunks of the csv for memory efficiency. So you might break up the data into chunks of 10000 lines and then run the calculations to avoid memory error.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
How to read a 6 GB csv file with pandas

Window timeseries with step in Spark/Scala

I have this input :
timestamp,user
1,A
2,B
5,C
9,E
12,F
The result wanted is :
timestampRange,userList
1 to 2,[A,B]
3 to 4,[] Or null
5 to 6,[C]
7 to 8,[] Or null
9 to 10,[E]
11 to 12,[F]
I tried using Window, but the problem, it doesn't include the empty timestamp range.
Any hints would be helpful.
Don't know if widowing function will cover the gaps between ranges, but you can take the following approach :
Define a dataframe, df_ranges:
val ranges = List((1,2), (3,4), (5,6), (7,8), (9,10))
val df_ranges = sc.parallelize(ranges).toDF("start", "end")
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 3| 4|
| 5| 6|
| 7| 8|
| 9| 10|
+-----+---+
Data with the timestamp column, df_data :
val data = List((1,"A"), (2,"B"), (5,"C"), (9,"E"))
val df_data = sc.parallelize(data).toDF("timestamp", "user")
+---------+----+
|timestamp|user|
+---------+----+
| 1| A|
| 2| B|
| 5| C|
| 9| E|
+---------+----+
Join the two dataframe on the start, end, timestamp columns:
df_ranges.join(df_data, df_ranges.col("start").equalTo(df_data.col("timestamp")).or(df_ranges.col("end").equalTo(df_data.col("timestamp"))), "left")
+-----+---+---------+----+
|start|end|timestamp|user|
+-----+---+---------+----+
| 1| 2| 1| A|
| 1| 2| 2| B|
| 5| 6| 5| C|
| 9| 10| 9| E|
| 3| 4| null|null|
| 7| 8| null|null|
+-----+---+---------+----+
Now do a simple aggregation with collect_list function :
res4.groupBy("start", "end").agg(collect_list("user")).orderBy("start")
+-----+---+------------------+
|start|end|collect_list(user)|
+-----+---+------------------+
| 1| 2| [A, B]|
| 3| 4| []|
| 5| 6| [C]|
| 7| 8| []|
| 9| 10| [E]|
+-----+---+------------------+

Categories