How to encode labels from array in pyspark

How to encode labels from array in pyspark - python

For example I have DataFrame with categorical features in name:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("example")
.config("spark.some.config.option", "some-value").getOrCreate()
features = [(['a', 'b', 'c'], 1),
(['a', 'c'], 2),
(['d'], 3),
(['b', 'c'], 4),
(['a', 'b', 'd'], 5)]
df = spark.createDataFrame(features, ['name','id'])
df.show()
Out:
+---------+----+
| name| id |
+---------+----+
|[a, b, c]| 1|
| [a, c]| 2|
| [d]| 3|
| [b, c]| 4|
|[a, b, d]| 5|
+---------+----+
What I want to get:
+--------+--------+--------+--------+----+
| name_a | name_b | name_c | name_d | id |
+--------+--------+--------+--------+----+
| 1 | 1 | 1 | 0 | 1 |
+--------+--------+--------+--------+----+
| 1 | 0 | 1 | 0 | 2 |
+--------+--------+--------+--------+----+
| 0 | 0 | 0 | 1 | 3 |
+--------+--------+--------+--------+----+
| 0 | 1 | 1 | 0 | 4 |
+--------+--------+--------+--------+----+
| 1 | 1 | 0 | 1 | 5 |
+--------+--------+--------+--------+----+
I found the same queston but there is nothing helpful.
I tried to use VectorIndexer from PySpark.ML but I faced some problems with a transform of name field to vector type.
from pyspark.ml.feature import VectorIndexer
indexer = VectorIndexer(inputCol="name", outputCol="indexed", maxCategories=5)
indexerModel = indexer.fit(df)
I get the following error:
Column name must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually ArrayType
I found a solution here but it looks overcomplicated. However, I'm not sure if it can be done only with VectorIndexer.

If you want use the output with Spark ML it is best to use CountVectorizer:
from pyspark.ml.feature import CountVectorizer
# Add binary=True if needed
df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")
.fit(df)
.transform(df))
df_enc.show(truncate=False)
+---------+---+-------------------------+
|name |id |name_vector |
+---------+---+-------------------------+
|[a, b, c]|1 |(4,[0,1,2],[1.0,1.0,1.0])|
|[a, c] |2 |(4,[0,1],[1.0,1.0]) |
|[d] |3 |(4,[3],[1.0]) |
|[b, c] |4 |(4,[1,2],[1.0,1.0]) |
|[a, b, d]|5 |(4,[0,2,3],[1.0,1.0,1.0])|
+---------+---+-------------------------+
Otherwise collect distinct values:
from pyspark.sql.functions import array_contains, col, explode
names = [
x[0] for x in
df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]
and select the columns with array_contains:
df_sep = df.select("*", *[
array_contains("name", name).alias("name_{}".format(name)).cast("integer")
for name in names]
)
df_sep.show()
+---------+---+------+------+------+------+
| name| id|name_a|name_b|name_c|name_d|
+---------+---+------+------+------+------+
|[a, b, c]| 1| 1| 1| 1| 0|
| [a, c]| 2| 1| 0| 1| 0|
| [d]| 3| 0| 0| 0| 1|
| [b, c]| 4| 0| 1| 1| 0|
|[a, b, d]| 5| 1| 1| 0| 1|
+---------+---+------+------+------+------+

With explode from the pyspark.sql.functions and pivot:
from pyspark.sql import functions as F
features = [(['a', 'b', 'c'], 1),
(['a', 'c'], 2),
(['d'], 3),
(['b', 'c'], 4),
(['a', 'b', 'd'], 5)]
df = spark.createDataFrame(features, ['name','id'])
df.show()
+---------+---+
| name| id|
+---------+---+
|[a, b, c]| 1|
| [a, c]| 2|
| [d]| 3|
| [b, c]| 4|
|[a, b, d]| 5|
+---------+---+
df = df.withColumn('exploded', F.explode('name'))
df.drop('name').groupby('id').pivot('exploded').count().show()
+---+----+----+----+----+
| id| a| b| c| d|
+---+----+----+----+----+
| 5| 1| 1|null| 1|
| 1| 1| 1| 1|null|
| 3|null|null|null| 1|
| 2| 1|null| 1|null|
| 4|null| 1| 1|null|
+---+----+----+----+----+
Sort by id and convert null to 0
df.drop('name').groupby('id').pivot('exploded').count().na.fill(0).sort(F.col('id').asc()).show()
+---+---+---+---+---+
| id| a| b| c| d|
+---+---+---+---+---+
| 1| 1| 1| 1| 0|
| 2| 1| 0| 1| 0|
| 3| 0| 0| 0| 1|
| 4| 0| 1| 1| 0|
| 5| 1| 1| 0| 1|
+---+---+---+---+---+
explode returns a new row for each element in the given array or map. You can then use pivot to "transpose" the new column.

Related

Sum of pyspark columns to ignore NaN values

I have a pypark dataframe in the following way:
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| 1| 3|
| 2| NaN| 4|
| 3| 3| 5|
+---+----+----+
I would like to sum col1 and col2 so that the result looks like this:
+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
| 1| 1| 3| 4|
| 2| NaN| 4| 4|
| 3| 3| 5| 8|
+---+----+----+---+
Here's what I have tried:
import pandas as pd
test = pd.DataFrame({
'id': [1, 2, 3],
'col1': [1, None, 3],
'col2': [3, 4, 5]
})
test = spark.createDataFrame(test)
test.withColumn('sum', F.col('col1') + F.col('col2')).show()
This code returns:
+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
| 1| 1| 3| 4|
| 2| NaN| 4|NaN| # <-- I want a 4 here, not this NaN
| 3| 3| 5| 8|
+---+----+----+---+
Can anyone help me with this?

Use F.nanvl to replace NaN with a given value (0 here):
import pyspark.sql.functions as F
result = test.withColumn('sum', F.nanvl(F.col('col1'), F.lit(0)) + F.col('col2'))
For your comment:
result = test.withColumn('sum',
F.when(
F.isnan(F.col('col1')) & F.isnan(F.col('col2')),
F.lit(float('nan'))
).otherwise(
F.nanvl(F.col('col1'), F.lit(0)) + F.nanvl(F.col('col2'), F.lit(0))
)
)

Add distinct count of a column to each row in PySpark

I need to add distinct count of a column to each row in PySpark dataframe.
Example:
If the original dataframe is this:
+----+----+
|col1|col2|
+----+----+
|abc | 1|
|xyz | 1|
|dgc | 2|
|ydh | 3|
|ujd | 1|
|ujx | 3|
+----+----+
Then I want something like this:
+----+----+----+
|col1|col2|col3|
+----+----+----+
|abc | 1| 3|
|xyz | 1| 3|
|dgc | 2| 3|
|ydh | 3| 3|
|ujd | 1| 3|
|ujx | 3| 3|
+----+----+----+
I tried df.withColumn('total_count', f.countDistinct('col2')) but it's giving error.

You can count distinct elements in the column and create new column with the value:
distincts = df.dropDuplicates(["col2"]).count()
df = df.withColumn("col3", f.lit(distincts))

Cross join to the count distinct as below:
df2 = df.crossJoin(df.select(F.countDistinct('col2').alias('col3')))
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| 1| 3|
| xyz| 1| 3|
| dgc| 2| 3|
| ydh| 3| 3|
| ujd| 1| 3|
| ujx| 3| 3|
+----+----+----+

You can use Window, collect_set and size:
from pyspark.sql import functions as F, Window
df = spark.createDataFrame([("abc", 1), ("xyz", 1), ("dgc", 2), ("ydh", 3), ("ujd", 1), ("ujx", 3)], ['col1', 'col2'])
window = Window.orderBy("col2").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn("col3", F.size(F.collect_set(F.col("col2")).over(window))).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| 1| 3|
| xyz| 1| 3|
| dgc| 2| 3|
| ydh| 3| 3|
| ujd| 1| 3|
| ujx| 3| 3|
+----+----+----+

How to filter dataframe using percentiles to filter out outliers?

Suppose I have a spark dataframe like this:
+------------+-----------+
|category |value |
+------------+-----------+
| a| 1|
| a| 2|
| b| 2|
| a| 3|
| b| 4|
| a| 4|
| b| 6|
| b| 8|
+------------+-----------+
I want to set values higher than 0.75 percentile to nan for each category.
That being;
a_values = [1,2,3,4] => a_values_filtered = [1,2,3,nan]
b_values = [2,4,6,8] => b_values_filtered = [2,3,6,nan]
So the expected output is:
+------------+-----------+
|category |value |
+------------+-----------+
| a| 1|
| a| 2|
| b| 2|
| a| 3|
| b| 4|
| a| nan|
| b| 6|
| b| nan|
+------------+-----------+
Any idea how to do it cleanly?
PS: I am new to spark

Use percent_rank function to get the percentiles, and then use when to assign values > 0.75 percent_rank to null.
from pyspark.sql import Window
from pyspark.sql.functions import percent_rank,when
w = Window.partitionBy(df.category).orderBy(df.value)
percentiles_df = df.withColumn('percentile',percent_rank().over(w))
result = percentiles_df.select(percentiles_df.category
,when(percentiles_df.percentile <= 0.75,percentiles_df.value).alias('value'))
result.show()

Here is another snippet similar to Prabhala's answer, I use percentile_approx UDF instead.
from pyspark.sql import Window
import pyspark.sql.functions as F
window = Window.partitionBy('category')
percentile = F.expr('percentile_approx(value, 0.75)')
tmp_df = df.withColumn('percentile_value', percentile.over(window))
result = tmp_df.select('category', when(tmp_df.percentile_value >= tmp_df.value, tmp_df.value).alias('value'))
result.show()
+--------+-----+
|category|value|
+--------+-----+
| b| 2|
| b| 4|
| b| 6|
| b| null|
| a| 1|
| a| 2|
| a| 3|
| a| null|
+--------+-----+

Joining with a lookup table in PySpark

I have 2 tables: Table 'A' and Table 'Lookup'
Table A:
ID Day
A 1
B 1
C 2
D 4
The lookup table has percentage values for each ID-Day combination.
Table Lookup:
ID 1 2 3 4
A 20 10 50 30
B 0 50 0 50
C 50 10 10 30
D 10 25 25 40
My expected output is to have an additional field in Table 'A' named 'Percent' with values filled in from the lookup table:
ID Day Percent
A 1 20
B 1 0
C 2 10
D 4 40
Since both the tables are large, I do not want to pivot any of the tables.

I have written code in scala. You can refer same for python.
scala> TableA.show()
+---+---+
| ID|Day|
+---+---+
| A| 1|
| B| 1|
| C| 2|
| D| 4|
+---+---+
scala> lookup.show()
+---+---+---+---+---+
| ID| 1| 2| 3| 4|
+---+---+---+---+---+
| A| 20| 10| 50| 30|
| B| 0| 50| 0| 50|
| C| 50| 10| 10| 30|
| D| 10| 25| 25| 40|
+---+---+---+---+---+
//UDF Functon to retrieve data from lookup table
val lookupUDF = (r:Row, s:String) => {
r.getAs(s).toString}
//Join over Key column "ID"
val joindf = TableA.join(lookup,"ID")
//final output DataFrame creation
val final_df = joindf.map(x => (x.getAs("ID").toString, x.getAs("Day").toString, lookupUDF(x,x.getAs("Day")))).toDF("ID","Day","Percentage")
final_df.show()
+---+---+----------+
| ID|Day|Percentage|
+---+---+----------+
| A| 1| 20|
| B| 1| 0|
| C| 2| 10|
| D| 4| 40|
+---+---+----------+

(Posting my answer a day after I posted the question)
I was able to solve this by converting the tables to a pandas dataframe.
from pyspark.sql.types import *
schema = StructType([StructField("id", StringType())\
,StructField("day", StringType())\
,StructField("1", IntegerType())\
,StructField("2", IntegerType())\
,StructField("3", IntegerType())\
,StructField("4", IntegerType())])
# Day field is String type
data = [['A', 1, 20, 10, 50, 30], ['B', 1, 0, 50, 0, 50], ['C', 2, 50, 10, 10, 30], ['D', 4, 10, 25, 25, 40]]
df = spark.createDataFrame(data,schema=schema)
df.show()
# After joining the 2 tables on "id", the tables would look like this:
+---+---+---+---+---+---+
| id|day| 1| 2| 3| 4|
+---+---+---+---+---+---+
| A| 1| 20| 10| 50| 30|
| B| 1| 0| 50| 0| 50|
| C| 2| 50| 10| 10| 30|
| D| 4| 10| 25| 25| 40|
+---+---+---+---+---+---+
# Converting to a pandas dataframe
pandas_df = df.toPandas()
id day 1 2 3 4
A 1 20 10 50 30
B 1 0 50 0 50
C 2 50 10 10 30
D 4 10 25 25 40
# UDF:
def udf(x):
return x[x['day']]
pandas_df['percent'] = pandas_df.apply(udf, axis=1)
# Converting back to a Spark DF:
spark_df = sqlContext.createDataFrame(pandas_df)
+---+---+---+---+---+---+---+
| id|day| 1| 2| 3| 4|new|
+---+---+---+---+---+---+---+
| A| 1| 20| 10| 50| 30| 20|
| B| 1| 0| 50| 0| 50| 0|
| C| 2| 50| 10| 10| 30| 10|
| D| 4| 10| 25| 25| 40| 40|
+---+---+---+---+---+---+---+
spark_df.select("id", "day", "percent").show()
+---+---+-------+
| id|day|percent|
+---+---+-------+
| A| 1| 20|
| B| 1| 0|
| C| 2| 10|
| D| 4| 40|
+---+---+-------+
I would appreciate if someone posts an answer in PySpark without the pandas-df conversion.

df = spark.createDataFrame([{'ID':'A','Day':1}
,{'ID':'B','Day':1}
,{'ID':'C','Day':2}
,{'ID':'D','Day':4}])
df1 = spark.createDataFrame([{'ID':'A','1':20,'2':10,'3':50,'4':30},
{'ID':'B','1':0,'2':50,'3':0,'4':50},
{'ID':'C','1':50,'2':10,'3':10,'4':30},
{'ID':'D','1':10,'2':25,'3':25,'4':40}
])
df1=df1.withColumn('1',col('1').cast('int')).withColumn('2',col('2').cast('int')).withColumn('3',col('3').cast('int')).withColumn('4',col('4').cast('int'))
df=df.withColumn('Day',col('Day').cast('int'))
df_final = df.join(df1,'ID')
df_final_rdd = df_final.rdd
print(df_final_rdd.collect())
def create_list(r,s):
s=str(s)
k = (r['ID'],r['Day'],r[s])
return k
l=[]
for element in df_final_rdd.collect():
l.append(create_list(element,element['Day']))
rdd = sc.parallelize(l)
df= spark.createDataFrame(rdd).toDF('ID','Day','Percent')

pyspark count not null values for pairs in two column within group

I have some data like this
A B C
1 Null 3
1 2 4
2 Null 6
2 2 Null
2 1 2
3 Null 4
and I want to groupby A and then calculat the number of rows that don't contain Null value. So, the result should be
A count
1 1
2 1
3 0
I don't think this will work..., does it?
df.groupby('A').agg(count('B','C'))

Personally, I would use an auxiliary column saying whether B or C is Null. Negative result in this solution and return 1 or 0. And use sum for this column.
from pyspark.sql.functions import sum, when
# ...
df.withColumn("isNotNull", when(df.B.isNull() | df.C.isNull(), 0).otherwise(1))\
.groupBy("A").agg(sum("isNotNull"))
Demo:
df.show()
# +---+----+----+
# | _1| _2| _3|
# +---+----+----+
# | 1|null| 3|
# | 1| 2| 4|
# | 2|null| 6|
# | 2| 2|null|
# | 2| 1| 2|
# | 3|null| 4|
# +---+----+----+
df.withColumn("isNotNull", when(df._2.isNull() | df._3.isNull(), 0).otherwise(1)).show()
# +---+----+----+---------+
# | _1| _2| _3|isNotNull|
# +---+----+----+---------+
# | 1|null| 3| 0|
# | 1| 2| 4| 1|
# | 2|null| 6| 0|
# | 2| 2|null| 0|
# | 2| 1| 2| 1|
# | 3|null| 4| 0|
# +---+----+----+---------+
df.withColumn("isNotNull", when(df._2.isNull() | df._3.isNull(), 0).otherwise(1))\
.groupBy("_1").agg(sum("isNotNull")).show()
# +---+--------------+
# | _1|sum(isNotNull)|
# +---+--------------+
# | 1| 1|
# | 3| 0|
# | 2| 1|
# +---+--------------+

You can drop rows that contain null values and then groupby + count:
df.select('A').dropDuplicates().join(
df.dropna(how='any').groupby('A').count(), on=['A'], how='left'
).show()
+---+-----+
| A|count|
+---+-----+
| 1| 1|
| 3| null|
| 2| 1|
+---+-----+
If you don't want to do the join, create another column to indicate whether there is null in columns B or C:
import pyspark.sql.functions as f
df.selectExpr('*',
'case when B is not null and C is not null then 1 else 0 end as D'
).groupby('A').agg(f.sum('D').alias('count')).show()
+---+-----+
| A|count|
+---+-----+
| 1| 1|
| 3| 0|
| 2| 1|
+---+-----+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to encode labels from array in pyspark - python

Related

Sum of pyspark columns to ignore NaN values

Add distinct count of a column to each row in PySpark

How to filter dataframe using percentiles to filter out outliers?

Joining with a lookup table in PySpark

pyspark count not null values for pairs in two column within group

Categories

Resources