PySpark drop-dupes based on a column condition - python

Still new to Spark and I'm trying to do this final transformation as cleanly and efficiently as possible.
Say I have a dataframe that looks like the following
+------+--------+
|ID | Hit |
+------+--------+
|123 | 0 |
|456 | 1 |
|789 | 0 |
|123 | 1 |
|123 | 0 |
|789 | 1 |
|1234 | 0 |
| 1234 | 0 |
+------+--------+
I'm trying to end up with a new dataframe(or two, depending on what's more efficient), where if a row has a 1 in "hit", it cannot have a row with a 0 in hit and if there is, the 0's would be to a distinct level based on the ID column.
Here's one of the methods I tried but I'm not sure if this is
1. The most efficient way possible
2. The cleanest way possible
dfhits = df.filter(df.Hit == 1)
dfnonhits = df.filter(df.Hit == 0)
dfnonhitsdistinct = dfnonhits.filter(~dfnonhits['ID'].isin(dfhits))
Enddataset would look like the following:
+------+--------+
|ID | Hit |
+------+--------+
|456 | 1 |
|123 | 1 |
|789 | 1 |
|1234 | 0 |
+------+--------+

# Creating the Dataframe.
from pyspark.sql.functions import col
df = sqlContext.createDataFrame([(123,0),(456,1),(789,0),(123,1),(123,0),(789,1),(500,0),(500,0)],
['ID','Hit'])
df.show()
+---+---+
| ID|Hit|
+---+---+
|123| 0|
|456| 1|
|789| 0|
|123| 1|
|123| 0|
|789| 1|
|500| 0|
|500| 0|
+---+---+
The idea is to find the total of Hit per ID and in case it is more than 0, it means that there is atleast one 1 present in Hit. So, when this condition is true, we will remove all rows with Hit values 0.
# Registering the dataframe as a temporary view.
df.registerTempTable('table_view')
df=sqlContext.sql(
'select ID, Hit, sum(Hit) over (partition by ID) as sum_Hit from table_view'
)
df.show()
+---+---+-------+
| ID|Hit|sum_Hit|
+---+---+-------+
|789| 0| 1|
|789| 1| 1|
|500| 0| 0|
|500| 0| 0|
|123| 0| 1|
|123| 1| 1|
|123| 0| 1|
|456| 1| 1|
+---+---+-------+
df = df.filter(~((col('Hit')==0) & (col('sum_Hit')>0))).drop('sum_Hit').dropDuplicates()
df.show()
+---+---+
| ID|Hit|
+---+---+
|789| 1|
|500| 0|
|123| 1|
|456| 1|
+---+---+

Related

Groupby and create a new column in PySpark dataframe

I have a pyspark dataframe like this,
+----------+--------+
|id_ | p |
+----------+--------+
| 1 | A |
| 1 | B |
| 1 | B |
| 1 | A |
| 1 | A |
| 1 | B |
| 2 | C |
| 2 | C |
| 2 | C |
| 2 | A |
| 2 | A |
| 2 | C |
---------------------
I want to create another column for each group of id_. Column is made using pandas now with the code,
sample.groupby(by=['id_'], group_keys=False).apply(lambda grp : grp['p'].ne(grp['p'].shift()).cumsum())
How can I do this in pyspark dataframe.?
Currently I am doing this with a help of a pandas UDF, which runs very slow.
What are the alternatives.?
Expected column will be like this,
1
2
2
3
3
4
1
1
1
2
2
3
You can combination of udf and window functions to achieve your results:
# required imports
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
# define a window, which we will use to calculate lag values
w = Window().partitionBy().orderBy(F.col('id_'))
# define user defined function (udf) to perform calculation on each row
def f(lag_val, current_val):
if lag_val != current_val:
return 1
return 0
# register udf so we can use with our dataframe
func_udf = F.udf(f, IntegerType())
# read csv file
df = spark.read.csv('/path/to/file.csv', header=True)
# create new column with lag on window we created earlier, apply udf on lagged
# and current value and then apply window function again to calculate cumsum
df.withColumn("new_column", func_udf(F.lag("p").over(w), df['p'])).withColumn('cumsum', F.sum('new_column').over(w.partitionBy(F.col('id_')).rowsBetween(Window.unboundedPreceding, 0))).show()
+---+---+----------+------+
|id_| p|new_column|cumsum|
+---+---+----------+------+
| 1| A| 1| 1|
| 1| B| 1| 2|
| 1| B| 0| 2|
| 1| A| 1| 3|
| 1| A| 0| 3|
| 1| B| 1| 4|
| 2| C| 1| 1|
| 2| C| 0| 1|
| 2| C| 0| 1|
| 2| A| 1| 2|
| 2| A| 0| 2|
| 2| C| 1| 3|
+---+---+----------+------+
# where:
# w.partitionBy : to partition by id_ column
# w.rowsBetween : to specify frame boundaries
# ref https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/expressions/Window.html#rowsBetween-long-long-

Most performant way to perform custom one-hot-encoding on a PySpark dataframe?

Let's say we have this PySpark dataframe:
+----+-------------+
| id | string_data |
+----+-------------+
| 1 | "test" |
+----+-------------+
| 2 | null |
+----+-------------+
| 3 | "9" |
+----+-------------+
| 4 | "deleted__" |
I want to perform some operation on this that will result in this dataframe:
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| id | string_data | is_string_data_null | is_string_data_a_number | does_string_data_contain_keyword_test | is_string_data_normal |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 1 | "test" | 0 | 0 | 1 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 2 | null | 1 | 0 | 0 | 0 |im
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 3 | "9" | 0 | 1 | 0 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 4 | "deleted__" | 0 | 0 | 0 | 1 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| | | | | | |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
Where each of the new columns has either a 1 or a 0 depending on the truth value. I have currently implemented this using a custom UDF that checks the value of the string_data column, but this is incredibly slow. I have also tried implementing a UDF that does not create new columns but instead overwrites the original one with an encoded vector [1, 0, 0...], etc. This is also too slow because we have to apply this to millions of rows and thousands of columns.
Is there any better way of doing this? I understand UDFs are not the most efficient way to solve things in PySpark but I can't seem to find any built-in PySpark functions that work.
Any thoughts would be appreciated!
Edit: Sorry, from mobile I didn't see the full expected output so my previous answer was very incomplete.
Anyway, your operation has to be done in two steps, starting with this DataFrame:
>>> df.show()
+---+-----------+
| id|string_data|
+---+-----------+
| 1| test|
| 2| null|
| 3| 9|
| 4| deleted__|
+---+-----------+
Create the boolean fields based on the conditions in the string_data field:
>>> df = (df
.withColumn('is_string_data_null', df.string_data.isNull())
.withColumn('is_string_data_a_number', df.string_data.cast('integer').isNotNull())
.withColumn('does_string_data_contain_keyword_test', coalesce(df.string_data, lit('')).contains('test'))
.withColumn('is_string_normal', ~(col('is_string_data_null') | col('is_string_data_a_number') | col('does_string_data_contain_keyword_test')))
)
>>> df.show()
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| id|string_data|is_string_data_null|is_string_data_a_number|does_string_data_contain_keyword_test|is_string_normal|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| 1| test| false| false| true| false|
| 2| null| true| false| false| false|
| 3| 9| false| true| false| false|
| 4| deleted__| false| false| false| true|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
Now that we have our columns, we can cast them to integers:
>>> df = (df
.withColumn('is_string_data_null', df.is_string_data_null.cast('integer'))
.withColumn('is_string_data_a_number', df.is_string_data_a_number.cast('integer'))
.withColumn('does_string_data_contain_keyword_test', df.does_string_data_contain_keyword_test.cast('integer'))
.withColumn('is_string_normal', df.is_string_normal.cast('integer'))
)
>>> df.show()
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| id|string_data|is_string_data_null|is_string_data_a_number|does_string_data_contain_keyword_test|is_string_normal|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| 1| test| 0| 0| 1| 0|
| 2| null| 1| 0| 0| 0|
| 3| 9| 0| 1| 0| 0|
| 4| deleted__| 0| 0| 0| 1|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
This should be far more performant than an UDF, as all the operations are done by Spark itself so there's no context switch from Spark to Python.

pyspark count not null values for pairs in two column within group

I have some data like this
A B C
1 Null 3
1 2 4
2 Null 6
2 2 Null
2 1 2
3 Null 4
and I want to groupby A and then calculat the number of rows that don't contain Null value. So, the result should be
A count
1 1
2 1
3 0
I don't think this will work..., does it?
df.groupby('A').agg(count('B','C'))
Personally, I would use an auxiliary column saying whether B or C is Null. Negative result in this solution and return 1 or 0. And use sum for this column.
from pyspark.sql.functions import sum, when
# ...
df.withColumn("isNotNull", when(df.B.isNull() | df.C.isNull(), 0).otherwise(1))\
.groupBy("A").agg(sum("isNotNull"))
Demo:
df.show()
# +---+----+----+
# | _1| _2| _3|
# +---+----+----+
# | 1|null| 3|
# | 1| 2| 4|
# | 2|null| 6|
# | 2| 2|null|
# | 2| 1| 2|
# | 3|null| 4|
# +---+----+----+
df.withColumn("isNotNull", when(df._2.isNull() | df._3.isNull(), 0).otherwise(1)).show()
# +---+----+----+---------+
# | _1| _2| _3|isNotNull|
# +---+----+----+---------+
# | 1|null| 3| 0|
# | 1| 2| 4| 1|
# | 2|null| 6| 0|
# | 2| 2|null| 0|
# | 2| 1| 2| 1|
# | 3|null| 4| 0|
# +---+----+----+---------+
df.withColumn("isNotNull", when(df._2.isNull() | df._3.isNull(), 0).otherwise(1))\
.groupBy("_1").agg(sum("isNotNull")).show()
# +---+--------------+
# | _1|sum(isNotNull)|
# +---+--------------+
# | 1| 1|
# | 3| 0|
# | 2| 1|
# +---+--------------+
You can drop rows that contain null values and then groupby + count:
df.select('A').dropDuplicates().join(
df.dropna(how='any').groupby('A').count(), on=['A'], how='left'
).show()
+---+-----+
| A|count|
+---+-----+
| 1| 1|
| 3| null|
| 2| 1|
+---+-----+
If you don't want to do the join, create another column to indicate whether there is null in columns B or C:
import pyspark.sql.functions as f
df.selectExpr('*',
'case when B is not null and C is not null then 1 else 0 end as D'
).groupby('A').agg(f.sum('D').alias('count')).show()
+---+-----+
| A|count|
+---+-----+
| 1| 1|
| 3| 0|
| 2| 1|
+---+-----+

Subtract 2 pyspark dataframes based on column

I have 2 pyspark dataframes,
i
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
| 2| 456|
| 3| 111|
| 4| 678|
+---+-----+
j
+----+-----+
|ID_B|COL_B|
+----+-----+
| 2| 456|
| 3| 111|
| 4| 876|
+----+-----+
I'm trying to subtract i from j based on values of a particular column i.e., values present in COL_A of i should not be present in COL_B of j.
Expected output should be,
diff
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
| 4| 678|
+---+-----+
This is my code,
common = i.join(j.withColumnRenamed('COL_B', 'COL_A'), ['COL_A'], 'leftsemi')
diff = i.subtract(common)
diff.show()
But the output is coming wrong,
diff
+---+-----+
| ID|COL_A|
+---+-----+
| 2| 456|
| 1| 123|
| 4| 678|
| 3| 111|
+---+-----+
Am I doing something wrong here? Thanks in advance.
Try:
left_join = i.join(j, j.COL_B == i.COL_A,how='left')
left_join.filter(left_join.COL_A.isNull()).show()
If you are having column names as args, you can do like:
left_join = i.join(j, j[colb] == i[cola],how='left')
left_join.filter(left_join[cola].isNull()).show()

PySpark Dataframe take mean of list within column and create new column with 1 & 0 depending on a condition

I am trying to calculate the mean of a list (cost) within a PySpark Dataframe column, the values that are less than the mean get the value 1 and above the mean a 0.
This is the current dataframe:
+----------+--------------------+--------------------+
| id| collect_list(p_id)|collect_list(cost) |
+----------+--------------------+--------------------+
| 7|[10, 987, 872] |[12.0, 124.6, 197.0]|
| 6|[11, 858, 299] |[15.0, 167.16, 50.0]|
| 17| [2]| [65.4785]|
| 1|[34359738369, 343...|[16.023384, 104.9...|
| 3|[17179869185, 0, ...|[48.3255, 132.025...|
+----------+--------------------+--------------------+
This is the desired output:
+----------+--------------------+--------------------+-----------+
| id| p_id |cost | result |
+----------+--------------------+--------------------+-----------+
| 7|10 |12.0 | 1 |
| 7|987 |124.6 | 0 |
| 7|872 |197.0 | 0 |
| 6|11 |15.0 | 1 |
| 6|858 |167.16 | 0 |
| 6|299 |50.0 | 1 |
| 17|2 |65.4785 | 1 |
+----------+--------------------+--------------------+-----------+
from pyspark.sql.functions import col, mean
#sample data
df = sc.parallelize([(7,[10, 987, 872],[12.0, 124.6, 197.0]),
(6,[11, 858, 299],[15.0, 167.16, 50.0]),
(17,[2],[65.4785])]).toDF(["id", "collect_list(p_id)","collect_list(cost)"])
#unpack collect_list in desired output format
df = df.rdd.flatMap(lambda row: [(row[0], x, y) for x,y in zip(row[1],row[2])]).toDF(["id", "p_id","cost"])
df1 = df.\
join(df.groupBy("id").agg(mean("cost").alias("mean_cost")), "id", 'left').\
withColumn("result",(col("cost") <= col("mean_cost")).cast("int")).\
drop("mean_cost")
df1.show()
Output is :
+---+----+-------+------+
| id|p_id| cost|result|
+---+----+-------+------+
| 7| 10| 12.0| 1|
| 7| 987| 124.6| 0|
| 7| 872| 197.0| 0|
| 6| 11| 15.0| 1|
| 6| 858| 167.16| 0|
| 6| 299| 50.0| 1|
| 17| 2|65.4785| 1|
+---+----+-------+------+
You can create a result list for every row and then zip pid, cost and result list. After that use explode on the zipped column.
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import *
def zip_cols(pid_list,cost_list):
mean = np.mean(cost_list)
res_list = list(map(lambda cost:1 if mean >= cost else 0,cost_list))
return[(x,y,z) for x,y,z in zip(pid_list, cost_list, res_list)]
udf_zip = udf(zip_cols, ArrayType(StructType([StructField("pid",IntegerType()),
StructField("cost", DoubleType()),
StructField("result",IntegerType())])))
df1 = (df.withColumn("temp",udf_zip("collect_list(p_id)","collect_list(cost)")).
drop("collect_list(p_id)","collect_list(cost)"))
df2 = (df1.withColumn("temp",explode(df1.temp)).
select("id",col("temp.pid").alias("pid"),
col("temp.cost").alias("cost"),
col("temp.result").alias("result")))
df2.show()
output
+---+---+-------+------+
| id|pid| cost|result|
+---+---+-------+------+
| 7| 10| 12.0| 1|
| 7| 98| 124.6| 0|
| 7|872| 197.0| 0|
| 6| 11| 15.0| 1|
| 6|858| 167.16| 0|
| 6|299| 50.0| 1|
| 17| 2|65.4758| 1|
+---+---+-------+------+

Categories