Still new to Spark and I'm trying to do this final transformation as cleanly and efficiently as possible.
Say I have a dataframe that looks like the following
+------+--------+
|ID | Hit |
+------+--------+
|123 | 0 |
|456 | 1 |
|789 | 0 |
|123 | 1 |
|123 | 0 |
|789 | 1 |
|1234 | 0 |
| 1234 | 0 |
+------+--------+
I'm trying to end up with a new dataframe(or two, depending on what's more efficient), where if a row has a 1 in "hit", it cannot have a row with a 0 in hit and if there is, the 0's would be to a distinct level based on the ID column.
Here's one of the methods I tried but I'm not sure if this is
1. The most efficient way possible
2. The cleanest way possible
dfhits = df.filter(df.Hit == 1)
dfnonhits = df.filter(df.Hit == 0)
dfnonhitsdistinct = dfnonhits.filter(~dfnonhits['ID'].isin(dfhits))
Enddataset would look like the following:
+------+--------+
|ID | Hit |
+------+--------+
|456 | 1 |
|123 | 1 |
|789 | 1 |
|1234 | 0 |
+------+--------+
# Creating the Dataframe.
from pyspark.sql.functions import col
df = sqlContext.createDataFrame([(123,0),(456,1),(789,0),(123,1),(123,0),(789,1),(500,0),(500,0)],
['ID','Hit'])
df.show()
+---+---+
| ID|Hit|
+---+---+
|123| 0|
|456| 1|
|789| 0|
|123| 1|
|123| 0|
|789| 1|
|500| 0|
|500| 0|
+---+---+
The idea is to find the total of Hit per ID and in case it is more than 0, it means that there is atleast one 1 present in Hit. So, when this condition is true, we will remove all rows with Hit values 0.
# Registering the dataframe as a temporary view.
df.registerTempTable('table_view')
df=sqlContext.sql(
'select ID, Hit, sum(Hit) over (partition by ID) as sum_Hit from table_view'
)
df.show()
+---+---+-------+
| ID|Hit|sum_Hit|
+---+---+-------+
|789| 0| 1|
|789| 1| 1|
|500| 0| 0|
|500| 0| 0|
|123| 0| 1|
|123| 1| 1|
|123| 0| 1|
|456| 1| 1|
+---+---+-------+
df = df.filter(~((col('Hit')==0) & (col('sum_Hit')>0))).drop('sum_Hit').dropDuplicates()
df.show()
+---+---+
| ID|Hit|
+---+---+
|789| 1|
|500| 0|
|123| 1|
|456| 1|
+---+---+
Related
I have a pyspark dataframe like this,
+----------+--------+
|id_ | p |
+----------+--------+
| 1 | A |
| 1 | B |
| 1 | B |
| 1 | A |
| 1 | A |
| 1 | B |
| 2 | C |
| 2 | C |
| 2 | C |
| 2 | A |
| 2 | A |
| 2 | C |
---------------------
I want to create another column for each group of id_. Column is made using pandas now with the code,
sample.groupby(by=['id_'], group_keys=False).apply(lambda grp : grp['p'].ne(grp['p'].shift()).cumsum())
How can I do this in pyspark dataframe.?
Currently I am doing this with a help of a pandas UDF, which runs very slow.
What are the alternatives.?
Expected column will be like this,
1
2
2
3
3
4
1
1
1
2
2
3
You can combination of udf and window functions to achieve your results:
# required imports
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
# define a window, which we will use to calculate lag values
w = Window().partitionBy().orderBy(F.col('id_'))
# define user defined function (udf) to perform calculation on each row
def f(lag_val, current_val):
if lag_val != current_val:
return 1
return 0
# register udf so we can use with our dataframe
func_udf = F.udf(f, IntegerType())
# read csv file
df = spark.read.csv('/path/to/file.csv', header=True)
# create new column with lag on window we created earlier, apply udf on lagged
# and current value and then apply window function again to calculate cumsum
df.withColumn("new_column", func_udf(F.lag("p").over(w), df['p'])).withColumn('cumsum', F.sum('new_column').over(w.partitionBy(F.col('id_')).rowsBetween(Window.unboundedPreceding, 0))).show()
+---+---+----------+------+
|id_| p|new_column|cumsum|
+---+---+----------+------+
| 1| A| 1| 1|
| 1| B| 1| 2|
| 1| B| 0| 2|
| 1| A| 1| 3|
| 1| A| 0| 3|
| 1| B| 1| 4|
| 2| C| 1| 1|
| 2| C| 0| 1|
| 2| C| 0| 1|
| 2| A| 1| 2|
| 2| A| 0| 2|
| 2| C| 1| 3|
+---+---+----------+------+
# where:
# w.partitionBy : to partition by id_ column
# w.rowsBetween : to specify frame boundaries
# ref https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/expressions/Window.html#rowsBetween-long-long-
Let's say we have this PySpark dataframe:
+----+-------------+
| id | string_data |
+----+-------------+
| 1 | "test" |
+----+-------------+
| 2 | null |
+----+-------------+
| 3 | "9" |
+----+-------------+
| 4 | "deleted__" |
I want to perform some operation on this that will result in this dataframe:
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| id | string_data | is_string_data_null | is_string_data_a_number | does_string_data_contain_keyword_test | is_string_data_normal |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 1 | "test" | 0 | 0 | 1 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 2 | null | 1 | 0 | 0 | 0 |im
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 3 | "9" | 0 | 1 | 0 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 4 | "deleted__" | 0 | 0 | 0 | 1 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| | | | | | |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
Where each of the new columns has either a 1 or a 0 depending on the truth value. I have currently implemented this using a custom UDF that checks the value of the string_data column, but this is incredibly slow. I have also tried implementing a UDF that does not create new columns but instead overwrites the original one with an encoded vector [1, 0, 0...], etc. This is also too slow because we have to apply this to millions of rows and thousands of columns.
Is there any better way of doing this? I understand UDFs are not the most efficient way to solve things in PySpark but I can't seem to find any built-in PySpark functions that work.
Any thoughts would be appreciated!
Edit: Sorry, from mobile I didn't see the full expected output so my previous answer was very incomplete.
Anyway, your operation has to be done in two steps, starting with this DataFrame:
>>> df.show()
+---+-----------+
| id|string_data|
+---+-----------+
| 1| test|
| 2| null|
| 3| 9|
| 4| deleted__|
+---+-----------+
Create the boolean fields based on the conditions in the string_data field:
>>> df = (df
.withColumn('is_string_data_null', df.string_data.isNull())
.withColumn('is_string_data_a_number', df.string_data.cast('integer').isNotNull())
.withColumn('does_string_data_contain_keyword_test', coalesce(df.string_data, lit('')).contains('test'))
.withColumn('is_string_normal', ~(col('is_string_data_null') | col('is_string_data_a_number') | col('does_string_data_contain_keyword_test')))
)
>>> df.show()
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| id|string_data|is_string_data_null|is_string_data_a_number|does_string_data_contain_keyword_test|is_string_normal|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| 1| test| false| false| true| false|
| 2| null| true| false| false| false|
| 3| 9| false| true| false| false|
| 4| deleted__| false| false| false| true|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
Now that we have our columns, we can cast them to integers:
>>> df = (df
.withColumn('is_string_data_null', df.is_string_data_null.cast('integer'))
.withColumn('is_string_data_a_number', df.is_string_data_a_number.cast('integer'))
.withColumn('does_string_data_contain_keyword_test', df.does_string_data_contain_keyword_test.cast('integer'))
.withColumn('is_string_normal', df.is_string_normal.cast('integer'))
)
>>> df.show()
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| id|string_data|is_string_data_null|is_string_data_a_number|does_string_data_contain_keyword_test|is_string_normal|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| 1| test| 0| 0| 1| 0|
| 2| null| 1| 0| 0| 0|
| 3| 9| 0| 1| 0| 0|
| 4| deleted__| 0| 0| 0| 1|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
This should be far more performant than an UDF, as all the operations are done by Spark itself so there's no context switch from Spark to Python.
I have some data like this
A B C
1 Null 3
1 2 4
2 Null 6
2 2 Null
2 1 2
3 Null 4
and I want to groupby A and then calculat the number of rows that don't contain Null value. So, the result should be
A count
1 1
2 1
3 0
I don't think this will work..., does it?
df.groupby('A').agg(count('B','C'))
Personally, I would use an auxiliary column saying whether B or C is Null. Negative result in this solution and return 1 or 0. And use sum for this column.
from pyspark.sql.functions import sum, when
# ...
df.withColumn("isNotNull", when(df.B.isNull() | df.C.isNull(), 0).otherwise(1))\
.groupBy("A").agg(sum("isNotNull"))
Demo:
df.show()
# +---+----+----+
# | _1| _2| _3|
# +---+----+----+
# | 1|null| 3|
# | 1| 2| 4|
# | 2|null| 6|
# | 2| 2|null|
# | 2| 1| 2|
# | 3|null| 4|
# +---+----+----+
df.withColumn("isNotNull", when(df._2.isNull() | df._3.isNull(), 0).otherwise(1)).show()
# +---+----+----+---------+
# | _1| _2| _3|isNotNull|
# +---+----+----+---------+
# | 1|null| 3| 0|
# | 1| 2| 4| 1|
# | 2|null| 6| 0|
# | 2| 2|null| 0|
# | 2| 1| 2| 1|
# | 3|null| 4| 0|
# +---+----+----+---------+
df.withColumn("isNotNull", when(df._2.isNull() | df._3.isNull(), 0).otherwise(1))\
.groupBy("_1").agg(sum("isNotNull")).show()
# +---+--------------+
# | _1|sum(isNotNull)|
# +---+--------------+
# | 1| 1|
# | 3| 0|
# | 2| 1|
# +---+--------------+
You can drop rows that contain null values and then groupby + count:
df.select('A').dropDuplicates().join(
df.dropna(how='any').groupby('A').count(), on=['A'], how='left'
).show()
+---+-----+
| A|count|
+---+-----+
| 1| 1|
| 3| null|
| 2| 1|
+---+-----+
If you don't want to do the join, create another column to indicate whether there is null in columns B or C:
import pyspark.sql.functions as f
df.selectExpr('*',
'case when B is not null and C is not null then 1 else 0 end as D'
).groupby('A').agg(f.sum('D').alias('count')).show()
+---+-----+
| A|count|
+---+-----+
| 1| 1|
| 3| 0|
| 2| 1|
+---+-----+
I have 2 pyspark dataframes,
i
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
| 2| 456|
| 3| 111|
| 4| 678|
+---+-----+
j
+----+-----+
|ID_B|COL_B|
+----+-----+
| 2| 456|
| 3| 111|
| 4| 876|
+----+-----+
I'm trying to subtract i from j based on values of a particular column i.e., values present in COL_A of i should not be present in COL_B of j.
Expected output should be,
diff
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
| 4| 678|
+---+-----+
This is my code,
common = i.join(j.withColumnRenamed('COL_B', 'COL_A'), ['COL_A'], 'leftsemi')
diff = i.subtract(common)
diff.show()
But the output is coming wrong,
diff
+---+-----+
| ID|COL_A|
+---+-----+
| 2| 456|
| 1| 123|
| 4| 678|
| 3| 111|
+---+-----+
Am I doing something wrong here? Thanks in advance.
Try:
left_join = i.join(j, j.COL_B == i.COL_A,how='left')
left_join.filter(left_join.COL_A.isNull()).show()
If you are having column names as args, you can do like:
left_join = i.join(j, j[colb] == i[cola],how='left')
left_join.filter(left_join[cola].isNull()).show()
I am trying to calculate the mean of a list (cost) within a PySpark Dataframe column, the values that are less than the mean get the value 1 and above the mean a 0.
This is the current dataframe:
+----------+--------------------+--------------------+
| id| collect_list(p_id)|collect_list(cost) |
+----------+--------------------+--------------------+
| 7|[10, 987, 872] |[12.0, 124.6, 197.0]|
| 6|[11, 858, 299] |[15.0, 167.16, 50.0]|
| 17| [2]| [65.4785]|
| 1|[34359738369, 343...|[16.023384, 104.9...|
| 3|[17179869185, 0, ...|[48.3255, 132.025...|
+----------+--------------------+--------------------+
This is the desired output:
+----------+--------------------+--------------------+-----------+
| id| p_id |cost | result |
+----------+--------------------+--------------------+-----------+
| 7|10 |12.0 | 1 |
| 7|987 |124.6 | 0 |
| 7|872 |197.0 | 0 |
| 6|11 |15.0 | 1 |
| 6|858 |167.16 | 0 |
| 6|299 |50.0 | 1 |
| 17|2 |65.4785 | 1 |
+----------+--------------------+--------------------+-----------+
from pyspark.sql.functions import col, mean
#sample data
df = sc.parallelize([(7,[10, 987, 872],[12.0, 124.6, 197.0]),
(6,[11, 858, 299],[15.0, 167.16, 50.0]),
(17,[2],[65.4785])]).toDF(["id", "collect_list(p_id)","collect_list(cost)"])
#unpack collect_list in desired output format
df = df.rdd.flatMap(lambda row: [(row[0], x, y) for x,y in zip(row[1],row[2])]).toDF(["id", "p_id","cost"])
df1 = df.\
join(df.groupBy("id").agg(mean("cost").alias("mean_cost")), "id", 'left').\
withColumn("result",(col("cost") <= col("mean_cost")).cast("int")).\
drop("mean_cost")
df1.show()
Output is :
+---+----+-------+------+
| id|p_id| cost|result|
+---+----+-------+------+
| 7| 10| 12.0| 1|
| 7| 987| 124.6| 0|
| 7| 872| 197.0| 0|
| 6| 11| 15.0| 1|
| 6| 858| 167.16| 0|
| 6| 299| 50.0| 1|
| 17| 2|65.4785| 1|
+---+----+-------+------+
You can create a result list for every row and then zip pid, cost and result list. After that use explode on the zipped column.
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import *
def zip_cols(pid_list,cost_list):
mean = np.mean(cost_list)
res_list = list(map(lambda cost:1 if mean >= cost else 0,cost_list))
return[(x,y,z) for x,y,z in zip(pid_list, cost_list, res_list)]
udf_zip = udf(zip_cols, ArrayType(StructType([StructField("pid",IntegerType()),
StructField("cost", DoubleType()),
StructField("result",IntegerType())])))
df1 = (df.withColumn("temp",udf_zip("collect_list(p_id)","collect_list(cost)")).
drop("collect_list(p_id)","collect_list(cost)"))
df2 = (df1.withColumn("temp",explode(df1.temp)).
select("id",col("temp.pid").alias("pid"),
col("temp.cost").alias("cost"),
col("temp.result").alias("result")))
df2.show()
output
+---+---+-------+------+
| id|pid| cost|result|
+---+---+-------+------+
| 7| 10| 12.0| 1|
| 7| 98| 124.6| 0|
| 7|872| 197.0| 0|
| 6| 11| 15.0| 1|
| 6|858| 167.16| 0|
| 6|299| 50.0| 1|
| 17| 2|65.4758| 1|
+---+---+-------+------+