Compute median of column in pyspark - python

I have a dataframe as shown below:
+-----------+------------+
|parsed_date| count|
+-----------+------------+
| 2017-12-16| 2|
| 2017-12-16| 2|
| 2017-12-17| 2|
| 2017-12-17| 2|
| 2017-12-18| 1|
| 2017-12-19| 4|
| 2017-12-19| 4|
| 2017-12-19| 4|
| 2017-12-19| 4|
| 2017-12-20| 1|
+-----------+------------+
I want to compute median of the entire 'count' column and add the result to a new column.
I tried:
median = df.approxQuantile('count',[0.5],0.1).alias('count_median')
But of course I am doing something wrong as it gives the following error:
AttributeError: 'list' object has no attribute 'alias'
Please help.

You need to add a column with withColumn because approxQuantile returns a list of floats, not a Spark column.
import pyspark.sql.functions as F
df2 = df.withColumn('count_media', F.lit(df.approxQuantile('count',[0.5],0.1)[0]))
df2.show()
+-----------+-----+-----------+
|parsed_date|count|count_media|
+-----------+-----+-----------+
| 2017-12-16| 2| 2.0|
| 2017-12-16| 2| 2.0|
| 2017-12-17| 2| 2.0|
| 2017-12-17| 2| 2.0|
| 2017-12-18| 1| 2.0|
| 2017-12-19| 4| 2.0|
| 2017-12-19| 4| 2.0|
| 2017-12-19| 4| 2.0|
| 2017-12-19| 4| 2.0|
| 2017-12-20| 1| 2.0|
+-----------+-----+-----------+
You can also use the approx_percentile / percentile_approx function in Spark SQL:
import pyspark.sql.functions as F
df2 = df.withColumn('count_media', F.expr("approx_percentile(count, 0.5, 10) over ()"))
df2.show()
+-----------+-----+-----------+
|parsed_date|count|count_media|
+-----------+-----+-----------+
| 2017-12-16| 2| 2|
| 2017-12-16| 2| 2|
| 2017-12-17| 2| 2|
| 2017-12-17| 2| 2|
| 2017-12-18| 1| 2|
| 2017-12-19| 4| 2|
| 2017-12-19| 4| 2|
| 2017-12-19| 4| 2|
| 2017-12-19| 4| 2|
| 2017-12-20| 1| 2|
+-----------+-----+-----------+

Related

pyspark aggregation with udf for each row value

I am very much new to pyspark and my problem is as follows:
I have this sample data frame and I want to apply withColumn(avg coeff) to each row of country column having same value (in country) and give me an output.
This is my selfmade dataframe similar to what I have been actually working on:
+--------+--------+---+-----+----+----+-----+------------+---------+----+
|Relegion| Country|Day|Month|Year|Week|cases|Weekly Coeff|Avg Coeff|rank|
+--------+--------+---+-----+----+----+-----+------------+---------+----+
| Hindu| India| 3| 1| 20| 1| 30| 0.5| 0.616| 3|
| Hindu| India| 2| 1| 20| 1| 20| 0.7| 0.616| 3|
| Hindu| India| 5| 2| 20| 2| 100| 0.9| 0.616| 3|
| Hindu| India| 6| 2| 20| 2| 160| 0.4| 0.616| 3|
| Hindu| India| 6| 2| 20| 1| 160| 0.4| 0.616| 3|
| Hindu| India| 1| 1| 20| 2| 5| 0.6| 0.616| 3|
| Hindu| India| 1| 1| 20| 1| 5| 0.6| 0.616| 3|
| Hindu| India| 2| 1| 20| 2| 20| 0.7| 0.616| 3|
| Hindu| India| 4| 2| 20| 1| 10| 0.6| 0.616| 3|
| Hindu| India| 3| 1| 20| 2| 30| 0.5| 0.616| 3|
| Hindu| India| 4| 2| 20| 2| 10| 0.6| 0.616| 3|
| Hindu| India| 5| 2| 20| 1| 100| 0.9| 0.616| 3|
| Muslim|Pakistan| 1| 1| 20| 1| 100| 0.6| 0.683| 2|
| Muslim|Pakistan| 4| 2| 20| 1| 200| 0.6| 0.683| 2|
| Muslim|Pakistan| 2| 1| 20| 1| 20| 0.9| 0.683| 2|
| Muslim|Pakistan| 5| 2| 20| 1| 300| 0.8| 0.683| 2|
| Muslim|Pakistan| 2| 1| 20| 2| 20| 0.9| 0.683| 2|
| Muslim|Pakistan| 3| 1| 20| 1| 50| 0.4| 0.683| 2|
| Muslim|Pakistan| 6| 2| 20| 1| 310| 0.8| 0.683| 2|
| Muslim|Pakistan| 3| 1| 20| 2| 50| 0.4| 0.683| 2|
+--------+--------+---+-----+----+----+-----+------------+---------+----+
I have to find the average coefficient (one value per country), I have added a column manually to test the result I haven't been able to find it.
You can use average over a window:
from pyspark.sql import functions as F, Window
df2 = df.withColumn('Avg_coeff', F.avg('Weekly_coeff').over(Window.partitionBy('Country')))
Note that it's not a good practice to have spaces in column names.

Merge Rows in Apache spark by eliminating null values

I have a spark data frame like below
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 2|null|null|null| 102| 202| 302|
| 4|null|null|null| 104| 204| 304|
| 1|null|null|null| 101| 201| 301|
| 3|null|null|null| 103| 203| 303|
| 1| 11| 21| 31|null|null|null|
| 2| 12| 22| 32|null|null|null|
| 4| 14| 24| 34|null|null|null|
| 3| 13| 23| 33|null|null|null|
+---+----+----+----+----+----+----+
I wanted to transform data frame like below by merging null rows
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 1| 11| 21| 31| 101| 201| 301|
| 2| 12| 22| 32| 102| 202| 302|
| 4| 14| 24| 34| 104| 204| 304|
| 3| 13| 23| 33| 103| 203| 303|
+---+----+----+----+----+----+----+
preferably in scala.
You can group on id and aggregate using first with ignorenulls for other columns:
import pyspark.sql.functions as F
(df.groupBy('id').agg(*[F.first(x,ignorenulls=True) for x in df.columns if x!='id'])
.show())
+---+----+----+----+-----+-----+-----+
| id| 1| 2| 3| sf_1| sf_2| sf_3|
+---+----+----+----+-----+-----+-----+
| 1|11.0|21.0|31.0|101.0|201.0|301.0|
| 3|13.0|23.0|33.0|103.0|203.0|303.0|
| 2|12.0|22.0|32.0|102.0|202.0|302.0|
| 4|14.0|24.0|34.0|104.0|204.0|304.0|
+---+----+----+----+-----+-----+-----+
scala way of doing.
val inputColumns = inputLoadDF.columns.toList.drop(0)
val exprs = inputColumns.map(x => first(x,true))
inputLoadDF.groupBy("id").agg(exprs.head,exprs.tail:_*).show()

How do I count consecutive values with pyspark?

I am trying to count consecutive values that appear in a column with Pyspark. I have the column "a" in my dataframe and expect to create the column "b".
+---+---+
| a| b|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 0| 4|
| 0| 5|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 2| 1|
| 2| 2|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 6|
| 3| 1|
| 3| 2|
| 3| 3|
+---+---+
I have tried to create the column "b" with lag function over some window, but without success.
w = Window\
.partitionBy(df.some_id)\
.orderBy(df.timestamp_column)
df.withColumn(
"b",
f.when(df.a == f.lag(df.a).over(w),
f.sum(f.lit(1)).over(w)).otherwise(f.lit(0))
)
I could resolve this issue with the following code:
df.withColumn("b",
f.row_number().over(Window.partitionBy("a").orderBy("timestamp_column"))

Pyspark - ranking column replacing ranking of ties by Mean of rank

Consider a data set with ranking
+--------+----+-----------+--------------+
| colA|colB|colA_rank |colA_rank_mean|
+--------+----+-----------+--------------+
| 21| 50| 1| 1|
| 9| 23| 2| 2.5|
| 9| 21| 3| 2.5|
| 8| 21| 4| 3|
| 2| 21| 5| 5.5|
| 2| 5| 6| 5.5|
| 1| 5| 7| 7.5|
| 1| 4| 8| 7.5|
| 0| 4| 9| 11|
| 0| 3| 10| 11|
| 0| 3| 11| 11|
| 0| 2| 12| 11|
| 0| 2| 13| 11|
+--------+----+-----------+--------------+
colA_rank is a normal ranking, while with colA_rank_mean I would like to resolve ties by replacing the ranking with the mean rank of the ties. Is it achievable with a single pass and some particular ranking method ?
Currently I am thinking of 2 passes but that would seem to require ordering the dataset twice on colA, one without partition and one with partition.
#Step 1: normal rank
df = df.withColumn("colA_rank",F.row_number().over(Window.orderBy("colA")))
#Step 2 : solve ties :
df = df.withColumn("colA_rank_mean",F.mean(colA_rank).over(Window.partitionBy("colA"))

Window timeseries with step in Spark/Scala

I have this input :
timestamp,user
1,A
2,B
5,C
9,E
12,F
The result wanted is :
timestampRange,userList
1 to 2,[A,B]
3 to 4,[] Or null
5 to 6,[C]
7 to 8,[] Or null
9 to 10,[E]
11 to 12,[F]
I tried using Window, but the problem, it doesn't include the empty timestamp range.
Any hints would be helpful.
Don't know if widowing function will cover the gaps between ranges, but you can take the following approach :
Define a dataframe, df_ranges:
val ranges = List((1,2), (3,4), (5,6), (7,8), (9,10))
val df_ranges = sc.parallelize(ranges).toDF("start", "end")
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 3| 4|
| 5| 6|
| 7| 8|
| 9| 10|
+-----+---+
Data with the timestamp column, df_data :
val data = List((1,"A"), (2,"B"), (5,"C"), (9,"E"))
val df_data = sc.parallelize(data).toDF("timestamp", "user")
+---------+----+
|timestamp|user|
+---------+----+
| 1| A|
| 2| B|
| 5| C|
| 9| E|
+---------+----+
Join the two dataframe on the start, end, timestamp columns:
df_ranges.join(df_data, df_ranges.col("start").equalTo(df_data.col("timestamp")).or(df_ranges.col("end").equalTo(df_data.col("timestamp"))), "left")
+-----+---+---------+----+
|start|end|timestamp|user|
+-----+---+---------+----+
| 1| 2| 1| A|
| 1| 2| 2| B|
| 5| 6| 5| C|
| 9| 10| 9| E|
| 3| 4| null|null|
| 7| 8| null|null|
+-----+---+---------+----+
Now do a simple aggregation with collect_list function :
res4.groupBy("start", "end").agg(collect_list("user")).orderBy("start")
+-----+---+------------------+
|start|end|collect_list(user)|
+-----+---+------------------+
| 1| 2| [A, B]|
| 3| 4| []|
| 5| 6| [C]|
| 7| 8| []|
| 9| 10| [E]|
+-----+---+------------------+

Categories