I am very much new to pyspark and my problem is as follows:
I have this sample data frame and I want to apply withColumn(avg coeff) to each row of country column having same value (in country) and give me an output.
This is my selfmade dataframe similar to what I have been actually working on:
+--------+--------+---+-----+----+----+-----+------------+---------+----+
|Relegion| Country|Day|Month|Year|Week|cases|Weekly Coeff|Avg Coeff|rank|
+--------+--------+---+-----+----+----+-----+------------+---------+----+
| Hindu| India| 3| 1| 20| 1| 30| 0.5| 0.616| 3|
| Hindu| India| 2| 1| 20| 1| 20| 0.7| 0.616| 3|
| Hindu| India| 5| 2| 20| 2| 100| 0.9| 0.616| 3|
| Hindu| India| 6| 2| 20| 2| 160| 0.4| 0.616| 3|
| Hindu| India| 6| 2| 20| 1| 160| 0.4| 0.616| 3|
| Hindu| India| 1| 1| 20| 2| 5| 0.6| 0.616| 3|
| Hindu| India| 1| 1| 20| 1| 5| 0.6| 0.616| 3|
| Hindu| India| 2| 1| 20| 2| 20| 0.7| 0.616| 3|
| Hindu| India| 4| 2| 20| 1| 10| 0.6| 0.616| 3|
| Hindu| India| 3| 1| 20| 2| 30| 0.5| 0.616| 3|
| Hindu| India| 4| 2| 20| 2| 10| 0.6| 0.616| 3|
| Hindu| India| 5| 2| 20| 1| 100| 0.9| 0.616| 3|
| Muslim|Pakistan| 1| 1| 20| 1| 100| 0.6| 0.683| 2|
| Muslim|Pakistan| 4| 2| 20| 1| 200| 0.6| 0.683| 2|
| Muslim|Pakistan| 2| 1| 20| 1| 20| 0.9| 0.683| 2|
| Muslim|Pakistan| 5| 2| 20| 1| 300| 0.8| 0.683| 2|
| Muslim|Pakistan| 2| 1| 20| 2| 20| 0.9| 0.683| 2|
| Muslim|Pakistan| 3| 1| 20| 1| 50| 0.4| 0.683| 2|
| Muslim|Pakistan| 6| 2| 20| 1| 310| 0.8| 0.683| 2|
| Muslim|Pakistan| 3| 1| 20| 2| 50| 0.4| 0.683| 2|
+--------+--------+---+-----+----+----+-----+------------+---------+----+
I have to find the average coefficient (one value per country), I have added a column manually to test the result I haven't been able to find it.
You can use average over a window:
from pyspark.sql import functions as F, Window
df2 = df.withColumn('Avg_coeff', F.avg('Weekly_coeff').over(Window.partitionBy('Country')))
Note that it's not a good practice to have spaces in column names.
I have a spark data frame like below
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 2|null|null|null| 102| 202| 302|
| 4|null|null|null| 104| 204| 304|
| 1|null|null|null| 101| 201| 301|
| 3|null|null|null| 103| 203| 303|
| 1| 11| 21| 31|null|null|null|
| 2| 12| 22| 32|null|null|null|
| 4| 14| 24| 34|null|null|null|
| 3| 13| 23| 33|null|null|null|
+---+----+----+----+----+----+----+
I wanted to transform data frame like below by merging null rows
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 1| 11| 21| 31| 101| 201| 301|
| 2| 12| 22| 32| 102| 202| 302|
| 4| 14| 24| 34| 104| 204| 304|
| 3| 13| 23| 33| 103| 203| 303|
+---+----+----+----+----+----+----+
preferably in scala.
You can group on id and aggregate using first with ignorenulls for other columns:
import pyspark.sql.functions as F
(df.groupBy('id').agg(*[F.first(x,ignorenulls=True) for x in df.columns if x!='id'])
.show())
+---+----+----+----+-----+-----+-----+
| id| 1| 2| 3| sf_1| sf_2| sf_3|
+---+----+----+----+-----+-----+-----+
| 1|11.0|21.0|31.0|101.0|201.0|301.0|
| 3|13.0|23.0|33.0|103.0|203.0|303.0|
| 2|12.0|22.0|32.0|102.0|202.0|302.0|
| 4|14.0|24.0|34.0|104.0|204.0|304.0|
+---+----+----+----+-----+-----+-----+
scala way of doing.
val inputColumns = inputLoadDF.columns.toList.drop(0)
val exprs = inputColumns.map(x => first(x,true))
inputLoadDF.groupBy("id").agg(exprs.head,exprs.tail:_*).show()
I am trying to count consecutive values that appear in a column with Pyspark. I have the column "a" in my dataframe and expect to create the column "b".
+---+---+
| a| b|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 0| 4|
| 0| 5|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 2| 1|
| 2| 2|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 6|
| 3| 1|
| 3| 2|
| 3| 3|
+---+---+
I have tried to create the column "b" with lag function over some window, but without success.
w = Window\
.partitionBy(df.some_id)\
.orderBy(df.timestamp_column)
df.withColumn(
"b",
f.when(df.a == f.lag(df.a).over(w),
f.sum(f.lit(1)).over(w)).otherwise(f.lit(0))
)
I could resolve this issue with the following code:
df.withColumn("b",
f.row_number().over(Window.partitionBy("a").orderBy("timestamp_column"))
Consider a data set with ranking
+--------+----+-----------+--------------+
| colA|colB|colA_rank |colA_rank_mean|
+--------+----+-----------+--------------+
| 21| 50| 1| 1|
| 9| 23| 2| 2.5|
| 9| 21| 3| 2.5|
| 8| 21| 4| 3|
| 2| 21| 5| 5.5|
| 2| 5| 6| 5.5|
| 1| 5| 7| 7.5|
| 1| 4| 8| 7.5|
| 0| 4| 9| 11|
| 0| 3| 10| 11|
| 0| 3| 11| 11|
| 0| 2| 12| 11|
| 0| 2| 13| 11|
+--------+----+-----------+--------------+
colA_rank is a normal ranking, while with colA_rank_mean I would like to resolve ties by replacing the ranking with the mean rank of the ties. Is it achievable with a single pass and some particular ranking method ?
Currently I am thinking of 2 passes but that would seem to require ordering the dataset twice on colA, one without partition and one with partition.
#Step 1: normal rank
df = df.withColumn("colA_rank",F.row_number().over(Window.orderBy("colA")))
#Step 2 : solve ties :
df = df.withColumn("colA_rank_mean",F.mean(colA_rank).over(Window.partitionBy("colA"))
I have this input :
timestamp,user
1,A
2,B
5,C
9,E
12,F
The result wanted is :
timestampRange,userList
1 to 2,[A,B]
3 to 4,[] Or null
5 to 6,[C]
7 to 8,[] Or null
9 to 10,[E]
11 to 12,[F]
I tried using Window, but the problem, it doesn't include the empty timestamp range.
Any hints would be helpful.
Don't know if widowing function will cover the gaps between ranges, but you can take the following approach :
Define a dataframe, df_ranges:
val ranges = List((1,2), (3,4), (5,6), (7,8), (9,10))
val df_ranges = sc.parallelize(ranges).toDF("start", "end")
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 3| 4|
| 5| 6|
| 7| 8|
| 9| 10|
+-----+---+
Data with the timestamp column, df_data :
val data = List((1,"A"), (2,"B"), (5,"C"), (9,"E"))
val df_data = sc.parallelize(data).toDF("timestamp", "user")
+---------+----+
|timestamp|user|
+---------+----+
| 1| A|
| 2| B|
| 5| C|
| 9| E|
+---------+----+
Join the two dataframe on the start, end, timestamp columns:
df_ranges.join(df_data, df_ranges.col("start").equalTo(df_data.col("timestamp")).or(df_ranges.col("end").equalTo(df_data.col("timestamp"))), "left")
+-----+---+---------+----+
|start|end|timestamp|user|
+-----+---+---------+----+
| 1| 2| 1| A|
| 1| 2| 2| B|
| 5| 6| 5| C|
| 9| 10| 9| E|
| 3| 4| null|null|
| 7| 8| null|null|
+-----+---+---------+----+
Now do a simple aggregation with collect_list function :
res4.groupBy("start", "end").agg(collect_list("user")).orderBy("start")
+-----+---+------------------+
|start|end|collect_list(user)|
+-----+---+------------------+
| 1| 2| [A, B]|
| 3| 4| []|
| 5| 6| [C]|
| 7| 8| []|
| 9| 10| [E]|
+-----+---+------------------+