I am new in spark and I have some doubts about working with df.
My problem is that I need to apply a formula to a pyspark df column using values from other columns.
I have following df
+-------+-------+-------+-------+-------+-------+
| count1| count2|val__00|val__01|val__02|val__03|
+-------+-------+-------+-------+-------+-------+
| 3| 1| 1.0| 0.0| 8.0| 0.0|
| 4| 2| 0.0| 1.379| 1.49| 1.373|
| 2| 5| 0.7| 0.0| 70.4| 0.0|
| 1| 8| 0.0| 4.0| 0.0| 0.0|
+-------+-------+-------+-------+-------+-------+
I need to apply following formula to columns val__xx for each row:
val__xx = val__xx + (count1*count2)
so final df will be
+-------+-------+-------+-------+-------+-------+
| count1| count2|val__00|val__01|val__02|val__03|
+-------+-------+-------+-------+-------+-------+
| 3| 1| 4.0| 3.0| 11.0| 3.0|
| 4| 2| 8.0| 9.379| 9.49| 9.373|
| 2| 5| 10.7| 10.0| 80.4| 10.0|
| 1| 8| 8.0| 12.0| 8.0| 8.0|
+-------+-------+-------+-------+-------+-------+
I think to apply an udf function but I don't know how to pass more than one column. Is it possible to make a function to pass more than one column?
I have implemented code below but I don't know hot to pass cols val__xx
def calculate(c, count1, count2):
return c + (count1*count2)
calculateUDF = udf(lambda x: calculate(x, count1, count2))
df_final = df.apply(calculateUDF(col(val__xx????), col(count1), col(count2))
You can do using withColumn in a for loop, no need for a udf.
from pyspark.sql import functions as f
for i in range(4):
df = df.withColumn(f'val__0{i}', f.col('count1') * f.col('count2') + f.col(f'val__0{i}'))
df.show()
+------+------+-------+-------+-------+-------+
|count1|count2|val__00|val__01|val__02|val__03|
+------+------+-------+-------+-------+-------+
| 3| 1| 4.0| 3.0| 11.0| 3.0|
| 4| 2| 8.0| 9.379| 9.49| 9.373|
| 2| 5| 10.7| 10.0| 80.4| 10.0|
| 1| 8| 8.0| 12.0| 8.0| 8.0|
+------+------+-------+-------+-------+-------+
If your your 'value' columns reaches double digits you'll need to left pad the i with zeros.
Related
I have a data frame like below in pyspark
df = spark.createDataFrame(
[
('14_100_00','A',25,0),
('14_100_00','A',0,24),
('15_100_00','A',20,1),
('150_100','C',21,0),
('16','A',0,20),
('16','A',20,0)],("rust", "name", "value_1","value_2" ))
df.show()
+---------+----+-------+-------+
| rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00| A| 25| 0|
|14_100_00| A| 0| 24|
|15_100_00| A| 20| 1|
| 150_100| C| 21| 0|
| 16| A| 0| 20|
| 16| A| 20| 0|
+---------+----+-------+-------+
I am trying to update the value_1 and value_2 columns based on below conditions
when rust and name columns are same then sum of value_1 as value_1 for that group
when rust and name columns are same then sum of value_2 as value_2 for that group
Expected result:
+---------+----+-------+-------+
| rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00| A| 25| 24|
|15_100_00| A| 20| 1|
| 150_100| C| 21| 0|
| 16| A| 20| 20|
+---------+----+-------+-------+
I have tried this:
df1 = df.withColumn("VALUE_1", f.sum("VALUE_1").over(Window.partitionBy("rust", "name"))).withColumn("VALUE_2", f.sum("VALUE_2").over(Window.partitionBy("rust", "name")))
df1.show()
+---------+----+-------+-------+
| rust|name|VALUE_1|VALUE_2|
+---------+----+-------+-------+
| 150_100| C| 21| 0|
| 16| A| 20| 20|
| 16| A| 20| 20|
|14_100_00| A| 25| 24|
|14_100_00| A| 25| 24|
|15_100_00| A| 20| 1|
+---------+----+-------+-------+
Is there a better way to achieve this without having duplicates?
Use groupBy instead of window functions:
df1 = df.groupBy("rust", "name").agg(
F.sum("value_1").alias("value_1"),
F.sum("value_2").alias("value_2"),
)
df1.show()
#+---------+----+-------+-------+
#| rust|name|value_1|value_2|
#+---------+----+-------+-------+
#|14_100_00| A| 25| 24|
#|15_100_00| A| 20| 1|
#| 150_100| C| 21| 0|
#| 16| A| 20| 20|
#+---------+----+-------+-------+
Im using the below function to explode a deeply nested JSON (has nested struct and array).
# Flatten nested df
def flatten_df(nested_df):
for col in nested_df.columns:
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for col in array_cols:
nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) == 0:
return nested_df
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flatten_df(flat_df)
Im successfully able to explode. But I also want to add the order or the index of the elements in the exploded dataframe. So in the above code I replace the explode_outer function to posexplode_outer. But I get the below error
An error was encountered:
'The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases'
I tried changing the nested_df.withColumn to nested_df.select but I wasn't successful. Can anyone help me explode the nested json but same time maintain the order of array elements as a column in the exploded dataframe.
Read json data as dataframe and create view or table.In spark SQL you can use number of laterviewexplode method using alias reference. If json data structure in struct type you can use dot to represent the structure. Level1.level2
Replace nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col])) with nested_df = df.selectExpr("*", f"posexplode({col}) as (position,col)").drop(col)
You might need to write some logic to replace the column names to original, but it should be simple
The error is because posexplode_outer returns two columns pos and col, so you cannot use it along withColumn(). This can be used in select as shown in the code below
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst_new = tst.withColumn("arr",F.array(tst.columns))
expr = tst.columns
expr.append(F.posexplode_outer('arr'))
#%%
tst_explode = tst_new.select(*expr)
results:
tst_explode.show()
+----+----+----+---+---+
|col1|col2|col3|pos|col|
+----+----+----+---+---+
| 1| 7| 80| 0| 1|
| 1| 7| 80| 1| 7|
| 1| 7| 80| 2| 80|
| 1| 8| 40| 0| 1|
| 1| 8| 40| 1| 8|
| 1| 8| 40| 2| 40|
| 1| 5| 100| 0| 1|
| 1| 5| 100| 1| 5|
| 1| 5| 100| 2|100|
| 5| 8| 90| 0| 5|
| 5| 8| 90| 1| 8|
| 5| 8| 90| 2| 90|
| 7| 6| 50| 0| 7|
| 7| 6| 50| 1| 6|
| 7| 6| 50| 2| 50|
| 0| 3| 60| 0| 0|
| 0| 3| 60| 1| 3|
| 0| 3| 60| 2| 60|
+----+----+----+---+---+
If you need to rename the columns, you can use the .withColumnRenamed() function
df_final=(tst_explode.withColumnRenamed('pos','position')).withColumnRenamed('col','column')
You can try select with list-comprehension to posexplode the ArrayType columns in your existing code:
for col in array_cols:
nested_df = nested_df.select([ F.posexplode_outer(col).alias(col+'_pos', col) if c == col else c for c in nested_df.columns ])
Example:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1,"n1", ["a", "b", "c"]),(2,"n2", ["foo", "bar"])],["id", "name", "vals"])
#+---+----+----------+
#| id|name| vals|
#+---+----+----------+
#| 1| n1| [a, b, c]|
#| 2| n2|[foo, bar]|
#+---+----+----------+
col = "vals"
df.select([F.posexplode_outer(col).alias(col+'_pos', col) if c == col else c for c in df.columns]).show()
#+---+----+--------+----+
#| id|name|vals_pos|vals|
#+---+----+--------+----+
#| 1| n1| 0| a|
#| 1| n1| 1| b|
#| 1| n1| 2| c|
#| 2| n2| 0| foo|
#| 2| n2| 1| bar|
#+---+----+--------+----+
I have a spark data frame like below
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 2|null|null|null| 102| 202| 302|
| 4|null|null|null| 104| 204| 304|
| 1|null|null|null| 101| 201| 301|
| 3|null|null|null| 103| 203| 303|
| 1| 11| 21| 31|null|null|null|
| 2| 12| 22| 32|null|null|null|
| 4| 14| 24| 34|null|null|null|
| 3| 13| 23| 33|null|null|null|
+---+----+----+----+----+----+----+
I wanted to transform data frame like below by merging null rows
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 1| 11| 21| 31| 101| 201| 301|
| 2| 12| 22| 32| 102| 202| 302|
| 4| 14| 24| 34| 104| 204| 304|
| 3| 13| 23| 33| 103| 203| 303|
+---+----+----+----+----+----+----+
preferably in scala.
You can group on id and aggregate using first with ignorenulls for other columns:
import pyspark.sql.functions as F
(df.groupBy('id').agg(*[F.first(x,ignorenulls=True) for x in df.columns if x!='id'])
.show())
+---+----+----+----+-----+-----+-----+
| id| 1| 2| 3| sf_1| sf_2| sf_3|
+---+----+----+----+-----+-----+-----+
| 1|11.0|21.0|31.0|101.0|201.0|301.0|
| 3|13.0|23.0|33.0|103.0|203.0|303.0|
| 2|12.0|22.0|32.0|102.0|202.0|302.0|
| 4|14.0|24.0|34.0|104.0|204.0|304.0|
+---+----+----+----+-----+-----+-----+
scala way of doing.
val inputColumns = inputLoadDF.columns.toList.drop(0)
val exprs = inputColumns.map(x => first(x,true))
inputLoadDF.groupBy("id").agg(exprs.head,exprs.tail:_*).show()
Consider a data set with ranking
+--------+----+-----------+--------------+
| colA|colB|colA_rank |colA_rank_mean|
+--------+----+-----------+--------------+
| 21| 50| 1| 1|
| 9| 23| 2| 2.5|
| 9| 21| 3| 2.5|
| 8| 21| 4| 3|
| 2| 21| 5| 5.5|
| 2| 5| 6| 5.5|
| 1| 5| 7| 7.5|
| 1| 4| 8| 7.5|
| 0| 4| 9| 11|
| 0| 3| 10| 11|
| 0| 3| 11| 11|
| 0| 2| 12| 11|
| 0| 2| 13| 11|
+--------+----+-----------+--------------+
colA_rank is a normal ranking, while with colA_rank_mean I would like to resolve ties by replacing the ranking with the mean rank of the ties. Is it achievable with a single pass and some particular ranking method ?
Currently I am thinking of 2 passes but that would seem to require ordering the dataset twice on colA, one without partition and one with partition.
#Step 1: normal rank
df = df.withColumn("colA_rank",F.row_number().over(Window.orderBy("colA")))
#Step 2 : solve ties :
df = df.withColumn("colA_rank_mean",F.mean(colA_rank).over(Window.partitionBy("colA"))
I have this input :
timestamp,user
1,A
2,B
5,C
9,E
12,F
The result wanted is :
timestampRange,userList
1 to 2,[A,B]
3 to 4,[] Or null
5 to 6,[C]
7 to 8,[] Or null
9 to 10,[E]
11 to 12,[F]
I tried using Window, but the problem, it doesn't include the empty timestamp range.
Any hints would be helpful.
Don't know if widowing function will cover the gaps between ranges, but you can take the following approach :
Define a dataframe, df_ranges:
val ranges = List((1,2), (3,4), (5,6), (7,8), (9,10))
val df_ranges = sc.parallelize(ranges).toDF("start", "end")
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 3| 4|
| 5| 6|
| 7| 8|
| 9| 10|
+-----+---+
Data with the timestamp column, df_data :
val data = List((1,"A"), (2,"B"), (5,"C"), (9,"E"))
val df_data = sc.parallelize(data).toDF("timestamp", "user")
+---------+----+
|timestamp|user|
+---------+----+
| 1| A|
| 2| B|
| 5| C|
| 9| E|
+---------+----+
Join the two dataframe on the start, end, timestamp columns:
df_ranges.join(df_data, df_ranges.col("start").equalTo(df_data.col("timestamp")).or(df_ranges.col("end").equalTo(df_data.col("timestamp"))), "left")
+-----+---+---------+----+
|start|end|timestamp|user|
+-----+---+---------+----+
| 1| 2| 1| A|
| 1| 2| 2| B|
| 5| 6| 5| C|
| 9| 10| 9| E|
| 3| 4| null|null|
| 7| 8| null|null|
+-----+---+---------+----+
Now do a simple aggregation with collect_list function :
res4.groupBy("start", "end").agg(collect_list("user")).orderBy("start")
+-----+---+------------------+
|start|end|collect_list(user)|
+-----+---+------------------+
| 1| 2| [A, B]|
| 3| 4| []|
| 5| 6| [C]|
| 7| 8| []|
| 9| 10| [E]|
+-----+---+------------------+