PySpark divide column by its sum [duplicate] - python

This question already has an answer here:
How to divide a column by its sum in a Spark DataFrame
(1 answer)
Closed 4 years ago.
I am trying to divide columns in PySpark by their respective sums. My dataframe(using only one column here) looks like this:
event_rates = [[1,10.461016949152542], [2, 10.38953488372093], [3, 10.609418282548477]]
event_rates = spark.createDataFrame(event_rates, ['cluster_id','mean_encoded'])
event_rates.show()
+----------+------------------+
|cluster_id| mean_encoded|
+----------+------------------+
| 1|10.461016949152542|
| 2| 10.38953488372093|
| 3|10.609418282548477|
+----------+------------------+
I tried two methods to do this but have failed in getting results
from pyspark.sql.functions import sum as spark_sum
cols = event_rates.columns[1:]
for each in cols:
event_rates = event_rates.withColumn(each+"_scaled", event_rates[each]/spark_sum(event_rates[each]))
This gives me the following error
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and '`cluster_id`' is not an aggregate function. Wrap '((`mean_encoded` / sum(`mean_encoded`)) AS `mean_encoded_scaled`)' in windowing function(s) or wrap '`cluster_id`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [cluster_id#22356L, mean_encoded#22357, (mean_encoded#22357 / sum(mean_encoded#22357)) AS mean_encoded_scaled#2
and following the question here I tried the following
stats = (event_rates.agg([spark_sum(x).alias(x + '_sum') for x in cols]))
event_rates = event_rates.join(broadcast(stats))
exprs = [event_rates[x] / event_rates[event_rates + '_sum'] for x in cols]
event_rates.select(exprs)
But I get an error from the first line stating
AssertionError: all exprs should be Column
How do I get across this?

This is an example on how to divide column mean_encoded by its sum. You need to sum the column first then crossJoin back to the previous dataframe. Then, you can divide any column by its sum.
import pyspark.sql.functions as fn
from pyspark.sql.types import *
event_rates = event_rates.crossJoin(event_rates.groupby().agg(fn.sum('mean_encoded').alias('sum_mean_encoded')))
event_rates_div = event_rates.select('cluster_id',
'mean_encoded',
fn.col('mean_encoded') / fn.col('sum_mean_encoded'))
Output
+----------+------------------+---------------------------------+
|cluster_id| mean_encoded|(mean_encoded / sum_mean_encoded)|
+----------+------------------+---------------------------------+
| 1|10.461016949152542| 0.3325183371367686|
| 2| 10.38953488372093| 0.3302461777809474|
| 3|10.609418282548477| 0.3372354850822839|
+----------+------------------+---------------------------------+

Try out this,
from pyspark.sql import functions as F
total = event_rates.groupBy().agg(F.sum("mean_encoded"),F.sum("cluster_id")).collect()
total
Answer will be,
[Row(sum(mean_encoded)=31.459970115421946, sum(cluster_id)=6)]

Related

Get last / delimited value from Dataframe column in PySpark

I am trying to get the last string after '/'.
The column can look like this: "lala/mae.da/rg1/zzzzz" (not necessary only 3 /), and I'd like to return: zzzzz
In SQL and Python it's very easy, but I would like to know if there is a way to do it in PySpark.
Solving it in Python:
original_string = "lala/mae.da/rg1/zzzzz"
last_char_index = original_string.rfind("/")
new_string = original_string[last_char_index+1:]
or directly:
new_string = original_string.rsplit('/', 1)[1]
And in SQL:
RIGHT(MyColumn, CHARINDEX('/', REVERSE(MyColumn))-1)
For PySpark I was thinking something like this:
df = df.select(col("MyColumn").rsplit('/', 1)[1])
but I get the following error: TypeError: 'Column' object is not callable and I am not even sure Spark allows me to do rsplit at all.
Do you have any suggestion on how can I solve this?
Adding another solution even though #Pav3k's answer is great. element_at which gets an item at specific position out of a list:
from pyspark.sql import functions as F
df = df.withColumn('my_col_split', F.split(df['MyColumn'], '/'))\
.select('MyColumn',F.element_at(F.col('my_col_split'), -1).alias('rsplit')
)
>>> df.show(truncate=False)
+---------------------+------+
|MyColumn |rsplit|
+---------------------+------+
|lala/mae.da/rg1/zzzzz|zzzzz |
|fefe |fefe |
|fe/fe/frs/fs/fe32/4 |4 |
+---------------------+------+
Pav3k's DF used.
import pandas as pd
from pyspark.sql import functions as F
df = pd.DataFrame({"MyColumn": ["lala/mae.da/rg1/zzzzz", "fefe", "fe/fe/frs/fs/fe32/4"]})
df = spark.createDataFrame(df)
df.show(truncate=False)
# output
+---------------------+
|MyColumn |
+---------------------+
|lala/mae.da/rg1/zzzzz|
|fefe |
|fe/fe/frs/fs/fe32/4 |
+---------------------+
(
df
.withColumn("NewCol",
F.split("MyColumn", "/")
)
.withColumn("NewCol", F.col("Newcol")[F.size("NewCol") -1])
.show()
)
# output
+--------------------+------+
| MyColumn|NewCol|
+--------------------+------+
|lala/mae.da/rg1/z...| zzzzz|
| fefe| fefe|
| fe/fe/frs/fs/fe32/4| 4|
+--------------------+------+
Since Spark 2.4, you can use split built-in function to split your string then use element_at built-in function to get the last element of your obtained array, as follows:
from pyspark.sql import functions as F
df = df.select(F.element_at(F.split(F.col("MyColumn"), '/'), -1))

Leag Lag and Window Function with concat function

I have to tranform data from basically merge line until |#| is found in data
Output Needed
I have transformed using lead lag function but unsure how to proceed
from pyspark.sql import functions as F
from pyspark.sql import Window
from pyspark.sql.functions import *
df = spark.read.text('text.dat')
#Adding index column each row get its row numbers , Spark distributes the data and to maintain the order of data we need to perfrom this action
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("linenumber")
#zipindex creates array making back to string
df_2 = spark.sql("select value.value as value , index from linenumber")
df_2.createOrReplaceTempView("linenumber2")
#Splitting and extracting the location value from header and assigning null
df_new = spark.sql("select value,case when value like '%|##|' then value else null end as orgval,case when value like '%|#|' then 1 else 0 end as valrow,index from linenumber2")
w = Window().partitionBy().orderBy(col("index"))
df_new=df_new.select("*", lag("valrow").over(w).alias("validrows"))
df_new.createOrReplaceTempView("linenumber3")
spark.sql("select * from linenumber3 order by index").show(100)
Please help.
Here is my code and explanation:
from pyspark.sql import functions as f, Row
from pyspark.sql.window import Window
df = spark.createDataFrame([
Row(Value='A', LineNumber=6),
Row(Value='B', LineNumber=7),
Row(Value='C', LineNumber=8),
Row(Value='D|#|', LineNumber=9),
Row(Value='A|#|', LineNumber=10),
Row(Value='E', LineNumber=11),
Row(Value='F', LineNumber=12),
Row(Value='G|#|', LineNumber=13),
Row(Value='I', LineNumber=23),
Row(Value='J', LineNumber=24),
Row(Value='K', LineNumber=25),
Row(Value='L', LineNumber=25)
])
df = df.withColumn('filename', f.input_file_name())
df = df.repartition('filename')
w = Window.partitionBy('filename').orderBy('index')
# Creating an id to enable window functions
df = df.withColumn('index', f.monotonically_increasing_id())
# Identifying if the previous row has |#| delimiter
df = df.withColumn('delimiter', f.lag('Value', default=False).over(w).contains('|#|'))
# Creating a column to group all values that must be concatenated
df = df.withColumn('group', f.sum(f.col('delimiter').cast('int')).over(w))
# Grouping them, removing |#|, collecting all values and concatenate them
df = (df
.groupBy('group')
.agg(f.concat_ws(',', f.collect_list(f.regexp_replace('Value', '\|#\|', ''))).alias('ConcalValue'),
f.min('LineNumber').alias('LineNumber')))
# Selecting only desired columns
(df
.select(f.col('ConcalValue').alias('Concal Value'), f.col('LineNumber').alias('Initial Line Number'))
.sort('LineNumber')
.show(truncate=False))
Output:
+------------+-------------------+
|Concal Value|Initial Line Number|
+------------+-------------------+
| A,B,C,D| 6|
| A| 10|
| E,F,G| 11|
| I,J,K,L| 23|
+------------+-------------------+

pyspark - Grouping and calculating data

I have the following csv file.
Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a,nexus4,nexus4_1,stand
1,1424696633909,1424696631918283972,-5.95224,0.6702118,8.136536,a,nexus4,nexus4_1,stand
2,1424696633918,1424696631923288855,-5.9950867,0.6535491999999999,8.204376,a,nexus4,nexus4_1,stand
3,1424696633919,1424696631928385290,-5.9427185,0.6761626999999999,8.128204,a,nexus4,nexus4_1,stand
I have to create a RDD where USER MODEL AND GT are PRIMARY KEY, I don't know if I have to do it using them as a tuple.
Then when I have the primary key field I have to calculate AVG, MAX and MIN from 'x','y' and 'z'.
Here is an output:
User,Model,gt,media(x,y,z),desviacion(x,y,z),max(x,y,z),min(x,y,z)
a, nexus4,stand,-3.0,0.7,8.2,2.8,0.14,0.0,-1.0,0.8,8.2,-5.0,0.6,8.2
Any idea about how to group them and for example get the media values from "x"
With my current code I get the following.
# Data loading
lectura = sc.textFile("Phones_accelerometer.csv")
datos = lectura.map(lambda x: ((x.split(",")[6], x.split(",")[7], x.split(",")[9]),(x.split(",")[3], x.split(",")[4], x.split(",")[5])))
sumCount = datos.combineByKey(lambda value: (value, 1), lambda x, value: (x[0] + value, x[1] + 1), lambda x, y: (x[0] + y[0], x[1] + y[1]))
An example of my tuples:
[(('a', 'nexus4', 'stand'), ('-5.958191', '0.6880646', '8.135345'))]
If you have a csv data in a file as given in the question then you can use sqlContext to read it as a dataframe and cast the appropriate types as
df = sqlContext.read.format("com.databricks.spark.csv").option("header", True).load("path to csv file")
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.select(F.col('User'), F.col('Model'), F.col('gt'), F.col('x').cast('float'), F.col('y').cast('float'), F.col('z').cast('float'))
I have selected primary keys and necessary columns only which should give you
+----+------+-----+----------+---------+--------+
|User|Model |gt |x |y |z |
+----+------+-----+----------+---------+--------+
|a |nexus4|stand|-5.958191 |0.6880646|8.135345|
|a |nexus4|stand|-5.95224 |0.6702118|8.136536|
|a |nexus4|stand|-5.9950867|0.6535492|8.204376|
|a |nexus4|stand|-5.9427185|0.6761627|8.128204|
+----+------+-----+----------+---------+--------+
All of your requirements: median, deviation, max and min depend on the list of x, y and z when grouped by primary keys: User, Model and gt.
So you would need groupBy and collect_list inbuilt function and a udf function to calculate all of your requiremnts. Final step is to separate them in different columns which are given below
from math import sqrt
def calculation(array):
num_items = len(array)
print num_items, sum(array)
mean = sum(array) / num_items
differences = [x - mean for x in array]
sq_differences = [d ** 2 for d in differences]
ssd = sum(sq_differences)
variance = ssd / (num_items - 1)
sd = sqrt(variance)
return [mean, sd, max(array), min(array)]
calcUdf = F.udf(calculation, T.ArrayType(T.FloatType()))
df.groupBy('User', 'Model', 'gt')\
.agg(calcUdf(F.collect_list(F.col('x'))).alias('x'), calcUdf(F.collect_list(F.col('y'))).alias('y'), calcUdf(F.collect_list(F.col('z'))).alias('z'))\
.select(F.col('User'), F.col('Model'), F.col('gt'), F.col('x')[0].alias('median_x'), F.col('y')[0].alias('median_y'), F.col('z')[0].alias('median_z'), F.col('x')[1].alias('deviation_x'), F.col('y')[1].alias('deviation_y'), F.col('z')[1].alias('deviation_z'), F.col('x')[2].alias('max_x'), F.col('y')[2].alias('max_y'), F.col('z')[2].alias('max_z'), F.col('x')[3].alias('min_x'), F.col('y')[3].alias('min_y'), F.col('z')[3].alias('min_z'))\
.show(truncate=False)
So finally you should have
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
|User|Model |gt |median_x |median_y |median_z|deviation_x|deviation_y|deviation_z|max_x |max_y |max_z |min_x |min_y |min_z |
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
|a |nexus4|stand|-5.962059|0.6719971|8.151115|0.022922019|0.01436464 |0.0356973 |-5.9427185|0.6880646|8.204376|-5.9950867|0.6535492|8.128204|
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
I hope the answer is helpful.
You'll have to used groupByKey to get median. While generally not preferred for performance reasons, finding the median value of a list of numbers can not be parallelized easily. The logic to compute median requires the entire list of numbers. groupByKey is the aggregation method to use when you need to process all the values for a key at the same time
Also, as mentioned in the comments, this task would be easier using Spark DataFrames.

Operating with 2 tuples in pyspark - spark python [duplicate]

I have the following csv file.
Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a,nexus4,nexus4_1,stand
1,1424696633909,1424696631918283972,-5.95224,0.6702118,8.136536,a,nexus4,nexus4_1,stand
2,1424696633918,1424696631923288855,-5.9950867,0.6535491999999999,8.204376,a,nexus4,nexus4_1,stand
3,1424696633919,1424696631928385290,-5.9427185,0.6761626999999999,8.128204,a,nexus4,nexus4_1,stand
I have to create a RDD where USER MODEL AND GT are PRIMARY KEY, I don't know if I have to do it using them as a tuple.
Then when I have the primary key field I have to calculate AVG, MAX and MIN from 'x','y' and 'z'.
Here is an output:
User,Model,gt,media(x,y,z),desviacion(x,y,z),max(x,y,z),min(x,y,z)
a, nexus4,stand,-3.0,0.7,8.2,2.8,0.14,0.0,-1.0,0.8,8.2,-5.0,0.6,8.2
Any idea about how to group them and for example get the media values from "x"
With my current code I get the following.
# Data loading
lectura = sc.textFile("Phones_accelerometer.csv")
datos = lectura.map(lambda x: ((x.split(",")[6], x.split(",")[7], x.split(",")[9]),(x.split(",")[3], x.split(",")[4], x.split(",")[5])))
sumCount = datos.combineByKey(lambda value: (value, 1), lambda x, value: (x[0] + value, x[1] + 1), lambda x, y: (x[0] + y[0], x[1] + y[1]))
An example of my tuples:
[(('a', 'nexus4', 'stand'), ('-5.958191', '0.6880646', '8.135345'))]
If you have a csv data in a file as given in the question then you can use sqlContext to read it as a dataframe and cast the appropriate types as
df = sqlContext.read.format("com.databricks.spark.csv").option("header", True).load("path to csv file")
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.select(F.col('User'), F.col('Model'), F.col('gt'), F.col('x').cast('float'), F.col('y').cast('float'), F.col('z').cast('float'))
I have selected primary keys and necessary columns only which should give you
+----+------+-----+----------+---------+--------+
|User|Model |gt |x |y |z |
+----+------+-----+----------+---------+--------+
|a |nexus4|stand|-5.958191 |0.6880646|8.135345|
|a |nexus4|stand|-5.95224 |0.6702118|8.136536|
|a |nexus4|stand|-5.9950867|0.6535492|8.204376|
|a |nexus4|stand|-5.9427185|0.6761627|8.128204|
+----+------+-----+----------+---------+--------+
All of your requirements: median, deviation, max and min depend on the list of x, y and z when grouped by primary keys: User, Model and gt.
So you would need groupBy and collect_list inbuilt function and a udf function to calculate all of your requiremnts. Final step is to separate them in different columns which are given below
from math import sqrt
def calculation(array):
num_items = len(array)
print num_items, sum(array)
mean = sum(array) / num_items
differences = [x - mean for x in array]
sq_differences = [d ** 2 for d in differences]
ssd = sum(sq_differences)
variance = ssd / (num_items - 1)
sd = sqrt(variance)
return [mean, sd, max(array), min(array)]
calcUdf = F.udf(calculation, T.ArrayType(T.FloatType()))
df.groupBy('User', 'Model', 'gt')\
.agg(calcUdf(F.collect_list(F.col('x'))).alias('x'), calcUdf(F.collect_list(F.col('y'))).alias('y'), calcUdf(F.collect_list(F.col('z'))).alias('z'))\
.select(F.col('User'), F.col('Model'), F.col('gt'), F.col('x')[0].alias('median_x'), F.col('y')[0].alias('median_y'), F.col('z')[0].alias('median_z'), F.col('x')[1].alias('deviation_x'), F.col('y')[1].alias('deviation_y'), F.col('z')[1].alias('deviation_z'), F.col('x')[2].alias('max_x'), F.col('y')[2].alias('max_y'), F.col('z')[2].alias('max_z'), F.col('x')[3].alias('min_x'), F.col('y')[3].alias('min_y'), F.col('z')[3].alias('min_z'))\
.show(truncate=False)
So finally you should have
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
|User|Model |gt |median_x |median_y |median_z|deviation_x|deviation_y|deviation_z|max_x |max_y |max_z |min_x |min_y |min_z |
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
|a |nexus4|stand|-5.962059|0.6719971|8.151115|0.022922019|0.01436464 |0.0356973 |-5.9427185|0.6880646|8.204376|-5.9950867|0.6535492|8.128204|
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
I hope the answer is helpful.
You'll have to used groupByKey to get median. While generally not preferred for performance reasons, finding the median value of a list of numbers can not be parallelized easily. The logic to compute median requires the entire list of numbers. groupByKey is the aggregation method to use when you need to process all the values for a key at the same time
Also, as mentioned in the comments, this task would be easier using Spark DataFrames.

flatMap doesn't preserve order when creating lists from pyspark dataframe columns

I have a PySpark dataframe df:
+---------+------------------+
|ceil_temp| test2|
+---------+------------------+
| -1|[6397024, 6425417]|
| 0|[6397024, 6425417]|
| 0|[6397024, 6425417]|
| 0|[6469640, 6531963]|
| 0|[6469640, 6531963]|
| 1|[6469640, 6531963]|
+---------+------------------+
I eventually want to add a new column(final) to this dataframe whose values are elements of list in test2 column based on the index of ceil_temp column. For example: if ceil_temp column has <0 or 0 value in it, final column has the element in the 0th index of test2 column.Something like this:
+---------+------------------+--------
|ceil_temp| test2|final |
+---------+------------------+--------
| -1|[6397024, 6425417]|6397024|
| 0|[6397024, 6425417]|6397024|
| 0|[6397024, 6425417]|6397024|
| 0|[6469640, 6531963]|6469640|
| 0|[6469640, 6531963]|6469640|
| 1|[6469640, 6531963]|6531963|
+---------+------------------+--------
To achieve this, I tried to extract ceil_temp and test2 as lists using flatMap:
m =df.select("ceil_temp").rdd.flatMap(lambda x: x).collect()
q= df.select("test2").rdd.flatMap(lambda x: x).collect()
l=[]
for i in range(len(num)):
if m[i]<0:
m[i]=0
else:
pass
l.append(q[i][m[i]])
Then converting this list l to a new df and joining it with original dataframe based on row index column that i add based on window function:
w = Window().orderBy()
df=df.withColumn("columnindex", rowNumber().over(w)).
However, the order of the lists extracted by flatMap doesn't seem to remain the same as that of parent dataframe df. I get the following:
m=[-1,0,0,0,0,1]
q=[[6469640, 6531963],[6469640, 6531963],[6469640, 6531963],[6397024, 6425417],[6397024, 6425417],[6397024, 6425417]]
Expected result:
m=[-1,0,0,0,0,1]
q=[[6397024, 6425417],[6397024, 6425417],[6397024, 6425417],[6469640, 6531963],[6469640, 6531963],[6469640, 6531963]]
Please advise on how to achieve the "final" column.
I think you could achieve your desired outcome using UDF on the rows of your dataframe.
You could then withColumn with the result of your udf.
val df = spark.sparkContext.parallelize(List(
(-1, List(6397024, 6425417)),
(0,List(6397024, 6425417)),
(0,List(6397024, 6425417)),
(0,List(6469640, 6531963)),
(0,List(6469640, 6531963)),
(1,List(6469640, 6531963)))).toDF("ceil_temp", "test2")
import org.apache.spark.sql.functions.udf
val selectRightElement = udf {
(ceilTemp: Int, test2: Seq[Int]) => {
// dummy code for the example
if (ceilTemp <= 0) test2(0) else test2(1)
}
}
df.withColumn("final", selectRightElement(df("ceil_temp"), df("test2"))).show
Doing like that will prevent shuffling of your row order.
I solved the above issue by:
df=df.withColumn("final",(df.test2).getItem(df.ceil_temp))

Categories