Merge Rows in Apache spark by eliminating null values - python

I have a spark data frame like below
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 2|null|null|null| 102| 202| 302|
| 4|null|null|null| 104| 204| 304|
| 1|null|null|null| 101| 201| 301|
| 3|null|null|null| 103| 203| 303|
| 1| 11| 21| 31|null|null|null|
| 2| 12| 22| 32|null|null|null|
| 4| 14| 24| 34|null|null|null|
| 3| 13| 23| 33|null|null|null|
+---+----+----+----+----+----+----+
I wanted to transform data frame like below by merging null rows
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 1| 11| 21| 31| 101| 201| 301|
| 2| 12| 22| 32| 102| 202| 302|
| 4| 14| 24| 34| 104| 204| 304|
| 3| 13| 23| 33| 103| 203| 303|
+---+----+----+----+----+----+----+
preferably in scala.

You can group on id and aggregate using first with ignorenulls for other columns:
import pyspark.sql.functions as F
(df.groupBy('id').agg(*[F.first(x,ignorenulls=True) for x in df.columns if x!='id'])
.show())
+---+----+----+----+-----+-----+-----+
| id| 1| 2| 3| sf_1| sf_2| sf_3|
+---+----+----+----+-----+-----+-----+
| 1|11.0|21.0|31.0|101.0|201.0|301.0|
| 3|13.0|23.0|33.0|103.0|203.0|303.0|
| 2|12.0|22.0|32.0|102.0|202.0|302.0|
| 4|14.0|24.0|34.0|104.0|204.0|304.0|
+---+----+----+----+-----+-----+-----+

scala way of doing.
val inputColumns = inputLoadDF.columns.toList.drop(0)
val exprs = inputColumns.map(x => first(x,true))
inputLoadDF.groupBy("id").agg(exprs.head,exprs.tail:_*).show()

Related

Join dataframes matching a column with a range determined by two columns in the other one with PySpark

I have a df on the left like this one:
+----+-----+
| id|value|
+----+-----+
| 2| xx|
| 4| xx|
| 11| xx|
| 14| xx|
| 27| xx|
| 28| xx|
| 56| xx|
| 55| xx|
+----+-----+
And another one on the right like this one:
+-----+---+----+
|start|end| ov |
+-----+---+----+
| 0| 9| A|
| 10| 19| B|
| 20| 29| C|
| 30| 39| D|
| 40| 49| F|
+-----+---+----+
And I need to join the rows when the id of the first table is between the range of start end of the second table. The output should look like this:
+----+-----+----+
| id|value| ov |
+----+-----+----+
| 2| xx| A|
| 4| xx| A|
| 11| xx| B|
| 14| xx| B|
| 27| xx| C|
| 28| xx| C|
| 56| xx| |
| 55| xx| |
+----+-----+----+
How can I achive this result with PySpark?
Use between operator with left join.
Example:
#using dataframes api
df.join(df1,(df['id'] >= df1['start']) & (df['id'] <= df1['end']),'left').select(df["*"],df1['ov']).show(10,False)
#using spark sql api
df.createOrReplaceTempView("t1")
df1.createOrReplaceTempView("t2")
spark.sql("select t1.*,t2.ov from t1 left join t2 on t1.id between t2.start and t2.end").show()
#this is just sample data
#+---+-----+----+
#| id|value| ov|
#+---+-----+----+
#| 2| xx| A|
#| 4| xx| A|
#| 55| zz|null|
#+---+-----+----+

Compute median of column in pyspark

I have a dataframe as shown below:
+-----------+------------+
|parsed_date| count|
+-----------+------------+
| 2017-12-16| 2|
| 2017-12-16| 2|
| 2017-12-17| 2|
| 2017-12-17| 2|
| 2017-12-18| 1|
| 2017-12-19| 4|
| 2017-12-19| 4|
| 2017-12-19| 4|
| 2017-12-19| 4|
| 2017-12-20| 1|
+-----------+------------+
I want to compute median of the entire 'count' column and add the result to a new column.
I tried:
median = df.approxQuantile('count',[0.5],0.1).alias('count_median')
But of course I am doing something wrong as it gives the following error:
AttributeError: 'list' object has no attribute 'alias'
Please help.
You need to add a column with withColumn because approxQuantile returns a list of floats, not a Spark column.
import pyspark.sql.functions as F
df2 = df.withColumn('count_media', F.lit(df.approxQuantile('count',[0.5],0.1)[0]))
df2.show()
+-----------+-----+-----------+
|parsed_date|count|count_media|
+-----------+-----+-----------+
| 2017-12-16| 2| 2.0|
| 2017-12-16| 2| 2.0|
| 2017-12-17| 2| 2.0|
| 2017-12-17| 2| 2.0|
| 2017-12-18| 1| 2.0|
| 2017-12-19| 4| 2.0|
| 2017-12-19| 4| 2.0|
| 2017-12-19| 4| 2.0|
| 2017-12-19| 4| 2.0|
| 2017-12-20| 1| 2.0|
+-----------+-----+-----------+
You can also use the approx_percentile / percentile_approx function in Spark SQL:
import pyspark.sql.functions as F
df2 = df.withColumn('count_media', F.expr("approx_percentile(count, 0.5, 10) over ()"))
df2.show()
+-----------+-----+-----------+
|parsed_date|count|count_media|
+-----------+-----+-----------+
| 2017-12-16| 2| 2|
| 2017-12-16| 2| 2|
| 2017-12-17| 2| 2|
| 2017-12-17| 2| 2|
| 2017-12-18| 1| 2|
| 2017-12-19| 4| 2|
| 2017-12-19| 4| 2|
| 2017-12-19| 4| 2|
| 2017-12-19| 4| 2|
| 2017-12-20| 1| 2|
+-----------+-----+-----------+

Comparing 3 columns in PySpark

I want to compare 3 columns in PySpark (percentages summing to 100%) to create a new one which would contain the column name of the max of the 3 columns or, in case the max is not unique, contains the name of the columns that have the same value. I have seen some similar examples here but they don't handle the case when the max is not unique. Below is my brute-force solution, but it takes so much time to run to become useless:
df\
.withColumn("MaxName",
F.when( (col(A)>col(B)) & (col(A)>col(C)), "A")\
.when( (col(B)>col(A)) & (col(B)>col(C)), "B")\
.when( (col(C)>col(A)) & (col(C)>col(B)), "C")\
.when( (col(A)==col(B)) &\
(col(A)>col(C)) | (col(B)>col(C)), "AB")\
.when( (col(C)==col(B)) | (col(C)==col(A)) &\
(col(C)>col(B)) | (col(C)>col(A)), "CAB")\
.otherwise("ABC")
Any insights to build a more efficient solution?
If I correctly understand, you can compare with greatest and return the column names , then concat:
Example:
Input:
np.random.seed(111)
df = spark.createDataFrame(pd.DataFrame(np.random.randint(0,100,(5,5)),
columns=list('ABCDE')))
df.show()
+---+---+---+---+---+
| A| B| C| D| E|
+---+---+---+---+---+
| 84| 84| 84| 86| 19|
| 41| 66| 82| 40| 71|
| 57| 7| 12| 10| 65|
| 88| 28| 14| 34| 21|
| 54| 72| 37| 76| 58|
+---+---+---+---+---+
Proposed solution:
import pyspark.sql.functions as F
cols = ['A','B','C']
df.withColumn("max_of_ABC",F.concat_ws("",
*[F.when(F.col(i) == F.greatest(*cols),i) for i in cols])).show()
+---+---+---+---+---+----------+
| A| B| C| D| E|max_of_ABC|
+---+---+---+---+---+----------+
| 84| 84| 84| 86| 19| ABC|
| 41| 66| 82| 40| 71| C|
| 57| 7| 12| 10| 65| A|
| 88| 28| 14| 34| 21| A|
| 54| 72| 37| 76| 58| B|
+---+---+---+---+---+----------+

How do I replace the null values with the mode of the column which is not numeric?

There are null values in my DataFrame in Continent_Name column and I wish to replace it with the mode of the same column.
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
| Country_Name|Number_of_Beer_Servings|Number_of_Spirit_Servings|Number_of_Wine_servings|Pure_alcohol_Consumption_litres|Continent_Name|
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
| Afghanistan| 0| 0| 0| 0.0| AS|
| Albania| 89| 132| 54| 4.9| EU|
| Algeria| 25| 0| 14| 0.7| AF|
| Andorra| 245| 138| 312| 12.4| EU|
| Angola| 217| 57| 45| 5.9| AF|
|Antigua & Barbuda| 102| 128| 45| 4.9| null|
| Argentina| 193| 25| 221| 8.3| SA|
| Armenia| 21| 179| 11| 3.8| EU|
| Australia| 261| 72| 212| 10.4| OC|
| Austria| 279| 75| 191| 9.7| EU|
| Azerbaijan| 21| 46| 5| 1.3| EU|
| Bahamas| 122| 176| 51| 6.3| null|
| Bahrain| 42| 63| 7| 2.0| AS|
| Bangladesh| 0| 0| 0| 0.0| AS|
| Barbados| 143| 173| 36| 6.3| null|
| Belarus| 142| 373| 42| 14.4| EU|
| Belgium| 295| 84| 212| 10.5| EU|
| Belize| 263| 114| 8| 6.8| null|
| Benin| 34| 4| 13| 1.1| AF|
| Bhutan| 23| 0| 0| 0.4| AS|
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
only showing top 20 rows
I tried in the following way:
for column in df_copy['Continent_Name']:
df_copy['Continent_Name'].fillna(df_copy['Continent_Name'].mode()[0], inplace=True)
the error that showed up:
TypeError: Column is not iterable
Creating the DataFrame below.
df = spark.createDataFrame([('Afghanistan',0,0,0,0.0,'AS'),('Albania',89,132,54,4.9,'EU'),
('Algeria',25,0,14,0.7,'AF'),('Andorra',245,138,312,12.4,'EU'),
('Angola',217,57,45,5.9,'AF'),('Antigua&Barbuda',102,128,45,4.9,None),
('Argentina',193,25,221,8.3,'SA'),('Armenia',21,179,11,3.8,'EU'),
('Australia',261,72,212,10.4,'OC'),('Austria',279,75,191,9.7,'EU'),
('Azerbaijan',21,46,5,1.3,'EU'),('Bahamas',122,176,51,6.3,None),
('Bahrain',42,63,7,2.0,'AS'),('Bangladesh',0,0,0,0.0,'AS'),
('Barbados',143,173,36,6.3,None),('Belarus',142,373,42,14.4,'EU'),
('Belgium',295,84,212,10.5,'EU'),('Belize',263,114,8,6.8,None),
('Benin',34,4,13,1.1,'AF'),('Bhutan',23,0,0,0.4,'AS')],
['Country_Name','Number_of_Beer_Servings','Number_of_Spirit_Servings',
'Number_of_Wine_servings','Pure_alcohol_Consumption_litres',
'Continent_Name'])
Since we intend to find the Mode, we need to look out for that value of Continent_Name which occurs most frequently.
df1 = df.where(col('Continent_Name').isNotNull())
Resistering our DataFrame as a view and applying SQL commands on it to group by and then count Continent_Name.
df1.registerTempTable('table')
df2=spark.sql(
'SELECT Continent_Name, COUNT(Continent_Name) AS count FROM table GROUP BY Continent_Name ORDER BY count desc'
)
df2.show()
+--------------+-----+
|Continent_Name|count|
+--------------+-----+
| EU| 7|
| AS| 4|
| AF| 3|
| SA| 1|
| OC| 1|
+--------------+-----+
Finally, return the first element of the group.
mode_value = df2.first()['Continent_Name']
print(mode_value)
EU
Once mode_value is obtained, just fill in using .fillna() function.
df = df.fillna({'Continent_Name':mode_value})
df.show()
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
| Country_Name|Number_of_Beer_Servings|Number_of_Spirit_Servings|Number_of_Wine_servings|Pure_alcohol_Consumption_litres|Continent_Name|
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
| Afghanistan| 0| 0| 0| 0.0| AS|
| Albania| 89| 132| 54| 4.9| EU|
| Algeria| 25| 0| 14| 0.7| AF|
| Andorra| 245| 138| 312| 12.4| EU|
| Angola| 217| 57| 45| 5.9| AF|
|Antigua&Barbuda| 102| 128| 45| 4.9| EU|
| Argentina| 193| 25| 221| 8.3| SA|
| Armenia| 21| 179| 11| 3.8| EU|
| Australia| 261| 72| 212| 10.4| OC|
| Austria| 279| 75| 191| 9.7| EU|
| Azerbaijan| 21| 46| 5| 1.3| EU|
| Bahamas| 122| 176| 51| 6.3| EU|
| Bahrain| 42| 63| 7| 2.0| AS|
| Bangladesh| 0| 0| 0| 0.0| AS|
| Barbados| 143| 173| 36| 6.3| EU|
| Belarus| 142| 373| 42| 14.4| EU|
| Belgium| 295| 84| 212| 10.5| EU|
| Belize| 263| 114| 8| 6.8| EU|
| Benin| 34| 4| 13| 1.1| AF|
| Bhutan| 23| 0| 0| 0.4| AS|
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+

Window timeseries with step in Spark/Scala

I have this input :
timestamp,user
1,A
2,B
5,C
9,E
12,F
The result wanted is :
timestampRange,userList
1 to 2,[A,B]
3 to 4,[] Or null
5 to 6,[C]
7 to 8,[] Or null
9 to 10,[E]
11 to 12,[F]
I tried using Window, but the problem, it doesn't include the empty timestamp range.
Any hints would be helpful.
Don't know if widowing function will cover the gaps between ranges, but you can take the following approach :
Define a dataframe, df_ranges:
val ranges = List((1,2), (3,4), (5,6), (7,8), (9,10))
val df_ranges = sc.parallelize(ranges).toDF("start", "end")
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 3| 4|
| 5| 6|
| 7| 8|
| 9| 10|
+-----+---+
Data with the timestamp column, df_data :
val data = List((1,"A"), (2,"B"), (5,"C"), (9,"E"))
val df_data = sc.parallelize(data).toDF("timestamp", "user")
+---------+----+
|timestamp|user|
+---------+----+
| 1| A|
| 2| B|
| 5| C|
| 9| E|
+---------+----+
Join the two dataframe on the start, end, timestamp columns:
df_ranges.join(df_data, df_ranges.col("start").equalTo(df_data.col("timestamp")).or(df_ranges.col("end").equalTo(df_data.col("timestamp"))), "left")
+-----+---+---------+----+
|start|end|timestamp|user|
+-----+---+---------+----+
| 1| 2| 1| A|
| 1| 2| 2| B|
| 5| 6| 5| C|
| 9| 10| 9| E|
| 3| 4| null|null|
| 7| 8| null|null|
+-----+---+---------+----+
Now do a simple aggregation with collect_list function :
res4.groupBy("start", "end").agg(collect_list("user")).orderBy("start")
+-----+---+------------------+
|start|end|collect_list(user)|
+-----+---+------------------+
| 1| 2| [A, B]|
| 3| 4| []|
| 5| 6| [C]|
| 7| 8| []|
| 9| 10| [E]|
+-----+---+------------------+

Categories