PySpark: Pivot column into single row - python

I am trying to pivot a simple dataframe in pyspark and I must be missing something. I have a dataframe df in the form of:
+----+----+
|Item| Key|
+----+----+
| 1| A|
+----+----+
| 2| A|
+----+----+
I attempt to pivot it on Item such as
df.groupBy("Item").\
pivot("Item", ["1","2"]).\
agg(first("Key"))
and I receive:
+----+----+----+
|Item| 1| 2|
+----+----+----+
| 1| A|null|
+----+----+----+
| 2|null| A|
+----+----+----+
But what I want is:
+----+----+
| 1| 2|
+----+----+
| A| A|
+----+----+
How do I keep the Item column from remaining in my output pivot table which I assume messes up my result? I am running Spark 2.3.2 and Python 3.7.0

Try without define aggregate column
>>> df.show()
+----+---+
|Item|Key|
+----+---+
| 1| A|
| 2| A|
+----+---+
>>> df.groupBy().pivot("Item", ["1","2"]).agg(first("Key")).show()
+---+---+
| 1| 2|
+---+---+
| A| A|
+---+---+

Related

Apply value to a dataframe after filtering another dataframe based on conditions

My question:
There are two dataframes and the info one is in progress to build.
What I want to do is filtering in reference dataframe based on the condition. When key is b, then apply value is 2 to the into table as whole column.
The output dataframe is the final one I want to do.
Dataframe (info)
+-----+-----+
| key|value|
+-----+-----+
| a| 10|
| b| 20|
| c| 50|
| d| 40|
+-----+-----+
Dataframe (Reference)
+-----+-----+
| key|value|
+-----+-----+
| a| 42|
| b| 2|
| c| 9|
| d| 100|
+-----+-----+
Below is the output I want:
Dataframe (Output)
+-----+-----+-----+
| key|value|const|
+-----+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 50| 2|
| d| 40| 2|
+-----+-----+-----+
I have tried several methods and below one is the latest one I tried, but system warm me that pyspark do not have loc function.
df_cal = (
info
.join(reference)
.withColumn('const', reference.loc[reference['key']=='b', 'value'].iloc[0])
.select('key', 'result', 'const')
)
df_cal.show()
And below is the warming that reminded by system:
AttributeError: 'Dataframe' object has no attribute 'loc'
This solve:
from pyspark.sql.functions import lit
target = 'b'
const = [i['value'] for i in df2.collect() if i['key'] == f'{target}']
df_cal = df1.withColumn('const', lit(const[0]))
df_cal.show()
+---+-----+-----+
|key|value|const|
+---+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 30| 2|
| d| 40| 2|
+---+-----+-----+

Add distinct count of a column to each row in PySpark

I need to add distinct count of a column to each row in PySpark dataframe.
Example:
If the original dataframe is this:
+----+----+
|col1|col2|
+----+----+
|abc | 1|
|xyz | 1|
|dgc | 2|
|ydh | 3|
|ujd | 1|
|ujx | 3|
+----+----+
Then I want something like this:
+----+----+----+
|col1|col2|col3|
+----+----+----+
|abc | 1| 3|
|xyz | 1| 3|
|dgc | 2| 3|
|ydh | 3| 3|
|ujd | 1| 3|
|ujx | 3| 3|
+----+----+----+
I tried df.withColumn('total_count', f.countDistinct('col2')) but it's giving error.
You can count distinct elements in the column and create new column with the value:
distincts = df.dropDuplicates(["col2"]).count()
df = df.withColumn("col3", f.lit(distincts))
Cross join to the count distinct as below:
df2 = df.crossJoin(df.select(F.countDistinct('col2').alias('col3')))
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| 1| 3|
| xyz| 1| 3|
| dgc| 2| 3|
| ydh| 3| 3|
| ujd| 1| 3|
| ujx| 3| 3|
+----+----+----+
You can use Window, collect_set and size:
from pyspark.sql import functions as F, Window
df = spark.createDataFrame([("abc", 1), ("xyz", 1), ("dgc", 2), ("ydh", 3), ("ujd", 1), ("ujx", 3)], ['col1', 'col2'])
window = Window.orderBy("col2").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn("col3", F.size(F.collect_set(F.col("col2")).over(window))).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| 1| 3|
| xyz| 1| 3|
| dgc| 2| 3|
| ydh| 3| 3|
| ujd| 1| 3|
| ujx| 3| 3|
+----+----+----+

how to select first n row items based on multiple conditions in pyspark

Now I have data like this:
+----+----+
|col1| d|
+----+----+
| A| 4|
| A| 10|
| A| 3|
| B| 3|
| B| 6|
| B| 4|
| B| 5.5|
| B| 13|
+----+----+
col1 is StringType, d is TimestampType, here I use DoubleType instead.
I want to generate data based on conditions tuples.
Given a tuple[(A,3.5),(A,8),(B,3.5),(B,10)]
I want to have the result like
+----+---+
|col1| d|
+----+---+
| A| 4|
| A| 10|
| B| 4|
| B| 13|
+----+---+
That is for each element in the tuple, we select from the pyspark dataframe the first 1 row that d is larger than the tuple number and col1 is equal to the tuple string.
What I've already written is:
df_res=spark_empty_dataframe
for (x,y) in tuples:
dft=df.filter(df.col1==x).filter(df.d>y).limit(1)
df_res=df_res.union(dft)
But I think this might have efficiency problem, I do not know if I were right.
A possible approach avoiding loops can be creating a dataframe from the tuple you have as input:
t = [('A',3.5),('A',8),('B',3.5),('B',10)]
ref=spark.createDataFrame([(i[0],float(i[1])) for i in t],("col1_y","d_y"))
Then we can join on the input dataframe(df) on condition and then group on the keys and values of tuple which will be repeated to get the first value on each group, then drop the extra columns:
(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner').orderBy("col1","d")
.groupBy("col1_y","d_y").agg(F.first("col1").alias("col1"),F.first("d").alias("d"))
.drop("col1_y","d_y")).show()
+----+----+
|col1| d|
+----+----+
| A|10.0|
| A| 4.0|
| B| 4.0|
| B|13.0|
+----+----+
Note, if order of the dataframe is important , you can assign an index column with monotonically_increasing_id and include them in the aggregation then orderBy the index column.
EDIT another way instead of ordering and get first directly with min:
(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner')
.groupBy("col1_y","d_y").agg(F.min("col1").alias("col1"),F.min("d").alias("d"))
.drop("col1_y","d_y")).show()
+----+----+
|col1| d|
+----+----+
| B| 4.0|
| B|13.0|
| A| 4.0|
| A|10.0|
+----+----+

How to add an ID to pyspark dataframe rows that increases only if a certain condition is met?

I have a pyspark dataframe, and I want to add an Id column to it that only increases if a condition is met.
Example:
over a Window on col1, if col2 value changes, the Id needs to be incremented by 1.
Input:
+----+----+
|col1|col2|
+----+----+
| 1| A|
| 1| A|
| 1| B|
| 1| C|
| 2| A|
| 2| B|
| 2| B|
| 2| B|
| 2| C|
| 2| C|
+----+----+
output:
+----+----+----+
|col1|col2| ID|
+----+----+----+
| 1| A| 1|
| 1| A| 1|
| 1| B| 2|
| 1| C| 3|
| 2| A| 1|
| 2| B| 2|
| 2| B| 2|
| 2| B| 2|
| 2| C| 3|
| 2| C| 3|
+----+----+----+
Thanks :)
What you are looking for is the dense_rank function (pyspark doc here).
Assuming your dataframe variable is df you can do something like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
...
df.withColumn('ID', F.dense_rank().over(Window.partitionBy('col1').orderBy('col2')))

How to filter dataframe using percentiles to filter out outliers?

Suppose I have a spark dataframe like this:
+------------+-----------+
|category |value |
+------------+-----------+
| a| 1|
| a| 2|
| b| 2|
| a| 3|
| b| 4|
| a| 4|
| b| 6|
| b| 8|
+------------+-----------+
I want to set values higher than 0.75 percentile to nan for each category.
That being;
a_values = [1,2,3,4] => a_values_filtered = [1,2,3,nan]
b_values = [2,4,6,8] => b_values_filtered = [2,3,6,nan]
So the expected output is:
+------------+-----------+
|category |value |
+------------+-----------+
| a| 1|
| a| 2|
| b| 2|
| a| 3|
| b| 4|
| a| nan|
| b| 6|
| b| nan|
+------------+-----------+
Any idea how to do it cleanly?
PS: I am new to spark
Use percent_rank function to get the percentiles, and then use when to assign values > 0.75 percent_rank to null.
from pyspark.sql import Window
from pyspark.sql.functions import percent_rank,when
w = Window.partitionBy(df.category).orderBy(df.value)
percentiles_df = df.withColumn('percentile',percent_rank().over(w))
result = percentiles_df.select(percentiles_df.category
,when(percentiles_df.percentile <= 0.75,percentiles_df.value).alias('value'))
result.show()
Here is another snippet similar to Prabhala's answer, I use percentile_approx UDF instead.
from pyspark.sql import Window
import pyspark.sql.functions as F
window = Window.partitionBy('category')
percentile = F.expr('percentile_approx(value, 0.75)')
tmp_df = df.withColumn('percentile_value', percentile.over(window))
result = tmp_df.select('category', when(tmp_df.percentile_value >= tmp_df.value, tmp_df.value).alias('value'))
result.show()
+--------+-----+
|category|value|
+--------+-----+
| b| 2|
| b| 4|
| b| 6|
| b| null|
| a| 1|
| a| 2|
| a| 3|
| a| null|
+--------+-----+

Categories