I started converting my Pandas implementations to pySpark but i'm having trouble going through some basic operations. So I have this table:
+-----+-----+----+
| Col1|Col2 |Col3|
+-----+-----+----+
| 1 |[1,3]| 0|
| 44 |[2,0]| 1|
| 77 |[1,5]| 7|
+-----+-----+----+
My desired output is:
+-----+-----+----+----+
| Col1|Col2 |Col3|Col4|
+-----+-----+----+----+
| 1 |[1,3]| 0|2.67|
| 44 |[2,0]| 1|2.67|
| 77 |[1,5]| 7|2.67|
+-----+-----+----+----+
To get here :
I averaged the first item of every array in Col2 and averaged the second item of every array in Col2. Since the average of the second "sub-column" is bigger ((3+0+5)/3) than the first "sub-column" ((1+2+1)/3) this is the "winning" condition. After that I created a new column that has the "winning" average replicated over the number of rows of that table (in this example 3).
I was already able to do this by "manually" selecting ta column, average it and then use a "lit" to replicate the results. The problem with my implementation is that collect() takes a lot of time and afaik its not recommended.
Could you please help me on this one ?
You can use greatest to get the greatest average of each (sub-)column in the array:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'Col4',
F.greatest(*[F.avg(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')[i]).over(Window.orderBy()) for i in range(2)])
)
df2.show()
+----+------+----+------------------+
|Col1| Col2|Col3| Col4|
+----+------+----+------------------+
| 1|[1, 3]| 0|2.6666666666666665|
| 44|[2, 0]| 1|2.6666666666666665|
| 77|[1, 5]| 7|2.6666666666666665|
+----+------+----+------------------+
If you want the array size to be dynamic, you can do
arr_size = df.select(F.max(F.size(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')))).head()[0]
df2 = df.withColumn(
'Col4',
F.greatest(*[F.avg(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')[i]).over(Window.orderBy()) for i in range(arr_size)])
)
Related
I have a SQL Code which i am trying to Convert into Pyspark?
The SQL Query looks like this: I need to Concatenate '0' at starting of 'ADDRESS_HOME' if the below Query Conditions Satisfies.
UPDATE STUDENT_DATA
SET STUDENT_DATA.ADDRESS_HOME = "0" & [STUDENT_DATA].ADDRESS_HOME
WHERE (((STUDENT_DATA.STATE_ABB)="TURIN" Or
(STUDENT_DATA.STATE_ABB)="RUSH" Or
(STUDENT_DATA.STATE_ABB)="MEXIC" Or
(STUDENT_DATA.STATE_ABB)="VINTA")
AND ((Len([ADDRESS_HOME])) < "5"));
Thank you in Advance for your responses
# +---+---------------+---------+
# | ID|ADDRESS_HOME | STATE_ABB|
# +---+---------------+---------+
# | 1| 7645 |RUSH |
# | 2| 98364 |MEXIC |
# | 3| 2980 |TURIN |
# | 4| 6728 |VINTA |
# | 5| 128 |VINTA |
EXPECTED OUTPUT
# +---+---------------+---------+
# | ID|ADDRESS_HOME | STATE_ABB|
# +---+---------------+---------+
# | 1| 07645 |RUSH |
# | 2| 98364 |MEXIC |
# | 3| 02980 |TURIN |
# | 4| 06728 |VINTA |
# | 5| 0128 |VINTA |
If you want to align the ADDRESS_HOME to be 5 digits and pad with 0, you can use lpad.
df = df.withColumn('ADDRESS_HOME', F.lpad('ADDRESS_HOME', 5, '0'))
If you want only pad with 1 char (0), when the ADDRESS_HOME has less than 5 chars.
df = (df.withColumn('ADDRESS_HOME', F.when(F.length('ADDRESS_HOME') < 5, F.concat(F.lit('0'), F.col('ADDRESS_HOME'))))
.otherwise(F.col('ADDRESS_HOME')))
UPDATE:
You can convert all OR criteria to IN clause(isin) then use logical AND with the other criteria.
states = ['RUSH', 'MEXIC', 'TURIN', 'VINTA']
df = (df.withColumn('ADDRESS_HOME',
F.when(F.col('STATE_ABB').isin(states) & (F.length('ADDRESS_HOME') < 5),
F.concat(F.lit('0'), F.col('ADDRESS_HOME')))
.otherwise(F.col('ADDRESS_HOME'))))
First you filter, your DF serching for the values you want to update.
Then you update the columns (First withcolumn)
After updating, you join your updated DF with your original dataframe (do this to get all values in one dataframe again). And do a coalesce to the FINAL ADDRESS
Finally, you select the values from the original DF (Id and State) and the updated value (Final_Address...since you did a coalesce, the values not updated will not be null, they are going to be the update value on the filtered condition, and the original value on the condition not matched in the filter).
This answer should solve your problem, BUT, #Emma answers is more efficient.
df = df.filter(
(f.col("STATE_ABB").isin(f.lit("TURIN"), f.lit("RUSH"), f.lit("TURIN"), f.lit("VINTA")) &
(f.len("ADDRESS_HOME") < 5)
).withColumn(
"ADDRESS_HOME_CONCAT",
f.concat(f.lit("0"),f.col("ADDRESS_HOME"))
).alias("df_filtered").join(
df.alias("original_df"),
on=f.col("original_df.Id") == f.col("df_filtered.Id")
how='left'
).withColumn(
"FINAL_ADDRESS",
f.coalesce(f.col("df_filtered.ADDRESS_HOME_CONCAT"), f.col("original_df.ADDRESS_HOME")
).select(
f.col("original_df.Id").alias("Id"),
f.col("FINAL_ADDRESS").alias("ADDRESS_HOME"),
f.col("original_df.STATE_ABB").alias("STATE_ABB")
)
Sorry for any typo missing, I've posted it from my cellphone!
i have a dataframe (df) like this:
col1
col2
col3
One
Two
x
One
Two
full
One
Two
y
One
Two
z
One
Two
full
One
Two
u
One
Two
e
Using PySPark i want to mark the element/rows immediately after col3=="full" with 1 otherwise 0, like this:
col1
col2
col3
flag
One
Two
x
0
One
Two
full
0
One
Two
y
1
One
Two
z
0
One
Two
full
0
One
Two
u
1
One
Two
e
0
At the moment this is my idea, but i'm not taking the row immediately after...:
df.withColumn('flag',f.when(f.col('CD_OPERAZIONE')=='full',1).otherwise(0))
can you help me?
Use lag and when statement
w= Window.partitionBy('col1','col2').orderBy('col1')
df.withColumn('x', when(lag('col3').over(w)=='full',1).otherwise(0)).show()
+----+----+----+---+
|col1|col2|col3| x|
+----+----+----+---+
| One| Two| x| 0|
| One| Two|full| 0|
| One| Two| y| 1|
| One| Two| z| 0|
| One| Two|full| 0|
| One| Two| u| 1|
| One| Two| e| 0|
+----+----+----+---+
Step 1: assign row number to each row using row_number function
Step 2: filter the dataframe with col3==full, now you have the row numbers where col3 is full, call it dataframe2 lets say
step 3: create a new column adding one to the row number column in dataframe2, now you will have row numbers of the immediate row next to the ones having col3 as full
step 4: join dataframe one with dataframe2 via inner join after selecting the new column from dataframe2 on row_number from dataframe1 and new row number column on dataframe2.
Pardon for no code, on my mobile. Let me know if you want help still.
I have sales list like this (pySpark):
+---------+----------+
|productId| date|
+---------+----------+
| 868|2020-11-01|
| 878|2020-11-01|
| 878|2020-11-01|
| 913|2020-11-01|
| 746|2020-11-01|
| 878|2020-11-01|
| 657|2020-11-02|
| 746|2020-11-02|
| 101|2020-11-02|
+---------+----------+
So, I want to get a new column: item position by number of purchases. Most popular item of the day will have rank 1, etc. I've tried to implement window function, but can't figure out how to do it correctly. What is the best way to do it? Thanks.
import pyspark.sql.functions as F
count_per_day = F.count('productId').over(Window.partitionBy('date')).alias('count_per_day')
df = df.select('*', count_per_day)
rank = F.rank().over(Window.partitionBy('date').orderBy(F.col('count_per_day').desc())).alias('rank')
df = df.select('*', rank)
I have the below pyspark dataframe.
Column_1 Column_2 Column_3 Column_4
1 A U1 12345
1 A A1 549BZ4G
Expected output:
Group by on column 1 and column 2. Collect set column 3 and 4 while preserving the order in input dataframe. It should be in the same order as input. There is no dependency in ordering between column 3 and 4. Both has to retain input dataframe ordering
Column_1 Column_2 Column_3 Column_4
1 A U1,A1 12345,549BZ4G
What I tried so far:
I first tried using window method. Where I partitioned by column 1 and 2 and order by column 1 and 2. I then grouped by column 1 and 2 and did a collect set on column 3 and 4.
I didn't get the expected output. My result was as below.
Column_1 Column_2 Column_3 Column_4
1 A U1,A1 549BZ4G,12345
I also tried using monotonically increasing id to create an index and then order by the index and then did a group by and collect set to get the output. But still no luck.
Is it due to alphanumeric and numeric values ?
How to retain the order of column 3 and column 4 as it is there in input with no change of ordering.
Use monotically_increasing_id function from spark to maintain the order.you can find more info about it here
#InputDF
# +----+----+----+-------+
# |col1|col2|col3| col4|
# +----+----+----+-------+
# | 1| A| U1| 12345|
# | 1| A| A1|549BZ4G|
# +----+----+----+-------+
df1 = df.withColumn("id", F.monotonically_increasing_id()).groupby("Col1", "col2").agg(F.collect_list("col4").alias("Col4"),F.collect_list("col3").alias("Col3"))
df1.select("col1", "col2",F.array_join("col3", ",").alias("col3"),F.array_join("col4", ",").alias("col4")).show()
# OutputDF
# +----+----+-----+-------------+
# |col1|col2| col3| col4|
# +----+----+-----+-------------+
# | 1| A|U1,A1|12345,549BZ4G|
# +----+----+-----+-------------+
Use array_distinct on top of collect_list to have distinct values and maintain order.
#InputDF
# +----+----+----+-------+
# |col1|col2|col3| col4|
# +----+----+----+-------+
# | 1| A| U1| 12345|
# | 1| A| A1|549BZ4G|
# | 1| A| U1|123456 |
# +----+----+----+-------+
df1 = df.withColumn("id", F.monotonically_increasing_id()).groupby("Col1", "col2").agg(
F.array_distinct(F.collect_list("col4")).alias("Col4"),F.array_distinct(F.collect_list("col3")).alias("Col3"))
df1.select("col1", "col2", F.array_join("col3", ",").alias("col3"), F.array_join("col4", ",").alias("col4")).show(truncate=False)
# +----+----+-----+---------------------+
# |col1|col2|col3 |col4 |
# +----+----+-----+---------------------+
# |1 |A |U1,A1|12345,549BZ4G,123456 |
# +----+----+-----+---------------------+
How about using struct ?
val result = df.groupBy(
"Column_1", "Column_2"
).agg(
collect_list(
struct(col("Column_3"), col("Column_4"))
).alias("coll")
).select(
col("Column_1"), col("Column_2"), col("coll.Column_3"), col("coll.Column_4")
)
This should produce the expected result.
The trick is that struct preserves named elements so you can reference them via . accessor. It works in ArrayType(StructType) as well.
Grouping collocated concepts via struct feels natural since it is a structure relation that you are trying to preserve here.
In theory you might even not want to unpack structs since these values appear to have a dependance.
All the collect functions(collect_set, collect_list) within spark are non-deterministic since the order of collected result depends on the order of rows in the underlying dataframe which is again non-deterministic.
So to conclude you can't really preserve the order with spark collect functions
ref - functions.scala spark repo
I am not sure why it is not showing you correct result, for me this is as per the expected result -
Input
df_a = spark.createDataFrame([(1,'A','U1','12345'),(1,'A','A1','549BZ4G')],[ "col_1","col_2","col_3","col_4"])
Logic
from pyspark.sql import functions as F
df_a = df_a.groupBy('col_1','col_2').agg(F.collect_list('col_3').alias('col_3'), F.collect_list('col_4').alias('col_4'))
Output
df_a.show()
+-----+-----+--------+----------------+
|col_1|col_2| col_3| col_4|
+-----+-----+--------+----------------+
| 1| A|[U1, A1]|[12345, 549BZ4G]|
+-----+-----+--------+----------------+
I'm working with Pyspark, I have Spark 1.6. And I would like to group some values together.
+--------+-----+
| Item |value|
+--------+-----+
| A | 187|
| B | 200|
| C | 3|
| D | 10|
I would to group all item with less of 10% total value together (in this case C and D would be group into new value "Other")
So, the new table look like
+--------+-----+
| Item |value|
+--------+-----+
| A | 187|
| B | 200|
| Other | 13|
Does some one know some function or simple way to do that?
Many thanks for your help
You can filter the dataframe twice to get a dataframe with just the values you want to keep, and one with just the others. Perform an aggregation on the others dataframe to sum them, then union the two dataframe back together. Depending on the data you may want to persist the original dataframe before all this so that it doesn't need to be evaluated twice.