Problem
Hello is there a way in pyspark/spark to select first element over some window on some condition?
Examples
Let's have an example input dataframe
+---------+----------+----+----+----------------+
| id| timestamp| f1| f2| computed|
+---------+----------+----+----+----------------+
| 1|2020-01-02|null|c1f2| [f2]|
| 1|2020-01-01|c1f1|null| [f1]|
| 2|2020-01-01|c2f1|null| [f1]|
+---------+----------+----+----+----------------+
I want to select for each id latest column (f1, f2...) that was computed.
So the "code" would look like this
cols = ["f1", "f2"]
w = Window().partitionBy("id").orderBy(f.desc("timestamp")).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
output_df = (
input_df.select(
"id",
*[f.first(col, condition=f.array_contains(f.col("computed"), col)).over(w).alias(col) for col in cols]
)
.groupBy("id")
.agg(*[f.first(col).alias(col) for col in cols])
)
And output should be
+---------+----+----+
| id| f1| f2|
+---------+----+----+
| 1|c1f1|c1f2|
| 2|c2f1|null|
+---------+----+----+
If the input looks like this
+---------+----------+----+----+----------------+
| id| timestamp| f1| f2| computed|
+---------+----------+----+----+----------------+
| 1|2020-01-02|null|c1f2| [f1, f2]|
| 1|2020-01-01|c1f1|null| [f1]|
| 2|2020-01-01|c2f1|null| [f1]|
+---------+----------+----+----+----------------+
Then the output should be
+---------+----+----+
| id| f1| f2|
+---------+----+----+
| 1|null|c1f2|
| 2|c2f1|null|
+---------+----+----+
As you can see it's not easy just to use f.first(ignore_nulls=True) because in this case we don't want to skip the null as it is taken as computed value.
Current solution
Step 1
Save original data types
cols = ["f1", "f2"]
orig_dtypes = [field.dataType for field in input_df.schema if field.name in cols]
Step 2
For Each column create new column with it's value if the column is computed and also replace original null with our "synthetic" <NULL> string
output_df = input_df.select(
"id", "timestamp", "computed",
*[
f.when(f.array_contains(f.col("computed"), col) & f.col(col).isNotNull(), f.col(col))
.when(f.array_contains(f.col("computed"), col) & f.col(col).isNull(), "<NULL>")
.alias(col)
for col in cols
]
)
Step 3
Select first non null value over window because now we know that <NULL> won't be skipped
output_df = (
output_df.select(
"id",
*[f.first(col, ignorenulls=True).over(w).alias(col) for col in cols],
)
.groupBy("id")
.agg(*[f.first(col).alias(col) for col in cols])
)
Step 4
Replace our "synthetic" <NULL> for original nulls.
output_df = output_df.replace("<NULL>", None)
Step 5
Cast columns back to it's original types because they might get retyped to string in step 2
output_df = output_df.select("id", *[f.col(col).cast(type_) for col, type_ in zip(cols, orig_dtypes)])
This solution works but it does not seem to be the right way to do it. Besides it's pretty heavy and it's taking too long to get computed.
Is there any other more "sparkish" way to do it?
Here's one way by using this trick of struct ordering.
Groupby id and collect list of structs like struct<col_exists_in_computed, timestamp, col_value> for each column in cols list, then using array_max function on the resulting array you get the lasted value you want:
from pyspark.sql import functions as F
output_df = input_df.groupBy("id").agg(
*[F.array_max(
F.collect_list(
F.struct(F.array_contains("computed", c), F.col("timestamp"), F.col(c))
)
)[c].alias(c) for c in cols]
)
# applied to you second dataframe example, it gives
output_df.show()
#+---+----+----+
#| id| f1| f2|
#+---+----+----+
#| 1|null|c1f2|
#| 2|c2f1|null|
#+---+----+----+
Related
I would like to compare 2 dataframes in pyspark.
Below is my test case dataset (from google).
So I have 2 df's
Base DF
Secondary DF
baseDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3500,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Vom,5000,mex,IT,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
secDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
77,XYZ,5000,mex,ITA,2/11/2019
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF
Expected output
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
Using pyspark I can use subtract() to find the values of table1 not present in table2, and consequently use unionAll of the two tables or should I use withcolumn to overwrite values satisfying the condition.
Could someone suggest a good way of doing this?
Update ---
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF.
You can do a left join and coalesce the resulting Sal column, with secdf taking precedence over basedf:
import pyspark.sql.functions as F
result = basedf.alias('basedf').join(
secdf.alias('secdf'),
['No', 'Name'],
'left'
).select(
[F.coalesce('secdf.Sal', 'basedf.Sal').alias('Sal')
if c == 'Sal'
else F.coalesce('secdf.Dept', 'basedf.Dept').alias('Dept')
if c == 'Dept'
else f'basedf.{c}'
for c in basedf.columns]
)
result.show()
+---+----+----+-------+----+---------+
| No|Name| Sal|Address|Dept|Join_Date|
+---+----+----+-------+----+---------+
| 11| Sam|1000| ind| ITA|2/11/2019|
| 22| Tom|2500| usa| HRA|2/11/2019|
| 33| Kom|3000| uk| ITA|2/11/2019|
| 44| Nom|4600| can| HRA|2/11/2019|
| 55| Vom|8000| mex| ITA|2/11/2019|
| 66| XYZ|5000| mex| IT|2/11/2019|
+---+----+----+-------+----+---------+
I'm trying to transpose an object type data frame and want to get back a data frame object (after it's been transposed):
and transpose into this:
I need the transposed object to remain a spark data frame.
Thank you!
check this out. you can use groupby and pivot. Please note i renamed name column because it was ambiguous to the dataframe once the name values are pivoted
df.show()
# +------------+-----+
# | name|value|
# +------------+-----+
# | Name| str|
# |lastActivity| date|
# | id| str|
# +------------+-----+
df1 = df.withColumnRenamed("name", "name_val").groupBy("name_val").pivot("name_val").agg(F.first("value"))
df1.show()
# +------------+----+----+------------+
# | name_val|Name| id|lastActivity|
# +------------+----+----+------------+
# | Name| str|null| null|
# | id|null| str| null|
# |lastActivity|null|null| date|
# +------------+----+----+------------+
df1.select(*[F.first(column,ignorenulls=True).alias(column) for column in df1.columns if column not in 'name_val']).show()
#
# +----+---+------------+
# |Name| id|lastActivity|
# +----+---+------------+
# | str|str| date|
# +----+---+------------+
I want to take a DF and double each column (with new column name).
I want to make "Stress Tests" on my ML Model (implemented using PySpark & Spark Pipeline) and see how well it performs if I double/triple the number of features in my input dataset.
For Example, take this DF:
+-------+-------+-----+------+
| _c0| _c1| _c2| _c3|
+-------+-------+-----+------+
| 1 |Testing| | true |
+-------+-------+-----+------+
and make it like this:
+-------+-------+-----+------+-------+-------+-----+------+
| _c0| _c1| _c2| _c3| _c4| _c5| _c6| _c7|
+-------+-------+-----+------+-------+-------+-----+------+
| 1 |Testing| | true | 1 |Testing| | true |
+-------+-------+-----+------+-------+-------+-----+------+
The easiest way I can do it is like this:
df = df
doubledDF = df
for col in df.columns:
doubledDF = doubledDF.withColumn(col+"1dup", df[col])
However, it takes way to much time.
I would appreciate any solution, and even more the explanation why this solution approach is better.
Thank you very much!
You can do this by using selectExpr(). The asterisk * will un-list a list.
For eg; *['_c0', '_c1', '_c2', '_c3'] will return '_c0', '_c1', '_c2', '_c3'
Along with the help of list-comprehensions, this code can be fairly generalized.
df = sqlContext.createDataFrame([(1,'Testing','',True)],('_c0','_c1','_c2','_c3'))
df.show()
+---+-------+---+----+
|_c0| _c1|_c2| _c3|
+---+-------+---+----+
| 1|Testing| |true|
+---+-------+---+----+
col_names = df.columns
print(col_names)
['_c0', '_c1', '_c2', '_c3']
df = df.selectExpr(*[i for i in col_names],*[i+' as '+i+'_dup' for i in col_names])
df.show()
+---+-------+---+----+-------+-------+-------+-------+
|_c0| _c1|_c2| _c3|_c0_dup|_c1_dup|_c2_dup|_c3_dup|
+---+-------+---+----+-------+-------+-------+-------+
| 1|Testing| |true| 1|Testing| | true|
+---+-------+---+----+-------+-------+-------+-------+
Note: The following code will work too.
df = df.selectExpr('*',*[i+' as '+i+'_dup' for i in col_names])
I have a dataframe which has a few columns like below :
category| category_id| bucket| prop_count| event_count | accum_prop_count | accum_event_count
-----------------------------------------------------------------------------------------------------
nation | nation | 1 | 222 | 444 | 555 | 6677
This dataframe starts from 0 rows and each function of my script adds a row to this.
There is a function which needs to modify 1 or 2 cell values based on condition. How to do this?
Code:
schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
a_df = sqlContext.createDataFrame([],schema)
a_temp = sqlContext.createDataFrame([("nation","nation",1,222,444,555)],schema)
a_df = a_df.unionAll(a_temp)
Rows added from some other function:
a_temp3 = sqlContext.createDataFrame([("nation","state",2,222,444,555)],schema)
a_df = a_df.unionAll(a_temp3)
Now to modify, I am trying a join with a condition.
a_temp4 = sqlContext.createDataFrame([("state","state",2,444,555,666)],schema)
a_df = a_df.join(a_temp4, [(a_df.category_id == a_temp4.category_id) & (some other cond here)], how = "inner")
But this code is not working. I am getting an error:
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
|category|category_id|bucket|prop_count|event_count|accum_prop_count|category|category_id|bucket|prop_count|event_count|accum_prop_count|
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
| nation| state| 2| 222| 444| 555| state| state| 2| 444| 555| 666|
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
How to fix this? Correct output should have 2 rows and the second row should have an updated value
1). An inner join will delete rows from your initial dataframe, if you want to have the same number of lines as a_df(on the left) you need a left join.
2). an == condition will duplicate columns if your columns have the same names you can use a list instead.
3). I imagine "some other condition" refers to bucket
4). You want to keep the value from a_temp4 if it exists (the join will set its values at null if it doesn't), psf.coalesce allows you to do this
import pyspark.sql.functions as psf
a_df = a_df.join(a_temp4, ["category_id", "bucket"], how="leftouter").select(
psf.coalesce(a_temp4.category, a_df.category).alias("category"),
"category_id",
"bucket",
psf.coalesce(a_temp4.prop_count, a_df.prop_count).alias("prop_count"),
psf.coalesce(a_temp4.event_count, a_df.event_count).alias("event_count"),
psf.coalesce(a_temp4.accum_prop_count, a_df.accum_prop_count).alias("accum_prop_count")
)
+--------+-----------+------+----------+-----------+----------------+
|category|category_id|bucket|prop_count|event_count|accum_prop_count|
+--------+-----------+------+----------+-----------+----------------+
| state| state| 2| 444| 555| 666|
| nation| nation| 1| 222| 444| 555|
+--------+-----------+------+----------+-----------+----------------+
If you only work with one-line dataframes you should consider coding the update directly instead of using join:
def update_col(category_id, bucket, col_name, col_val):
return psf.when((a_df.category_id == category_id) & (a_df.bucket == bucket), col_val).otherwise(a_df[col_name]).alias(col_name)
a_df.select(
update_col("state", 2, "category", "nation"),
"category_id",
"bucket",
update_col("state", 2, "prop_count", 444),
update_col("state", 2, "event_count", 555),
update_col("state", 2, "accum_prop_count", 666)
).show()
I use pyspark and work with the following dataframe:
+---------+----+--------------------+-------------------+
| id| sid| values| ratio|
+---------+----+--------------------+-------------------+
| 6052791|4178|[2#2#2#2#3#3#3#3#...|0.32673267326732675|
| 57908575|4178|[2#2#2#2#3#3#3#3#...| 0.3173076923076923|
| 78836630|4178|[2#2#2#2#3#3#3#3#...| 0.782608695652174|
|109252111|4178|[2#2#2#2#3#3#3#3#...| 0.2803738317757009|
|139428308|4385|[2#2#2#3#4#4#4#4#...| 1.140625|
|173158079|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|183739386|4390|[3#2#2#3#3#2#4#4#...|0.32080419580419584|
|206815630|4178|[2#2#2#2#3#3#3#3#...|0.14782608695652175|
|242251660|4320|[2#2#2#2#3#3#3#3#...| 0.1452991452991453|
|272670796|5038|[3#2#2#2#2#2#2#3#...| 0.2648648648648649|
|297848516|4320|[2#2#2#2#3#3#3#3#...|0.12195121951219512|
|346566485|4113|[2#3#3#2#2#2#2#3#...| 0.646823138928402|
|369667874|5038|[2#2#2#2#2#2#2#3#...| 0.4546293788454067|
|374645154|4320|[2#2#2#2#3#3#3#3#...|0.34782608695652173|
|400996010|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|401594848|4178|[3#3#6#6#3#3#4#4#...| 0.7647058823529411|
|401954629|4569|[3#3#3#3#3#3#3#3#...| 0.5520833333333333|
|417115190|4320|[2#2#2#2#3#3#3#3#...| 0.6235294117647059|
|423877535|4178|[2#2#2#2#3#3#3#3#...| 0.5538461538461539|
|445523599|4320|[2#2#2#2#3#3#3#3#...| 0.1271186440677966|
+---------+----+--------------------+-------------------+
What I want is to make sid 4178 as a column and put rounded ratio as its row value. The result should look as follows:
+---------+-------+------+-------+
| id| 4178 |4385 | 4390 |(if sid for id fill row with ratio)
+---------+-------+------+-------+
| 6052791|0.32 | 0 | 0 |(if not fill with 0)
id 4178
6052791 0.32
The number of columns is the number of sids that have the same rounded ratio.
If that sid does not exist for any id then sid column has to contain 0.
You need a column to groupby, for which I am adding a new column called sNo.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(List((6052791, 4178, 0.42673267326732675),
(6052791, 4178, 0.22673267326732675),
(6052791, 4179, 0.62673267326732675),
(6052791, 4180, 0.72673267326732675),
(6052791, 4179, 0.82673267326732675),
(6052791, 4179, 0.92673267326732675))).toDF("id", "sid", "ratio")
df.withColumn("sNo", lit(1))
.groupBy("sNo")
.pivot("sid")
.agg(min("ratio"))
.show
This would return output
+---+-------------------+------------------+------------------+
|sNo| 4178| 4179| 4180|
+---+-------------------+------------------+------------------+
| 1|0.22673267326732674|0.6267326732673267|0.7267326732673267|
+---+-------------------+------------------+------------------+
That sounds like a pivot that could be in Spark SQL (Scala version) as follows:
scala> ratios.
groupBy("id").
pivot("sid").
agg(first("ratio")).
show
+-------+-------------------+
| id| 4178|
+-------+-------------------+
|6052791|0.32673267326732675|
+-------+-------------------+
I'm still unsure how to select the other columns (4385 and 4390 in your example). It seems that you round ratio and search for other sids that would match.