I'm trying to transpose an object type data frame and want to get back a data frame object (after it's been transposed):
and transpose into this:
I need the transposed object to remain a spark data frame.
Thank you!
check this out. you can use groupby and pivot. Please note i renamed name column because it was ambiguous to the dataframe once the name values are pivoted
df.show()
# +------------+-----+
# | name|value|
# +------------+-----+
# | Name| str|
# |lastActivity| date|
# | id| str|
# +------------+-----+
df1 = df.withColumnRenamed("name", "name_val").groupBy("name_val").pivot("name_val").agg(F.first("value"))
df1.show()
# +------------+----+----+------------+
# | name_val|Name| id|lastActivity|
# +------------+----+----+------------+
# | Name| str|null| null|
# | id|null| str| null|
# |lastActivity|null|null| date|
# +------------+----+----+------------+
df1.select(*[F.first(column,ignorenulls=True).alias(column) for column in df1.columns if column not in 'name_val']).show()
#
# +----+---+------------+
# |Name| id|lastActivity|
# +----+---+------------+
# | str|str| date|
# +----+---+------------+
Related
Problem
Hello is there a way in pyspark/spark to select first element over some window on some condition?
Examples
Let's have an example input dataframe
+---------+----------+----+----+----------------+
| id| timestamp| f1| f2| computed|
+---------+----------+----+----+----------------+
| 1|2020-01-02|null|c1f2| [f2]|
| 1|2020-01-01|c1f1|null| [f1]|
| 2|2020-01-01|c2f1|null| [f1]|
+---------+----------+----+----+----------------+
I want to select for each id latest column (f1, f2...) that was computed.
So the "code" would look like this
cols = ["f1", "f2"]
w = Window().partitionBy("id").orderBy(f.desc("timestamp")).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
output_df = (
input_df.select(
"id",
*[f.first(col, condition=f.array_contains(f.col("computed"), col)).over(w).alias(col) for col in cols]
)
.groupBy("id")
.agg(*[f.first(col).alias(col) for col in cols])
)
And output should be
+---------+----+----+
| id| f1| f2|
+---------+----+----+
| 1|c1f1|c1f2|
| 2|c2f1|null|
+---------+----+----+
If the input looks like this
+---------+----------+----+----+----------------+
| id| timestamp| f1| f2| computed|
+---------+----------+----+----+----------------+
| 1|2020-01-02|null|c1f2| [f1, f2]|
| 1|2020-01-01|c1f1|null| [f1]|
| 2|2020-01-01|c2f1|null| [f1]|
+---------+----------+----+----+----------------+
Then the output should be
+---------+----+----+
| id| f1| f2|
+---------+----+----+
| 1|null|c1f2|
| 2|c2f1|null|
+---------+----+----+
As you can see it's not easy just to use f.first(ignore_nulls=True) because in this case we don't want to skip the null as it is taken as computed value.
Current solution
Step 1
Save original data types
cols = ["f1", "f2"]
orig_dtypes = [field.dataType for field in input_df.schema if field.name in cols]
Step 2
For Each column create new column with it's value if the column is computed and also replace original null with our "synthetic" <NULL> string
output_df = input_df.select(
"id", "timestamp", "computed",
*[
f.when(f.array_contains(f.col("computed"), col) & f.col(col).isNotNull(), f.col(col))
.when(f.array_contains(f.col("computed"), col) & f.col(col).isNull(), "<NULL>")
.alias(col)
for col in cols
]
)
Step 3
Select first non null value over window because now we know that <NULL> won't be skipped
output_df = (
output_df.select(
"id",
*[f.first(col, ignorenulls=True).over(w).alias(col) for col in cols],
)
.groupBy("id")
.agg(*[f.first(col).alias(col) for col in cols])
)
Step 4
Replace our "synthetic" <NULL> for original nulls.
output_df = output_df.replace("<NULL>", None)
Step 5
Cast columns back to it's original types because they might get retyped to string in step 2
output_df = output_df.select("id", *[f.col(col).cast(type_) for col, type_ in zip(cols, orig_dtypes)])
This solution works but it does not seem to be the right way to do it. Besides it's pretty heavy and it's taking too long to get computed.
Is there any other more "sparkish" way to do it?
Here's one way by using this trick of struct ordering.
Groupby id and collect list of structs like struct<col_exists_in_computed, timestamp, col_value> for each column in cols list, then using array_max function on the resulting array you get the lasted value you want:
from pyspark.sql import functions as F
output_df = input_df.groupBy("id").agg(
*[F.array_max(
F.collect_list(
F.struct(F.array_contains("computed", c), F.col("timestamp"), F.col(c))
)
)[c].alias(c) for c in cols]
)
# applied to you second dataframe example, it gives
output_df.show()
#+---+----+----+
#| id| f1| f2|
#+---+----+----+
#| 1|null|c1f2|
#| 2|c2f1|null|
#+---+----+----+
I have written a data preprocessing codes in Pandas UDF in PySpark. I'm using lambda function to extract a part of the text from all the records of a column.
Here is how my code looks like:
#pandas_udf("string", PandasUDFType.SCALAR)
def get_X(col):
return col.apply(lambda x: x.split(',')[-1] if len(x.split(',')) > 0 else x)
df = df.withColumn('X', get_first_name(df.Y))
This is working fine and giving the desired results. But I need to write the same piece of logic in Spark equivalent code. Is there a way to do it? Thanks.
I think one function substring_index is enough for this particular task:
from pyspark.sql.functions import substring_index
df = spark.createDataFrame([(x,) for x in ['f,l', 'g', 'a,b,cd']], ['c1'])
df2.withColumn('c2', substring_index('c1', ',', -1)).show()
+------+---+
| c1| c2|
+------+---+
| f,l| l|
| g| g|
|a,b,cd| cd|
+------+---+
Given the following DataFrame df:
df.show()
# +-------------+
# | BENF_NME|
# +-------------+
# | Doe, John|
# | Foo|
# |Baz, Quux,Bar|
# +-------------+
You can simply use regexp_extract() to select the first name:
from pyspark.sql.functions import regexp_extract
df.withColumn('First_Name', regexp_extract(df.BENF_NME, r'(?:.*,\s*)?(.*)', 1)).show()
# +-------------+----------+
# | BENF_NME|First_Name|
# +-------------+----------+
# | Doe, John| John|
# | Foo| Foo|
# |Baz, Quux,Bar| Bar|
# +-------------+----------+
If you don't care about possible leading spaces, substring_index() provides a simple alternative to your original logic:
from pyspark.sql.functions import substring_index
df.withColumn('First_Name', substring_index(df.BENF_NME, ',', -1)).show()
# +-------------+----------+
# | BENF_NME|First_Name|
# +-------------+----------+
# | Doe, John| John|
# | Foo| Foo|
# |Baz, Quux,Bar| Bar|
# +-------------+----------+
In this case the first row's First_Name has a leading space:
df.withColumn(...).collect()[0]
# Row(BENF_NME=u'Doe, John', First_Name=u' John'
If you still want to use a custom function, you need to create a user-defined function (UDF) using udf():
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
get_first_name = udf(lambda s: s.split(',')[-1], StringType())
df.withColumn('First_Name', get_first_name(df.BENF_NME)).show()
# +-------------+----------+
# | BENF_NME|First_Name|
# +-------------+----------+
# | Doe, John| John|
# | Foo| Foo|
# |Baz, Quux,Bar| Bar|
# +-------------+----------+
Note that UDFs are slower than the built-in Spark functions, especially Python UDFs.
You can do the same using when to implement if-then-else logic:
First split the column, then compute its size. If the size is greater than 0, take the last element from the split array. Otherwise, return the original column.
from pyspark.sql.functions import split, size, when
def get_first_name(col):
col_split = split(col, ',')
split_size = size(col_split)
return when(split_size > 0, col_split[split_size-1]).otherwise(col)
As an example, suppose you had the following DataFrame:
df.show()
#+---------+
#| BENF_NME|
#+---------+
#|Doe, John|
#| Madonna|
#+---------+
You can call the new function just as before:
df = df.withColumn('First_Name', get_first_name(df.BENF_NME))
df.show()
#+---------+----------+
#| BENF_NME|First_Name|
#+---------+----------+
#|Doe, John| John|
#| Madonna| Madonna|
#+---------+----------+
I have a column 'true_recoms' in spark dataframe:
-RECORD 17-----------------------------------------------------------------
item | 20380109
true_recoms | {"5556867":1,"5801144":5,"7397596":21}
I need to 'explode' this column to get something like this:
item | 20380109
recom_item | 5556867
recom_cnt | 1
..............
item | 20380109
recom_item | 5801144
recom_cnt | 5
..............
item | 20380109
recom_item | 7397596
recom_cnt | 21
I've tried to use from_json but its doesnt work:
schema_json = StructType(fields=[
StructField("item", StringType()),
StructField("recoms", StringType())
])
df.select(col("true_recoms"),from_json(col("true_recoms"), schema_json)).show(5)
+--------+--------------------+------+
| item| true_recoms|true_r|
+--------+--------------------+------+
|31746548|{"32731749":3,"31...| [,]|
|17359322|{"17359392":1,"17...| [,]|
|31480894|{"31480598":1,"31...| [,]|
| 7265665|{"7265891":1,"503...| [,]|
|31350949|{"32218698":1,"31...| [,]|
+--------+--------------------+------+
only showing top 5 rows
The schema is incorrectly defined. You declare to be as struct with two string fields
item
recoms
while neither field is present in the document.
Unfortunately from_json can take return only structs or array of structs so redefining it as
MapType(StringType(), LongType())
is not an option.
Personally I would use an udf
from pyspark.sql.functions import udf, explode
import json
#udf("map<string, bigint>")
def parse(s):
try:
return json.loads(s)
except json.JSONDecodeError:
pass
which can be applied like this
df = spark.createDataFrame(
[(31746548, """{"5556867":1,"5801144":5,"7397596":21}""")],
("item", "true_recoms")
)
df.select("item", explode(parse("true_recoms")).alias("recom_item", "recom_cnt")).show()
# +--------+----------+---------+
# | item|recom_item|recom_cnt|
# +--------+----------+---------+
# |31746548| 5801144| 5|
# |31746548| 7397596| 21|
# |31746548| 5556867| 1|
# +--------+----------+---------+
I have a dataframe in pyspark:
ratings = spark.createDataFrame(
sc.textFile("transactions.json").map(lambda l: json.loads(l)),
)
ratings.show()
+--------+-------------------+------------+----------+-------------+-------+
|click_id| created_at| ip|product_id|product_price|user_id|
+--------+-------------------+------------+----------+-------------+-------+
| 123|2016-10-03 12:50:33| 10.10.10.10| 98373| 220.5| 1|
| 124|2017-02-03 11:51:33| 10.13.10.10| 97373| 320.5| 1|
| 125|2017-10-03 12:52:33| 192.168.2.1| 96373| 20.5| 1|
| 126|2017-10-03 13:50:33|172.16.11.10| 88373| 220.5| 2|
| 127|2017-10-03 13:51:33| 10.12.15.15| 87373| 320.5| 2|
| 128|2017-10-03 13:52:33|192.168.1.10| 86373| 20.5| 2|
| 129|2017-08-03 14:50:33| 10.13.10.10| 78373| 220.5| 3|
| 130|2017-10-03 14:51:33| 12.168.1.60| 77373| 320.5| 3|
| 131|2017-10-03 14:52:33| 10.10.30.30| 76373| 20.5| 3|
+--------+-------------------+------------+----------+-------------+-------+
ratings.registerTempTable("transactions")
final_df = sqlContext.sql("select * from transactions");
I want to add a new column to this data frame called status and then update the status column based on created_at and user_id.
The created_at and user_id are read from the given table transations and passed to a function get_status(user_id,created_at) which returns the status. This status needs to be put into the transaction table as a new column for the corresponding user_id and created_at
Can I run alter and update command in pyspark?
How can this be done using pyspark ?
It's not clear what you want to do exactly. You should check out window functions they allow you to compare, sum... rows in a frame.
For instance
import pyspark.sql.functions as psf
from pyspark.sql import Window
w = Window.partitionBy("user_id").orderBy(psf.desc("created_at"))
ratings.withColumn(
"status",
psf.when(psf.row_number().over(w) == 1, "active").otherwise("inactive")).sort("click_id").show()
+--------+-------------------+------------+----------+-------------+-------+--------+
|click_id| created_at| ip|product_id|product_price|user_id| status|
+--------+-------------------+------------+----------+-------------+-------+--------+
| 123|2016-10-03 12:50:33| 10.10.10.10| 98373| 220.5| 1|inactive|
| 124|2017-02-03 11:51:33| 10.13.10.10| 97373| 320.5| 1|inactive|
| 125|2017-10-03 12:52:33| 192.168.2.1| 96373| 20.5| 1| active|
| 126|2017-10-03 13:50:33|172.16.11.10| 88373| 220.5| 2|inactive|
| 127|2017-10-03 13:51:33| 10.12.15.15| 87373| 320.5| 2|inactive|
| 128|2017-10-03 13:52:33|192.168.1.10| 86373| 20.5| 2| active|
| 129|2017-08-03 14:50:33| 10.13.10.10| 78373| 220.5| 3|inactive|
| 130|2017-10-03 14:51:33| 12.168.1.60| 77373| 320.5| 3|inactive|
| 131|2017-10-03 14:52:33| 10.10.30.30| 76373| 20.5| 3| active|
+--------+-------------------+------------+----------+-------------+-------+--------+
It gives you each user's last click
If you want to pass a UDF to create a new column from two existing ones.
Say you have a function that takes the user_id and created_at as arguments
from pyspark.sql.types import *
def get_status(user_id,created_at):
...
get_status_udf = psf.udf(get_status, StringType())
StringType() or whichever datatype your function outputs
ratings.withColumn("status", get_status_udf("user_id", "created_at"))
I have a dataframe which has a few columns like below :
category| category_id| bucket| prop_count| event_count | accum_prop_count | accum_event_count
-----------------------------------------------------------------------------------------------------
nation | nation | 1 | 222 | 444 | 555 | 6677
This dataframe starts from 0 rows and each function of my script adds a row to this.
There is a function which needs to modify 1 or 2 cell values based on condition. How to do this?
Code:
schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
a_df = sqlContext.createDataFrame([],schema)
a_temp = sqlContext.createDataFrame([("nation","nation",1,222,444,555)],schema)
a_df = a_df.unionAll(a_temp)
Rows added from some other function:
a_temp3 = sqlContext.createDataFrame([("nation","state",2,222,444,555)],schema)
a_df = a_df.unionAll(a_temp3)
Now to modify, I am trying a join with a condition.
a_temp4 = sqlContext.createDataFrame([("state","state",2,444,555,666)],schema)
a_df = a_df.join(a_temp4, [(a_df.category_id == a_temp4.category_id) & (some other cond here)], how = "inner")
But this code is not working. I am getting an error:
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
|category|category_id|bucket|prop_count|event_count|accum_prop_count|category|category_id|bucket|prop_count|event_count|accum_prop_count|
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
| nation| state| 2| 222| 444| 555| state| state| 2| 444| 555| 666|
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+
How to fix this? Correct output should have 2 rows and the second row should have an updated value
1). An inner join will delete rows from your initial dataframe, if you want to have the same number of lines as a_df(on the left) you need a left join.
2). an == condition will duplicate columns if your columns have the same names you can use a list instead.
3). I imagine "some other condition" refers to bucket
4). You want to keep the value from a_temp4 if it exists (the join will set its values at null if it doesn't), psf.coalesce allows you to do this
import pyspark.sql.functions as psf
a_df = a_df.join(a_temp4, ["category_id", "bucket"], how="leftouter").select(
psf.coalesce(a_temp4.category, a_df.category).alias("category"),
"category_id",
"bucket",
psf.coalesce(a_temp4.prop_count, a_df.prop_count).alias("prop_count"),
psf.coalesce(a_temp4.event_count, a_df.event_count).alias("event_count"),
psf.coalesce(a_temp4.accum_prop_count, a_df.accum_prop_count).alias("accum_prop_count")
)
+--------+-----------+------+----------+-----------+----------------+
|category|category_id|bucket|prop_count|event_count|accum_prop_count|
+--------+-----------+------+----------+-----------+----------------+
| state| state| 2| 444| 555| 666|
| nation| nation| 1| 222| 444| 555|
+--------+-----------+------+----------+-----------+----------------+
If you only work with one-line dataframes you should consider coding the update directly instead of using join:
def update_col(category_id, bucket, col_name, col_val):
return psf.when((a_df.category_id == category_id) & (a_df.bucket == bucket), col_val).otherwise(a_df[col_name]).alias(col_name)
a_df.select(
update_col("state", 2, "category", "nation"),
"category_id",
"bucket",
update_col("state", 2, "prop_count", 444),
update_col("state", 2, "event_count", 555),
update_col("state", 2, "accum_prop_count", 666)
).show()