group the data in pyspark - python

I have dataframe like below,
df.select(to_json(struct("items"))).show(1, False)
items
-----
[{ "id":"1","types":{ "0":"1", "price":{ "value":"1"}}},
{ "id":"1","types":{ "0":"2", "price":{ "value":"2"}}},
{ "id":"2","types":{ "0":"3", "price":{ "value":"1"}}}]
now I want to achieve the data frame like this in pyspark,
Basically, I want to group the contents based on the id.
items
-----
[{ "id":"1","types": [ {"0":"1", "price":{ "value":"1"}}, {"0":"2", "price":{ "value":"2"}} ],
{ "id":"2","types": [ {"0":"3", "price":{ "value":"1"}} ]
To reproduce it:
from pyspark.sql import Row
# Spark version: 2.4.4
df = spark.createDataFrame([
Row(items=[Row(id='1',types=Row(o='1',price=Row(value="1"))),
Row(id='1',types=Row(o='10',price=Row(value="1"))),
Row(id='2',types=Row(o='13',price=Row(value="1")))]),
Row(items=[Row(id='3',types=Row(o='1',price=Row(value="1"))),
Row(id='4',types=Row(o='10',price=Row(value="1"))),
Row(id='3',types=Row(o='13',price=Row(value="1")))])
], schema='items:array<struct<id:string,types:struct<`0`:string,price:struct<value:string>>>>')

First I used the higher-order function AGGREGATE to modify the items column to a map column type where the key is your id and the value is an array of types values.
To finish it, I applied another expression to transform from map to struct type following your desired keys.
import pyspark.sql.functions as f
# It's necessary to run using Spark 3 or later
spark.conf.set('spark.sql.mapKeyDedupPolicy', 'LAST_WIN')
agg_df = df.withColumn('items', f.expr('AGGREGATE(items, CAST(MAP() AS MAP<STRING, ARRAY<STRUCT<`0`:STRING, `price`:STRUCT<`value`:STRING>>>>), (acc, item) -> ' \
'IF(acc[item.id] IS NULL, ' \
'MAP_CONCAT(acc, MAP(item.id, ARRAY(item.types))), ' \
'MAP_CONCAT(acc, MAP(item.id, ARRAY_UNION(acc[item.id], ARRAY(item.types))))))'))
transform_df = agg_df.withColumn('items', f.expr('TRANSFORM(MAP_KEYS(items), key -> STRUCT(key AS id, items[key] AS types))'))
# transform_df.printSchema()
# root
# |-- items: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- id: string (nullable = true)
# | | |-- types: array (nullable = true)
# | | | |-- element: struct (containsNull = true)
# | | | | |-- 0: string (nullable = true)
# | | | | |-- price: struct (nullable = true)
# | | | | | |-- value: string (nullable = true)
# Databricks only
display(transform_df)
Output

Related

Convert array of elements to multiple columns

How I can convert a array (in a column) with a set of elements in a JSON dataset to multiple columns with python, spark or pandas?
The data is structured in this form:
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- idAccount: long (nullable = true)
| | |-- infractionType: string (nullable = true)
| | |-- responseTime: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- transactionCode: string (nullable = true)
I'm expecting some kind of this:
id
idAccount
value
value
value
value
an array of structs can be exploded into columns using the inline sql function.
here's an example of how it works.
data_sdf = spark.createDataFrame([([(1234, 2345, 3456), (4321, 5432, 6543)],)],
'items array<struct<id: int, id_acc: int, foo: int>>'
)
# +----------------------------------------+
# |items |
# +----------------------------------------+
# |[{1234, 2345, 3456}, {4321, 5432, 6543}]|
# +----------------------------------------+
# root
# |-- items: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- id: integer (nullable = true)
# | | |-- id_acc: integer (nullable = true)
# | | |-- foo: integer (nullable = true)
# explode and create new columns using struct fields - using `inline`
data_sdf. \
selectExpr('inline(items)'). \
show()
# +----+------+----+
# | id|id_acc| foo|
# +----+------+----+
# |1234| 2345|3456|
# |4321| 5432|6543|
# +----+------+----+
you can further just select() the required fields after the explosion.
In Spark SQL, you can access the item in ArrayType or MapType column by using getItem. For example, you want to get the value of the id of first item, you can use df.select(func.getItem(0).getItem('id'))

Spark: How to rename dynamically special characters in nested struct fields

I read large number of deeply nested jsons with fields, that contains special characters, that cause a lot of troubles.
I would like to rename fields' characters / and - to underscore _ ideally in PySpark. For example column a-new to a_new·
NOTE: there are thousands of field names with special characters so it should be done dynamically. If it is easier to deal with the problem to just add fields to back-quotes this would be also solution. The problem I face is that spark interprets only part of struct name (a-new as a etc.).
Ref: Rename nested field in spark dataframe
Input df:
root
|-- a-new: long (nullable = true)
|-- b/old: struct (nullable = true)
| |-- c-red: struct (nullable = true)
| | |-- d/bue: struct (nullable = true)
| | | |-- e-green: string (nullable = true)
| | | |-- f-white: struct (nullable = true)
| | | | |-- g/blue: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- date: long (nullable = true)
| | | | | | |-- val: long (nullable = true)
Required outcome:
root
|-- a_new: long (nullable = true)
|-- b_old: struct (nullable = true)
| |-- c_red: struct (nullable = true)
| | |-- d_bue: struct (nullable = true)
| | | |-- e_green: string (nullable = true)
| | | |-- f_white: struct (nullable = true)
| | | | |-- g_blue: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- date: long (nullable = true)
| | | | | | |-- val: long (nullable = true)
I'm wondering if there is more efficient way than to recreate df with new schema as I found in the solution :
https://stackoverflow.com/a/58030523/9579821
json_1 = """{"a-new":1,"b/old":{"c-red":{"d/bue":{"e-green":"label_1","f-white":{"g/blue":[{"date":2020,"val":1}]}}}}}"""
df = spark.read.json(sc.parallelize([json_1]))
df.printSchema()
# Some imports
from pyspark.sql.types import DataType, StructType, ArrayType
from copy import copy
# We take a dataframe and return a new one with required changes
def clean_df(df):
# Returns a new sanitized field name (this function can be anything really)
def sanitizeFieldName(s: str) -> str:
return s.replace("-", "_").replace("/", "_")
# We call this on all fields to create a copy and to perform any
# changes we might want to do to the field.
def sanitizeField(field: StructField) -> StructField:
field = copy(field)
field.name = sanitizeFieldName(field.name)
# We recursively call cleanSchema on all types
field.dataType = cleanSchema(field.dataType)
return field
def cleanSchema(dataType: [DataType]) -> [DataType]:
dataType = copy(dataType)
# If the type is a StructType we need to recurse otherwise
# we can return since we've reached the leaf node
if isinstance(dataType, StructType):
# We call our sanitizer for all top level fields
dataType.fields = [sanitizeField(f) for f in dataType.fields]
elif isinstance(dataType, ArrayType):
dataType.elementType = cleanSchema(dataType.elementType)
return dataType
# Now since we have the new schema we can create a new DataFrame
# by using the old Frame's RDD as data and the new schema as the
# schema for the data
return spark.createDataFrame(df.rdd, cleanSchema(df.schema))
clean_df(df).printSchema()
# Rename columns using `withColumnRenamed`
for c in df.columns:
df = df.withColumnRenamed(c,c.replace('-','_').replace('/','_'))
# Rename nested fields using `cast`
for c in df.columns:
new_schema = df.select(c).schema.simpleString().replace('-','_').replace('/','_')[8+len(c):-1]
df = df.withColumn(c,F.col(c).cast(new_schema))

AWS Glue Dynamic Frame columns from array

I have a nested json, structured as the following example:
{'A':[{'key':'B','value':'C'},{'key':'D','value':'E'}]}
Now I want to map this to the following schema:
|--A
|--|--B
|--|--D
e.g. A structure recovered from a json like:
{'A':{'B':'C','D':'E'}}
The array in 'A' has no fixed number of entries, but the contained dicts always have the two keys 'key','value'
Please find the script below.
from pyspark.sql.functions import lit, col, explode, create_map, collect_list
from itertools import chain
>>> sample.printSchema()
root
|-- A: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
>>> final_df = (sample
... .select(explode('A').alias("A"))
... .withColumn("A",create_map("A.key", "A.value"))
... .groupby().agg(collect_list("A").alias("A"))
... )
>>> final_df.printSchema()
root
|-- A: array (nullable = true)
| |-- element: map (containsNull = false)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
>>> final_df.show(truncate=False)
+--------------------+
|A |
+--------------------+
|[[B -> C], [D -> E]]|
+--------------------+
>>> (final_df
... .write
... .format("json")
... .mode("overwrite")
... .save("sample_files/2020-09-29/out")
... )

How to convert complex JSON to dataframe by using PySpark?

I require a python code to convert the JSON to dataframe.
My JSON format is
{"feed":{"catalog":{"schema":["somekey":"somevalue"], "add":{"items":[{["somekey":"somevalue"]}]}}....
I want the JSON into multiple dataframe which are present inside the items.
For example:
Input JSON
{"feed":{"catalog":{"schema":["somekey":"somevalue"], "add":{"items":[{[{"sku":"10002","status":"Enabled"},{"sku":"10003","status":"Enabled"}]}]}}
Expected Dataframe
sku status
100002 Enabled
1000023 Enabled
Thanks in advance, please help to solve the problem.
You need to explode items array to get sku,status columns.
#sample valid json
jsn='{"feed":{"catalog":{"schema":["somekey","somevalue"], "add":{"items":[{"sku":"10002","status":"Enabled"},{"sku":"10003","status":"Enabled"}]}}}}'
#read the json using spark.read.json
df=spark.read.json(sc.parallelize([jsn]))
#print schema
df.printSchema()
#root
# |-- feed: struct (nullable = true)
# | |-- catalog: struct (nullable = true)
# | | |-- add: struct (nullable = true)
# | | | |-- items: array (nullable = true)
# | | | | |-- element: struct (containsNull = true)
# | | | | | |-- sku: string (nullable = true)
# | | | | | |-- status: string (nullable = true)
# | | |-- schema: array (nullable = true)
# | | | |-- element: string (containsNull = true)
df.withColumn("items",explode(col("feed.catalog.add.items"))).\
select("items.*").\
show()
#+-----+-------+
#| sku| status|
#+-----+-------+
#|10002|Enabled|
#|10003|Enabled|
#+-----+-------+

How to get the column by its index instead of a name?

I have the following initial PySpark DataFrame:
+----------+--------------------------------+
|product_PK| products|
+----------+--------------------------------+
| 686 | [[686,520.70],[645,2]]|
| 685 |[[685,45.556],[678,23],[655,21]]|
| 693 | []|
df = sqlCtx.createDataFrame(
[(686, [[686,520.70], [645,2]]), (685, [[685,45.556], [678,23],[655,21]]), (693, [])],
["product_PK", "products"]
)
The column products contains nested data. I need to extract the second value in each pair of values. I am running this code:
temp_dataframe = dataframe.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem("_2"))
It works well with particular DataFrame. However, I want to put this code into a function and run it on different DataFrames. All of my DataFrames have the same structure. The only difference is that the sub-column "_2" might be named differently in some DataFrames, e.g. "col1" or "col2".
For example:
DataFrame content
root
|-- product_PK: long (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: long (nullable = true)
| | |-- _2: double (nullable = true)
|-- exploded: struct (nullable = true)
| |-- _1: long (nullable = true)
| |-- _2: double (nullable = true)
DataFrame content
root
|-- product_PK: long (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- product_PK: long (nullable = true)
| | |-- col2: integer (nullable = true)
|-- exploded: struct (nullable = true)
| |-- product_PK: long (nullable = true)
| |-- col2: integer (nullable = true)
I tried to use index like getItem(1), but it says that the name of a column must be provided.
Is there any way to avoid specifying the column name or somehow generalize this part of a code?
My goal is that exploded contains the second value of each pair in the nested data, i.e. _2 or col1 or col2.
It sounds like you were on the right track. I think the way to accomplish this is to read the schema to determine the name of the field you want to explode on. Instead of schema.names though, you need to use schema.fields to find the struct field, and then use it's properties to figure out the fields in the struct. Here is an example:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Setup the test dataframe
data = [
(686, [(686, 520.70), (645, 2.)]),
(685, [(685, 45.556), (678, 23.), (655, 21.)]),
(693, [])
]
schema = StructType([
StructField("product_PK", StringType()),
StructField("products",
ArrayType(StructType([
StructField("_1", IntegerType()),
StructField("col2", FloatType())
]))
)
])
df = sqlCtx.createDataFrame(data, schema)
# Find the products field in the schema, then find the name of the 2nd field
productsField = next(f for f in df.schema.fields if f.name == 'products')
target_field = productsField.dataType.elementType.names[1]
# Do your explode using the field name
temp_dataframe = df.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem(target_field))
Now, if you examine the result you get this:
>>> temp_dataframe.printSchema()
root
|-- product_PK: string (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = true)
| | |-- col2: float (nullable = true)
|-- exploded: struct (nullable = true)
| |-- _1: integer (nullable = true)
| |-- col2: float (nullable = true)
|-- score: float (nullable = true)
Is that what you want?
>>> df.show(10, False)
+----------+-----------------------------------------------------------------------+
|product_PK|products |
+----------+-----------------------------------------------------------------------+
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|
|693 |[] |
+----------+-----------------------------------------------------------------------+
>>> import pyspark.sql.functions as F
>>> df.withColumn("exploded", F.explode("products")) \
... .withColumn("exploded", F.col("exploded").getItem(1)) \
... .show(10,False)
+----------+-----------------------------------------------------------------------+--------+
|product_PK|products |exploded|
+----------+-----------------------------------------------------------------------+--------+
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |null |
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |2 |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|null |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|23 |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|21 |
+----------+-----------------------------------------------------------------------+--------+
Given that your exploded column is a struct as
|-- exploded: struct (nullable = true)
| |-- _1: integer (nullable = true)
| |-- col2: float (nullable = true)
You can use following logic to get the second element without knowing the name
from pyspark.sql import functions as F
temp_dataframe = df.withColumn("exploded" , F.explode(F.col("products")))
temp_dataframe.withColumn("score", F.col("exploded."+temp_dataframe.select(F.col("exploded.*")).columns[1]))
you should have output as
+----------+--------------------------------------+------------+------+
|product_PK|products |exploded |score |
+----------+--------------------------------------+------------+------+
|686 |[[686,520.7], [645,2.0]] |[686,520.7] |520.7 |
|686 |[[686,520.7], [645,2.0]] |[645,2.0] |2.0 |
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[685,45.556]|45.556|
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[678,23.0] |23.0 |
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[655,21.0] |21.0 |
+----------+--------------------------------------+------------+------+

Categories