I have a json object that has an unfortunate combination of nesting and arrays. So its not totally obvious how to query it with spark sql.
here is a sample object:
{
stuff: [
{a:1,b:2,c:3}
]
}
so, in javascript, to get the value for c, I'd write myData.stuff[0].c
And in my spark sql query, if that array wasn't there, I'd be able to use dot notation:
SELECT stuff.c FROM blah
but I can't, because the innermost object is wrapped in an array.
I've tried:
SELECT stuff.0.c FROM blah // FAIL
SELECT stuff.[0].c FROM blah // FAIL
So, what is the magical way to select that data? or is that even supported yet?
It is not clear what you mean by JSON object so lets consider two different cases:
An array of structs
import tempfile
path = tempfile.mktemp()
with open(path, "w") as fw:
fw.write('''{"stuff": [{"a": 1, "b": 2, "c": 3}]}''')
df = sqlContext.read.json(path)
df.registerTempTable("df")
df.printSchema()
## root
## |-- stuff: array (nullable = true)
## | |-- element: struct (containsNull = true)
## | | |-- a: long (nullable = true)
## | | |-- b: long (nullable = true)
## | | |-- c: long (nullable = true)
sqlContext.sql("SELECT stuff[0].a FROM df").show()
## +---+
## |_c0|
## +---+
## | 1|
## +---+
An array of maps
# Note: schema inference from dictionaries has been deprecated
# don't use this in practice
df = sc.parallelize([{"stuff": [{"a": 1, "b": 2, "c": 3}]}]).toDF()
df.registerTempTable("df")
df.printSchema()
## root
## |-- stuff: array (nullable = true)
## | |-- element: map (containsNull = true)
## | | |-- key: string
## | | |-- value: long (valueContainsNull = true)
sqlContext.sql("SELECT stuff[0]['a'] FROM df").show()
## +---+
## |_c0|
## +---+
## | 1|
## +---+
See also Querying Spark SQL DataFrame with complex types
Related
How I can convert a array (in a column) with a set of elements in a JSON dataset to multiple columns with python, spark or pandas?
The data is structured in this form:
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- idAccount: long (nullable = true)
| | |-- infractionType: string (nullable = true)
| | |-- responseTime: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- transactionCode: string (nullable = true)
I'm expecting some kind of this:
id
idAccount
value
value
value
value
an array of structs can be exploded into columns using the inline sql function.
here's an example of how it works.
data_sdf = spark.createDataFrame([([(1234, 2345, 3456), (4321, 5432, 6543)],)],
'items array<struct<id: int, id_acc: int, foo: int>>'
)
# +----------------------------------------+
# |items |
# +----------------------------------------+
# |[{1234, 2345, 3456}, {4321, 5432, 6543}]|
# +----------------------------------------+
# root
# |-- items: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- id: integer (nullable = true)
# | | |-- id_acc: integer (nullable = true)
# | | |-- foo: integer (nullable = true)
# explode and create new columns using struct fields - using `inline`
data_sdf. \
selectExpr('inline(items)'). \
show()
# +----+------+----+
# | id|id_acc| foo|
# +----+------+----+
# |1234| 2345|3456|
# |4321| 5432|6543|
# +----+------+----+
you can further just select() the required fields after the explosion.
In Spark SQL, you can access the item in ArrayType or MapType column by using getItem. For example, you want to get the value of the id of first item, you can use df.select(func.getItem(0).getItem('id'))
I read large number of deeply nested jsons with fields, that contains special characters, that cause a lot of troubles.
I would like to rename fields' characters / and - to underscore _ ideally in PySpark. For example column a-new to a_new·
NOTE: there are thousands of field names with special characters so it should be done dynamically. If it is easier to deal with the problem to just add fields to back-quotes this would be also solution. The problem I face is that spark interprets only part of struct name (a-new as a etc.).
Ref: Rename nested field in spark dataframe
Input df:
root
|-- a-new: long (nullable = true)
|-- b/old: struct (nullable = true)
| |-- c-red: struct (nullable = true)
| | |-- d/bue: struct (nullable = true)
| | | |-- e-green: string (nullable = true)
| | | |-- f-white: struct (nullable = true)
| | | | |-- g/blue: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- date: long (nullable = true)
| | | | | | |-- val: long (nullable = true)
Required outcome:
root
|-- a_new: long (nullable = true)
|-- b_old: struct (nullable = true)
| |-- c_red: struct (nullable = true)
| | |-- d_bue: struct (nullable = true)
| | | |-- e_green: string (nullable = true)
| | | |-- f_white: struct (nullable = true)
| | | | |-- g_blue: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- date: long (nullable = true)
| | | | | | |-- val: long (nullable = true)
I'm wondering if there is more efficient way than to recreate df with new schema as I found in the solution :
https://stackoverflow.com/a/58030523/9579821
json_1 = """{"a-new":1,"b/old":{"c-red":{"d/bue":{"e-green":"label_1","f-white":{"g/blue":[{"date":2020,"val":1}]}}}}}"""
df = spark.read.json(sc.parallelize([json_1]))
df.printSchema()
# Some imports
from pyspark.sql.types import DataType, StructType, ArrayType
from copy import copy
# We take a dataframe and return a new one with required changes
def clean_df(df):
# Returns a new sanitized field name (this function can be anything really)
def sanitizeFieldName(s: str) -> str:
return s.replace("-", "_").replace("/", "_")
# We call this on all fields to create a copy and to perform any
# changes we might want to do to the field.
def sanitizeField(field: StructField) -> StructField:
field = copy(field)
field.name = sanitizeFieldName(field.name)
# We recursively call cleanSchema on all types
field.dataType = cleanSchema(field.dataType)
return field
def cleanSchema(dataType: [DataType]) -> [DataType]:
dataType = copy(dataType)
# If the type is a StructType we need to recurse otherwise
# we can return since we've reached the leaf node
if isinstance(dataType, StructType):
# We call our sanitizer for all top level fields
dataType.fields = [sanitizeField(f) for f in dataType.fields]
elif isinstance(dataType, ArrayType):
dataType.elementType = cleanSchema(dataType.elementType)
return dataType
# Now since we have the new schema we can create a new DataFrame
# by using the old Frame's RDD as data and the new schema as the
# schema for the data
return spark.createDataFrame(df.rdd, cleanSchema(df.schema))
clean_df(df).printSchema()
# Rename columns using `withColumnRenamed`
for c in df.columns:
df = df.withColumnRenamed(c,c.replace('-','_').replace('/','_'))
# Rename nested fields using `cast`
for c in df.columns:
new_schema = df.select(c).schema.simpleString().replace('-','_').replace('/','_')[8+len(c):-1]
df = df.withColumn(c,F.col(c).cast(new_schema))
I have dataframe like below,
df.select(to_json(struct("items"))).show(1, False)
items
-----
[{ "id":"1","types":{ "0":"1", "price":{ "value":"1"}}},
{ "id":"1","types":{ "0":"2", "price":{ "value":"2"}}},
{ "id":"2","types":{ "0":"3", "price":{ "value":"1"}}}]
now I want to achieve the data frame like this in pyspark,
Basically, I want to group the contents based on the id.
items
-----
[{ "id":"1","types": [ {"0":"1", "price":{ "value":"1"}}, {"0":"2", "price":{ "value":"2"}} ],
{ "id":"2","types": [ {"0":"3", "price":{ "value":"1"}} ]
To reproduce it:
from pyspark.sql import Row
# Spark version: 2.4.4
df = spark.createDataFrame([
Row(items=[Row(id='1',types=Row(o='1',price=Row(value="1"))),
Row(id='1',types=Row(o='10',price=Row(value="1"))),
Row(id='2',types=Row(o='13',price=Row(value="1")))]),
Row(items=[Row(id='3',types=Row(o='1',price=Row(value="1"))),
Row(id='4',types=Row(o='10',price=Row(value="1"))),
Row(id='3',types=Row(o='13',price=Row(value="1")))])
], schema='items:array<struct<id:string,types:struct<`0`:string,price:struct<value:string>>>>')
First I used the higher-order function AGGREGATE to modify the items column to a map column type where the key is your id and the value is an array of types values.
To finish it, I applied another expression to transform from map to struct type following your desired keys.
import pyspark.sql.functions as f
# It's necessary to run using Spark 3 or later
spark.conf.set('spark.sql.mapKeyDedupPolicy', 'LAST_WIN')
agg_df = df.withColumn('items', f.expr('AGGREGATE(items, CAST(MAP() AS MAP<STRING, ARRAY<STRUCT<`0`:STRING, `price`:STRUCT<`value`:STRING>>>>), (acc, item) -> ' \
'IF(acc[item.id] IS NULL, ' \
'MAP_CONCAT(acc, MAP(item.id, ARRAY(item.types))), ' \
'MAP_CONCAT(acc, MAP(item.id, ARRAY_UNION(acc[item.id], ARRAY(item.types))))))'))
transform_df = agg_df.withColumn('items', f.expr('TRANSFORM(MAP_KEYS(items), key -> STRUCT(key AS id, items[key] AS types))'))
# transform_df.printSchema()
# root
# |-- items: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- id: string (nullable = true)
# | | |-- types: array (nullable = true)
# | | | |-- element: struct (containsNull = true)
# | | | | |-- 0: string (nullable = true)
# | | | | |-- price: struct (nullable = true)
# | | | | | |-- value: string (nullable = true)
# Databricks only
display(transform_df)
Output
I have a nested json, structured as the following example:
{'A':[{'key':'B','value':'C'},{'key':'D','value':'E'}]}
Now I want to map this to the following schema:
|--A
|--|--B
|--|--D
e.g. A structure recovered from a json like:
{'A':{'B':'C','D':'E'}}
The array in 'A' has no fixed number of entries, but the contained dicts always have the two keys 'key','value'
Please find the script below.
from pyspark.sql.functions import lit, col, explode, create_map, collect_list
from itertools import chain
>>> sample.printSchema()
root
|-- A: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
>>> final_df = (sample
... .select(explode('A').alias("A"))
... .withColumn("A",create_map("A.key", "A.value"))
... .groupby().agg(collect_list("A").alias("A"))
... )
>>> final_df.printSchema()
root
|-- A: array (nullable = true)
| |-- element: map (containsNull = false)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
>>> final_df.show(truncate=False)
+--------------------+
|A |
+--------------------+
|[[B -> C], [D -> E]]|
+--------------------+
>>> (final_df
... .write
... .format("json")
... .mode("overwrite")
... .save("sample_files/2020-09-29/out")
... )
I require a python code to convert the JSON to dataframe.
My JSON format is
{"feed":{"catalog":{"schema":["somekey":"somevalue"], "add":{"items":[{["somekey":"somevalue"]}]}}....
I want the JSON into multiple dataframe which are present inside the items.
For example:
Input JSON
{"feed":{"catalog":{"schema":["somekey":"somevalue"], "add":{"items":[{[{"sku":"10002","status":"Enabled"},{"sku":"10003","status":"Enabled"}]}]}}
Expected Dataframe
sku status
100002 Enabled
1000023 Enabled
Thanks in advance, please help to solve the problem.
You need to explode items array to get sku,status columns.
#sample valid json
jsn='{"feed":{"catalog":{"schema":["somekey","somevalue"], "add":{"items":[{"sku":"10002","status":"Enabled"},{"sku":"10003","status":"Enabled"}]}}}}'
#read the json using spark.read.json
df=spark.read.json(sc.parallelize([jsn]))
#print schema
df.printSchema()
#root
# |-- feed: struct (nullable = true)
# | |-- catalog: struct (nullable = true)
# | | |-- add: struct (nullable = true)
# | | | |-- items: array (nullable = true)
# | | | | |-- element: struct (containsNull = true)
# | | | | | |-- sku: string (nullable = true)
# | | | | | |-- status: string (nullable = true)
# | | |-- schema: array (nullable = true)
# | | | |-- element: string (containsNull = true)
df.withColumn("items",explode(col("feed.catalog.add.items"))).\
select("items.*").\
show()
#+-----+-------+
#| sku| status|
#+-----+-------+
#|10002|Enabled|
#|10003|Enabled|
#+-----+-------+