How to convert complex JSON to dataframe by using PySpark?

How to convert complex JSON to dataframe by using PySpark? - python

I require a python code to convert the JSON to dataframe.
My JSON format is
{"feed":{"catalog":{"schema":["somekey":"somevalue"], "add":{"items":[{["somekey":"somevalue"]}]}}....
I want the JSON into multiple dataframe which are present inside the items.
For example:
Input JSON
{"feed":{"catalog":{"schema":["somekey":"somevalue"], "add":{"items":[{[{"sku":"10002","status":"Enabled"},{"sku":"10003","status":"Enabled"}]}]}}
Expected Dataframe
sku status
100002 Enabled
1000023 Enabled
Thanks in advance, please help to solve the problem.

You need to explode items array to get sku,status columns.
#sample valid json
jsn='{"feed":{"catalog":{"schema":["somekey","somevalue"], "add":{"items":[{"sku":"10002","status":"Enabled"},{"sku":"10003","status":"Enabled"}]}}}}'
#read the json using spark.read.json
df=spark.read.json(sc.parallelize([jsn]))
#print schema
df.printSchema()
#root
# |-- feed: struct (nullable = true)
# | |-- catalog: struct (nullable = true)
# | | |-- add: struct (nullable = true)
# | | | |-- items: array (nullable = true)
# | | | | |-- element: struct (containsNull = true)
# | | | | | |-- sku: string (nullable = true)
# | | | | | |-- status: string (nullable = true)
# | | |-- schema: array (nullable = true)
# | | | |-- element: string (containsNull = true)
df.withColumn("items",explode(col("feed.catalog.add.items"))).\
select("items.*").\
show()
#+-----+-------+
#| sku| status|
#+-----+-------+
#|10002|Enabled|
#|10003|Enabled|
#+-----+-------+

Related

Convert array of elements to multiple columns

How I can convert a array (in a column) with a set of elements in a JSON dataset to multiple columns with python, spark or pandas?
The data is structured in this form:
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- idAccount: long (nullable = true)
| | |-- infractionType: string (nullable = true)
| | |-- responseTime: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- transactionCode: string (nullable = true)
I'm expecting some kind of this:
id
idAccount
value
value
value
value

an array of structs can be exploded into columns using the inline sql function.
here's an example of how it works.
data_sdf = spark.createDataFrame([([(1234, 2345, 3456), (4321, 5432, 6543)],)],
'items array<struct<id: int, id_acc: int, foo: int>>'
)
# +----------------------------------------+
# |items |
# +----------------------------------------+
# |[{1234, 2345, 3456}, {4321, 5432, 6543}]|
# +----------------------------------------+
# root
# |-- items: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- id: integer (nullable = true)
# | | |-- id_acc: integer (nullable = true)
# | | |-- foo: integer (nullable = true)
# explode and create new columns using struct fields - using `inline`
data_sdf. \
selectExpr('inline(items)'). \
show()
# +----+------+----+
# | id|id_acc| foo|
# +----+------+----+
# |1234| 2345|3456|
# |4321| 5432|6543|
# +----+------+----+
you can further just select() the required fields after the explosion.

In Spark SQL, you can access the item in ArrayType or MapType column by using getItem. For example, you want to get the value of the id of first item, you can use df.select(func.getItem(0).getItem('id'))

Spark: How to rename dynamically special characters in nested struct fields

I read large number of deeply nested jsons with fields, that contains special characters, that cause a lot of troubles.
I would like to rename fields' characters / and - to underscore _ ideally in PySpark. For example column a-new to a_new·
NOTE: there are thousands of field names with special characters so it should be done dynamically. If it is easier to deal with the problem to just add fields to back-quotes this would be also solution. The problem I face is that spark interprets only part of struct name (a-new as a etc.).
Ref: Rename nested field in spark dataframe
Input df:
root
|-- a-new: long (nullable = true)
|-- b/old: struct (nullable = true)
| |-- c-red: struct (nullable = true)
| | |-- d/bue: struct (nullable = true)
| | | |-- e-green: string (nullable = true)
| | | |-- f-white: struct (nullable = true)
| | | | |-- g/blue: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- date: long (nullable = true)
| | | | | | |-- val: long (nullable = true)
Required outcome:
root
|-- a_new: long (nullable = true)
|-- b_old: struct (nullable = true)
| |-- c_red: struct (nullable = true)
| | |-- d_bue: struct (nullable = true)
| | | |-- e_green: string (nullable = true)
| | | |-- f_white: struct (nullable = true)
| | | | |-- g_blue: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- date: long (nullable = true)
| | | | | | |-- val: long (nullable = true)
I'm wondering if there is more efficient way than to recreate df with new schema as I found in the solution :
https://stackoverflow.com/a/58030523/9579821
json_1 = """{"a-new":1,"b/old":{"c-red":{"d/bue":{"e-green":"label_1","f-white":{"g/blue":[{"date":2020,"val":1}]}}}}}"""
df = spark.read.json(sc.parallelize([json_1]))
df.printSchema()
# Some imports
from pyspark.sql.types import DataType, StructType, ArrayType
from copy import copy
# We take a dataframe and return a new one with required changes
def clean_df(df):
# Returns a new sanitized field name (this function can be anything really)
def sanitizeFieldName(s: str) -> str:
return s.replace("-", "_").replace("/", "_")
# We call this on all fields to create a copy and to perform any
# changes we might want to do to the field.
def sanitizeField(field: StructField) -> StructField:
field = copy(field)
field.name = sanitizeFieldName(field.name)
# We recursively call cleanSchema on all types
field.dataType = cleanSchema(field.dataType)
return field
def cleanSchema(dataType: [DataType]) -> [DataType]:
dataType = copy(dataType)
# If the type is a StructType we need to recurse otherwise
# we can return since we've reached the leaf node
if isinstance(dataType, StructType):
# We call our sanitizer for all top level fields
dataType.fields = [sanitizeField(f) for f in dataType.fields]
elif isinstance(dataType, ArrayType):
dataType.elementType = cleanSchema(dataType.elementType)
return dataType
# Now since we have the new schema we can create a new DataFrame
# by using the old Frame's RDD as data and the new schema as the
# schema for the data
return spark.createDataFrame(df.rdd, cleanSchema(df.schema))
clean_df(df).printSchema()

# Rename columns using `withColumnRenamed`
for c in df.columns:
df = df.withColumnRenamed(c,c.replace('-','_').replace('/','_'))
# Rename nested fields using `cast`
for c in df.columns:
new_schema = df.select(c).schema.simpleString().replace('-','_').replace('/','_')[8+len(c):-1]
df = df.withColumn(c,F.col(c).cast(new_schema))

Extract schema labels from pyspark dataframe

From a pyspark dataframe I want to create a python list with the schema labels for a specific schema "level".
The schema is:
root
|-- DISPLAY: struct (nullable = true)
| |-- 1WO: struct (nullable = true)
| | |-- JPY: struct (nullable = true)
| | | |-- CHANGE24HOUR: string (nullable = true)
| | | |-- CHANGEDAY: string (nullable = true)
| |-- AAVE: struct (nullable = true)
| | |-- JPY: struct (nullable = true)
| | | |-- CHANGE24HOUR: string (nullable = true)
| | | |-- CHANGEDAY: string (nullable = true)
The expected output is:
list = 1WO, AAVE
The following code print everything in the schema:
df.schema.jsonValue()
Is there an easy way to extract those labels pls?

Select the first layer using the asterisk notation, and the n list the columns:
df.select('DISPLAY.*').columns

pyspark, group by array specific property through aliases

i have this structure for my dataframe
root: array (nullable = true)
|-- element: struct (containsNull = true)
|-- id: long (nullable = true)
|-- time: struct (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- properties: array (nullable = true)
|-- element: struct (containsNull = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
that i need to transform in this one:
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
expanding my key-value array on column.
using pivot and groupby i can transform my dataframe:
df2 = df.groupby("start","end","id").pivot("prop.key").agg(last("prop.value", True))
but i need also to group by one (or more) property(key) value, but i can't.
df2 = df.groupby("start","end","id","car_type","car_loc").pivot("prop.key").agg(last("prop.value", True))
where "car_type","car_loc" are properties (prop.keys).
i need to call these properties through their aliases(not using getItem()).
is it possible? anyone can help me please?
thank you
EDIT
an example. i have this situation:
+---+----------+----------+--------------------+
| id| start | end | prop|
+---+----------+----------+--------------------+
| 1|2019-05-12|2020-05-12|[car_type, fiat |
| 1|2019-05-12|2020-05-12|[car_loc, home |
| 1|2019-05-12|2020-05-12|[car_num, xd7890 |
| 2|2019-05-13|2020-05-13|[car_type, fiat |
| 2|2019-05-13|2020-05-13|[car_loc, home |
| 2|2019-05-13|2020-05-13|[car_num, ae1234 |
| 1|2019-05-12|2020-05-12|[car_type, ford |
| 1|2019-05-12|2020-05-12|[car_loc, office |
| 1|2019-05-12|2020-05-12|[car_num, gh7890 |
i need to transform dataframe to have the situation :
+---------------------+---+--------+-------+-------+
| start | end | id|car_type|car_loc|car_num|
+---------------------+---+--------+-------+-------+
|2019-05-12|2020-05-12| 1|fiat |home |xd7890 |
|2019-05-13|2020-05-13| 2|fiat |home |ae1234 |
|2019-05-12|2020-05-12| 1|ford |office |gh7890 |

Reading and accessing nested fields in json files using spark

I have multiple json files I wish to use to create a spark data frame from. In testing with a subset, when I load the files, I get rows of the json information themselves instead of parsed json information. I am doing the following:
df = spark.read.json('gutenberg/test')
df.show()
+--------------------+--------------------+--------------------+
| 1| 10| 5|
+--------------------+--------------------+--------------------+
| null|[WrappedArray(),W...| null|
| null| null|[WrappedArray(Uni...|
|[WrappedArray(Jef...| null| null|
+--------------------+--------------------+--------------------+
When I check the schema of the dataframe, It appears to be there, but am having trouble accessing it:
df.printSchema()
root
|-- 1: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
|-- 10: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
|-- 5: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
I keep getting errors when trying to access the information, so any help would be great.
Specifically, I am looking to create a new dataframe where the columns are ('author', 'formaturi', 'language', 'rights', 'subject', 'title', 'txt')
I am using pyspark 2.2

Since I do not know what json file is exactly like, assuming it is a new line delimited jsons, this should work.
def _construct_key(previous_key, separator, new_key):
if previous_key:
return "{}{}{}".format(previous_key, separator, new_key)
else:
return new_key
def flatten(nested_dict, separator="_", root_keys_to_ignore=set()):
assert isinstance(nested_dict, dict)
assert isinstance(separator, str)
flattened_dict = dict()
def _flatten(object_, key):
if isinstance(object_, dict):
for object_key in object_:
if not (not key and object_key in root_keys_to_ignore):
_flatten(object_[object_key], _construct_key(key,\
separator, object_key))
elif isinstance(object_, list) or isinstance(object_, set):
for index, item in enumerate(object_):
_flatten(item, _construct_key(key, separator, index))
else:
flattened_dict[key] = object_
_flatten(nested_dict, None)
return flattened_dict
def flatten(_json):
return flatt(_json.asDict(True))
df = spark.read.json('gutenberg/test',\
primitivesAsString=True,\
allowComments=True,\
allowUnquotedFieldNames=True,\
allowNumericLeadingZero=True,\
allowBackslashEscapingAnyCharacter=True,\
mode='DROPMALFORMED')\
.rdd.map(flatten).toDF()
df.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert complex JSON to dataframe by using PySpark? - python

Related

Convert array of elements to multiple columns

Spark: How to rename dynamically special characters in nested struct fields

Extract schema labels from pyspark dataframe

pyspark, group by array specific property through aliases

Reading and accessing nested fields in json files using spark

Categories

Resources