How to create a spark DataFrame from Nested JSON structure - python

I'm trying to load data from the ExactOnline API into a spark DataFrame. Data comes out of the API in a very ugly format. I have multiple lines of valid JSON objects in one JSON file. One line of JSON looks as follows:
{
"d": {
"results": [
{
"__metadata": {
"uri": "https://start.exactonline.nl/api_endpoint",
"type": "Exact.Web.Api.Models.Account",
},
"Accountant": null,
"AccountManager": null,
"AccountManagerFullName": null,
"AccountManagerHID": null,
...
},
{
"__metadata": {
"uri": "https://start.exactonline.nl/api_endpoint",
"type": "Exact.Web.Api.Models.Account",
},
"Accountant": null,
"AccountManager": null,
"AccountManagerFullName": null,
"AccountManagerHID": null,
...
}
]
}
}
What I need is that the keys of the dictionary's in the results list become the dataframe columns, and the number of dictionary's in the results become my rows. In the example I provided above, that would result in a dataframe with the following columns:
__metadata|Accountant|AccountManager|AccountManagerFullName|AccountManagerHID
And two rows, one for each entry in the "results" list.
In Python on my local machine, I am easily able to achieve this by using the following code snippet:
import json
import pandas as pd
folder_path = "path_to_json_file"
def flatten(l):
return [item for sublist in l for item in sublist]
with open(folder_path) as f:
# Extract relevant data from each line in the JSON structure and create a nested list,
# Where the "inner" lists are lists with dicts
# (1 line of JSON in my file = 1 inner list, so if my JSON file has 6
# lines the nested list will have 6 lists with a number of dictionaries)
data = [json.loads(line)["d"]["results"] for line in f]
# Flatten the nested lists into one giant list
flat_data = flatten(data)
# Create a dataframe from that flat list.
df = pd.DataFrame(flat_data)
However, I'm using a Pyspark Notebook in Azure Synapse, and the JSON files reside in our Data Lake so I cannot use with open to open files. I am limited to using spark functions. I have tried to achieve what I described above using spark.explode and spark.select:
from pyspark.sql import functions as sf
df = spark.read.json(path=path_to_json_file_in_data_lake)
df_subset = df.select("d.results")
df_exploded = df_subset.withColumn("results", sf.explode(sf.col("results")))
df_exploded has the right number of rows, but not the proper columns. I think I'm searching in the right direction but cannot wrap my head around it. Some assistance would be greatly appreciated.

you can directly read JSON files in spark with spark.read.json(), but use the multiLine option as a single JSON is spread across multiple lines. then use inline sql function to explode and create new columns using the struct fields inside the array.
json_sdf = spark.read.option("multiLine", "true").json(
"./drive/MyDrive/samplejsonsparkread.json"
)
# root
# |-- d: struct (nullable = true)
# | |-- results: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- AccountManager: string (nullable = true)
# | | | |-- AccountManagerFullName: string (nullable = true)
# | | | |-- AccountManagerHID: string (nullable = true)
# | | | |-- Accountant: string (nullable = true)
# | | | |-- __metadata: struct (nullable = true)
# | | | | |-- type: string (nullable = true)
# | | | | |-- uri: string (nullable = true)
# use `inline` sql function to explode and create new fields from array of structs
df.selectExpr("inline(d.results)").show(truncate=False)
# +--------------+----------------------+-----------------+----------+-------------------------------------------------------------------------+
# |AccountManager|AccountManagerFullName|AccountManagerHID|Accountant|__metadata |
# +--------------+----------------------+-----------------+----------+-------------------------------------------------------------------------+
# |null |null |null |null |{Exact.Web.Api.Models.Account, https://start.exactonline.nl/api_endpoint}|
# |null |null |null |null |{Exact.Web.Api.Models.Account, https://start.exactonline.nl/api_endpoint}|
# +--------------+----------------------+-----------------+----------+-------------------------------------------------------------------------+
# root
# |-- AccountManager: string (nullable = true)
# |-- AccountManagerFullName: string (nullable = true)
# |-- AccountManagerHID: string (nullable = true)
# |-- Accountant: string (nullable = true)
# |-- __metadata: struct (nullable = true)
# | |-- type: string (nullable = true)
# | |-- uri: string (nullable = true)

I tried your code, it is working fine. Just missing one last step :
df_exploded = df_subset.withColumn("results", sf.explode(sf.col('results')))
df_exploded.select("results.*").show()
+--------------+----------------------+-----------------+----------+--------------------+
|AccountManager|AccountManagerFullName|AccountManagerHID|Accountant| __metadata|
+--------------+----------------------+-----------------+----------+--------------------+
| null| null| null| null|[Exact.Web.Api.Mo...|
| null| null| null| null|[Exact.Web.Api.Mo...|
+--------------+----------------------+-----------------+----------+--------------------+

Related

Few fields are missing in printSchema() when read from parquet file using PySpark

I have a parquet file having one column named as "isdeleted" with boolean as the datatype.
When I printSchema(), that "isdeleted" column is not present in the schema. I am not getting whats going on here.
scala> val products = sqlContext.read.format("parquet").load("s3a://bucketname/parquet/data/product")
scala> products.printSchema
I am missing one column from the below schema generated from above code:
root
|-- student_name: string (nullable = true)
|-- marks: double (nullable = true)
|-- height: double (nullable = true)
|-- city: string (nullable = true)

Reorganize Pyspark dataframe: Create new column using row element

I am trying to map document with this structure to dataframe.
root
|-- Id: "a1"
|-- Type: "Work"
|-- Tag: Array
| |--0: Object
| | |-- Tag.name : "passHolder"
| | |-- Tag.value : "Jack Ryan"
| | |-- Tag.stat : "verified"
| |-- 1: Object
| | |-- Tag.name : "passNum"
| | |-- Tag.value : "1234"
| | |-- Tag.stat : "unverified"
|-- version: 1.5
By exploding the array using explode_outer,flattening struct and renaming using .col + alias, the dataframe will look like:
df = df.withColumn("Tag",F.explode_outer("Tag"))
df = df.select(col("*"),
.col("Tag.name").alias("Tag_name"),
.col("Tag.value").alias("Tag_value"),
.col("Tag.stat").alias("Tag_stat")).drop("Tag")
+--+----+-----------+-----------+---------+---------+
|Id|Type| Tag_name | Tag_value |Tag_stat | version |
+--+----+-----------+-----------+---------+---------+
a1 Work passHolder Jack Ryan verified 1.5
a1 Work passNum 1234 unverified 1.5
I am trying to reorganise the df structure so that it's more query-able,by making certain row element as column name and populate it with relevant values.
Can anyone help to give pointers/steps required to arrive at desired output format like below? Thank you very much for the advise.
Target format:
+--+----+-----------------+-----------------+-------------+------------+--------+
|Id|Type| Tag_passHolder | passHolder_stat | Tag_passNum |passNum_stat||version|
+--+----+-----------------+-----------------+-------------+------------+--------+
a1 Work Jack Ryan verified 1234 unverified 1.5
Based on the output df you displayed, I would do something like this :
from pyspark.sql import functions as F
passholder_df = df.select(
"ID",
"Type",
F.col("Tag_value").alias("Tag_passHolder"),
F.col("Tag_stat").alias("passHolder_stat"),
"version",
).where("Tag_name = 'passHolder'")
passnum_df = df.select(
"ID",
"Type",
F.col("Tag_value").alias("Tag_passNum"),
F.col("Tag_stat").alias("passNum_stat"),
"version",
).where("Tag_name = 'passNum'")
passholder_df.join(passnum_df, on=["ID", "Type", "version"], how="full")
You probably need to work a little bit on the join condition, depending on your business rules.

Handling changing datatypes in Pyspark/Hive

I am having an issue in parsing inconsistent datatypes in pyspark. As shown in the example file below, SA key contains always a dictionary but sometimes it can appear as string value. When I try to fetch the column SA.SM.Name, I get the exception as shown below.
How do I put null for SA.SM.Name column in pyspark/hive for the values having other than JSONs. Can someone help me please?
I tried to cast to different datatypes but nothing worked or may be I would be doing something wrong.
Input file Contents: mypath
{"id":1,"SA":{"SM": {"Name": "John","Email": "John#example.com"}}}
{"id":2,"SA":{"SM": {"Name": "Jerry","Email": "Jerry#example.com"}}}
{"id":3,"SA":"STRINGVALUE"}
df=spark.read.json(my_path)
df.registerTempTable("T")
spark.sql("""select id,SA.SM.Name from T """).show()
Traceback (most recent call last): File "", line 1, in
File "/usr/lib/spark/python/pyspark/sql/session.py", line
767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File
"/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in call File
"/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: "Can't extract value from
SA#6.SM: need struct type but got string; line 1 pos 10"
That is not possible using dataframes, since the column SA is being read as string while spark loads it. But you can load the file/table using sparkContext as rdd and then use a cleaner function for mapping the empty dict value to the SA. Here i loaded the file as textFile, but do necessary implementation if it is hadoopfile.
def cleaner(record):
output = ""
print(type(record))
try:
output = json.loads(record)
except Exception as e:
print("exception happened")
finally:
if isinstance(output.get("SA"), str ):
print("This is string")
output["SA"] = {}
return output
dfx = spark.sparkContext.textFile("file://"+my_path)
dfx2 = dfx.map(cleaner)
new_df = spark.createDataFrame(dfx2)
new_df.show(truncate=False)
+---------------------------------------------------+---+
|SA |id |
+---------------------------------------------------+---+
|[SM -> [Email -> John#example.com, Name -> John]] |1 |
|[SM -> [Email -> Jerry#example.com, Name -> Jerry]]|2 |
|[] |3 |
+---------------------------------------------------+---+
new_df.printSchema()
root
|-- SA: map (nullable = true)
| |-- key: string
| |-- value: map (valueContainsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
|-- id: long (nullable = true)
Note: if the output value of name has to be written to the same table/ column , this solution might not work and if you try to write back the loaded dataframe to the same table, then it will cause the SA column to break and you will get a list of names and emails as per the schema provided in the comments of the qn.

__getnewargs__ error while using udf in Pyspark

There is a datafarame with 2 columns (db and tb): db stands for database and tb stands for tableName of that database.
+--------------------+--------------------+
| database| tableName|
+--------------------+--------------------+
|aaaaaaaaaaaaaaaaa...| tttttttttttttttt|
|bbbbbbbbbbbbbbbbb...| rrrrrrrrrrrrrrrr|
|aaaaaaaaaaaaaaaaa...| ssssssssssssssssss|
I have the following method in python:
def _get_tb_db(db, tb):
df = spark.sql("select * from {}.{}".format(db, tb))
return df.dtypes
and this udf:
test = udf(lambda db, tb: _get_tb_db(db, tb), StringType())
while running this:
df = df.withColumn("dtype", test(col("db"), col("tb")))
there is following error:
pickle.PicklingError: Could not serialize object: Py4JError: An
error occurred while calling o58.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
I found some discussion on stackoverflow: Spark __getnewargs__ error
but yet I am not sure how to resolve this issue?
Is the error because I am creating another dataframe inside the UDF?
Similar to the solution in the link i tried this:
cols = copy.deepcopy(df.columns)
df = df.withColumn("dtype", scanning(cols[0], cols[1]))
but still get error
Any solution?
The error means that you can not use Spark dataframe in the UDF. But since your dataframe containing names of databases and tables is most likely small, it's enough to just take a Python for loop, below are some methods which might help get your data:
from pyspark.sql import Row
# assume dfs is the df containing database names and table names
dfs.printSchema()
root
|-- database: string (nullable = true)
|-- tableName: string (nullable = true)
Method-1: use df.dtypes
Run the sql select * from database.tableName limit 1 to generate df and return its dtypes, convert it into StringType().
data = []
DRow = Row('database', 'tableName', 'dtypes')
for row in dfs.collect():
try:
dtypes = spark.sql('select * from `{}`.`{}` limit 1'.format(row.database, row.tableName)).dtypes
data.append(DRow(row.database, row.tableName, str(dtypes)))
except Exception, e:
print("ERROR from {}.{}: [{}]".format(row.database, row.tableName, e))
pass
df_dtypes = spark.createDataFrame(data)
# DataFrame[database: string, tableName: string, dtypes: string]
Note:
using dtypes instead of str(dtypes) will get the following schema where _1, and _2 are col_name and col_dtype respectively:
root
|-- database: string (nullable = true)
|-- tableName: string (nullable = true)
|-- dtypes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: string (nullable = true)
using this method, each table will have only one row. for the next two methods, each col_type of a table will have its own row.
Method-2: use describe
you can also retrieve this information from running spark.sql("describe tableName") by which you get dataframe directly, then use a reduce function to union the results from all tables.
from functools import reduce
def get_df_dtypes(db, tb):
try:
return spark.sql('desc `{}`.`{}`'.format(db, tb)) \
.selectExpr(
'"{}" as `database`'.format(db)
, '"{}" as `tableName`'.format(tb)
, 'col_name'
, 'data_type')
except Exception, e:
print("ERROR from {}.{}: [{}]".format(db, tb, e))
pass
# an example table:
get_df_dtypes('default', 'tbl_df1').show()
+--------+---------+--------+--------------------+
|database|tableName|col_name| data_type|
+--------+---------+--------+--------------------+
| default| tbl_df1| array_b|array<struct<a:st...|
| default| tbl_df1| array_d| array<string>|
| default| tbl_df1|struct_c|struct<a:double,b...|
+--------+---------+--------+--------------------+
# use reduce function to union all tables into one df
df_dtypes = reduce(lambda d1, d2: d1.union(d2), [ get_df_dtypes(row.database, row.tableName) for row in dfs.collect() ])
Method-3: use spark.catalog.listColumns()
Use spark.catalog.listColumns() which creates a list of collections.Column objects, retrieve name and dataType and merge the data. the resulting dataframe is normalized with col_name and col_dtype on their own columns (same as using Method-2).
data = []
DRow = Row('database', 'tableName', 'col_name', 'col_dtype')
for row in dfs.select('database', 'tableName').collect():
try:
for col in spark.catalog.listColumns(row.tableName, row.database):
data.append(DRow(row.database, row.tableName, col.name, col.dataType))
except Exception, e:
print("ERROR from {}.{}: [{}]".format(row.database, row.tableName, e))
pass
df_dtypes = spark.createDataFrame(data)
# DataFrame[database: string, tableName: string, col_name: string, col_dtype: string]
A Note: different Spark distributions/versions might have different result from describe tbl_name and other commands when retrieving metadata, make sure the correct column names are used in the queries.

How to filter JSON data by multi-value column

With help of Spark SQL I'm trying to filter out all business items from which belongs to a specific group category.
The data is loaded from JSON file:
businessJSON = os.path.join(targetDir, 'yelp_academic_dataset_business.json')
businessDF = sqlContext.read.json(businessJSON)
The schema of the file is following:
businessDF.printSchema()
root
|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element: string (containsNull = true)
..
|-- type: string (nullable = true)
I'm trying to extract all business connected to restaurant business:
restaurants = businessDF[businessDF.categories.inSet("Restaurants")]
but it doesn't work because as I understand the expected type of column should be a string, but in my case this is array. About it tells me an exception:
Py4JJavaError: An error occurred while calling o1589.filter.
: org.apache.spark.sql.AnalysisException: invalid cast from string to array<string>;
Can you please suggest any other way to get what I want?
How about an UDF?
from pyspark.sql.functions import udf, col, lit
from pyspark.sql.types import BooleanType
contains = udf(lambda xs, val: val in xs, BooleanType())
df = sqlContext.createDataFrame([Row(categories=["foo", "bar"])])
df.select(contains(df.categories, lit("foo"))).show()
## +----------------------------------+
## |PythonUDF#<lambda>(categories,foo)|
## +----------------------------------+
## | true|
## +----------------------------------+
df.select(contains(df.categories, lit("foobar"))).show()
## +-------------------------------------+
## |PythonUDF#<lambda>(categories,foobar)|
## +-------------------------------------+
## | false|
## +-------------------------------------+

Categories