__getnewargs__ error while using udf in Pyspark - python

There is a datafarame with 2 columns (db and tb): db stands for database and tb stands for tableName of that database.
+--------------------+--------------------+
| database| tableName|
+--------------------+--------------------+
|aaaaaaaaaaaaaaaaa...| tttttttttttttttt|
|bbbbbbbbbbbbbbbbb...| rrrrrrrrrrrrrrrr|
|aaaaaaaaaaaaaaaaa...| ssssssssssssssssss|
I have the following method in python:
def _get_tb_db(db, tb):
df = spark.sql("select * from {}.{}".format(db, tb))
return df.dtypes
and this udf:
test = udf(lambda db, tb: _get_tb_db(db, tb), StringType())
while running this:
df = df.withColumn("dtype", test(col("db"), col("tb")))
there is following error:
pickle.PicklingError: Could not serialize object: Py4JError: An
error occurred while calling o58.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
I found some discussion on stackoverflow: Spark __getnewargs__ error
but yet I am not sure how to resolve this issue?
Is the error because I am creating another dataframe inside the UDF?
Similar to the solution in the link i tried this:
cols = copy.deepcopy(df.columns)
df = df.withColumn("dtype", scanning(cols[0], cols[1]))
but still get error
Any solution?

The error means that you can not use Spark dataframe in the UDF. But since your dataframe containing names of databases and tables is most likely small, it's enough to just take a Python for loop, below are some methods which might help get your data:
from pyspark.sql import Row
# assume dfs is the df containing database names and table names
dfs.printSchema()
root
|-- database: string (nullable = true)
|-- tableName: string (nullable = true)
Method-1: use df.dtypes
Run the sql select * from database.tableName limit 1 to generate df and return its dtypes, convert it into StringType().
data = []
DRow = Row('database', 'tableName', 'dtypes')
for row in dfs.collect():
try:
dtypes = spark.sql('select * from `{}`.`{}` limit 1'.format(row.database, row.tableName)).dtypes
data.append(DRow(row.database, row.tableName, str(dtypes)))
except Exception, e:
print("ERROR from {}.{}: [{}]".format(row.database, row.tableName, e))
pass
df_dtypes = spark.createDataFrame(data)
# DataFrame[database: string, tableName: string, dtypes: string]
Note:
using dtypes instead of str(dtypes) will get the following schema where _1, and _2 are col_name and col_dtype respectively:
root
|-- database: string (nullable = true)
|-- tableName: string (nullable = true)
|-- dtypes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: string (nullable = true)
using this method, each table will have only one row. for the next two methods, each col_type of a table will have its own row.
Method-2: use describe
you can also retrieve this information from running spark.sql("describe tableName") by which you get dataframe directly, then use a reduce function to union the results from all tables.
from functools import reduce
def get_df_dtypes(db, tb):
try:
return spark.sql('desc `{}`.`{}`'.format(db, tb)) \
.selectExpr(
'"{}" as `database`'.format(db)
, '"{}" as `tableName`'.format(tb)
, 'col_name'
, 'data_type')
except Exception, e:
print("ERROR from {}.{}: [{}]".format(db, tb, e))
pass
# an example table:
get_df_dtypes('default', 'tbl_df1').show()
+--------+---------+--------+--------------------+
|database|tableName|col_name| data_type|
+--------+---------+--------+--------------------+
| default| tbl_df1| array_b|array<struct<a:st...|
| default| tbl_df1| array_d| array<string>|
| default| tbl_df1|struct_c|struct<a:double,b...|
+--------+---------+--------+--------------------+
# use reduce function to union all tables into one df
df_dtypes = reduce(lambda d1, d2: d1.union(d2), [ get_df_dtypes(row.database, row.tableName) for row in dfs.collect() ])
Method-3: use spark.catalog.listColumns()
Use spark.catalog.listColumns() which creates a list of collections.Column objects, retrieve name and dataType and merge the data. the resulting dataframe is normalized with col_name and col_dtype on their own columns (same as using Method-2).
data = []
DRow = Row('database', 'tableName', 'col_name', 'col_dtype')
for row in dfs.select('database', 'tableName').collect():
try:
for col in spark.catalog.listColumns(row.tableName, row.database):
data.append(DRow(row.database, row.tableName, col.name, col.dataType))
except Exception, e:
print("ERROR from {}.{}: [{}]".format(row.database, row.tableName, e))
pass
df_dtypes = spark.createDataFrame(data)
# DataFrame[database: string, tableName: string, col_name: string, col_dtype: string]
A Note: different Spark distributions/versions might have different result from describe tbl_name and other commands when retrieving metadata, make sure the correct column names are used in the queries.

Related

How to create a spark DataFrame from Nested JSON structure

I'm trying to load data from the ExactOnline API into a spark DataFrame. Data comes out of the API in a very ugly format. I have multiple lines of valid JSON objects in one JSON file. One line of JSON looks as follows:
{
"d": {
"results": [
{
"__metadata": {
"uri": "https://start.exactonline.nl/api_endpoint",
"type": "Exact.Web.Api.Models.Account",
},
"Accountant": null,
"AccountManager": null,
"AccountManagerFullName": null,
"AccountManagerHID": null,
...
},
{
"__metadata": {
"uri": "https://start.exactonline.nl/api_endpoint",
"type": "Exact.Web.Api.Models.Account",
},
"Accountant": null,
"AccountManager": null,
"AccountManagerFullName": null,
"AccountManagerHID": null,
...
}
]
}
}
What I need is that the keys of the dictionary's in the results list become the dataframe columns, and the number of dictionary's in the results become my rows. In the example I provided above, that would result in a dataframe with the following columns:
__metadata|Accountant|AccountManager|AccountManagerFullName|AccountManagerHID
And two rows, one for each entry in the "results" list.
In Python on my local machine, I am easily able to achieve this by using the following code snippet:
import json
import pandas as pd
folder_path = "path_to_json_file"
def flatten(l):
return [item for sublist in l for item in sublist]
with open(folder_path) as f:
# Extract relevant data from each line in the JSON structure and create a nested list,
# Where the "inner" lists are lists with dicts
# (1 line of JSON in my file = 1 inner list, so if my JSON file has 6
# lines the nested list will have 6 lists with a number of dictionaries)
data = [json.loads(line)["d"]["results"] for line in f]
# Flatten the nested lists into one giant list
flat_data = flatten(data)
# Create a dataframe from that flat list.
df = pd.DataFrame(flat_data)
However, I'm using a Pyspark Notebook in Azure Synapse, and the JSON files reside in our Data Lake so I cannot use with open to open files. I am limited to using spark functions. I have tried to achieve what I described above using spark.explode and spark.select:
from pyspark.sql import functions as sf
df = spark.read.json(path=path_to_json_file_in_data_lake)
df_subset = df.select("d.results")
df_exploded = df_subset.withColumn("results", sf.explode(sf.col("results")))
df_exploded has the right number of rows, but not the proper columns. I think I'm searching in the right direction but cannot wrap my head around it. Some assistance would be greatly appreciated.
you can directly read JSON files in spark with spark.read.json(), but use the multiLine option as a single JSON is spread across multiple lines. then use inline sql function to explode and create new columns using the struct fields inside the array.
json_sdf = spark.read.option("multiLine", "true").json(
"./drive/MyDrive/samplejsonsparkread.json"
)
# root
# |-- d: struct (nullable = true)
# | |-- results: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- AccountManager: string (nullable = true)
# | | | |-- AccountManagerFullName: string (nullable = true)
# | | | |-- AccountManagerHID: string (nullable = true)
# | | | |-- Accountant: string (nullable = true)
# | | | |-- __metadata: struct (nullable = true)
# | | | | |-- type: string (nullable = true)
# | | | | |-- uri: string (nullable = true)
# use `inline` sql function to explode and create new fields from array of structs
df.selectExpr("inline(d.results)").show(truncate=False)
# +--------------+----------------------+-----------------+----------+-------------------------------------------------------------------------+
# |AccountManager|AccountManagerFullName|AccountManagerHID|Accountant|__metadata |
# +--------------+----------------------+-----------------+----------+-------------------------------------------------------------------------+
# |null |null |null |null |{Exact.Web.Api.Models.Account, https://start.exactonline.nl/api_endpoint}|
# |null |null |null |null |{Exact.Web.Api.Models.Account, https://start.exactonline.nl/api_endpoint}|
# +--------------+----------------------+-----------------+----------+-------------------------------------------------------------------------+
# root
# |-- AccountManager: string (nullable = true)
# |-- AccountManagerFullName: string (nullable = true)
# |-- AccountManagerHID: string (nullable = true)
# |-- Accountant: string (nullable = true)
# |-- __metadata: struct (nullable = true)
# | |-- type: string (nullable = true)
# | |-- uri: string (nullable = true)
I tried your code, it is working fine. Just missing one last step :
df_exploded = df_subset.withColumn("results", sf.explode(sf.col('results')))
df_exploded.select("results.*").show()
+--------------+----------------------+-----------------+----------+--------------------+
|AccountManager|AccountManagerFullName|AccountManagerHID|Accountant| __metadata|
+--------------+----------------------+-----------------+----------+--------------------+
| null| null| null| null|[Exact.Web.Api.Mo...|
| null| null| null| null|[Exact.Web.Api.Mo...|
+--------------+----------------------+-----------------+----------+--------------------+

Few fields are missing in printSchema() when read from parquet file using PySpark

I have a parquet file having one column named as "isdeleted" with boolean as the datatype.
When I printSchema(), that "isdeleted" column is not present in the schema. I am not getting whats going on here.
scala> val products = sqlContext.read.format("parquet").load("s3a://bucketname/parquet/data/product")
scala> products.printSchema
I am missing one column from the below schema generated from above code:
root
|-- student_name: string (nullable = true)
|-- marks: double (nullable = true)
|-- height: double (nullable = true)
|-- city: string (nullable = true)

Handling changing datatypes in Pyspark/Hive

I am having an issue in parsing inconsistent datatypes in pyspark. As shown in the example file below, SA key contains always a dictionary but sometimes it can appear as string value. When I try to fetch the column SA.SM.Name, I get the exception as shown below.
How do I put null for SA.SM.Name column in pyspark/hive for the values having other than JSONs. Can someone help me please?
I tried to cast to different datatypes but nothing worked or may be I would be doing something wrong.
Input file Contents: mypath
{"id":1,"SA":{"SM": {"Name": "John","Email": "John#example.com"}}}
{"id":2,"SA":{"SM": {"Name": "Jerry","Email": "Jerry#example.com"}}}
{"id":3,"SA":"STRINGVALUE"}
df=spark.read.json(my_path)
df.registerTempTable("T")
spark.sql("""select id,SA.SM.Name from T """).show()
Traceback (most recent call last): File "", line 1, in
File "/usr/lib/spark/python/pyspark/sql/session.py", line
767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File
"/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in call File
"/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: "Can't extract value from
SA#6.SM: need struct type but got string; line 1 pos 10"
That is not possible using dataframes, since the column SA is being read as string while spark loads it. But you can load the file/table using sparkContext as rdd and then use a cleaner function for mapping the empty dict value to the SA. Here i loaded the file as textFile, but do necessary implementation if it is hadoopfile.
def cleaner(record):
output = ""
print(type(record))
try:
output = json.loads(record)
except Exception as e:
print("exception happened")
finally:
if isinstance(output.get("SA"), str ):
print("This is string")
output["SA"] = {}
return output
dfx = spark.sparkContext.textFile("file://"+my_path)
dfx2 = dfx.map(cleaner)
new_df = spark.createDataFrame(dfx2)
new_df.show(truncate=False)
+---------------------------------------------------+---+
|SA |id |
+---------------------------------------------------+---+
|[SM -> [Email -> John#example.com, Name -> John]] |1 |
|[SM -> [Email -> Jerry#example.com, Name -> Jerry]]|2 |
|[] |3 |
+---------------------------------------------------+---+
new_df.printSchema()
root
|-- SA: map (nullable = true)
| |-- key: string
| |-- value: map (valueContainsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
|-- id: long (nullable = true)
Note: if the output value of name has to be written to the same table/ column , this solution might not work and if you try to write back the loaded dataframe to the same table, then it will cause the SA column to break and you will get a list of names and emails as per the schema provided in the comments of the qn.

Updating column in spark dataframe with json schema

I have json files, and I'm trying to hash one field of it with SHA 256. These files are on AWS S3. I am currently using spark with python on Apache Zeppelin.
Here is my json schema, I am trying to hash 'mac' field;
|-- Document: struct (nullable = true)
| |-- data: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- mac: string (nullable = true)
I've tried couple of things;
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
import hashlib
hcData = sqlc.read.option("inferSchema","true").json(inputPath)
hcData.registerTempTable("hcData")
name = 'Document'
udf = UserDefinedFunction(lambda x: hashlib.sha256(str(x).encode('utf-8')).hexdigest(), StringType())
new_df = hcData.select(*[udf(column).alias(name) if column == name else column for column in hcData.columns])
This code works fine. But when I try to hash mac field and change name variable nothing happens;
name = 'Document.data[0].mac'
name = 'mac'
I guess it is because, it couldn't find column with given name.
I've tried to change the code a bit;
def valueToCategory(value):
return hashlib.sha256(str(value).encode('utf-8')).hexdigest()
udfValueToCategory = udf(valueToCategory, StringType())
df = hcData.withColumn("Document.data[0].mac",udfValueToCategory("Document.data.mac"))
This code hashes "Document.data.mac" and creates new column with hashed mac addresses. I want to update existing column. For those variables not nested it can update, there is no problem, but for nested variables I couldn't find a way to update.
So basically, I want to hash a field in nested json file with spark python. Can anyone knows how to update spark dataframe with schema?
Here is the python solution for my question below.
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
import hashlib
import re
def find(s, r):
l = re.findall(r, s)
if(len(l)!=0):
return l
else:
lis = ["null"]
return lis
def hash(s):
return hashlib.sha256(str(s).encode('utf-8')).hexdigest()
def hashAll(s, r):
st = s
macs = re.findall(r, s)
for mac in macs:
st = st.replace(mac, hash(mac))
return st
rdd = sc.textFile(inputPath)
regex = "([0-9A-Z]{1,2}[:-]){5}([0-9A-Z]{1,2})"
hashed_rdd = rdd.map(lambda line: hashAll(line, regex))
hashed_rdd.saveAsTextFile(outputPath)
Well, I've found a solution for my question with scala. There can be redundant codes but it worked anyway.
import scala.util.matching.Regex
import java.security.MessageDigest
val inputPath = ""
val outputPath = ""
//finds mac addresses with given regex
def find(s: String, r: Regex): List[String] = {
val l = r.findAllIn(s).toList
if(!l.isEmpty){
return l
} else {
val lis: List[String] = List("null")
return lis
}
}
//hashes given string with sha256
def hash(s: String): String = {
return MessageDigest.getInstance("SHA-256").digest(s.getBytes).map(0xFF & _).map { "%02x".format(_) }.foldLeft(""){_ + _}
}
//hashes given line
def hashAll(s: String, r:Regex): String = {
var st = s
val macs = find(s, r)
for (mac <- macs){
st = st.replaceAll(mac, hash(mac))
}
return st
}
//read data
val rdd = sc.textFile(inputPath)
//mac address regular expression
val regex = "(([0-9A-Z]{1,2}[:-]){5}([0-9A-Z]{1,2}))".r
//hash data
val hashed_rdd = rdd.map(line => hashAll(line, regex))
//write hashed data
hashed_rdd.saveAsTextFile(outputPath)

How to filter JSON data by multi-value column

With help of Spark SQL I'm trying to filter out all business items from which belongs to a specific group category.
The data is loaded from JSON file:
businessJSON = os.path.join(targetDir, 'yelp_academic_dataset_business.json')
businessDF = sqlContext.read.json(businessJSON)
The schema of the file is following:
businessDF.printSchema()
root
|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element: string (containsNull = true)
..
|-- type: string (nullable = true)
I'm trying to extract all business connected to restaurant business:
restaurants = businessDF[businessDF.categories.inSet("Restaurants")]
but it doesn't work because as I understand the expected type of column should be a string, but in my case this is array. About it tells me an exception:
Py4JJavaError: An error occurred while calling o1589.filter.
: org.apache.spark.sql.AnalysisException: invalid cast from string to array<string>;
Can you please suggest any other way to get what I want?
How about an UDF?
from pyspark.sql.functions import udf, col, lit
from pyspark.sql.types import BooleanType
contains = udf(lambda xs, val: val in xs, BooleanType())
df = sqlContext.createDataFrame([Row(categories=["foo", "bar"])])
df.select(contains(df.categories, lit("foo"))).show()
## +----------------------------------+
## |PythonUDF#<lambda>(categories,foo)|
## +----------------------------------+
## | true|
## +----------------------------------+
df.select(contains(df.categories, lit("foobar"))).show()
## +-------------------------------------+
## |PythonUDF#<lambda>(categories,foobar)|
## +-------------------------------------+
## | false|
## +-------------------------------------+

Categories