Reorganize Pyspark dataframe: Create new column using row element

Reorganize Pyspark dataframe: Create new column using row element - python

I am trying to map document with this structure to dataframe.
root
|-- Id: "a1"
|-- Type: "Work"
|-- Tag: Array
| |--0: Object
| | |-- Tag.name : "passHolder"
| | |-- Tag.value : "Jack Ryan"
| | |-- Tag.stat : "verified"
| |-- 1: Object
| | |-- Tag.name : "passNum"
| | |-- Tag.value : "1234"
| | |-- Tag.stat : "unverified"
|-- version: 1.5
By exploding the array using explode_outer,flattening struct and renaming using .col + alias, the dataframe will look like:
df = df.withColumn("Tag",F.explode_outer("Tag"))
df = df.select(col("*"),
.col("Tag.name").alias("Tag_name"),
.col("Tag.value").alias("Tag_value"),
.col("Tag.stat").alias("Tag_stat")).drop("Tag")
+--+----+-----------+-----------+---------+---------+
|Id|Type| Tag_name | Tag_value |Tag_stat | version |
+--+----+-----------+-----------+---------+---------+
a1 Work passHolder Jack Ryan verified 1.5
a1 Work passNum 1234 unverified 1.5
I am trying to reorganise the df structure so that it's more query-able,by making certain row element as column name and populate it with relevant values.
Can anyone help to give pointers/steps required to arrive at desired output format like below? Thank you very much for the advise.
Target format:
+--+----+-----------------+-----------------+-------------+------------+--------+
|Id|Type| Tag_passHolder | passHolder_stat | Tag_passNum |passNum_stat||version|
+--+----+-----------------+-----------------+-------------+------------+--------+
a1 Work Jack Ryan verified 1234 unverified 1.5

Based on the output df you displayed, I would do something like this :
from pyspark.sql import functions as F
passholder_df = df.select(
"ID",
"Type",
F.col("Tag_value").alias("Tag_passHolder"),
F.col("Tag_stat").alias("passHolder_stat"),
"version",
).where("Tag_name = 'passHolder'")
passnum_df = df.select(
"ID",
"Type",
F.col("Tag_value").alias("Tag_passNum"),
F.col("Tag_stat").alias("passNum_stat"),
"version",
).where("Tag_name = 'passNum'")
passholder_df.join(passnum_df, on=["ID", "Type", "version"], how="full")
You probably need to work a little bit on the join condition, depending on your business rules.

Related

How to create a spark DataFrame from Nested JSON structure

I'm trying to load data from the ExactOnline API into a spark DataFrame. Data comes out of the API in a very ugly format. I have multiple lines of valid JSON objects in one JSON file. One line of JSON looks as follows:
{
"d": {
"results": [
{
"__metadata": {
"uri": "https://start.exactonline.nl/api_endpoint",
"type": "Exact.Web.Api.Models.Account",
},
"Accountant": null,
"AccountManager": null,
"AccountManagerFullName": null,
"AccountManagerHID": null,
...
},
{
"__metadata": {
"uri": "https://start.exactonline.nl/api_endpoint",
"type": "Exact.Web.Api.Models.Account",
},
"Accountant": null,
"AccountManager": null,
"AccountManagerFullName": null,
"AccountManagerHID": null,
...
}
]
}
}
What I need is that the keys of the dictionary's in the results list become the dataframe columns, and the number of dictionary's in the results become my rows. In the example I provided above, that would result in a dataframe with the following columns:
__metadata|Accountant|AccountManager|AccountManagerFullName|AccountManagerHID
And two rows, one for each entry in the "results" list.
In Python on my local machine, I am easily able to achieve this by using the following code snippet:
import json
import pandas as pd
folder_path = "path_to_json_file"
def flatten(l):
return [item for sublist in l for item in sublist]
with open(folder_path) as f:
# Extract relevant data from each line in the JSON structure and create a nested list,
# Where the "inner" lists are lists with dicts
# (1 line of JSON in my file = 1 inner list, so if my JSON file has 6
# lines the nested list will have 6 lists with a number of dictionaries)
data = [json.loads(line)["d"]["results"] for line in f]
# Flatten the nested lists into one giant list
flat_data = flatten(data)
# Create a dataframe from that flat list.
df = pd.DataFrame(flat_data)
However, I'm using a Pyspark Notebook in Azure Synapse, and the JSON files reside in our Data Lake so I cannot use with open to open files. I am limited to using spark functions. I have tried to achieve what I described above using spark.explode and spark.select:
from pyspark.sql import functions as sf
df = spark.read.json(path=path_to_json_file_in_data_lake)
df_subset = df.select("d.results")
df_exploded = df_subset.withColumn("results", sf.explode(sf.col("results")))
df_exploded has the right number of rows, but not the proper columns. I think I'm searching in the right direction but cannot wrap my head around it. Some assistance would be greatly appreciated.

you can directly read JSON files in spark with spark.read.json(), but use the multiLine option as a single JSON is spread across multiple lines. then use inline sql function to explode and create new columns using the struct fields inside the array.
json_sdf = spark.read.option("multiLine", "true").json(
"./drive/MyDrive/samplejsonsparkread.json"
)
# root
# |-- d: struct (nullable = true)
# | |-- results: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- AccountManager: string (nullable = true)
# | | | |-- AccountManagerFullName: string (nullable = true)
# | | | |-- AccountManagerHID: string (nullable = true)
# | | | |-- Accountant: string (nullable = true)
# | | | |-- __metadata: struct (nullable = true)
# | | | | |-- type: string (nullable = true)
# | | | | |-- uri: string (nullable = true)
# use `inline` sql function to explode and create new fields from array of structs
df.selectExpr("inline(d.results)").show(truncate=False)
# +--------------+----------------------+-----------------+----------+-------------------------------------------------------------------------+
# |AccountManager|AccountManagerFullName|AccountManagerHID|Accountant|__metadata |
# +--------------+----------------------+-----------------+----------+-------------------------------------------------------------------------+
# |null |null |null |null |{Exact.Web.Api.Models.Account, https://start.exactonline.nl/api_endpoint}|
# |null |null |null |null |{Exact.Web.Api.Models.Account, https://start.exactonline.nl/api_endpoint}|
# +--------------+----------------------+-----------------+----------+-------------------------------------------------------------------------+
# root
# |-- AccountManager: string (nullable = true)
# |-- AccountManagerFullName: string (nullable = true)
# |-- AccountManagerHID: string (nullable = true)
# |-- Accountant: string (nullable = true)
# |-- __metadata: struct (nullable = true)
# | |-- type: string (nullable = true)
# | |-- uri: string (nullable = true)

I tried your code, it is working fine. Just missing one last step :
df_exploded = df_subset.withColumn("results", sf.explode(sf.col('results')))
df_exploded.select("results.*").show()
+--------------+----------------------+-----------------+----------+--------------------+
|AccountManager|AccountManagerFullName|AccountManagerHID|Accountant| __metadata|
+--------------+----------------------+-----------------+----------+--------------------+
| null| null| null| null|[Exact.Web.Api.Mo...|
| null| null| null| null|[Exact.Web.Api.Mo...|
+--------------+----------------------+-----------------+----------+--------------------+

Python or PETL Parsing XML

I have been playing with PETL and seeing if I could extract multiple xml files and combine them into one.
I have no control over the structure of the XML files, Here are the variations I am seeing and which is giving my trouble.
XML File 1 Example:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name>John Doe</Name>
<Date>01/01/2021</Date>
</Info>
<App>
<Description></Description>
<Type>Two</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>
XML File 2 Example:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name></Name>
<Date>01/02/2021</Date>
</Info>
<App>
<Description>Sample description here.</Description>
<Type>One</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
<DetailOne>3</DetailOne>
<DetailTwo>4</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>
My python code is just scanning the subfolder xmlfiles and then trying to use PETL to parse from there. With the structure of the documents, I am loading three tables so far:
1 to hold the Info name and date
2 to hold the description and type
3 to collect the details
import petl as etl
import os
from lxml import etree
for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
if filename.endswith('.xml'):
# Get the info children
table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
'Name': 'Name',
'Date': 'Date'
})
# Get the App children
table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
'Description': 'Description',
'Type': 'Type'
})
# Get the App Details children
table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App/Details', {
'DetailOne': 'DetailOne',
'DetailTwo': 'DetailTwo'
})
# concat
c = etl.crossjoin(table1, table2, table3)
# I want the filename added on
result = etl.addfield(c, 'FileName', filename)
print('Results:\n', result)
I concat the three tables because I want the Info and App data on each line with each detail. This works until I get a XML file that has multiples of the DetailOne and DetailTwo elements.
What I am getting as results is:
Results:
+------------+----------+-------------+------+-----------+-----------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+==========+=============+======+===========+===========+==========+
| 01/01/2021 | John Doe | None | Two | 1 | 2 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
| 01/01/2021 | John Doe | None | Two | 10 | 11 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
Results:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | ('1', '3') | ('2', '4') | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
The second file showing DetailOne being ('1','3') and DetailTwo being ('2', '4') is not what I want.
What I want is:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | 1 | 2 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 3 | 4 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
I believe XPath may be the way to go but after researching:
https://petl.readthedocs.io/en/stable/io.html#xml-files - doesn't go in depth on lxml and petl
some light reading here:
https://www.w3schools.com/xml/xpath_syntax.asp
some more reading here:
https://lxml.de/tutorial.html
Any assistance on this is appreciated!

First, thanks for taking the time to write a good question. I'm happy to spend the time answering it.
I've never used PETL, but I did scan the docs for XML processing. I think your main problem is that the <Details> tag sometimes contains 1 pair of tags, and sometimes multiple pairs. If only there was a way to extract a flat list of the and tag values, without the enclosing tags getting in the way...
Fortunately there is. I used https://www.webtoolkitonline.com/xml-xpath-tester.html and the XPath expression //Details/DetailOne returns the list 1,3,10 when applied to your example XML.
So I suspect that something like this should work:
import petl as etl
import os
from lxml import etree
for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
if filename.endswith('.xml'):
# Get the info children
table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
'Name': 'Name',
'Date': 'Date'
})
# Get the App children
table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
'Description': 'Description',
'Type': 'Type'
})
# Get the App Details children
table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), '/App', {
'DetailOne': '//DetailOne',
'DetailTwo': '//DetailTwo'
})
# concat
c = etl.crossjoin(table1, table2, table3)
# I want the filename added on
result = etl.addfield(c, 'FileName', filename)
print('Results:\n', result)
The leading // may be redundant. It is XPath syntax for 'at any level in the document'. I don't know how PETL processes the XPath so I'm trying to play safe. I agree btw - the documentation is rather light on details.

Handling changing datatypes in Pyspark/Hive

I am having an issue in parsing inconsistent datatypes in pyspark. As shown in the example file below, SA key contains always a dictionary but sometimes it can appear as string value. When I try to fetch the column SA.SM.Name, I get the exception as shown below.
How do I put null for SA.SM.Name column in pyspark/hive for the values having other than JSONs. Can someone help me please?
I tried to cast to different datatypes but nothing worked or may be I would be doing something wrong.
Input file Contents: mypath
{"id":1,"SA":{"SM": {"Name": "John","Email": "John#example.com"}}}
{"id":2,"SA":{"SM": {"Name": "Jerry","Email": "Jerry#example.com"}}}
{"id":3,"SA":"STRINGVALUE"}
df=spark.read.json(my_path)
df.registerTempTable("T")
spark.sql("""select id,SA.SM.Name from T """).show()
Traceback (most recent call last): File "", line 1, in
File "/usr/lib/spark/python/pyspark/sql/session.py", line
767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File
"/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in call File
"/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: "Can't extract value from
SA#6.SM: need struct type but got string; line 1 pos 10"

That is not possible using dataframes, since the column SA is being read as string while spark loads it. But you can load the file/table using sparkContext as rdd and then use a cleaner function for mapping the empty dict value to the SA. Here i loaded the file as textFile, but do necessary implementation if it is hadoopfile.
def cleaner(record):
output = ""
print(type(record))
try:
output = json.loads(record)
except Exception as e:
print("exception happened")
finally:
if isinstance(output.get("SA"), str ):
print("This is string")
output["SA"] = {}
return output
dfx = spark.sparkContext.textFile("file://"+my_path)
dfx2 = dfx.map(cleaner)
new_df = spark.createDataFrame(dfx2)
new_df.show(truncate=False)
+---------------------------------------------------+---+
|SA |id |
+---------------------------------------------------+---+
|[SM -> [Email -> John#example.com, Name -> John]] |1 |
|[SM -> [Email -> Jerry#example.com, Name -> Jerry]]|2 |
|[] |3 |
+---------------------------------------------------+---+
new_df.printSchema()
root
|-- SA: map (nullable = true)
| |-- key: string
| |-- value: map (valueContainsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
|-- id: long (nullable = true)
Note: if the output value of name has to be written to the same table/ column , this solution might not work and if you try to write back the loaded dataframe to the same table, then it will cause the SA column to break and you will get a list of names and emails as per the schema provided in the comments of the qn.

How to use behave context.table with key value table?

I saw that it is possible to access data from context.table from Behave when the table described in the BDD has a header. for example:
Scenario: Add new Expense
Given the user fill out the required fields
| item | name | amount |
| Wine | Julie | 30.00 |
To access this code it's simply:
for row in context.table:
context.page.fill_item(row['item'])
context.page.fill_name(row['name'])
context.page.fill_amount(row['amount'])
That works well and it's very clean, however, I have to refactor code when I have a huge amount of lines of input data. for example:
Given I am on registration page
When I fill "test#test.com" for email address
And I fill "test" for password
And I fill "Didier" for first name
And I fill "Dubois" for last name
And I fill "946132795" for phone number
And I fill "456456456" for mobile phon
And I fill "Company name" for company name
And I fill "Avenue Victor Hugo 1" for address
And I fill "97123" for postal code
And I fill "Lyon" for city
And I select "France" country
...
15 more lines for filling the form
How could I use the following table in behave:
|first name | didier |
|last name | Dubois |
|phone| 4564564564 |
So on ...
How would my step definition look like?

To use a vertical table rather than a horizontal table, you need to process each row as its own field. The table still needs a header row:
When I fill in the registration form with:
| Field | Value |
| first name | Didier |
| last name | Dubois |
| country | France |
| ... | ... |
In your step definition, loop over the table rows and call a method on your selenium page model:
for row in context.table
context.page.fill_field(row['Field'], row['Value'])
The Selenium page model method needs to do something based on the field name:
def fill_field(self, field, value)
if field == 'first name':
self.first_name.send_keys(value)
elif field == 'last name':
self.last_name.send_keys(value)
elif field == 'country':
# should be instance of SelectElement
self.country.select_by_text(value)
elif
...
else:
raise NameError(f'Field {field} is not valid')

How to filter JSON data by multi-value column

With help of Spark SQL I'm trying to filter out all business items from which belongs to a specific group category.
The data is loaded from JSON file:
businessJSON = os.path.join(targetDir, 'yelp_academic_dataset_business.json')
businessDF = sqlContext.read.json(businessJSON)
The schema of the file is following:
businessDF.printSchema()
root
|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element: string (containsNull = true)
..
|-- type: string (nullable = true)
I'm trying to extract all business connected to restaurant business:
restaurants = businessDF[businessDF.categories.inSet("Restaurants")]
but it doesn't work because as I understand the expected type of column should be a string, but in my case this is array. About it tells me an exception:
Py4JJavaError: An error occurred while calling o1589.filter.
: org.apache.spark.sql.AnalysisException: invalid cast from string to array<string>;
Can you please suggest any other way to get what I want?

How about an UDF?
from pyspark.sql.functions import udf, col, lit
from pyspark.sql.types import BooleanType
contains = udf(lambda xs, val: val in xs, BooleanType())
df = sqlContext.createDataFrame([Row(categories=["foo", "bar"])])
df.select(contains(df.categories, lit("foo"))).show()
## +----------------------------------+
## |PythonUDF#<lambda>(categories,foo)|
## +----------------------------------+
## | true|
## +----------------------------------+
df.select(contains(df.categories, lit("foobar"))).show()
## +-------------------------------------+
## |PythonUDF#<lambda>(categories,foobar)|
## +-------------------------------------+
## | false|
## +-------------------------------------+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reorganize Pyspark dataframe: Create new column using row element - python

Related

How to create a spark DataFrame from Nested JSON structure

Python or PETL Parsing XML

Handling changing datatypes in Pyspark/Hive

How to use behave context.table with key value table?

How to filter JSON data by multi-value column

Categories

Resources