AWS Glue Dynamic Frame columns from array - python

I have a nested json, structured as the following example:
{'A':[{'key':'B','value':'C'},{'key':'D','value':'E'}]}
Now I want to map this to the following schema:
|--A
|--|--B
|--|--D
e.g. A structure recovered from a json like:
{'A':{'B':'C','D':'E'}}
The array in 'A' has no fixed number of entries, but the contained dicts always have the two keys 'key','value'

Please find the script below.
from pyspark.sql.functions import lit, col, explode, create_map, collect_list
from itertools import chain
>>> sample.printSchema()
root
|-- A: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
>>> final_df = (sample
... .select(explode('A').alias("A"))
... .withColumn("A",create_map("A.key", "A.value"))
... .groupby().agg(collect_list("A").alias("A"))
... )
>>> final_df.printSchema()
root
|-- A: array (nullable = true)
| |-- element: map (containsNull = false)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
>>> final_df.show(truncate=False)
+--------------------+
|A |
+--------------------+
|[[B -> C], [D -> E]]|
+--------------------+
>>> (final_df
... .write
... .format("json")
... .mode("overwrite")
... .save("sample_files/2020-09-29/out")
... )

Related

Convert array of elements to multiple columns

How I can convert a array (in a column) with a set of elements in a JSON dataset to multiple columns with python, spark or pandas?
The data is structured in this form:
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- idAccount: long (nullable = true)
| | |-- infractionType: string (nullable = true)
| | |-- responseTime: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- transactionCode: string (nullable = true)
I'm expecting some kind of this:
id
idAccount
value
value
value
value
an array of structs can be exploded into columns using the inline sql function.
here's an example of how it works.
data_sdf = spark.createDataFrame([([(1234, 2345, 3456), (4321, 5432, 6543)],)],
'items array<struct<id: int, id_acc: int, foo: int>>'
)
# +----------------------------------------+
# |items |
# +----------------------------------------+
# |[{1234, 2345, 3456}, {4321, 5432, 6543}]|
# +----------------------------------------+
# root
# |-- items: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- id: integer (nullable = true)
# | | |-- id_acc: integer (nullable = true)
# | | |-- foo: integer (nullable = true)
# explode and create new columns using struct fields - using `inline`
data_sdf. \
selectExpr('inline(items)'). \
show()
# +----+------+----+
# | id|id_acc| foo|
# +----+------+----+
# |1234| 2345|3456|
# |4321| 5432|6543|
# +----+------+----+
you can further just select() the required fields after the explosion.
In Spark SQL, you can access the item in ArrayType or MapType column by using getItem. For example, you want to get the value of the id of first item, you can use df.select(func.getItem(0).getItem('id'))

How to flatten json file in pyspark

I have json file structure as shown below.evry time json file structure will change in pyspark how we handle flatten any kind of json file. Can u help me on this.
root
|-- student: struct (nullable = true)
|-- children: struct (nullable = true)
|-- parent: struct (nullable = true
| |-- id: string (nullable = true)
| |-- type: string (nullable = true)
| |-- date: string (nullable = true)
|-- multipliers: array (nullable = true)
| |-- element: double (containsNull = true)
|-- spawn_time: string (nullable = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)
This approach uses a recursive function to determine the columns to select, by building a flat list of fully-named prefixes in the prefix accumulator parameter.
Note that it will work on any format that supports nesting, not just JSON (Parquet, Avro, etc).
Furthermore, the input can have any schema, but this example uses:
{"c1": {"c3": 4, "c4": 12}, "c2": "w1"}
{"c1": {"c3": 5, "c4": 34}, "c2": "w2"}
The original df shows as:
+-------+---+
| c1| c2|
+-------+---+
|[4, 12]| w1|
|[5, 34]| w2|
+-------+---+
The code:
from pyspark.sql.types import StructType
from pyspark.sql.functions import col
# return a list of all (possibly nested) fields to select, within a given schema
def flatten(schema, prefix: str = ""):
# return a list of sub-items to select, within a given field
def field_items(field):
name = f'{prefix}.{field.name}' if prefix else field.name
if type(field.dataType) == StructType:
return flatten(field.dataType, name)
else:
return [col(name)]
return [item for field in schema.fields for item in field_items(field)]
df = spark.read.json(path)
print('===== df =====')
df.printSchema()
flattened = flatten(df.schema)
print('flattened =', flatten(df.schema))
print('===== df2 =====')
df2 = df.select(*flattened)
df2.printSchema()
df2.show()
As you will see in the output, the flatten function returns a flat list of columns, each one fully named (using "parent_col.child_col" naming format).
Output:
===== df =====
root
|-- c1: struct (nullable = true)
| |-- c3: long (nullable = true)
| |-- c4: long (nullable = true)
|-- c2: string (nullable = true)
flattened = [Column<b'c1.c3'>, Column<b'c1.c4'>, Column<b'c2'>]
===== df2 =====
root
|-- c3: long (nullable = true)
|-- c4: long (nullable = true)
|-- c2: string (nullable = true)
+---+---+---+
| c3| c4| c2|
+---+---+---+
| 4| 12| w1|
| 5| 34| w2|
+---+---+---+

How to convert complex JSON to dataframe by using PySpark?

I require a python code to convert the JSON to dataframe.
My JSON format is
{"feed":{"catalog":{"schema":["somekey":"somevalue"], "add":{"items":[{["somekey":"somevalue"]}]}}....
I want the JSON into multiple dataframe which are present inside the items.
For example:
Input JSON
{"feed":{"catalog":{"schema":["somekey":"somevalue"], "add":{"items":[{[{"sku":"10002","status":"Enabled"},{"sku":"10003","status":"Enabled"}]}]}}
Expected Dataframe
sku status
100002 Enabled
1000023 Enabled
Thanks in advance, please help to solve the problem.
You need to explode items array to get sku,status columns.
#sample valid json
jsn='{"feed":{"catalog":{"schema":["somekey","somevalue"], "add":{"items":[{"sku":"10002","status":"Enabled"},{"sku":"10003","status":"Enabled"}]}}}}'
#read the json using spark.read.json
df=spark.read.json(sc.parallelize([jsn]))
#print schema
df.printSchema()
#root
# |-- feed: struct (nullable = true)
# | |-- catalog: struct (nullable = true)
# | | |-- add: struct (nullable = true)
# | | | |-- items: array (nullable = true)
# | | | | |-- element: struct (containsNull = true)
# | | | | | |-- sku: string (nullable = true)
# | | | | | |-- status: string (nullable = true)
# | | |-- schema: array (nullable = true)
# | | | |-- element: string (containsNull = true)
df.withColumn("items",explode(col("feed.catalog.add.items"))).\
select("items.*").\
show()
#+-----+-------+
#| sku| status|
#+-----+-------+
#|10002|Enabled|
#|10003|Enabled|
#+-----+-------+

How to get only particular attribute from a column in dataframe schema?

i have this Schema of dataframe df :
root
|-- id: long (nullable = true)
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _href: string (nullable = true)
| | |-- type: string (nullable = true)
How can I modify the dataframe such that column a contains only _href values and not _value type?
Is it possible?
I've tried something like this , but it's wrong :
df=df.withColumn('a', 'a._href')
For example this is my data :
+---+---------------------------------------------------------------------+
|id| a |
+---+---------------------------------------------------------------------+
| 17|[[Gwendolyn Tucke,http://facebook.com],[i have , http://youtube.com]]|
| 23|[[letter, http://google.com],[hihow are you , http://google.co.il]] |
+---+---------------------------------------------------------------------+
but when i want to look like this:
+---+---------------------------------------------+
|id| a |
+---+---------------------------------------------+
| 17|[[http://facebook.com],[ http://youtube.com]]|
| 23|[[http://google.com],[http://google.co.il]] |
+---+---------------------------------------------+
ps: I don't want to use pandas at all.
You could just select a._href and assign it to a new column. Try this Scala solution.
scala> case class sub(_value:String,_href:String)
defined class sub
scala> val df = Seq((17,Array(sub("Gwendolyn Tucke","http://facebook.com"),sub("i have"," http://youtube.com"))),(23,Array(sub("letter","http://google.com"),sub("hihow are you","http://google.co.il")))).toDF("id","a")
df: org.apache.spark.sql.DataFrame = [id: int, a: array<struct<_value:string,_href:string>>]
scala> df.show(false)
+---+-----------------------------------------------------------------------+
|id |a |
+---+-----------------------------------------------------------------------+
|17 |[[Gwendolyn Tucke, http://facebook.com], [i have, http://youtube.com]]|
|23 |[[letter, http://google.com], [hihow are you, http://google.co.il]] |
+---+-----------------------------------------------------------------------+
scala> df.select("id","a._href").show(false)
+---+------------------------------------------+
|id |_href |
+---+------------------------------------------+
|17 |[http://facebook.com, http://youtube.com]|
|23 |[http://google.com, http://google.co.il] |
+---+------------------------------------------+
You can assign it to a new column
scala> val df2 = df.withColumn("result",$"a._href")
df2: org.apache.spark.sql.DataFrame = [id: int, a: array<struct<_value:string,_href:string>> ... 1 more field]
scala> df2.show(false)
+---+-----------------------------------------------------------------------+------------------------------------------+
|id |a |result |
+---+-----------------------------------------------------------------------+------------------------------------------+
|17 |[[Gwendolyn Tucke, http://facebook.com], [i have, http://youtube.com]]|[http://facebook.com, http://youtube.com]|
|23 |[[letter, http://google.com], [hihow are you, http://google.co.il]] |[http://google.com, http://google.co.il] |
+---+-----------------------------------------------------------------------+------------------------------------------+
scala> df2.printSchema
root
|-- id: integer (nullable = false)
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _value: string (nullable = true)
| | |-- _href: string (nullable = true)
|-- result: array (nullable = true)
| |-- element: string (containsNull = true)
scala>
You can try below code :
from pyspark.sql.functions import *
df.select("id", explode("a")).select("id","a._href", "a.type").show()
Above code will return DataFrame with 3 columns(id, _href, type) at the same level which you can use for your further analysis.
I hope it helps.
Regards,
Neeraj

How to get the column by its index instead of a name?

I have the following initial PySpark DataFrame:
+----------+--------------------------------+
|product_PK| products|
+----------+--------------------------------+
| 686 | [[686,520.70],[645,2]]|
| 685 |[[685,45.556],[678,23],[655,21]]|
| 693 | []|
df = sqlCtx.createDataFrame(
[(686, [[686,520.70], [645,2]]), (685, [[685,45.556], [678,23],[655,21]]), (693, [])],
["product_PK", "products"]
)
The column products contains nested data. I need to extract the second value in each pair of values. I am running this code:
temp_dataframe = dataframe.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem("_2"))
It works well with particular DataFrame. However, I want to put this code into a function and run it on different DataFrames. All of my DataFrames have the same structure. The only difference is that the sub-column "_2" might be named differently in some DataFrames, e.g. "col1" or "col2".
For example:
DataFrame content
root
|-- product_PK: long (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: long (nullable = true)
| | |-- _2: double (nullable = true)
|-- exploded: struct (nullable = true)
| |-- _1: long (nullable = true)
| |-- _2: double (nullable = true)
DataFrame content
root
|-- product_PK: long (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- product_PK: long (nullable = true)
| | |-- col2: integer (nullable = true)
|-- exploded: struct (nullable = true)
| |-- product_PK: long (nullable = true)
| |-- col2: integer (nullable = true)
I tried to use index like getItem(1), but it says that the name of a column must be provided.
Is there any way to avoid specifying the column name or somehow generalize this part of a code?
My goal is that exploded contains the second value of each pair in the nested data, i.e. _2 or col1 or col2.
It sounds like you were on the right track. I think the way to accomplish this is to read the schema to determine the name of the field you want to explode on. Instead of schema.names though, you need to use schema.fields to find the struct field, and then use it's properties to figure out the fields in the struct. Here is an example:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Setup the test dataframe
data = [
(686, [(686, 520.70), (645, 2.)]),
(685, [(685, 45.556), (678, 23.), (655, 21.)]),
(693, [])
]
schema = StructType([
StructField("product_PK", StringType()),
StructField("products",
ArrayType(StructType([
StructField("_1", IntegerType()),
StructField("col2", FloatType())
]))
)
])
df = sqlCtx.createDataFrame(data, schema)
# Find the products field in the schema, then find the name of the 2nd field
productsField = next(f for f in df.schema.fields if f.name == 'products')
target_field = productsField.dataType.elementType.names[1]
# Do your explode using the field name
temp_dataframe = df.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem(target_field))
Now, if you examine the result you get this:
>>> temp_dataframe.printSchema()
root
|-- product_PK: string (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = true)
| | |-- col2: float (nullable = true)
|-- exploded: struct (nullable = true)
| |-- _1: integer (nullable = true)
| |-- col2: float (nullable = true)
|-- score: float (nullable = true)
Is that what you want?
>>> df.show(10, False)
+----------+-----------------------------------------------------------------------+
|product_PK|products |
+----------+-----------------------------------------------------------------------+
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|
|693 |[] |
+----------+-----------------------------------------------------------------------+
>>> import pyspark.sql.functions as F
>>> df.withColumn("exploded", F.explode("products")) \
... .withColumn("exploded", F.col("exploded").getItem(1)) \
... .show(10,False)
+----------+-----------------------------------------------------------------------+--------+
|product_PK|products |exploded|
+----------+-----------------------------------------------------------------------+--------+
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |null |
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |2 |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|null |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|23 |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|21 |
+----------+-----------------------------------------------------------------------+--------+
Given that your exploded column is a struct as
|-- exploded: struct (nullable = true)
| |-- _1: integer (nullable = true)
| |-- col2: float (nullable = true)
You can use following logic to get the second element without knowing the name
from pyspark.sql import functions as F
temp_dataframe = df.withColumn("exploded" , F.explode(F.col("products")))
temp_dataframe.withColumn("score", F.col("exploded."+temp_dataframe.select(F.col("exploded.*")).columns[1]))
you should have output as
+----------+--------------------------------------+------------+------+
|product_PK|products |exploded |score |
+----------+--------------------------------------+------------+------+
|686 |[[686,520.7], [645,2.0]] |[686,520.7] |520.7 |
|686 |[[686,520.7], [645,2.0]] |[645,2.0] |2.0 |
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[685,45.556]|45.556|
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[678,23.0] |23.0 |
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[655,21.0] |21.0 |
+----------+--------------------------------------+------------+------+

Categories