Transform array to column dynamically using pyspark - python

I'm having a trouble with a json dataframe:
{
"keys":[
{
"id":1,
"start":"2019-05-10",
"end":"2019-05-11",
"property":[
{
"key":"home",
"value":"1000"
},
{
"key":"office",
"value":"exit"
},
{
"key":"car",
"value":"ford"
}
]
},
{
"id":2,
"start":"2019-05-11",
"end":"2019-05-12",
"property":[
{
"key":"home",
"value":"2000"
},
{
"key":"office",
"value":"out"
},
{
"key":"car",
"value":"fiat"
}
]
}
]
}
root
|-- id: long (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- property: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I need to have key and value as column, where key is the name of column and value is the value in the dataframe.
At first I used getItem with an alias,:
df.select("id","start","end",col("property.value").getItem(0).alias("home"),col("property.value").getItem(1).alias("office"),col("property.value").getItem(2).alias("car")
But elements number and position can change, so i thought to provide a new schema with all the possible value for key, and to set value from my dataframe, without being joined to the position, but i think it is a low performance solution.
I tried also using pivot but i don't have the correct result as shown in figure, in fact i need to have split column, without a comma in the column name and value
id |start |end |[home, office, car]
+---+--------------+------------+--------------
|1 |2019-05-10 |2019-05-11 |[1000,exit,ford]
|2 |2019-05-11 |2019-05-12 |[2000,out,fiat]
I need this schema updating dynamically the fields, which number can be fixed:
|-- root
|-- id: long (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- home: string (nullable = true)
|-- office: string (nullable = true)
|-- car: string (nullable = true)
|-- cycle: string (nullable = true)
Anyone can help me, please?

Please find my try below. I deliberately expanded it into a couple of steps so that you could see how the final df was created (feel free to wrap these steps, however this would not have any impact on the performance).
inputJSON = "/tmp/my_file.json"
dfJSON = spark.read.json(inputJSON, multiLine=True)
from pyspark.sql import functions as F
df = dfJSON.select(F.explode(dfJSON["keys"]).alias("x"))
df2 = df.select(F.col("x.start").alias("start"),F.col("x.end").alias("end"),F.col("x.id").alias("id"),F.col("x.property").alias("property"))
df3 = df2.select(F.col("start"),F.col("end"),F.col("id"), F.explode(df2["property"]).alias("properties"))
df4 = df3.select(F.col("start"),F.col("end"),F.col("id"), F.col("properties.key").alias("key"), F.col("properties.value").alias("value"))
df4.groupBy("start","end","id").pivot('key').agg(F.last('value', True)).show()
Output:
+----------+----------+---+----+----+------+
| start| end| id| car|home|office|
+----------+----------+---+----+----+------+
|2019-05-11|2019-05-12| 2|fiat|2000| out|
|2019-05-10|2019-05-11| 1|ford|1000| exit|
+----------+----------+---+----+----+------+
Schemas:
dfJSON.printSchema()
root
|-- keys: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- end: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- property: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- key: string (nullable = true)
| | | | |-- value: string (nullable = true)
| | |-- start: string (nullable = true)
df2.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- property: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
df3.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- properties: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: string (nullable = true)
df4.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)

Try with groupBy and pivot.
from pyspark.sql.functions import *
cols=['home','office','car']
spark.read.option("multiline","true").\
json("<path>").\
selectExpr("explode(keys)").\
selectExpr("col.id","col.start","col.end","explode(col.property)").\
select("id","start","end","col.*").\
groupBy("id","start","end").\
pivot("key").\
agg(first("value")).\
withColumn("[home,office,car]",array(*cols)).\
drop(*cols).\
show()
#+---+----------+----------+------------------+
#| id| start| end| [home,office,car]|
#+---+----------+----------+------------------+
#| 1|2019-05-10|2019-05-11|[1000, exit, ford]|
#| 2|2019-05-11|2019-05-12| [2000, out, fiat]|
#+---+----------+----------+------------------+

Related

How to rename the first level keys of struct with PySpark in Azure Databricks?

I would like to rename the keys of the first level objects inside my payload.
from pyspark.sql.functions import *
ds = {'Fruits': {'apple': {'color': 'red'},'mango': {'color': 'green'}}, 'Vegetables': None}
df = spark.read.json(sc.parallelize([ds]))
df.printSchema()
"""
root
|-- Fruits: struct (nullable = true)
| |-- apple: struct (nullable = true)
| | |-- color: string (nullable = true)
| | |-- shape: string (nullable = true)
| |-- mango: struct (nullable = true)
| | |-- color: string (nullable = true)
|-- Vegetables: string (nullable = true)
"""
Desired output:
root
|-- Fruits: struct (nullable = true)
| |-- APPLE: struct (nullable = true)
| | |-- color: string (nullable = true)
| | |-- shape: string (nullable = true)
| |-- MANGO: struct (nullable = true)
| | |-- color: string (nullable = true)
|-- Vegetables: string (nullable = true)
In this case I would like to rename the keys in the first level to uppercase.
If I had a map type I could use transform keys:
df.select(transform_keys("Fruits", lambda k, _: upper(k)).alias("data_upper")).display()
Unfortunately, I have a struct type.
AnalysisException: cannot resolve 'transform_keys(Fruits,
lambdafunction(upper(x_18), x_18, y_19))' due to argument data type
mismatch: argument 1 requires map type, however, 'Fruits' is of
structapple:struct<color:string,shape:string,mango:structcolor:string>
type.;
I'm using Databricks runtime 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12).
The function you tried to use (transform_keys) is for map type columns. Your column type is struct.
You could use withField.
from pyspark.sql import functions as F
ds = spark.createDataFrame([], 'Fruits struct<apple:struct<color:string,shape:string>,mango:struct<color:string>>, Vegetables string')
ds.printSchema()
# root
# |-- Fruits: struct (nullable = true)
# | |-- apple: struct (nullable = true)
# | | |-- color: string (nullable = true)
# | | |-- shape: string (nullable = true)
# | |-- mango: struct (nullable = true)
# | | |-- color: string (nullable = true)
# |-- Vegetables: string (nullable = true)
ds = ds.withColumn('Fruits', F.col('Fruits').withField('APPLE', F.col('Fruits.apple')))
ds = ds.withColumn('Fruits', F.col('Fruits').withField('MANGO', F.col('Fruits.mango')))
ds.printSchema()
# root
# |-- Fruits: struct (nullable = true)
# | |-- APPLE: struct (nullable = true)
# | | |-- color: string (nullable = true)
# | | |-- shape: string (nullable = true)
# | |-- MANGO: struct (nullable = true)
# | | |-- color: string (nullable = true)
# |-- Vegetables: string (nullable = true)
You can also recreate the structure, but you will need to include all of the struct fields when recreating.
ds = ds.withColumn('Fruits', F.struct(
F.col('Fruits.apple').alias('APPLE'),
F.col('Fruits.mango').alias('MANGO'),
))
ds.printSchema()
# root
# |-- Fruits: struct (nullable = true)
# | |-- APPLE: struct (nullable = true)
# | | |-- color: string (nullable = true)
# | | |-- shape: string (nullable = true)
# | |-- MANGO: struct (nullable = true)
# | | |-- color: string (nullable = true)
# |-- Vegetables: string (nullable = true)

How do I get column and nested field names in pyspark?

I have a pyspark df.
df.printSchema()
root
|-- bio: string (nullable = true)
|-- city: string (nullable = true)
|-- company: string (nullable = true)
|-- custom_fields: struct (nullable = true)
| |-- nested_field1: string (nullable = true)
|-- email: string (nullable = true)
|-- first_conversion: struct (nullable = true)
| |-- nested_field2: struct (nullable = true)
| | |-- number: string (nullable = true)
| | |-- state: string (nullable = true)
I would like to iterate over column and nested fields in order to get their names (just their names). I should be able to print them and get the following result:
bio
city
company
custom_fields
nested_field1
email
first_conversion
nested_field2
number
state
I can easily print the first level with:
for st in df.schema:
print(st.name)
But how do I check for deeper levels in runtime and recursively?
dtypes will give you more details of the schema, you will have to parse it though
df.printSchema()
root
|-- id: integer (nullable = true)
|-- rec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = true)
| | |-- b: float (nullable = true)
df.dtypes
# [('id', 'int'), ('rec', 'array<struct<a:int,b:float>>')]

Spark : How to reuse the same array schema that has all fields defined across the data-frame

I have hundreds of columns a,b,c ... . I would like to modify dataframe schema, where each array will have the same shape date, num and val field.
There are thousands of id so I would like to modify ONLY schema not dataframe. Modified schema will be used in the next step to load data to dataframe efficiently . I would like to avoid using UDF to modify whole dataframe.
Input schema:
df.printSchema()
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true) !!! NOTE : `num` !!!
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
Required Output schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
To reproduce input Schema:
df = spark.read.json(sc.parallelize([
"""{"id":1,"a":[{"date":2001,"num":1},{"date":2002,},{"date":2003,}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"num":2},{"date":2002},{"date":2003}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]))
for field in df.schema:
print(field)
Print output:
StructField(a,ArrayType(StructType(List(StructField(date,LongType,true),StructField(num,LongType,true),StructField(val,LongType,true))),true),true)
StructField(b,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(c,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(d,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(id,LongType,true)
Solution (see OneCricketeer answer below for details) :
from pyspark.sql.types import StructField, StructType, LongType, ArrayType
jsonstr=[
"""{"id":1,"a":[{"date":2001,"val":1,"num":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"val":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]
array_schema = ArrayType(StructType([
StructField('date' ,LongType(),True),
StructField('num' ,LongType(),True),
StructField('val' ,LongType(),True)]),
True)
keys = ['a', 'b', 'c', 'd']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
dff = spark.read.json(sc.parallelize(jsonstr),df_schema)
I think the true solution is to have consistent names, or at least something more descriptive if the fields are truly different. "num" and "val" are basically synonymous
If I understand the question, you want to reuse the same array schema that has all fields defined
array_schema = ArrayType(StructType([StructField('date' ,LongType(),False),StructField('num' ,LongType(),True),StructField('val' ,LongType(),True))),True)
df_schema = StructType([
StructField('a',array_schema,True)
StructField('b',array_schema,True)
...
StructField('id',LongType(),True)
])
Or you can do this in a loop, which is safe because it's applied in the Spark driver
keys = ['a', 'b']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
(change each boolean to a False if there will be no nulls)
Then you need to provide this schema to your read function
spark.read.schema(df_schema).json(...
If there will still be more fields that cannot be consistently applied to all "keys", then use ArrayType(MapType(StringType(), LongType()), False)

How to efficiently process records in rdd and maintain the structure of a record

I have been working with Google Analytics data I have got in s3. I am loading the file as follows -
df = sc.textFile('s3n://BUCKET_NAME/2017/1/2/')
After this, I get an RDD. But if we want to see the schema I have loaded the data into spark SQL and the data schema is like this -
root
|-- channelGrouping: string (nullable = true)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- index: string (nullable = true)
| | |-- value: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable = true)
| |-- browser: string (nullable = true)
| |-- browserSize: string (nullable = true)
| |-- browserVersion: string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |-- isMobile: boolean (nullable = true)
| |-- javaEnabled: boolean (nullable = true)
| |-- language: string (nullable = true)
| |-- mobileDeviceBranding: string (nullable = true)
| |-- mobileDeviceInfo: string (nullable = true)
| |-- mobileDeviceMarketingName: string (nullable = true)
| |-- mobileDeviceModel: string (nullable = true)
| |-- mobileInputSelector: string (nullable = true)
| |-- operatingSystem: string (nullable = true)
| |-- operatingSystemVersion: string (nullable = true)
| |-- screenColors: string (nullable = true)
| |-- screenResolution: string (nullable = true)
|-- fullVisitorId: string (nullable = true)
|-- geoNetwork: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- cityId: string (nullable = true)
| |-- continent: string (nullable = true)
| |-- country: string (nullable = true)
| |-- latitude: string (nullable = true)
| |-- longitude: string (nullable = true)
| |-- metro: string (nullable = true)
| |-- networkDomain: string (nullable = true)
| |-- networkLocation: string (nullable = true)
| |-- region: string (nullable = true)
| |-- subContinent: string (nullable = true)
what I tried -
def remove_null_device(val):
_ori = json.loads(val)
# _ori = val
_dic_val = _ori['device']
for key, _value in _dic_val.items():
if _value == "null":
_dic_val[key] = "Hello There I am Testing this"
_ori["device"] = _dic_val
return _ori
device_data = df_rdd.map(remove_null_device)
Problem Statement - I want to iterate over every record as this is the nested structure I am thinking to pass one main key at a time like we have device, geoNetwork and check if the values are empty or not or null.
But this seems to change the structure of the whole record and the items are not getting updated don't know why. Please suggest any better approach for same.
Thanks!
Ok I want to check for all the fields in device if they are empty or null or (not set) and then updated those values and return the row I have updated and schema should remain intact.

Spark - Creating Nested DataFrame

I'm starting with PySpark and I'm having troubles with creating DataFrames with nested objects.
This is my example.
I have users.
$ cat user.json
{"id":1,"name":"UserA"}
{"id":2,"name":"UserB"}
Users have orders.
$ cat order.json
{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}
And I like to join it to get such a struct where orders are array nested in users.
$ cat join.json
{"id":1, "name":"UserA", "orders":[{"id":1,"price":202.30,"userid":1},{"id":2,"price":343.99,"userid":1}]}
{"id":2,"name":"UserB","orders":[{"id":3,"price":399.99,"userid":2}]}
How can I do that ?
Is there any kind of nested join or something similar ?
>>> user = sqlContext.read.json("user.json")
>>> user.printSchema();
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
>>> order = sqlContext.read.json("order.json")
>>> order.printSchema();
root
|-- id: long (nullable = true)
|-- price: double (nullable = true)
|-- userid: long (nullable = true)
>>> joined = sqlContext.read.json("join.json")
>>> joined.printSchema();
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
EDIT:
I know there is possibility to do this using join and foldByKey, but is there any simpler way ?
EDIT2:
I'm using solution by #zero323
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn")
I add second nested structure 'lines'
>>> lines = sqlContext.read.json(path + "lines.json")
>>> lines.printSchema();
root
|-- id: long (nullable = true)
|-- orderid: long (nullable = true)
|-- product: string (nullable = true)
orders = joinTable(order, lines, "id", "orderid", "lines")
joined = joinTable(user, orders, "id", "userid", "orders")
joined.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
| | |-- lines: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _1: long (nullable = true)
| | | | |-- _2: long (nullable = true)
| | | | |-- _3: string (nullable = true)
After this column names from lines are lost.
Any ideas ?
EDIT 3:
I tried to manual specify schema.
from pyspark.sql.types import *
fields = []
fields.append(StructField("_1", LongType(), True))
inner = ArrayType(lines.schema)
fields.append(StructField("_2", inner))
new_schema = StructType(fields)
print new_schema
grouped = lines.rdd.groupBy(lambda r: r.orderid)
grouped = grouped.map(lambda x: (x[0], list(x[1])))
g = sqlCtx.createDataFrame(grouped, new_schema)
Error:
TypeError: StructType(List(StructField(id,LongType,true),StructField(orderid,LongType,true),StructField(product,StringType,true))) can not accept object in type <class 'pyspark.sql.types.Row'>
This will work only in Spark 2.0 or later
First we'll need a couple of imports:
from pyspark.sql.functions import struct, collect_list
The rest is a simple aggregation and join:
orders = spark.read.json("/path/to/order.json")
users = spark.read.json("/path/to/user.json")
combined = users.join(
orders
.groupBy("userId")
.agg(collect_list(struct(*orders.columns)).alias("orders"))
.withColumnRenamed("userId", "id"), ["id"])
For the example data the result is:
combined.show(2, False)
+---+-----+---------------------------+
|id |name |orders |
+---+-----+---------------------------+
|1 |UserA|[[1,202.3,1], [2,343.99,1]]|
|2 |UserB|[[3,399.99,2]] |
+---+-----+---------------------------+
with schema:
combined.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
and JSON representation:
for x in combined.toJSON().collect():
print(x)
{"id":1,"name":"UserA","orders":[{"id":1,"price":202.3,"userid":1},{"id":2,"price":343.99,"userid":1}]}
{"id":2,"name":"UserB","orders":[{"id":3,"price":399.99,"userid":2}]}
First, you need to use the userid as the join key for the second DataFrame:
user.join(order, user.id == order.userid)
Then you can use a map step to transform the resulting records to your desired format.
For flatining your data frame from nested to normal use
dff= df.select("column with multiple columns.*")

Categories