Spark - Creating Nested DataFrame

Spark - Creating Nested DataFrame - python

I'm starting with PySpark and I'm having troubles with creating DataFrames with nested objects.
This is my example.
I have users.
$ cat user.json
{"id":1,"name":"UserA"}
{"id":2,"name":"UserB"}
Users have orders.
$ cat order.json
{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}
And I like to join it to get such a struct where orders are array nested in users.
$ cat join.json
{"id":1, "name":"UserA", "orders":[{"id":1,"price":202.30,"userid":1},{"id":2,"price":343.99,"userid":1}]}
{"id":2,"name":"UserB","orders":[{"id":3,"price":399.99,"userid":2}]}
How can I do that ?
Is there any kind of nested join or something similar ?
>>> user = sqlContext.read.json("user.json")
>>> user.printSchema();
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
>>> order = sqlContext.read.json("order.json")
>>> order.printSchema();
root
|-- id: long (nullable = true)
|-- price: double (nullable = true)
|-- userid: long (nullable = true)
>>> joined = sqlContext.read.json("join.json")
>>> joined.printSchema();
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
EDIT:
I know there is possibility to do this using join and foldByKey, but is there any simpler way ?
EDIT2:
I'm using solution by #zero323
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn")
I add second nested structure 'lines'
>>> lines = sqlContext.read.json(path + "lines.json")
>>> lines.printSchema();
root
|-- id: long (nullable = true)
|-- orderid: long (nullable = true)
|-- product: string (nullable = true)
orders = joinTable(order, lines, "id", "orderid", "lines")
joined = joinTable(user, orders, "id", "userid", "orders")
joined.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
| | |-- lines: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _1: long (nullable = true)
| | | | |-- _2: long (nullable = true)
| | | | |-- _3: string (nullable = true)
After this column names from lines are lost.
Any ideas ?
EDIT 3:
I tried to manual specify schema.
from pyspark.sql.types import *
fields = []
fields.append(StructField("_1", LongType(), True))
inner = ArrayType(lines.schema)
fields.append(StructField("_2", inner))
new_schema = StructType(fields)
print new_schema
grouped = lines.rdd.groupBy(lambda r: r.orderid)
grouped = grouped.map(lambda x: (x[0], list(x[1])))
g = sqlCtx.createDataFrame(grouped, new_schema)
Error:
TypeError: StructType(List(StructField(id,LongType,true),StructField(orderid,LongType,true),StructField(product,StringType,true))) can not accept object in type <class 'pyspark.sql.types.Row'>

This will work only in Spark 2.0 or later
First we'll need a couple of imports:
from pyspark.sql.functions import struct, collect_list
The rest is a simple aggregation and join:
orders = spark.read.json("/path/to/order.json")
users = spark.read.json("/path/to/user.json")
combined = users.join(
orders
.groupBy("userId")
.agg(collect_list(struct(*orders.columns)).alias("orders"))
.withColumnRenamed("userId", "id"), ["id"])
For the example data the result is:
combined.show(2, False)
+---+-----+---------------------------+
|id |name |orders |
+---+-----+---------------------------+
|1 |UserA|[[1,202.3,1], [2,343.99,1]]|
|2 |UserB|[[3,399.99,2]] |
+---+-----+---------------------------+
with schema:
combined.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
and JSON representation:
for x in combined.toJSON().collect():
print(x)
{"id":1,"name":"UserA","orders":[{"id":1,"price":202.3,"userid":1},{"id":2,"price":343.99,"userid":1}]}
{"id":2,"name":"UserB","orders":[{"id":3,"price":399.99,"userid":2}]}

First, you need to use the userid as the join key for the second DataFrame:
user.join(order, user.id == order.userid)
Then you can use a map step to transform the resulting records to your desired format.

For flatining your data frame from nested to normal use
dff= df.select("column with multiple columns.*")

Related

How do I get column and nested field names in pyspark?

I have a pyspark df.
df.printSchema()
root
|-- bio: string (nullable = true)
|-- city: string (nullable = true)
|-- company: string (nullable = true)
|-- custom_fields: struct (nullable = true)
| |-- nested_field1: string (nullable = true)
|-- email: string (nullable = true)
|-- first_conversion: struct (nullable = true)
| |-- nested_field2: struct (nullable = true)
| | |-- number: string (nullable = true)
| | |-- state: string (nullable = true)
I would like to iterate over column and nested fields in order to get their names (just their names). I should be able to print them and get the following result:
bio
city
company
custom_fields
nested_field1
email
first_conversion
nested_field2
number
state
I can easily print the first level with:
for st in df.schema:
print(st.name)
But how do I check for deeper levels in runtime and recursively?

dtypes will give you more details of the schema, you will have to parse it though
df.printSchema()
root
|-- id: integer (nullable = true)
|-- rec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = true)
| | |-- b: float (nullable = true)
df.dtypes
# [('id', 'int'), ('rec', 'array<struct<a:int,b:float>>')]

Python, Select from Nested Dataframe when Input File Changes Format

df = spark.read.json(['/Users/.../input/json/thisistheinputfile.json'])
df.printSchema()
Results in something like below:
root
|-- _metadata: struct (nullable = true)
| |-- bundled: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bundledIds: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- failedInitializations: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- unbundled: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- anonymousId: string (nullable = true)
|-- channel: string (nullable = true)
|-- context: struct (nullable = true)
| |-- campaign: struct (nullable = true)
| | |-- content: string (nullable = true)
| | |-- medium: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- source: string (nullable = true)
| | |-- term: string (nullable = true)
| | |-- utm_campaign: string (nullable = true)
| | |-- utm_medium: string (nullable = true)
| | |-- utm_term: string (nullable = true)
| |-- ip: string (nullable = true)
However (SOME TIME LATER) in some cases the input file does not contain some of the content that was present above, for instance, maybe the campaign information is not available:
root
|-- _metadata: struct (nullable = true)
| |-- bundled: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bundledIds: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- failedInitializations: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- unbundled: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- anonymousId: string (nullable = true)
|-- channel: string (nullable = true)
|-- context: struct (nullable = true)
| |-- ip: string (nullable = true)
I would like to automatically be able to select some of the columns but I don't want the script to crash when the content is not available. Note that the number of columns that are to be selected are a lot more then in the example below:
df_2 = df\
.select(expr("context.campaign.source").alias("campaign_source"),
expr("context.campaign.utm_campaign").alias("utm_campaign"),
'anonymousId')
One case could be that the anonymousId, ip and context.campaign.source exists, but not the context.campaign.utm_campaign and all the possible combinations (can be a lot with many columns).
I tried listing the part I wanted to find and check if they existed and could thereafter use that list as an input to the dataframe selection. But I found this difficult since I have a nested dataframe:
lst = ['anonymousId',
'source',
'utm_campaign',
'ip']
col_exists = []
for col in lst:
if df_seglog_prod.schema.simpleString().find(col) > 0:
col_exists.append(col)
else:
print('Column', col, 'does not exist')
df_2 = df.select(col_exsists) #does ofc not work...
Any tips on how to work with this kind of nested dataframe?
Thank you in advance!!

The following steps helped resolve my issue:
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, ArrayType):
dtype = dtype.elementType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(name)
return fields
def intersection(lst1, lst2):
# Use of hybrid method
temp = set(lst2)
lst3 = [value for value in lst1 if value in temp]
return lst3
fieldsPathName = flatten(df.schema)
df_prepSchema = df.select(fieldsPathName).toDF(*fieldsPathName)
lst1 = ['context.campaign.source',
'context.campaign.utm_campaign',
'timestamp',
'anonymousId']
lst2 = df.columns
cols = intersection(lst1, lst2)
append_str = '`'
cols = [append_str + col for col in cols]
cols = [col + append_str for col in cols]
df_2 = df_prepSchema.select(cols)

Spark : How to reuse the same array schema that has all fields defined across the data-frame

I have hundreds of columns a,b,c ... . I would like to modify dataframe schema, where each array will have the same shape date, num and val field.
There are thousands of id so I would like to modify ONLY schema not dataframe. Modified schema will be used in the next step to load data to dataframe efficiently . I would like to avoid using UDF to modify whole dataframe.
Input schema:
df.printSchema()
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true) !!! NOTE : `num` !!!
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
Required Output schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
To reproduce input Schema:
df = spark.read.json(sc.parallelize([
"""{"id":1,"a":[{"date":2001,"num":1},{"date":2002,},{"date":2003,}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"num":2},{"date":2002},{"date":2003}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]))
for field in df.schema:
print(field)
Print output:
StructField(a,ArrayType(StructType(List(StructField(date,LongType,true),StructField(num,LongType,true),StructField(val,LongType,true))),true),true)
StructField(b,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(c,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(d,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(id,LongType,true)
Solution (see OneCricketeer answer below for details) :
from pyspark.sql.types import StructField, StructType, LongType, ArrayType
jsonstr=[
"""{"id":1,"a":[{"date":2001,"val":1,"num":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"val":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]
array_schema = ArrayType(StructType([
StructField('date' ,LongType(),True),
StructField('num' ,LongType(),True),
StructField('val' ,LongType(),True)]),
True)
keys = ['a', 'b', 'c', 'd']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
dff = spark.read.json(sc.parallelize(jsonstr),df_schema)

I think the true solution is to have consistent names, or at least something more descriptive if the fields are truly different. "num" and "val" are basically synonymous
If I understand the question, you want to reuse the same array schema that has all fields defined
array_schema = ArrayType(StructType([StructField('date' ,LongType(),False),StructField('num' ,LongType(),True),StructField('val' ,LongType(),True))),True)
df_schema = StructType([
StructField('a',array_schema,True)
StructField('b',array_schema,True)
...
StructField('id',LongType(),True)
])
Or you can do this in a loop, which is safe because it's applied in the Spark driver
keys = ['a', 'b']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
(change each boolean to a False if there will be no nulls)
Then you need to provide this schema to your read function
spark.read.schema(df_schema).json(...
If there will still be more fields that cannot be consistently applied to all "keys", then use ArrayType(MapType(StringType(), LongType()), False)

pyspark, group by array specific property through aliases

i have this structure for my dataframe
root: array (nullable = true)
|-- element: struct (containsNull = true)
|-- id: long (nullable = true)
|-- time: struct (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- properties: array (nullable = true)
|-- element: struct (containsNull = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
that i need to transform in this one:
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
expanding my key-value array on column.
using pivot and groupby i can transform my dataframe:
df2 = df.groupby("start","end","id").pivot("prop.key").agg(last("prop.value", True))
but i need also to group by one (or more) property(key) value, but i can't.
df2 = df.groupby("start","end","id","car_type","car_loc").pivot("prop.key").agg(last("prop.value", True))
where "car_type","car_loc" are properties (prop.keys).
i need to call these properties through their aliases(not using getItem()).
is it possible? anyone can help me please?
thank you
EDIT
an example. i have this situation:
+---+----------+----------+--------------------+
| id| start | end | prop|
+---+----------+----------+--------------------+
| 1|2019-05-12|2020-05-12|[car_type, fiat |
| 1|2019-05-12|2020-05-12|[car_loc, home |
| 1|2019-05-12|2020-05-12|[car_num, xd7890 |
| 2|2019-05-13|2020-05-13|[car_type, fiat |
| 2|2019-05-13|2020-05-13|[car_loc, home |
| 2|2019-05-13|2020-05-13|[car_num, ae1234 |
| 1|2019-05-12|2020-05-12|[car_type, ford |
| 1|2019-05-12|2020-05-12|[car_loc, office |
| 1|2019-05-12|2020-05-12|[car_num, gh7890 |
i need to transform dataframe to have the situation :
+---------------------+---+--------+-------+-------+
| start | end | id|car_type|car_loc|car_num|
+---------------------+---+--------+-------+-------+
|2019-05-12|2020-05-12| 1|fiat |home |xd7890 |
|2019-05-13|2020-05-13| 2|fiat |home |ae1234 |
|2019-05-12|2020-05-12| 1|ford |office |gh7890 |

Transform array to column dynamically using pyspark

I'm having a trouble with a json dataframe:
{
"keys":[
{
"id":1,
"start":"2019-05-10",
"end":"2019-05-11",
"property":[
{
"key":"home",
"value":"1000"
},
{
"key":"office",
"value":"exit"
},
{
"key":"car",
"value":"ford"
}
]
},
{
"id":2,
"start":"2019-05-11",
"end":"2019-05-12",
"property":[
{
"key":"home",
"value":"2000"
},
{
"key":"office",
"value":"out"
},
{
"key":"car",
"value":"fiat"
}
]
}
]
}
root
|-- id: long (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- property: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I need to have key and value as column, where key is the name of column and value is the value in the dataframe.
At first I used getItem with an alias,:
df.select("id","start","end",col("property.value").getItem(0).alias("home"),col("property.value").getItem(1).alias("office"),col("property.value").getItem(2).alias("car")
But elements number and position can change, so i thought to provide a new schema with all the possible value for key, and to set value from my dataframe, without being joined to the position, but i think it is a low performance solution.
I tried also using pivot but i don't have the correct result as shown in figure, in fact i need to have split column, without a comma in the column name and value
id |start |end |[home, office, car]
+---+--------------+------------+--------------
|1 |2019-05-10 |2019-05-11 |[1000,exit,ford]
|2 |2019-05-11 |2019-05-12 |[2000,out,fiat]
I need this schema updating dynamically the fields, which number can be fixed:
|-- root
|-- id: long (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- home: string (nullable = true)
|-- office: string (nullable = true)
|-- car: string (nullable = true)
|-- cycle: string (nullable = true)
Anyone can help me, please?

Please find my try below. I deliberately expanded it into a couple of steps so that you could see how the final df was created (feel free to wrap these steps, however this would not have any impact on the performance).
inputJSON = "/tmp/my_file.json"
dfJSON = spark.read.json(inputJSON, multiLine=True)
from pyspark.sql import functions as F
df = dfJSON.select(F.explode(dfJSON["keys"]).alias("x"))
df2 = df.select(F.col("x.start").alias("start"),F.col("x.end").alias("end"),F.col("x.id").alias("id"),F.col("x.property").alias("property"))
df3 = df2.select(F.col("start"),F.col("end"),F.col("id"), F.explode(df2["property"]).alias("properties"))
df4 = df3.select(F.col("start"),F.col("end"),F.col("id"), F.col("properties.key").alias("key"), F.col("properties.value").alias("value"))
df4.groupBy("start","end","id").pivot('key').agg(F.last('value', True)).show()
Output:
+----------+----------+---+----+----+------+
| start| end| id| car|home|office|
+----------+----------+---+----+----+------+
|2019-05-11|2019-05-12| 2|fiat|2000| out|
|2019-05-10|2019-05-11| 1|ford|1000| exit|
+----------+----------+---+----+----+------+
Schemas:
dfJSON.printSchema()
root
|-- keys: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- end: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- property: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- key: string (nullable = true)
| | | | |-- value: string (nullable = true)
| | |-- start: string (nullable = true)
df2.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- property: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
df3.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- properties: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: string (nullable = true)
df4.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)

Try with groupBy and pivot.
from pyspark.sql.functions import *
cols=['home','office','car']
spark.read.option("multiline","true").\
json("<path>").\
selectExpr("explode(keys)").\
selectExpr("col.id","col.start","col.end","explode(col.property)").\
select("id","start","end","col.*").\
groupBy("id","start","end").\
pivot("key").\
agg(first("value")).\
withColumn("[home,office,car]",array(*cols)).\
drop(*cols).\
show()
#+---+----------+----------+------------------+
#| id| start| end| [home,office,car]|
#+---+----------+----------+------------------+
#| 1|2019-05-10|2019-05-11|[1000, exit, ford]|
#| 2|2019-05-11|2019-05-12| [2000, out, fiat]|
#+---+----------+----------+------------------+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark - Creating Nested DataFrame - python

First, you need to use the userid as the join key for the second DataFrame: user.join(order, user.id == order.userid) Then you can use a map step to transform the resulting records to your desired format.

For flatining your data frame from nested to normal use dff= df.select("column with multiple columns.*")

Related

How do I get column and nested field names in pyspark?

Python, Select from Nested Dataframe when Input File Changes Format

Spark : How to reuse the same array schema that has all fields defined across the data-frame

pyspark, group by array specific property through aliases

Transform array to column dynamically using pyspark

Categories

Resources