df = spark.read.json(['/Users/.../input/json/thisistheinputfile.json'])
df.printSchema()
Results in something like below:
root
|-- _metadata: struct (nullable = true)
| |-- bundled: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bundledIds: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- failedInitializations: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- unbundled: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- anonymousId: string (nullable = true)
|-- channel: string (nullable = true)
|-- context: struct (nullable = true)
| |-- campaign: struct (nullable = true)
| | |-- content: string (nullable = true)
| | |-- medium: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- source: string (nullable = true)
| | |-- term: string (nullable = true)
| | |-- utm_campaign: string (nullable = true)
| | |-- utm_medium: string (nullable = true)
| | |-- utm_term: string (nullable = true)
| |-- ip: string (nullable = true)
However (SOME TIME LATER) in some cases the input file does not contain some of the content that was present above, for instance, maybe the campaign information is not available:
root
|-- _metadata: struct (nullable = true)
| |-- bundled: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bundledIds: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- failedInitializations: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- unbundled: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- anonymousId: string (nullable = true)
|-- channel: string (nullable = true)
|-- context: struct (nullable = true)
| |-- ip: string (nullable = true)
I would like to automatically be able to select some of the columns but I don't want the script to crash when the content is not available. Note that the number of columns that are to be selected are a lot more then in the example below:
df_2 = df\
.select(expr("context.campaign.source").alias("campaign_source"),
expr("context.campaign.utm_campaign").alias("utm_campaign"),
'anonymousId')
One case could be that the anonymousId, ip and context.campaign.source exists, but not the context.campaign.utm_campaign and all the possible combinations (can be a lot with many columns).
I tried listing the part I wanted to find and check if they existed and could thereafter use that list as an input to the dataframe selection. But I found this difficult since I have a nested dataframe:
lst = ['anonymousId',
'source',
'utm_campaign',
'ip']
col_exists = []
for col in lst:
if df_seglog_prod.schema.simpleString().find(col) > 0:
col_exists.append(col)
else:
print('Column', col, 'does not exist')
df_2 = df.select(col_exsists) #does ofc not work...
Any tips on how to work with this kind of nested dataframe?
Thank you in advance!!
The following steps helped resolve my issue:
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, ArrayType):
dtype = dtype.elementType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(name)
return fields
def intersection(lst1, lst2):
# Use of hybrid method
temp = set(lst2)
lst3 = [value for value in lst1 if value in temp]
return lst3
fieldsPathName = flatten(df.schema)
df_prepSchema = df.select(fieldsPathName).toDF(*fieldsPathName)
lst1 = ['context.campaign.source',
'context.campaign.utm_campaign',
'timestamp',
'anonymousId']
lst2 = df.columns
cols = intersection(lst1, lst2)
append_str = '`'
cols = [append_str + col for col in cols]
cols = [col + append_str for col in cols]
df_2 = df_prepSchema.select(cols)
I have hundreds of columns a,b,c ... . I would like to modify dataframe schema, where each array will have the same shape date, num and val field.
There are thousands of id so I would like to modify ONLY schema not dataframe. Modified schema will be used in the next step to load data to dataframe efficiently . I would like to avoid using UDF to modify whole dataframe.
Input schema:
df.printSchema()
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true) !!! NOTE : `num` !!!
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
Required Output schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
To reproduce input Schema:
df = spark.read.json(sc.parallelize([
"""{"id":1,"a":[{"date":2001,"num":1},{"date":2002,},{"date":2003,}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"num":2},{"date":2002},{"date":2003}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]))
for field in df.schema:
print(field)
Print output:
StructField(a,ArrayType(StructType(List(StructField(date,LongType,true),StructField(num,LongType,true),StructField(val,LongType,true))),true),true)
StructField(b,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(c,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(d,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(id,LongType,true)
Solution (see OneCricketeer answer below for details) :
from pyspark.sql.types import StructField, StructType, LongType, ArrayType
jsonstr=[
"""{"id":1,"a":[{"date":2001,"val":1,"num":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"val":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]
array_schema = ArrayType(StructType([
StructField('date' ,LongType(),True),
StructField('num' ,LongType(),True),
StructField('val' ,LongType(),True)]),
True)
keys = ['a', 'b', 'c', 'd']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
dff = spark.read.json(sc.parallelize(jsonstr),df_schema)
I think the true solution is to have consistent names, or at least something more descriptive if the fields are truly different. "num" and "val" are basically synonymous
If I understand the question, you want to reuse the same array schema that has all fields defined
array_schema = ArrayType(StructType([StructField('date' ,LongType(),False),StructField('num' ,LongType(),True),StructField('val' ,LongType(),True))),True)
df_schema = StructType([
StructField('a',array_schema,True)
StructField('b',array_schema,True)
...
StructField('id',LongType(),True)
])
Or you can do this in a loop, which is safe because it's applied in the Spark driver
keys = ['a', 'b']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
(change each boolean to a False if there will be no nulls)
Then you need to provide this schema to your read function
spark.read.schema(df_schema).json(...
If there will still be more fields that cannot be consistently applied to all "keys", then use ArrayType(MapType(StringType(), LongType()), False)
i have this structure for my dataframe
root: array (nullable = true)
|-- element: struct (containsNull = true)
|-- id: long (nullable = true)
|-- time: struct (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- properties: array (nullable = true)
|-- element: struct (containsNull = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
that i need to transform in this one:
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
expanding my key-value array on column.
using pivot and groupby i can transform my dataframe:
df2 = df.groupby("start","end","id").pivot("prop.key").agg(last("prop.value", True))
but i need also to group by one (or more) property(key) value, but i can't.
df2 = df.groupby("start","end","id","car_type","car_loc").pivot("prop.key").agg(last("prop.value", True))
where "car_type","car_loc" are properties (prop.keys).
i need to call these properties through their aliases(not using getItem()).
is it possible? anyone can help me please?
thank you
EDIT
an example. i have this situation:
+---+----------+----------+--------------------+
| id| start | end | prop|
+---+----------+----------+--------------------+
| 1|2019-05-12|2020-05-12|[car_type, fiat |
| 1|2019-05-12|2020-05-12|[car_loc, home |
| 1|2019-05-12|2020-05-12|[car_num, xd7890 |
| 2|2019-05-13|2020-05-13|[car_type, fiat |
| 2|2019-05-13|2020-05-13|[car_loc, home |
| 2|2019-05-13|2020-05-13|[car_num, ae1234 |
| 1|2019-05-12|2020-05-12|[car_type, ford |
| 1|2019-05-12|2020-05-12|[car_loc, office |
| 1|2019-05-12|2020-05-12|[car_num, gh7890 |
i need to transform dataframe to have the situation :
+---------------------+---+--------+-------+-------+
| start | end | id|car_type|car_loc|car_num|
+---------------------+---+--------+-------+-------+
|2019-05-12|2020-05-12| 1|fiat |home |xd7890 |
|2019-05-13|2020-05-13| 2|fiat |home |ae1234 |
|2019-05-12|2020-05-12| 1|ford |office |gh7890 |
I'm having a trouble with a json dataframe:
{
"keys":[
{
"id":1,
"start":"2019-05-10",
"end":"2019-05-11",
"property":[
{
"key":"home",
"value":"1000"
},
{
"key":"office",
"value":"exit"
},
{
"key":"car",
"value":"ford"
}
]
},
{
"id":2,
"start":"2019-05-11",
"end":"2019-05-12",
"property":[
{
"key":"home",
"value":"2000"
},
{
"key":"office",
"value":"out"
},
{
"key":"car",
"value":"fiat"
}
]
}
]
}
root
|-- id: long (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- property: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I need to have key and value as column, where key is the name of column and value is the value in the dataframe.
At first I used getItem with an alias,:
df.select("id","start","end",col("property.value").getItem(0).alias("home"),col("property.value").getItem(1).alias("office"),col("property.value").getItem(2).alias("car")
But elements number and position can change, so i thought to provide a new schema with all the possible value for key, and to set value from my dataframe, without being joined to the position, but i think it is a low performance solution.
I tried also using pivot but i don't have the correct result as shown in figure, in fact i need to have split column, without a comma in the column name and value
id |start |end |[home, office, car]
+---+--------------+------------+--------------
|1 |2019-05-10 |2019-05-11 |[1000,exit,ford]
|2 |2019-05-11 |2019-05-12 |[2000,out,fiat]
I need this schema updating dynamically the fields, which number can be fixed:
|-- root
|-- id: long (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- home: string (nullable = true)
|-- office: string (nullable = true)
|-- car: string (nullable = true)
|-- cycle: string (nullable = true)
Anyone can help me, please?
Please find my try below. I deliberately expanded it into a couple of steps so that you could see how the final df was created (feel free to wrap these steps, however this would not have any impact on the performance).
inputJSON = "/tmp/my_file.json"
dfJSON = spark.read.json(inputJSON, multiLine=True)
from pyspark.sql import functions as F
df = dfJSON.select(F.explode(dfJSON["keys"]).alias("x"))
df2 = df.select(F.col("x.start").alias("start"),F.col("x.end").alias("end"),F.col("x.id").alias("id"),F.col("x.property").alias("property"))
df3 = df2.select(F.col("start"),F.col("end"),F.col("id"), F.explode(df2["property"]).alias("properties"))
df4 = df3.select(F.col("start"),F.col("end"),F.col("id"), F.col("properties.key").alias("key"), F.col("properties.value").alias("value"))
df4.groupBy("start","end","id").pivot('key').agg(F.last('value', True)).show()
Output:
+----------+----------+---+----+----+------+
| start| end| id| car|home|office|
+----------+----------+---+----+----+------+
|2019-05-11|2019-05-12| 2|fiat|2000| out|
|2019-05-10|2019-05-11| 1|ford|1000| exit|
+----------+----------+---+----+----+------+
Schemas:
dfJSON.printSchema()
root
|-- keys: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- end: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- property: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- key: string (nullable = true)
| | | | |-- value: string (nullable = true)
| | |-- start: string (nullable = true)
df2.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- property: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
df3.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- properties: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: string (nullable = true)
df4.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Try with groupBy and pivot.
from pyspark.sql.functions import *
cols=['home','office','car']
spark.read.option("multiline","true").\
json("<path>").\
selectExpr("explode(keys)").\
selectExpr("col.id","col.start","col.end","explode(col.property)").\
select("id","start","end","col.*").\
groupBy("id","start","end").\
pivot("key").\
agg(first("value")).\
withColumn("[home,office,car]",array(*cols)).\
drop(*cols).\
show()
#+---+----------+----------+------------------+
#| id| start| end| [home,office,car]|
#+---+----------+----------+------------------+
#| 1|2019-05-10|2019-05-11|[1000, exit, ford]|
#| 2|2019-05-11|2019-05-12| [2000, out, fiat]|
#+---+----------+----------+------------------+