I have a JSON file with various levels of nested struct/array columns in one DataFrame, df_1. I have a smaller DataFrame, df_2, with less columns, but the column names match with some column names in df_1, and none of the nested structure.
I want to apply the schema from df_1 to df_2 in a way that the two share the same schema, taking the existing columns in df_2 where possible, and creating the columns/nested structure that exist in df_1 but not df_2.
df_1
root
|-- association_info: struct (nullable = true)
| |-- ancestry: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- doi: string (nullable = true)
| |-- gwas_catalog_id: string (nullable = true)
| |-- neg_log_pval: double (nullable = true)
| |-- study_id: string (nullable = true)
| |-- pubmed_id: string (nullable = true)
| |-- url: string (nullable = true)
|-- gold_standard_info: struct (nullable = true)
| |-- evidence: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- class: string (nullable = true)
| | | |-- confidence: string (nullable = true)
| | | |-- curated_by: string (nullable = true)
| | | |-- description: string (nullable = true)
| | | |-- pubmed_id: string (nullable = true)
| | | |-- source: string (nullable = true)
| |-- gene_id: string (nullable = true)
| |-- highest_confidence: string (nullable = true)
df_2
root
|-- study_id: string (nullable = true)
|-- description: string (nullable = true)
|-- gene_id: string (nullable = true)
The expected output would be to have the same schema as df_1, and for any columns that don't exist in df_2 to just fill with null.
I have tried completely flattening the structure of df_1 to join the two DataFrames, but then I'm unsure how to change it back into the original schema. All solutions I've attempted so far have been in PySpark. It would be preferable to use PySpark for performance considerations, but if a solution requires converted to a Pandas DataFrame that's also feasible.
df1.select('association_info.study_id',
'gold_standard_info.evidence.element.description',
'gold_standard_info.gene_id')
The above code will reach into the df1 and provide you requisite fields in df2. The schema will remain same.
Could you try the same.
I have hundreds of columns a,b,c ... . I would like to modify dataframe schema, where each array will have the same shape date, num and val field.
There are thousands of id so I would like to modify ONLY schema not dataframe. Modified schema will be used in the next step to load data to dataframe efficiently . I would like to avoid using UDF to modify whole dataframe.
Input schema:
df.printSchema()
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true) !!! NOTE : `num` !!!
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
Required Output schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
To reproduce input Schema:
df = spark.read.json(sc.parallelize([
"""{"id":1,"a":[{"date":2001,"num":1},{"date":2002,},{"date":2003,}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"num":2},{"date":2002},{"date":2003}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]))
for field in df.schema:
print(field)
Print output:
StructField(a,ArrayType(StructType(List(StructField(date,LongType,true),StructField(num,LongType,true),StructField(val,LongType,true))),true),true)
StructField(b,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(c,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(d,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(id,LongType,true)
Solution (see OneCricketeer answer below for details) :
from pyspark.sql.types import StructField, StructType, LongType, ArrayType
jsonstr=[
"""{"id":1,"a":[{"date":2001,"val":1,"num":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"val":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]
array_schema = ArrayType(StructType([
StructField('date' ,LongType(),True),
StructField('num' ,LongType(),True),
StructField('val' ,LongType(),True)]),
True)
keys = ['a', 'b', 'c', 'd']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
dff = spark.read.json(sc.parallelize(jsonstr),df_schema)
I think the true solution is to have consistent names, or at least something more descriptive if the fields are truly different. "num" and "val" are basically synonymous
If I understand the question, you want to reuse the same array schema that has all fields defined
array_schema = ArrayType(StructType([StructField('date' ,LongType(),False),StructField('num' ,LongType(),True),StructField('val' ,LongType(),True))),True)
df_schema = StructType([
StructField('a',array_schema,True)
StructField('b',array_schema,True)
...
StructField('id',LongType(),True)
])
Or you can do this in a loop, which is safe because it's applied in the Spark driver
keys = ['a', 'b']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
(change each boolean to a False if there will be no nulls)
Then you need to provide this schema to your read function
spark.read.schema(df_schema).json(...
If there will still be more fields that cannot be consistently applied to all "keys", then use ArrayType(MapType(StringType(), LongType()), False)
I have multiple json files I wish to use to create a spark data frame from. In testing with a subset, when I load the files, I get rows of the json information themselves instead of parsed json information. I am doing the following:
df = spark.read.json('gutenberg/test')
df.show()
+--------------------+--------------------+--------------------+
| 1| 10| 5|
+--------------------+--------------------+--------------------+
| null|[WrappedArray(),W...| null|
| null| null|[WrappedArray(Uni...|
|[WrappedArray(Jef...| null| null|
+--------------------+--------------------+--------------------+
When I check the schema of the dataframe, It appears to be there, but am having trouble accessing it:
df.printSchema()
root
|-- 1: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
|-- 10: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
|-- 5: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
I keep getting errors when trying to access the information, so any help would be great.
Specifically, I am looking to create a new dataframe where the columns are ('author', 'formaturi', 'language', 'rights', 'subject', 'title', 'txt')
I am using pyspark 2.2
Since I do not know what json file is exactly like, assuming it is a new line delimited jsons, this should work.
def _construct_key(previous_key, separator, new_key):
if previous_key:
return "{}{}{}".format(previous_key, separator, new_key)
else:
return new_key
def flatten(nested_dict, separator="_", root_keys_to_ignore=set()):
assert isinstance(nested_dict, dict)
assert isinstance(separator, str)
flattened_dict = dict()
def _flatten(object_, key):
if isinstance(object_, dict):
for object_key in object_:
if not (not key and object_key in root_keys_to_ignore):
_flatten(object_[object_key], _construct_key(key,\
separator, object_key))
elif isinstance(object_, list) or isinstance(object_, set):
for index, item in enumerate(object_):
_flatten(item, _construct_key(key, separator, index))
else:
flattened_dict[key] = object_
_flatten(nested_dict, None)
return flattened_dict
def flatten(_json):
return flatt(_json.asDict(True))
df = spark.read.json('gutenberg/test',\
primitivesAsString=True,\
allowComments=True,\
allowUnquotedFieldNames=True,\
allowNumericLeadingZero=True,\
allowBackslashEscapingAnyCharacter=True,\
mode='DROPMALFORMED')\
.rdd.map(flatten).toDF()
df.show()
I'm starting with PySpark and I'm having troubles with creating DataFrames with nested objects.
This is my example.
I have users.
$ cat user.json
{"id":1,"name":"UserA"}
{"id":2,"name":"UserB"}
Users have orders.
$ cat order.json
{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}
And I like to join it to get such a struct where orders are array nested in users.
$ cat join.json
{"id":1, "name":"UserA", "orders":[{"id":1,"price":202.30,"userid":1},{"id":2,"price":343.99,"userid":1}]}
{"id":2,"name":"UserB","orders":[{"id":3,"price":399.99,"userid":2}]}
How can I do that ?
Is there any kind of nested join or something similar ?
>>> user = sqlContext.read.json("user.json")
>>> user.printSchema();
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
>>> order = sqlContext.read.json("order.json")
>>> order.printSchema();
root
|-- id: long (nullable = true)
|-- price: double (nullable = true)
|-- userid: long (nullable = true)
>>> joined = sqlContext.read.json("join.json")
>>> joined.printSchema();
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
EDIT:
I know there is possibility to do this using join and foldByKey, but is there any simpler way ?
EDIT2:
I'm using solution by #zero323
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn")
I add second nested structure 'lines'
>>> lines = sqlContext.read.json(path + "lines.json")
>>> lines.printSchema();
root
|-- id: long (nullable = true)
|-- orderid: long (nullable = true)
|-- product: string (nullable = true)
orders = joinTable(order, lines, "id", "orderid", "lines")
joined = joinTable(user, orders, "id", "userid", "orders")
joined.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
| | |-- lines: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _1: long (nullable = true)
| | | | |-- _2: long (nullable = true)
| | | | |-- _3: string (nullable = true)
After this column names from lines are lost.
Any ideas ?
EDIT 3:
I tried to manual specify schema.
from pyspark.sql.types import *
fields = []
fields.append(StructField("_1", LongType(), True))
inner = ArrayType(lines.schema)
fields.append(StructField("_2", inner))
new_schema = StructType(fields)
print new_schema
grouped = lines.rdd.groupBy(lambda r: r.orderid)
grouped = grouped.map(lambda x: (x[0], list(x[1])))
g = sqlCtx.createDataFrame(grouped, new_schema)
Error:
TypeError: StructType(List(StructField(id,LongType,true),StructField(orderid,LongType,true),StructField(product,StringType,true))) can not accept object in type <class 'pyspark.sql.types.Row'>
This will work only in Spark 2.0 or later
First we'll need a couple of imports:
from pyspark.sql.functions import struct, collect_list
The rest is a simple aggregation and join:
orders = spark.read.json("/path/to/order.json")
users = spark.read.json("/path/to/user.json")
combined = users.join(
orders
.groupBy("userId")
.agg(collect_list(struct(*orders.columns)).alias("orders"))
.withColumnRenamed("userId", "id"), ["id"])
For the example data the result is:
combined.show(2, False)
+---+-----+---------------------------+
|id |name |orders |
+---+-----+---------------------------+
|1 |UserA|[[1,202.3,1], [2,343.99,1]]|
|2 |UserB|[[3,399.99,2]] |
+---+-----+---------------------------+
with schema:
combined.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
and JSON representation:
for x in combined.toJSON().collect():
print(x)
{"id":1,"name":"UserA","orders":[{"id":1,"price":202.3,"userid":1},{"id":2,"price":343.99,"userid":1}]}
{"id":2,"name":"UserB","orders":[{"id":3,"price":399.99,"userid":2}]}
First, you need to use the userid as the join key for the second DataFrame:
user.join(order, user.id == order.userid)
Then you can use a map step to transform the resulting records to your desired format.
For flatining your data frame from nested to normal use
dff= df.select("column with multiple columns.*")