Pyspark: explode columns to new dataframe - python

Ihave some pyspark dataframe with schema:
|-- doc_id: string (nullable = true)
|-- msp_contracts: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _el1: string (nullable = true)
| | |-- _el2: long (nullable = true)
| | |-- _el3: string (nullable = true)
| | |-- _el4: string (nullable = true)
| | |-- _el5: string (nullable = true)
How do I get this data frame:
|-- doc_id: string (nullable = true)
|-- _el1: string (nullable = true)
|-- _el3: string (nullable = true)
|-- _el4: string (nullable = true)
|-- _el5: string (nullable = true)
I try in select:
explode('msp_contracts').select(
col(u'msp_contracts.element._el1'),
col(u'msp_contracts.element._el2')
)
but I can have error:
'Column' object is not callable

After explode('msp_contracts') spark will add col column as a result of explode (if alias in not provided).
df.select("doc_id",explode("msp_contracts")).show()
#+------+---+
#|doc_id|col|
#+------+---+
#| 1|[1]|
#+------+---+
Use col to select _el1, Try with df_1.select("doc_id",explode("msp_contracts")).select("doc_id",col(u"col._el1")).show()
Example:
jsn='{"doc_id":1,"msp_contracts":[{"_el1":1}]}'
df=spark.read.json(sc.parallelize([(jsn)]))
#schema
#root
# |-- doc_id: long (nullable = true)
# |-- msp_contracts: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- _el1: long (nullable = true)
df.withColumn("msp_contracts",explode(col("msp_contracts"))).\
select("doc_id","msp_contracts._el1").show()
#+------+----+
#|doc_id|_el1|
#+------+----+
#| 1| 1|
#+------+----+
UPDATE:
df.select("doc_id",explode("msp_contracts")).\
select("doc_id","col._el1").\
show()
#or
df.select("doc_id",explode("msp_contracts")).\
select("doc_id",col(u"col._el1")).\
show()
#+------+----+
#|doc_id|_el1|
#+------+----+
#| 1| 1|
#+------+----+

Work for me:
df.select("doc_id",explode("msp_contracts")).\
select("doc_id","col._el1")
With alias and costum column:
df.select(
'doc_id',
explode('msp_contracts').alias("msp_contracts")
)\
.select(
'doc_id',
col('msp_contracts.el_1').alias('last_period_44fz_customer'),
col('msp_contracts.el_2').alias('last_period_44fz_customer_inn')
)\
.withColumn("load_dtm", now_f())

Related

Spark : How to reuse the same array schema that has all fields defined across the data-frame

I have hundreds of columns a,b,c ... . I would like to modify dataframe schema, where each array will have the same shape date, num and val field.
There are thousands of id so I would like to modify ONLY schema not dataframe. Modified schema will be used in the next step to load data to dataframe efficiently . I would like to avoid using UDF to modify whole dataframe.
Input schema:
df.printSchema()
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true) !!! NOTE : `num` !!!
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
Required Output schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
To reproduce input Schema:
df = spark.read.json(sc.parallelize([
"""{"id":1,"a":[{"date":2001,"num":1},{"date":2002,},{"date":2003,}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"num":2},{"date":2002},{"date":2003}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]))
for field in df.schema:
print(field)
Print output:
StructField(a,ArrayType(StructType(List(StructField(date,LongType,true),StructField(num,LongType,true),StructField(val,LongType,true))),true),true)
StructField(b,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(c,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(d,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(id,LongType,true)
Solution (see OneCricketeer answer below for details) :
from pyspark.sql.types import StructField, StructType, LongType, ArrayType
jsonstr=[
"""{"id":1,"a":[{"date":2001,"val":1,"num":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"val":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]
array_schema = ArrayType(StructType([
StructField('date' ,LongType(),True),
StructField('num' ,LongType(),True),
StructField('val' ,LongType(),True)]),
True)
keys = ['a', 'b', 'c', 'd']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
dff = spark.read.json(sc.parallelize(jsonstr),df_schema)
I think the true solution is to have consistent names, or at least something more descriptive if the fields are truly different. "num" and "val" are basically synonymous
If I understand the question, you want to reuse the same array schema that has all fields defined
array_schema = ArrayType(StructType([StructField('date' ,LongType(),False),StructField('num' ,LongType(),True),StructField('val' ,LongType(),True))),True)
df_schema = StructType([
StructField('a',array_schema,True)
StructField('b',array_schema,True)
...
StructField('id',LongType(),True)
])
Or you can do this in a loop, which is safe because it's applied in the Spark driver
keys = ['a', 'b']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
(change each boolean to a False if there will be no nulls)
Then you need to provide this schema to your read function
spark.read.schema(df_schema).json(...
If there will still be more fields that cannot be consistently applied to all "keys", then use ArrayType(MapType(StringType(), LongType()), False)

Extract schema labels from pyspark dataframe

From a pyspark dataframe I want to create a python list with the schema labels for a specific schema "level".
The schema is:
root
|-- DISPLAY: struct (nullable = true)
| |-- 1WO: struct (nullable = true)
| | |-- JPY: struct (nullable = true)
| | | |-- CHANGE24HOUR: string (nullable = true)
| | | |-- CHANGEDAY: string (nullable = true)
| |-- AAVE: struct (nullable = true)
| | |-- JPY: struct (nullable = true)
| | | |-- CHANGE24HOUR: string (nullable = true)
| | | |-- CHANGEDAY: string (nullable = true)
The expected output is:
list = 1WO, AAVE
The following code print everything in the schema:
df.schema.jsonValue()
Is there an easy way to extract those labels pls?
Select the first layer using the asterisk notation, and the n list the columns:
df.select('DISPLAY.*').columns

How to flatten json file in pyspark

I have json file structure as shown below.evry time json file structure will change in pyspark how we handle flatten any kind of json file. Can u help me on this.
root
|-- student: struct (nullable = true)
|-- children: struct (nullable = true)
|-- parent: struct (nullable = true
| |-- id: string (nullable = true)
| |-- type: string (nullable = true)
| |-- date: string (nullable = true)
|-- multipliers: array (nullable = true)
| |-- element: double (containsNull = true)
|-- spawn_time: string (nullable = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)
This approach uses a recursive function to determine the columns to select, by building a flat list of fully-named prefixes in the prefix accumulator parameter.
Note that it will work on any format that supports nesting, not just JSON (Parquet, Avro, etc).
Furthermore, the input can have any schema, but this example uses:
{"c1": {"c3": 4, "c4": 12}, "c2": "w1"}
{"c1": {"c3": 5, "c4": 34}, "c2": "w2"}
The original df shows as:
+-------+---+
| c1| c2|
+-------+---+
|[4, 12]| w1|
|[5, 34]| w2|
+-------+---+
The code:
from pyspark.sql.types import StructType
from pyspark.sql.functions import col
# return a list of all (possibly nested) fields to select, within a given schema
def flatten(schema, prefix: str = ""):
# return a list of sub-items to select, within a given field
def field_items(field):
name = f'{prefix}.{field.name}' if prefix else field.name
if type(field.dataType) == StructType:
return flatten(field.dataType, name)
else:
return [col(name)]
return [item for field in schema.fields for item in field_items(field)]
df = spark.read.json(path)
print('===== df =====')
df.printSchema()
flattened = flatten(df.schema)
print('flattened =', flatten(df.schema))
print('===== df2 =====')
df2 = df.select(*flattened)
df2.printSchema()
df2.show()
As you will see in the output, the flatten function returns a flat list of columns, each one fully named (using "parent_col.child_col" naming format).
Output:
===== df =====
root
|-- c1: struct (nullable = true)
| |-- c3: long (nullable = true)
| |-- c4: long (nullable = true)
|-- c2: string (nullable = true)
flattened = [Column<b'c1.c3'>, Column<b'c1.c4'>, Column<b'c2'>]
===== df2 =====
root
|-- c3: long (nullable = true)
|-- c4: long (nullable = true)
|-- c2: string (nullable = true)
+---+---+---+
| c3| c4| c2|
+---+---+---+
| 4| 12| w1|
| 5| 34| w2|
+---+---+---+

Transform array to column dynamically using pyspark

I'm having a trouble with a json dataframe:
{
"keys":[
{
"id":1,
"start":"2019-05-10",
"end":"2019-05-11",
"property":[
{
"key":"home",
"value":"1000"
},
{
"key":"office",
"value":"exit"
},
{
"key":"car",
"value":"ford"
}
]
},
{
"id":2,
"start":"2019-05-11",
"end":"2019-05-12",
"property":[
{
"key":"home",
"value":"2000"
},
{
"key":"office",
"value":"out"
},
{
"key":"car",
"value":"fiat"
}
]
}
]
}
root
|-- id: long (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- property: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I need to have key and value as column, where key is the name of column and value is the value in the dataframe.
At first I used getItem with an alias,:
df.select("id","start","end",col("property.value").getItem(0).alias("home"),col("property.value").getItem(1).alias("office"),col("property.value").getItem(2).alias("car")
But elements number and position can change, so i thought to provide a new schema with all the possible value for key, and to set value from my dataframe, without being joined to the position, but i think it is a low performance solution.
I tried also using pivot but i don't have the correct result as shown in figure, in fact i need to have split column, without a comma in the column name and value
id |start |end |[home, office, car]
+---+--------------+------------+--------------
|1 |2019-05-10 |2019-05-11 |[1000,exit,ford]
|2 |2019-05-11 |2019-05-12 |[2000,out,fiat]
I need this schema updating dynamically the fields, which number can be fixed:
|-- root
|-- id: long (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- home: string (nullable = true)
|-- office: string (nullable = true)
|-- car: string (nullable = true)
|-- cycle: string (nullable = true)
Anyone can help me, please?
Please find my try below. I deliberately expanded it into a couple of steps so that you could see how the final df was created (feel free to wrap these steps, however this would not have any impact on the performance).
inputJSON = "/tmp/my_file.json"
dfJSON = spark.read.json(inputJSON, multiLine=True)
from pyspark.sql import functions as F
df = dfJSON.select(F.explode(dfJSON["keys"]).alias("x"))
df2 = df.select(F.col("x.start").alias("start"),F.col("x.end").alias("end"),F.col("x.id").alias("id"),F.col("x.property").alias("property"))
df3 = df2.select(F.col("start"),F.col("end"),F.col("id"), F.explode(df2["property"]).alias("properties"))
df4 = df3.select(F.col("start"),F.col("end"),F.col("id"), F.col("properties.key").alias("key"), F.col("properties.value").alias("value"))
df4.groupBy("start","end","id").pivot('key').agg(F.last('value', True)).show()
Output:
+----------+----------+---+----+----+------+
| start| end| id| car|home|office|
+----------+----------+---+----+----+------+
|2019-05-11|2019-05-12| 2|fiat|2000| out|
|2019-05-10|2019-05-11| 1|ford|1000| exit|
+----------+----------+---+----+----+------+
Schemas:
dfJSON.printSchema()
root
|-- keys: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- end: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- property: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- key: string (nullable = true)
| | | | |-- value: string (nullable = true)
| | |-- start: string (nullable = true)
df2.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- property: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
df3.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- properties: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: string (nullable = true)
df4.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Try with groupBy and pivot.
from pyspark.sql.functions import *
cols=['home','office','car']
spark.read.option("multiline","true").\
json("<path>").\
selectExpr("explode(keys)").\
selectExpr("col.id","col.start","col.end","explode(col.property)").\
select("id","start","end","col.*").\
groupBy("id","start","end").\
pivot("key").\
agg(first("value")).\
withColumn("[home,office,car]",array(*cols)).\
drop(*cols).\
show()
#+---+----------+----------+------------------+
#| id| start| end| [home,office,car]|
#+---+----------+----------+------------------+
#| 1|2019-05-10|2019-05-11|[1000, exit, ford]|
#| 2|2019-05-11|2019-05-12| [2000, out, fiat]|
#+---+----------+----------+------------------+

Reading and accessing nested fields in json files using spark

I have multiple json files I wish to use to create a spark data frame from. In testing with a subset, when I load the files, I get rows of the json information themselves instead of parsed json information. I am doing the following:
df = spark.read.json('gutenberg/test')
df.show()
+--------------------+--------------------+--------------------+
| 1| 10| 5|
+--------------------+--------------------+--------------------+
| null|[WrappedArray(),W...| null|
| null| null|[WrappedArray(Uni...|
|[WrappedArray(Jef...| null| null|
+--------------------+--------------------+--------------------+
When I check the schema of the dataframe, It appears to be there, but am having trouble accessing it:
df.printSchema()
root
|-- 1: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
|-- 10: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
|-- 5: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
I keep getting errors when trying to access the information, so any help would be great.
Specifically, I am looking to create a new dataframe where the columns are ('author', 'formaturi', 'language', 'rights', 'subject', 'title', 'txt')
I am using pyspark 2.2
Since I do not know what json file is exactly like, assuming it is a new line delimited jsons, this should work.
def _construct_key(previous_key, separator, new_key):
if previous_key:
return "{}{}{}".format(previous_key, separator, new_key)
else:
return new_key
def flatten(nested_dict, separator="_", root_keys_to_ignore=set()):
assert isinstance(nested_dict, dict)
assert isinstance(separator, str)
flattened_dict = dict()
def _flatten(object_, key):
if isinstance(object_, dict):
for object_key in object_:
if not (not key and object_key in root_keys_to_ignore):
_flatten(object_[object_key], _construct_key(key,\
separator, object_key))
elif isinstance(object_, list) or isinstance(object_, set):
for index, item in enumerate(object_):
_flatten(item, _construct_key(key, separator, index))
else:
flattened_dict[key] = object_
_flatten(nested_dict, None)
return flattened_dict
def flatten(_json):
return flatt(_json.asDict(True))
df = spark.read.json('gutenberg/test',\
primitivesAsString=True,\
allowComments=True,\
allowUnquotedFieldNames=True,\
allowNumericLeadingZero=True,\
allowBackslashEscapingAnyCharacter=True,\
mode='DROPMALFORMED')\
.rdd.map(flatten).toDF()
df.show()

Categories