Null Column Values in PySpark DataFrame after changing Schema

Null Column Values in PySpark DataFrame after changing Schema - python

I have a json string. I am currently reading it into a pyspark dataframe.
rdd = sc.parallelize([json_str])
nested_df = hc.read.json(rdd)
Upon doing nested_df.show(20,False) I get:
\+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
|C_0_0 |C_0_1 |
\+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
|\[{19.765432, 3.13}, {19.765432, 3.13}, {19.765432, 3.13}, {19.765432, 3.13}, {19.765432, 3.13}\]|{{2000-12-12 23:30:30.1234567, 2000-12-12 23:30:30.1234567}, {2000-12-12 23:30:30.1234567, 3.13}}|
\+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
My Current Schema is after reading the json_str is :
# root
# |-- C_0_0: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- C_2_0: double (nullable = true)
# | | |-- C_2_1: double (nullable = true)
# |-- C_0_1: struct (nullable = true)
# | |-- C_1_0: struct (nullable = true)
# | | |-- C_2_0: string (nullable = true)
# | | |-- C_2_1: string (nullable = true)
# | |-- C_1_1: struct (nullable = true)
# | | |-- C_2_0: string (nullable = true)
# | | |-- C_2_1: double (nullable = true)
But I want my schema to be without data loss:
# root
# |-- C_0_0: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- C_2_0: decimal(6,6) (nullable = true)
# | | |-- C_2_1: double (nullable = true)
# |-- C_0_1: struct (nullable = true)
# | |-- C_1_0: struct (nullable = true)
# | | |-- C_2_0: timestamp (nullable = true)
# | | |-- C_2_1: timestamp (nullable = true)
# | |-- C_1_1: struct (nullable = true)
# | | |-- C_2_0: timestamp (nullable = true)
# | | |-- C_2_1: double (nullable = true)
This is my create table statement:
CREATE TABLE CompoundDataTypesSchema.TABLE_NAME ( C_0_0 ARRAY OF ROW ( C_2_0 DECIMAL(6,6) ,C_2_1 DOUBLE ) ,C_0_1 ROW ( C_1_0 ROW ( C_2_0 TIMESTAMP ,C_2_1 TIMESTAMP ),C_1_1 ROW ( C_2_0 TIMESTAMP ,C_2_1 DOUBLE ) ) ) in
FILES_SERVICE
I am expecting the above mentioned table with newly defined schema without data loss. But my current table is:
+-----+----------------------------------------------------------------------------------------------+
|C_0_0|C_0_1 |
+-----+----------------------------------------------------------------------------------------------+
|null |{{2000-12-12 23:30:30.123456, 2000-12-12 23:30:30.123456}, {2000-12-12 23:30:30.123456, 3.13}}|
+-----+----------------------------------------------------------------------------------------------+
The C_0_0 column has NULL value.
To change the schema I tried the following:
def transform_schema(self, schema, parent=""):
if schema == None:
return StructType()
new_schema = []
for f in schema.fields:
if parent:
field_name = parent + '.' + f.name
else:
field_name = f.name
if isinstance(f.dataType, ArrayType):
new_schema.append(StructField(f.name, ArrayType(self.transform_schema(f.dataType.elementType))))
elif isinstance(f.dataType, StructType):
new_schema.append(StructField(f.name, self.transform_schema(f.dataType)))
else:
new_datatype = self.changeDatatypeforNestedField()
new_schema.append(StructField(f.name, new_datatype, f.nullable))
return StructType(new_schema)
nested_df_schema = nested_df.schema
for f in nested_df_schema.fields:
print("Name: ", f.name)
col_name = f.name
if isinstance(f.dataType, ArrayType):
new_schema = ArrayType(self.transform_schema(f.dataType.elementType, parent = f.name))
nested_df = nested_df.withColumn("col_name_json", to_json(col_name)).drop(col_name)
nested_df = nested_df.withColumn(col_name, from_json("col_name_json", new_schema)).drop("col_name_json")
elif isinstance(f.dataType, StructType):
new_schema = self.transform_schema(f.dataType, parent = f.name)
nested_df = nested_df.withColumn("col_name_json", to_json(col_name)).drop(col_name)
nested_df = nested_df.withColumn(col_name, from_json("col_name_json", new_schema)).drop("col_name_json")
else:
new_datatype = self.changeDatatypeforNestedField()
nested_df = nested_df.withColumn(col_name, nested_df[col_name].cast(new_datatype))]
Can someone point me out what might be the issue?

Read the data first, and transform the column what you want.
rdd = sc.parallelize([{'C_0_0': [{'C_2_0': 19.765432, 'C_2_1': 3.13}, {'C_2_0': 19.765432, 'C_2_1': 3.13}, {'C_2_0': 19.765432, 'C_2_1': 3.13}, {'C_2_0': 19.765432, 'C_2_1': 3.13}, {'C_2_0': 19.765432, 'C_2_1': 3.13}], 'C_0_1': {'C_1_0': {'C_2_0': "2000-12-12 23:30:30.1234567", 'C_2_1': "2000-12-12 23:30:30.1234567"}, 'C_1_1': {'C_2_0': "2000-12-12 23:30:30.1234567", 'C_2_1': 3.13}}}])
schema = StructType([StructField('C_0_0', ArrayType(StructType([StructField('C_2_0', DecimalType(8, 6), True), StructField('C_2_1', DoubleType(), True)]), True), True), StructField('C_0_1', StructType([StructField('C_1_0', StructType([StructField('C_2_0', TimestampType(), True), StructField('C_2_1', TimestampType(), True)]), True), StructField('C_1_1', StructType([StructField('C_2_0', TimestampType(), True), StructField('C_2_1', DoubleType(), True)]), True)]), True)])
df = spark.read.json(rdd, schema=schema)
df2 = df.withColumn('C_0_0', f.transform('C_0_0', lambda e: f.struct((e['C_2_0'] % 1).cast("decimal(6, 6)"), e['C_2_1'])))
df2.show(truncate=False)
df2.printSchema()
+------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+
|C_0_0 |C_0_1 |
+------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+
|[{0.765432, 3.13}, {0.765432, 3.13}, {0.765432, 3.13}, {0.765432, 3.13}, {0.765432, 3.13}]|{{2000-12-12 23:30:30.123456, 2000-12-12 23:30:30.123456}, {2000-12-12 23:30:30.123456, 3.13}}|
+------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+
root
|-- C_0_0: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- col1: decimal(6,6) (nullable = true)
| | |-- col2: double (nullable = true)
|-- C_0_1: struct (nullable = true)
| |-- C_1_0: struct (nullable = true)
| | |-- C_2_0: timestamp (nullable = true)
| | |-- C_2_1: timestamp (nullable = true)
| |-- C_1_1: struct (nullable = true)
| | |-- C_2_0: timestamp (nullable = true)
| | |-- C_2_1: double (nullable = true)

Related

How to rename the first level keys of struct with PySpark in Azure Databricks?

I would like to rename the keys of the first level objects inside my payload.
from pyspark.sql.functions import *
ds = {'Fruits': {'apple': {'color': 'red'},'mango': {'color': 'green'}}, 'Vegetables': None}
df = spark.read.json(sc.parallelize([ds]))
df.printSchema()
"""
root
|-- Fruits: struct (nullable = true)
| |-- apple: struct (nullable = true)
| | |-- color: string (nullable = true)
| | |-- shape: string (nullable = true)
| |-- mango: struct (nullable = true)
| | |-- color: string (nullable = true)
|-- Vegetables: string (nullable = true)
"""
Desired output:
root
|-- Fruits: struct (nullable = true)
| |-- APPLE: struct (nullable = true)
| | |-- color: string (nullable = true)
| | |-- shape: string (nullable = true)
| |-- MANGO: struct (nullable = true)
| | |-- color: string (nullable = true)
|-- Vegetables: string (nullable = true)
In this case I would like to rename the keys in the first level to uppercase.
If I had a map type I could use transform keys:
df.select(transform_keys("Fruits", lambda k, _: upper(k)).alias("data_upper")).display()
Unfortunately, I have a struct type.
AnalysisException: cannot resolve 'transform_keys(Fruits,
lambdafunction(upper(x_18), x_18, y_19))' due to argument data type
mismatch: argument 1 requires map type, however, 'Fruits' is of
structapple:struct<color:string,shape:string,mango:structcolor:string>
type.;
I'm using Databricks runtime 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12).

The function you tried to use (transform_keys) is for map type columns. Your column type is struct.
You could use withField.
from pyspark.sql import functions as F
ds = spark.createDataFrame([], 'Fruits struct<apple:struct<color:string,shape:string>,mango:struct<color:string>>, Vegetables string')
ds.printSchema()
# root
# |-- Fruits: struct (nullable = true)
# | |-- apple: struct (nullable = true)
# | | |-- color: string (nullable = true)
# | | |-- shape: string (nullable = true)
# | |-- mango: struct (nullable = true)
# | | |-- color: string (nullable = true)
# |-- Vegetables: string (nullable = true)
ds = ds.withColumn('Fruits', F.col('Fruits').withField('APPLE', F.col('Fruits.apple')))
ds = ds.withColumn('Fruits', F.col('Fruits').withField('MANGO', F.col('Fruits.mango')))
ds.printSchema()
# root
# |-- Fruits: struct (nullable = true)
# | |-- APPLE: struct (nullable = true)
# | | |-- color: string (nullable = true)
# | | |-- shape: string (nullable = true)
# | |-- MANGO: struct (nullable = true)
# | | |-- color: string (nullable = true)
# |-- Vegetables: string (nullable = true)
You can also recreate the structure, but you will need to include all of the struct fields when recreating.
ds = ds.withColumn('Fruits', F.struct(
F.col('Fruits.apple').alias('APPLE'),
F.col('Fruits.mango').alias('MANGO'),
))
ds.printSchema()
# root
# |-- Fruits: struct (nullable = true)
# | |-- APPLE: struct (nullable = true)
# | | |-- color: string (nullable = true)
# | | |-- shape: string (nullable = true)
# | |-- MANGO: struct (nullable = true)
# | | |-- color: string (nullable = true)
# |-- Vegetables: string (nullable = true)

Python, Select from Nested Dataframe when Input File Changes Format

df = spark.read.json(['/Users/.../input/json/thisistheinputfile.json'])
df.printSchema()
Results in something like below:
root
|-- _metadata: struct (nullable = true)
| |-- bundled: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bundledIds: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- failedInitializations: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- unbundled: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- anonymousId: string (nullable = true)
|-- channel: string (nullable = true)
|-- context: struct (nullable = true)
| |-- campaign: struct (nullable = true)
| | |-- content: string (nullable = true)
| | |-- medium: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- source: string (nullable = true)
| | |-- term: string (nullable = true)
| | |-- utm_campaign: string (nullable = true)
| | |-- utm_medium: string (nullable = true)
| | |-- utm_term: string (nullable = true)
| |-- ip: string (nullable = true)
However (SOME TIME LATER) in some cases the input file does not contain some of the content that was present above, for instance, maybe the campaign information is not available:
root
|-- _metadata: struct (nullable = true)
| |-- bundled: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bundledIds: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- failedInitializations: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- unbundled: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- anonymousId: string (nullable = true)
|-- channel: string (nullable = true)
|-- context: struct (nullable = true)
| |-- ip: string (nullable = true)
I would like to automatically be able to select some of the columns but I don't want the script to crash when the content is not available. Note that the number of columns that are to be selected are a lot more then in the example below:
df_2 = df\
.select(expr("context.campaign.source").alias("campaign_source"),
expr("context.campaign.utm_campaign").alias("utm_campaign"),
'anonymousId')
One case could be that the anonymousId, ip and context.campaign.source exists, but not the context.campaign.utm_campaign and all the possible combinations (can be a lot with many columns).
I tried listing the part I wanted to find and check if they existed and could thereafter use that list as an input to the dataframe selection. But I found this difficult since I have a nested dataframe:
lst = ['anonymousId',
'source',
'utm_campaign',
'ip']
col_exists = []
for col in lst:
if df_seglog_prod.schema.simpleString().find(col) > 0:
col_exists.append(col)
else:
print('Column', col, 'does not exist')
df_2 = df.select(col_exsists) #does ofc not work...
Any tips on how to work with this kind of nested dataframe?
Thank you in advance!!

The following steps helped resolve my issue:
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, ArrayType):
dtype = dtype.elementType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(name)
return fields
def intersection(lst1, lst2):
# Use of hybrid method
temp = set(lst2)
lst3 = [value for value in lst1 if value in temp]
return lst3
fieldsPathName = flatten(df.schema)
df_prepSchema = df.select(fieldsPathName).toDF(*fieldsPathName)
lst1 = ['context.campaign.source',
'context.campaign.utm_campaign',
'timestamp',
'anonymousId']
lst2 = df.columns
cols = intersection(lst1, lst2)
append_str = '`'
cols = [append_str + col for col in cols]
cols = [col + append_str for col in cols]
df_2 = df_prepSchema.select(cols)

Spark : How to reuse the same array schema that has all fields defined across the data-frame

I have hundreds of columns a,b,c ... . I would like to modify dataframe schema, where each array will have the same shape date, num and val field.
There are thousands of id so I would like to modify ONLY schema not dataframe. Modified schema will be used in the next step to load data to dataframe efficiently . I would like to avoid using UDF to modify whole dataframe.
Input schema:
df.printSchema()
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true) !!! NOTE : `num` !!!
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
Required Output schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
To reproduce input Schema:
df = spark.read.json(sc.parallelize([
"""{"id":1,"a":[{"date":2001,"num":1},{"date":2002,},{"date":2003,}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"num":2},{"date":2002},{"date":2003}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]))
for field in df.schema:
print(field)
Print output:
StructField(a,ArrayType(StructType(List(StructField(date,LongType,true),StructField(num,LongType,true),StructField(val,LongType,true))),true),true)
StructField(b,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(c,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(d,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(id,LongType,true)
Solution (see OneCricketeer answer below for details) :
from pyspark.sql.types import StructField, StructType, LongType, ArrayType
jsonstr=[
"""{"id":1,"a":[{"date":2001,"val":1,"num":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"val":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]
array_schema = ArrayType(StructType([
StructField('date' ,LongType(),True),
StructField('num' ,LongType(),True),
StructField('val' ,LongType(),True)]),
True)
keys = ['a', 'b', 'c', 'd']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
dff = spark.read.json(sc.parallelize(jsonstr),df_schema)

I think the true solution is to have consistent names, or at least something more descriptive if the fields are truly different. "num" and "val" are basically synonymous
If I understand the question, you want to reuse the same array schema that has all fields defined
array_schema = ArrayType(StructType([StructField('date' ,LongType(),False),StructField('num' ,LongType(),True),StructField('val' ,LongType(),True))),True)
df_schema = StructType([
StructField('a',array_schema,True)
StructField('b',array_schema,True)
...
StructField('id',LongType(),True)
])
Or you can do this in a loop, which is safe because it's applied in the Spark driver
keys = ['a', 'b']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
(change each boolean to a False if there will be no nulls)
Then you need to provide this schema to your read function
spark.read.schema(df_schema).json(...
If there will still be more fields that cannot be consistently applied to all "keys", then use ArrayType(MapType(StringType(), LongType()), False)

Reading and accessing nested fields in json files using spark

I have multiple json files I wish to use to create a spark data frame from. In testing with a subset, when I load the files, I get rows of the json information themselves instead of parsed json information. I am doing the following:
df = spark.read.json('gutenberg/test')
df.show()
+--------------------+--------------------+--------------------+
| 1| 10| 5|
+--------------------+--------------------+--------------------+
| null|[WrappedArray(),W...| null|
| null| null|[WrappedArray(Uni...|
|[WrappedArray(Jef...| null| null|
+--------------------+--------------------+--------------------+
When I check the schema of the dataframe, It appears to be there, but am having trouble accessing it:
df.printSchema()
root
|-- 1: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
|-- 10: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
|-- 5: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
I keep getting errors when trying to access the information, so any help would be great.
Specifically, I am looking to create a new dataframe where the columns are ('author', 'formaturi', 'language', 'rights', 'subject', 'title', 'txt')
I am using pyspark 2.2

Since I do not know what json file is exactly like, assuming it is a new line delimited jsons, this should work.
def _construct_key(previous_key, separator, new_key):
if previous_key:
return "{}{}{}".format(previous_key, separator, new_key)
else:
return new_key
def flatten(nested_dict, separator="_", root_keys_to_ignore=set()):
assert isinstance(nested_dict, dict)
assert isinstance(separator, str)
flattened_dict = dict()
def _flatten(object_, key):
if isinstance(object_, dict):
for object_key in object_:
if not (not key and object_key in root_keys_to_ignore):
_flatten(object_[object_key], _construct_key(key,\
separator, object_key))
elif isinstance(object_, list) or isinstance(object_, set):
for index, item in enumerate(object_):
_flatten(item, _construct_key(key, separator, index))
else:
flattened_dict[key] = object_
_flatten(nested_dict, None)
return flattened_dict
def flatten(_json):
return flatt(_json.asDict(True))
df = spark.read.json('gutenberg/test',\
primitivesAsString=True,\
allowComments=True,\
allowUnquotedFieldNames=True,\
allowNumericLeadingZero=True,\
allowBackslashEscapingAnyCharacter=True,\
mode='DROPMALFORMED')\
.rdd.map(flatten).toDF()
df.show()

How to get the column by its index instead of a name?

I have the following initial PySpark DataFrame:
+----------+--------------------------------+
|product_PK| products|
+----------+--------------------------------+
| 686 | [[686,520.70],[645,2]]|
| 685 |[[685,45.556],[678,23],[655,21]]|
| 693 | []|
df = sqlCtx.createDataFrame(
[(686, [[686,520.70], [645,2]]), (685, [[685,45.556], [678,23],[655,21]]), (693, [])],
["product_PK", "products"]
)
The column products contains nested data. I need to extract the second value in each pair of values. I am running this code:
temp_dataframe = dataframe.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem("_2"))
It works well with particular DataFrame. However, I want to put this code into a function and run it on different DataFrames. All of my DataFrames have the same structure. The only difference is that the sub-column "_2" might be named differently in some DataFrames, e.g. "col1" or "col2".
For example:
DataFrame content
root
|-- product_PK: long (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: long (nullable = true)
| | |-- _2: double (nullable = true)
|-- exploded: struct (nullable = true)
| |-- _1: long (nullable = true)
| |-- _2: double (nullable = true)
DataFrame content
root
|-- product_PK: long (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- product_PK: long (nullable = true)
| | |-- col2: integer (nullable = true)
|-- exploded: struct (nullable = true)
| |-- product_PK: long (nullable = true)
| |-- col2: integer (nullable = true)
I tried to use index like getItem(1), but it says that the name of a column must be provided.
Is there any way to avoid specifying the column name or somehow generalize this part of a code?
My goal is that exploded contains the second value of each pair in the nested data, i.e. _2 or col1 or col2.

It sounds like you were on the right track. I think the way to accomplish this is to read the schema to determine the name of the field you want to explode on. Instead of schema.names though, you need to use schema.fields to find the struct field, and then use it's properties to figure out the fields in the struct. Here is an example:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Setup the test dataframe
data = [
(686, [(686, 520.70), (645, 2.)]),
(685, [(685, 45.556), (678, 23.), (655, 21.)]),
(693, [])
]
schema = StructType([
StructField("product_PK", StringType()),
StructField("products",
ArrayType(StructType([
StructField("_1", IntegerType()),
StructField("col2", FloatType())
]))
)
])
df = sqlCtx.createDataFrame(data, schema)
# Find the products field in the schema, then find the name of the 2nd field
productsField = next(f for f in df.schema.fields if f.name == 'products')
target_field = productsField.dataType.elementType.names[1]
# Do your explode using the field name
temp_dataframe = df.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem(target_field))
Now, if you examine the result you get this:
>>> temp_dataframe.printSchema()
root
|-- product_PK: string (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = true)
| | |-- col2: float (nullable = true)
|-- exploded: struct (nullable = true)
| |-- _1: integer (nullable = true)
| |-- col2: float (nullable = true)
|-- score: float (nullable = true)

Is that what you want?
>>> df.show(10, False)
+----------+-----------------------------------------------------------------------+
|product_PK|products |
+----------+-----------------------------------------------------------------------+
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|
|693 |[] |
+----------+-----------------------------------------------------------------------+
>>> import pyspark.sql.functions as F
>>> df.withColumn("exploded", F.explode("products")) \
... .withColumn("exploded", F.col("exploded").getItem(1)) \
... .show(10,False)
+----------+-----------------------------------------------------------------------+--------+
|product_PK|products |exploded|
+----------+-----------------------------------------------------------------------+--------+
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |null |
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |2 |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|null |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|23 |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|21 |
+----------+-----------------------------------------------------------------------+--------+

Given that your exploded column is a struct as
|-- exploded: struct (nullable = true)
| |-- _1: integer (nullable = true)
| |-- col2: float (nullable = true)
You can use following logic to get the second element without knowing the name
from pyspark.sql import functions as F
temp_dataframe = df.withColumn("exploded" , F.explode(F.col("products")))
temp_dataframe.withColumn("score", F.col("exploded."+temp_dataframe.select(F.col("exploded.*")).columns[1]))
you should have output as
+----------+--------------------------------------+------------+------+
|product_PK|products |exploded |score |
+----------+--------------------------------------+------------+------+
|686 |[[686,520.7], [645,2.0]] |[686,520.7] |520.7 |
|686 |[[686,520.7], [645,2.0]] |[645,2.0] |2.0 |
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[685,45.556]|45.556|
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[678,23.0] |23.0 |
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[655,21.0] |21.0 |
+----------+--------------------------------------+------------+------+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Null Column Values in PySpark DataFrame after changing Schema - python

Related

How to rename the first level keys of struct with PySpark in Azure Databricks?

Python, Select from Nested Dataframe when Input File Changes Format

Spark : How to reuse the same array schema that has all fields defined across the data-frame

Reading and accessing nested fields in json files using spark

How to get the column by its index instead of a name?

Categories

Resources