ArrayType of mixed data in spark - python

I want to merge two different array list into one. Each of the array is a column in spark dataframe. Therefore, I want to use a udf
def some_function(u,v):
li = list()
for x,y in zip(u,v):
li.append(x.extend(y))
return li
udf_object = udf(some_function,ArrayType(ArrayType(StringType()))))
new_x = x.withColumn('new_name',udf_object(col('name'),col('features')))
This is the schema of data:
root
|-- blockingkey: string (nullable = true)
|-- blocked_records: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- flattened_array: array (nullable = true)
| |-- element: string (containsNull = true)
|-- features: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: float (containsNull = true)
|-- name: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
I am trying to merge name and features. so it is like the first element in name will be merged with first element in features.
But this only returns an array with NUll values when there is Integer or FloatValues. Kindly help me resolve this issue if this can be done using udf or other wise.

If you have dataframe and schema as
+------------------------------------------------+----------------------------------------+
|features |name |
+------------------------------------------------+----------------------------------------+
|[WrappedArray(2.0, 3.0), WrappedArray(3.0, 5.0)]|[WrappedArray(a, b), WrappedArray(c, d)]|
|[WrappedArray(2.0, 3.0), WrappedArray(3.0, 5.0)]|[WrappedArray(a, b), WrappedArray(c, d)]|
+------------------------------------------------+----------------------------------------+
root
|-- features: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: double (containsNull = true)
|-- name: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
Then you can define the udf function and call the udf function as
import pyspark.sql.types as t
from pyspark.sql import functions as f
def some_function(u,v):
li = []
for x, y in zip(u, v):
li.append(x + y)
return li
udf_object = f.udf(some_function,t.ArrayType(t.ArrayType(t.StringType())))
new_x = x.withColumn('new_name',udf_object(f.col('name'),f.col('features')))
so new_x would be
+------------------------------------------------+----------------------------------------+------------------------------------------------------------+
|features |name |new_name |
+------------------------------------------------+----------------------------------------+------------------------------------------------------------+
|[WrappedArray(2.0, 3.0), WrappedArray(3.0, 5.0)]|[WrappedArray(a, b), WrappedArray(c, d)]|[WrappedArray(a, b, 2.0, 3.0), WrappedArray(c, d, 3.0, 5.0)]|
|[WrappedArray(2.0, 3.0), WrappedArray(3.0, 5.0)]|[WrappedArray(a, b), WrappedArray(c, d)]|[WrappedArray(a, b, 2.0, 3.0), WrappedArray(c, d, 3.0, 5.0)]|
+------------------------------------------------+----------------------------------------+------------------------------------------------------------+
root
|-- features: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: double (containsNull = true)
|-- name: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- new_name: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
I hope the answer is helpful

Related

How to iterate nested array in pyspark schema?

Currently my schema is:
root
|-- C_0_0: double (nullable = true)
|-- C_0_1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: **double** (containsNull = true)
|-- C_0_2: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: double (containsNull = true)
I want to change it to :
root
|-- C_0_0: double (nullable = true)
|-- C_0_1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: **decimal(8,6)** (containsNull = true)
|-- C_0_2: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: double (containsNull = true)
How can I iterate a nested array since no field name is present for child of an array?
You don't need to iterate, just use cast.
This would work:
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, DecimalType
df=df.withColumn("ArrayOfDoub", F.col("C_0_1").cast(ArrayType(ArrayType(DecimalType(8,6)))))
Input:
Output:

Python, Select from Nested Dataframe when Input File Changes Format

df = spark.read.json(['/Users/.../input/json/thisistheinputfile.json'])
df.printSchema()
Results in something like below:
root
|-- _metadata: struct (nullable = true)
| |-- bundled: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bundledIds: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- failedInitializations: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- unbundled: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- anonymousId: string (nullable = true)
|-- channel: string (nullable = true)
|-- context: struct (nullable = true)
| |-- campaign: struct (nullable = true)
| | |-- content: string (nullable = true)
| | |-- medium: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- source: string (nullable = true)
| | |-- term: string (nullable = true)
| | |-- utm_campaign: string (nullable = true)
| | |-- utm_medium: string (nullable = true)
| | |-- utm_term: string (nullable = true)
| |-- ip: string (nullable = true)
However (SOME TIME LATER) in some cases the input file does not contain some of the content that was present above, for instance, maybe the campaign information is not available:
root
|-- _metadata: struct (nullable = true)
| |-- bundled: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bundledIds: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- failedInitializations: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- unbundled: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- anonymousId: string (nullable = true)
|-- channel: string (nullable = true)
|-- context: struct (nullable = true)
| |-- ip: string (nullable = true)
I would like to automatically be able to select some of the columns but I don't want the script to crash when the content is not available. Note that the number of columns that are to be selected are a lot more then in the example below:
df_2 = df\
.select(expr("context.campaign.source").alias("campaign_source"),
expr("context.campaign.utm_campaign").alias("utm_campaign"),
'anonymousId')
One case could be that the anonymousId, ip and context.campaign.source exists, but not the context.campaign.utm_campaign and all the possible combinations (can be a lot with many columns).
I tried listing the part I wanted to find and check if they existed and could thereafter use that list as an input to the dataframe selection. But I found this difficult since I have a nested dataframe:
lst = ['anonymousId',
'source',
'utm_campaign',
'ip']
col_exists = []
for col in lst:
if df_seglog_prod.schema.simpleString().find(col) > 0:
col_exists.append(col)
else:
print('Column', col, 'does not exist')
df_2 = df.select(col_exsists) #does ofc not work...
Any tips on how to work with this kind of nested dataframe?
Thank you in advance!!
The following steps helped resolve my issue:
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, ArrayType):
dtype = dtype.elementType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(name)
return fields
def intersection(lst1, lst2):
# Use of hybrid method
temp = set(lst2)
lst3 = [value for value in lst1 if value in temp]
return lst3
fieldsPathName = flatten(df.schema)
df_prepSchema = df.select(fieldsPathName).toDF(*fieldsPathName)
lst1 = ['context.campaign.source',
'context.campaign.utm_campaign',
'timestamp',
'anonymousId']
lst2 = df.columns
cols = intersection(lst1, lst2)
append_str = '`'
cols = [append_str + col for col in cols]
cols = [col + append_str for col in cols]
df_2 = df_prepSchema.select(cols)

Spark : How to reuse the same array schema that has all fields defined across the data-frame

I have hundreds of columns a,b,c ... . I would like to modify dataframe schema, where each array will have the same shape date, num and val field.
There are thousands of id so I would like to modify ONLY schema not dataframe. Modified schema will be used in the next step to load data to dataframe efficiently . I would like to avoid using UDF to modify whole dataframe.
Input schema:
df.printSchema()
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true) !!! NOTE : `num` !!!
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
Required Output schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
To reproduce input Schema:
df = spark.read.json(sc.parallelize([
"""{"id":1,"a":[{"date":2001,"num":1},{"date":2002,},{"date":2003,}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"num":2},{"date":2002},{"date":2003}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]))
for field in df.schema:
print(field)
Print output:
StructField(a,ArrayType(StructType(List(StructField(date,LongType,true),StructField(num,LongType,true),StructField(val,LongType,true))),true),true)
StructField(b,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(c,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(d,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(id,LongType,true)
Solution (see OneCricketeer answer below for details) :
from pyspark.sql.types import StructField, StructType, LongType, ArrayType
jsonstr=[
"""{"id":1,"a":[{"date":2001,"val":1,"num":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"val":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]
array_schema = ArrayType(StructType([
StructField('date' ,LongType(),True),
StructField('num' ,LongType(),True),
StructField('val' ,LongType(),True)]),
True)
keys = ['a', 'b', 'c', 'd']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
dff = spark.read.json(sc.parallelize(jsonstr),df_schema)
I think the true solution is to have consistent names, or at least something more descriptive if the fields are truly different. "num" and "val" are basically synonymous
If I understand the question, you want to reuse the same array schema that has all fields defined
array_schema = ArrayType(StructType([StructField('date' ,LongType(),False),StructField('num' ,LongType(),True),StructField('val' ,LongType(),True))),True)
df_schema = StructType([
StructField('a',array_schema,True)
StructField('b',array_schema,True)
...
StructField('id',LongType(),True)
])
Or you can do this in a loop, which is safe because it's applied in the Spark driver
keys = ['a', 'b']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
(change each boolean to a False if there will be no nulls)
Then you need to provide this schema to your read function
spark.read.schema(df_schema).json(...
If there will still be more fields that cannot be consistently applied to all "keys", then use ArrayType(MapType(StringType(), LongType()), False)

Pyspark: explode columns to new dataframe

Ihave some pyspark dataframe with schema:
|-- doc_id: string (nullable = true)
|-- msp_contracts: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _el1: string (nullable = true)
| | |-- _el2: long (nullable = true)
| | |-- _el3: string (nullable = true)
| | |-- _el4: string (nullable = true)
| | |-- _el5: string (nullable = true)
How do I get this data frame:
|-- doc_id: string (nullable = true)
|-- _el1: string (nullable = true)
|-- _el3: string (nullable = true)
|-- _el4: string (nullable = true)
|-- _el5: string (nullable = true)
I try in select:
explode('msp_contracts').select(
col(u'msp_contracts.element._el1'),
col(u'msp_contracts.element._el2')
)
but I can have error:
'Column' object is not callable
After explode('msp_contracts') spark will add col column as a result of explode (if alias in not provided).
df.select("doc_id",explode("msp_contracts")).show()
#+------+---+
#|doc_id|col|
#+------+---+
#| 1|[1]|
#+------+---+
Use col to select _el1, Try with df_1.select("doc_id",explode("msp_contracts")).select("doc_id",col(u"col._el1")).show()
Example:
jsn='{"doc_id":1,"msp_contracts":[{"_el1":1}]}'
df=spark.read.json(sc.parallelize([(jsn)]))
#schema
#root
# |-- doc_id: long (nullable = true)
# |-- msp_contracts: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- _el1: long (nullable = true)
df.withColumn("msp_contracts",explode(col("msp_contracts"))).\
select("doc_id","msp_contracts._el1").show()
#+------+----+
#|doc_id|_el1|
#+------+----+
#| 1| 1|
#+------+----+
UPDATE:
df.select("doc_id",explode("msp_contracts")).\
select("doc_id","col._el1").\
show()
#or
df.select("doc_id",explode("msp_contracts")).\
select("doc_id",col(u"col._el1")).\
show()
#+------+----+
#|doc_id|_el1|
#+------+----+
#| 1| 1|
#+------+----+
Work for me:
df.select("doc_id",explode("msp_contracts")).\
select("doc_id","col._el1")
With alias and costum column:
df.select(
'doc_id',
explode('msp_contracts').alias("msp_contracts")
)\
.select(
'doc_id',
col('msp_contracts.el_1').alias('last_period_44fz_customer'),
col('msp_contracts.el_2').alias('last_period_44fz_customer_inn')
)\
.withColumn("load_dtm", now_f())

Reading and accessing nested fields in json files using spark

I have multiple json files I wish to use to create a spark data frame from. In testing with a subset, when I load the files, I get rows of the json information themselves instead of parsed json information. I am doing the following:
df = spark.read.json('gutenberg/test')
df.show()
+--------------------+--------------------+--------------------+
| 1| 10| 5|
+--------------------+--------------------+--------------------+
| null|[WrappedArray(),W...| null|
| null| null|[WrappedArray(Uni...|
|[WrappedArray(Jef...| null| null|
+--------------------+--------------------+--------------------+
When I check the schema of the dataframe, It appears to be there, but am having trouble accessing it:
df.printSchema()
root
|-- 1: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
|-- 10: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
|-- 5: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
I keep getting errors when trying to access the information, so any help would be great.
Specifically, I am looking to create a new dataframe where the columns are ('author', 'formaturi', 'language', 'rights', 'subject', 'title', 'txt')
I am using pyspark 2.2
Since I do not know what json file is exactly like, assuming it is a new line delimited jsons, this should work.
def _construct_key(previous_key, separator, new_key):
if previous_key:
return "{}{}{}".format(previous_key, separator, new_key)
else:
return new_key
def flatten(nested_dict, separator="_", root_keys_to_ignore=set()):
assert isinstance(nested_dict, dict)
assert isinstance(separator, str)
flattened_dict = dict()
def _flatten(object_, key):
if isinstance(object_, dict):
for object_key in object_:
if not (not key and object_key in root_keys_to_ignore):
_flatten(object_[object_key], _construct_key(key,\
separator, object_key))
elif isinstance(object_, list) or isinstance(object_, set):
for index, item in enumerate(object_):
_flatten(item, _construct_key(key, separator, index))
else:
flattened_dict[key] = object_
_flatten(nested_dict, None)
return flattened_dict
def flatten(_json):
return flatt(_json.asDict(True))
df = spark.read.json('gutenberg/test',\
primitivesAsString=True,\
allowComments=True,\
allowUnquotedFieldNames=True,\
allowNumericLeadingZero=True,\
allowBackslashEscapingAnyCharacter=True,\
mode='DROPMALFORMED')\
.rdd.map(flatten).toDF()
df.show()

Categories