Python, Select from Nested Dataframe when Input File Changes Format - python

df = spark.read.json(['/Users/.../input/json/thisistheinputfile.json'])
df.printSchema()
Results in something like below:
root
|-- _metadata: struct (nullable = true)
| |-- bundled: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bundledIds: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- failedInitializations: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- unbundled: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- anonymousId: string (nullable = true)
|-- channel: string (nullable = true)
|-- context: struct (nullable = true)
| |-- campaign: struct (nullable = true)
| | |-- content: string (nullable = true)
| | |-- medium: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- source: string (nullable = true)
| | |-- term: string (nullable = true)
| | |-- utm_campaign: string (nullable = true)
| | |-- utm_medium: string (nullable = true)
| | |-- utm_term: string (nullable = true)
| |-- ip: string (nullable = true)
However (SOME TIME LATER) in some cases the input file does not contain some of the content that was present above, for instance, maybe the campaign information is not available:
root
|-- _metadata: struct (nullable = true)
| |-- bundled: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bundledIds: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- failedInitializations: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- unbundled: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- anonymousId: string (nullable = true)
|-- channel: string (nullable = true)
|-- context: struct (nullable = true)
| |-- ip: string (nullable = true)
I would like to automatically be able to select some of the columns but I don't want the script to crash when the content is not available. Note that the number of columns that are to be selected are a lot more then in the example below:
df_2 = df\
.select(expr("context.campaign.source").alias("campaign_source"),
expr("context.campaign.utm_campaign").alias("utm_campaign"),
'anonymousId')
One case could be that the anonymousId, ip and context.campaign.source exists, but not the context.campaign.utm_campaign and all the possible combinations (can be a lot with many columns).
I tried listing the part I wanted to find and check if they existed and could thereafter use that list as an input to the dataframe selection. But I found this difficult since I have a nested dataframe:
lst = ['anonymousId',
'source',
'utm_campaign',
'ip']
col_exists = []
for col in lst:
if df_seglog_prod.schema.simpleString().find(col) > 0:
col_exists.append(col)
else:
print('Column', col, 'does not exist')
df_2 = df.select(col_exsists) #does ofc not work...
Any tips on how to work with this kind of nested dataframe?
Thank you in advance!!

The following steps helped resolve my issue:
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, ArrayType):
dtype = dtype.elementType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(name)
return fields
def intersection(lst1, lst2):
# Use of hybrid method
temp = set(lst2)
lst3 = [value for value in lst1 if value in temp]
return lst3
fieldsPathName = flatten(df.schema)
df_prepSchema = df.select(fieldsPathName).toDF(*fieldsPathName)
lst1 = ['context.campaign.source',
'context.campaign.utm_campaign',
'timestamp',
'anonymousId']
lst2 = df.columns
cols = intersection(lst1, lst2)
append_str = '`'
cols = [append_str + col for col in cols]
cols = [col + append_str for col in cols]
df_2 = df_prepSchema.select(cols)

Related

How do I get column and nested field names in pyspark?

I have a pyspark df.
df.printSchema()
root
|-- bio: string (nullable = true)
|-- city: string (nullable = true)
|-- company: string (nullable = true)
|-- custom_fields: struct (nullable = true)
| |-- nested_field1: string (nullable = true)
|-- email: string (nullable = true)
|-- first_conversion: struct (nullable = true)
| |-- nested_field2: struct (nullable = true)
| | |-- number: string (nullable = true)
| | |-- state: string (nullable = true)
I would like to iterate over column and nested fields in order to get their names (just their names). I should be able to print them and get the following result:
bio
city
company
custom_fields
nested_field1
email
first_conversion
nested_field2
number
state
I can easily print the first level with:
for st in df.schema:
print(st.name)
But how do I check for deeper levels in runtime and recursively?
dtypes will give you more details of the schema, you will have to parse it though
df.printSchema()
root
|-- id: integer (nullable = true)
|-- rec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = true)
| | |-- b: float (nullable = true)
df.dtypes
# [('id', 'int'), ('rec', 'array<struct<a:int,b:float>>')]

Spark : How to reuse the same array schema that has all fields defined across the data-frame

I have hundreds of columns a,b,c ... . I would like to modify dataframe schema, where each array will have the same shape date, num and val field.
There are thousands of id so I would like to modify ONLY schema not dataframe. Modified schema will be used in the next step to load data to dataframe efficiently . I would like to avoid using UDF to modify whole dataframe.
Input schema:
df.printSchema()
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true) !!! NOTE : `num` !!!
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
Required Output schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
To reproduce input Schema:
df = spark.read.json(sc.parallelize([
"""{"id":1,"a":[{"date":2001,"num":1},{"date":2002,},{"date":2003,}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"num":2},{"date":2002},{"date":2003}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]))
for field in df.schema:
print(field)
Print output:
StructField(a,ArrayType(StructType(List(StructField(date,LongType,true),StructField(num,LongType,true),StructField(val,LongType,true))),true),true)
StructField(b,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(c,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(d,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(id,LongType,true)
Solution (see OneCricketeer answer below for details) :
from pyspark.sql.types import StructField, StructType, LongType, ArrayType
jsonstr=[
"""{"id":1,"a":[{"date":2001,"val":1,"num":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"val":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]
array_schema = ArrayType(StructType([
StructField('date' ,LongType(),True),
StructField('num' ,LongType(),True),
StructField('val' ,LongType(),True)]),
True)
keys = ['a', 'b', 'c', 'd']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
dff = spark.read.json(sc.parallelize(jsonstr),df_schema)
I think the true solution is to have consistent names, or at least something more descriptive if the fields are truly different. "num" and "val" are basically synonymous
If I understand the question, you want to reuse the same array schema that has all fields defined
array_schema = ArrayType(StructType([StructField('date' ,LongType(),False),StructField('num' ,LongType(),True),StructField('val' ,LongType(),True))),True)
df_schema = StructType([
StructField('a',array_schema,True)
StructField('b',array_schema,True)
...
StructField('id',LongType(),True)
])
Or you can do this in a loop, which is safe because it's applied in the Spark driver
keys = ['a', 'b']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
(change each boolean to a False if there will be no nulls)
Then you need to provide this schema to your read function
spark.read.schema(df_schema).json(...
If there will still be more fields that cannot be consistently applied to all "keys", then use ArrayType(MapType(StringType(), LongType()), False)

Rename nested struct columns to all in lower case in a Spark DataFrame using PySpark

Similar kind of solution is already available using scala, but I need a solution in pyspark. I am new to python, need all your help on the same.
Below is the link for scala solution, For better understanding of requirement.
Rename nested struct columns in a Spark DataFrame
I am trying to change the names of a DataFrame columns in python. I am easily able to change the column names for direct fields but I'm facing difficulty while converting array struct columns.
Below is my DataFrame schema.
|-- VkjLmnVop: string (nullable = true)
|-- KaTasLop: string (nullable = true)
|-- AbcDef: struct (nullable = true)
| |-- UvwXyz: struct (nullable = true)
| | |-- MnoPqrstUv: string (nullable = true)
| | |-- ManDevyIxyz: string (nullable = true)
But I need the schema like below
|-- vkjlmnvop: string (nullable = true)
|-- kataslop: string (nullable = true)
|-- abcdef: struct (nullable = true)
| |-- uvwxyz: struct (nullable = true)
| | |-- mnopqrstuv: string (nullable = true)
| | |-- mandevyixyz: string (nullable = true)
How I can change Struct column names dynamically?
I have also found a different solution of similar logic with less number of lines.
import pyspark.sql.functions as spf
ds = {'AbcDef': {'UvwXyz': {'VkjLmnVop': 'abcd'}}, 'HijKS': 'fgds'}
df = spark.read.json(sc.parallelize([ds]))
df.printSchema()
"""
root
|-- AbcDef: struct (nullable = true)
| |-- UvwXyz: struct (nullable = true)
| | |-- VkjLmnVop: string (nullable = true)
|-- HijKS: string (nullable = true)
"""
for i in df.columns : df = df.withColumnRenamed(i, i.lower())
schemaDef = [y.replace("]","") for y in [x.replace("DataFrame[","") for x in df.__str__().split(", ")]]
for j in schemaDef :
columnName = j.split(": ")[0]
dataType = j.split(": ")[1]
df = df.withColumn(columnName, spf.col(columnName).cast(dataType.lower()))
df.printSchema()
"""
root
|-- abcdef: struct (nullable = true)
| |-- uvwxyz: struct (nullable = true)
| | |-- vkjlmnvop: string (nullable = true)
|-- hijks: string (nullable = true)
"""
I guess this is what you wanted. Hope it helps!
def get_column_wise_schema(df_string_schema, df_columns):
# Returns a dictionary containing column name and corresponding column schema as string.
column_schema_dict = {}
i = 0
while i < len(df_columns):
current_col = df_columns[i]
next_col = df_columns[i + 1] if i < len(df_columns) - 1 else None
current_col_split_key = '[' + current_col + ': ' if i == 0 else ' ' + current_col + ': '
next_col_split_key = ']' if i == len(df_columns) - 1 else ', ' + next_col + ': '
column_schema_dict[current_col] = df_string_schema.split(current_col_split_key)[1].\
split(next_col_split_key)[0]
i += 1
return column_schema_dict
def convert_colnames_to_lower(spark_df):
columns = spark_df.columns
column_wise_schema_dict = get_column_wise_schema(spark_df.__str__(), columns)
col_exprs = []
for column_name in columns:
column_schema_lowercase = column_wise_schema_dict[column_name]
col_exprs.append(spf.col(column_name).cast(column_schema_lowercase).
alias(column_name.lower()))
return spark_df.select(*col_exprs)
ds = {'AbcDef': {'UvwXyz': {'VkjLmnVop': 'abcd'}}, 'HijKS': 'fgds'}
df = spark.read.json(sc.parallelize([ds]))
df.printSchema()
"""
root
|-- AbcDef: struct (nullable = true)
| |-- UvwXyz: struct (nullable = true)
| | |-- VkjLmnVop: string (nullable = true)
|-- HijKS: string (nullable = true)
"""
converted_df = convert_colnames_to_lower(df)
converted_df.printSchema()
"""
root
|-- abcdef: struct (nullable = true)
| |-- uvwxyz: struct (nullable = true)
| | |-- vkjlmnvop: string (nullable = true)
|-- hijks: string (nullable = true)
"""

How to efficiently process records in rdd and maintain the structure of a record

I have been working with Google Analytics data I have got in s3. I am loading the file as follows -
df = sc.textFile('s3n://BUCKET_NAME/2017/1/2/')
After this, I get an RDD. But if we want to see the schema I have loaded the data into spark SQL and the data schema is like this -
root
|-- channelGrouping: string (nullable = true)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- index: string (nullable = true)
| | |-- value: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable = true)
| |-- browser: string (nullable = true)
| |-- browserSize: string (nullable = true)
| |-- browserVersion: string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |-- isMobile: boolean (nullable = true)
| |-- javaEnabled: boolean (nullable = true)
| |-- language: string (nullable = true)
| |-- mobileDeviceBranding: string (nullable = true)
| |-- mobileDeviceInfo: string (nullable = true)
| |-- mobileDeviceMarketingName: string (nullable = true)
| |-- mobileDeviceModel: string (nullable = true)
| |-- mobileInputSelector: string (nullable = true)
| |-- operatingSystem: string (nullable = true)
| |-- operatingSystemVersion: string (nullable = true)
| |-- screenColors: string (nullable = true)
| |-- screenResolution: string (nullable = true)
|-- fullVisitorId: string (nullable = true)
|-- geoNetwork: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- cityId: string (nullable = true)
| |-- continent: string (nullable = true)
| |-- country: string (nullable = true)
| |-- latitude: string (nullable = true)
| |-- longitude: string (nullable = true)
| |-- metro: string (nullable = true)
| |-- networkDomain: string (nullable = true)
| |-- networkLocation: string (nullable = true)
| |-- region: string (nullable = true)
| |-- subContinent: string (nullable = true)
what I tried -
def remove_null_device(val):
_ori = json.loads(val)
# _ori = val
_dic_val = _ori['device']
for key, _value in _dic_val.items():
if _value == "null":
_dic_val[key] = "Hello There I am Testing this"
_ori["device"] = _dic_val
return _ori
device_data = df_rdd.map(remove_null_device)
Problem Statement - I want to iterate over every record as this is the nested structure I am thinking to pass one main key at a time like we have device, geoNetwork and check if the values are empty or not or null.
But this seems to change the structure of the whole record and the items are not getting updated don't know why. Please suggest any better approach for same.
Thanks!
Ok I want to check for all the fields in device if they are empty or null or (not set) and then updated those values and return the row I have updated and schema should remain intact.

Spark - Creating Nested DataFrame

I'm starting with PySpark and I'm having troubles with creating DataFrames with nested objects.
This is my example.
I have users.
$ cat user.json
{"id":1,"name":"UserA"}
{"id":2,"name":"UserB"}
Users have orders.
$ cat order.json
{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}
And I like to join it to get such a struct where orders are array nested in users.
$ cat join.json
{"id":1, "name":"UserA", "orders":[{"id":1,"price":202.30,"userid":1},{"id":2,"price":343.99,"userid":1}]}
{"id":2,"name":"UserB","orders":[{"id":3,"price":399.99,"userid":2}]}
How can I do that ?
Is there any kind of nested join or something similar ?
>>> user = sqlContext.read.json("user.json")
>>> user.printSchema();
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
>>> order = sqlContext.read.json("order.json")
>>> order.printSchema();
root
|-- id: long (nullable = true)
|-- price: double (nullable = true)
|-- userid: long (nullable = true)
>>> joined = sqlContext.read.json("join.json")
>>> joined.printSchema();
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
EDIT:
I know there is possibility to do this using join and foldByKey, but is there any simpler way ?
EDIT2:
I'm using solution by #zero323
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn")
I add second nested structure 'lines'
>>> lines = sqlContext.read.json(path + "lines.json")
>>> lines.printSchema();
root
|-- id: long (nullable = true)
|-- orderid: long (nullable = true)
|-- product: string (nullable = true)
orders = joinTable(order, lines, "id", "orderid", "lines")
joined = joinTable(user, orders, "id", "userid", "orders")
joined.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
| | |-- lines: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _1: long (nullable = true)
| | | | |-- _2: long (nullable = true)
| | | | |-- _3: string (nullable = true)
After this column names from lines are lost.
Any ideas ?
EDIT 3:
I tried to manual specify schema.
from pyspark.sql.types import *
fields = []
fields.append(StructField("_1", LongType(), True))
inner = ArrayType(lines.schema)
fields.append(StructField("_2", inner))
new_schema = StructType(fields)
print new_schema
grouped = lines.rdd.groupBy(lambda r: r.orderid)
grouped = grouped.map(lambda x: (x[0], list(x[1])))
g = sqlCtx.createDataFrame(grouped, new_schema)
Error:
TypeError: StructType(List(StructField(id,LongType,true),StructField(orderid,LongType,true),StructField(product,StringType,true))) can not accept object in type <class 'pyspark.sql.types.Row'>
This will work only in Spark 2.0 or later
First we'll need a couple of imports:
from pyspark.sql.functions import struct, collect_list
The rest is a simple aggregation and join:
orders = spark.read.json("/path/to/order.json")
users = spark.read.json("/path/to/user.json")
combined = users.join(
orders
.groupBy("userId")
.agg(collect_list(struct(*orders.columns)).alias("orders"))
.withColumnRenamed("userId", "id"), ["id"])
For the example data the result is:
combined.show(2, False)
+---+-----+---------------------------+
|id |name |orders |
+---+-----+---------------------------+
|1 |UserA|[[1,202.3,1], [2,343.99,1]]|
|2 |UserB|[[3,399.99,2]] |
+---+-----+---------------------------+
with schema:
combined.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
and JSON representation:
for x in combined.toJSON().collect():
print(x)
{"id":1,"name":"UserA","orders":[{"id":1,"price":202.3,"userid":1},{"id":2,"price":343.99,"userid":1}]}
{"id":2,"name":"UserB","orders":[{"id":3,"price":399.99,"userid":2}]}
First, you need to use the userid as the join key for the second DataFrame:
user.join(order, user.id == order.userid)
Then you can use a map step to transform the resulting records to your desired format.
For flatining your data frame from nested to normal use
dff= df.select("column with multiple columns.*")

Categories