Spark SQL search inside an array for a struct - python

My data structure is defined approximately as follows:
schema = StructType([
# ... fields skipped
StructField("extra_features",
ArrayType(StructType([
StructField("key", StringType(), False),
StructField("value", StringType(), True)
])), nullable = False)],
)
Now, I'd like to search for entries in a data frame where a struct {"key": "somekey", "value": "somevalue"} exists in the array column. How do I do this?

Spark has a function array_contains that can be used to check the contents of an ArrayType column, but unfortunately it doesn't seem like it can handle arrays of complex types. It is possible to do it with a UDF (User Defined Function) however:
from pyspark.sql.types import *
from pyspark.sql import Row
import pyspark.sql.functions as F
schema = StructType([StructField("extra_features", ArrayType(StructType([
StructField("key", StringType(), False),
StructField("value", StringType(), True)])),
False)])
df = spark.createDataFrame([
Row([{'key': 'a', 'value': '1'}]),
Row([{'key': 'b', 'value': '2'}])], schema)
# UDF to check whether {'key': 'a', 'value': '1'} is in an array
# The actual data of a (nested) StructType value is a Row
contains_keyval = F.udf(lambda extra_features: Row(key='a', value='1') in extra_features, BooleanType())
df.where(contains_keyval(df.extra_features)).collect()
This results in:
[Row(extra_features=[Row(key=u'a', value=u'1')])]
You can also use the UDF to add another column that indicates whether the key-value pair is present:
df.withColumn('contains_it', contains_keyval(df.extra_features)).collect()
results in:
[Row(extra_features=[Row(key=u'a', value=u'1')], contains_it=True),
Row(extra_features=[Row(key=u'b', value=u'2')], contains_it=False)]

Since Spark 2.4.0 you can use the functions exist.
Example with SparkSQL:
SELECT
EXISTS
(
ARRAY(named_struct("key": "a", "value": "1"), named_struct("key": "b", "value": "2")),
x -> x = named_struct("key": "a", "value": "1")
)
Example with PySpark:
df.filter('exists(extra_features, x -> x = named_struct("key": "a", "value": "1"))')
Note that not all the functions to manipulate arrays start with array_*.
Ex: exist, filter, size, ...

Related

Nested Json Using pyspark

We have to build nested json using below structure in pyspark and i have added data that need to feed using this
Input Data structure
Data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
columns = ["data","action"]
df = spark.createDataFrame(zip(a1, a2), columns)
#Input data for json structure
a1=["Pune"]
a2=["YES"]
a3=["India"]
col=["DA_Stinf_city","DA_Stinf_NA_ID_GRANT","DA_country"]
data=spark.createDataFrame(zip(a1, a2,a3), col)
Expected result based on above data
{
"data": {
"studentinfo": {
"city": "Pune",
"name": {
"id": {
"grant": "YES"
}
}
},
"country": "india"
}
}
we have tried using F.struct function in manually but we have find dynamic way to build this json using df dataframe having data and action column
data.select(
F.struct(
F.struct(
F.struct(F.col("DA_Stinf_city")).alias("city"),
F.struct(
F.struct(F.col("DA_Stinf_NA_ID_GRANT")).alias("id")
).alias("name"),
).alias("studentinfo"),
F.struct(F.col("DA_country")).alias("country")
).alias("data")
)
The approach below should give the correct structure (with the wrong key names - if you are happy with the approach, which doesn't use DataFrame operations but rather works in the underlying RDD, then I can flesh it out):
def build_json(input, running={}):
new_input = {}
for hierarchy, value in input:
key = hierarchy.pop(0)
if len(hierarchy) == 0:
running[key] = value
else:
new_input[key] = new_input.get(key, []) + [(hierarchy, value)]
for key in new_input:
print(new_input[key])
running[key] = build_json(new_input[key], running={})
return running
data.rdd.map(
lambda x: build_json(
[(column.split("_"), value) for column, value in x.asDict().items()]
)
)
The basic idea is to get a set of tuples from the underlying RDD consisting of the column name broken into its json hierarchy and the value to insert into the hierarchy. Then the function build_json inserts the value into its correct place in the json hierarchy, while building out the json object recursively.

Json Creation using dataframe

We are using below dataframe to create json file
Input file
import pandas as pd
import numpy as np
a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
a3=[np.NaN,np.NaN,"StringType",np.NaN,"BoolType","StringType"]
d1=pd.DataFrame(list(zip(a1,a2,a3)),columns=['data','action','datatype'])
We have to build below 2 structure using above dataframe in dynamic way
we have fit above data in below format
for schema e.g::
StructType([StructField(Column_name,Datatype,True)])
for Data e.g::
F.struct(F.col(column_name)).alias(json_expected_name)
expected output structure for schema
StructType(
[
StructField("data",
StructType(
[
StructField(
"studentinfo",
StructType(
[
StructField("city",StringType(),True),
StructField("name",StructType(
[
StructField("id",
StructType(
[
StructField("grant",BoolType(),True)
])
)]
)
)
]
)
),
StructField("country",StringType(),True)
])
)
])
2)Expected data fetch
df.select(
F.struct(
F.struct(
F.struct(F.col("DA_Stinf_city")).alias("city"),
F.struct(
F.struct(F.col("DA_Stinf_NA_ID_GRANT")).alias("id")
).alias("name"),
).alias("studentinfo"),
F.struct(F.col("DA_country")).alias("country")
).alias("data")
)
We have to use for loop and add these kind of entry in (data.studentinfo.name.id)
data->studentinfo->name->id
Which I have already add in expected output structure
this is the resulting json. How you need to reassemble the json into a new hierarchial json structure that you desire. Action has the hierarchy elements to your tree and data type the type. I think you can assume null data types are numeric. The name datatype is wrong as null. It should be stringtype
import pandas as pd
import numpy as np
import json
a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
a3=["StructType","StructTypeType","StringType","NumberType","BoolType","StringType"]
df=pd.DataFrame(list(zip(a1,a2,a3)),columns=['data','action','datatype'])
json_tree=df.to_json()
{
"data":{
"0":"DA_STinf",
"1":"DA_Stinf_NA",
"2":"DA_Stinf_city",
"3":"DA_Stinf_NA_ID",
"4":"DA_Stinf_NA_ID_GRANT",
"5":"DA_country"
},
"action":{
"0":"data.studentinfo",
"1":"data.studentinfo.name",
"2":"data.studentinfo.city",
"3":"data.studentinfo.name.id",
"4":"data.studentinfo.name.id.grant",
"5":"data.country"
},
"datatype":{
"0":"StructType",
"1":"StructType",
"2":"StringType",
"3":"NumericType",
"4":"BoolType",
"5":"StringType"
}
}
def convert_action_to_hierarchy(data):
data=json.loads(data)
action = data['action']
datatype_list = data['datatype']
result = {}
for i in range(len(action)):
action_list = action[str(i)].split('.')
temp = result
for j in range(len(action_list)):
datatype = datatype_list[str(j)]
result[action_list[j]]=(j,datatype)
return result
print(convert_action_to_hierarchy(json_tree))
output:
{'data': (0, 'StructType'), 'studentinfo': (1, 'StructType'), 'name': (2, 'StringType'), 'city': (2, 'StringType'), 'id': (3, 'NumberType'), 'grant': (4, 'BoolType'), 'country': (1, 'StringType')}
The number is the level in the hierarchy

Flatten a column value using dataframe

Im trying to flatten 2 columns from a table loaded into a dataframe as below:
u_group
t_group
{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}
{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}
{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}
{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}
I want to separate them and get them as:
u_group.link
u_group.value
t_group.link
t_group.value
https://hi.com/api/now/table/system/2696f18b376bca0
2696f18b376bca0
https://hi.com/api/now/table/system/2696f18b376bca0
2696f18b376bca0
https://hi.com/api/now/table/system/99b27bc1db761f4
99b27bc1db761f4
https://hi.com/api/now/table/system/99b27bc1db761f4
99b27bc1db761f4
I used the below code, but wasnt successful.
import ast
from pandas.io.json import json_normalize
df12 = spark.sql("""select u_group,t_group from tbl""")
def only_dict(d):
'''
Convert json string representation of dictionary to a python dict
'''
return ast.literal_eval(d)
def list_of_dicts(ld):
'''
Create a mapping of the tuples formed after
converting json strings of list to a python list
'''
return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])
A = json_normalize(df12['u_group'].apply(only_dict).tolist()).add_prefix('link.')
B = json_normalize(df['u_group'].apply(list_of_dicts).tolist()).add_prefix('value.')
TypeError: 'Column' object is not callable
Kindly help or suggest if any other code would work better.
need simple example and code for answer
example:
data = [[{'link':'A1', 'value':'B1'}, {'link':'A2', 'value':'B2'}],
[{'link':'C1', 'value':'D1'}, {'link':'C2', 'value':'D2'}]]
df = pd.DataFrame(data, columns=['u', 't'])
output(df):
u t
0 {'link': 'A1', 'value': 'B1'} {'link': 'A2', 'value': 'B2'}
1 {'link': 'C1', 'value': 'D1'} {'link': 'C2', 'value': 'D2'}
use following code:
pd.concat([df[i].apply(lambda x: pd.Series(x)).add_prefix(i + '_') for i in df.columns], axis=1)
output:
u_link u_value t_link t_value
0 A1 B1 A2 B2
1 C1 D1 C2 D2
Here are my 2 cents,
A simple way to achieve this using PYSPARK.
Create the dataframe as follows:
data = [
(
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}""",
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}"""
),
(
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}""",
"""{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}"""
)
]
df = spark.createDataFrame(data,schema=['u_group','t_group'])
Then use the from_json() to parse the dictionary and fetch the individual values as follows:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema_column = StructType([
StructField("link",StringType(),True),
StructField("value",StringType(),True),
])
df = df .withColumn('U_GROUP_PARSE',from_json(col('u_group'),schema_column))\
.withColumn('T_GROUP_PARSE',from_json(col('t_group'),schema_column))\
.withColumn('U_GROUP.LINK',col("U_GROUP_PARSE.link"))\
.withColumn('U_GROUP.VALUE',col("U_GROUP_PARSE.value"))\
.withColumn('T_GROUP.LINK',col("T_GROUP_PARSE.link"))\
.withColumn('T_GROUP.VALUE',col("T_GROUP_PARSE.value"))\
.drop('u_group','t_group','U_GROUP_PARSE','T_GROUP_PARSE')
Print the dataframe
df.show(truncate=False)
Please check the below image for your reference:

How to add an entire list contents into a Pyspark Dataframe row?

I am creating a new pyspark dataframe from a list of strings. How should my code look like?
This is my list: ['there', 'is', 'one', 'that', 'commands'] and this is what I want ideally:
words(header)
Row 1: ['there', 'is', 'one', 'that', 'commands']
Row 2: ['test', 'try'
I have tried out the following codes but none of them gave me exactly what I wanted.
test_list=['hi','bye','thanks']
test_list=sc.parallelize(test_list)
schema = StructType([StructField("name", StringType(), True)])
df3 = sqlContext.createDataFrame(test_list, schema)
AND
test_list=['hi','bye','thanks']
test_list=sc.parallelize(test_list)
df3 = sqlContext.createDataFrame(row(test_list), schema)
I cannot get the dataframes to show using df.show().
You just need to import Row object, rest everything was fine.
from pyspark.sql.types import Row, StructType, StructField, StringType
test_list=['hi','bye','thanks']
test_list=sc.parallelize(test_list)
rdd= test_list.map(lambda t: Row(name=t))
schema = StructType([StructField("name", StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)
df.show()
+------+
| name|
+------+
| hi|
| bye|
|thanks|
+------+

Splitting dataFrame using spark python

I'm using dataframe in spark to split and store data in a tablular format. My data in file looks as below -
{"click_id": 123, "created_at": "2016-10-03T10:50:33", "product_id": 98373, "product_price": 220.50, "user_id": 1, "ip": "10.10.10.10"}
{"click_id": 124, "created_at": "2017-02-03T10:51:33", "product_id": 97373, "product_price": 320.50, "user_id": 1, "ip": "10.13.10.10"}
{"click_id": 125, "created_at": "2017-10-03T10:52:33", "product_id": 96373, "product_price": 20.50, "user_id": 1, "ip": "192.168.2.1"}
and I've written this code to split the data -
from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as psf
spark = SparkSession \
.builder \
.appName("Hello") \
.config("World") \
.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
ratings = spark.createDataFrame(
sc.textFile("transactions.json").map(lambda l: l.split(',')),
["Col1","Col2","Col3","Col4","Col5","Col6"]
)
ratings.registerTempTable("ratings")
final_df = sqlContext.sql("select * from ratings");
final_df.show(20,False)
The above code works fine and gives the below output :
As you can see from the output the "click_id and number" is being shown, similarly created_at and timestamp is being shown.
I want to actually have only the values in the table - click_id, created_at, product_id and so on.
How do I get only those values into my table ?
In your map function, parse the json object instead of splitting it
map(lambda l: l.split(','))
should become
map(lambda l: json.loads(l))
(after you have imported json)
import json
Also if you remove the columns definition
["Col1","Col2","Col3","Col4","Col5","Col6"]
you will get the columns from json
Assuming you want to use only the dataframe API, then you could use the following code:
ratings = spark.read.json("transactions.json")
This will load the json into a dataframe, mapping the json keys into column names.
Then you can select and rename the columns with the code below.
ratings = ratings.select(col('click_id').alias('Col1'),
col('created_at').alias('Col2'),
col('product_id').alias('Col3'),
col('product_price').alias('Col4'),
col('user_id').alias('Col5'),
col('ip').alias('Col6'))
This way you can also cast columns into relevant datatypes, e.g. col('product_price').cast('double').alias('Col4') and properly save to database.

Categories