Extract Schema from nested Json-String column in Pyspark

Extract Schema from nested Json-String column in Pyspark - python

Assuming I have the following table:
body
{"Day":1,"vals":[{"id":"1", "val":"3"}, {"id":"2", "val":"4"}]}
My goal is to write down the schema in Pyspark for this nested json column. I've tried the following two things:
schema = StructType([
StructField("Day", StringType()),
StructField(
"vals",
StructType([
StructType([
StructField("id", StringType(), True),
StructField("val", DoubleType(), True)
])
StructType([
StructField("id", StringType(), True),
StructField("val", DoubleType(), True)
])
])
)
])
Here I get the error that of
'StructType' object has no attribute 'name'
Another approach was to declare the nested Arrays as ArrayType:
schema = StructType([
StructField("Day", StringType()),
StructField(
"vals",
ArrayType(
ArrayType(
StructField("id", StringType(), True),
StructField("val", DoubleType(), True)
, True)
ArrayType(
StructField("id", StringType(), True),
StructField("val", DoubleType(), True)
, True)
, True)
)
])
Here I get the following error:
takes from 2 to 3 positional arguments but 5 were given
Which propably comes from the array only taking the Sql type as an argument.
Can anybody tell me what their approach would be to create the schema, since I'm a super newbie to the whole topic..

This is the structure you are looking for:
Data = [
(1, [("1","3"), ("2","4")])
]
schema = StructType([
StructField('Day', IntegerType(), True),
StructField('vals', ArrayType(StructType([
StructField('id', StringType(), True),
StructField('val', StringType(), True)
]),True))
])
df = spark.createDataFrame(data=Data,schema=schema)
df.printSchema()
df.show(truncate=False)
This will get you the next output:
root
|-- Day: integer (nullable = true)
|-- vals: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- val: string (nullable = true)
+---+----------------+
|Day|vals |
+---+----------------+
|1 |[{1, 3}, {1, 3}]|
+---+----------------+

Related

How to change data type in pyspark dataframe automatically

I have data from csv file, and use it in jupyter notebook with pysaprk. I have many columns and all of them have string data type. I know how to change data type manually, but there is any way to do it automatically?

You can use the inferSchema option when you load your csv file, to let spark try to infer the schema. With the following example csv file, you can get two different schemas depending on whether you set inferSchema to true or not:
seq,date
1,13/10/1942
2,12/02/2013
3,01/02/1959
4,06/04/1939
5,23/10/2053
6,13/03/2059
7,10/12/1983
8,28/10/1952
9,07/04/2033
10,29/11/2035
Example code:
df = (spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "false") # default option
.load(path))
df.printSchema()
df2 = (spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path))
df2.printSchema()
Output:
root
|-- seq: string (nullable = true)
|-- date: string (nullable = true)
root
|-- seq: integer (nullable = true)
|-- date: string (nullable = true)

You would need to define the schema before reading the file:
from pyspark.sql import functions as F
from pyspark.sql.types import *
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.show()
df.printSchema()
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
| James| | Smith|36636| M| 3000|
| Michael| Rose| |40288| M| 4000|
| Robert| |Williams|42114| M| 4000|
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| | F| -1|
+---------+----------+--------+-----+------+------+
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: integer (nullable = true)

Dataframe schema change based on filtered values while reading JSON

I have a case where I am trying to read a json file consisting an overall structure
overall json file schema:
root
|-- event: string (nullable = true)
|-- eventid: string (nullable = true)
|-- property1: struct (nullable = true)
| |-- sub_property1: string (nullable = true)
| |-- sub_property2: string (nullable = true)
|-- property2: struct (nullable = true)
| |-- sub_property1: string (nullable = true)
| |-- sub_property2: string (nullable = true)
| |-- sub_property3: string (nullable = true)
Now depending on the type of event the properties might be populated or not. For event = 'facebook_login' the schema would be
facebook_login schema:
root
|-- event: string (nullable = true)
|-- eventid: string (nullable = true)
|-- property1: struct (nullable = true)
| |-- sub_property1: string (nullable = true)
|-- property2: struct (nullable = true)
| |-- sub_property1: string (nullable = true)
| |-- sub_property3: string (nullable = true)
and when event = 'google_login' the schema would be
google_login schema:
root
|-- event: string (nullable = true)
|-- eventid: string (nullable = true)
|-- property1: struct (nullable = true)
| |-- sub_property2: string (nullable = true)
|-- property2: struct (nullable = true)
| |-- sub_property2: string (nullable = true)
| |-- sub_property3: string (nullable = true)
The problem I am facing is when I read this file and try to filter events it gives the same schema as the overall file schema (of course giving null/missing values for missing properties)
json_df = df.read.json(json_file_path)
fb_login_df = json_df.filter("event='facebook_login'")
google_login_df = json_df.filter("event='google_login'")
fb_login_df.printSchema()
google_login_df.printSchema() # same schema output for both
Is there a way we can achieve this ? to have different schema structures based on the filtered value ?
P.S : I was thinking having custom schemas defined for each event type but that would not scale since there are thousands of different event types in the json file

give the schema when you read the json:
for a try.json which contains this:
[{"event":"a","eventid":"mol","property1":{"sub1":"ex ","sub2":"ni"},"property2":{"sub1":"exe","sub2":"ad","sub3":"qui"}},{"event":"s","eventid":"cul","property1":{"sub1":"et ","sub2":"ame"},"property2":{"sub1":"o","sub2":"q","sub3":"m"}}]
you can do:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
structureSchema1 = StructType([
StructField('event', StringType(), True),
StructField('eventid', StringType(), True),
StructField('property1', StructType([
StructField('sub1', StringType(), True)
])),
StructField('property2', StructType([
StructField('sub1', StringType(), True),
StructField('sub3', StringType(), True)
]))])
structureSchema2 = StructType([
StructField('event', StringType(), True),
StructField('eventid', StringType(), True),
StructField('property1', StructType([
StructField('sub2', StringType(), True)
])),
StructField('property2', StructType([
StructField('sub2', StringType(), True),
StructField('sub3', StringType(), True)
]))])
df1 = spark.read.schema(structureSchema1).json("./try.json")
df2 = spark.read.schema(structureSchema2).json("./try.json")

I suggest reading the data in as text. 1 row = 1 event.
Filter the data. (Google/Facebook)
Use [from_json][1] to create the schema as needed.
You will have to store the data in it's own table as you can't mix schema's.

How to define schema for Pyspark createDataFrame(rdd, schema)?

I looked at spark-rdd to dataframe.
I read my gziped json into rdd
rdd1 =sc.textFile('s3://cw-milenko-tests/Json_gzips/ticr_calculated_2_2020-05-27T11-59-06.json.gz')
I want to convert it to spark dataframe. The first method from the linked SO question does not work. This is the first row form the file
{"code_event": "1092406", "code_event_system": "LOTTO", "company_id": "2", "date_event": "2020-05-27 12:00:00.000", "date_event_real": "0001-01-01 00:00:00.000", "ecode_class": "", "ecode_event": "183", "eperiod_event": "", "etl_date": "2020-05-27", "event_no": 1, "group_no": 0, "name_event": "Ungaria Putto - 8/20", "name_event_short": "Ungaria Putto - 8/20", "odd_coefficient": 1, "odd_coefficient_entry": 1, "odd_coefficient_user": 1, "odd_ekey": "11", "odd_name": "11", "odd_status": "", "odd_type": "11", "odd_voidfactor": 0, "odd_win_types": "", "special_bet_value": "", "ticket_id": "899M-E2X93P", "id_update": 8000001036823656, "topic_group": "cwg5", "kafka_key": "899M-E2X93P", "kafka_epoch": 1590580609424, "kafka_partition": 0, "kafka_topic": "tickets-calculated_2"}
How to infer the schema?
SO answer says
schema = StructType([StructField(str(i), StringType(), True) for i in range(32)])
Why range(32) ?

To answer your question the range(32) just indicates number of columns to which StrucField class can be applied for required schema. In your case there are 30 columns.
Based on your data I was able to create dataframe using below logic:
from pyspark.sql.functions import *
from pyspark.sql.types import *
data_json = {"code_event": "1092406", "code_event_system": "LOTTO", "company_id": "2", "date_event": "2020-05-27 12:00:00.000",
"date_event_real": "0001-01-01 00:00:00.000", "ecode_class": "", "ecode_event": "183", "eperiod_event": "",
"etl_date": "2020-05-27", "event_no": 1, "group_no": 0, "name_event": "Ungaria Putto - 8/20", "name_event_short": "Ungaria Putto - 8/20",
"odd_coefficient": 1, "odd_coefficient_entry": 1, "odd_coefficient_user": 1, "odd_ekey": "11", "odd_name": "11", "odd_status": "",
"odd_type": "11", "odd_voidfactor": 0, "odd_win_types": "", "special_bet_value": "", "ticket_id": "899M-E2X93P", "id_update": 8000001036823656,
"topic_group": "cwg5", "kafka_key": "899M-E2X93P", "kafka_epoch": 1590580609424, "kafka_partition": 0, "kafka_topic": "tickets-calculated_2"}
column_names = [x for x in data_json.keys()]
row_data = [([x for x in data_json.values()])]
input = []
for i in column_names:
if str(type(data_json[i])).__contains__('str') :
input.append(StructField(str(i), StringType(), True))
elif str(type(data_json[i])).__contains__('int') and len(str(data_json[i])) <= 8:
input.append(StructField(str(i), IntegerType(), True))
else :
input.append(StructField(str(i), LongType(), True))
schema = StructType(input)
data = spark.createDataFrame(row_data, schema)
data.show()
Output
# +----------+-----------------+----------+--------------------+--------------------+-----------+-----------+-------------+----------+--------+--------+--------------------+--------------------+---------------+---------------------+--------------------+--------+--------+----------+--------+--------------+-------------+-----------------+-----------+----------------+-----------+-----------+-------------+---------------+--------------------+
# |code_event|code_event_system|company_id| date_event| date_event_real|ecode_class|ecode_event|eperiod_event| etl_date|event_no|group_no| name_event| name_event_short|odd_coefficient|odd_coefficient_entry|odd_coefficient_user|odd_ekey|odd_name|odd_status|odd_type|odd_voidfactor|odd_win_types|special_bet_value| ticket_id| id_update|topic_group| kafka_key| kafka_epoch|kafka_partition| kafka_topic|
# +----------+-----------------+----------+--------------------+--------------------+-----------+-----------+-------------+----------+--------+--------+--------------------+--------------------+---------------+---------------------+--------------------+--------+--------+----------+--------+--------------+-------------+-----------------+-----------+----------------+-----------+-----------+-------------+---------------+--------------------+
# | 1092406| LOTTO| 2|2020-05-27 12:00:...|0001-01-01 00:00:...| | 183| |2020-05-27| 1| 0|Ungaria Putto - 8/20|Ungaria Putto - 8/20| 1| 1| 1| 11| 11| | 11| 0| | |899M-E2X93P|8000001036823656| cwg5|899M-E2X93P|1590580609424| 0|tickets-calculated_2|
# +----------+-----------------+----------+--------------------+--------------------+-----------+-----------+-------------+----------+--------+--------+--------------------+--------------------+---------------+---------------------+--------------------+--------+--------+----------+--------+--------------+-------------+-----------------+-----------+----------------+-----------+-----------+-------------+---------------+--------------------+

range(32) in that example is just an example - they are generating schema with 32 columns, each of them having the number as a name. If you really want to define schema, then you need to explicitly define every column:
from pyspark.sql.types import *
schema = StructType([
StructField('code_event', IntegerType(), True),
StructField('code_event_system', StringType(), True),
...
])
But better way would be to avoid use of the RDD API, and directly read the file into a dataframe with following code (see documentation):
>>> data = spark.read.json('s3://cw-milenko-tests/Json_gzips/ticr_calculated_2_2020-05-27T11-59-06.json.gz')
>>> data.printSchema()
root
|-- code_event: string (nullable = true)
|-- code_event_system: string (nullable = true)
|-- company_id: string (nullable = true)
|-- date_event: string (nullable = true)
|-- date_event_real: string (nullable = true)
|-- ecode_class: string (nullable = true)
|-- ecode_event: string (nullable = true)
|-- eperiod_event: string (nullable = true)
|-- etl_date: string (nullable = true)
....

ValueError: Some of types cannot be determined after inferring (pyspark)

I'm trying to create a dataframe with the following schema:
|-- data: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- keyNote: struct (nullable = true)
| | |-- key: string (nullable = true)
| | |-- note: string (nullable = true)
| |-- details: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
This is the best I managed to do:
schema = StructType([
StructField("id",LongType(), True),
StructField("keyNote",StructType([
StructField("key",StringType(),True),
StructField("note",StringType(),True)
])),
StructField("details",MapType(StringType(), StringType(), True))
])
df = spark\
.createDataFrame([("idd",("keyy","notee"),("keyy","valuee")),schema])
But I'm getting an exception:
ValueError: Some of types cannot be determined after inferring

Seems like schema is correct, but the test data is wrong. Please check below example:
from pyspark.sql.types import *
schema = StructType([
StructField("id",LongType(), True),
StructField("keyNote",StructType([
StructField("key",StringType(),True),
StructField("note",StringType(),True)
])),
StructField("details",MapType(StringType(), StringType(), True))
])
test_data = [[9, {"key": "mykey", "note": "mynote"}, {"a": "val_a", "b": "val_b"}]]
df = spark.createDataFrame(test_data,schema=schema)
df.show(20, False)
df.printSchema()
output of above code:
+---+---------------+------------------------+
|id |keyNote |details |
+---+---------------+------------------------+
|9 |[mykey, mynote]|[a -> val_a, b -> val_b]|
+---+---------------+------------------------+
root
|-- id: long (nullable = true)
|-- keyNote: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- note: string (nullable = true)
|-- details: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)

You have a syntax error there:
>>> spark.createDataFrame([("idd",("keyy","notee"),("keyy","valuee"))])
DataFrame[_1: string, _2: struct<_1:string,_2:string>, _3: struct<_1:string,_2:string>]
you are not closing a bracket ] properly.
Besides, you cannot give "idd" - a string - if you declare a LondType, and you must not forget about other elements:
>>> spark.createDataFrame([(123123,[("keyy","notee"),("keyy","valuee")], {})], schema)
DataFrame[id: bigint, keyNote: struct<key:string,note:string>, details: map<string,string>]

Add aggregation from different dataframe as column

With this dataset:
start,end,rms,state,maxTemp,minTemp
2019-02-20T16:16:31.752Z,2019-02-20T17:33:34.750Z,4.588481,charge,35.0,32.0
2019-02-20T17:33:34.935Z,2019-02-20T18:34:49.737Z,5.770562,discharge,35.0,33.0
And this:
[{"EventDate":"2019-02-02T16:17:00.579Z","Value":"23"},
{"EventDate":"2019-02-02T16:18:01.579Z","Value":"23"},
{"EventDate":"2019-02-02T16:19:02.581Z","Value":"23"},
{"EventDate":"2019-02-02T16:20:03.679Z","Value":"23"},
{"EventDate":"2019-02-02T16:21:04.684Z","Value":"23"},
{"EventDate":"2019-02-02T17:40:05.693Z","Value":"23"},
{"EventDate":"2019-02-02T17:40:06.694Z","Value":"23"},
{"EventDate":"2019-02-02T17:40:07.698Z","Value":"23"},
{"EventDate":"2019-02-02T17:40:08.835Z","Value":"23"}]
schema = StructType([
StructField('EventDate', TimestampType(), True),
StructField('Value', FloatType(), True)
])
I want to add max and min values of the json dataset as columns into the csv dataset.
I have tried:
cyclesWithValues = csvDf\
.withColumn("max", jsondata.filter((col("EventDate") >= csvDf.start) & (col("EventDate") <= csvDf.end)).agg({"value": "max"}).head()["max(Value)"])\
.withColumn("min", jsondata.filter((col("EventDate") >= csvDf.start) & (col("EventDate") <= csvDf.end)).agg({"value": "min"}).head()["min(Value)"])
But I get this error:
AnalysisException: 'Resolved attribute(s) start#38271,end#38272 missing from EventDate#38283,Value#38286 in operator !Filter ((EventDate#38283 >= start#38271) && (EventDate#38283 <= end#38272)).;;\n!Filter ((EventDate#38283 >= start#38271) && (EventDate#38283 <= end#38272))\n+- Project [EventDate#38283, cast(Value#38280 as float) AS Value#38286]\n +- Project [to_timestamp(EventDate#38279, None) AS EventDate#38283, Value#38280]\n +- Relation[EventDate#38279,Value#38280] json\n'
I have a solution based on arrays, but it seems very slow, so I was hoping something like this would speed things up a bit.
Right now I am using this solution:
dfTemperature = spark.read.option("multiline", "true").json("path")
dfTemperatureCast = dfTemperature.withColumn("EventDate", to_timestamp(dfTemperature.EventDate)).withColumn("Value", dfTemperature.Value.cast('float'))
def AddVAluesToDf(row):
temperatures = dfTemperatureCast.filter((col("EventDate") >= row["start"]) & (col("EventDate") <= row["end"]))
maxTemp = temperatures.agg({"value": "max"}).head()["max(value)"]
minTemp = temperatures.agg({"value": "min"}).head()["min(value)"]
return (row.start, row.end, row.rms, row.state, maxTemp, minTemp)
pool = ThreadPool(10)
withValues = pool.map(AddVAluesToDf, rmsDf)
schema = StructType([
StructField('start', TimestampType(), True),
StructField('end', TimestampType(), True),
StructField('maxTemp', FloatType(), True),
StructField('minTemp', FloatType(), True)
])
cyclesDF = spark.createDataFrame(withValues, schema)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract Schema from nested Json-String column in Pyspark - python

Related

How to change data type in pyspark dataframe automatically

Dataframe schema change based on filtered values while reading JSON

How to define schema for Pyspark createDataFrame(rdd, schema)?

ValueError: Some of types cannot be determined after inferring (pyspark)

Add aggregation from different dataframe as column

Categories

Resources