How to define schema for Pyspark createDataFrame(rdd, schema)? - python

I looked at spark-rdd to dataframe.
I read my gziped json into rdd
rdd1 =sc.textFile('s3://cw-milenko-tests/Json_gzips/ticr_calculated_2_2020-05-27T11-59-06.json.gz')
I want to convert it to spark dataframe. The first method from the linked SO question does not work. This is the first row form the file
{"code_event": "1092406", "code_event_system": "LOTTO", "company_id": "2", "date_event": "2020-05-27 12:00:00.000", "date_event_real": "0001-01-01 00:00:00.000", "ecode_class": "", "ecode_event": "183", "eperiod_event": "", "etl_date": "2020-05-27", "event_no": 1, "group_no": 0, "name_event": "Ungaria Putto - 8/20", "name_event_short": "Ungaria Putto - 8/20", "odd_coefficient": 1, "odd_coefficient_entry": 1, "odd_coefficient_user": 1, "odd_ekey": "11", "odd_name": "11", "odd_status": "", "odd_type": "11", "odd_voidfactor": 0, "odd_win_types": "", "special_bet_value": "", "ticket_id": "899M-E2X93P", "id_update": 8000001036823656, "topic_group": "cwg5", "kafka_key": "899M-E2X93P", "kafka_epoch": 1590580609424, "kafka_partition": 0, "kafka_topic": "tickets-calculated_2"}
How to infer the schema?
SO answer says
schema = StructType([StructField(str(i), StringType(), True) for i in range(32)])
Why range(32) ?

To answer your question the range(32) just indicates number of columns to which StrucField class can be applied for required schema. In your case there are 30 columns.
Based on your data I was able to create dataframe using below logic:
from pyspark.sql.functions import *
from pyspark.sql.types import *
data_json = {"code_event": "1092406", "code_event_system": "LOTTO", "company_id": "2", "date_event": "2020-05-27 12:00:00.000",
"date_event_real": "0001-01-01 00:00:00.000", "ecode_class": "", "ecode_event": "183", "eperiod_event": "",
"etl_date": "2020-05-27", "event_no": 1, "group_no": 0, "name_event": "Ungaria Putto - 8/20", "name_event_short": "Ungaria Putto - 8/20",
"odd_coefficient": 1, "odd_coefficient_entry": 1, "odd_coefficient_user": 1, "odd_ekey": "11", "odd_name": "11", "odd_status": "",
"odd_type": "11", "odd_voidfactor": 0, "odd_win_types": "", "special_bet_value": "", "ticket_id": "899M-E2X93P", "id_update": 8000001036823656,
"topic_group": "cwg5", "kafka_key": "899M-E2X93P", "kafka_epoch": 1590580609424, "kafka_partition": 0, "kafka_topic": "tickets-calculated_2"}
column_names = [x for x in data_json.keys()]
row_data = [([x for x in data_json.values()])]
input = []
for i in column_names:
if str(type(data_json[i])).__contains__('str') :
input.append(StructField(str(i), StringType(), True))
elif str(type(data_json[i])).__contains__('int') and len(str(data_json[i])) <= 8:
input.append(StructField(str(i), IntegerType(), True))
else :
input.append(StructField(str(i), LongType(), True))
schema = StructType(input)
data = spark.createDataFrame(row_data, schema)
data.show()
Output
# +----------+-----------------+----------+--------------------+--------------------+-----------+-----------+-------------+----------+--------+--------+--------------------+--------------------+---------------+---------------------+--------------------+--------+--------+----------+--------+--------------+-------------+-----------------+-----------+----------------+-----------+-----------+-------------+---------------+--------------------+
# |code_event|code_event_system|company_id| date_event| date_event_real|ecode_class|ecode_event|eperiod_event| etl_date|event_no|group_no| name_event| name_event_short|odd_coefficient|odd_coefficient_entry|odd_coefficient_user|odd_ekey|odd_name|odd_status|odd_type|odd_voidfactor|odd_win_types|special_bet_value| ticket_id| id_update|topic_group| kafka_key| kafka_epoch|kafka_partition| kafka_topic|
# +----------+-----------------+----------+--------------------+--------------------+-----------+-----------+-------------+----------+--------+--------+--------------------+--------------------+---------------+---------------------+--------------------+--------+--------+----------+--------+--------------+-------------+-----------------+-----------+----------------+-----------+-----------+-------------+---------------+--------------------+
# | 1092406| LOTTO| 2|2020-05-27 12:00:...|0001-01-01 00:00:...| | 183| |2020-05-27| 1| 0|Ungaria Putto - 8/20|Ungaria Putto - 8/20| 1| 1| 1| 11| 11| | 11| 0| | |899M-E2X93P|8000001036823656| cwg5|899M-E2X93P|1590580609424| 0|tickets-calculated_2|
# +----------+-----------------+----------+--------------------+--------------------+-----------+-----------+-------------+----------+--------+--------+--------------------+--------------------+---------------+---------------------+--------------------+--------+--------+----------+--------+--------------+-------------+-----------------+-----------+----------------+-----------+-----------+-------------+---------------+--------------------+

range(32) in that example is just an example - they are generating schema with 32 columns, each of them having the number as a name. If you really want to define schema, then you need to explicitly define every column:
from pyspark.sql.types import *
schema = StructType([
StructField('code_event', IntegerType(), True),
StructField('code_event_system', StringType(), True),
...
])
But better way would be to avoid use of the RDD API, and directly read the file into a dataframe with following code (see documentation):
>>> data = spark.read.json('s3://cw-milenko-tests/Json_gzips/ticr_calculated_2_2020-05-27T11-59-06.json.gz')
>>> data.printSchema()
root
|-- code_event: string (nullable = true)
|-- code_event_system: string (nullable = true)
|-- company_id: string (nullable = true)
|-- date_event: string (nullable = true)
|-- date_event_real: string (nullable = true)
|-- ecode_class: string (nullable = true)
|-- ecode_event: string (nullable = true)
|-- eperiod_event: string (nullable = true)
|-- etl_date: string (nullable = true)
....

Related

How to change data type in pyspark dataframe automatically

I have data from csv file, and use it in jupyter notebook with pysaprk. I have many columns and all of them have string data type. I know how to change data type manually, but there is any way to do it automatically?
You can use the inferSchema option when you load your csv file, to let spark try to infer the schema. With the following example csv file, you can get two different schemas depending on whether you set inferSchema to true or not:
seq,date
1,13/10/1942
2,12/02/2013
3,01/02/1959
4,06/04/1939
5,23/10/2053
6,13/03/2059
7,10/12/1983
8,28/10/1952
9,07/04/2033
10,29/11/2035
Example code:
df = (spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "false") # default option
.load(path))
df.printSchema()
df2 = (spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path))
df2.printSchema()
Output:
root
|-- seq: string (nullable = true)
|-- date: string (nullable = true)
root
|-- seq: integer (nullable = true)
|-- date: string (nullable = true)
You would need to define the schema before reading the file:
from pyspark.sql import functions as F
from pyspark.sql.types import *
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.show()
df.printSchema()
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
| James| | Smith|36636| M| 3000|
| Michael| Rose| |40288| M| 4000|
| Robert| |Williams|42114| M| 4000|
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| | F| -1|
+---------+----------+--------+-----+------+------+
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: integer (nullable = true)

Dataframe schema change based on filtered values while reading JSON

I have a case where I am trying to read a json file consisting an overall structure
overall json file schema:
root
|-- event: string (nullable = true)
|-- eventid: string (nullable = true)
|-- property1: struct (nullable = true)
| |-- sub_property1: string (nullable = true)
| |-- sub_property2: string (nullable = true)
|-- property2: struct (nullable = true)
| |-- sub_property1: string (nullable = true)
| |-- sub_property2: string (nullable = true)
| |-- sub_property3: string (nullable = true)
Now depending on the type of event the properties might be populated or not. For event = 'facebook_login' the schema would be
facebook_login schema:
root
|-- event: string (nullable = true)
|-- eventid: string (nullable = true)
|-- property1: struct (nullable = true)
| |-- sub_property1: string (nullable = true)
|-- property2: struct (nullable = true)
| |-- sub_property1: string (nullable = true)
| |-- sub_property3: string (nullable = true)
and when event = 'google_login' the schema would be
google_login schema:
root
|-- event: string (nullable = true)
|-- eventid: string (nullable = true)
|-- property1: struct (nullable = true)
| |-- sub_property2: string (nullable = true)
|-- property2: struct (nullable = true)
| |-- sub_property2: string (nullable = true)
| |-- sub_property3: string (nullable = true)
The problem I am facing is when I read this file and try to filter events it gives the same schema as the overall file schema (of course giving null/missing values for missing properties)
json_df = df.read.json(json_file_path)
fb_login_df = json_df.filter("event='facebook_login'")
google_login_df = json_df.filter("event='google_login'")
fb_login_df.printSchema()
google_login_df.printSchema() # same schema output for both
Is there a way we can achieve this ? to have different schema structures based on the filtered value ?
P.S : I was thinking having custom schemas defined for each event type but that would not scale since there are thousands of different event types in the json file
give the schema when you read the json:
for a try.json which contains this:
[{"event":"a","eventid":"mol","property1":{"sub1":"ex ","sub2":"ni"},"property2":{"sub1":"exe","sub2":"ad","sub3":"qui"}},{"event":"s","eventid":"cul","property1":{"sub1":"et ","sub2":"ame"},"property2":{"sub1":"o","sub2":"q","sub3":"m"}}]
you can do:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
structureSchema1 = StructType([
StructField('event', StringType(), True),
StructField('eventid', StringType(), True),
StructField('property1', StructType([
StructField('sub1', StringType(), True)
])),
StructField('property2', StructType([
StructField('sub1', StringType(), True),
StructField('sub3', StringType(), True)
]))])
structureSchema2 = StructType([
StructField('event', StringType(), True),
StructField('eventid', StringType(), True),
StructField('property1', StructType([
StructField('sub2', StringType(), True)
])),
StructField('property2', StructType([
StructField('sub2', StringType(), True),
StructField('sub3', StringType(), True)
]))])
df1 = spark.read.schema(structureSchema1).json("./try.json")
df2 = spark.read.schema(structureSchema2).json("./try.json")
I suggest reading the data in as text. 1 row = 1 event.
Filter the data. (Google/Facebook)
Use [from_json][1] to create the schema as needed.
You will have to store the data in it's own table as you can't mix schema's.

How to flatten json file in pyspark

I have json file structure as shown below.evry time json file structure will change in pyspark how we handle flatten any kind of json file. Can u help me on this.
root
|-- student: struct (nullable = true)
|-- children: struct (nullable = true)
|-- parent: struct (nullable = true
| |-- id: string (nullable = true)
| |-- type: string (nullable = true)
| |-- date: string (nullable = true)
|-- multipliers: array (nullable = true)
| |-- element: double (containsNull = true)
|-- spawn_time: string (nullable = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)
This approach uses a recursive function to determine the columns to select, by building a flat list of fully-named prefixes in the prefix accumulator parameter.
Note that it will work on any format that supports nesting, not just JSON (Parquet, Avro, etc).
Furthermore, the input can have any schema, but this example uses:
{"c1": {"c3": 4, "c4": 12}, "c2": "w1"}
{"c1": {"c3": 5, "c4": 34}, "c2": "w2"}
The original df shows as:
+-------+---+
| c1| c2|
+-------+---+
|[4, 12]| w1|
|[5, 34]| w2|
+-------+---+
The code:
from pyspark.sql.types import StructType
from pyspark.sql.functions import col
# return a list of all (possibly nested) fields to select, within a given schema
def flatten(schema, prefix: str = ""):
# return a list of sub-items to select, within a given field
def field_items(field):
name = f'{prefix}.{field.name}' if prefix else field.name
if type(field.dataType) == StructType:
return flatten(field.dataType, name)
else:
return [col(name)]
return [item for field in schema.fields for item in field_items(field)]
df = spark.read.json(path)
print('===== df =====')
df.printSchema()
flattened = flatten(df.schema)
print('flattened =', flatten(df.schema))
print('===== df2 =====')
df2 = df.select(*flattened)
df2.printSchema()
df2.show()
As you will see in the output, the flatten function returns a flat list of columns, each one fully named (using "parent_col.child_col" naming format).
Output:
===== df =====
root
|-- c1: struct (nullable = true)
| |-- c3: long (nullable = true)
| |-- c4: long (nullable = true)
|-- c2: string (nullable = true)
flattened = [Column<b'c1.c3'>, Column<b'c1.c4'>, Column<b'c2'>]
===== df2 =====
root
|-- c3: long (nullable = true)
|-- c4: long (nullable = true)
|-- c2: string (nullable = true)
+---+---+---+
| c3| c4| c2|
+---+---+---+
| 4| 12| w1|
| 5| 34| w2|
+---+---+---+

ValueError: Some of types cannot be determined after inferring (pyspark)

I'm trying to create a dataframe with the following schema:
|-- data: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- keyNote: struct (nullable = true)
| | |-- key: string (nullable = true)
| | |-- note: string (nullable = true)
| |-- details: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
This is the best I managed to do:
schema = StructType([
StructField("id",LongType(), True),
StructField("keyNote",StructType([
StructField("key",StringType(),True),
StructField("note",StringType(),True)
])),
StructField("details",MapType(StringType(), StringType(), True))
])
df = spark\
.createDataFrame([("idd",("keyy","notee"),("keyy","valuee")),schema])
But I'm getting an exception:
ValueError: Some of types cannot be determined after inferring
Seems like schema is correct, but the test data is wrong. Please check below example:
from pyspark.sql.types import *
schema = StructType([
StructField("id",LongType(), True),
StructField("keyNote",StructType([
StructField("key",StringType(),True),
StructField("note",StringType(),True)
])),
StructField("details",MapType(StringType(), StringType(), True))
])
test_data = [[9, {"key": "mykey", "note": "mynote"}, {"a": "val_a", "b": "val_b"}]]
df = spark.createDataFrame(test_data,schema=schema)
df.show(20, False)
df.printSchema()
output of above code:
+---+---------------+------------------------+
|id |keyNote |details |
+---+---------------+------------------------+
|9 |[mykey, mynote]|[a -> val_a, b -> val_b]|
+---+---------------+------------------------+
root
|-- id: long (nullable = true)
|-- keyNote: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- note: string (nullable = true)
|-- details: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
You have a syntax error there:
>>> spark.createDataFrame([("idd",("keyy","notee"),("keyy","valuee"))])
DataFrame[_1: string, _2: struct<_1:string,_2:string>, _3: struct<_1:string,_2:string>]
you are not closing a bracket ] properly.
Besides, you cannot give "idd" - a string - if you declare a LondType, and you must not forget about other elements:
>>> spark.createDataFrame([(123123,[("keyy","notee"),("keyy","valuee")], {})], schema)
DataFrame[id: bigint, keyNote: struct<key:string,note:string>, details: map<string,string>]

How to filter based on array value in PySpark?

My Schema:
|-- Canonical_URL: string (nullable = true)
|-- Certifications: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Certification_Authority: string (nullable = true)
| | |-- End: string (nullable = true)
| | |-- License: string (nullable = true)
| | |-- Start: string (nullable = true)
| | |-- Title: string (nullable = true)
|-- CompanyId: string (nullable = true)
|-- Country: string (nullable = true)
|-- vendorTags: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- score: double (nullable = true)
| | |-- vendor: string (nullable = true)
I tried the below query to select nested fields from vendorTags
df3 = sqlContext.sql("select vendorTags.vendor from globalcontacts")
How can I query the nested fields in where clause like below in PySpark
df3 = sqlContext.sql("select vendorTags.vendor from globalcontacts where vendorTags.vendor = 'alpha'")
or
df3 = sqlContext.sql("select vendorTags.vendor from globalcontacts where vendorTags.score > 123.123456")
something like this..
I tried the above queries only to get the below error
df3 = sqlContext.sql("select vendorTags.vendor from globalcontacts where vendorTags.vendor = 'alpha'")
16/03/15 13:16:02 INFO ParseDriver: Parsing command: select vendorTags.vendor from globalcontacts where vendorTags.vendor = 'alpha'
16/03/15 13:16:03 INFO ParseDriver: Parse Completed
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 583, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve '(vendorTags.vendor = cast(alpha as double))' due to data type mismatch: differing types in '(vendorTags.vendor = cast(alpha as double))' (array<string> and double).; line 1 pos 71"
For equality based queries you can use array_contains:
df = sc.parallelize([(1, [1, 2, 3]), (2, [4, 5, 6])]).toDF(["k", "v"])
df.createOrReplaceTempView("df")
# With SQL
sqlContext.sql("SELECT * FROM df WHERE array_contains(v, 1)")
# With DSL
from pyspark.sql.functions import array_contains
df.where(array_contains("v", 1))
If you want to use more complex predicates you'll have to either explode or use an UDF, for example something like this:
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
def exists(f):
return udf(lambda xs: any(f(x) for x in xs), BooleanType())
df.where(exists(lambda x: x > 3)("v"))
In Spark 2.4. or later it is also possible to use higher order functions
from pyspark.sql.functions import expr
df.where(expr("""aggregate(
transform(v, x -> x > 3),
false,
(x, y) -> x or y
)"""))
or
df.where(expr("""
exists(v, x -> x > 3)
"""))
Python wrappers should be available in 3.1 (SPARK-30681).
In spark 2.4 you can filter array values using filter function in sql API.
https://spark.apache.org/docs/2.4.0/api/sql/index.html#filter
Here's example in pyspark. In the example we filter out all array values which are empty strings:
df = df.withColumn("ArrayColumn", expr("filter(ArrayColumn, x -> x != '')"))

Categories