How to compare two schema in Databricks notebook in python - python

I'm going to ingest data using databricks notebook. I want to validate the schema of the data ingested against what I'm expecting the schema of these data to be.
So basically I have:
validation_schema = StructType([
StructField("a", StringType(), True),
StructField("b", IntegerType(), False),
StructField("c", StringType(), False),
StructField("d", StringType(), False)
])
data_ingested_good = [("foo",1,"blabla","36636"),
("foo",2,"booboo","40288"),
("bar",3,"fafa","42114"),
("bar",4,"jojo","39192"),
("baz",5,"jiji","32432")
]
data_ingested_bad = [("foo","1","blabla","36636"),
("foo","2","booboo","40288"),
("bar","3","fafa","42114"),
("bar","4","jojo","39192"),
("baz","5","jiji","32432")
]
data_ingested_good.printSchema()
data_ingested_bad.printSchema()
validation_schema.printSchema()
I've seen similar questions but answers are always in scala.

it's really depends on your exact requirements & complexities of schemas that you want to compare - for example, ignore nullability flag vs. taking it into account, order of columns, support for maps/structs/arrays, etc. Also, do you want to see difference or just a flag if schemas are matching or not.
In the simplest case it could be as simple as following - just compare string representations of schemas:
def compare_schemas(df1, df2):
return df1.schema.simpleString() == df2.schema.simpleString()
I personally would recommend to take an existing library, like Chispa that has more advanced schema comparison functions - you can tune checks, it will show differences, etc. After installation (you can just do %pip install chispa) - this will throw an exception if schemas are different:
from chispa.schema_comparer import assert_schema_equality
assert_schema_equality(df1.schema, df2.schema)

another method , you can find the difference based on the simple python list comparisons .
dept = [("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
]
deptColumns = ["dept_name","dept_id"]
dept1 = [("Finance",10,'999'),
("Marketing",20,'999'),
("Sales",30,'999'),
("IT",40,'999')
]
deptColumns1 = ["dept_name","dept_id","extracol"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
dept1DF = spark.createDataFrame(data=dept1, schema = deptColumns1)
deptDF_columns=deptDF.schema.names
dept1DF_columns=dept1DF.schema.names
list_difference = []
for item in dept1DF_columns:
if item not in deptDF_columns:
list_difference.append(item)
print(list_difference)
Screen print :

Related

filter on the pyspark dataframe schema to get new dataframe with columns having specific type

I want to create a generic function in pyspark that takes dataframe and a datatype as a parameter and filter the columns that does not satisfy the criteria. I am not very good at python and I am kind of stuck at the point from where I am not able to find how can I do that.
I have a scala representation of the code that does the same thing.
//sample data
val df = Seq(("587","mumbai",Some(5000),5.05),("786","chennai",Some(40000),7.055),("432","Gujarat",Some(20000),6.75),("2","Delhi",None,10.0)).toDF("Id","City","Salary","Increase").withColumn("RefID",$"Id")
import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {
val cols = df.schema.toList
.filter(x => x.dataType == colType)
.map(c => col(c.name))
df.select(cols:_*)
}
val res = selectByType(IntegerType, df)
res is the dataframe that has only integer columns in this case the salary column and we have drop all the other columns that have different types dynamically.
I won't the same behaviour in pyspark but I am not able to accomplish that.
This is what I have tried
//sample data
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DoubleType
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True), \
StructField("raise",DoubleType(),True) \
])
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000,2.5),
("Michael","Rose","","40288","M",4000,4.7),
("Robert","","Williams","42114","M",4000,8.9),
("Maria","Anne","Jones","39192","F",4000,0.0),
("Jen","Mary","Brown","","F",-1,-1.2)
]
df = spark.createDataFrame(data=data2,schema=schema)
//getting the column list from schema of the dataframe
pschema = df.schema.fields
datatypes = [IntegerType,DoubleType] //column datatype that I want.
out = filter(lambda x: x.dataType.isin(datatypes), pschema) //gives invalid syntax error.
Can someone help me out in terms of what is the thing that I am doing wrong. Scala code only passes single datatype but I as per my use case I want to handle scenario in which we can pass multiple datatypes and we get the dataframe with required columns of that specified datatypes.
Initially if someone can give any idea for how can I make it work for single datatype then I can give it a try to see if I can do that same for multiple datatypes.
Note : Sample data for scala and pyspark is different as I copied the Pyspark sample data from somewhere just to speed the operation as I am just concerned about the final output requirement.

How to union Spark SQL Dataframes in Python

Here are several ways of creating a union of dataframes, which (if any) is best /recommended when we are talking about big dataframes? Should I create an empty dataframe first or continuously union to the first dataframe created?
Empty Dataframe creation
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("A", StringType(), False),
StructField("B", StringType(), False),
StructField("C", StringType(), False)
])
pred_union_df = spark_context.parallelize([]).toDF(schema)
Method 1 - Union as you go:
for ind in indications:
fitted_model = get_fitted_model(pipeline, train_balanced_df, ind)
pred = get_predictions(fitted_model, pred_output_df, ind)
pred_union_df = pred_union_df.union(pred[['A', 'B', 'C']])
Method 2 - Union at the end:
all_pred = []
for ind in indications:
fitted_model = get_fitted_model(pipeline, train_balanced_df, ind)
pred = get_predictions(fitted_model, pred_output_df, ind)
all_pred.append(pred)
pred_union_df = pred_union_df.union(all_pred)
Or do I have it all wrong?
Edit:
Method 2 was not possible as I thought it would be from this answer. I had to loop through the list and union each dataframe.
Method 2 is always preferred since it avoid the long lineage issue.
Although DataFrame.union only takes one DataFrame as argument, RDD.union does take a list. Given your sample code, you could try to union them before calling toDF.
If your data is on disk, you could also try to load them all at once to achieve union, e.g.,
dataframe = spark.read.csv([path1, path2, path3])

Apache Spark 2.0.0 PySpark manual dataframe creation head scratcher

When I assemble a one row data frame as follows my method successfully brings back the expected data frame.
def build_job_finish_data_frame(sql_context, job_load_id, is_success):
job_complete_record_schema = StructType(
[
StructField("job_load_id", IntegerType(), False),
StructField("terminate_datetime", TimestampType(), False),
StructField("was_success", BooleanType(), False)
]
)
data = [
Row(
job_load_id=job_load_id,
terminate_datetime=datetime.now(),
was_success=is_success
)
]
return sql_context.createDataFrame(data, job_complete_record_schema)
If I change the "terminate_datetime" to "end_datetime" or "finish_datetime" as shown below it throws an error.
def build_job_finish_data_frame(sql_context, job_load_id, is_success):
job_complete_record_schema = StructType(
[
StructField("job_load_id", IntegerType(), False),
StructField("end_datetime", TimestampType(), False),
StructField("was_success", BooleanType(), False)
]
)
data = [
Row(
job_load_id=job_load_id,
end_datetime=datetime.now(),
was_success=is_success
)
]
return sql_context.createDataFrame(data, job_complete_record_schema)
The error I receive is
TypeError: IntegerType can not accept object datetime.datetime(2016, 10, 5, 11, 19, 31, 915745) in type <class 'datetime.datetime'>
I can change "terminate_datetime" to "start_datetime" and have experimented with other words.
I can see no reason for field name changes breaking this code as it is doing nothing more than building a manual data frame.
This is worrying as I am using data frames to load up a data warehouse where I have no control of the field names.
I am running PySpark on Python 3.3.2 on Fedora 20.
Why the name changes things? The problem is that Row is a tuple sorted by __fields__. So the first case creates
from pyspark.sql import Row
from datetime import datetime
x = Row(job_load_id=1, terminate_datetime=datetime.now(), was_success=True)
x.__fields__
## ['job_load_id', 'terminate_datetime', 'was_success']
while the second one creates:
y = Row(job_load_id=1, end_datetime=datetime.now(), was_success=True)
y.__fields__
## ['end_datetime', 'job_load_id', 'was_success']
This no longer matches the schema you defined which expects (IntegerType, TimestampType, Boolean).
Because Row is useful mostly for schema inference and you provide schema directly you can address that by using standard tuple:
def build_job_finish_data_frame(sql_context, job_load_id, is_success):
job_complete_record_schema = StructType(
[
StructField("job_load_id", IntegerType(), False),
StructField("end_datetime", TimestampType(), False),
StructField("was_success", BooleanType(), False)
]
)
data = [tuple(job_load_id, datetime.now(), is_success)]
return sql_context.createDataFrame(data, job_complete_record_schema)
although creating a single element DataFrame looks strange if not pointless.

Pyspark: How to transform json strings in a dataframe column

The following is more or less straight python code which functionally extracts exactly as I want. The data schema for the column I'm filtering out within the dataframe is basically a json string.
However, I had to greatly bump up the memory requirement for this and I'm only running on a single node. Using a collect is probably bad and creating all of this on a single node really isn't taking advantage of the distributed nature of Spark.
I'd like a more Spark centric solution. Can anyone help me massage the logic below to better take advantage of Spark? Also, as a learning point: please provide an explanation for why/how the updates make it better.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
from pyspark.sql.types import SchemaStruct, SchemaField, StringType
input_schema = SchemaStruct([
SchemaField('scrubbed_col_name', StringType(), nullable=True)
])
output_schema = SchemaStruct([
SchemaField('val01_field_name', StringType(), nullable=True),
SchemaField('val02_field_name', StringType(), nullable=True)
])
example_input = [
'''[{"val01_field_name": "val01_a", "val02_field_name": "val02_a"},
{"val01_field_name": "val01_a", "val02_field_name": "val02_b"},
{"val01_field_name": "val01_b", "val02_field_name": "val02_c"}]''',
'''[{"val01_field_name": "val01_c", "val02_field_name": "val02_a"}]''',
'''[{"val01_field_name": "val01_a", "val02_field_name": "val02_d"}]''',
]
desired_output = {
'val01_a': ['val_02_a', 'val_02_b', 'val_02_d'],
'val01_b': ['val_02_c'],
'val01_c': ['val_02_a'],
}
def capture(dataframe):
# Capture column from data frame if it's not empty
data = dataframe.filter('scrubbed_col_name != null')\
.select('scrubbed_col_name')\
.rdd\
.collect()
# Create a mapping of val1: list(val2)
mapping = {}
# For every row in the rdd
for row in data:
# For each json_string within the row
for json_string in row:
# For each item within the json string
for val in json.loads(json_string):
# Extract the data properly
val01 = val.get('val01_field_name')
val02 = val.get('val02_field_name')
if val02 not in mapping.get(val01, []):
mapping.setdefault(val01, []).append(val02)
return mapping
One possible solution:
(df
.rdd # Convert to rdd
.flatMap(lambda x: x) # Flatten rows
# Parse JSON. In practice you should add proper exception handling
.flatMap(lambda x: json.loads(x))
# Get values
.map(lambda x: (x.get('val01_field_name'), x.get('val02_field_name')))
# Convert to final shape
.groupByKey())
Given output specification this operation is not exactly efficient (do you really require grouped values?) but still much better than collect.

Correctly reading the types from file in PySpark

I have a tab-separated file containing lines as
id1 name1 ['a', 'b'] 3.0 2.0 0.0 1.0
that is, an id, a name, a list with some strings, and a series of 4 float attributes.
I am reading this file as
rdd = sc.textFile('myfile.tsv') \
.map(lambda row: row.split('\t'))
df = sqlc.createDataFrame(rdd, schema)
where I give the schema as
schema = StructType([
StructField('id', StringType(), True),
StructField('name', StringType(), True),
StructField('list', ArrayType(StringType()), True),
StructField('att1', FloatType(), True),
StructField('att2', FloatType(), True),
StructField('att3', FloatType(), True),
StructField('att4', FloatType(), True)
])
Problem is, both the list and the attributes do not get properly read, judging from a collect on the DataFrame. In fact, I get None for all of them:
Row(id=u'id1', brand_name=u'name1', list=None, att1=None, att2=None, att3=None, att4=None)
What am I doing wrong?
It is properly read, it just doesn't work as you expect. Schema argument declares what are the types to avoid expensive schema inference not how to cast the data. Providing input that matches declared schema is your responsibility.
This can be also handled either by data source (take a look at spark-csv and inferSchema option). It won't handle complex types like array though.
Since your schema is mostly flat and you know the types you can try something like this:
df = rdd.toDF([f.name for f in schema.fields])
exprs = [
# You should excluding casting
# on other complex types as well
col(f.name).cast(f.dataType) if f.dataType.typeName() != "array"
else col(f.name)
for f in schema.fields
]
df.select(*exprs)
and handle complex types separately using string processing functions or UDFs. Alternatively, since you read data in Python anyway, just enforce desired types before you create DF.

Categories