Comparing schema of dataframe using Pyspark

Comparing schema of dataframe using Pyspark - python

I have a data frame (df).
For showing its schema I use:
from pyspark.sql.functions import *
df1.printSchema()
And I get the following result:
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
Sometimes the schema changes (the column type or name):
df2.printSchema()
#root
# |-- name: array (nullable = true)
# |-- gender: integer (nullable = true)
# |-- age: long (nullable = true)
I would like to compare between the two schemas (df1 and df2) and get only the differences in types and columns names (Sometimes the column can move to another position).
The results should be a table (or data frame) something like this:
column df1 df2 diff
name: string array type
gender: N/A integer new column
(age column is the same and didn't change. In case of omission of column there will be indication 'omitted')
How can I do it if efficiently if I have many columns in each?

Without any external library, we can find the schema difference using
from pyspark.sql.session import SparkSession
from pyspark.sql import DataFrame
def schema_diff(spark: SparkSession, df_1: DataFrame, df_2: DataFrame):
s1 = spark.createDataFrame(df_1.dtypes, ["d1_name", "d1_type"])
s2 = spark.createDataFrame(df_2.dtypes, ["d2_name", "d2_type"])
difference = (
s1.join(s2, s1.d1_name == s2.d2_name, how="outer")
.where(s1.d1_type.isNull() | s2.d2_type.isNull())
.select(s1.d1_name, s1.d1_type, s2.d2_name, s2.d2_type)
.fillna("")
)
return difference
fillna is optional. I prefer to view them as empty string.
in where clause we use type because this will help us to show even if column exists in both dataframe but they have different schemas.
this will also show all columns that are in second dataframe but not in first dataframe
Usage:
diff = schema_diff(spark, df_1, df_2)
diff.show(diff.count(), truncate=False)

You can try creating two pandas dataframes with metadata from both DF1 and DF2 like below
pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
pd_df2=pd.DataFrame(df2.dtypes,columns=['column','data_type'])
and then join those two pandas dataframes through 'outer' join?

A custom function that could be useful for someone.
def SchemaDiff(DF1, DF2):
# Getting schema for both dataframes in a dictionary
DF1Schema = {x[0]:x[1] for x in DF1.dtypes}
DF2Schema = {x[0]:x[1] for x in DF2.dtypes}
# Column present in DF1 but not in DF2
DF1MinusDF2 = dict.fromkeys((set(DF1.columns) - set(DF2.columns)), '')
for column_name in DF1MinusDF2:
DF1MinusDF2[column_name] = DF1Schema[column_name]
# Column present in DF2 but not in DF1
DF2MinusDF1 = dict.fromkeys((set(DF2.columns) - set(DF1.columns)), '')
for column_name in DF2MinusDF1:
DF2MinusDF1[column_name] = DF2Schema[column_name]
# Find data type changed in DF1 as compared to DF2
UpdatedDF1Schema = {k:v for k,v in DF1Schema.items() if k not in DF1MinusDF2}
UpdatedDF1Schema = {**UpdatedDF1Schema, **DF2MinusDF1}
DF1DataTypesChanged = {}
for column_name in UpdatedDF1Schema:
if UpdatedDF1Schema[column_name] != DF2Schema[column_name]:
DF1DataTypesChanged[column_name] = DF2Schema[column_name]
return DF1MinusDF2, DF2MinusDF1, DF1DataTypesChanged

you can simply use
df1.printSchema() == df2.printSchema()

Related

Pyspark removing duplicate columns after broadcast join

I have two dataframes which I wish to join and then save as a parquet table. After performing the join my resulting table has duplicate columns, preventing me from saving the dataset.
Here is my code for the join
join_conditions = [
df1.colX == df2.colY,
df1.col1 == df2.col1,
df1.col2 == df2.col2,
df1.col3 == df2.col3,
]
dfj= df1.alias("1").join(F.broadcast(df2.alias("2")), join_conditions, "inner"
).drop("1.col1", "1.col2", "1.col3")
dfj.write.format("parquet").mode("overwrite").saveAsTable("table")
I expected that the drop would remove the duplicate columns but an exception is thrown saying they are still there, when I try to save the table. drop() doesn't throw an exception if the columns don't exist, which means that the alias is probably wrong / not working as I expect?
I cannot do the join conditions as a list of strings as this seems to cause an error when not all columns in the join condition are called the same on each DataFrame:
join_conditions = [
df1.colX == df2.colY,
"col1",
"col2",
"col3"
]
doesn't work for example.
This join works but still results in the duplicate columns
join_conditions = [
df1.X == df2.colY,
F.col("1.col1") == F.col("2.col1"),
F.col("1.col2") == F.col("2.col2"),
F.col("1.col3") == F.col("2.col3"),
]
also didn't work. All of these methods still result in the joined dataframe having the duplicate columns col1, col2 and col3. What am I doing wrong / not understanding correctly? Answers with pyspark sample code would be appreciated.

Im not sure why it doesn't work, its really weird.
This isn't so pretty but it works
from pyspark.sql import functions as F
data = [{'colX': "hello", 'col1': 1, 'col2': 2, 'col3': 3}]
data2 = [{'colY': "hello", 'col1': 1, 'col2': 2, 'col3': 3}]
df1 = spark.createDataFrame(data)
df2 = spark.createDataFrame(data2)
join_cond = [df1.colX==df2.colY,
df1.col1==df2.col1,
df1.col2==df2.col2,
df1.col3==df2.col3]
df1.join(F.broadcast(df2), join_cond, 'inner').drop(df1.col1).drop(df1.col2).drop(df1.col3).printSchema()
root
|-- colX: string (nullable = true)
|-- col1: long (nullable = true)
|-- col2: long (nullable = true)
|-- col3: long (nullable = true)
|-- colY: string (nullable = true)

How to convert column types to match joining dataframes in pyspark?

I have an empty dataframe in pyspark that I want to use to append machine learning results coming from model.transform(test_data) in pyspark - but then I try a union function to join the dataframes I get a column types must match error.
This is my code:
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
schema = StructType([
StructField("row_num",IntegerType(),True),
StructField("label",IntegerType(),True),
StructField("probability",DoubleType(),True),
])
empty = spark.createDataFrame(sc.emptyRDD(), schema)
model = LogisticRegression().fit(train_data)
preds = model.transform(test_data)
all_preds = empty.unionAll(preds)
AnalysisException: Union can only be performed on tables with the compatible column types.
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> <> double at the third column of the second table;
I've tried casting the types of my empty dataframe to match but it hasn't worked to get the same types - is there any way around this? I'm aiming to have the machine learning run iteratively in a for loop with each prediction output appended to a pyspark dataframe.
For reference, preds looks like:
preds.printSchema()
root
|-- row_num: integer (nullable = true)
|-- label: integer (nullable = true)
|-- probability: vector (nullable = true)

You can create an empty dataframe based on the schema of the preds dataframe:
model = LogisticRegression().fit(train_data)
preds = model.transform(test_data)
empty = spark.createDataFrame(sc.emptyRDD(), preds.schema)
all_preds = empty.unionAll(preds)

Create new dataFrame based on reformatted columns from old dataFrame

I imported the data from a database
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",
"mongodb://127.0.0.1/test.db").load()
I have selected the double columns using
double_list = [name for name,types in df.dtypes if types == 'double']
Credits to #Ramesh Maharjan.
To remove special characters we use
removedSpecials = [''.join(y for y in x if y.isalnum()) for x in double_list]
The question is:
How can I create a new dataframe based on df with ONLY double_list columns. ?

If you already have list of column names with double as datatype then next step is to remove the special characters which can be done by using .isalnum() credit as
removedSpecials = [''.join(y for y in x if y.isalnum()) for x in double_list]
once you have the special characters removed list of column names then its just .withColumnRenamed() api call as
for (x, y) in zip(double_list, removedSpecials):
df = df.withColumnRenamed(x, y)
df.show(truncate=False) should give you the renamed dataframe on the columns with double datatype
If you don't want the columns that are not in double_list i.e. not in double datatype list then you can use select as
df.select(*removedSpecials).show(truncate=False)
The reason for doing
for (x, y) in zip(double_list, removedSpecials):
df = df.withColumnRenamed(x, y)
before doing
df.select(*removedSpecials).show(truncate=False)
is that there might be special characters like . which doesn't make concise solutions like df.select([df[x].alias(y) for (x, y) in zip(double_list, removedSpecials)]).show(truncate=False) to work
I hope the answer is helpful

scala code, you can convert into python
import sqlContext.implicits._
// sample df
df.show()
+----+--------------------+--------+
|data| Week|NumCCol1|
+----+--------------------+--------+
| aac|01/28/2018-02/03/...| 2.0|
| aac|02/04/2018-02/10/...| 23.0|
| aac|02/11/2018-02/17/...| 105.0|
+----+--------------------+--------+
df.printSchema()
root
|-- data: string (nullable = true)
|-- Week: string (nullable = true)
|-- NumCCol1: double (nullable = false)
val df1 = df.schema.fields
.collect({case x if x.dataType.typeName == "double" => x.name})
.foldLeft(df)({case(dframe,field) => dframe.select(field)})
// df with only double columns
df1.show()
use df1.withColumnRenamed to rename the columns

Can I change the nullability of a column in my Spark dataframe?

I have a StructField in a dataframe that is not nullable. Simple example:
import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
which returns:
[StructField(name,StringType,true),
StructField(age,LongType,true),
StructField(foo,BooleanType,false)]
Notice that the field foo is not nullable. Problem is that (for reasons I won't go into) I want it to be nullable. I found this post Change nullable property of column in spark dataframe which suggested a way of doing it so I adapted the code therein to this:
import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, newSchema)
which failed with:
TypeError: StructField(name,StringType,true) is not JSON serializable
I also see this in the stack trace:
raise ValueError("Circular reference detected")
So I'm a bit stuck. Can anyone modify this example in a way that enables me to define a dataframe where column foo is nullable?

I know this question is already answered, but I was looking for a more generic solution when I came up with this:
def set_df_columns_nullable(spark, df, column_list, nullable=True):
for struct_field in df.schema:
if struct_field.name in column_list:
struct_field.nullable = nullable
df_mod = spark.createDataFrame(df.rdd, df.schema)
return df_mod
You can then call it like this:
set_df_columns_nullable(spark,df,['name','age'])

For the general case, one can change the nullability of a column via the nullable property of the StructField of that specific column.
Here's an example:
df.schema['col_1']
# StructField(col_1,DoubleType,false)
df.schema['col_1'].nullable = True
df.schema['col_1']
# StructField(col_1,DoubleType,true)

Seems you missed the StructType(newSchema).
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
df2.show()

df1 = df.rdd.toDF()
df1.printSchema()
Output:
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- foo: boolean (nullable = true)

Compare column names in two data frames pyspark

I have two data frames in pyspark df and data. The schema are like below
>>> df.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- nation: string (nullable = true)
|-- Date: timestamp (nullable = false)
|-- ZipCode: integer (nullable = true)
|-- car: string (nullable = true)
|-- van: string (nullable = true)
>>> data.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- nation: string (nullable = true)
|-- date: string (nullable = true)
|-- zipcode: integer (nullable = true)
Now I want to add columns car and van to my data data frame by comparing both the schema.
I would also want to compares two data frames if the columns are same do nothing, but if the columns are different then add the columns to the data frame that doesn't have the columns.
How can we achieve that in pyspark.
FYI I am using spark 1.6
once the columns are added to the data frame. The values for those columns in the newly added data frame Should be null.
for example here we are adding columns to data data frame so the columns car and van in data data frame should contain null values but the same columns in df data frame should have their original values
what happens if there are more than 2 new columns to be added

As the schema is not but StructType consisting of list of StructFields, we can retrieve the fields list, to compare and find the missing columns,
df_schema = df.schema.fields
data_schema = data.schema.fields
df_names = [x.name.lower() for x in df_scehma]
data_names = [x.name.lower() for x in data_schema]
if df_schema <> data_schema:
col_diff = set(df_names) ^ set(data_names)
col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if ((x[0] is not None and x[0].name.lower() in col_diff) or x[1].name.lower() in col_diff)]
for i in col_list:
if i[0] in df_names:
data = data.withColumn("%s"%i[0],lit(None).cast(i[1]))
else:
df = df.withColumn("%s"%i[0],lit(None).cast(i[1]))
else:
print "Nothing to do"
You have mentioned to add the column if there is no null values, but your schema diference are nullable columns, so have not used that check. If you need it, then add check for nullable as below,
col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if (x[0].name.lower() in col_diff or x[1].name.lower() in col_diff) and not x.nullable]
Please check the documentation for more about StructType and StructFields,
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.types.StructType

If you have to do this to multiple tables, it might be worth it to generalize the code a bit. This code takes the first non-null value in the non-matching source column to create the new column in the target table.
from pyspark.sql.functions import lit, first
def first_non_null(f,t): # find the first non-null value of a column
return f.select(first(f[t], ignorenulls=True)).first()[0]
def match_type(f1,f2,miss): # add missing column to the target table
for i in miss:
try:
f1 = f1.withColumn(i, lit(first_non_null(f2,i)))
except:
pass
try:
f2 = f2.withColumn(i, lit(first_non_null(f1,i)))
except:
pass
return f1, f2
def column_sync_up(d1,d2): # test if the matching requirement is met
missing = list(set(d1.columns) ^ set(d2.columns))
if len(missing)>0:
return match_type(d1,d2,missing)
else:
print "Columns Match!"
df1, df2 = column_sync_up(df1,df2) # reuse as necessary

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing schema of dataframe using Pyspark - python

You can try creating two pandas dataframes with metadata from both DF1 and DF2 like below pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type']) pd_df2=pd.DataFrame(df2.dtypes,columns=['column','data_type']) and then join those two pandas dataframes through 'outer' join?

you can simply use df1.printSchema() == df2.printSchema()

Related

Pyspark removing duplicate columns after broadcast join

How to convert column types to match joining dataframes in pyspark?

Create new dataFrame based on reformatted columns from old dataFrame

Can I change the nullability of a column in my Spark dataframe?

Compare column names in two data frames pyspark

Categories

Resources