Pyspark removing duplicate columns after broadcast join - python

I have two dataframes which I wish to join and then save as a parquet table. After performing the join my resulting table has duplicate columns, preventing me from saving the dataset.
Here is my code for the join
join_conditions = [
df1.colX == df2.colY,
df1.col1 == df2.col1,
df1.col2 == df2.col2,
df1.col3 == df2.col3,
]
dfj= df1.alias("1").join(F.broadcast(df2.alias("2")), join_conditions, "inner"
).drop("1.col1", "1.col2", "1.col3")
dfj.write.format("parquet").mode("overwrite").saveAsTable("table")
I expected that the drop would remove the duplicate columns but an exception is thrown saying they are still there, when I try to save the table. drop() doesn't throw an exception if the columns don't exist, which means that the alias is probably wrong / not working as I expect?
I cannot do the join conditions as a list of strings as this seems to cause an error when not all columns in the join condition are called the same on each DataFrame:
join_conditions = [
df1.colX == df2.colY,
"col1",
"col2",
"col3"
]
doesn't work for example.
This join works but still results in the duplicate columns
join_conditions = [
df1.X == df2.colY,
F.col("1.col1") == F.col("2.col1"),
F.col("1.col2") == F.col("2.col2"),
F.col("1.col3") == F.col("2.col3"),
]
also didn't work. All of these methods still result in the joined dataframe having the duplicate columns col1, col2 and col3. What am I doing wrong / not understanding correctly? Answers with pyspark sample code would be appreciated.

Im not sure why it doesn't work, its really weird.
This isn't so pretty but it works
from pyspark.sql import functions as F
data = [{'colX': "hello", 'col1': 1, 'col2': 2, 'col3': 3}]
data2 = [{'colY': "hello", 'col1': 1, 'col2': 2, 'col3': 3}]
df1 = spark.createDataFrame(data)
df2 = spark.createDataFrame(data2)
join_cond = [df1.colX==df2.colY,
df1.col1==df2.col1,
df1.col2==df2.col2,
df1.col3==df2.col3]
df1.join(F.broadcast(df2), join_cond, 'inner').drop(df1.col1).drop(df1.col2).drop(df1.col3).printSchema()
root
|-- colX: string (nullable = true)
|-- col1: long (nullable = true)
|-- col2: long (nullable = true)
|-- col3: long (nullable = true)
|-- colY: string (nullable = true)

Related

How to convert column types to match joining dataframes in pyspark?

I have an empty dataframe in pyspark that I want to use to append machine learning results coming from model.transform(test_data) in pyspark - but then I try a union function to join the dataframes I get a column types must match error.
This is my code:
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
schema = StructType([
StructField("row_num",IntegerType(),True),
StructField("label",IntegerType(),True),
StructField("probability",DoubleType(),True),
])
empty = spark.createDataFrame(sc.emptyRDD(), schema)
model = LogisticRegression().fit(train_data)
preds = model.transform(test_data)
all_preds = empty.unionAll(preds)
AnalysisException: Union can only be performed on tables with the compatible column types.
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> <> double at the third column of the second table;
I've tried casting the types of my empty dataframe to match but it hasn't worked to get the same types - is there any way around this? I'm aiming to have the machine learning run iteratively in a for loop with each prediction output appended to a pyspark dataframe.
For reference, preds looks like:
preds.printSchema()
root
|-- row_num: integer (nullable = true)
|-- label: integer (nullable = true)
|-- probability: vector (nullable = true)
You can create an empty dataframe based on the schema of the preds dataframe:
model = LogisticRegression().fit(train_data)
preds = model.transform(test_data)
empty = spark.createDataFrame(sc.emptyRDD(), preds.schema)
all_preds = empty.unionAll(preds)

Create new dataFrame based on reformatted columns from old dataFrame

I imported the data from a database
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",
"mongodb://127.0.0.1/test.db").load()
I have selected the double columns using
double_list = [name for name,types in df.dtypes if types == 'double']
Credits to #Ramesh Maharjan.
To remove special characters we use
removedSpecials = [''.join(y for y in x if y.isalnum()) for x in double_list]
The question is:
How can I create a new dataframe based on df with ONLY double_list columns. ?
If you already have list of column names with double as datatype then next step is to remove the special characters which can be done by using .isalnum() credit as
removedSpecials = [''.join(y for y in x if y.isalnum()) for x in double_list]
once you have the special characters removed list of column names then its just .withColumnRenamed() api call as
for (x, y) in zip(double_list, removedSpecials):
df = df.withColumnRenamed(x, y)
df.show(truncate=False) should give you the renamed dataframe on the columns with double datatype
If you don't want the columns that are not in double_list i.e. not in double datatype list then you can use select as
df.select(*removedSpecials).show(truncate=False)
The reason for doing
for (x, y) in zip(double_list, removedSpecials):
df = df.withColumnRenamed(x, y)
before doing
df.select(*removedSpecials).show(truncate=False)
is that there might be special characters like . which doesn't make concise solutions like df.select([df[x].alias(y) for (x, y) in zip(double_list, removedSpecials)]).show(truncate=False) to work
I hope the answer is helpful
scala code, you can convert into python
import sqlContext.implicits._
// sample df
df.show()
+----+--------------------+--------+
|data| Week|NumCCol1|
+----+--------------------+--------+
| aac|01/28/2018-02/03/...| 2.0|
| aac|02/04/2018-02/10/...| 23.0|
| aac|02/11/2018-02/17/...| 105.0|
+----+--------------------+--------+
df.printSchema()
root
|-- data: string (nullable = true)
|-- Week: string (nullable = true)
|-- NumCCol1: double (nullable = false)
val df1 = df.schema.fields
.collect({case x if x.dataType.typeName == "double" => x.name})
.foldLeft(df)({case(dframe,field) => dframe.select(field)})
// df with only double columns
df1.show()
use df1.withColumnRenamed to rename the columns

Comparing schema of dataframe using Pyspark

I have a data frame (df).
For showing its schema I use:
from pyspark.sql.functions import *
df1.printSchema()
And I get the following result:
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
Sometimes the schema changes (the column type or name):
df2.printSchema()
#root
# |-- name: array (nullable = true)
# |-- gender: integer (nullable = true)
# |-- age: long (nullable = true)
I would like to compare between the two schemas (df1 and df2) and get only the differences in types and columns names (Sometimes the column can move to another position).
The results should be a table (or data frame) something like this:
column df1 df2 diff
name: string array type
gender: N/A integer new column
(age column is the same and didn't change. In case of omission of column there will be indication 'omitted')
How can I do it if efficiently if I have many columns in each?
Without any external library, we can find the schema difference using
from pyspark.sql.session import SparkSession
from pyspark.sql import DataFrame
def schema_diff(spark: SparkSession, df_1: DataFrame, df_2: DataFrame):
s1 = spark.createDataFrame(df_1.dtypes, ["d1_name", "d1_type"])
s2 = spark.createDataFrame(df_2.dtypes, ["d2_name", "d2_type"])
difference = (
s1.join(s2, s1.d1_name == s2.d2_name, how="outer")
.where(s1.d1_type.isNull() | s2.d2_type.isNull())
.select(s1.d1_name, s1.d1_type, s2.d2_name, s2.d2_type)
.fillna("")
)
return difference
fillna is optional. I prefer to view them as empty string.
in where clause we use type because this will help us to show even if column exists in both dataframe but they have different schemas.
this will also show all columns that are in second dataframe but not in first dataframe
Usage:
diff = schema_diff(spark, df_1, df_2)
diff.show(diff.count(), truncate=False)
You can try creating two pandas dataframes with metadata from both DF1 and DF2 like below
pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
pd_df2=pd.DataFrame(df2.dtypes,columns=['column','data_type'])
and then join those two pandas dataframes through 'outer' join?
A custom function that could be useful for someone.
def SchemaDiff(DF1, DF2):
# Getting schema for both dataframes in a dictionary
DF1Schema = {x[0]:x[1] for x in DF1.dtypes}
DF2Schema = {x[0]:x[1] for x in DF2.dtypes}
# Column present in DF1 but not in DF2
DF1MinusDF2 = dict.fromkeys((set(DF1.columns) - set(DF2.columns)), '')
for column_name in DF1MinusDF2:
DF1MinusDF2[column_name] = DF1Schema[column_name]
# Column present in DF2 but not in DF1
DF2MinusDF1 = dict.fromkeys((set(DF2.columns) - set(DF1.columns)), '')
for column_name in DF2MinusDF1:
DF2MinusDF1[column_name] = DF2Schema[column_name]
# Find data type changed in DF1 as compared to DF2
UpdatedDF1Schema = {k:v for k,v in DF1Schema.items() if k not in DF1MinusDF2}
UpdatedDF1Schema = {**UpdatedDF1Schema, **DF2MinusDF1}
DF1DataTypesChanged = {}
for column_name in UpdatedDF1Schema:
if UpdatedDF1Schema[column_name] != DF2Schema[column_name]:
DF1DataTypesChanged[column_name] = DF2Schema[column_name]
return DF1MinusDF2, DF2MinusDF1, DF1DataTypesChanged
you can simply use
df1.printSchema() == df2.printSchema()

Can I change the nullability of a column in my Spark dataframe?

I have a StructField in a dataframe that is not nullable. Simple example:
import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
which returns:
[StructField(name,StringType,true),
StructField(age,LongType,true),
StructField(foo,BooleanType,false)]
Notice that the field foo is not nullable. Problem is that (for reasons I won't go into) I want it to be nullable. I found this post Change nullable property of column in spark dataframe which suggested a way of doing it so I adapted the code therein to this:
import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, newSchema)
which failed with:
TypeError: StructField(name,StringType,true) is not JSON serializable
I also see this in the stack trace:
raise ValueError("Circular reference detected")
So I'm a bit stuck. Can anyone modify this example in a way that enables me to define a dataframe where column foo is nullable?
I know this question is already answered, but I was looking for a more generic solution when I came up with this:
def set_df_columns_nullable(spark, df, column_list, nullable=True):
for struct_field in df.schema:
if struct_field.name in column_list:
struct_field.nullable = nullable
df_mod = spark.createDataFrame(df.rdd, df.schema)
return df_mod
You can then call it like this:
set_df_columns_nullable(spark,df,['name','age'])
For the general case, one can change the nullability of a column via the nullable property of the StructField of that specific column.
Here's an example:
df.schema['col_1']
# StructField(col_1,DoubleType,false)
df.schema['col_1'].nullable = True
df.schema['col_1']
# StructField(col_1,DoubleType,true)
Seems you missed the StructType(newSchema).
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
df2.show()
df1 = df.rdd.toDF()
df1.printSchema()
Output:
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- foo: boolean (nullable = true)

Selecting only numeric/string columns names from a Spark DF in pyspark

I have a Spark DataFrame in Pyspark (2.1.0) and I am looking to get the names of numeric columns only or string columns only.
For example, this is the Schema of my DF:
root
|-- Gender: string (nullable = true)
|-- SeniorCitizen: string (nullable = true)
|-- MonthlyCharges: double (nullable = true)
|-- TotalCharges: double (nullable = true)
|-- Churn: string (nullable = true)
This is what I need:
num_cols = [MonthlyCharges, TotalCharges]
str_cols = [Gender, SeniorCitizen, Churn]
How can I make it?
dtypes is list of tuples (columnNane,type) you can use simple filter
columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]
PySpark provides a rich API related to schema types. As #DanieldePaula mentioned you can access fields' metadata through df.schema.fields.
Here is a different approach based on statically typed checking:
from pyspark.sql.types import StringType, DoubleType
df = spark.createDataFrame([
[1, 2.3, "t1"],
[2, 5.3, "t2"],
[3, 2.1, "t3"],
[4, 1.5, "t4"]
], ["cola", "colb", "colc"])
# get string
str_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]
# ['colc']
# or double
dbl_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, DoubleType)]
# ['colb']
You can do what zlidme suggested to get only string (categorical columns). To extend on the answer given take a look at the example bellow. It will give you all numeric (continuous) columns in a list called continuousCols, all categorical columns in a list called categoricalCols and all columns in a list called allCols.
data = {'mylongint': [0, 1, 2],
'shoes': ['blue', 'green', 'yellow'],
'hous': ['furnitur', 'roof', 'foundation'],
'C': [1, 0, 0]}
play_df = pd.DataFrame(data)
play_ddf = spark.createDataFrame(play_df)
#store all column names in a list
allCols = [item[0] for item in play_ddf]
#store all column names that are categorical in a list
categoricalCols = [item[0] for item in play_ddf.dtypes if item[1].startswith('string')]
#store all column names that are continous in a list
continuousCols =[item[0] for item in play_ddf.dtypes if item[1].startswith('bigint')]
print(len(allCols), ' - ', len(continuousCols), ' - ', len(categoricalCols))
This will give the result: 4 - 2 - 2

Categories