Use parquet file with special characters in column names in PySpark - python

MAIN GOAL
Show or select columns from the Spark dataframe read from the parquet file.
All the solutions mentioned in the forum are not successfull in our case.
PROBLEM
The issue happens when the parquet file is read and queried with SPARK and is due to the presence of special characters ,;{}()\n\t= within column names. The problem was reproduced with a simple parquet files with two columns and five rows. The names of the columns are:
SpeedReference_Final_01 (RifVel_G0)
SpeedReference_Final_02 (RifVel_G1)
The error arised is:
Attribute name "SpeedReference_Final_01 (RifVel_G0)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
We are using PySpark in Python language and the experimented solutions can be categorized as it follows:
Solutions based on column rename - [spark.read.parquet + rename of the obtained dataframe]
Several solutions have been experimented:
withColumnRenamed (Issue N.2 within the script)
toDF (Issue N.3)
alias (Issue N.5)
None of them is working in our case.
Read the parquet file into a Pandas dataframe and then create a new one from it - [pd.read.parquet + spark.createDataFrame]
This solution is working with a small parquet file (Issue N.0 i.e. WORKAROUND within the script): the created spark dataframe can be successfully queried even if it has column names containing special characters. Unfortunately it is unpracticable with our big parquet files (600000 rows x 1000 columns for each parquet), since creating the spark dataframe is interminable.
An attempt to read the parquet file into a Spark dataframe and create a new Spark dataframe with its rdd and a renamed schema is not practicable since the extraction of the rdd from the Spark dataframe arises the same error (Issue N.4).
Read the parquet file with a prefixed schema (that avoids the special characters) - [spark.read.schema(...).parquet]
The solution is not working, since data related to critical columns become null/None as expected since the renamed columns are not present within the original file.
The mentioned solutions are summarized in the python code below and have been experimented with the Example parquet file.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col
import pandas as pd
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# Select file
filename = 'D:/Simple.parquet'
issue_num = 0 # Workaround to issues (Equivalent to no issue)
#issue_num = 1 # Issue 1 - Unable to show dataframe or select column with name containing invalid character(s)
#issue_num = 2 # Issue 2 - Unable to show dataframe or select column after rename (using withColumnRenamed)
#issue_num = 3 # Issue 3 - Unable to show dataframe or select column after rename (using toDF)
#issue_num = 4 # Issue 4 - Unable to extract rdd from renamed dataframe
#issue_num = 5 # Issue 5 - Unable to select column with alias
if issue_num == 0:
################################################################################################
# WORKAROUND - Create Spark data frame from Pandas dataframe
df_pd = pd.read_parquet(filename)
DF = spark.createDataFrame(df_pd)
print('WORKAROUND')
DF.show()
# +-----------------------------------+-----------------------------------+
# |SpeedReference_Final_01 (RifVel_G0)|SpeedReference_Final_02 (RifVel_G1)|
# +-----------------------------------+-----------------------------------+
# | 553.5228271484375| 720.3720703125|
# | 553.5228271484375| 720.3720703125|
# | 553.5228271484375| 720.3720703125|
# | 553.5228271484375| 720.3720703125|
# | 553.5228271484375| 720.3720703125|
# +-----------------------------------+-----------------------------------+
################################################################################################
# Correct management of columns with invalid characters when using spark.createDataFrame
# spark.createDataFrame: Create a dataframe with two columns with invalid characters - OK
# DFCREATED
schema = StructType(
[
StructField("SpeedReference_Final_01 (RifVel_G0)", FloatType(), nullable=True),
StructField("SpeedReference_Final_02 (RifVel_G1)", FloatType(), nullable=True)
]
)
row_in = [(553.523,720.372), (553.523,720.372), (553.523,720.372), (553.523,720.372), (553.523,720.372)]
rdd=spark.sparkContext.parallelize(row_in)
DFCREATED = spark.createDataFrame(rdd, schema)
DFCREATED.show()
# +-----------------------------------+-----------------------------------+
# |SpeedReference_Final_01 (RifVel_G0)|SpeedReference_Final_02 (RifVel_G1)|
# +-----------------------------------+-----------------------------------+
# | 553.523| 720.372|
# | 553.523| 720.372|
# | 553.523| 720.372|
# | 553.523| 720.372|
# | 553.523| 720.372|
# +-----------------------------------+-----------------------------------+
DF_SEL_VAR_CREATED = DFCREATED.select(DFCREATED.columns[0]).take(2)
for el in DF_SEL_VAR_CREATED:
print(el)
#Row(SpeedReference_Final_01 (RifVel_G0)=553.5230102539062)
#Row(SpeedReference_Final_01 (RifVel_G0)=553.5230102539062)
else:
# spark.read: read file into dataframe - OK
DF = spark.read.parquet(filename)
print('ORIGINAL SCHEMA')
DF.printSchema()
# root
# |-- SpeedReference_Final_01 (RifVel_G0): float (nullable = true)
# |-- SpeedReference_Final_02 (RifVel_G1): float (nullable = true)
if issue_num == 1:
###############################################################################################
# Issue 1 - Unable to show dataframe or select column with name containing invalid character(s)
DF.show()
# DF.select(DF.columns[0]).show()
# DF_SEL_VAR = DF.select(DF.columns[0]).take(3)
#ECC: Attribute name "SpeedReference_Final_01 (RifVel_G0)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
# on all 3 previous statements
elif issue_num == 2:
###############################################################################################
# Issue 2 - Unable to show dataframe or select column after rename (using withColumnRenamed)
DFRENAMED = DF.withColumnRenamed('SpeedReference_Final_01 (RifVel_G0)','RifVelG0').withColumnRenamed('SpeedReference_Final_02 (RifVel_G1)','RifVelG1')
print('RENAMED SCHEMA')
DFRENAMED.printSchema()
# root
# |-- RifVelG0: float (nullable = true)
# |-- RifVelG1: float (nullable = true)
DFRENAMED.show()
# DF_SEL_VAR_RENAMED = DFRENAMED.select(DFRENAMED.RifVelG0).take(2)
#ECC: Attribute name "SpeedReference_Final_01 (RifVel_G0)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
# on all 2 previous statements
elif issue_num == 3:
###############################################################################################
# Issue 3 - Unable to show dataframe or select column after rename (using to_DF)
DFRENAMED = DF.toDF('RifVelG0', 'RifVelG1')
print('RENAMED SCHEMA')
DFRENAMED.printSchema()
# root
# |-- RifVelG0: float (nullable = true)
# |-- RifVelG1: float (nullable = true)
DFRENAMED.show()
# DF_SEL_VAR_RENAMED = DFRENAMED.select(DFRENAMED.RifVelG0).take(2)
#ECC: Attribute name "SpeedReference_Final_01 (RifVel_G0)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
# on all 2 previous statements
elif issue_num == 4:
###############################################################################################
# Issue 4 - Unable to extract rdd from renamed dataframe
DFRENAMED = DF.withColumnRenamed('SpeedReference_Final_01 (RifVel_G0)','RifVelG0').withColumnRenamed('SpeedReference_Final_02 (RifVel_G1)','RifVelG1')
DFRENAMED_rdd = DFRENAMED.rdd
#ECC: Attribute name "SpeedReference_Final_01 (RifVel_G0)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
elif issue_num == 5:
###############################################################################################
# Issue 5 - Unable to select column with alias
DF_SEL_VAR = DF.select(col(DF.columns[0]).alias('RifVelG0')).take(3)
#ECC: Attribute name "SpeedReference_Final_01 (RifVel_G0)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
Have you any idea on how we can solve the problem?
Any suggestion is really appreciated.

try something like this:
import re
import pyspark.sql.functions as f
def remove_special_characters(string: str):
return re.sub("[^a-zA-Z0-9 ]", "", string)
DFCREATED = DFCREATED.select(
[
f.col(column).alias(remove_special_characters(column))
for column in DFCREATED.columns
]
)
also I think you can use this function to remove other things like space.

Related

How to convert column types to match joining dataframes in pyspark?

I have an empty dataframe in pyspark that I want to use to append machine learning results coming from model.transform(test_data) in pyspark - but then I try a union function to join the dataframes I get a column types must match error.
This is my code:
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
schema = StructType([
StructField("row_num",IntegerType(),True),
StructField("label",IntegerType(),True),
StructField("probability",DoubleType(),True),
])
empty = spark.createDataFrame(sc.emptyRDD(), schema)
model = LogisticRegression().fit(train_data)
preds = model.transform(test_data)
all_preds = empty.unionAll(preds)
AnalysisException: Union can only be performed on tables with the compatible column types.
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> <> double at the third column of the second table;
I've tried casting the types of my empty dataframe to match but it hasn't worked to get the same types - is there any way around this? I'm aiming to have the machine learning run iteratively in a for loop with each prediction output appended to a pyspark dataframe.
For reference, preds looks like:
preds.printSchema()
root
|-- row_num: integer (nullable = true)
|-- label: integer (nullable = true)
|-- probability: vector (nullable = true)
You can create an empty dataframe based on the schema of the preds dataframe:
model = LogisticRegression().fit(train_data)
preds = model.transform(test_data)
empty = spark.createDataFrame(sc.emptyRDD(), preds.schema)
all_preds = empty.unionAll(preds)

TypeError: 'Column' object is not callable Pysarpk, when joining two tables

so i am trying to 2 join 2 data frames and in doing so i am getting the following error.
TypeError: 'Column' object is not callable
I am loading data as simple csv files, following is the schema loaded from CSVs.
root
|-- movie_id,title: string (nullable = true)
root
|-- user_id,movie_id,tag,timestamp: string (nullable = true)
following is my implementation for loading
df1 = spark.read.format("csv").option("header", "true").load("collaborative/titles.csv", header=True, sep="|")
df2 = spark.read.format("csv").option("header", "true").load("collaborative/tags.csv", header=True, sep="|")
df1.printSchema()
df2.printSchema()
df1.alias("df1").join(df2.alias("df2"), col("df1.movie_id").equalTo(col("df2.movie_id"))).select(col("df2.*"))
There's no method called equalTo on a column object. When you do col("df1.movie_id").equalTo it assumes you are accessing a nested field in the movie_id and returns another column, and hence the error: column object is not callable.
print(col('df1.movie_id').equalTo)
# Column<b'df1.movie_id[equalTo]'>
To fix the problem, follow the correct join syntax here.
In your case, the simplest solution is to drop irrelevant column from df1 before join so you don't have to create aliases for data frames and select later:
df1.select('movie_id').join(df2, 'movie_id').show()
You can try the following:
d1 = df1.alias("df1")
d2 = df2.alias("df2")
d1.join(d2,d1.movie_id == d2.movie_id).select('df2.*')
You can refer to pyspark join examples here.

Programatically select columns from a dataframe with udf

I am new to pyspark.
I am trying to extract columns of a dataframe using a config file which contains a UDF.
If I define the select column as a list on the client it works but if I import the list from a config file the column list is of the type string.
Is there an alternate way.
opening spark-shell using pyspark .
*******************************************************************
version 2.2.0
Using Python version 2.7.16 (default, Mar 18 2019 18:38:44)
SparkSession available as 'spark'
*******************************************************************
jsonDF = spark.read.json("/tmp/people.json")
jsonDF.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
jsonDF.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
jsonCurDF = jsonDF.filter(jsonDF.age.isNotNull()).cache()
# Define the UDF
from pyspark.sql.functions import udf
#udf("long")
def squared_udf(s):
return s * s
# Selecting the columns from a list.
colSelList = ['age', 'name', squared_udf('age')]
jsonCurDF.select(colSelList).show()
+---+------+----------------+
|age| name|squared_udf(age)|
+---+------+----------------+
| 30| Andy| 900|
| 19|Justin| 361|
+---+------+----------------+
# If I use an external config file
colSelListStr = ["age", "name" , "squared_udf('age')"]
jsonCurDF.select(colSelListStr).show()
The above command fails "cannot resolve '`squared_udf('age')'
Tried registering the function, tried selectExpr and using the column function.
In the colSelList the udf call is translated to a column type.
print colSelList[2]
Column<squared_udf(age)
print colSelListStr[2]
squared_udf('age')
print column(colSelListStr[2])
Column<squared_udf('age')
What am I doing wrong here? or is there an alternate solution?
It's because squared_age is considered as string not a function when you are passing it from a list.
There is a round way you can do this and you don't need to import UDF for this.
Assume this is the list you need to select
directly passing this list will result into an error because squared_age doesn't contain in this data frame
so first you take all the columns of existing df into a list by
existing_cols = df.columns
and your these are the columns you need
now take intersection of both the lists
it will you give you a common element list
intersection = list(set(existing_cols) & set(col_list))
now try like this
newDF= df.select(intersection).rdd.map(lambda x: (x["age"], x["name"], x["age"]*x["age"])).toDF(col_list)
which will give you this
hope this helped.

How to read JSON file (Spark/Pyspark) with dots in column names using inferred schema?

I am importing JSON files dynamically (sending multiple file names to a script in parallel) and one of my files includes dots in the field names.
When this gets read into a dataframe for processing, the schema inference breaks it up into nested structs (ie; "A.B.C" -> A [B [C]]).
Is there a way to read in columns from the file without breaking up a column name that includes dots?
I understand that backticks can qualify a column name, but since I cannot explicitly define the schema before reading the JSON file, I cannot do this.
df = sqlContext.read.option('multiline','true').json(<location>)
df.printSchema()
I see the field "P.O. Replacement Cost" become:
|-- P: struct (nullable = true)
| |-- O: struct (nullable = true)
| | |-- Replacement Cost: double (nullable = true)
dot notation within the col function
col('colName.nestCol.nestNestCol.etc')
from pyspark.sql.functions import col
df.select( col('colName.nestCol').alias('nestCol') )
df.where( col('colName.nestCol') == 'value')
EDIT:
sorry I misread your questions. Try this and see if it works.
.withColumn(
'p', # Overwrite your nested struct with the new nested column
struct(
col( 'p.*' ),
lit( col('p.a.b.c') ).alias( 'abc' ) # Renaming
)
)\
.withColumn(
'p',
col( 'p' ).dropFields( col('a.b.c') ) # Remove the ugly named column
)
if you don't need to keep the column in the nested struct you can do a simple
select( col('p.a.b.c').alias('abc') )
This is move the a.b.c column out of the nested struct.

Comparing schema of dataframe using Pyspark

I have a data frame (df).
For showing its schema I use:
from pyspark.sql.functions import *
df1.printSchema()
And I get the following result:
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
Sometimes the schema changes (the column type or name):
df2.printSchema()
#root
# |-- name: array (nullable = true)
# |-- gender: integer (nullable = true)
# |-- age: long (nullable = true)
I would like to compare between the two schemas (df1 and df2) and get only the differences in types and columns names (Sometimes the column can move to another position).
The results should be a table (or data frame) something like this:
column df1 df2 diff
name: string array type
gender: N/A integer new column
(age column is the same and didn't change. In case of omission of column there will be indication 'omitted')
How can I do it if efficiently if I have many columns in each?
Without any external library, we can find the schema difference using
from pyspark.sql.session import SparkSession
from pyspark.sql import DataFrame
def schema_diff(spark: SparkSession, df_1: DataFrame, df_2: DataFrame):
s1 = spark.createDataFrame(df_1.dtypes, ["d1_name", "d1_type"])
s2 = spark.createDataFrame(df_2.dtypes, ["d2_name", "d2_type"])
difference = (
s1.join(s2, s1.d1_name == s2.d2_name, how="outer")
.where(s1.d1_type.isNull() | s2.d2_type.isNull())
.select(s1.d1_name, s1.d1_type, s2.d2_name, s2.d2_type)
.fillna("")
)
return difference
fillna is optional. I prefer to view them as empty string.
in where clause we use type because this will help us to show even if column exists in both dataframe but they have different schemas.
this will also show all columns that are in second dataframe but not in first dataframe
Usage:
diff = schema_diff(spark, df_1, df_2)
diff.show(diff.count(), truncate=False)
You can try creating two pandas dataframes with metadata from both DF1 and DF2 like below
pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
pd_df2=pd.DataFrame(df2.dtypes,columns=['column','data_type'])
and then join those two pandas dataframes through 'outer' join?
A custom function that could be useful for someone.
def SchemaDiff(DF1, DF2):
# Getting schema for both dataframes in a dictionary
DF1Schema = {x[0]:x[1] for x in DF1.dtypes}
DF2Schema = {x[0]:x[1] for x in DF2.dtypes}
# Column present in DF1 but not in DF2
DF1MinusDF2 = dict.fromkeys((set(DF1.columns) - set(DF2.columns)), '')
for column_name in DF1MinusDF2:
DF1MinusDF2[column_name] = DF1Schema[column_name]
# Column present in DF2 but not in DF1
DF2MinusDF1 = dict.fromkeys((set(DF2.columns) - set(DF1.columns)), '')
for column_name in DF2MinusDF1:
DF2MinusDF1[column_name] = DF2Schema[column_name]
# Find data type changed in DF1 as compared to DF2
UpdatedDF1Schema = {k:v for k,v in DF1Schema.items() if k not in DF1MinusDF2}
UpdatedDF1Schema = {**UpdatedDF1Schema, **DF2MinusDF1}
DF1DataTypesChanged = {}
for column_name in UpdatedDF1Schema:
if UpdatedDF1Schema[column_name] != DF2Schema[column_name]:
DF1DataTypesChanged[column_name] = DF2Schema[column_name]
return DF1MinusDF2, DF2MinusDF1, DF1DataTypesChanged
you can simply use
df1.printSchema() == df2.printSchema()

Categories