Compare column names in two data frames pyspark - python

I have two data frames in pyspark df and data. The schema are like below
>>> df.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- nation: string (nullable = true)
|-- Date: timestamp (nullable = false)
|-- ZipCode: integer (nullable = true)
|-- car: string (nullable = true)
|-- van: string (nullable = true)
>>> data.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- nation: string (nullable = true)
|-- date: string (nullable = true)
|-- zipcode: integer (nullable = true)
Now I want to add columns car and van to my data data frame by comparing both the schema.
I would also want to compares two data frames if the columns are same do nothing, but if the columns are different then add the columns to the data frame that doesn't have the columns.
How can we achieve that in pyspark.
FYI I am using spark 1.6
once the columns are added to the data frame. The values for those columns in the newly added data frame Should be null.
for example here we are adding columns to data data frame so the columns car and van in data data frame should contain null values but the same columns in df data frame should have their original values
what happens if there are more than 2 new columns to be added

As the schema is not but StructType consisting of list of StructFields, we can retrieve the fields list, to compare and find the missing columns,
df_schema = df.schema.fields
data_schema = data.schema.fields
df_names = [x.name.lower() for x in df_scehma]
data_names = [x.name.lower() for x in data_schema]
if df_schema <> data_schema:
col_diff = set(df_names) ^ set(data_names)
col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if ((x[0] is not None and x[0].name.lower() in col_diff) or x[1].name.lower() in col_diff)]
for i in col_list:
if i[0] in df_names:
data = data.withColumn("%s"%i[0],lit(None).cast(i[1]))
else:
df = df.withColumn("%s"%i[0],lit(None).cast(i[1]))
else:
print "Nothing to do"
You have mentioned to add the column if there is no null values, but your schema diference are nullable columns, so have not used that check. If you need it, then add check for nullable as below,
col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if (x[0].name.lower() in col_diff or x[1].name.lower() in col_diff) and not x.nullable]
Please check the documentation for more about StructType and StructFields,
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.types.StructType

If you have to do this to multiple tables, it might be worth it to generalize the code a bit. This code takes the first non-null value in the non-matching source column to create the new column in the target table.
from pyspark.sql.functions import lit, first
def first_non_null(f,t): # find the first non-null value of a column
return f.select(first(f[t], ignorenulls=True)).first()[0]
def match_type(f1,f2,miss): # add missing column to the target table
for i in miss:
try:
f1 = f1.withColumn(i, lit(first_non_null(f2,i)))
except:
pass
try:
f2 = f2.withColumn(i, lit(first_non_null(f1,i)))
except:
pass
return f1, f2
def column_sync_up(d1,d2): # test if the matching requirement is met
missing = list(set(d1.columns) ^ set(d2.columns))
if len(missing)>0:
return match_type(d1,d2,missing)
else:
print "Columns Match!"
df1, df2 = column_sync_up(df1,df2) # reuse as necessary

Related

Data Frames being read in with varying number of columns, how do I dynamically change data types of only columns that are Boolean to String data type?

In my notebook, I have Data Frames being read in that will have a variable number of columns every time the notebook is ran. How do I dynamically change the data types of only the columns that are Boolean data types to String data type?
This is a problem I faced so I am posting the answer incase this helps someone else.
The name of the data frame is "df".
Here we dynamically convert every column in the incoming dataset that is a Boolean data type to a String data type:
def bool_col_DataTypes(DataFrame):
"""This Function accepts a Spark Data Frame as an argument. It returns a list of all Boolean columns in your dataframe."""
DataFrame = dict(DataFrame.dtypes)
list_of_bool_cols_for_conversion = [x for x, y in DataFrame.items() if y == 'boolean']
return list_of_bool_cols_for_conversion
list_of_bool_columns = bool_col_DataTypes(df)
for i in list_of_bool_columns:
df = df.withColumn(i, F.col(i).cast(StringType()))
new_df = df
data=([(True, 'Lion',1),
(False, 'fridge',2),
( True, 'Bat', 23)])
schema =StructType([StructField('Answer',BooleanType(), True),StructField('Entity',StringType(), True),StructField('ID',IntegerType(), True)])
df=spark.createDataFrame(data, schema)
df.printSchema()
Schema
root
|-- Answer: boolean (nullable = true)
|-- Entity: string (nullable = true)
|-- ID: integer (nullable = true)
Transformation
df1 =df.select( *[col(x).cast('string').alias(x) if y =='boolean' else col(x) for x, y in df.dtypes])
df1.printSchema()
root
|-- Answer: string (nullable = true)
|-- Entity: string (nullable = true)
|-- ID: integer (nullable = true)

combine the mx value with same name in one line pyspark

i want to covert this value
{"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"20 alt1.aspmx.l.google.com"}
{"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"20 alt2.aspmx.l.google.com"}
{"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"30 aspmx2.googlemail.com"}
{"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"30 aspmx3.googlemail.com"}
{"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"20 alt1.aspmx.l.google.com"}
{"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"20 alt2.aspmx.l.google.com"}
{"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"30 aspmx2.googlemail.com"}
{"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"30 aspmx3.googlemail.com"}
test.printSchema()
root
|-- name: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- type: string (nullable = true)
|-- value: string (nullable = true)
combine the mx value with same name in one line pyspark
the result that i want
{ "timestamp":"1601093713", "name":"exmple1.com", "type":"mx", "value":" alt1.aspmx.l.google.com,alt2.aspmx.l.google.com , aspmx2.googlemail.com, aspmx3.googlemail.com" }
{ "timestamp":"1601093713", "name":"exmple2.com", "type":"mx", "value":" alt1.aspmx.l.google.com, alt2.aspmx.l.google.com , aspmx2.googlemail.com, aspmx3.googlemail.com" }
You can do this using a groupBy, agg, and collect_list [docs (external link)]. Please not that this will provide with a list of values and not a string. How to do the conversion if needed can be found in Convert PySpark dataframe column from list to string.
df_grouped = df.groupby('name').agg(F.collect_list('value').alias('values'))
The followup question here would be how you want to handle the other columns. E.g. the timestamp or the type.

converting RDD to dataframe fails on string to date conversion

I am working on extracting some data from xml. My overall workflow, which might be inefficient, is:
Read xml into a dataframe ('df_individual')
Filter unwanted columns
Make the target schema (shared below)
Convert the dataframe to RDD
Create a dataframe using schema and RDD from step 3 and 4
I created the RDD like below:
rddd = df_individual.rdd.map(tuple)
'df_individual' is the orignal dataframe where read the xml.
Below is the schema:
schema = types.StructType([
types.StructField('applicaion_id', types.StringType()),
types.StructField('cd_type', types.StringType()),
types.StructField('cd_title', types.StringType()),
types.StructField('firstname', types.StringType()),
types.StructField('middlename', types.StringType()),
types.StructField('nm_surname', types.StringType()),
types.StructField('dt_dob', types.DateType()),
types.StructField('cd_gender', types.StringType()),
types.StructField('cd_citizenship', types.StringType())
])
It fails on
df_result = spark.createDataFrame(rddd, schema)
The error is
TypeError: field dt_dob: DateType can not accept object '1973-02-19' in type <class 'str'>
The main purpose of creating the 'df_result' dataframe is having a predefined schema and implicitly casting all the columns where there is difference between RDD and dataframe. This is my first time working with RDD and I couldn't find a straight forward casting mechanism for such a case.
If you can help with solving the casting error or share a better workflow that would be great.
Thanks
If your aim is only to get your data into the right schema and transform some string columns into date columns, I would use a select combined with to_date.
df.select('applicaion_id', 'cd_type', 'cd_title', 'firstname', 'middlename', 'nm_surname', \
F.to_date('dt_dob').alias('dt_bob'), \
'cd_gender', 'cd_citizenship') \
.printSchema()
prints
root
|-- applicaion_id: string (nullable = true)
|-- cd_type: string (nullable = true)
|-- cd_title: string (nullable = true)
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- nm_surname: string (nullable = true)
|-- dt_bob: date (nullable = true)
|-- cd_gender: string (nullable = true)
|-- cd_citizenship: string (nullable = true)
with the column dt_bob having a date datatype.

How to access dataframe returned by UDF inside column in Spark Streaming

I'm writing code to perform step detection on streaming sensor data. To do so I separate all incoming values as sensor parameters. After that I detect the steps with a UDF. Inside this UDF I create a dict to store two lists of important timestamps per determined step.
#udf(StructType())
def detect_steps(timestamp, x, y, z):
.....
d = dict()
d['timestamp_ic'] = times_ic
d['timestamp_to'] = timestamp_to
......
return d
In the Spark main function I created a dataframe that calculates all these steps in a sliding window like so:
stepData = LLLData \
.withWatermark("time", "10 seconds") \
.groupBy(
window("time", windowDuration="5 seconds",slideDuration="1 second"),
"sensor"
) \
.agg(
collect_list("time").alias("time_window"),
collect_list(sensorData.Acceleration_x).alias("Acceleration_x_window"),
collect_list(sensorData.Acceleration_y).alias("Acceleration_y_window"),
collect_list(sensorData.Acceleration_z).alias("Acceleration_z_window"),
) \
.select(
"window",
"sensor",
detect_steps("time_window", "Acceleration_x_window", "Acceleration_y_window", "Acceleration_z_window")
)
Now, when I print the df schema it looks like this:
|-- window: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- sensor: string (nullable = true)
|-- detect_steps("time_window", "Acceleration_x_window", "Acceleration_y_window", "Acceleration_z_window"): string (nullable = true)
While I want this:
|-- window: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- sensor: string (nullable = true)
|-- timestamp_ic: string (nullable = true)
|-- timestamp_to: string (nullable = true)
However I cannot perform a select statement on the UDF column in stepData, error: Column is not iterable.
When I try to alter the root dataframe afterwards, by for example parsing the ic column to a spark dataframe like so:
df_stepData = spark.createDataFrame(data=stepData.select("ic"))
it gives me TypeError: data is already a DataFrame.
Looking at the dataframe schema, however, ic is typed as string.
I've also tried to read ic as a json file, but that gives the following error:
TypeError: path can be only string, list or RDD
I could fix the problem by triggering the detect_steps UDF twice, the first one returning timestamp_ic and the second one returning timestamp_to, in order to get two columns but I'm sure there is a better, more efficient way.

How to convert column with string type to int form in pyspark data frame?

I have dataframe in pyspark. Some of its numerical columns contain nan so when I am reading the data and checking for the schema of dataframe, those columns will have string type.
How I can change them to int type. I replaced the nan values with 0 and again checked the schema, but then also it's showing the string type for those columns.I am following the below code:
data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()
my data looks like this:
here columns Plays and drafts containing integer values but because of nan present in these columns, they are treated as string type.
from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))
You can run loop for each column but this is the simplest way to convert string column into integer.
You could use cast(as int) after replacing NaN with 0,
data_df = df.withColumn("Plays", df.call_time.cast('float'))
Another way to do it is using the StructField if you have multiple fields that needs to be modified.
Ex:
from pyspark.sql.types import StructField,IntegerType, StructType,StringType
newDF=[StructField('CLICK_FLG',IntegerType(),True),
StructField('OPEN_FLG',IntegerType(),True),
StructField('I1_GNDR_CODE',StringType(),True),
StructField('TRW_INCOME_CD_V4',StringType(),True),
StructField('ASIAN_CD',IntegerType(),True),
StructField('I1_INDIV_HHLD_STATUS_CODE',IntegerType(),True)
]
finalStruct=StructType(fields=newDF)
df=spark.read.csv('ctor.csv',schema=finalStruct)
Output:
Before
root
|-- CLICK_FLG: string (nullable = true)
|-- OPEN_FLG: string (nullable = true)
|-- I1_GNDR_CODE: string (nullable = true)
|-- TRW_INCOME_CD_V4: string (nullable = true)
|-- ASIAN_CD: integer (nullable = true)
|-- I1_INDIV_HHLD_STATUS_CODE: string (nullable = true)
After:
root
|-- CLICK_FLG: integer (nullable = true)
|-- OPEN_FLG: integer (nullable = true)
|-- I1_GNDR_CODE: string (nullable = true)
|-- TRW_INCOME_CD_V4: string (nullable = true)
|-- ASIAN_CD: integer (nullable = true)
|-- I1_INDIV_HHLD_STATUS_CODE: integer (nullable = true)
This is slightly a long procedure to cast , but the advantage is that all the required fields can be done.
It is to be noted that if only the required fields are assigned the data type, then the resultant dataframe will contain only those fields which are changed.

Categories