py4j.Py4JException: Method and([class java.lang.String]) does not exist - python

I am spark dataframe with below schema.
-root
|-- ME_KE: string (nullable = true)
|-- CSPD_CAT: string (nullable = true)
|-- EFF_DT: string (nullable = true)
|-- TER_DT: string (nullable = true)
|-- CREATE_DTM: string (nullable = true)
|-- ELIG_IND: string (nullable = true)
Basically I am trying to convert spark SQL code into SQL on directly on dataframe.
df=spark.read.format('csv').load(SourceFilesPath+"\\cutdetl.csv",infraSchema=True,header=True)
df.createOrReplaceTempView("cutdetl")
spark.sql(f"""select
me_ke,
eff_dt,
ter_dt,
create_dtm
from
cutdetl
where
(elig_ind = 'Y') and
((to_date({start_dt},'dd-mon-yyyy') between eff_dt and ter_dt) or
(eff_dt between to_date({start_dt}'dd-mon-yyyy') and to_date({end_dt},'dd-mon-yyyy'))
""")
Below is the code I have tried.
df1=df.select("me_ke","eff_dt","ter_dt","elig_ind")
.where(col("elig_ind")=="Y" & (F.to_date('31-SEP-2022', dd-mon-yyyy')
.between(col("mepe_eff_dt"),col("mepe_term_dt"))) |
(F.to_date(col("eff_dt"))
.between(F.to_date('31-DEC-2022'),F.to_date('31-DEC-2022'))))
I am getting below error:
py4j.Py4JException: Method and([class java.lang.String]) does not exist```
Could anyone help with converting above code to dataframe level SQL

I'd go like this
from pyspark.sql.functions import col
df=spark.read.format('csv').load(SourceFilesPath+"\\cutdetl.csv",infraSchema=True,header=True)
df.createOrReplaceTempView("cutdetl")
df1 = df.filter(col("elig_ind") == "Y")
df1 = df1.filter((col("eff_dt").between(f"to_date({start_dt},'dd-mon-yyyy')", f"to_date({end_dt},'dd-mon-yyyy')")) |
(f"to_date({start_dt},'dd-mon-yyyy')".between(col("eff_dt"), col("ter_dt"))))
df1 = df1.select("me_ke", "eff_dt", "ter_dt", "create_dtm")

Related

Spark 3.2.2 Joining same dataframe multiple time does not drop column

We have some PySpark code that joins a table table_a, twice to another table table_b using the following code. After joining the table twice, we drop the key_hash column from the output DataFrame.
This code was working fine in spark version 3.0.1. Since upgrading to spark version 3.2.2, the behaviour has changed and during the first transform operation the key_hash field gets dropped from the output DataFrame but when the 2nd transform operation gets executed then the key_hash field still stays in the output_df.
Can someone please guide what has changed in Spark behaviour that is causing this issue?
def tr_join_sac_user(self, df_a):
def inner(df_b):
return (
df_b.join(df_a, on=df_b["sac_key_hash"] == df_a["key_hash"], how="left")
.drop(df_a.key_hash)
.drop(df_b.sac_key_hash)
)
return inner
def tr_join_sec_user(self, df_a):
def inner(df_b):
return (
df_b.join(df_a, on=df_b["sec_key_hash"] == df_a["key_hash"], how="left")
.drop(df_a.key_hash)
.drop(df_b.sec_key_hash)
)
return inner
table_a_df = spark.read.format("delta").load("/path/to/table_a")
table_b_df = spark.read.format("delta").load("/path/to/table_b")
output_df = table_b_df.transform(tr_join_sac_user(table_a_df))
output_df = output_df.transform(tr_join_sec_user(table_a_df))
If we use .drop('key_hash') instead of .drop(df_a.key_hash) that seems to work and the column does get dropped in 2nd transform as well. I would like to understand what has changed in Spark behaviour between these versions (or if it’s a bug) as this might have an impact in other places in our codebase as well.
Hi I also have an issue with this one, I don't know if its a bug or not but it seems not happening all time
utilization_raw = time_lab.crossJoin(approved_listing)
utilization_raw = utilization_raw\
.join(availability_series,
((utilization_raw.date_series == availability_series.availability_date)&\
(utilization_raw.listing_id == availability_series.listing_id)),"left")\
.drop(availability_series.listing_id).dropDuplicates()\ --> WORKING
.join(request_series,
((utilization_raw.date_series==request_series.request_date_series)&\
(utilization_raw.listing_id == request_series.listing_id)),"left")\
.drop(request_series.listing_id)\ --> WORKING
.join(listing_pricing,
((utilization_raw.date_series==listing_pricing.price_created_date)&\
(utilization_raw.listing_id==listing_pricing.listing_id)),'left').drop(listing_pricing.listing_id)\ --> NOT WORKING
Here's the result of printSchema()
root
|-- date_series: date (nullable = false)
|-- week_series: date (nullable = true)
|-- month_series: date (nullable = true)
|-- woy_num: integer (nullable = false)
|-- doy_num: integer (nullable = false)
|-- dow_num: integer (nullable = false)
|-- listing_id: integer (nullable = true)
|-- is_driverless: integer (nullable = false)
|-- listing_deleted_at: date (nullable = true)
|-- daily_gmv: decimal(38,23) (nullable = true)
|-- daily_nmv: decimal(38,23) (nullable = true)
|-- daily_calendar_gmv: decimal(31,13) (nullable = true)
|-- daily_calendar_nmv: decimal(31,13) (nullable = true)
|-- active_booking: long (nullable = true)
|-- is_available: integer (nullable = false)
|-- is_requested: integer (nullable = false)
|-- listing_id: integer (nullable = true) --> duplicated
|-- base_price: decimal(10,2) (nullable = true)
Update: what we did is we updated the databricks version from 9.1 to 11.3

Rename nested struct columns to all in lower case in a Spark DataFrame using PySpark

Similar kind of solution is already available using scala, but I need a solution in pyspark. I am new to python, need all your help on the same.
Below is the link for scala solution, For better understanding of requirement.
Rename nested struct columns in a Spark DataFrame
I am trying to change the names of a DataFrame columns in python. I am easily able to change the column names for direct fields but I'm facing difficulty while converting array struct columns.
Below is my DataFrame schema.
|-- VkjLmnVop: string (nullable = true)
|-- KaTasLop: string (nullable = true)
|-- AbcDef: struct (nullable = true)
| |-- UvwXyz: struct (nullable = true)
| | |-- MnoPqrstUv: string (nullable = true)
| | |-- ManDevyIxyz: string (nullable = true)
But I need the schema like below
|-- vkjlmnvop: string (nullable = true)
|-- kataslop: string (nullable = true)
|-- abcdef: struct (nullable = true)
| |-- uvwxyz: struct (nullable = true)
| | |-- mnopqrstuv: string (nullable = true)
| | |-- mandevyixyz: string (nullable = true)
How I can change Struct column names dynamically?
I have also found a different solution of similar logic with less number of lines.
import pyspark.sql.functions as spf
ds = {'AbcDef': {'UvwXyz': {'VkjLmnVop': 'abcd'}}, 'HijKS': 'fgds'}
df = spark.read.json(sc.parallelize([ds]))
df.printSchema()
"""
root
|-- AbcDef: struct (nullable = true)
| |-- UvwXyz: struct (nullable = true)
| | |-- VkjLmnVop: string (nullable = true)
|-- HijKS: string (nullable = true)
"""
for i in df.columns : df = df.withColumnRenamed(i, i.lower())
schemaDef = [y.replace("]","") for y in [x.replace("DataFrame[","") for x in df.__str__().split(", ")]]
for j in schemaDef :
columnName = j.split(": ")[0]
dataType = j.split(": ")[1]
df = df.withColumn(columnName, spf.col(columnName).cast(dataType.lower()))
df.printSchema()
"""
root
|-- abcdef: struct (nullable = true)
| |-- uvwxyz: struct (nullable = true)
| | |-- vkjlmnvop: string (nullable = true)
|-- hijks: string (nullable = true)
"""
I guess this is what you wanted. Hope it helps!
def get_column_wise_schema(df_string_schema, df_columns):
# Returns a dictionary containing column name and corresponding column schema as string.
column_schema_dict = {}
i = 0
while i < len(df_columns):
current_col = df_columns[i]
next_col = df_columns[i + 1] if i < len(df_columns) - 1 else None
current_col_split_key = '[' + current_col + ': ' if i == 0 else ' ' + current_col + ': '
next_col_split_key = ']' if i == len(df_columns) - 1 else ', ' + next_col + ': '
column_schema_dict[current_col] = df_string_schema.split(current_col_split_key)[1].\
split(next_col_split_key)[0]
i += 1
return column_schema_dict
def convert_colnames_to_lower(spark_df):
columns = spark_df.columns
column_wise_schema_dict = get_column_wise_schema(spark_df.__str__(), columns)
col_exprs = []
for column_name in columns:
column_schema_lowercase = column_wise_schema_dict[column_name]
col_exprs.append(spf.col(column_name).cast(column_schema_lowercase).
alias(column_name.lower()))
return spark_df.select(*col_exprs)
ds = {'AbcDef': {'UvwXyz': {'VkjLmnVop': 'abcd'}}, 'HijKS': 'fgds'}
df = spark.read.json(sc.parallelize([ds]))
df.printSchema()
"""
root
|-- AbcDef: struct (nullable = true)
| |-- UvwXyz: struct (nullable = true)
| | |-- VkjLmnVop: string (nullable = true)
|-- HijKS: string (nullable = true)
"""
converted_df = convert_colnames_to_lower(df)
converted_df.printSchema()
"""
root
|-- abcdef: struct (nullable = true)
| |-- uvwxyz: struct (nullable = true)
| | |-- vkjlmnvop: string (nullable = true)
|-- hijks: string (nullable = true)
"""

PySpark - Json explode nested with Struct and array of struct

I am trying to parse nested json with some sample json. Below is the print schema
|-- batters: struct (nullable = true)
| |-- batter: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- id: string (nullable = true)
| | | |-- type: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- ppu: double (nullable = true)
|-- topping: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- type: string (nullable = true)
|-- type: string (nullable = true)
Trying to explode batters,topping separately and combine them.
df_batter = df_json.select("batters.*")
df_explode1= df_batter.withColumn("batter", explode("batter")).select("batter.*")
df_explode2= df_json.withColumn("topping", explode("topping")).select("id",
"type","name","ppu","topping.*")
Unable to combine the two data frame.
Tried using single query
exploded1 = df_json.withColumn("batter", df_batter.withColumn("batter",
explode("batter"))).withColumn("topping", explode("topping")).select("id",
"type","name","ppu","topping.*","batter.*")
But getting error.Kindly help me to solve it. Thanks
You basically have to explode the arrays together using arrays_zip which returns a merged array of structs. Try this. I haven't tested but it should work.
from pyspark.sql import functions as F
df_json.select("id","type","name","ppu","topping","batters.*")\
.withColumn("zipped", F.explode(F.arrays_zip("batter","topping")))\
.select("id","type","name","ppu","zipped.*").show()
You could also do it one by one:
from pyspark.sql import functions as F
df1=df_json.select("id","type","name","ppu","topping","batters.*")\
.withColumn("batter", F.explode("batter"))\
.select("id","type","name","ppu","topping","batter")
df1.withColumn("topping", F.explode("topping")).select("id","type","name","ppu","topping.*","batter.*")

create nested json file managing null values

I'm working with pyspark and I have the following code that creates a nested json file from a dataframe with some fields (product, quantity, from, to) nested in "requirements". Herunder the code that creates the json an one row as example
final2 = final.groupby('identifier', 'plant', 'family', 'familyDescription', 'type', 'name', 'description', 'batchSize', 'phantom', 'makeOrBuy', 'safetyStock', 'unit', 'unitPrice', 'version').agg(F.collect_list(F.struct(F.col("product"), F.col("quantity"), F.col("from"), F.col("to"))).alias('requirements'))
{"identifier":"xxx","plant":"xxxx","family":"xxxx","familyDescription":"xxxx","type":"assembled","name":"xxxx","description":"xxxx","batchSize":20.0,"phantom":"False","makeOrBuy":"make","safetyStock":0.0,"unit":"PZ","unitPrice":xxxx,"version":"0001","requirements":[{"product":"yyyy","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"zzzz","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"kkkk","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"wwww","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"bbbb","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"}]}
The schema of the final2 dataframe is the following:
|-- identifier: string (nullable = true)
|-- plant: string (nullable = true)
|-- family: string (nullable = true)
|-- familyDescription: string (nullable = true)
|-- type: string (nullable = false)
|-- name: string (nullable = true)
|-- description: string (nullable = true)
|-- batchSize: double (nullable = true)
|-- phantom: string (nullable = false)
|-- makeOrBuy: string (nullable = false)
|-- safetyStock: double (nullable = true)
|-- unit: string (nullable = true)
|-- unitPrice: double (nullable = true)
|-- version: string (nullable = true)
|-- requirements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- product: string (nullable = true)
| | |-- quantity: double (nullable = true)
| | |-- from: timestamp (nullable = true)
| | |-- to: timestamp (nullable = true)
I'm facing a problem because I have to add to my final dataframe some data with product, quantity, from, to = Null: using the code above I get "requirements":[{}] , but the DB where I write the file (MongoDB) get an error with the empty JSON object because it expects "requirements":[] for null values.
I've tried with
import pyspark.sql.functions as F
df = final_prova2.withColumn("requirements",
F.when(final_prova2.requirements.isNull(),
F.array()).otherwise(final_prova2.requirements))
but it doesn't work.
Any suggestion on how modify the code? I'm struggling to find a solution (I don't even know if a solution is possible considering the structure required).
Thanks
You need to check if all 4 fields of requirements are NULL, not the column itself. One way you can fix this is to adjust the collect_list aggregate function when creating final2:
import pyspark.sql.functions as F
final2 = final.groupby('identifier', 'plant', 'family', 'familyDescription', 'type', 'name', 'description', 'batchSize', 'phantom', 'makeOrBuy', 'safetyStock', 'unit', 'unitPrice', 'version') \
.agg(F.expr("""
collect_list(
IF(coalesce(quantity, product, from, to) is NULL
, NULL
, struct(product, quantity, from, to)
)
)
""").alias('requirements'))
Where:
we use an SQL expression IF(condition, true_value, false_value) to set up the argument for collect_list
the condition: coalesce(quantity, product, from, to) is NULL is to test if all listed 4 columns are NULL, if it's true, return NULL, otherwise return struct(product, quantity, from, to)

Flatten XML dataframe in spark

from pyspark.sql.functions import *
def flatten_df(nested_df):
exist = True
while exist:
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) > 0:
print(nested_cols)
flat_df = nested_df.select(flat_cols +
[col("`"+nc+'`.`'+c+"`").alias((nc+'_'+c).replace(".","_"))
for nc in nested_cols
for c in nested_df.select("`"+nc+'`.*').columns])
nested_df=flat_df
#break
else:
exist = False
return flat_df
df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "GetDocument").load("/FileStore/tables/test.xml")
df1=flatten_df(df)
Here is the code I am using to flatten an xml document. Basically I want to take a xml with nested xml and flatten all of it to a single row without any structured datatypes, so each value is a column. The above code works for test cases I have done, but I have tried on a very large XML and after a couple rounds of flattening (in the while loop) it breaks with the following error:
'Ambiguous reference to fields StructField(_Id,StringType,true), StructField(_id,StringType,true);'
I assume because it is trying to create 2 seperate columns with the same name. How can I avoid this but keep my code generic for any XML?
One thing to note, it is okay to have arrays as a datatype for a column, I will be exploding those arrays to seperate rows in a later step.
Update example
Original DF -
|-- Order: long (nullable = true)
|-- attval: string (nullable = true)
|-- children: struct (nullable = true)
| |-- id: string(nullable = true)
| |-- att: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Order: long (nullable = true)
| | | |-- attval: string (nullable = true)
DF after function -
|-- Order: long (nullable = true)
|-- attval: string (nullable = true)
|-- children_id: string(nullable = true)
|-- children_att: array (nullable = true)
| |-- children_att_element_Order: long (nullable = true)
| |-- children_att_element_attval: string (nullable = true)
I was facing a similar issue and was able to parse my XML file as follow
Install the following Maven library: “com.databricks:spark-xml_2.10:0.4.1” on Databricks
Upload your file on DBFS using the following path: FileStore > tables > xml > sample_data
Run the following code:
data = spark.read.format("com.databricks.spark.xml").option("rootTag", "col1").option("rowTag", "col2").option("rowTag", "col3").load("dbfs:/FileStore/tables/sample_data.xml")
display(data)

Categories