Can't associate temp view with database in spark session

Can't associate temp view with database in spark session - python

I'm trying to create a temp view using spark, from a csv file.
To reproduce my production scenario, I need to test my script locally, however in production I'm using Glue Jobs (AWS) where there are databases and tables.
In the code below, I'm creating a database in my spark session and using it, after that, I create a temp view.
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("pulsar_data").getOrCreate()
df = spark.read.format('csv')\
.options(infer_schema=True)\
.options(header=True)\
.load('pulsar_stars.csv')
spark.sql('CREATE DATABASE IF NOT EXISTS MYDB')
spark.sql('USE MYDB')
df.createOrReplaceTempView('MYDB.TB_PULSAR_STARS')
spark.catalog.listTables()
spark.sql('SELECT * FROM MYDB.TB_PULSAR_STARS').show()
However, when I try to select db.table, Spark can't find the relation between my temp view and my database and throws following error:
*** pyspark.sql.utils.AnalysisException: Table or view not found: MYDB.TB_PULSAR_STARS; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [MYDB, TB_PULSAR_STARS], [], false
Debugging my code with pdb, I have listed my spark session catalog, where I find that my table is in fact associated:
(Pdb) spark.catalog.listTables()
[Table(name='tb_pulsar_stars', database='MYDB', description=None, tableType='TEMPORARY', isTemporary=True)]
How can I make this relationship work?

Temporary view name associated to a DataFrame can only be one segment. This is explicitly checked here in Spark code. I would expect your code to throw AnalysisException: CREATE TEMPORARY VIEW or the corresponding Dataset APIs only accept single-part view names, but got: MYDB.TB_PULSAR_STARS - not sure why in your case it's a bit different.
Anyway, use:
df.createOrReplaceTempView('TB_PULSAR_STARS')
spark.sql('SELECT * FROM TB_PULSAR_STARS').show()
And if you need to actually write this data to a table, create it using:
spark.sql("CREATE TABLE MYDB.TB_PULSAR_STARS AS SELECT * FROM TB_PULSAR_STARS")

Related

Snowpark Snowflake Python to run a sql statement and export to Excel

I'm creating a Snowflake procedure using Snowpark (python) package executing a query into a snowflake dataframe and I would like to export that into Excel, how can I accomplish that? Is it a better approach to do this? The end goal is to export it the query results into Excel. Needs to be in a Snowflake procedure since we already have others "parent" procedures. Thanks!
CREATE OR REPLACE PROCEDURE EXPORT_SP()
RETURNS string not null
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
PACKAGES = ('snowflake-snowpark-python', 'pandas')
HANDLER = 'run'
AS
$$
import pandas
def run(snowpark_session):
## Execute the query into a Snowflake dataframe
results_df = snowpark_session.sql('''
SELECT * FROM
MY TABLES
;
''').collect()
return results_df
$$
;

In general, you can do this by:
"Unloading" the data from the table using the COPY INTO <location> command.
Using the GET command to copy the data to your local filesystem.
Open the file with Excel! If you used the CSV format and the appropriate format options in step 1, you should be able to easily open the resulting data with Excel.
Snowpark directly supports step 1 in the DataFrameWriter.copy_into_location method. An instance of DataFrameWriter contained in the DataFrame.write attribute, as described here.
Snowpark also directly supports step 2 in the FileOperation.get method. As per the example in that documentation page, you can access this method using the .file attribute of your Snowpark session object.
Putting this all together, you should be able to do something like this in Snowpark to save a single exported file into the current working directory:
source_table = "my_table"
unload_location = "#my_stage/export.csv"
def run(session):
df = session.table(source_table)
df.write.copy_into_location(
unload_location,
file_format_type="csv",
format_type_options=dict(
compression="none",
field_delimiter="\t",
),
single=True,
header=True,
)
session.file.get(unload_location, ".")
You can of course use session.sql() instead of session.table() as needed. You might also want to consider unloading data to the stage associated with the source data, instead of creating a separate stage, i.e. if the data is from table my_table then you would unload to the stage #%my_table.
For more details, refer to the documentation pages I linked, which contain important reference information as well as several examples.
Note that I am not sure if session.file is accessible from inside a stored procedure; you will have to experiment to see what works in your specific situation.
As always, remember that this is untested code written by an unpaid volunteer. Always triple-check and test any code that is provided here. Please do ask questions in the comments if anything is still unclear.

PySpark: Not able to read hive orc table using spark.sql

Used spark to write df to hdfs:
df.write.partitionBy("date").mode("append").format("ORC").save("/tmp/table1")
Created Hive External Table on top of this (default.table1)
I am able to read this table using beeline.
select * from default.table1; --> works fine
I am able to read this folder using spark
spark.read.orc("/tmp/table1").show() # --> works fine
However, when I use spark to read this hive table, i am getting error:
spark.sql("select * from default.table1").show() # --> error
When i take a count of this table, it works fine also:
spark.sql("select count(*) from default.table1").show() # --> works fine
Also, when I write spark df as csv to hdfs, i have no issues in reading it as spark.sql (hive)
Following is the error message:
"Py4JJavaError: An error occurred while calling 0192.showString."

Managed to fix the issue by fixing the hive ddl.
Must include the line
"WITH SERDEPROPERTIES ..." to make sure hiveContext is able to read the data in spark

MySQL table definition has changed error when reading from a table that has been written to by PySpark

I am currently working on a data pipeline with pyspark. As part of the pipeline, I write a spark dataframe to mysql using the following function:
def jdbc_insert_overwrite_table(df, mysql_user, mysql_pass, mysql_host, mysql_port, mysql_db, num_executors, table_name,
logger):
mysql_url = "jdbc:mysql://{}:{}/{}?characterEncoding=utf8".format(mysql_host, mysql_port, mysql_db)
logger.warn("JDBC Writing to table " + table_name)
df.write.format('jdbc')\
.options(
url=mysql_url,
driver='com.mysql.cj.jdbc.Driver',
dbtable=table_name,
user=mysql_user,
password=mysql_pass,
truncate=True,
numpartitions=num_executors,
batchsize=100000
).mode('Overwrite').save()
This works with no issue. However, later on in the pipeline (within the same PySpark app/ spark session), this table is a dependency for another transformation, and I try reading from this table using the following function:
def read_mysql_table_in_session_df(spark, mysql_conn, query_str, query_schema):
cursor = mysql_conn.cursor()
cursor.execute(query_str)
records = cursor.fetchall()
df = spark.createDataFrame(records, schema=query_schema)
return df
And I get this MySQL error: Error 1412: Table definition has changed, please retry transaction.
I've been able to resolve this by closing and ping(reconnect=True) to the database, but I don't like this solution as it feels like a band-aid.
Any ideas why I'm getting this error? I've confirmed writing to the table does not change the table definition (schema wise, at least).

AWS Glue and update duplicating data

I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will of course get duplicate records in the database. Instead of multiple records being inserted I want Glue to try and update that record if it notices a field has changed, each record has a unique id. Is this possible?

I followed the similar approach which is suggested as 2nd option by Yuriy. Get existing data as well as new data and then do some processing to merge to of them and write with ovewrite mode. Following code would help you to get an idea about how to solve this problem.
sc = SparkContext()
glueContext = GlueContext(sc)
#get your source data
src_data = create_dynamic_frame.from_catalog(database = src_db, table_name = src_tbl)
src_df = src_data.toDF()
#get your destination data
dst_data = create_dynamic_frame.from_catalog(database = dst_db, table_name = dst_tbl)
dst_df = dst_data.toDF()
#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)
#Finally save data to destination with OVERWRITE mode
merged_df.write.format('jdbc').options( url = dest_jdbc_url,
user = dest_user_name,
password = dest_password,
dbtable = dest_tbl ).mode("overwrite").save()

Unfortunately there is no elegant way to do it with Glue. If you would write to Redshift you could use postactions to implement Redshift merge operation. However, it's not possible for other jdbc sinks (afaik).
Alternatively in your ETL script you can load existing data from a database to filter out existing records before saving. However if your DB table is big then the job may take a while to process it.
Another approach is to write into a staging table with mode 'overwrite' first (replace existing staging data) and then make a call to a DB via API to copy new records only into a final table.

I have used INSERT into table .... ON DUPLICATE KEY.. for UPSERTs into the Aurora RDS running mysql engine. Maybe this would be a reference for your use case. We cannot use a JDBC since we have only APPEND, OVERWRITE, ERROR modes currently supported.
I am not sure of the RDS database engine you are using, and following is an example for mysql UPSERTS.
Please see this reference, where i have posted a solution using INSERT INTO TABLE..ON DUPLICATE KEY for mysql :
Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

Dynamically creating table from csv file using psycopg2

I would like to get some understanding on the question that I was pretty sure was clear for me. Is there any way to create table using psycopg2 or any other python Postgres database adapter with the name corresponding to the .csv file and (probably the most important) with columns that are specified in the .csv file.

I'll leave you to look at the psycopg2 library properly - this is off the top of my head (not had to use it for a while, but IIRC the documentation is ample).
The steps are:
Read column names from CSV file
Create "CREATE TABLE whatever" ( ... )
Maybe INSERT data
import os.path
my_csv_file = '/home/somewhere/file.csv'
table_name = os.path.splitext(os.path.split(my_csv_file)[1])[0]
cols = next(csv.reader(open(my_csv_file)))
You can go from there...
Create a SQL query (possibly using a templating engine for the fields and then issue the insert if needs be)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.