I am trying to load data from s3 buckets csv files into snowflake using glue ETL. Wrote a python script within the ETL job for the same as below:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from py4j.java_gateway import java_import
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
## #params: [JOB_NAME, URL, ACCOUNT, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'ACCOUNT', 'WAREHOUSE', 'DB', 'SCHEMA',
'USERNAME', 'PASSWORD'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
java_import(spark._jvm, "net.snowflake.spark.snowflake")
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession
(spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
sfOptions = {
"sfURL" : args['URL'],
"sfAccount" : args['ACCOUNT'],
"sfUser" : args['USERNAME'],
"sfPassword" : args['PASSWORD'],
"sfDatabase" : args['DB'],
"sfSchema" : args['SCHEMA'],
"sfWarehouse" : args['WAREHOUSE'],
}
dyf = glueContext.create_dynamic_frame.from_catalog(database = "salesforcedb", table_name =
"pr_summary_csv", transformation_ctx = "dyf")
df=dyf.toDF()
##df.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("parallelism",
"8").option("dbtable", "abcdef").mode("overwrite").save()
df.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "abcdef").save()
job.commit()
The error thrown is:
error occurred while calling o81.save. Incorrect username or password was specified.
However if I don't convert to Spark data frame, and use directly the dynamic frame I get error like this:
AttributeError: 'function' object has no attribute 'format'
Could someone please look over my code and tell me what I'm doing wrong for converting a dynamic frame to DF? Please let me know If I need to provide more information.
BTW , I am newbie to snowflake and this is my trial on loading data through AWS Glue. 😊
error occurred while calling o81.save. Incorrect username or password
was specified.
The error message says that there's an error about the user or the password. If you are sure that the user name and the password are correct, please be sure that the Snowflake account name and URL are also correct.
However if I don't convert to Spark data frame, and use directly the
dynamic frame I get error like this:
AttributeError: 'function' object has no attribute 'format'
The Glue DynamicFrame's write method is different than Spark DataFrame, so it's normal to not to have same methods. Please check the documentation:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-write
It seems you need to give the parameters as connection_options:
write(connection_type, connection_options, format, format_options, accumulator_size)
connection_options = {"url": "jdbc-url/database", "user": "username", "password": "password","dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"}
Even you use the DynamicFrame, you will probably end up with the incorrect username or password error. So I suggest you to focus on fixing the credentials.
Here is the tested Glue Code (you can copy paste as it is only change the table name ), which you can use for setting up Glue ETL .
You will have to add the JDBC and Spark jars .You can use the below link for set up:
https://community.snowflake.com/s/article/How-To-Use-AWS-Glue-With-Snowflake
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from py4j.java_gateway import java_import
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake";
## #params: [JOB_NAME, URL, ACCOUNT, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'ACCOUNT', 'WAREHOUSE', 'DB', 'SCHEMA', 'USERNAME', 'PASSWORD'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## uj = sc._jvm.net.snowflake.spark.snowflake
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession(spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
sfOptions = {
"sfURL" : args['URL'],
"sfAccount" : args['ACCOUNT'],
"sfUser" : args['USERNAME'],
"sfPassword" : args['PASSWORD'],
"sfDatabase" : args['DB'],
"sfSchema" : args['SCHEMA'],
"sfWarehouse" : args['WAREHOUSE'],
}
## Read from a Snowflake table into a Spark Data Frame
df = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "Select * from <tablename>").load()
df.show()
## Perform any kind of transformations on your data and save as a new Data Frame: df1 = df.[Insert any filter, transformation, or other operation]
## Write the Data Frame contents back to Snowflake in a new table df1.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "[new_table_name]").mode("overwrite").save() job.commit()
Related
I am running a job on a Databricks notebook that connects to my MySQL database on AWS RDS and inserts data. When I ran the notebook manually, I was able to connect to the endpoint URL and insert my data. Now I have my notebook running on a corn job every 30 min. The first job was successful, but every job after that failed with this error:
MySQLInterfaceError: MySQL server has gone away
I then tried to run my job manually again and I get the same error on tweets_pdf.to_sql(name='tweets', con=engine, if_exists = 'replace', index=False). This is the code that is running in the Databricks notebook:
from __future__ import print_function
import sys
import pymysql
import os
import re
import mysql.connector
from sqlalchemy import create_engine
from operator import add
import pandas as pd
from pyspark.sql.types import StructField, StructType, StringType
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql import SQLContext
import json
import boto
import boto3
from boto.s3.key import Key
import boto.s3.connection
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import *
# Get AWS credentials
aws_key_id = os.environ.get("accesskeyid")
aws_key = os.environ.get("secretaccesskey")
# Start spark instance
conf = SparkConf().setAppName("first")
sc = SparkContext.getOrCreate(conf=conf)
# Allow spark to access my S3 bucket
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId",aws_key_id)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",aws_key)
config_dict = {"fs.s3n.awsAccessKeyId":aws_key_id,
"fs.s3n.awsSecretAccessKey":aws_key}
bucket = "diego-twitter-stream-sink"
prefix = "/2020/*/*/*/*"
filename = "s3n://{}/{}".format(bucket, prefix)
# Convert file from S3 bucket to an RDD
rdd = sc.hadoopFile(filename,
'org.apache.hadoop.mapred.TextInputFormat',
'org.apache.hadoop.io.Text',
'org.apache.hadoop.io.LongWritable',
conf=config_dict)
spark = SparkSession.builder.appName("PythonWordCount").config("spark.files.overwrite","true").getOrCreate()
# Map RDD to specific columns
df = spark.read.json(rdd.map(lambda x: x[1]))
features_of_interest = ["ts", "text", "sentiment"]
df_reduce = df.select(features_of_interest)
# Convert RDD to Pandas Dataframe
tweets_pdf = df_reduce.toPandas()
engine = create_engine(f'mysql+mysqlconnector://admin:{os.environ.get("databasepassword")}#{os.environ.get("databasehost")}/twitter-data')
tweets_pdf.to_sql(name='tweets', con=engine, if_exists = 'replace', index=False)
Does anyone know what could be the issue? All of the database config variables are correct, the S3 bucket that PySpark is streaming from has data, and the AWS RDS is nowhere near any capacity or compute limits.
A default max_allowed_packets (4M) can cause this issue
I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried.
#initialize pyspark
import findspark
findspark.init('C:\Spark\spark-2.4.5-bin-hadoop2.7')
#import required modules
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import *
import pandas as pd
#Create spark configuration object
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
#Create spark context and sparksession
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
table = "dbo.test"
#read table data into a spark dataframe
jdbcDF = spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://localhost:1433;databaseName=Demo;integratedSecurity=true;") \
.option("dbtable", table) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
#show the data loaded into dataframe
#jdbcDF.show()
sqlQueries="execute testJoin"
resultDF=spark.sql(sqlQueries)
resultDF.show(resultDF.count(),False)
This doesn't work — how do I do it?
In case someone is still looking for a method on how to do this, it's possible to use the built-in jdbc-connector of you spark session. Following code sample will do the trick:
import msal
# Set url & credentials
jdbc_url = ...
tenant_id = ...
sp_client_id = ...
sp_client_secret = ...
# Write your SQL statement as a string
name = "Some passed value"
statement = f"""
EXEC Staging.SPR_InsertDummy
#Name = '{name}'
"""
# Generate an OAuth2 access token for service principal
authority = f"https://login.windows.net/{tenant_id}"
app = msal.ConfidentialClientApplication(sp_client_id, sp_client_secret, authority)
token = app.acquire_token_for_client(scopes="https://database.windows.net/.default")["access_token"]
# Create a spark properties object and pass the access token
properties = spark._sc._gateway.jvm.java.util.Properties()
properties.setProperty("accessToken", token)
# Fetch the driver manager from your spark context
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
# Create a connection object and pass the properties object
con = driver_manager.getConnection(jdbc_url, properties)
# Create callable statement and execute it
exec_statement = con.prepareCall(statement)
exec_statement.execute()
# Close connections
exec_statement.close()
con.close()
For more information and a similar method using SQL-user credentials to connect over JDBC, or on how to take return parameters, I'd suggest you take a look at this blogpost:
https://medium.com/delaware-pro/executing-ddl-statements-stored-procedures-on-sql-server-using-pyspark-in-databricks-2b31d9276811
Running a stored procedure through a JDBC connection from azure databricks is not supported as of now. But your options are:
Use a pyodbc library to connect and execute your procedure. But by using this library, it means that you will be running your code on the driver node while all your workers are idle. See this article for details.
https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark
Use a SQL table function rather than procedures. In a sense, you can use anything that you can use in the FORM clause of a SQL query.
Since you are in an azure environment, then using a combination of azure data factory (to execute your procedure) and azure databricks can help you to build pretty powerful pipelines.
Python Shell Jobs was introduced in AWS Glue. They mentioned:
You can now use Python shell jobs, for example, to submit SQL queries to services such as ... Amazon Athena ...
Ok. We have an example to read data from Athena tables here:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
persons = glueContext.create_dynamic_frame.from_catalog(
database="legislators",
table_name="persons_json")
print("Count: ", persons.count())
persons.printSchema()
# TODO query all persons
However, it uses Spark instead of Python Shell. There are no such libraries that are normally available with Spark job type and I have an error:
ModuleNotFoundError: No module named 'awsglue.transforms'
How can I rewrite the code above to make it executable in the Python Shell job type?
The thing is, Python Shell type has its own limited set of built-in libraries.
I only managed to achieve my goal using Boto 3 to query data and Pandas to read it into a dataframe.
Here is the code snippet:
import boto3
import pandas as pd
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
athena_client = boto3.client(service_name='athena', region_name='us-east-1')
bucket_name = 'bucket-with-csv'
print('Working bucket: {}'.format(bucket_name))
def run_query(client, query):
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={ 'Database': 'sample-db' },
ResultConfiguration={ 'OutputLocation': 's3://{}/fromglue/'.format(bucket_name) },
)
return response
def validate_query(client, query_id):
resp = ["FAILED", "SUCCEEDED", "CANCELLED"]
response = client.get_query_execution(QueryExecutionId=query_id)
# wait until query finishes
while response["QueryExecution"]["Status"]["State"] not in resp:
response = client.get_query_execution(QueryExecutionId=query_id)
return response["QueryExecution"]["Status"]["State"]
def read(query):
print('start query: {}\n'.format(query))
qe = run_query(athena_client, query)
qstate = validate_query(athena_client, qe["QueryExecutionId"])
print('query state: {}\n'.format(qstate))
file_name = "fromglue/{}.csv".format(qe["QueryExecutionId"])
obj = s3_client.get_object(Bucket=bucket_name, Key=file_name)
return pd.read_csv(obj['Body'])
time_entries_df = read('SELECT * FROM sample-table')
SparkContext won't be available in Glue Python Shell. Hence you need to depend on Boto3 and Pandas to handle the data retrieval. But it comes a lot of overhead to query Athena using boto3 and poll the ExecutionId to check if the query execution got finished.
Recently awslabs released a new package called AWS Data Wrangler. It extends power of Pandas library to AWS to easily interact with Athena and lot of other AWS Services.
Reference link:
https://github.com/awslabs/aws-data-wrangler
https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/006%20-%20Amazon%20Athena.ipynb
Note: AWS Data Wrangler library wont be available by default inside Glue Python shell. To include it in Python shell, follow the instructions in following link:
https://aws-data-wrangler.readthedocs.io/en/latest/install.html#aws-glue-python-shell-jobs
I have a few month using glue, i use:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
data_frame = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.load(<CSVs THAT IS USING FOR ATHENA - STRING>)
I'm trying to count the records and print the schema of my partitioned table (in a form of parquet). I'm doing it just in AWS Glue Console (since I dont have access to connect to a developer endpoint). However, I dont think my query is producing any result. See my code below. Any suggestion?
%pyspark
from awsglue.context import GlueContext
from awsglue.transforms import *
from pyspark.context import SparkContext
glueContext = GlueContext(SparkContext.getOrCreate())
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "s3", table_name = "subscriber", push_down_predicate = "(date=='2018-12-06')", transformation_ctx = "datasource0")
df = datasource0.toDF()
print df.count()
df.printSchema()
I'm not sure about using print in Glue... I would recommend use logging to print results. You can get the logger object and use it like that:
spark = glueContext.spark_session
log4jLogger = spark.sparkContext._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger(__name__)
logger.info(df.count())
From the Job console you can then access to the logs of the specific Job execution. There you should be able to see your DF count for example.
You can see an example code with generated results in the below picture:
I'd like to execute SparkSQL on SageMaker by AWS Glue, but haven't succeeded.
What I want to do is parameterizing Glue job, so I want it acceptable to access empty tables. However, when the method glueContext.create_dynamic_frame.from_catalog is provided with an empty table, it raises an error.
Here's a code what raises an error:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
df1 = glueContext.create_dynamic_frame.from_catalog(
database = "<glue's database name>",
table_name = "<glue's table name>", # I want here to be parameterized
transformation_ctx = "df1"
)
df1 = df1.toDF() # Here raises an Error
df1.createOrReplaceTempView('tmp_table')
df_sql = spark.sql("""SELECT ...""")
And this is the error:
Unable to infer schema for Parquet. It must be specified manually.
Is it impossible to use an empty table as an input to DynamicFrame? Thank you in advance.
df1 = df1.toDF() # Here raises an Error
Replace this line with:
dynamic_df = DynamicFrame.fromDF(df1, glueContext, 'sample_job') # Load pyspark df to dynamic frame