AWS Glue - Truncate destination postgres table prior to insert - python

I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE.
Has anyone been able to do so?

I've tried the DROP/ TRUNCATE scenario, but have not been able to do it with connections already created in Glue, but with a pure Python PostgreSQL driver, pg8000.
Download the tar of pg8000 from pypi
Create an empty __init__.py in the root folder
Zip up the contents & upload to S3
Reference the zip file in the Python lib path of the job
Set the DB connection details as job params (make sure to prepend all key names with --). Tick the "Server-side encryption" box.
Then you can simply create a connection and execute SQL.
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import pg8000
args = getResolvedOptions(sys.argv, [
'JOB_NAME',
'PW',
'HOST',
'USER',
'DB'
])
# ...
# Create Spark & Glue context
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# ...
config_port = 5432
conn = pg8000.connect(
database=args['DB'],
user=args['USER'],
password=args['PW'],
host=args['HOST'],
port=config_port
)
query = "TRUNCATE TABLE {0};".format(".".join([schema, table]))
cur = conn.cursor()
cur.execute(query)
conn.commit()
cur.close()
conn.close()

After following step (4) of #thenaturalist's response,
sc.addPyFile("/home/glue/downloads/python/pg8000.zip")
import pg8000
worked for me in a development endpoint (zeppelin notebook)
More info: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

To clarify #thenaturalist instructions for the zip as I still struggled with this
Download the tar.gz of pg8000 from pypi.org and extract.
Zip the contents so you have the below structure
pg8000-1.15.3.zip
|
| -- pg8000 <dir>
| -- __init__.py
| -- _version.py <optional>
| -- core.py
Upload to s3 and then you should be able to just do a simple import pg8000.
NOTE: scramp is also required at the moment so follow the same procedure as above to include the scramp module. But you don't need to import it.

data=spark.sql(sql)
conf = glueContext.extract_jdbc_conf("jdbc-commerce")
data.write \
.mode('overwrite') \
.format("jdbc") \
.option("url", conf['url']) \
.option("database", 'Pacvue_Commerce') \
.option("dbtable", "dbo.glue_1") \
.option("user", conf['user']) \
.option('truncate','true') \
.option("password", conf['password']) \
.save()
glue api not support , but spark api support.
jdbc-commerce is your connection name at crawl.
use extract_jdbc_conf to get url、username and password.

Related

`AnalysisException("Database 'delta' not found" ...` when creating delta table over DeltaTableBuilder API locally

I am creating delta tables in Databricks Runtime 9.1 LTS using the DeltaTableBuilder API in PySpark. This works fine. When running the same code locally (for a unit test) I get a strange error.
Im a setting up my local environment and the local PySpark Session as described in the Quickstart guide.
Steps to reproduce:
Setup environment as in DB Runtime 9.1: Python 3.8
pip install delta-spark==1.0.0
pip install pyspark==3.1.2
Create SparkSession:
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip
session_builder = SparkSession.builder
session_builder = (
session_builder.config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
)
spark = configure_spark_with_delta_pip(session_builder).getOrCreate()
Try to create delta table:
from delta import DeltaTable
DeltaTable.createIfNotExists(spark) \
.location("test_tables/test.delta") \
.addColumn("id", "LONG") \
.execute()
Error:
pyspark.sql.utils.AnalysisException: Database 'delta' not found
In Databricks this error does not appear. It does not require any database "delta" it just creates the delta table directory with the delta_log in it - no database involved. But ok lets create the database locally:
from delta import DeltaTable
spark.sql("CREATE DATABASE delta")
DeltaTable.createOrReplace(spark) \
.location("test_delta") \
.addColumn("id", "LONG") \
.execute()
Error:
pyspark.sql.utils.AnalysisException: `delta`.`test_delta` is not a Delta table.
How can I make the behaviour locally as in Databricks?

Exception: Java gateway process exited before sending its port number with pyspark

I am working with python and pyspark in a jupyter notebook. I am trying to read several parquet files from an aws s3 bucket and convert them into a single json file.
This is what I have:
from functools import reduce
from pyspark.sql import DataFrame
bucket = s3.Bucket(name='mybucket')
keys =[]
for key in bucket.objects.all():
keys.append(key.key)
print(keys[0])
from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
.master('local') \
.appName('myAppName') \
.config('spark.executor.memory', '5gb') \
.config("spark.cores.max", "6") \
.getOrCreate()
sc = spark.sparkContext
But I am getting:
Exception: Java gateway process exited before sending its port number with pyspark
I am not sure how to fix this, thank you!
Your getting this error because your pyspark is not able to communicate with your cluster. you need to set the value of some global variable like this.
import os
import findspark
findspark.init()
os.environ['PYSPARK_SUBMIT_ARGS'] = """--name job_name --master yarn / local
--conf spark.dynamicAllocation.enabled=true
pyspark-shell"""
os.environ['PYSPARK_PYTHON'] = "python3.6" # what ever version of python your using
os.environ['python'] = "python3.6"
findspark package is optional but it's good to use in case of pyspark.

how to load --jars with pyspark with spark standalone on client mode

I am using python 2.7 with spark standalone cluster on client mode.
I want to use jdbc for mysql and found that i need to load it using --jars argument, I have the jdbc on my local, and manage to load it with pyspark console like here
When I write a python script inside my ide, using pyspark, I don't manage to load the additional jar mysql-connector-java-5.1.26.jar and keep get
no suitable driver
error
How can I load additional jar files when running a python script in client mode, using a standalone cluster on client mode and refering to a remote master?
edit: added some code #########################################################################
this is the basic code that i am using, i use pyspark with spark context in python e.g i do not use spark submit directly and don't understand how to use spark submit parameters in this case...
def createSparkContext(masterAdress = algoMaster):
"""
:return: return a spark context that is suitable for my configs
note the ip for the master
app name is not that important, just to show off
"""
from pyspark.mllib.util import MLUtils
from pyspark import SparkConf
from pyspark import SparkContext
import os
SUBMIT_ARGS = "--driver-class-path /var/nfs/general/mysql-connector-java-5.1.43 pyspark-shell"
#SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
#conf.set("spark.driver.extraClassPath", "var/nfs/general/mysql-connector-java-5.1.43")
conf.setMaster(masterAdress)
conf.setAppName('spark-basic')
conf.set("spark.executor.memory", "2G")
#conf.set("spark.executor.cores", "4")
conf.set("spark.driver.memory", "3G")
conf.set("spark.driver.cores", "3")
#conf.set("spark.driver.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
sc = SparkContext(conf=conf)
print sc._conf.get("spark.executor.extraClassPath")
return sc
sql = SQLContext(sc)
df = sql.read.format('jdbc').options(url='jdbc:mysql://ip:port?user=user&password=pass', dbtable='(select * from tablename limit 100) as tablename').load()
print df.head()
Thanks
Your SUBMIT_ARGS is going to be passed to the spark-submit when creating a sparkContext from python. You should use --jars instead of --driver-class-path.
EDIT
Your problem is actually a lot simpler than it seems: you're missing the parameter driver in the options:
sql = SQLContext(sc)
df = sql.read.format('jdbc').options(
url='jdbc:mysql://ip:port',
user='user',
password='pass',
driver="com.mysql.jdbc.Driver",
dbtable='(select * from tablename limit 100) as tablename'
).load()
You can also put userand password in separate arguments.

pyspark using mysql database on remote machine

I am using python 2.7 with ubuntu and running spark via a python script using a sparkcontext
My db is a remote mysql, with a username and password.
I try to query it using this code
sc = createSparkContext()
sql = SQLContext(sc)
df = sql.read.format('jdbc').options(url='jdbc:mysql://ip:port?user=user&password=password', dbtable='(select * from tablename limit 100) as tablename').load()
print df.head()
And get this error
py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: java.sql.SQLException: No suitable driver
I found that I need the JDBC driver for mysql.
I downloaded the platform free one from here
I tried including it using this code in starting the spark context
conf.set("spark.driver.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
and tried to install it using
sudo apt-get install libmysql-java
on the master machine, on the db machine and on the machine running the python script with no luck.
edit2
#
i tried using
conf.set("spark.executor.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
seems by the output of
print sc.getConf().getAll()
which is
[(u'spark.driver.memory', u'3G'), (u'spark.executor.extraClassPath',
u'file:///var/nfs/general/mysql-connector-java-5.1.43.jar'),
(u'spark.app.name', u'spark-basic'), (u'spark.app.id',
u'app-20170830'), (u'spark.rdd.compress', u'True'),
(u'spark.master', u'spark://127.0.0.1:7077'), (u'spark.driver.port',
u''), (u'spark.serializer.objectStreamReset', u'100'),
(u'spark.executor.memory', u'2G'), (u'spark.executor.id', u'driver'),
(u'spark.submit.deployMode', u'client'), (u'spark.driver.host',
u''), (u'spark.driver.cores', u'3')]
that it includes the correct path, but still i get the same "no driver" error...
What am i missing here?
Thanks
You need to set the classpath for both driver and worker nodes. Add the following to spark configuration
conf.set("spark.executor.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
conf.set("spark.driver.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
Or you can pass it using
import os
os.environ['SPARK_CLASSPATH'] = "/path/to/driver/mysql.jar"
For spark >=2.0.0 you can add the comma separated list of jars to spark-defaults.conf file located in spark_home/conf directory like this
spark.jars path_2_jar1,path_2_jar2
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Word Count")\
.config("spark.driver.extraClassPath", "/home/tuhin/mysql.jar")\
.getOrCreate()
dataframe_mysql = spark.read\
.format("jdbc")\
.option("url", "jdbc:mysql://ip:port/db_name")\
.option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", "employees").option("user", "root")\
.option("password", "12345678").load()
print(dataframe_mysql.columns)
"/home/tuhin/mysql.jar" is the location of mysql jar file

What is wrong with my boto elastic mapreduce jar jobflow parameters?

I am using the boto library to create a job flow in Amazons Elastic MapReduce Webservice (EMR). The following code should create a step:
step2 = JarStep(name='Find similiar items',
jar='s3n://recommendertest/mahout-core/mahout-core-0.5-SNAPSHOT.jar',
main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob',
step_args=['s3n://bucket/output/' + run_id + '/aggregate_watched/',
's3n://bucket/output/' + run_id + '/similiar_items/',
'SIMILARITY_PEARSON_CORRELATION'
])
When I run the job flow, it always fails throwing this error:
java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/JobContext
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.JobContext
This is the line in the EMR logs invoking the java code:
2011-01-24T22:18:54.491Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java \
-cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop \
/hadoop-0.18-core.jar:/home/hadoop/hadoop-0.18-tools.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* \
-Xmx1000m \
-Dhadoop.log.dir=/mnt/var/log/hadoop/steps/3 \
-Dhadoop.log.file=syslog \
-Dhadoop.home.dir=/home/hadoop \
-Dhadoop.id.str=hadoop \
-Dhadoop.root.logger=INFO,DRFA \
-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/3/tmp \
-Djava.library.path=/home/hadoop/lib/native/Linux-i386-32 \
org.apache.hadoop.mapred.JobShell \
/mnt/var/lib/hadoop/steps/3/mahout-core-0.5-SNAPSHOT.jar \
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob \
s3n://..../output/job_2011-01-24_23:09:29/aggregate_watched/ \
s3n://..../output/job_2011-01-24_23:09:29/similiar_items/ \
SIMILARITY_PEARSON_CORRELATION
What is wrong with the parameters? The java class definition can be found here:
https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html
I found the solution for the problem:
You need to specify hadoop version 0.20 in the jobflow parameters
You need to run the JAR step with mahout-core-0.5-SNAPSHOT-job.jar, not with the mahout-core-0.5-SNAPSHOT.jar
If you have an additional streaming step in your jobflow, you need to fix a bug in boto:
Open boto/emr/step.py
Change line 138 to "return '/home/hadoop/contrib/streaming/hadoop-streaming.jar'"
Save and reinstall boto
This is how the job_flow function should be invoked to run with mahout:
jobid = emr_conn.run_jobflow(name = name,
log_uri = 's3n://'+ main_bucket_name +'/emr-logging/',
enable_debugging=1,
hadoop_version='0.20',
steps=[step1,step2])
The fix to boto described in step #2 above (i.e. using the non-versioned hadoop-streamin.jar file) has been incorporated into the github master in this commit:
https://github.com/boto/boto/commit/a4e8e065473b5ff9af554ceb91391f286ac5cac7
For Some reference doing this from boto
import boto.emr.connection as botocon
import boto.emr.step as step
con = botocon.EmrConnection(aws_access_key_id='', aws_secret_access_key='')
step = step.JarStep(name='Find similar items', jar='s3://mahout-core-0.6-job.jar', main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob', action_on_failure='CANCEL_AND_WAIT', step_args=['--input', 's3://', '--output', 's3://', '--similarityClassname', 'SIMILARITY_PEARSON_CORRELATION'])
con.add_jobflow_steps('jflow', [step])
Obviously you need to upload the mahout-core-0.6-job.jar to an accessible s3 location. And the input and out put have to be accessible.

Categories