PySpark via Dataproc + SSL Connection to Cloud SQL

PySpark via Dataproc + SSL Connection to Cloud SQL - python

I have a Cloud SQL instance storing data in a database, and I have checked the option for this Cloud SQL instance to block all unencrypted connections. When I select this option, I am given three SSL certificates - a server certificate, a client public key, and a client private key (three separate .pem files) (link to relevant CloudSQL+SSL documentation). These certificate files are used to establish encrypted connections to the Cloud SQL instance.
I'm able to successfully connect to Cloud SQL using MySQL from the command line using the --ssl-ca, --ssl-cert, and --ssl-key options to specify the server certificate, client public key, and client private key, respectively:
mysql -uroot -p -h <host-ip-address> \
--ssl-ca=server-ca.pem \
--ssl-cert=client-cert.pem \
--ssl-key=client-key.pem
I am now trying to run a PySpark job that connects to this Cloud SQL instance to extract the data to analyze it. The PySpark job is basically the same as this example provided by Google Cloud training team. On line 39 of said script, there is a JDBC connection that is made to the Cloud SQL instance:
jdbcDriver = 'com.mysql.jdbc.Driver'
jdbcUrl = 'jdbc:mysql://%s:3306/%s?user=%s&password=%s' % (CLOUDSQL_INSTANCE_IP, CLOUDSQL_DB_NAME, CLOUDSQL_USER, CLOUDSQL_PWD)
but this does not make an encrypted connection and does not provide the three certificate files. If I have unencrypted connections to the Cloud SQL instance disabled, I see the following error message:
17/09/21 06:23:21 INFO org.spark_project.jetty.util.log: Logging initialized #5353ms
17/09/21 06:23:21 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT
17/09/21 06:23:21 INFO org.spark_project.jetty.server.Server: Started #5426ms
17/09/21 06:23:21 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector#74af54ac{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
[...snip...]
py4j.protocol.Py4JJavaError: An error occurred while calling o51.load.
: java.sql.SQLException: Access denied for user 'root'#'<cloud-sql-instance-ip>' (using password: YES)
whereas if I have unencrypted connections to the Cloud SQL instance enabled, the job runs just fine. (This indicates that the issue is not with Cloud SQL API permissions - the cluster I'm running the PySpark job from definitely have permission to access the Cloud SQL instance.)
The JDBC connection strings I have found involving SSL add a &useSSL=true or &encrypt=true but do not point to external certificates; or, they use a keystore in some kind of Java-specific procedure. How can I modify the JDBC connection string from the Python script linked to above, in order to point JDBC (via PySpark) to the locations of the server certificate and client public/private keys (server-ca.pem, client-cert.pem, and client-key.pem) on disk?

There's a handy initialization action for configuring the CloudSQL Proxy on Dataproc clusters. By default it assumes you intend to use CloudSQL for the Hive metastore, but if you download it and customize it setting ENABLE_CLOUD_SQL_METASTORE=0 and then re-upload it into your own bucket to use as your custom initialization action, then you should automatically get the CloudSQL proxy installed on all your nodes. And then you just set your mysql connection string to point to localhost instead of the real CloudSQL IP.
When specifying the metadata flags, if you've disabled additional-cloud-sql-instances instead of hive-metastore-instance in your metadata:
--metadata "additional-cloud-sql-instances=<PROJECT_ID>:<REGION>:<ANOTHER_INSTANCE_NAME>=tcp<PORT_#>`
In this case you can optionally use the same port assignment the script would've used by default for the metastore, which is port 3306.

Related

Attempting to establish a connection to Amazon Redshift from Python Script

I am trying to connect to a Amazon redshift table. I created the table using SQL and now I am writing a Python script to append a data frame to the database. I am unable to connect to the database and feel that I have something wrong with my syntax or something else. My code is below.
from sqlalchemy import create_engine
conn = create_engine('jdbc:redshift://username:password#localhost:port/db_name')
Here is the error I am getting.
sqlalchemy.exc.ArgumentError: Could not parse rfc1738 URL from string
Thanks!

There are basically two options for connecting to Amazon Redshift using Python.
Option 1: JDBC Connection
This is a traditional connection to a database. The popular choice tends to be using psycopg2 to establish the connection, since Amazon Redshift resembles a PostgreSQL database. You can download specific JDBC drivers for Redshift.
This connection would require the Redshift database to be accessible to the computer making the query, and the Security Group would need to permit access on port 5439. If you are trying to connect from a computer on the Internet, the database would need to be in a Public Subnet and set to Publicly Accessible = Yes.
See: Establish a Python Redshift Connection: A Comprehensive Guide - Learn | Hevo
Option 2: Redshift Data API
You can directly query an Amazon Redshift database by using the Boto3 library for Python, including an execute_statement() call to query data and a get_statement_result() call to retrieve the results. This also works with IAM authentication rather than having to create additional 'database users'.
There is no need to configure Security Groups for this method, since the request is made to AWS (on the Internet). It also works with Redshift databases that are in private subnets.

How to setup ssl connection in Azure redis cache with python?

We have a premium subscription in azure portal.
i just want to know what is the difference between ssl_cert_reqs or ssl=True
if i am making connection with below will this is secured?
r = redis.StrictRedis(host=myHostname, port=6380,
password=myPassword, ssl_cert_reqs='none', ssl=True)
if it's not secured how i can create a certificate file for that?
looks like if SSL is enabled we need to specify ssl_ca_certs, where to get this file? or do we need to generate it with some azure service?.
For Azure Cache for Redis version 3.0 or higher, TLS/SSL certificate check is enforced. ssl_ca_certs must be explicitly set when connecting to Azure Cache for Redis
Ref: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-python-get-started

how to make a connection to an existing Redshift database using JDBC driver and Boto3 API in Python

I do not know how to write a connection string in Python using Boto3 API to make a jdbc connection to an existing database on AWS Redshift. I am using MobaXterm or putty to make a SSH connection. I have some codes to create the table, but am lost as to how to connect to the Database in Redshift
import boto3
s3client = boto3.client('redshift', config=client-config)
CREATE TABLE pheaapoc_schema.green_201601_csv (
vendorid varchar(4),
pickup_ddatetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
I need to connect to database "dummy" and create a table.

TL;DR; You do not need IAM credentials or boto3 to connect to Redshift. What you need is end_point for the Redshift cluster and redshift credentials and a postgres client using which you can connect.
You can connect to Redshift cluster just the way you connect to any Database (Like MySQL, PostgreSQL or MongoDB). To connect to any database, you need 5 items.
host - (This is nothing but the end point you get from AWS console/Redshift)
username - (Refer again to AWS console/Redshift. Take a look at master username section)
password - (If you created the Redshift, you should know the password for master user)
port number - (5439 for Redshift)
Database - (The default database you created at first)
Refer to the screenshot if it is not intuitive.
What boto3 APIs do?
Boto3 provides APIs using which you can modify your Redshift cluster. For example, it provides APIs to delete your cluster, resize your cluster or take a snapshot of your cluster. They do not involve connection whatsoever.
Screenshots for reference:

Cloud SQL/NiFi: Connect to cloud sql database with python and NiFi

So, I am doing a etl process in which I use Apache NiFi as an etl tool along with a postgresql database from google cloud sql to read csv file from GCS. As a part of the process, I need to write a query to transform data read from csv file and insert to the table in the cloud sql database. So, based on NIFi, I need to write a python to execute a sql queries automatically on a daily basis. But the question here is that how can I write a python to connect with the cloud sql database? What config that should be done? I have read something about cloud sql proxy but can I just use an cloud sql instance's internal ip address and put it in some config file and creating some dbconnector out of it?
Thank you
Edit: I can connect to cloud sql database from my vm using psql -h [CLOUD_SQL_PRIVATE_IP_ADDR] -U postgres but I need to run python script for the etl process and there's a part of the process that need to execute sql. What I am trying to ask is that how can I write a python file that use for executing the sql
e.g. In python, query = 'select * from table ....' and then run
postgres.run_sql(query) which will execute the query. So how can I create this kind of executor?

I don't understand why you need to write any code in Python? I've done a similar process where I used GetFile (locally) to read a CSV file, parse and transform it, and then used ExecuteSQLRecord to insert the rows into a SQL server (running on a cloud provider). The DBCPConnectionPool needs to reference your cloud provider as per their connection instructions. This means the URL likely reference something.google.com and you may need to open firewall rules using your cloud provider administration.

You can connect directly to a Cloud SQL instance via a Public IP (public meaning accessible via the public internet) mostly the same as a local database. By default, connections via Public IP require some form of authorization. Here you have 3 (maybe 4*) options:
Cloud SQL Proxy - this is an executable that listens on a local port or unix socket and uses IAM permissions to authenticate, encrypt, and forward connections to the database.
Self-managed SSL/TLS - Create a SSL/TLS key pair, providing the client key to NiFi as proof of authentication.
Whitelisting an IP - Whitelist which IPs are allowed to connect (so the IP that NiFi publicly sits on). This is the least secure option for a variety of reasons.
Any of these options should work for you to connect directly to the database. If you still need to specifics for Python, I suggest looking into SQLAlchemy and use these snippets here as reference.
Another possible option: It looks like NiFi is using Java and allows you to specify a jar as a driver, so you could potentially also provide a driver bundled with the Cloud SQL JDBC SocketFactory to authenticate the connection as well.

To connect to a Cloud SQL instance with Python you need the Cloud SQL Proxy. Also you have to set a configuration file.
In this tutorial you can find step by step how to achieve this. It is described how to set the configuration file needed for the connection (here you can find an example of this file as well).
Also in the tutorial there are some examples showing you how to interact with your database with Python.

Setup local db2 server for password less access from python

I want to setup a local db2 database to allow local users to connect without using password (specifically via python).
I can connect to the database from cli w/o password as db2 connect to <DATABASE>.
However, when trying to connect from within python using official ibm_db api as
ibm_db.connect("database", "", "")
throws the following error:
SQL 300082N Security processing failed with reason "17" ("UNSUPPORTED FUNCTION"). SQLSTATE=08001 SQLCODE=-30082
Based on the documentation for authentication options, I have set the following options:
AUTHENTICATION=CLIENT
TRUST_CLNTAUTH=CLIENT
TRUST_ALLCLNTS=YES
however, I am still getting the same error.
P.S. #1: I am not concerned about user authentication as they have already been authenticated before been allowed to log in to the server.
P.S. #2: A similar question has already been asked at DB2 connection without specifying username and password. However, I need to connect via python and even with the setting 1 and 3 prescribed in the accepted answer, the connection fails.
P.S. #3: Possibly relevant link - http://www-01.ibm.com/support/docview.wss?uid=swg21237107

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark via Dataproc + SSL Connection to Cloud SQL - python

Related

Attempting to establish a connection to Amazon Redshift from Python Script

How to setup ssl connection in Azure redis cache with python?

how to make a connection to an existing Redshift database using JDBC driver and Boto3 API in Python

Cloud SQL/NiFi: Connect to cloud sql database with python and NiFi

Setup local db2 server for password less access from python

Categories

Resources