AWS Glue read from RDS database that's in VPC - python

I have an RDS database that is sitting in a VPC. My ultimate goal is to run a nightly job that takes the data from RDS and stores it in Redshift. I am currently doing this using Glue and Glue connections. I am able to write to RDS/Redshift using connections with the following line:
datasource2 = DynamicFrame.fromDF(dfFinal, glueContext, "scans")
output = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource2, catalog_connection = "MPtest", connection_options = {"database" : "app", "dbtable" : "scans"})
Where dfFinal is my final data frame after a bunch of transformations that are not essential to this post. That code works fine, however I would like to modify it so I could read a table from RDS into a data frame.
Since the RDS database is in a VPC, I would like to use the catalog_connection parameter, but the DynamicFrameReader class has no from_jdbc_conf method and thus no obvious way to use my glue connection.
I have seen posts that say you could use a method like this:
url = "jdbc:postgresql://host/dbName"
properties = {
"user" : "user",
"password" : "password"
}
df = spark.read.jdbc(url=url, table="table", properties=properties)
But when I try that it times out because it's not a publicly accessible database. Any suggestions?

You are on the right track with using Glue connections.
Define Glue connection of Type JDBC for your Postgres instance
Type JDBC
JDBC URL jdbc:postgresql://<RDS ip>:<RDS port>/<database_name>
VPC Id <VPC of RDS instance>
Subnet <subnet of RDS instance>
Security groups <Security Group allowed to connect to RDS>
Edit Glue Job, and select the Glue Connection so it appears under "Required Connections"
Create connections options dictionary as
options = {'url': connection.jdbc_url,
'user': connection.username,
'password': connection.password,
'dbtable': table
}
Use the options dictionary in the below to create a Dynamic frame to read from the table
table_ddf = glueContext.create_dynamic_frame.from_options(
connection_type='postgresql',
connection_options=options,
transformation_ctx=transformation_ctx
)

Related

azure-devops-python-api query for work item where field == string

I'm using the azure python api (https://github.com/microsoft/azure-devops-python-api) and I need to be able to query & find a specific work item based on a custom field value.
The closest thing I can find is the function create_query, but Im hoping to be able to run a query such as
queryRsp = wit_5_1_client.run_query(
posted_query='',
project=project.id,
query='Custom.RTCID=282739'
)
I just need to find my azure devops work item where the custom field RTCID has a certain specific unique value.
Do i need to create a query with the api, run it, get results, then delete the query? Or is there any way I can run this simple query and get the results using the azure devops api?
Your Requirement can be achieved.
For example, on my side, there is two workitems that have custom field 'RTCID':
The Below is how to use python to design this feature(On my side, both organization name and project name named 'BowmanCP'):
#query workitems from azure devops
from azure.devops.connection import Connection
from msrest.authentication import BasicAuthentication
from azure.devops.v5_1.work_item_tracking.models import Wiql
import pprint
# Fill in with your personal access token and org URL
personal_access_token = '<Your Personal Access Token>'
organization_url = 'https://dev.azure.com/BowmanCP'
# Create a connection to the org
credentials = BasicAuthentication('', personal_access_token)
connection = Connection(base_url=organization_url, creds=credentials)
# Get a client (the "core" client provides access to projects, teams, etc)
core_client = connection.clients.get_core_client()
#query workitems, custom field 'RTCID' has a certain specific unique value
work_item_tracking_client = connection.clients.get_work_item_tracking_client()
query = "SELECT [System.Id], [System.WorkItemType], [System.Title], [System.AssignedTo], [System.State], [System.Tags] FROM workitems WHERE [System.TeamProject] = 'BowmanCP' AND [Custom.RTCID] = 'xxx'"
#convert query str to wiql
wiql = Wiql(query=query)
query_results = work_item_tracking_client.query_by_wiql(wiql).work_items
#get the results via title
for item in query_results:
work_item = work_item_tracking_client.get_work_item(item.id)
pprint.pprint(work_item.fields['System.Title'])
Successfully got them on my side:
SDK source code is here:
https://github.com/microsoft/azure-devops-python-api/blob/451cade4c475482792cbe9e522c1fee32393139e/azure-devops/azure/devops/released/work_item_tracking/work_item_tracking_client.py#L704
You can refer to above source code.

Azure SQL Database Manipulation

Is it possible to create a table in an Azure sql database using Python? I am pulling a list of things from an API and then want to push them to a table in my Azure sql db but can not find a tutorial or guide on how to do so. Googling for it led me to tutorials on how to pull data from my db. Thanks
If you are using Azure SQL database, you could follow the Azure official tutorial which # Mohamed Elrashid provided for you: Azure SQL Database libraries for Python:
Example:
Create a SQL Database resource and restrict access to a range of IP addresses using a firewall rule.
from azure.common.client_factory import get_client_from_cli_profile
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.sql import SqlManagementClient
RESOURCE_GROUP = 'YOUR_RESOURCE_GROUP_NAME'
LOCATION = 'eastus' # example Azure availability zone, should match resource group
SQL_SERVER = 'yourvirtualsqlserver'
SQL_DB = 'YOUR_SQLDB_NAME'
USERNAME = 'YOUR_USERNAME'
PASSWORD = 'YOUR_PASSWORD'
# create resource client
resource_client = get_client_from_cli_profile(ResourceManagementClient)
# create resource group
resource_client.resource_groups.create_or_update(RESOURCE_GROUP, {'location': LOCATION})
sql_client = get_client_from_cli_profile(SqlManagementClient)
# Create a SQL server
server = sql_client.servers.create_or_update(
RESOURCE_GROUP,
SQL_SERVER,
{
'location': LOCATION,
'version': '12.0', # Required for create
'administrator_login': USERNAME, # Required for create
'administrator_login_password': PASSWORD # Required for create
}
)
# Create a SQL database in the Basic tier
database = sql_client.databases.create_or_update(
RESOURCE_GROUP,
SQL_SERVER,
SQL_DB,
{
'location': LOCATION,
'collation': 'SQL_Latin1_General_CP1_CI_AS',
'create_mode': 'default',
'requested_service_objective_name': 'Basic'
}
)
# Open access to this server for IPs
firewall_rule = sql_client.firewall_rules.create_or_update(
RESOURCE_GROUP,
SQL_DB,
"firewall_rule_name_123.123.123.123",
"123.123.123.123", # Start ip range
"167.220.0.235" # End ip range
)
If you are using Azure Database for MySQL, please ref this Azure tutorial Python + Azure Database for MySQL.
Azure Database for MySQL and Python can be used together for data analysis – MySQL as database engine and Python as statistical tool. When dealing with large datasets that potentially exceed the memory of your machine it is recommended to push the data into database engine, where you can query the data in smaller digestible chunks.
In this article we will learn how to use Python to perform the following tasks:
Create Azure Database for MySQL using azure python sdk
Hope this helps.

Google Cloud SQL - Can I directly access database with SQLAlchemy - not locally

I'm trying to directly access Google Cloud SQL and create there table. I want to use as little services as possible (keep it simple), therefore I really don't want to use Cloud SDK whatever.
I want to use something similar, that I saw here. I tried to replicate it, but I ended up with error.
AttributeError: module 'socket' has no attribute 'AF_UNIX'
For all this I'm using Python with sqlalchemy & pymysql
I really don't know how to debug it since I'm using it first few hours, but I think that problem could be with URL or environmental variables (app.yamp file, which I created).
I think that I already have installed all dependencies which I need
db_user = os.environ.get("db_user")
db_pass = os.environ.get("db_pass")
db_name = os.environ.get("db_name ")
cloud_sql_connection_name = os.environ.get("cloud_sql_connection_name ")
db = sqlalchemy.create_engine(
# Equivalent URL:
# mysql+pymysql://<db_user>:<db_pass>#/<db_name>?unix_socket=/cloudsql/<cloud_sql_instance_name>
sqlalchemy.engine.url.URL(
drivername='mysql+pymysql',
username=db_user,
password=db_pass,
database=db_name,
query={
'unix_socket': '/cloudsql/{}'.format(cloud_sql_connection_name)
}
),
pool_size=5,
max_overflow=2,
pool_timeout=30,
pool_recycle=1800,
)
with db.connect() as conn:
conn.execute(
"CREATE TABLE IF NOT EXISTS votes "
"( vote_id SERIAL NOT NULL, time_cast timestamp NOT NULL, "
"candidate CHAR(6) NOT NULL, PRIMARY KEY (vote_id) );"
)
I do not use db_user etc. as real values. These are just examples.
It should pass successfully and create an table in Google SQL
Can I directly access database with SQLAlchemy - not locally
You are specifying a unix socket /cloudsql/{}. This requires that you set up the Cloud SQL Proxy on your local machine.
To access Cloud SQL directly, you will need to specify the Public IP address for Cloud SQL. In your call to the function sqlalchemy.engine.url.URL, specify the host and port parameters and remove the query parameter.

Import MongoDB output result into S3 bucket using Python

MongoDB Database Name :- testdb,
Collection Name :- test_collection
MongoDB command that I want to execute :-
db.getCollection('test_collection').find({ request_time: { $gte: new Date('2018-06-22'), $lt: new Date('2018-06-26') }});
In the documents of test_collection, there is a key called request_time. I want to fetch the documents in the time range ('2018-06-22') and ('2018-06-26')
MongoDB username :- user
MongoDB Password :- password
MongoDB is running on port 27017.
I need help in two things. I can connect into database but how to provide username and password in case of authentication. This is my Python code,
from pymongo import Connection
connection = Connection()
connection = Connection('localhost', 27017)
db = connection.testdb
collection = db.testcollection
for post in collection.find():
print post
Another thing is,
I have a S3 bucket called, mongodoc . I want to query that mongo command and import the result documents into S3 bucket.
I can connect to S3 bucket by using a library called Boto ,
from boto.s3.connection import S3Connection
conn = S3Connection(AWS_KEY, AWS_SECRET)
bucket = conn.get_bucket(mongodoc)
destination = bucket.new_key()
destination.name = filename
destination.set_contents_from_file(myfile)
destination.make_public()
What is the recommended way to achieve this ?
In case of providing authentication, you have to provide the username and password along with the hostname,
connection=Connection(hostname="",username="",password="")
And for s3 connection, try using boto3 rather than using boto. boto3 provides a wide variety of functionality available for s3 client as well as resources. Once queried your mongodb results can be uploaded to your s3 buckets in the form of files.

Create a table from query results in Google BigQuery

We're using Google BigQuery via the Python API. How would I create a table (new one or overwrite old one) from query results? I reviewed the query documentation, but I didn't find it useful.
We want to simulate:
"SELEC ... INTO ..." from ANSI SQL.
You can do this by specifying a destination table in the query. You would need to use the Jobs.insert API rather than the Jobs.query call, and you should specify writeDisposition=WRITE_APPEND and fill out the destination table.
Here is what the configuration would look like, if you were using the raw API. If you're using Python, the Python client should give accessors to these same fields:
"configuration": {
"query": {
"query": "select count(*) from foo.bar",
"destinationTable": {
"projectId": "my_project",
"datasetId": "my_dataset",
"tableId": "my_table"
},
"createDisposition": "CREATE_IF_NEEDED",
"writeDisposition": "WRITE_APPEND",
}
}
The accepted answer is correct, but it does not provide Python code to perform the task. Here is an example, refactored out of a small custom client class I just wrote. It does not handle exceptions, and the hard-coded query should be customised to do something more interesting than just SELECT * ...
import time
from google.cloud import bigquery
from google.cloud.bigquery.table import Table
from google.cloud.bigquery.dataset import Dataset
class Client(object):
def __init__(self, origin_project, origin_dataset, origin_table,
destination_dataset, destination_table):
"""
A Client that performs a hardcoded SELECT and INSERTS the results in a
user-specified location.
All init args are strings. Note that the destination project is the
default project from your Google Cloud configuration.
"""
self.project = origin_project
self.dataset = origin_dataset
self.table = origin_table
self.dest_dataset = destination_dataset
self.dest_table_name = destination_table
self.client = bigquery.Client()
def run(self):
query = ("SELECT * FROM `{project}.{dataset}.{table}`;".format(
project=self.project, dataset=self.dataset, table=self.table))
job_config = bigquery.QueryJobConfig()
# Set configuration.query.destinationTable
destination_dataset = self.client.dataset(self.dest_dataset)
destination_table = destination_dataset.table(self.dest_table_name)
job_config.destination = destination_table
# Set configuration.query.createDisposition
job_config.create_disposition = 'CREATE_IF_NEEDED'
# Set configuration.query.writeDisposition
job_config.write_disposition = 'WRITE_APPEND'
# Start the query
job = self.client.query(query, job_config=job_config)
# Wait for the query to finish
job.result()
Create a table from query results in Google BigQuery. Assuming you are using Jupyter Notebook with Python 3 going to explain the following steps:
How to create a new dataset on BQ (to save the results)
How to run a query and save the results in a new dataset in table format on BQ
Create a new DataSet on BQ: my_dataset
bigquery_client = bigquery.Client() #Create a BigQuery service object
dataset_id = 'my_dataset'
dataset_ref = bigquery_client.dataset(dataset_id) # Create a DatasetReference using a chosen dataset ID.
dataset = bigquery.Dataset(dataset_ref) # Construct a full Dataset object to send to the API.
dataset.location = 'US' # Specify the geographic location where the new dataset will reside. Remember this should be same location as that of source data set from where we are getting data to run a query
# Send the dataset to the API for creation. Raises google.api_core.exceptions.AlreadyExists if the Dataset already exists within the project.
dataset = bigquery_client.create_dataset(dataset) # API request
print('Dataset {} created.'.format(dataset.dataset_id))
Run a query on BQ using Python:
There are 2 types here:
Allowing Large Results
Query without mentioning large result etc.
I am taking the Public dataset here: bigquery-public-data:hacker_news & Table id: comments to run a query.
Allowing Large Results
DestinationTableName='table_id1' #Enter new table name you want to give
!bq query --allow_large_results --destination_table=project_id:my_dataset.$DestinationTableName 'SELECT * FROM [bigquery-public-data:hacker_news.comments]'
This query will allow large query results if required.
Without mentioning --allow_large_results:
DestinationTableName='table_id2' #Enter new table name you want to give
!bq query destination_table=project_id:my_dataset.$DestinationTableName 'SELECT * FROM [bigquery-public-data:hacker_news.comments] LIMIT 100'
This will work for the query where the result is not going to cross the limit mentioned in Google BQ documentation.
Output:
A new dataset on BQ with the name my_dataset
Results of the queries saved as tables in my_dataset
Note:
These queries are Commands which you can run on the terminal(without ! in the beginning). But as we are using Python to run these commands/queries we are using !. This will enable us to use/run commands in the Python program as well.
Also please upvote the answer :). Thank You.

Categories