Python - Dynamic variable in Redshift SQL Query - python

I'm developing in a python environment and I want to call an sql query using psycopg2
Lets say I have the following UNLOAD command in an .sql file:
UNLOAD
(
'
Some SQL Query
'
)
TO 's3://%PATH%'
...
In the sql file the %PATH% should be explicit like: 'folder1/folder3/file_name'.
But I want the python program to set this %PATH% in runtime. which means that the .sql file containts something like %PATH% and will be set only in runtime.
Any idea of how to do it?

Implementing it this way will give you a tough time.
The best way to do is to dump the file at a static location:
UNLOAD
(
'
Some SQL Query
'
)
TO 's3://path/to/static/s3_bucket'
...
and then use (via a shellscript / or opt for a suitable command for any other script)
aws s3 mv $source $destination
Here, you may pass any value for $destination which can be easily populated during run-time.
In short, you've dumped the file in s3 at a fixed location (using
UNLOAD) and moved it to the location of your choice or a location
populated at run time (using aws s3 mv...)

You simply specify a replacement field in your SQL file, and the use a format command.
Create your file like this
UNLOAD ('Some SQL Query')
TO 's3://{bucket}/{key}'
And use this file in python like
template = open('file1.sql', 'r').read()
query = template.format(bucket='mybucket', key='folder/file.csv')

You would not be able to set up the UNLOAD path dynamically at runtime, however you could put your SQL statement in a something like a shell/python script where you can create variables with the path you'd like and then pass them into the query.
This UNLOAD utility by AWS will get you started if you decide to go with a python script.

Related

A way to export psql table (or query) directly to AWS S3 as file (csv, json)

This answer suggests using AWS Data Pipeline but I'm wondering if there's a clever way to do it with my own machine and Python.
I've been using psycopg2, boto3 and pandas libraries. Tables have 5 to 50 columns and few million rows. My current method doesn't work that well with large data.
Guess I can show one of my own versions here aswell which is based on copy_expert in psycopg2
import io
import psycopg2
import boto3
resource = boto3.resource('s3')
conn = psycopg2.connect(dbname=db, user=user, password=pw, host=host)
cur = conn.cursor()
def copyFun(bucket, select_query, filename):
query = f"""COPY {select_query} TO STDIN \
WITH (FORMAT csv, DELIMITER ',', QUOTE '"', HEADER TRUE)"""
file = io.StringIO()
cur.copy_expert(query, file)
resource.Object(bucket, f'{filename}.csv').put(Body=file.getvalue())
We do following in our case, performance wise, its pretty fast, and scheduled method rather then continuous streaming. I'm not 100% sure if its wise method, but definitely good from speed prospective in case of scheduled data exports in CSV format that we eventually use for loading to d/w.
Using shell script, we fire psql command to copy data to local file in EC2 App intance.
psql [your connection options go here] -F, -A -c 'select * from my_schema.example' >example.csv
Then, using shell script, we fire s3cmd command to Put example.csv to designated S3:bucket Location.
s3cmd put example.csv s3://your-bucket/path/to/file/
This is an old question, but it comes up when searching for "aws_s3.export_query_to_s3", even though there is no mention of it here, so I thought I'd throw another answer out there.
This can be done natively with a Postgres extension if you're using AWS Aurora Postgres 11.6 or above via: aws_s3.export_query_to_s3
Exporting data from an Aurora PostgreSQL DB cluster to Amazon S3
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html
See here for the function reference:
https://docs.amazonaws.cn/en_us/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html#postgresql-s3-export-functions
This has been present since Aurora for Postgres since 3.1.0 which was released on February 11, 2020 (I don't know why this URL says 2018): https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Updates.20180305.html#AuroraPostgreSQL.Updates.20180305.310
I would not recommend using 3.1.0/11.6, however, there is a bug that causes data corruption issues after 10MB of data is exported to S3: https://forums.aws.amazon.com/thread.jspa?messageID=962494
I just tested with 3.3.1, from September 17, 2020, and the issue isn't present, so, anyone who wants a way to dump data from Postgres to S3... and is on AWS, give this a try!
Here's an example query to create JSONL for you.
JSONL is JSON, with a single JSON object per line: https://jsonlines.org/
So you can dump a whole table to a JSONL file, for example... You could also do json_agg in postgres and dump as a single JSON file with objects in an array, it's up to you, really. Just change the query, and the file extension, and leave it as text format.
select * from aws_s3.query_export_to_s3(
'select row_to_json(data) from (<YOUR QUERY HERE>) data',
aws_commons.create_s3_uri(
'example-bucket/some/path',
'whatever.jsonl',
'us-east-1'),
options :='format text');
For CSV, something like this should do the trick:
select * from aws_s3.query_export_to_s3(
'<YOUR QUERY HERE>',
aws_commons.create_s3_uri(
'example-bucket/some/path',
'whatever.csv',
'us-east-1'),
options :='format csv');

Python-sqlite-how to save a database output to a text file

I have a python program which uses SQLite features. The program lists First name, last name, and type of pet. The program is saved as a DB file called pets.db. I want to be able to convert this database into text. To do this, I tried to use a dump statement in command prompt. Here is my output:
sqlite> .output file location of pets.db
Usage: .output FILE
sqlite> .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
COMMIT;
sqlite>.exit
However, pets.txt does not exist when I type
dir pets.txt /s /p
in command prompt.
Any suggestions? I used http://www.sqlitetutorial.net/sqlite-dump/ as a guide.
Your .output command is slightly off. Based on your comment, it sounds like you're typing .output file $(location of pets.db). This isn't how the command works.
First off, open the database you want to dump with the command sqlite3 pets.db. This will open your databse. You can ensure you have the database you want by using the command .tables. If you see tables in the database, you know you've opened it correctly. If not, the command won't display anything.
Once you've opened the file, .output $(filename).txt will actually set the output to a specified file. Now, you can use the .dump command. It'll take a moment for the driver to actually write the whole db if it's somewhat large.
Once the file is finished writing, you can exit with .exit.

Use python and psycopg2 to execute a sql file that contains a DROP DATABASE statement

I am trying to use a python function to execute a .sql file.
The sql file begins with a DROP DATABASE statement.
The first lines of the .sql file look like this:
DROP DATABASE IF EXISTS myDB;
CREATE DATABASE myDB;
The rest of the .sql file defines all the tables and views for 'myDB'
Python Code:
def connect():
conn = psycopg2.connect(dbname='template1', user='user01')
conn.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT)
cursor = conn.cursor()
sqlfile = open('/path/to/myDB-schema.sql', 'r')
cursor.execute(sqlfile.read())
db = psycopg2.connect(dbname='myDB', user='user01')
cursor = db.cursor()
return db,cursor
When I run the connect() function, I get an error on the DROP DATABASE statement.
ERROR:
psycopg2.InternalError: DROP DATABASE cannot be executed from a function or multi-command string
I spent a lot of time googling this error, and I can't find a solution.
I also tried adding an AUTOCOMMIT statement to the top of the .sql file, but it didn't change anything.
SET AUTOCOMMIT TO ON;
I am aware that postgreSQL doesn't allow you to drop a database that you are currently connected to, but I didn't think that was the problem here because I begin the connect() function by connecting to the template1 database, and from that connection create the cursor object which opens the .sql file.
Has anyone else run into this error, is there any way to to execute the .sql file using a python function?
This worked for me for a file consisting of SQL Queries in each line:
sql_file = open('file.sql','r')
cursor.execute(sql_file.read())
You are reading in the entire file and passing the whole thing to PostgreSQL as one string (as the error message says, "multi-command string". Is that what you are intending to do? If so, it isn't going to work.
Try this:
cursor.execute(sqlfile.readline())
Or, shell out to psql and let it do the work.
In order to deploy scripts using CRON that serve as ETL files that use .SQL, we had to expand how we call the SQL file itself.
sql_file = os.path.join(os.path.dirname(__file__), "../sql/ccd_parcels.sql")
sqlcurr = open(sql_file, mode='r').read()
curDest.execute(sqlcurr)
connDest.commit()
This seemed to please the CRON job...

Using Postgres's COPY FROM file query in Python without writing to a temporary file

I need to load data from some source data sources to a Postgres database.
To do this task, I first write the data to a temporary CSV file and then load data from the CSV file to Postgres database using COPY FROM query. I do all of this on Python.
The code looks like this:
table_name = 'products'
temp_file = "'C:\\Users\\username\\tempfile.csv'"
db_conn = psycopg2.connect(host, port, user, password, database)
cursor = db_conn.cursor()
query = """COPY """ + table_name + """ FROM """ + temp_file + " WITH NULL AS ''; """
cursor.execute(query)
I want to avoid the step of writing to the intermediate file. Instead, I would like to write to a Python object and then load data to postgres database using COPY FROM file method.
I am aware of this technique of using psycopg2's copy_from method which copies data from a StringIO object to the postgres database. However, I cannot use psycopg2 for a reason and hence, I don't want my COPY FROM task to be dependent on a library. I want it to be Postgres query which can be run by any other postgres driver as well.
Please advise a better way of doing this without writing to an intermediate file.
You could call the psql command-line tool from your script (i.e. using subprocess.call) and leverage its \copy command, piping the output of one instance to the input of another, avoiding a temp file. i.e.
psql -X -h from_host -U user -c "\copy from_table to stdout" | psql -X -h to_host -U user -c "\copy to_table from stdin"
This assumes the table exists in the destination database. If not, a separate command would first need to create it.
Also, note that one caveat of this method is that errors from the first psql call can get swallowed by the piping process.
psycopg2 has integrated support for the COPY wire-protocol, allowing you to use COPY ... FROM STDIN / COPY ... TO STDOUT.
See Using COPY TO and COPY FROM in the psycopg2 docs.
Since you say you can't use psycopg2, you're out of luck. Drivers must understand COPY TO STDOUT / COPY FROM STDIN in order to use them, or must provide a way to write raw data to the socket so you can hijack the driver's network socket and implement the COPY protocol yourself. Driver specific code is absolutely required for this, it is not possible to simply use the DB-API.
So khampson's suggestion, while usually a really bad idea, seems to be your only alternative.
(I'm posting this mostly to make sure that other people who find this answer who don't have restrictions against using psycopg2 do the sane thing.)
If you must use psql, please:
Use the subprocess module with the Popen constructor
Pass -qAtX and -v ON_ERROR_STOP=1 to psql to get sane behaviour for batching.
Use the array form command, e.g. ['psql', '-v', 'ON_ERROR_STOP=1', '-qAtX', '-c', '\copy mytable from stdin'], rather than using a shell.
Write to psql's stdin, then close it, and wait for psql to finish.
Remember to trap exceptions thrown on command failure. Let subprocess capture stderr and wrap it in the exception object.
It's safer, cleaner, and easier to get right than the old-style os.popen2 etc.

Export & Map CSV output to MySQL table using Python

I have a multiple clients to single server bidirectional iperf set-up for network monitoring. The iperf server runs well and displays output in CSV format based on the cron jobs written on the client end.
I wish to write a python script to automate the process of mapping these CSV outputs to a MySQL database; which in turn would be updated and saved at regular intervals without need of human intervention.
I am using a Ubuntu 13.10 machine as the iperf server. Following is a sample CSV output that I get. This is not being stored to a file, just being displayed on screen.
s1:~$ iperf -s -y C
20140422105054,172.16.10.76,41065,172.16.10.65,5001,6,0.0-20.0,73138176,29215083
20140422105054,172.16.10.76,5001,172.16.10.65,56254,4,0.0-20.0,46350336,18502933
20140422105100,172.16.10.76,54550,172.16.10.50,5001,8,0.0-20.0,67895296,27129408
20140422105100,172.16.10.76,5001,172.16.10.50,58447,5,0.0-20.1,50937856,20292796
20140422105553,172.16.10.76,5001,172.16.10.65,47382,7,0.0-20.1,51118080,20358083
20140422105553,172.16.10.76,41067,172.16.10.65,5001,5,0.0-20.1,76677120,30524007
20140422105600,172.16.10.76,5001,172.16.10.50,40734,4,0.0-20.0,57606144,23001066
20140422105600,172.16.10.76,54552,172.16.10.50,5001,8,0.0-20.0,70123520,28019115
20140422110053,172.16.10.76,41070,172.16.10.65,5001,5,0.0-20.1,63438848,25284066
20140422110053,172.16.10.76,5001,172.16.10.65,46462,6,0.0-20.1,11321344,4497094
The fields I want to map them to are: timestamp, server_ip, server_port, client_ip, client_port, tag_id, interval, transferred, bandwidth
I want to map this CSV output periodically to a MySQL database, for which I do understand that I would have to write a Python script (inside a cron job) querying and storing in MySQL database. I am a beginner at Python scripting and database queries.
I went through another discussion on Server Fault at [https://serverfault.com/questions/566737/iperf-csv-output-format]; and would like to build my query based on this.
Generate SQL script, then run it
If you do not want to use complex solutions like sqlalchemy, following approach is possible.
having your csv data, convert them into SQL script
use mysql command line tool to run this script
Before you do it the first time, be sure you create needed database structure in the database (this I leave to you).
My following sample uses (just for my convenience) package docopt, so you need installing it:
$ pip install docopt
CSV to SQL script conversion utility
csv2sql.py:
"""
Usage:
csv2sql.py [--table <tablename>] <csvfile>
Options:
--table <tablename> Name of table in database to import into [default: mytable]
Convert csv file with iperf data into sql script for importing
those data into MySQL database.
"""
from csv import DictReader
from docopt import docopt
if __name__ == "__main__":
args = docopt(__doc__)
fname = args["<csvfile>"]
tablename = args["--table"]
headers = ["timestamp",
"server_ip",
"server_port",
"client_ip",
"client_port",
"tag_id",
"interval",
"transferred",
"bandwidth"
]
sql = """insert into {tablename}
values ({timestamp},"{server_ip}",{server_port},"{client_ip}",{client_port},{tag_id},"{interval}",{transferred},{bandwidth});"""
with open(fname) as f:
reader = DictReader(f, headers, delimiter=",")
for rec in reader:
print(sql.format(tablename=tablename, **rec)) # python <= 2.6 will fail here
Convert CSV to SQL script
First let the conversion utility introduce:
$ python csv2sql.py -h
Usage:
csv2sql.py [--table <tablename>] <csvfile>
Options:
--table <tablename> Name of table in database to import into [default: mytable]
Convert csv file with iperf data into sql script for importing
those data into MySQL database.
Having your data in file data.csv:
$ python csv2sql.py data.csv
insert into mytable
values (20140422105054,"172.16.10.76",41065,"172.16.10.65",5001,6,"0.0-20.0",73138176,29215083);
insert into mytable
values (20140422105054,"172.16.10.76",5001,"172.16.10.65",56254,4,"0.0-20.0",46350336,18502933);
insert into mytable
values (20140422105100,"172.16.10.76",54550,"172.16.10.50",5001,8,"0.0-20.0",67895296,27129408);
insert into mytable
values (20140422105100,"172.16.10.76",5001,"172.16.10.50",58447,5,"0.0-20.1",50937856,20292796);
insert into mytable
values (20140422105553,"172.16.10.76",5001,"172.16.10.65",47382,7,"0.0-20.1",51118080,20358083);
insert into mytable
values (20140422105553,"172.16.10.76",41067,"172.16.10.65",5001,5,"0.0-20.1",76677120,30524007);
insert into mytable
values (20140422105600,"172.16.10.76",5001,"172.16.10.50",40734,4,"0.0-20.0",57606144,23001066);
insert into mytable
values (20140422105600,"172.16.10.76",54552,"172.16.10.50",5001,8,"0.0-20.0",70123520,28019115);
insert into mytable
values (20140422110053,"172.16.10.76",41070,"172.16.10.65",5001,5,"0.0-20.1",63438848,25284066);
insert into mytable
values (20140422110053,"172.16.10.76",5001,"172.16.10.65",46462,6,"0.0-20.1",11321344,4497094);
Put it all into file data.sql:
$ python csv2sql.py data.csv > data.sql
Apply data.sql to your MySQL database
And finally use mysql command (provided by MySQL) to do the import into database:
$ myslq --user username --password password db_name < data.sql
If you plan using Python, then I would recommend using sqlalchemy
General approach is:
define class, which has all the attributes, you want to store
map all the properties of the class to database columns and types
read your data from csv (using e.g. csv module), for each row create corresponding object being the class prepared before, and let it to be stored.
sqlalchemy shall provide you more details and instructions, your requirement seems rather easy.
Other option is to find out an existing csv import tool, some are already available with MySQL, there are plenty of others too.
This probably is not the kind of answer you are looking for, but if you learn a little sqlite3 (a native Python module - "import sqlite3") by doing a basic tutorial online, you will realize that your problem is not at all difficult to solve. Then just use a standard timer, such as time.sleep() to repeat the procedure.

Categories