Redshift COPY operation doesn't work in SQLAlchemy - python

I'm trying to do a Redshift COPY in SQLAlchemy.
The following SQL correctly copies objects from my S3 bucket into my Redshift table when I execute it in psql:
COPY posts FROM 's3://mybucket/the/key/prefix'
WITH CREDENTIALS 'aws_access_key_id=myaccesskey;aws_secret_access_key=mysecretaccesskey'
JSON AS 'auto';
I have several files named
s3://mybucket/the/key/prefix.001.json
s3://mybucket/the/key/prefix.002.json
etc.
I can verify that the new rows were added to the table with select count(*) from posts.
However, when I execute the exact same SQL expression in SQLAlchemy, execute completes without error, but no rows get added to my table.
session = get_redshift_session()
session.bind.execute("COPY posts FROM 's3://mybucket/the/key/prefix' WITH CREDENTIALS aws_access_key_id=myaccesskey;aws_secret_access_key=mysecretaccesskey' JSON AS 'auto';")
session.commit()
It doesn't matter whether I do the above or
from sqlalchemy.sql import text
session = get_redshift_session()
session.execute(text("COPY posts FROM 's3://mybucket/the/key/prefix' WITH CREDENTIALS aws_access_key_id=myaccesskey;aws_secret_access_key=mysecretaccesskey' JSON AS 'auto';"))
session.commit()

I basically had the same problem, though in my case it was more:
engine = create_engine('...')
engine.execute(text("COPY posts FROM 's3://mybucket/the/key/prefix' WITH CREDENTIALS aws_access_key_id=myaccesskey;aws_secret_access_key=mysecretaccesskey' JSON AS 'auto';"))
By stepping through pdb, the problem was obviously the lack of a .commit() being invoked. I don't know why session.commit() is not working in your case (maybe the session "lost track" of the sent commands?) so it might not actually fix your problem.
Anyhow, as explained in the sqlalchemy docs
Given this requirement, SQLAlchemy implements its own “autocommit” feature which works completely consistently across all backends. This is achieved by detecting statements which represent data-changing operations, i.e. INSERT, UPDATE, DELETE [...] If the statement is a text-only statement and the flag is not set, a regular expression is used to detect INSERT, UPDATE, DELETE, as well as a variety of other commands for a particular backend.
So, there are 2 solutions, either:
text("COPY posts FROM 's3://mybucket/the/key/prefix' WITH CREDENTIALS aws_access_key_id=myaccesskey;aws_secret_access_key=mysecretaccesskey' JSON AS 'auto';").execution_options(autocommit=True).
Or, get a fixed version of the redshift dialect... I just opened a PR about it

Add a commit to the end of the copy worked for me:
<your copy sql>;commit;

I have had success using the core expression language and Connection.execute() (as opposed to the ORM and sessions) to copy delimited files to Redshift with the code below. Perhaps you could adapt it for JSON.
def copy_s3_to_redshift(conn, s3path, table, aws_access_key, aws_secret_key, delim='\t', uncompress='auto', ignoreheader=None):
"""Copy a TSV file from S3 into redshift.
Note the CSV option is not used, so quotes and escapes are ignored. Empty fields are loaded as null.
Does not commit a transaction.
:param Connection conn: SQLAlchemy Connection
:param str uncompress: None, 'gzip', 'lzop', or 'auto' to autodetect from `s3path` extension.
:param int ignoreheader: Ignore this many initial rows.
:return: Whatever a copy command returns.
"""
if uncompress == 'auto':
uncompress = 'gzip' if s3path.endswith('.gz') else 'lzop' if s3path.endswith('.lzo') else None
copy = text("""
copy "{table}"
from :s3path
credentials 'aws_access_key_id={aws_access_key};aws_secret_access_key={aws_secret_key}'
delimiter :delim
emptyasnull
ignoreheader :ignoreheader
compupdate on
comprows 1000000
{uncompress};
""".format(uncompress=uncompress or '', table=text(table), aws_access_key=aws_access_key, aws_secret_key=aws_secret_key)) # copy command doesn't like table name or keys single-quoted
return conn.execute(copy, s3path=s3path, delim=delim, ignoreheader=ignoreheader or 0)

Related

How does a SQLAlchemy Session manage transactions, when executing multiple raw SQL statements at once?

None of the "similar questions" really get at this specific topic, but I am trying to find out how SQLAlchemy's Session handles transactions, when:
Passing raw SQL text to the execute() method, rather than utilizing any SQLAlchemy model objects, AND
The raw SQL text contains multiple distinct commands.
For instance:
bulk_operation = """
DELETE FROM the_table WHERE id = ...;
INSERT INTO the_table (id, name) VALUES (...);
"""
sql = text(bulk_operation)
session.execute(sql.bindparams(id=foo, name=bar))
The goal here is to restore the original state, if either the DELETE or the INSERT fails for any reason.
But does Session.execute() actually guarantee this, in this context? Is it necessary to include BEGIN and COMMIT commands within the raw SQL text itself, or manage from the Python level with session.commit() or something else? Thanks in advance!

Prevent SQL Injection in BigQuery with Python for table name

I have an Airflow DAG which takes an argument from the user for table.
I then use this value in an SQL statement and execute it in BigQuery. I'm worried about exposing myself to SQL Injection.
Here is the code:
sql = f"""
CREATE OR REPLACE TABLE {PROJECT}.{dataset}.{table} PARTITION BY DATE(start_time) as (
//OTHER CODE
)
"""
client = bigquery.Client()
query_job = client.query(sql)
Both dataset and table get passed through via airflow but I'm worried someone could pass through something like: random_table; truncate other_tbl; -- as the table argument.
My fear is that the above will create a table called random_table and then truncate an existing table.
Is there a safer way to process these passed through arguments?
I've looked into parameterized queries in BigQuery but these don't work for table names.
You will have to create a table name validator. I think you can safely validate by using just backticks --> ` at the start and at the end of your table name string. It's not a 100% solution but it worked for some of my test scenarios I try. It should work like this:
# validate should look for ` at the beginning and end of your tablename
table_name = validate(f"`{project}.{dataset}.{table}`")
sql = f"""
CREATE OR REPLACE TABLE {table_name} PARTITION BY DATE(start_time) as (
//OTHER CODE
)
"""
...
Note: I suggest you to check the following post on medium site to check about bigquery sql injection.
I checked the official documentation about Running parameterized queries, and sadly it only covers the parameterization of variables not tables or other string part of your query.
As a final note, I recommend to open a feature request for BigQuery for this particular scenario.
You should probably look into sanitization/validation of user input in general. This is done before passing the input to the BQ query.
With Python, you could look for malicious strings in the user input - like truncate in your example - or use a regex to filter input that for instance contains --. Those are just some quick examples. I recommend you do more research on that topic; you will also find quite a few questions on that topic on SE.

Arbitrary selects failing using cx_Oracle with instant client

I am trying to select data from an ORACLE 12c database using cx_Oracle, but I am getting the exception: "cx_Oracle.OperationalError: ORA-03113: end-of-file on communication channel".
My query behaves fine using Pycharm (jdbc:oracle:thin driver). Using cx_Oracle in python 3.6, however, the query fails unless I reduce the number of IDs in the IN clause from 500 to about 250. The Cursor.fetchall() function is what throws the exception. I do not have privileged access to the database in order to check things like locks or load, but could these be the cause of the issue? According to our DBA, there is nothing wrong on the Oracle db server, and since the query works fine otherwise, I am inclined to believe it. I have messed with the client-side sqlnet.ora as well, which has allowed exceptions to eventually be thrown instead of hanging forever, but I still cannot fetch the data.
def select(self, query, *args):
cur = self.dbh.cursor()
cur.prepare(query)
try:
cur.execute(None, args)
return cur.fetchall()
# my attempt to handle the issue
except (cx_Oracle.OperationalError, cx_Oracle.DatabaseError) as e:
# cx_Oracle.OperationalError: ORA-03113: end-of-file on communication channel
self.logger.error('Oracle Error: {}'.format(traceback.format_exc()))
raise e
The code calls select like this. For brevity, I've omitted the full string IDs
ids = ['1', '2', '3', ...]
query = """\
select * from my_table where id in(:0,:1,:2,:3,:4, ...)
"""
self.select(query, *ids)
The query fails without the placeholders (with the IDs placed directly in the query) as well.
I expect to be able to run any select query using an IN clause with up to 1000 IDs without receiving the ORA-03113 Exception.
Edit:
I installed oracle-instantclient18.5-basic-18.5.0.0.0-3.x86_64.rpm* on ubuntu 18.04.2, have cx_Oracle version 7.1.2, and I am connecting to Oracle 12.1.0.2.0.
The query is on the underlying tables of BMC Software's ARS. I will start working to try to replicate the problem with a local table structure, but it is a mess and will take some time. If I am able to create a local copy of the tables, I'm not sure I'd be able to replicate the issue, as identical queries with different IDs work fine. That makes it seem data driven, however, after I reduced the query to 250 IDs, I swapped the 250 from the first half to the second half, and got the same success result, so it doesn't seem to be just one bad row.
Is there more helpful logging I can enable on the client side to get more information?
Edit2: I should also add that the issue does not just occur with one query. I've seen the same issue with select queries to completely different tables.
Edit3: I just found out that by commenting out some of the columns that I'm selecting also can make the query work. columns like this:
to_char(to_date('1970-01-01','YYYY-MM-DD') + numtodsinterval(EventStart,'SECOND'),'YYYY-MM-DD HH24:MI:SS')
This may indicate that some kind of timeout is being reached which may or may not be configured in my sqlnet.ora:
DISABLE_OOB=on
SQLNET.RECV_TIMEOUT=60
SQLNET.SEND_TIMEOUT=60
TCP.CONNECT_TIMEOUT=300
SQLNET.OUTBOUND_CONNECT_TIMEOUT=300
ENABLE=BROKEN
TRACE_LEVEL_CLIENT=ADMIN
TRACE_FILE_CLIENT=sqlnet
Edit 4: I've tried some more things.
I installed the same version of instant client, except on a windows 7 machine, and ran the same query against the same db instance. The query succeeded.
I also narrowed down that for this particular query, it will accept 499 IDs, but fails with 500. it doesn't matter which ID I comment out from the query.
I also tried tricking the query into thinking there were fewer IDs by using a sub-select instead:
IN(
select regexp_substr(:0,'[^,]+', 1, level) from dual connect by regexp_substr(:0, '[^,]+', 1, level) is not null
)
I got the error "cx_Oracle.DatabaseError: ORA-01460: unimplemented or unreasonable conversion requested", after which I realized made sense because Oracle will only allow a string to be up to 4000 bytes long.
I think I finally found a way to get everything functioning. I finally came across this link:
https://ardentperf.com/2010/09/08/mysterious-oracle-net-errors/
It turns out that this solved my problem. I am still having trouble getting cx_Oracle to honor a connection string of the same format as the tnsnames.ora file, but I changed my code to refer to the tnsnames.ora for now as follows:
connection_info = {
'user': self.config.get(self.db, 'user'),
'pass': self.config.get(self.db, 'password')
}
connection_string = '{user}/{pass}#TEST'\
.format(**connection_info)
connection = cx_Oracle.connect(connection_string)
where my tnsnames.ora contains the following:
TEST =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(Host = myhost.com)(Port = 1521))
)
(SDU=1024)
(CONNECT_DATA =
(SID=mysid)
)
)
The key here is the SDU=1024, which inexplicably fixes this issue.
https://docs.oracle.com/cd/B28359_01/network.111/b28317/sqlnet.htm#NETRF184
Documentation from the above link indicates that the default for SDU is 8192 bytes (8 KB) and my understanding is that there is supposed to be auto-negotiation of this value. This does not appear to be the case, and I don't know what past defaults have been.

SQLAlchemy core insert does not insert data into db

I have such issue that SQLAlchemy Core does not insert rows when I'm trying to insert data using connection.execute(table.insert(), list_of_rows). I construct connection object without any additional parameters, it means connection = engine.connect() and engine only with one additional parameter engine = create_engine(uri, echo=True).
Except that I can't find data in db also I can't find "INSERT" statement in logs of my app.
May be important that this issue I'm reproducing during py.test tests.
DB that I use is mssql in docker container.
EDIT1:
rowcount of proxyresult is always -1 regardless if I use transaction or no and if I changed insert to connection.execute(table.insert().execution_options(autocommit=True), list_of_rows).rowcount
EDIT2:
I rewrote this code and now it works. I don't see any major difference.
What's the inserted row count after connection.execute:
proxy = connection.execute(table.insert(), list_of_rows)
print(proxy.rowcount)
if rowcount is positive integer, it proves it indeed writes the data into DB, but may be only present in a transaction, if so you could then check whether autocommit is on: https://docs.sqlalchemy.org/en/latest/core/connections.html#understanding-autocommit

Execute .sql schema in psycopg2 in Python

I have a PostgreSQL schema stored in .sql file. It looks something like:
CREATE TABLE IF NOT EXISTS users (
id INTEGER PRIMARY KEY,
facebook_id TEXT NOT NULL,
name TEXT NOT NULL,
access_token TEXT,
created INTEGER NOT NULL
);
How shall I run this schema after connecting to the database?
My existing Python code works for SQLite databases:
# Create database connection
self.connection = sqlite3.connect("example.db")
# Run database schema
with self.connection as cursor:
cursor.executescript(open("schema.sql", "r").read())
But the psycopg2 doesn't have an executescript method on the cursor. So, how can I achieve this?
You can just use execute:
with self.connection as cursor:
cursor.execute(open("schema.sql", "r").read())
though you may want to set psycopg2 to autocommit mode first so you can use the script's own transaction management.
It'd be nice if psycopg2 offered a smarter mode where it read the file in a statement-at-a-time and sent it to the DB, but at present there's no such mode as far as I know. It'd need a fairly solid parser to do it correctly when faced with $$ quoting (and its $delimiter$ variant where the deimiter may be any identifier), standard_conforming_strings, E'' strings, nested function bodies, etc.
Note that this will not work with:
anything containing psql backslash commands
COPY .. FROM STDIN
very long input
... and therefore won't work with dumps from pg_dump
I can't reply to comments of the selected answer by lack of reputation, so i'll make an answer to help with the COPY issue.
Depending on the volume of your DB,pg_dump --inserts outputs INSERTs instead of COPYs

Categories