SQL command SELECT fetches uncommitted data from Postgresql database - python

In short:
I have Postgresql database and I connect to that DB through Python's psycopg2 module. Such script might look like this:
import psycopg2
# connect to my database
conn = psycopg2.connect(dbname="<my-dbname>",
user="postgres",
password="<password>",
host="localhost",
port="5432")
cur = conn.cursor()
ins = "insert into testtable (age, name) values (%s,%s);"
data = ("90", "George")
sel = "select * from testtable;"
cur.execute(sel)
print(cur.fetchall())
# prints out
# [(100, 'Paul')]
#
# db looks like this
# age | name
# ----+-----
# 100 | Paul
# insert new data - no commit!
cur.execute(ins, data)
# perform the same select again
cur.execute(sel)
print(cur.fetchall())
# prints out
# [(100, 'Paul'),(90, 'George')]
#
# db still looks the same
# age | name
# ----+-----
# 100 | Paul
cur.close()
conn.close()
That is, I connect to that database which at the start of the script looks like this:
age | name
----+-----
100 | Paul
I perform SQL select and retrieve only Paul data. Then I do SQL insert, however without any commit, but the second SQL select still fetches both Paul and George - and I don't want that. I've looked both into psycopg and Postgresql docs and found out about ISOLATION LEVEL (see Postgresql and see psycopg2). In Postgresql docs (under 13.2.1. Read Committed Isolation Level) it explicitly says:
However, SELECT does see the effects of previous updates executed within its own transaction, even though they are not yet committed.
I've tried different isolation levels, I understand, that Read Committed and Repeatable Read don't wokr, I thought, that Serializable might work, but it does not -- meaning that I still can fetch uncommitted data with select.
I could do conn.set_isolation_level(0), where 0 represents psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT, or I could probably wrap the execute commands inside with statements (see).
After all, I am bit confused, whether I understand transactions and isolations (and the behavior of select without commit is completely normal) or not. Can somebody enlighten this topic to me?

Your two SELECT statements are using the same connection, and therefore the same transaction. From the psycopg manual you linked:
By default, the first time a command is sent to the database ... a new transaction is created. The following database commands will be executed in the context of the same transaction.
Your code is therefore equivalent to the following:
BEGIN TRANSACTION;
select * from testtable;
insert into testtable (age, name) values (90, 'George');
select * from testtable;
ROLLBACK TRANSACTION;
Isolation levels control how a transaction interacts with other transactions. Within a transaction, you can always see the effects of commands within that transaction.
If you want to isolate two different parts of your code, you will need to open two connections to the database, each of which will (unless you enable autocommit) create a separate transaction.
Note that according to the document already linked, creating a new cursor will not be enough:
...not only the commands issued by the first cursor, but the ones issued by all the cursors created by the same connection

Using autocommit will not solve your problem. When autocommit is one every insert and update is automatically committed to the database and all subsequent reads will see that data.
It's most unusual to not want to see data that has been written to the database by you. But if that's what you want, you need two separate connections and you must make sure that your select is executed prior to the commit.

Related

Simple Python method taking forever to execute/run

My code gets stuck at the clean_up() method in MyClass()
my_class.py:
import os
import pandas as pd
import psycopg2, pymysql, pyodbc
from db_credentials_dict import db_credentials
class MyClass():
def __init__(self, from_database, to_table_name, report_name):
...
def get_sql(self):
...
def get_connection(self):
...
def query_to_csv(self):
...
def csv_to_postgres(self):
...
def extract_and_load(self):
self.query_to_csv()
self.csv_to_postgres()
def get_final_sql(self):
...
def postgres_to_csv(self):
...
def clean_up(self):
print('\nTruncating {}...'.format(self.to_table_name), end='')
with self.postgres_connection.cursor() as cursor:
cursor.execute("SELECT NOT EXISTS (SELECT 1 FROM %s)" % self.to_table_name)
empty = cursor.fetchone()[0]
if not empty:
cursor.execute("TRUNCATE TABLE %s" % self.to_table_name)
self.postgres_connection.commit()
print('DONE')
print('Removing {}...'.format(self.filename), end='')
if os.path.exists(self.filepath):
os.remove(self.filepath)
print('DONE')
else:
print('{} does not exist'.format(self.filename))
main.py:
from my_class import MyClass
from time import sleep
bookings = MyClass(from_database='...',to_table_name='...',report_name='...')
bookings2 = MyClass(from_database='...',to_table_name='...',report_name='...')
channel = MyClass(from_database='...',to_table_name='...',report_name='...')
cost = MyClass(from_database='...',to_table_name='...',report_name='...')
tables = [bookings, bookings2, channel, cost]
for table in tables:
table.extract_and_load()
daily_report = MyClass(from_database='...',to_table_name='...',report_name='...')
daily_report.postgres_to_csv()
sleep(10)
for table in tables:
table.clean_up()
When I run main.py it runs everything up until the final loop i.e. for table in tables: table.clean_up(). It just gets stuck there, no errors or warnings.
When running that method its own, it works fine i.e. it truncates the postgres tables. Need help getting this to work and understanding why the final method doesn't execute when all others are executed.
Output of when running clean_up() on its own:
Truncating bookings...DONE
Removing bookings.csv...DONE
Truncating bookings2...DONE
Removing bookings2.csv...DONE
Truncating channel...DONE
Removing channel.csv...DONE
Truncating cost...DONE
Removing cost.csv...DONE
What the overall code does:
Grabbing files that contain sql queries extracting data from different databases and executing those queries.
Exporting them to csv
Importing csv's to a postgres database
Writing a postgres query that brings data together and then exports as a csv to be used in a BI tool for data visualisation
Truncating the data in postgres and deleting the csv files in point 2
You're probably thinking that this is a madness and I agree. I'm currently making due with what I've got and can't store data on my computer because it's company data (hence the truncating and removing of the data).
TRUNCATE clause in Postgres requires ACCESS EXCLUSIVE lock on the relation and also may fire some BEFORE TRUNCATE triggers.
My guess is that problem is on the Postgres side, e.g. TRUNCATE tries to acquire ACCESS EXCLUSIVE lock on the relation, but it is already locked by someone else.
First, check your Postgres logs.
Next, check what your Postgres backends are doing during your hanged TRUNCATE with:
SELECT * FROM pg_stat_activity;
Next, investigate locks with:
-- View with readable locks info and filtered out locks on system tables
CREATE VIEW active_locks AS
SELECT clock_timestamp(), pg_class.relname, pg_locks.locktype, pg_locks.database,
pg_locks.relation, pg_locks.page, pg_locks.tuple, pg_locks.virtualtransaction,
pg_locks.pid, pg_locks.mode, pg_locks.granted
FROM pg_locks JOIN pg_class ON pg_locks.relation = pg_class.oid
WHERE relname !~ '^pg_' and relname <> 'active_locks';
-- Now when we want to see locks just type
SELECT * FROM active_locks;
I would suggest you to do the desired clean process all in one transaction. To Prevent undesired states.
And check If table exists in the database using the information_schema
SELECT EXISTS (
SELECT 1
FROM information_schema.tables
WHERE table_schema = 'schema_name'
AND table_name = 'table_name'
);
Check If exists, create your truncate commands.
and execute the clean process all in one transaction.

Ending a SELECT transaction psycopg2 and postgres

I am executing a number of SELECT queries on a postgres database using psycopg2, but I am getting ERROR: Out of shared memory. It suggests that I should increase max_locks_per_transaction., but this confuses me because each SELECT query is operating on only one table, and max_locks_per_transaction is already set to 512, 8 times the default.
I am using TimescaleDB, which could be the result of a larger than normal number of locks (one for each chunk rather than one for each table, maybe), but this still can't explain running out when so many are allowed. I'm assuming what is happening here is that all the queries are all being run as part of one transaction.
The code I am using looks something as follows.
db = DatabaseConnector(**connection_params)
tables = db.get_table_list()
for table in tables:
result = db.query(f"""SELECT a, b, COUNT(c) FROM {table} GROUP BY a, b""")
print(result)
Where db.query is defined as:
def query(self, sql):
with self._connection.cursor() as cur:
cur.execute(sql)
return_value = cur.fetchall()
return return_value
and self._connection is:
self._connection = psycopg2.connect(**connection_params)
Do I need to explicitly end the transaction in some way to free up locks? And how can I go about doing this in psycopg2? I would have assumed that there was an implicit end to the transaction when the cursor is closed on __exit__. I know if I was inserting or deleting rows I would use COMMIT at the end, but it seems strange to use as I am not changing the table.
UPDATE: When I explicitly open and close the connection in the loop, the error does not show. However, I assume there is a better way to end the transaction after each SELECT than this.

SQLAlchemy core insert does not insert data into db

I have such issue that SQLAlchemy Core does not insert rows when I'm trying to insert data using connection.execute(table.insert(), list_of_rows). I construct connection object without any additional parameters, it means connection = engine.connect() and engine only with one additional parameter engine = create_engine(uri, echo=True).
Except that I can't find data in db also I can't find "INSERT" statement in logs of my app.
May be important that this issue I'm reproducing during py.test tests.
DB that I use is mssql in docker container.
EDIT1:
rowcount of proxyresult is always -1 regardless if I use transaction or no and if I changed insert to connection.execute(table.insert().execution_options(autocommit=True), list_of_rows).rowcount
EDIT2:
I rewrote this code and now it works. I don't see any major difference.
What's the inserted row count after connection.execute:
proxy = connection.execute(table.insert(), list_of_rows)
print(proxy.rowcount)
if rowcount is positive integer, it proves it indeed writes the data into DB, but may be only present in a transaction, if so you could then check whether autocommit is on: https://docs.sqlalchemy.org/en/latest/core/connections.html#understanding-autocommit

How to truncate and insert on the same transaction using sqlalchemy?

I have a production_table and stage_table.
I have a python script that runs for few hours and generate data in the stage_table.
I want at the end of the script to COPY data from the stage_table to the production_table.
Basically this is what I want:
1. TRUNCATE production_table
2. COPY production_table from stage_table
This is my code:
from sqlalchemy import create_engine
from sqlalchemy.sql import text as sa_text
engine = create_engine("mysql+pymysql:// AMAZON AWS")
engine.execute(sa_text('''TRUNCATE TABLE {1}; COPY TABLE {1} from {0}'''.format(stage_table, production_table)).execution_options(autocommit=True))
This should generate :
TRUNCATE TABLE production_table; COPY TABLE production_table from stage_table
However this doesn't work.
sqlalchemy.exc.ProgrammingError: (pymysql.err.ProgrammingError) (1064,
u"You have an error in your SQL syntax;
How can I make it work? and how can I make sure that the TRUNCATE and COPY are together. I don't want TRUNCATE to happen if COPY aborts.
The usual way to handle multiple statements in a single transaction in SQLAlchemy would be to begin an explicit transaction and execute each statement in it:
with engine.begin() as conn:
conn.execute(statement_1)
conn.execute(statement_2)
...
As to your original attempt, there is no COPY statement in MySQL. Some other DBMS do have something of the kind. Also not all DB-API drivers support multiple statements in a single query or command, at least out of the box, which would seem to be the case here as well. See this issue and the related note in the PyMySQL ChangeLog.
The biggest issue is that not all statements in MySQL can be rolled back, of which the most common are DDL statements. In other words you simply cannot execute TRUNCATE [TABLE] ... in the same transaction as the following INSERT INTO ... and must design your application around that limitation. As suggested in the comments by Christian W. you could perhaps create an entirely new table from your staging table and rename, or just swap the production and staging tables. RENAME TABLE ... cannot be rolled back either, but at least you'd reduce the window for error, and could undo the changes since the original production table would still be there, just under a new name. You could then remove the original production table when all else is done. Here's something that demonstrates the idea, but requires manual intervention, if something goes awry:
# No point in faking transactions here, since MySQL in use.
engine.execute("CREATE TABLE new_production AS SELECT * FROM stage_table")
engine.execute("RENAME TABLE production_table TO old_production")
engine.execute("RENAME TABLE new_production TO production_table")
# Point of no return:
engine.execute("DROP TABLE old_production")

pymssql ( python module ) unable to use temporary tables

This isn't a question, so much as a pre-emptive answer. (I have gotten lots of help from this website & wanted to give back.)
I was struggling with a large bit of SQL query that was failing when I tried to run it via python using pymssql, but would run fine when directly through MS SQL. (E.g., in my case, I was using MS SQL Server Management Studio to run it outside of python.)
Then I finally discovered the problem: pymssql cannot handle temporary tables. At least not my version, which is still 1.0.1.
As proof, here is a snippet of my code, slightly altered to protect any IP issues:
conn = pymssql.connect(host=sqlServer, user=sqlID, password=sqlPwd, \
database=sqlDB)
cur = conn.cursor()
cur.execute(testQuery)
The above code FAILS (returns no data, to be specific, and spits the error "pymssql.OperationalError: No data available." if you call cur.fetchone() ) if I call it with testQuery defined as below:
testQuery = """
CREATE TABLE #TEST (
[sample_id] varchar (256)
,[blah] varchar (256) )
INSERT INTO #TEST
SELECT DISTINCT
[sample_id]
,[blah]
FROM [myTableOI]
WHERE [Shipment Type] in ('test')
SELECT * FROM #TEST
"""
However, it works fine if testQuery is defined as below.
testQuery = """
SELECT DISTINCT
[sample_id]
,[blah]
FROM [myTableOI]
WHERE [Shipment Type] in ('test')
"""
I did a Google search as well as a search within Stack Overflow, and couldn't find any information regarding the particular issue. I also looked under the pymssql documentation and FAQ, found at http://code.google.com/p/pymssql/wiki/FAQ, and did not see anything mentioning that temporary tables are not allowed. So I thought I'd add this "question".
Update: July 2016
The previously-accepted answer is no longer valid. The second "will NOT work" example does indeed work with pymssql 2.1.1 under Python 2.7.11 (once conn.autocommit(1) is replaced with conn.autocommit(True) to avoid "TypeError: Cannot convert int to bool").
For those who run across this question and might have similar problems, I thought I'd pass on what I'd learned since the original post. It turns out that you CAN use temporary tables in pymssql, but you have to be very careful in how you handle commits.
I'll first explain by example. The following code WILL work:
testQuery = """
CREATE TABLE #TEST (
[name] varchar(256)
,[age] int )
INSERT INTO #TEST
values ('Mike', 12)
,('someone else', 904)
"""
conn = pymssql.connect(host=sqlServer, user=sqlID, password=sqlPwd, \
database=sqlDB) ## obviously setting up proper variables here...
conn.autocommit(1)
cur = conn.cursor()
cur.execute(testQuery)
cur.execute("SELECT * FROM #TEST")
tmp = cur.fetchone()
tmp
This will then return the first item (a subsequent fetch will return the other):
('Mike', 12)
But the following will NOT work
testQuery = """
CREATE TABLE #TEST (
[name] varchar(256)
,[age] int )
INSERT INTO #TEST
values ('Mike', 12)
,('someone else', 904)
SELECT * FROM #TEST
"""
conn = pymssql.connect(host=sqlServer, user=sqlID, password=sqlPwd, \
database=sqlDB) ## obviously setting up proper variables here...
conn.autocommit(1)
cur = conn.cursor()
cur.execute(testQuery)
tmp = cur.fetchone()
tmp
This will fail saying "pymssql.OperationalError: No data available." The reason, as best I can tell, is that whether you have autocommit on or not, and whether you specifically make a commit yourself or not, all tables must explicitly be created AND COMMITTED before trying to read from them.
In the first case, you'll notice that there are two "cur.execute(...)" calls. The first one creates the temporary table. Upon finishing the "cur.execute()", since autocommit is turned on, the SQL script is committed, the temporary table is made. Then another cur.execute() is called to read from that table. In the second case, I attempt to create & read from the table "simultaneously" (at least in the mind of pymssql... it works fine in MS SQL Server Management Studio). Since the table has not previously been made & committed, I cannot query into it.
Wow... that was a hassle to discover, and it will be a hassle to adjust my code (developed on MS SQL Server Management Studio at first) so that it will work within a script. Oh well...

Categories