Python psycopg2 - Logging events - python

I'm using psycopg2, and I have a problem with logging events (executed queries, notifications, errors) to a file. I want to get effect like in PgAdmin history window.
For example I'm executing this query:
insert into city(id, name, countrycode, district, population) values (4080,'Savilla', 'ESP', 'andalucia', 1000000)
And in PgAdmin I see effect like this:
Executing query:
insert into city(id, name, countrycode, district, population) values (4080,'Sevilla', 'ESP', 'andalucia', 1000000)
Query executed in 26 ms.
One row affected.
Can I get similar effect using psycopg2?
I tried to use LoggingCursor, but it not satisfactory for me, because it logs only queries.
Thanks for help.
EDIT:
My code:
conn = psycopg2.extras.LoggingConnection(DSN)
File=open('log.log','a')
File.write('================================')
psycopg2.extras.LoggingConnection.initialize(conn,File)
File.write('\n'+time.strftime("%Y-%m-%d %H:%M:%S") + '---Executing query:\n\t')
q="""insert into city(id, name, countrycode, district, population) values (4080,'Sevilla', 'ESP', 'andalucia', 10000)"""
c=conn.cursor()
c.execute(q)
File.write('\n'+time.strftime("%Y-%m-%d %H:%M:%S") + '---Executing query:\n\t')
q="""delete from city where id = 4080"""
c=conn.cursor()
c.execute(q)
conn.commit()
File.close()
And this is my output log:
================================
2012-12-30 22:42:31---Executing query:
insert into city(id, name, countrycode, district, population) values (4080,'Sevilla', 'ESP', 'andalucia', 10000)
2012-12-30 22:42:31---Executing query:
delete from city where id = 4080
I want to see in log file informations about how many rows was affected and informations about errors. Finally I want to have a complete log file with all events.

From what I can see, you have three requirements that are not fulfilled by the LoggingCursor class
Query execution time
Number of rows affected
A complete log file with all events.
For the first requirement, take a look at the source code for the MinTimeLoggingConnection class in psycopg2.extras. It sub-classes LoggingConnection and outputs the execution time of queries that exceed a minimum time (note that this needs to be used in conjunction with the MinTimeLoggingCursor).
For the second requirement, the rowcount attribute of the cursor class specifies
the number of rows that the last execute*() produced (for DQL
statements like SELECT) or affected (for DML statements like UPDATE or
INSERT)
Thus it should be possible to create your own type of LoggingConnection and LoggingCursor that includes this additional functionality.
My attempt is as follows. Just replace LoggingConnection with LoggingConnection2 in your code and this should all work. As a side-note, you don't need to create a new cursor for your second query. You can just call c.execute(q) again after you've defined your second query.
import psycopg2
import os
import time
from psycopg2.extras import LoggingConnection
from psycopg2.extras import LoggingCursor
class LoggingConnection2(psycopg2.extras.LoggingConnection):
def initialize(self, logobj):
LoggingConnection.initialize(self, logobj)
def filter(self, msg, curs):
t = (time.time() - curs.timestamp) * 1000
return msg + os.linesep + 'Query executed in: {0:.2f} ms. {1} row(s) affected.'.format(t, curs.rowcount)
def cursor(self, *args, **kwargs):
kwargs.setdefault('cursor_factory', LoggingCursor2)
return super(LoggingConnection, self).cursor(*args, **kwargs)
class LoggingCursor2(psycopg2.extras.LoggingCursor):
def execute(self, query, vars=None):
self.timestamp = time.time()
return LoggingCursor.execute(self, query, vars)
def callproc(self, procname, vars=None):
self.timestamp = time.time()
return LoggingCursor.execute(self, procname, vars)
I'm not sure how to create a complete log of all events, but the notices attribute of the connection class may be of interest.

Maybe you can get what you're looking for without writing any code.
There's an option in postgresql itself called "log_min_duration" that might help you out.
You can set it to zero, and every query will be logged, along with its run-time cost. Or you can set it to some positive number, like say, 500, and postgresql will only record queries that take at least 500 ms to run.
You won't get the results of the query in your log file, but you'll get the exact query, including the interpolated bound parameters.
If this works well for you, later on, check out the auto_explain module.
Good luck!

Just take a look at how the LoggingCursor is implemented and write your own cursor subclass: it's very easy.

Related

Why is a subsequent query not able to find newly-inserted rows?

I'm using AWS RDS, which I'm accessing with pymysql. I have a python lambda function that inserts a row into one of my tables. I then call cursor.commit() on the pymysql cursor object. Later, my lambda invokes a second lambda; this second lambda (using a different db connection) executes a SELECT to look for the newly-added row. Unfortunately, the row is not found immediately. As a debugging step, I added code like this:
lambda_handler.py
...
uuid_values = [uuid_value] # A single-item list
things = queries.get_things(uuid_values)
# Added for debugging
if not things:
print('For debugging: things not found.')
time.sleep(5)
things = queries.get_things()
print(f'for debugging: {str(things)}')
return things
queries.py
def get_things(uuid_values):
# Creates a string of the form 'UUID_TO_BIN(%s), UUID_TO_BIN(%s)' for use in the query below
format_string = ','.join(['UUID_TO_BIN(%s)'] * len(uuid_values))
tuple_of_keys = tuple([str(key) for key in uuid_values])
with db_conn.get_cursor() as cursor:
# Lightly simplified query
cursor.execute( '''
SELECT ...
FROM table1 t1
JOIN table2 t2 ON t1.id = t2.t1_id
WHERE
t1.uuid_value IN ({format_string})
AND t2.status_id = 1
''' % format_string,
tuple_of_keys)
results = cursor.fetchall()
db_conn.conn.commit()
return results
This outputs
'For debugging: things not found.'
'\<thing list\>'
meaning the row is not found immediately, but is after a brief delay. I'd rather not leave this delay in when I ship to production. I'm not doing anything with transactions or isolation level. So it's very strange to me that this second query would not find the newly-inserted row. Any idea what might be causing this?

Problems implementing a python db listener

I'm writing a module for a program that needs to listen for new entries in a db, and execute a function on the event of new rows being posted to this table... aka a trigger.
I have written some code, but it does not work. Here's my logic:
connect to db, query for the newest row, compare that row with variable, if not equal, run function, store newest row to variable, else close. Run every 2 seconds to compare newest row with whatever is stored in the variable/object.
Everything runs fine and pulls the expected results from the db, however I'm getting a 'local variable 'last_sent' referenced before assignment.
This confuses me for 2 reasons.
I thought I set last_sent to 'nothing' as a global variable/object before the functions are called.
In order for my comparison logic to work, I can't set last_sent within the sendListener() function before the if/else
Here's the code.
from Logger import Logger
from sendSMS import sendSMS
from Needles import dbUser, dbHost, dbPassword, pull_stmt
import pyodbc
import time
#set last_sent to something
last_sent = ''
def sendListener():
#connect to db
cnxn = pyodbc.connect('UID='+dbUser+';PWD='+dbPassword+';DSN='+dbHost)
cursor = cnxn.cursor()
#run query to pull newest row
cursor.execute(pull_stmt)
results = cursor.fetchone()
#if query results different from results stored in last_sent, run function.
#then set last_sent object to the query results for next comparison.
if results != last_sent:
sendSMS()
last_sent = results
else:
cnxn.close()
# a loop to run the check every 2 seconds- as to lessen cpu usage
def sleepLoop():
while 0 == 0:
sendListener()
time.sleep(2.0)
sleepLoop()
I'm sure there is a better way to implement this.
Here:
if results != last_sent:
sendSMS()
last_sent = results
else:
cnxn.close()
Python sees that you're assigning to last_sent, but it's not marked as global in this function, so it must be local. Yet you're reading it in results != last_sent before its definition, so you get the error.
To solve this, mark it as global at the beginning of the function:
def sendListener():
global last_sent
...

How can I record SQLAlchemy generated SQL and query execution times?

I'm trying to store the SQL queries generated by SQLAlchemy, and how long each one takes to run. I'm using an event listener to store the SQL:
statements = []
#listens_for(DBSession.get_bind(), "before_cursor_execute", named=True)
def before_cursor_execute(**kw):
statements.append(kw['statement'])
Can I use the same event listener somehow to store execution time, or should I be using something else?
You can use before_cursor_execute to record the start time of your query, then calculate the difference in after_cursor_execute, something like this:
#event.listens_for(engine, "before_cursor_execute")
def _record_query_start(conn, cursor, statement, parameters, context, executemany):
conn.info["query_start"] = datetime.now()
#event.listens_for(engine, "after_cursor_execute")
def _calculate_query_run_time(conn, cursor, statement, parameters, context, executemany):
print("this query took {}".format(datetime.now() - conn.info["query_start"]))

transfer millions of records from sqlite to postgresql using python sqlalchemy

We have around 1500 sqlite dbs, each has 0 to 20,000,000 records in table (violation) total no of violation records is around 90,000,000.
Each file we generate by running a crawler on the 1500 servers. With this violation table we have some other tables too which we use for further analysis.
To analyze the results we push all these sqlite violation records into postsgres violation table, along with other insertion and other calculation.
Following is the code I use to transfer records,
class PolicyViolationService(object):
def __init__(self, pg_dao, crawler_dao_s):
self.pg_dao = pg_dao
self.crawler_dao_s = crawler_dao_s
self.user_violation_count = defaultdict(int)
self.analyzer_time_id = self.pg_dao.get_latest_analyzer_tracker()
def process(self):
"""
transfer policy violation record from crawler db to analyzer db
"""
for crawler_dao in self.crawler_dao_s:
violations = self.get_violations(crawler_dao.get_violations())
self.pg_dao.insert_rows(violations)
def get_violations(self, violation_records):
for violation in violation_records:
violation = dict(violation.items())
violation.pop('id')
self.user_violation_count[violation.get('user_id')] += 1
violation['analyzer_time_id'] = self.analyzer_time_id
yield PolicyViolation(**violation)
in sqlite dao
==============
def get_violations(self):
result_set = self.db.execute('select * from policyviolations;')
return result_set
in pg dao
=========
def insert_rows(self, rows):
self.session.add_all(rows)
self.session.commit()
This code works but taking very log time. What is the right way to approach this problem. Please suggest, we have been discussing about parallel processing, skip sqlalchemy and some other options. Please suggest us right way.
Thanks in advance!
The fastest way to get these to PostgreSQL is to use the COPY command, outside any SQLAlchemy.
Within SQLAlchemy one must note that the ORM is very slow. It is doubly slow if you have lots of stuff in ORM that you then flush. You could make it faster by doing flushes after 1000 items or so; it would also make sure that the session does not grow too big. However, why just not use SQLAlchemy Core to generate inserts:
ins = violations.insert().values(col1='value', col2='value')
conn.execute(ins)

Cassandra low performance?

I have to choose Cassandra or MongoDB(or another nosql database, I accept suggestions) for a project with a lot of inserts(1M/day).
So I create a small test to measure the write performance. Here's the code to insert in Cassandra:
import time
import os
import random
import string
import pycassa
def get_random_string(string_length):
return ''.join(random.choice(string.letters) for i in xrange(string_length))
def connect():
"""Connect to a test database"""
connection = pycassa.connect('test_keyspace', ['localhost:9160'])
db = pycassa.ColumnFamily(connection,'foo')
return db
def random_insert(db):
"""Insert a record into the database. The record has the following format
ID timestamp
4 random strings
3 random integers"""
record = {}
record['id'] = str(time.time())
record['str1'] = get_random_string(64)
record['str2'] = get_random_string(64)
record['str3'] = get_random_string(64)
record['str4'] = get_random_string(64)
record['num1'] = str(random.randint(0, 100))
record['num2'] = str(random.randint(0, 1000))
record['num3'] = str(random.randint(0, 10000))
db.insert(str(time.time()), record)
if __name__ == "__main__":
db = connect()
start_time = time.time()
for i in range(1000000):
random_insert(db)
end_time = time.time()
print "Insert time: %lf " %(end_time - start_time)
And the code to insert in Mongo it's the same changing the connection function:
def connect():
"""Connect to a test database"""
connection = pymongo.Connection('localhost', 27017)
db = connection.test_insert
return db.foo2
The results are ~1046 seconds to insert in Cassandra, and ~437 to finish in Mongo.
It's supposed that Cassandra it's much faster than Mongo inserting data. So , What i'm doing wrong?
There is no equivalent to Mongo's unsafe mode in Cassandra. (We used to have one, but we took it out, because it's just a Bad Idea.)
The other main problem is that you're doing single-threaded inserts. Cassandra is designed for high concurrency; you need to use a multithreaded test. See the graph at the bottom of http://spyced.blogspot.com/2010/01/cassandra-05.html (actual numbers are over a year out of date but the principle is still true).
The Cassandra source distribution has such a test included in contrib/stress.
If I am not mistaken, Cassandra allows you to specify whether or not you are doing a MongoDB-equivalent "safe mode" insert. (I dont recall the name of that feature in Cassandra)
In other words, Cassandra may be configured to write to disk and then return as opposed to the default MongoDB configuration which immediately returns after performing an insert without knowing if the insert was successful or not. It just means that your application never waits for a pass\fail from the server.
You can change that behavior by using safe mode in MongoDB but this is known to have a large impact on performance. Enable safe mode and you may see different results.
You will harness true power of Cassandra once you have multiple nodes running. Any node will be able to take a write request. Multithreading a client is only flooding more requests to same instance which is not going to help after a point.
Check cassandra log for the events that happen during your tests. Cassandra will initiate a disk write once the Memtable is full (this is configurable, make it large enough and you will be dealing on in RAM + disk writes of commit log). If disk write for Memtable happen during your test then it will slow it down. I do not know when MongoDB writes to disk.
Might I suggest taking a look at Membase here? It's used in exactly the same way as memcached and is fully distributed so you can continuously scale your write input rate simply by adding more servers and/or more RAM.
For this case, you'll definitely want to go with a client-side Moxi to give you the best performance. Take a look at our wiki: wiki.membase.org for examples and let me know if you need any further instruction...I'm happy to walk you through it and I'm certain that Membase can handle this load easily.
Create batch mutator for doing
multiple insert, update, and remove
operations using as few roundtrips as
possible.
http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.batch
Batch mutator helped me reduce insert time in at least half

Categories