I'm using peewee to interface a MySQL database. I have a list of entries which must be inserted into database and updated in case they're already present there. I'm using create_or_get function for this. I also use threading to speed up the process; code looks like this:
# pool is just a map wrapper around standard threading module
pool = utils.ThreadPool()
for page in xrange(0, pages):
pool.add_task(self.update_page, page)
pool.wait_completion()
def update_page(self, num):
for entry in self.get_entries_from_page(num):
self.push_entry(entry)
def push_entry(self, entry):
with _db.execution_context():
result, new = EntryModel.create_or_get(**entry) # << error here
if not new :
if entry['date'] > result.date:
result.hits += 1
result.date = track['date']
result.save()
Database initialization:
_db.initialize(playhouse.pool.PooledMySQLDatabase(n, user = u, passwd = w, host = h, port = p))
Everything was running smoothly, but all of sudden I began to receive a lot of errors on the mentioned line:
(1305, 'SAVEPOINT s449cd5a8d165440aaf47b205e2932362 does not exist')
Savepoint number changes every time and data is not being written to database. Recreating database did not help. What can lead to this error?
Try removing autocommit=True during database connection create.
Related
I have a program that uses a couple sql databases to store data. I have a class that manages the various sql functions, such as getting a value, an entire table or just updating a value. All of the processes work fine until I run a function that uses UPDATE. I execute an UPDATE command and try to commit the change and the database is always locked. Every function I have in my custom sql class has
cursor.close
database.close
So there shouldn't be any issue with the database connection still being open. Am I missing something in this syntax that is not connecting to the database correctly? I used the extra print statements in an attempt to find out where the problem is occurring, so those can be ignored.
import sqlite3 as db
import os
databaseName = "site"
class MassDb:
def __init__(self,databaseName):
super(MassDb, self).__init__()
print("Current Directory: ",os.getcwd())
self.databaseName = databaseName
def updateValue(self, location, metric, input_value):
print("OPEN CONNECTION UPDATE - running updateValue: ",location, metric, input_value)
if self.databaseName == "site":
try:
siteConn = db.connect("site_data.db")
siteCursor = siteConn.cursor()
siteCursor.execute("UPDATE sites SET " + metric + " = ? WHERE LOCATI ON = ?", (input_value, location))
siteConn.commit()
except:
print("UPDATE FAILED")
finally:
siteCursor.close
siteConn.close
elif self.databaseName == "comp":
try:
compConn = db.connect("comp_data.db")
compCursor = compConn.cursor()
compCursor.execute("UPDATE competitors SET " + metric + " = ? WHERE NAME = ?", (input_value, location))
compConn.commit()
except:
print("UPDATE FAILED")
finally:
compCursor.close
compConn.close
print("CLOSED CONNECTION UPDATE - Update Connection Closed")
else:
print("Update Error")
MassDb("site").updateValue("Location", "CURRENT_SCORE", "100")
As #roganjosh commented, my problem was that I wasn't properly closing the database. If
commit()
is used, there's no need to close the database. However,
cursor.close()
and
conn.close()
need to be written as such. Leaving off the parentheses would be as though an attribute is being referenced, rather than a method. In order to execute the close method, the () must be present. Seems obvious now, but I wasn't aware at the time. Hopefully this can help someone else if they too run across this.
Additionally, using a context manager works and eliminates the need to use close()
with conn:
#do stuff here
commit()
My questions basically is is there a best practice approach to db interaction and am I doing something silly / wrong in the below that is costing processing time.
My program pulls data from a website and writes to a SQL database. Speed is very important and I want to be able to refresh the data as quickly as possible. I've tried a number of ways and I feel its still way too slow - i.e. could be much better with a better approach / design to interaction with the db and I'm sure I'm making all sorts of mistakes. I can download the data to memory very quickly but the writes to the db take much much longer.
The 3 main approaches I've tried are:
Threads that pull the data and populate a list of SQL commands, when
threads complete run sql in main thread
Threads that pull data and push to SQL (as per below code)
Threads that pull data and populate a q with separate thread(s)
polling the q and pushing to the db.
Code as below:
import MySQLdb as mydb
class DatabaseUtility():
def __init__(self):
"""set db parameters"""
def updateCommand(self, cmd):
"""run SQL commands and return number of matched rows"""
try:
self.cur.execute(cmd)
return int(re.search('Rows matched: (\d+)', self.cur._info).group(1))
except Exception, e:
print ('runCmd error: ' + str(e))
print ('With SQL: ' + cmd)
return 0
def addCommand(self, cmd):
"""write SQL command to db"""
try:
self.cur.execute(cmd)
return self.cur.rowcount
except Exception, e:
print ('runCmd error: ' + str(e))
print ('With SQL: ' + cmd)
return 0
I've created a class that instantiates a db connection and is called as below:
from Queue import Queue
from threading import Thread
import urllib2
import json
from databasemanager import DatabaseUtility as dbU
from datalinks import getDataLink, allDataLinks
numThreads = 3
q = Queue()
dbu = dbU()
class OddScrape():
def __init__(self, name, q):
self.name = name
self.getOddsData(self.name, q)
def getOddsData(self, i, q):
"""Worker thread - parse each datalink and update / insert to db"""
while True:
#get datalink, create db connection
self.dbu = dbU()
matchData = q.get()
#load data link using urllib2 and do a bunch of stuff
#to parse the data to the required format
#try to update in db and insert if not found
sql = "sql to update %s" %(params)
update = self.dbu.updateCommand(sql)
if update < 1:
sql = "sql to insert" %(params)
self.dbu.addCommand(sql)
q.task_done()
self.dbu.dbConClose()
print eventlink
def threadQ():
#set up some threads
for i in range(numThreads):
worker = Thread(target=OddScrape, args=(i, q,))
worker.start()
#get urldata for all matches required and add to q
matchids = dbu.runCommand("sql code to determine scope of urls")
for match in matchids:
sql = "sql code to get url data %s" %match
q.put(dbu.runCommand(sql))
q.join()
I've also added an index to the table I'm writing too which seemed to help a tiny bit but not noticeably:
CREATE INDEX `idx_oddsdata_bookid_datalinkid`
ON `dbname`.`oddsdata` (bookid, datalinkid) COMMENT '' ALGORITHM DEFAULT LOCK DEFAULT;
Multiple threads implies multiple connections. Although getting a connection is "fast" in MySQL, it is not instantaneous. I do not know the relative speed of getting a connection versus running a query, but I doubt if you multi-threaded idea will win.
Could you show us examples of the actual queries (SQL, not python code) you need to run. We may have suggestions on combining queries, improved indexes, etc. Please provide SHOW CREATE TABLE, too. (You mentioned a CREATE INDEX, but it is useless out of context.)
It looks like you are doing a multi-step process that could be collapsed into INSERT ... ON DUPLICATE KEY UPDATE ....
I am running in to the dreaded MySQL Commands out of Sync when using a custom DB library and celery.
The library is as follows:
import pymysql
import pymysql.cursors
from furl import furl
from flask import current_app
class LegacyDB:
"""Db
Legacy Database connectivity library
"""
def __init__(self,app):
with app.app_context():
self.rc = current_app.config['RAVEN']
self.logger = current_app.logger
self.data = {}
# setup Mysql
try:
uri = furl(current_app.config['DBCX'])
self.dbcx = pymysql.connect(
host=uri.host,
user=uri.username,
passwd=uri.password,
db=str(uri.path.segments[0]),
port=int(uri.port),
cursorclass=pymysql.cursors.DictCursor
)
except:
self.rc.captureException()
def query(self, sql, params = None, TTL=36):
# INPUT 1 : SQL query
# INPUT 2 : Parameters
# INPUT 3 : Time To Live
# OUTPUT : Array of result
# check that we're still connected to the
# database before we fire off the query
try:
db_cursor = self.dbcx.cursor()
if params:
self.logger.debug("%s : %s" % (sql, params))
db_cursor.execute(sql,params)
self.dbcx.commit()
else:
self.logger.debug("%s" % sql)
db_cursor.execute(sql)
self.data = db_cursor.fetchall()
if self.data == None:
self.data = {}
db_cursor.close()
except Exception as ex:
if ex[0] == "2006":
db_cursor.close()
self.connect()
db_cursor = self.dbcx.cursor()
if params:
db_cursor.execute(sql,params)
self.dbcx.commit()
else:
db_cursor.execute(sql)
self.data = db_cursor.fetchall()
db_cursor.close()
else:
self.rc.captureException()
return self.data
The purpose of the library is to work alongside SQLAlchemy whilst I migrate a legacy database schema from a C++-based system to a Python based system.
All configuration is done via a Flask application and the app.config['DBCX'] value reads the same as a SQLAlchemy String ("mysql://user:pass#host:port/dbname") allowing me to easily switch over in future.
I have a number of tasks that run "INSERT" statements via celery, all of which utilise this library. As you can imagine, the main reason for running Celery is so that I can increase throughput on this application, however I seem to be hitting an issue with the threading in my library or the application as after a while (around 500 processed messages) I see the following in the logs:
Stacktrace (most recent call last):
File "legacy/legacydb.py", line 49, in query
self.dbcx.commit()
File "pymysql/connections.py", line 662, in commit
self._read_ok_packet()
File "pymysql/connections.py", line 643, in _read_ok_packet
raise OperationalError(2014, "Command Out of Sync")
I'm obviously doing something wrong to hit this error, however it doesn't seem to matter whether MySQL has autocommit enabled/disabled or where I place my connection.commit() call.
If I leave out the connection.commit() then I don't get anything inserted into the database.
I've recently moved from mysqldb to pymysql and the occurrences appear to be lower, however given that these are simple "insert" commands and not a complicated select (there aren't even any foreign key constraints on this database!) I'm struggling to work out where the issue is.
As things stand at present, I am unable to use executemany as I cannot prepare the statements in advance (I am pulling data from a "firehose" message queue and storing it locally for later processing).
First of all, make sure that the celery thingamajig uses its own connection(s) since
>>> pymysql.threadsafety
1
Which means: "threads may share the module but not connections".
Is the init called once, or per-worker? If only once, you need to move the initialisation.
How about lazily initialising the connection in a thread-local variable the first time query is called?
We have a little bit of a complicated setup:
In our normal code, we connect manually to a mysql db. We're doing this because I guess the connections django normally uses are not threadsafe? So we let django make the connection, extract the information from it, and then use a mysqldb connection to do the actual querying.
Our code is largely an update process, so we have autocommit turned off to save time.
For ease of creating test data, I created django models that represent the tables, and use them to create rows to test on. So I have functions like:
def make_thing(**overrides):
fields = deepcopy(DEFAULT_THING)
fields.update(overrides)
s = Thing(**fields)
s.save()
transaction.commit(using='ourdb')
reset_queries()
return s
However, it doesn't seem to actually be committing! After I make an object, I later have code that executes raw sql against the mysqldb connection:
def get_information(self, value):
print self.api.rawSql("select count(*) from thing")[0][0]
query = 'select info from thing where column = %s' % value
return self.api.rawSql(query)[0][0]
This print statement prints 0! Why?
Also, if I turn autocommit off, I get
TransactionManagementError: This is forbidden when an 'atomic' block is active.
when we try to alter the autocommit level later.
EDIT: I also just tried https://groups.google.com/forum/#!topic/django-users/4lzsQAWYwG0, which did not help.
EDIT2: I checked from a shell against the database--the commit is working, it's just not getting picked up. I've tried setting the transaction isolation level but it isn't helping. I should add that a function further up from get_information uses this decorator:
def single_transaction(fn):
from django.db import transaction
from django.db import connection
def wrapper(*args, **kwargs):
prior_autocommit = transaction.get_autocommit()
transaction.set_autocommit(False)
connection.cursor().execute('set transaction isolation level read committed')
connection.cursor().execute("SELECT ##session.tx_isolation")
try:
result = fn(*args, **kwargs)
transaction.commit()
return result
finally:
transaction.set_autocommit(prior_autocommit)
django.db.reset_queries()
gc.collect()
wrapper.__name__ = fn.__name__
return wrapper
Program 1 inserts some jobs into a table job_table.
Program 2 needs to :
1. get the job from the table
2. handle the job
-> this needs to be multi-threaded (because each job involves urllib waiting time, which should run in parallel)
3. insert the results into my_other_table, commiting the result
Any good (standard?) ways to implement this? The issue is that commiting inside one thread, also commits the other threads.
I was able to pick the records from the mysql table and put them in queue later get them from queue but not able to insert into a new mysql table.
Here i am able to pick up only the new records when ever they fall into the table.
Hope this may help you.
Any mistakes please assist me.
from threading import Thread
import time
import Queue
import csv
import random
import pandas as pd
import pymysql.cursors
from sqlalchemy import create_engine
import logging
queue = Queue.Queue(1000)
logging.basicConfig(level=logging.DEBUG, format='(%(threadName)-9s) %(message)s', )
conn = pymysql.connect(conn-details)
cursor = conn.cursor()
class ProducerThread(Thread):
def run(self):
global queue
cursor.execute("SELECT ID FROM multi ORDER BY ID LIMIT 1")
min_id = cursor.fetchall()
min_id1 = list(min_id[0])
while True:
cursor.execute("SELECT ID FROM multi ORDER BY ID desc LIMIT 1")
max_id = cursor.fetchall()
max_id1 = list(max_id[0])
sql = "select * from multi where ID between '{}' and '{}'".format(min_id1[0], max_id1[0])
cursor.execute(sql)
data = cursor.fetchall()
min_id1[0] = max_id1[0] + 1
for row in data:
num = row
queue.put(num) # acquire();wait()
logging.debug('Putting ' + str(num) + ' : ' + str(queue.qsize()) + ' items in queue')
class ConsumerThread(Thread):
def run(self):
global queue
while True:
num = queue.get()
print num
logging.debug('Getting ' + str(num) + ' : ' + str(queue.qsize()) + ' items in queue')
**sql1 = """insert into multi_out(ID,clientname) values ('%s','%s')""",num[0],num[1]
print sql1
# cursor.execute(sql1, num)
cursor.execute("""insert into multi_out(ID,clientname) values ('%s','%s')""",(num[0],num[1]))**
# conn.commit()
# conn.close()
def main():
ProducerThread().start()
num_of_consumers = 20
for i in range(num_of_consumers):
ConsumerThread().start()
main()
What probably happens is you share the MySQL connection between the two threads. Try creating a new MySQL connection inside each thread.
For program 2, look at http://www.celeryproject.org/ :)
This is a common task when doing some sort of web crawling. I have implemented a single thread which grabs a job, waits for the http response, then writes the response to a database table.
The problems I have come across with my method, is you need to lock the table where you are grabbing jobs from, and mark them as in progress or complete, in order for multiple threads to not try and grab the same task.
Just used threading.Thread in python and override the run method.
Use 1 database connection per thread. (some db libraries in python are not thread safe)
If you have X number of threads running, periodically reading from the jobs table then MySQL will do the concurrency for you.
Or if you need even more assurance, you can always lock the jobs table yourself before reading the next available entry. This way you can be 100% sure that a single job will only be processed once.
As #Martin said, keep connections separate for all threads. They can use the same credentials.
So in short:
Program one -> Insert into jobs
Program two -> Create a write lock on the jobs table, so no one else can read from it
Program two -> read next available job
Program two -> Unlock the table
Do everything else as usual, MySQL will handle concurrency