Basic concurrent SQLite writer in Python - python

I have created a very basic script that periodically writes some data into a database:
test.py
import sqlite3
import sys
import time
DB_CREATE_TABLE = 'CREATE TABLE IF NOT EXISTS items (item TEXT)'
DB_INSERT = 'INSERT INTO items VALUES (?)'
FILENAME = 'test.db'
def main():
index = int()
c = sqlite3.connect(FILENAME)
c.execute(DB_CREATE_TABLE)
c.commit()
while True:
item = '{name}_{index}'.format(name=sys.argv[1], index=index)
c.execute(DB_INSERT, (item,))
c.commit()
time.sleep(1)
index += 1
c.close()
if __name__ == '__main__':
main()
Now I can achieve a simple concurrency by running the script several times:
python3 test.py foo &
python3 test.py bar &
I have tried to read some articles about scripts writing into the same database file at same time but still I'm not sure how will my script handle such event and I didn't figure any way how to test it.
My expectations are that in the unlikely event when the two instances of my script try to write to the database in the same millisecond, the later one will simply silently wait till the earlier finishes its job.
Does my current implementation meet my expectations? If it does not, how does it behave in case of such event and how can I fix it?

TL;DR
This script will meet the expectations.
Explanation
When the unlikely event of two script instances trying to write at the same time happens, the first one locks the database and the second one silently waits for a while until the first one finishes its transaction so that the database is unlocked for writing again.
More precisely, the second script instance waits for 5 seconds (by default) and then raises the OperationalError with the message database is locked. As #roganjosh commented, this behavior is actually specific for a Python SQLite wrapper. The documentation states:
When a database is accessed by multiple connections, and one of the processes modifies the database, the SQLite database is locked until that transaction is committed. The timeout parameter specifies how long the connection should wait for the lock to go away until raising an exception. The default for the timeout parameter is 5.0 (five seconds).
Tests
To demonstrate the collision event of the two instances I modified the main function:
def main():
c = sqlite3.connect(FILENAME)
c.execute(DB_CREATE_TABLE)
c.commit()
print('{} {}: {}'.format(time.time(), sys.argv[1], 'trying to insert ...'))
try:
c.execute(DB_INSERT, (sys.argv[1],))
except sqlite3.OperationalError as e:
print('{} {}: {}'.format(time.time(), sys.argv[1], e))
return
time.sleep(int(sys.argv[2]))
c.commit()
print('{} {}: {}'.format(time.time(), sys.argv[1], 'done'))
c.close()
The documentation states that the database is locked until the transaction is commited. So simply sleeping during the transaction should be enough to test it.
Test 1
We run the following command:
python3 test.py first 10 & sleep 1 && python3 test.py second 0
The first instance is being run and after 1s the second instance is being run. The first instance creates a 10s long transaction during which the second one tries to write to the database, waits and then raises an exception. The log demonstrates that:
1540307088.6203635 first: trying to insert ...
1540307089.6155508 second: trying to insert ...
1540307094.6333485 second: database is locked
1540307098.6353421 first: done
Test 2
We run the following command:
python3 test.py first 3 & sleep 1 && python3 test.py second 0
The first instance is being run and after 1s the second instance is being run. The first instance creates a 3s long transaction during which the second one tries to write to the database and waits. Since it has been created after 1s it has to wait 3s - 1s = 2s which is less than the default 5s so both transactions will finish successfully. The log demonstrates that:
1540307132.2834115 first: trying to insert ...
1540307133.2811155 second: trying to insert ...
1540307135.2912169 first: done
1540307135.3217440 second: done
Conclusion
The time needed for the transaction to finish is significantly smaller (milliseconds) than the lock time limit (5s) so in this scenario the script indeed meets the expectations. But as #HarlyH. commented, the transactions wait in a queue to be commited so for a heavily used or very large database this is not a good solution since the communication with the database will become slow.

Related

Python script getting Killed

Environment
Flask 0.10.1
SqlAlchemy 1.0.10
Python 3.4.3
Using unittest
I have created two separate tests whose goals are looking into the databases through 700k records and doing some string finds. When the tests are executed one at a time, it works fine, but when the whole script is executed with:
python name_of_script.py
it exits with "KILLED" at random places.
The main code on both tests go something like this:
def test_redundant_categories_exist(self):
self.assertTrue(self.get_redundant_categories() > 0, 'There are 0 redundant categories to remove. Cannot test removing them if there are none to remove.')
def get_redundant_categories(self):
total = 0
with get_db_session_scope(db.session) as db_session:
records = db_session.query(Category)
for row in records:
if len(row.c) > 1:
c = row.c
#TODO: threads, each thread handles a bulk of rows
redundant_categories = [cat_x.id
for cat_x in c
for cat_y in c
if cat_x != cat_y and re.search(r'(^|/)' + cat_x.path + r'($|/)', cat_y.path)
]
total += len(redundant_categories)
records = None
db_session.close()
return total
The other test calls a function located in the manager.py file that does something similar, but with an added bulk delete in the database.
def test_remove_redundant_mappings(self):
import os
os.system( "python ../../manager.py remove_redundant_mappings" )
self.assertEqual(self.get_redundant_categories(), 0, "There are redundant categories left after running manager.py remove_redundant_mappings()")
Is it possible for the data to be kept in memory between tests? I don't quite understand how executing the tests individually works, but when run back to back, the process ends with Killed.
Any ideas?
Edit (things I've tried to no avail):
import the function from manager.py and call it without os.system(..)
import gc and run a gc.collect() after get_redundant_categories() and after calling remove_redundant_mappings()
While searching high and low, I serendipitously came upon the following comment in this StackOverflow question/answer
What is happening, I think, is that people are instantiating sessions and not closing them. The objects are then being garbage collected without closing the sessions. Why sqlalchemy sessions don't close themselves when the session object goes out of scope has always and will always be beyond me. #melchoir55
So I added the following the to the method that was being tested:
db_session.close()
Now the unittest executes without getting killed.

Scheduling same method to run for different inputs corresponding to its time interval

I am a beginner in Python. I have a python script that needs to be executed always. The script is taking some url s from DB and calling some functions to check activity of links. These functions should be executed at specific intervals for each urls(value specific for each url and is taking from db while retrieving urls). I read about sched module and cron tabs but got bit confused on what to use and how to use them to achieve all these!or is there a better solution to implement all this?
1) run the script always
2)in code for each url the interval to call/check a method is different, and each should be checked in its particular time interval
My main code will be something like
def checkSublinks(urlId,search,domain,depth_restricted_to,links_restricted_to):
#method here
try:
db=MySQLdb.connect("localhost","root","password","crawler")
cursor=db.cursor();
query="select * from website"
cursor.execute(query)
result=cursor.fetchall()
for row in result:
depth=0
maxCountReached=False
urlId=row[0]
print "Id :",urlId
search=row[1]
domain=row[2]
depth_restricted_to=row[3]
links_restricted_to=row[4]
website_uptime=row[5]
link_uptime=row[6]
checkSublinks(urlId,search,domain,depth_restricted_to,links_restricted_to)
except Exception,e:
print e
print "Error in creating DB Connection!"
finally:
db.close()
Here each url is calling checkSublinks in its corresponding time interval. Request your valuable suggestions in this at the earliest
You can try timer mechanism provided under the threading functionality. Ideally I would be running a script for ever - and for every timer interval, read the data. HTH!

Caching of DB queries in django

I have django app that uses two external scripts. One script moves a file from A to B, stores the value for B in a database - and exists afterwards, which should commit any possibly open transactions. The next script reacts to movement of the file (using inotify), calculates md5sum (which appearently takes time) and then looks for an entry in the database like
x = Queue.get(filename=location).
Looking at the timestamps of my logs, I am 100% sure that the first script is long done before the second script (actually a daemon) runs the query. Interestingly enough, the thing works perfectly after a restart of daemon.
This leads me to believe that somehow the Queryset (I actually run the code shown above everytime a new file is detected with inotify) is cached during runtime of the daemon. I however would not want to restart the daemon all the time, but instead force the query to actually use the DB instead of that cache.
The django documentation doesn't say much about that - however usually django is not used as an external :)
Thank you in advance for any hints!
Ben
PS: as per request the source of the relevant part from the daemon
def _get_info(self, path):
try:
obj = Queue.objects.get(filename=path)
x = obj.x
return x
except Exception, e:
self.logger.error("Error in lookup: %s" % e)
return None
This is called by a thread everytime a new file is moved to the watched directory
Whereas the code in the first script looks like
for f in Queue.objects.all():
if (matching_stuff_here):
f.filename = B
f.save()
sys.exit(0)
You haven't shown any actual code, so we have to guess. My guess would be that even though the transaction in the first script is done and committed, you're still inside an open transaction in script B: and because of transaction isolation, you won't see any changes in B until you finish the transaction there.

Python's MySqlDB not getting updated row

I have a script that waits until some row in a db is updated:
con = MySQLdb.connect(server, user, pwd, db)
When the script starts the row's value is "running", and it waits for the value to become "finished"
while(True):
sql = '''select value from table where some_condition'''
cur = self.getCursor()
cur.execute(sql)
r = cur.fetchone()
cur.close()
res = r['value']
if res == 'finished':
break
print res
time.sleep(5)
When I run this script it hangs forever. Even though I see the value of the row has changed to "finished" when I query the table, the printout of the script is still "running".
Is there some setting I didn't set?
EDIT: The python script only queries the table. The update to the table is carried out by a tomcat webapp, using JDBC, that is set on autocommit.
This is an InnoDB table, right? InnoDB is transactional storage engine. Setting autocommit to true will probably fix this behavior for you.
conn.autocommit(True)
Alternatively, you could change the transaction isolation level. You can read more about this here:
http://dev.mysql.com/doc/refman/5.0/en/set-transaction.html
The reason for this behavior is that inside a single transaction the reads need to be consistent. All consistent reads within the same transaction read the snapshot established by the first read. Even if you script only reads the table this is considered a transaction too. This is the default behavior in InnoDB and you need to change that or run conn.commit() after each read.
This page explains this in more details: http://dev.mysql.com/doc/refman/5.0/en/innodb-consistent-read.html
I worked around this by running
c.execute("""set session transaction isolation level READ COMMITTED""")
early on in my reading session. Updates from other threads do come through now.
In my instance I was keeping connections open for a long time (inside mod_python) and so updates by other processes weren't being seen at all.

How can I Cause a Deadlock in MySQL for Testing Purposes

I want to make my Python library working with MySQLdb be able to detect deadlocks and try again. I believe I've coded a good solution, and now I want to test it.
Any ideas for the simplest queries I could run using MySQLdb to create a deadlock condition would be?
system info:
MySQL 5.0.19
Client 5.1.11
Windows XP
Python 2.4 / MySQLdb 1.2.1 p2
Here's some pseudocode for how i do it in PHP:
Script 1:
START TRANSACTION;
INSERT INTO table <anything you want>;
SLEEP(5);
UPDATE table SET field = 'foo';
COMMIT;
Script 2:
START TRANSACTION;
UPDATE table SET field = 'foo';
SLEEP(5);
INSERT INTO table <anything you want>;
COMMIT;
Execute script 1 and then immediately execute script 2 in another terminal. You'll get a deadlock if the database table already has some data in it (In other words, it starts deadlocking after the second time you try this).
Note that if mysql won't honor the SLEEP() command, use Python's equivalent in the application itself.
you can always run LOCK TABLE tablename from another session (mysql CLI for instance). That might do the trick.
It will remain locked until you release it or disconnect the session.
I'm not familar with Python, so excuse my incorrect language If I'm saying this wrong... but open two sessions (in separate windows, or from separate Python processes - from separate boxes would work ... ) Then ...
. In Session A:
Begin Transaction
Insert TableA() Values()...
. Then In Session B:
Begin Transaction
Insert TableB() Values()...
Insert TableA() Values() ...
. Then go back to session A
Insert TableB() Values () ...
You'll get a deadlock...
You want something along the following lines.
parent.py
import subprocess
c1= subprocess.Popen( ["python", "child.py", "1"], stdin=subprocess.PIPE, stdout=subprocess.PIPE )
c2= subprocess.Popen( ["python", "child.py", "2"], stdin=subprocess.PIPE, stdout=subprocess.PIPE )
out1, err1= c1.communicate( "to 1: hit it!" )
print " 1:", repr(out1)
print "*1:", repr(err1)
out2, err2= c2.communicate( "to 2: ready, set, go!" )
print " 2:", repr(out2)
print "*2:", repr(err2)
out1, err1= c1.communicate()
print " 1:", repr(out1)
print "*1:", repr(err1)
out2, err2= c2.communicate()
print " 2:", repr(out2)
print "*2:", repr(err2)
c1.wait()
c2.wait()
child.py
import yourDBconnection as dbapi2
def child1():
print "Child 1 start"
conn= dbapi2.connect( ... )
c1= conn.cursor()
conn.begin() # turn off autocommit, start a transaction
ra= c1.execute( "UPDATE A SET AC1='Achgd' WHERE AC1='AC1-1'" )
print ra
print "Child1", raw_input()
rb= c1.execute( "UPDATE B SET BC1='Bchgd' WHERE BC1='BC1-1'" )
print rb
c1.close()
print "Child 1 finished"
def child2():
print "Child 2 start"
conn= dbapi2.connect( ... )
c1= conn.cursor()
conn.begin() # turn off autocommit, start a transaction
rb= c1.execute( "UPDATE B SET BC1='Bchgd' WHERE BC1='BC1-1'" )
print rb
print "Child2", raw_input()
ra= c1.execute( "UPDATE A SET AC1='Achgd' WHERE AC1='AC1-1'" )
print ta
c1.close()
print "Child 2 finish"
try:
if sys.argv[1] == "1":
child1()
else:
child2()
except Exception, e:
print repr(e)
Note the symmetry. Each child starts out holding one resource. Then they attempt to get someone else's held resource. You can, for fun, have 3 children and 3 resources for a really vicious circle.
Note that difficulty in contriving a situation in which deadlock occurs. If your transactions are short -- and consistent -- deadlock is very difficult to achieve. Deadlock requires (a) transaction which hold locks for a long time AND (b) transactions which acquire locks in an inconsistent order. I have found it easiest to prevent deadlocks by keeping my transactions short and consistent.
Also note the non-determinism. You can't predict which child will die with a deadlock and which will continue after the other died. Only one of the two need to die to release needed resources for the other. Some RDBMS's claim that there's a rule based on number of resources held blah blah blah, but in general, you'll never know how the victim was chosen.
Because of the two writes being in a specific order, you sort of expect child 1 to die first. However, you can't guarantee that. It's not deadlock until child 2 tries to get child 1's resources -- the sequence of who acquired first may not determine who dies.
Also note that these are processes, not threads. Threads -- because of the Python GIL -- might be inadvertently synchronized and would require lots of calls to time.sleep( 0.001 ) to give the other thread a chance to catch up. Processes -- for this -- are slightly simpler because they're fully independent.
Not sure if either above is correct.
Check out this:
http://www.xaprb.com/blog/2006/08/08/how-to-deliberately-cause-a-deadlock-in-mysql/

Categories