Cassandra CQL UPDATE with IF - python

Newbie here (and it seems like it might be a newbie question).
Using Ubuntu 14.04 with a fresh install of Cassandra 2.1.1, CQL 3.2.0 (it says).
Writing a back-end database for a CherryPy site, initially as a session database.
I've come up with a scheme for a kind of 'row locking' as a session lock, but it doesn't seem to be hanging together, so I've reduced it to a simple test program running against a local Cassandra instance. To run this test, I open two terminal windows to run two python instances of it at the same time, each with different instance numbers ('1' and '2').
import time, sys, os, cassandra
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
instance = sys.argv[1]
cluster = Cluster( auth_provider=PlainTextAuthProvider( username='cassandra', password='cassandra'))
cdb = cluster.connect()
cdb.execute("CREATE KEYSPACE IF NOT EXISTS test WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 1}")
cdb.execute("CREATE TABLE IF NOT EXISTS test.test ( id text primary key, val int, lock text )")
cdb.execute("INSERT INTO test.test (id, val, lock) VALUES ('session_id1', 0, '') ")
raw_input( '<Enter> to start ... ')
i = 0
while i < 10000:
i += 1
# set lock
while True:
r = cdb.execute( "UPDATE test.test SET lock = '%s' WHERE id = 'session_id1' IF lock = '' " % instance)
if r[0].applied == True:
break
# check lock and increment val
s0 = cdb.execute("SELECT val,lock FROM test.test WHERE id = 'session_id1' " )[0]
if s0.lock != instance:
print 'error: instance [%s] %s %s' % (instance, s0, r[0])
cdb.execute( "UPDATE test.test SET val = %s WHERE id = 'session_id1'", (s0.val + 1,))
# clear lock
cdb.execute( "UPDATE test.test SET lock = '' WHERE id = 'session_id1' ")
time.sleep( .01)
So if I understand correctly, the UPDATE..IF should be 'applied' (and the break taken) only if the existing value of lock is '' (an empty string), so this should give an effective exclusive lock on the row.
The problem is that the 's1.lock != instance' test quite frequently fires, showing that despite the UPDATE being applied, the value of lock afterwards is variously still '' or that of the other instance...
I know that when I roll out to a cluster I'm going to have to manage consistency issues, but this is against a single local Cass instance - surely consistency shouldn't be a problem here?
I can't imagine this CQL form is broken (tm), so it must be me. What am I doing wrong, or what is it I don't understand? TIA.
UPDATE: Ok, I googled a lot on this before I posted here, and now have spent the day since posting doing the same.
In particular, the stackoverflow posting Cassandra Optimistic Locking is addressing a similar issue (for a different reason), and his solution was:
"update table1 set version_num = 5 where name = 'abc' if version_num = 4"
which he says works for him - but is really exactly what I am doing, but which isn't working for me.
So I believe my approach to be sound, but clearly I have a problem.
Are there any environmental issues that could be affecting me? (installation, pythonic, whatever...)

Unsatisfactory Work-Around found*
After trying a lot of variations of the test code (above), I have come to the view that the statement:
"UPDATE test.test SET lock = '%s' WHERE id = 'session_id1' IF lock = '' "
around 5% of the time it finds lock is '' (empty string), it actually fails to write the value to lock, but nevertheless returns 'applied=True'.
By way of further testing, I modified that test code as follows:
# set lock
while True:
r = cdb.execute( "UPDATE test.test SET lock = '%s' WHERE id = 'session_id1' IF lock = '' " % instance)
if r[0].applied == True:
s = cdb.execute("SELECT lock FROM test.test WHERE id = 'session_id1' " )
if s[0].lock == instance:
break
# check lock and increment val
(etc)
... so this code now confirms that the lock had been applied, and if not, it goes back to try again...
So this is:
1) Horrible
2) Kludgy
3) Inefficient
4) Totally reliable (the only thing that really matters to me)
I've tested this on the 'single local Cassandra instance', and the main point is that the incrementing of the 'val' column that the lock is supposed to be protecting, does reach the proper terminating value (20000 with the code as above).
I've also tested it on a 2-node cluster with a replication factor of 2, with one instance of the test code running on each node, and that works too (although the "UPDATE ... IF" statement, now with a consistency of QUORUM, occasionally returns:
exception - code=1100 [Coordinator node timed out waiting for replica nodes' responses]\
message="Operation timed out - received only 1 responses." \
info={'received_responses': 1, 'required_responses': 2, 'write_type': 5, 'consistency': 8}
... that needs careful handling, as it appears that the lock has always been set, despite not having received all of the replies... and it cannot be retried, as the operation isn't idempotent...)
So I clearly haven't fixed the underlying problem, and although I have fixed the symptom, I would still appreciate a more thorough insight into what is happening...
I'd appreciate any feedback (but at least I can make progress again). TIA

So I've had some communication with Tyler Hobbs (Datastax), and in a nutshell:
"The correct functioning of the mechanism that provides the atomic test-and-set facility (via LightWeight Transactions) depends upon using the same mechanism to clear the lock."
... so I need to use a similar 'IF' construct to clear it, even though I already know the contents...
# clear lock
cdb.execute( "UPDATE test.test SET lock = '' WHERE id = 'session_id1' IF lock = '%s'" % instance)
... and that works.

Related

Whats is correct way to work with PostgreSQL from Python threads?

I need to increase speed of parsing heap of XML files. I decided to try use Python threads, but I do not know how to correct work from them with DB.
My DB store only links to files. I decided to add isProcessing column to my DB to prevent acquire of same rows from multiple threads
So result table look like:
|xml_path|isProcessing|
Every thread set this flag before starting processing and other threads select for procossings rows where this flags is not set.
But I am not sure that is correct way, because I am not sure that acquire is atomic and two threads my to process same row two times.
def select_single_file_for_processing():
#...
sql = """UPDATE processing_files SET "isProcessing" = 'TRUE' WHERE "xml_name"='{0}'""".format(xml_name)
cursor.execute(sql)
conn.commit()
def worker():
result = select_single_file_for_processing() #
# ...
# processing()
def main():
# ....
while unprocessed_xml_count != 0: # now unprocessed_xml_count is global! I know that it's wrong, but how to fix it?
checker_thread = threading.Thread(target=select_total_unpocessed_xml_count)
checker_thread.start() # if we have files for processing
for i in range(10): # run processed
t = Process(target=worker)
t.start()
The second question - what is the best practice to work with DB from multiprocessing module?
As written, your isProcessing flag could have problems with multiple threads. You should include a predicate for isProcessing = FALSE and check how many rows are updated. One thread will report 1 row and any other threads will report 0 rows.
As to best practices? This is a reasonable solution. The key is to be specific. A simple update will set the values as specified. The operation you are trying to perform though is to change the value from a to b, hence including the predicate for a in the statement.
UPDATE processing_files
SET isProcessing = 'TRUE'
WHERE xmlName = '...'
AND isProcessing = 'FALSE';

sqlalchemy query after flushed delete

Given this piece of code:
record = session.query(Foo).filter(Foo.id == 1).first()
session.delete(record)
session.flush()
has_record = session.query(Foo).filter(Foo.id == 1).first()
I think the 'has_record' should be None here, but it turns out to be the same row as record.
Did I miss something to get the assumed result. Or is there any way that can make the delete take effect without commit?
Mysql would behave in a different way under similar process.
start transaction;
select * from Foo where id = 1; # Hit one record
delete from Foo where id = 1; # Nothing goes to the disk
select * from Foo where id = 1; # Empty set
commit; # Everything geos to the disk
I made a stupid mistake here. The session I'm using is a routing session, which has a master/slave session behind it. The fact might be that the delete is flushed to master and the query still goes to slave, so of course I can query the record again.

With Python sqlite3, should I use sleep() after any SQL operation and commit()?

I was doing some scripted update to the sqlite database of Clementine media player (60,000 entries) and found that in the end the database was corrupted.
So I strongly suspected that it was that my write operations in the for-loop didn't have enough time to complete before next loop starts. I tested by time.sleep() for 2 seconds after my UPDATE call, and 15 seconds after my periodic commit(). This seems to work but the whole process became really slow. Sample code:
CommitInterval = 1000
artistCounter = 0
for artist in allArtists
artistCounter += 1
for record in albumRatings:
album = record[0]
rating = record[1]
dbCursor.execute('UPDATE songs SET rating = ? WHERE LOWER(songs.artist) == ? AND LOWER(songs.album) == ? AND rating < ?', (rating, artist, album, rating))
# short sleep
ShortSleepSec = 2
time.sleep(ShortSleepSec)
if artistCounter == CommitInterval:
db.commit()
artistCounter = 0
# long sleep
SleepSec = 15
print 'Sleep %d seconds...' % (SleepSec)
time.sleep(SleepSec)
Here are my questions:
Should I really sleep after both the UPDATE and commit() or just one of them?
How should I calculate how long I should sleep after these calls?
Thanks very much!
Sqlite doesn't need to sleep after a commit. Sqlite is synchronous, in-process, so by the time commit() returns, the operation is completed. But: it can be dangerous to use sqlite on more than one thread.
Maybe a bit late, but
you do not need to sleep between database operations
you should not insert single records over and over
The sqlite3 module provides you with cursor.executemany(). This will heavily reduce runtime. Use it like so:
par = [ (row[1], artist, row[0], row[1]) for row in albumRatings ]
sql = 'UPDATE songs SET rating = ? WHERE LOWER(songs.artist) == ? AND LOWER(songs.album) == ? AND rating < ?'
dbCursor.executemany(sql, par)
I would db.commit() immediately after that round.
Also, make sure that the media player does not access the database during the update process, aso not from some else invisible demon job.

Python's MySqlDB not getting updated row

I have a script that waits until some row in a db is updated:
con = MySQLdb.connect(server, user, pwd, db)
When the script starts the row's value is "running", and it waits for the value to become "finished"
while(True):
sql = '''select value from table where some_condition'''
cur = self.getCursor()
cur.execute(sql)
r = cur.fetchone()
cur.close()
res = r['value']
if res == 'finished':
break
print res
time.sleep(5)
When I run this script it hangs forever. Even though I see the value of the row has changed to "finished" when I query the table, the printout of the script is still "running".
Is there some setting I didn't set?
EDIT: The python script only queries the table. The update to the table is carried out by a tomcat webapp, using JDBC, that is set on autocommit.
This is an InnoDB table, right? InnoDB is transactional storage engine. Setting autocommit to true will probably fix this behavior for you.
conn.autocommit(True)
Alternatively, you could change the transaction isolation level. You can read more about this here:
http://dev.mysql.com/doc/refman/5.0/en/set-transaction.html
The reason for this behavior is that inside a single transaction the reads need to be consistent. All consistent reads within the same transaction read the snapshot established by the first read. Even if you script only reads the table this is considered a transaction too. This is the default behavior in InnoDB and you need to change that or run conn.commit() after each read.
This page explains this in more details: http://dev.mysql.com/doc/refman/5.0/en/innodb-consistent-read.html
I worked around this by running
c.execute("""set session transaction isolation level READ COMMITTED""")
early on in my reading session. Updates from other threads do come through now.
In my instance I was keeping connections open for a long time (inside mod_python) and so updates by other processes weren't being seen at all.

How can I Cause a Deadlock in MySQL for Testing Purposes

I want to make my Python library working with MySQLdb be able to detect deadlocks and try again. I believe I've coded a good solution, and now I want to test it.
Any ideas for the simplest queries I could run using MySQLdb to create a deadlock condition would be?
system info:
MySQL 5.0.19
Client 5.1.11
Windows XP
Python 2.4 / MySQLdb 1.2.1 p2
Here's some pseudocode for how i do it in PHP:
Script 1:
START TRANSACTION;
INSERT INTO table <anything you want>;
SLEEP(5);
UPDATE table SET field = 'foo';
COMMIT;
Script 2:
START TRANSACTION;
UPDATE table SET field = 'foo';
SLEEP(5);
INSERT INTO table <anything you want>;
COMMIT;
Execute script 1 and then immediately execute script 2 in another terminal. You'll get a deadlock if the database table already has some data in it (In other words, it starts deadlocking after the second time you try this).
Note that if mysql won't honor the SLEEP() command, use Python's equivalent in the application itself.
you can always run LOCK TABLE tablename from another session (mysql CLI for instance). That might do the trick.
It will remain locked until you release it or disconnect the session.
I'm not familar with Python, so excuse my incorrect language If I'm saying this wrong... but open two sessions (in separate windows, or from separate Python processes - from separate boxes would work ... ) Then ...
. In Session A:
Begin Transaction
Insert TableA() Values()...
. Then In Session B:
Begin Transaction
Insert TableB() Values()...
Insert TableA() Values() ...
. Then go back to session A
Insert TableB() Values () ...
You'll get a deadlock...
You want something along the following lines.
parent.py
import subprocess
c1= subprocess.Popen( ["python", "child.py", "1"], stdin=subprocess.PIPE, stdout=subprocess.PIPE )
c2= subprocess.Popen( ["python", "child.py", "2"], stdin=subprocess.PIPE, stdout=subprocess.PIPE )
out1, err1= c1.communicate( "to 1: hit it!" )
print " 1:", repr(out1)
print "*1:", repr(err1)
out2, err2= c2.communicate( "to 2: ready, set, go!" )
print " 2:", repr(out2)
print "*2:", repr(err2)
out1, err1= c1.communicate()
print " 1:", repr(out1)
print "*1:", repr(err1)
out2, err2= c2.communicate()
print " 2:", repr(out2)
print "*2:", repr(err2)
c1.wait()
c2.wait()
child.py
import yourDBconnection as dbapi2
def child1():
print "Child 1 start"
conn= dbapi2.connect( ... )
c1= conn.cursor()
conn.begin() # turn off autocommit, start a transaction
ra= c1.execute( "UPDATE A SET AC1='Achgd' WHERE AC1='AC1-1'" )
print ra
print "Child1", raw_input()
rb= c1.execute( "UPDATE B SET BC1='Bchgd' WHERE BC1='BC1-1'" )
print rb
c1.close()
print "Child 1 finished"
def child2():
print "Child 2 start"
conn= dbapi2.connect( ... )
c1= conn.cursor()
conn.begin() # turn off autocommit, start a transaction
rb= c1.execute( "UPDATE B SET BC1='Bchgd' WHERE BC1='BC1-1'" )
print rb
print "Child2", raw_input()
ra= c1.execute( "UPDATE A SET AC1='Achgd' WHERE AC1='AC1-1'" )
print ta
c1.close()
print "Child 2 finish"
try:
if sys.argv[1] == "1":
child1()
else:
child2()
except Exception, e:
print repr(e)
Note the symmetry. Each child starts out holding one resource. Then they attempt to get someone else's held resource. You can, for fun, have 3 children and 3 resources for a really vicious circle.
Note that difficulty in contriving a situation in which deadlock occurs. If your transactions are short -- and consistent -- deadlock is very difficult to achieve. Deadlock requires (a) transaction which hold locks for a long time AND (b) transactions which acquire locks in an inconsistent order. I have found it easiest to prevent deadlocks by keeping my transactions short and consistent.
Also note the non-determinism. You can't predict which child will die with a deadlock and which will continue after the other died. Only one of the two need to die to release needed resources for the other. Some RDBMS's claim that there's a rule based on number of resources held blah blah blah, but in general, you'll never know how the victim was chosen.
Because of the two writes being in a specific order, you sort of expect child 1 to die first. However, you can't guarantee that. It's not deadlock until child 2 tries to get child 1's resources -- the sequence of who acquired first may not determine who dies.
Also note that these are processes, not threads. Threads -- because of the Python GIL -- might be inadvertently synchronized and would require lots of calls to time.sleep( 0.001 ) to give the other thread a chance to catch up. Processes -- for this -- are slightly simpler because they're fully independent.
Not sure if either above is correct.
Check out this:
http://www.xaprb.com/blog/2006/08/08/how-to-deliberately-cause-a-deadlock-in-mysql/

Categories