Django database storage issue - python

So I am running a particular code block in threads(5), and in the code been executed by the threads data are been saved to the database. e.g.
link = AccountLinkModel.objects.create(account=account)
if I print the value of "link" object and any field from the AccountLinkModel model, they print successfully, which means the data was created, but eventually, the record of some was not found in the Database, only few were recorded in the database.
Any advice on what might cause this?

Maybe try to run it with only 1 thread to see if the problem still occurs, if not you might have a race condition in your code where you are not properly locking a shared ressource of your threads.

if I print the value of "link" object and any field from the AccountLinkModel model, they print successfully, which means the data was created.
If the create function is wraped by a transaction block like
with transaction.atomic:
link = AccountLinkModel.objects.create(account=account)
...other db stuff
Then the object is not persistet at the database after the create funktion was called. If you print after the create call, then that are just the python attributes from the ram.

Related

How to know whether a nifi process group has completed the data transfer using nipyapi?

I have to know the status of data transfer job(flow inside a Process Group) whether it is completed, failed or is it running. I want to do this using nipyapi for a web application.
I have a process group in NiFi, inside which I have the NiFi flow. I am scheduling the process group using nipyapi:
nipyapi.canvas.schedule_process_group(id, True)
Now I want to monitor the status of the process group using nipyapi. By status I specifically want to know whether it's still running, failed or completed.
NiFi does not really have a concept of a job that can be checked for completion. Once you start all the components in a process group, they are then running indefinitely until someone stops them.
The concept of being "done" or "complete" is really dependent on what you data flow is doing. For example, if your first processor is say GetFile, once that processor is running it is going to monitor the directory for files until someone stops the processor. While the processor is running it has no way of knowing if there will ever be more files, or if it has already seen all the files that will ever be dropped in the directory. That knowledge is only known by whoever/whatever is putting the files there.
To determine failure you need to do something in your data flow to capture the failures. Most processors have a failure relationship, so you would need to route these somewhere and take some action to track the failures.
I think I found a good solution for this problem. This is how i solved it.
So i have a mysql db which basically keeps track of all the files that are to be transferred. The database table will have 2 columns. One for the Filename (lets say is unique) and flag for whether the file has been transferred (True and False).
For Nifi Screenshot click here
We have 3 section of processors.
First: listSFTP and putMySQL
Second: getSFTP and putHDFS
Third: listHDFS and putHDFS
First Section responsible for listing the files in SFTP. It Gets all the files and adds a row into the mysql that the Filename is 'X' and 'False' for not transferred yet.
insert into NifiTest.Jobs values('${filename}', 0);
Third Section does the same thing for HDFS. It will either insert with Transferred = True or update if already a row exists with the same file name.
insert into NifiTest.Jobs values('${filename}', 1) on duplicate key update TRANSFERRED = 1;
2nd Section does nothing but Send the file to the HDFS.
Now to check when the data transfer job is finished.
You will start the entire process group together. When you query the Database and you get all Transferred = 1, that means the job is finished.
It might feel like there are some cases where it might fail but when you carefully think out all the cases you will see that it takes care of all the situations.
Let me know if i am wrong or some improvement can be made to this solution.
You can achieve this without a database using Nifi process groups variable registry APIs of process groups. Create a custom processor which sets a variable in process group lets say is_complete = true as last processor. Then you can monitor for this variable with nipyapi.

Save to database inside thread

I'm working with django,
I have a thread whose purpose is to take queued list of database items and modify them
Here is my model :
class MyModel(models.Model):
boolean = models.BooleanField(editable=False)
and the problematic code :
def unqueue():
while not queue.empty():
myList = queue.get()
for i in myList:
if not i.boolean:
break
i.boolean = False
i.save() # <== error because database table is locked
queue.task_done()
queue = Queue()
thread = Thread(target=unqueue)
thread.start()
def addToQueue(pk_list): # can be called multiple times simultaneously
list = []
for pk in pk_list:
list.append(MyModel.objects.get(pk=pk))
queue.put(list)
I now the code is missing lot of check ect... I simplified here to make it clearer
What can I do to be able to save to my db while inside the thread ?
EDIT : I need to be synchronous because i.boolean (and other properties in my real code) mustn't be overwrite
I tried to create a dedicated Table in the database but it didn't work, I still have the same issue
EDIT 2 : I should precise that I'm using SQLite3. I tried to see if I could unlock/lock specific table in SQLite and it seem that locking apply on the entire db only. That is probably why using a dedicated table wasn't helpful.
That is bad for me, because I need to access different table simultaneously from different thread, is it possible ?
EDIT 3 : It seems that my problem is the one listed here
https://docs.djangoproject.com/en/1.8/ref/databases/#database-is-locked-errors
Are you sure you need a synchronized queue? May be asynchronous solution will solve your problem? Need a thread-safe asynchronous message queue
The solution I found was to change database,
SQLite don't allow concurency access like that.
I switched to MySql and it work now

What is the proper way of using MySQLdb connections and cursors across multiple functions in Python

I'm kind of new to Python and its MySQLdb connector.
I'm writing an API to return some data from a database using the RESTful approach. In PHP, I wrapped the Connection management part in a class, acting as an abstraction layer for MySQL queries.
In Python:
I define the connection early on in the script: con = mdb.connect('localhost', 'user', 'passwd', 'dbname')
Then, in all subsequent methods:
import MySQLdb as mdb
def insert_func():
with con:
cur = con.cursor(mdb.cursors.DictCursor)
cur.execute("INSERT INTO table (col1, col2, col3) VALUES (%s, %s, %s)", (val1, val2, val3) )
rows = cur.fetchall()
#do something with the results
return someval
etc.
I use mdb.cursors.DictCursor because I prefer to be able to access database columns in an associative array manner.
Now the problems start popping up:
in one function, I issue an insert query to create a 'group' with unique 'groupid'.
This 'group' has a creator. Every user in the database holds a JSON array in the 'groups' column of his/her row in the table.
So when I create a new group, I want to assign the groupid to the user that created it.
I update the user's record using a similar function.
I've wrapped the 'insert' and 'update' parts in two separate function defs.
The first time I run the script, everything works fine.
The second time I run the script, the script runs endlessly (I suspect due to some idle connection to the MySQL database).
When I interrupt it using CTRL + C, I get one of the following errors:
"'Cursor' object has no attribute 'connection'"
"commands out of sync; you can't run this command now"
or any other KeyboardInterrupt exception, as would be expected.
It seems to me that these errors are caused by some erroneous way of handling connections and cursors in my code.
I read it was good practice to use with con: so that the connection will automatically close itself after the query. I use 'with' on 'con' in each function, so the connection is closed, but I decided to define the connection globally, for any function to use it. This seems incompatible with the with con: context management. I suspect the cursor needs to be 'context managed' in a similar way, but I do not know how to do this (To my knowledge, PHP doesn't use cursors for MySQL, so I have no experience using them).
I now have the following questions:
Why does it work the first time but not the second? (it will however, work again, once, after the CTRL + C interrupt).
How should I go about using connections and cursors when using multiple functions (that can be called upon in sequence)?
I think there are two main issues going on here- one appears to be python code and the other is the structure of how you're interacting to your DB.
First, you're not closing your connection. This depends on your application's needs - you have to decide how long it should stay open. Reference this SO question
from contextlib import closing
with closing( connection.cursor() ) as cursor:
... use the cursor ...
# cursor closed. Guaranteed.
connection.close()
Right now, you have to interrupt your program with Ctl+C because there's no reason for your with statement to stop running.
Second, start thinking about your interactions with the DB in terms of 'transactions'. Do something, commit it to the DB, if it didn't work, rollback, if it did, close the connection. Here's a tutorial.
With connections, as with file handles the rule of thumb is open late, close early.
So I would recommend share connections only where they are trying to do one thing. Or if you multiprocess, then each process gets a connection, again following open late, close early. And if you are doing sequential operation (say in a loop) open and close outside the loop. Having global connections can get messy. Mainly because now you have to keep track of which function uses it at what time, and what it tries to do with it.
The issue of "cannot run command now", is because your keyboard interrupt kills the active connection.
As to part one of your question - endlessly could be anywhere. Each instance of python will get its own connection. So when you run it the second time it should get its own connection. Open up a mysql client and do
show full processlist
to see whats going on.

Synchronizing data, want to track what has changed

I have a program query data from database (MySQL) every minute.
while 1:
self.alerteng.updateAndAnalyze()
time.sleep(60)
but the data doesn't change frequently; maybe once an hour or a day.(change by another C++ program)
I think the best way is track the change if a change happens then I query and update my data.
any advice?
It depends what you're doing, but SQLAlchemy's Events functionality might help you out.
It lets you run code whenever something happens in your database, i.e. after you insert a new row, or set a column value. I've used it in Flask apps to kick off notifications or other async processes.
http://docs.sqlalchemy.org/en/rel_0_7/orm/events.html#mapper-events
Here's toy code from a Flask app that'll run the kick_off_analysis() function whenever a new YourModel model is created in the database.
from sqlalchemy import event
#event.listens_for(YourModel, "after_insert")
def kick_off_analysis(mapper, connection, your_model):
# do stuff here
Hope that helps you get started.
I don't know how expensive updateAndAnalyze() is, but I'm pretty sure it's like most SQL commands: not something you really want to poll.
You have a textbook case for the Observer Pattern. You want MySQL to call something somehow in your code whenever it gets updated. I'm not positive of the exact mechanism to do this, but there should be way to set a trigger on your relevant tables where it can notify your code that the underlying table has been updated. Then, instead of polling, you basically get "interrupted" with knowledge that you need to do something. It will also eliminate that up-to-a-minute lag you're introducing, which will make whatever you're doing feel more snappy.

Why doesn't this loop display an updated object count every five seconds?

I use this python code to output the number of Things every 5 seconds:
def my_count():
while True:
print "Number of Things: %d" % Thing.objects.count()
time.sleep(5)
my_count()
If another process generates a new Thing while my_count() is running, my_count() will keep printing the same number, even though it now has changed in the database. (But if I kill my_count() and restart it, it will display the new Thing count.)
Things are stored in a MYSQL innodb database, and this code runs on ubuntu.
Why won't my_count() display the new Thing.objects.count() without being restarted?
Because Python DB API is by default in AUTOCOMMIT=OFF mode, and (at least for MySQLdb) on REPEATABLE READ isolation level. This means that behind the scenes you have an ongoing database transaction (InnoDB is transactional engine) in which the first access to given row (or maybe even table, I'm not sure) fixes "view" of this resource for the remaining part of the transaction.
To prevent this behaviour, you have to 'refresh' current transaction:
from django.db import transaction
#transaction.autocommit
def my_count():
while True:
transaction.commit()
print "Number of Things: %d" % Thing.objects.count()
time.sleep(5)
-- note that the transaction.autocommit decorator is only for entering transaction management mode (this could also be done manually using transaction.enter_transaction_management/leave_transaction_managemen functions).
One more thing - to be aware - Django's autocommit is not the same autocommit you have in database - it's completely independent. But this is out of scope for this question.
Edited on 22/01/2012
Here is a "twin answer" to a similar question.

Categories