Save to database inside thread - python

I'm working with django,
I have a thread whose purpose is to take queued list of database items and modify them
Here is my model :
class MyModel(models.Model):
boolean = models.BooleanField(editable=False)
and the problematic code :
def unqueue():
while not queue.empty():
myList = queue.get()
for i in myList:
if not i.boolean:
break
i.boolean = False
i.save() # <== error because database table is locked
queue.task_done()
queue = Queue()
thread = Thread(target=unqueue)
thread.start()
def addToQueue(pk_list): # can be called multiple times simultaneously
list = []
for pk in pk_list:
list.append(MyModel.objects.get(pk=pk))
queue.put(list)
I now the code is missing lot of check ect... I simplified here to make it clearer
What can I do to be able to save to my db while inside the thread ?
EDIT : I need to be synchronous because i.boolean (and other properties in my real code) mustn't be overwrite
I tried to create a dedicated Table in the database but it didn't work, I still have the same issue
EDIT 2 : I should precise that I'm using SQLite3. I tried to see if I could unlock/lock specific table in SQLite and it seem that locking apply on the entire db only. That is probably why using a dedicated table wasn't helpful.
That is bad for me, because I need to access different table simultaneously from different thread, is it possible ?
EDIT 3 : It seems that my problem is the one listed here
https://docs.djangoproject.com/en/1.8/ref/databases/#database-is-locked-errors

Are you sure you need a synchronized queue? May be asynchronous solution will solve your problem? Need a thread-safe asynchronous message queue

The solution I found was to change database,
SQLite don't allow concurency access like that.
I switched to MySql and it work now

Related

Django database storage issue

So I am running a particular code block in threads(5), and in the code been executed by the threads data are been saved to the database. e.g.
link = AccountLinkModel.objects.create(account=account)
if I print the value of "link" object and any field from the AccountLinkModel model, they print successfully, which means the data was created, but eventually, the record of some was not found in the Database, only few were recorded in the database.
Any advice on what might cause this?
Maybe try to run it with only 1 thread to see if the problem still occurs, if not you might have a race condition in your code where you are not properly locking a shared ressource of your threads.
if I print the value of "link" object and any field from the AccountLinkModel model, they print successfully, which means the data was created.
If the create function is wraped by a transaction block like
with transaction.atomic:
link = AccountLinkModel.objects.create(account=account)
...other db stuff
Then the object is not persistet at the database after the create funktion was called. If you print after the create call, then that are just the python attributes from the ram.

How to concurently run queries using Impala with Python code?

Context
I use Python (3.7) to run several queries on an Hadoop server.
After several tests, I think Impala is the most efficent engine to query the database. So I set up a connexion using Ibis framework in order to force the use of Impala (Hive is used by default).
Considering the number of queries, I'm trying to concurrently run these queries.
I think I'm getting close, but I'm stuck with a problem when trying to share the connection to the server using Ibis, and the multiple processes I start.
I'm quite new to Python, but I'm going to do my best to explain clearly my problem, and to use the right vocabulary. Please forgive me in advance for any mistake...!
How the queries are submitted
For submitting my queries, the code looks like :
Connection to the database:
hdfs = ibis.hdfs_connect(host='X.X.X.X', port=Y)
client = ibis.impala.connect(host='X.X.X.X',port=Y,hdfs_client=hdfs)
Creation of the query (done several times):
query = "SELECT ... FROM ... WHERE ..."
Send the query and retrieve the results (done for each query):
query = self.client.sql(query)
data = query.execute(limit = None)
What have been done to concurrently run these queries
For now, I've created a Process class using multiprocessing, and I'm passing to it the client parameter which would enable the connection (at least, I thought), and a list containing the informations required to configure the queries to run on the server:
import multiprocessing
class ParallelDataRetrieving(multiprocessing.Process):
"""Process in charge of retrieving the data."""
def __init__(self,client,someInformations):
multiprocessing.Process.__init__(self)
self.client = client
self.someInformations = someInformations
def run(self):
"""Code to run during the execution of the process."""
cpt1 = 0
while cpt1 < len(someInformations):
query = Use someInformations[cpt1] to create the query.
query = self.client.sql(query)
data = query.execute(limit = None)
Some work on the data...
return 0
Then, from the main script, I (try to) establish the connection, and start several Processes using this connection:
hdfs = ibis.hdfs_connect(host='X.X.X.X', port=Y)
client = ibis.impala.connect(host='X.X.X.X',port=Y,hdfs_client=hdfs)
process_1 = ParallelDataRetrieving(client,someInformations)
process_1.start()
process_2 = ...
But this code does not work. I get the error "TypeError: can't pickle _thread.lock objects".
From what I understand, this comes from the fact that multiprocessing uses Pickle to "encapsulate" the parameters, and transfer them to the processes (whose memory runs separatly on Windows). And it does not seem to be possible to pickle the "client" parameter.
I then found several ideas on the internet which tried to solve this issue, but none of them seems to be applicable to my particular case (Ibis, Impala...):
I tried to create the connection directly in the run method of the Process object (which means one connexion per Process) : this leads to "BrokenPipeError: [Errno 32] Broken pipe"
I tried to use multiprocessing.sharedctypes.RawValue, but if this is the right solution, I'm not very confident I implemented it correctly in my code...
Here is pretty much my situation on the moment. I will keep on trying to solve this problem, but as a kind of a "new comer" to Python, and multiprocessing of database queries, I have thought a more advanced user could probably help me!
Thank you in advance for the time you will devote to this request!

Python3 how to syncronize a manager Dict between two Threads

I am working on a TestSuite app based on Python. In this application I have a running report multiprocessing thread which generate a html report of all the test cases execute by the main program.
The main program it self is multi threaded although using the Normal threading module.
Now this report writer thread act like a server and each test case thread started by the main application can ask for an interface to this writer (IFW).
Communicating from the IFW to the writer just use a single Queue(). However it is also possible for each of the IFW's to ask for a status form the writer and here is where it gets tricky because this data is IFW ID specific.
It is not possible to just use Queues because of the IFW dynamic behavior, so I used a manager to create a proxy Queue however this cause a lot of problems for me (created a new manager instance for each interface). So now I'm trying with the manager().dict() method. However I cannot figure out how to synchronize between the to threads. Here is my code from the IFW:
def getCurrentTestInfo(self):
self._UpdateQueue.put_nowait(WriterCtrlCmd(self.WriterIfId, 'get_current_test_info'))
while self.WriterIfId not in self._RequestDict:
pass
res = self._RequestDict[self.WriterIfId]
del self._RequestDict[self.WriterIfId]
return res
What happens here is that the IFW sends a request cmd to the writer and the writer then returns the test infomation. Along with this request cmd is there a specific IFW ID, this ID is unique.
I know at this point that the entry does not exist in the dict() so I wait for the entry to show up using a "poll" :) and then I read the data. However everyone can see the potential problem in this code.
Is there a way to somehow wait for the dict to be updated by means of the manager ? maybe an event or condition ? Although I would like something linked to the dict() method it self
NOTE there is a period of processing time from the put command until the dict have been updated. so I cannot use Queue.put() instead of Queue.put_nowait()

Synchronizing data, want to track what has changed

I have a program query data from database (MySQL) every minute.
while 1:
self.alerteng.updateAndAnalyze()
time.sleep(60)
but the data doesn't change frequently; maybe once an hour or a day.(change by another C++ program)
I think the best way is track the change if a change happens then I query and update my data.
any advice?
It depends what you're doing, but SQLAlchemy's Events functionality might help you out.
It lets you run code whenever something happens in your database, i.e. after you insert a new row, or set a column value. I've used it in Flask apps to kick off notifications or other async processes.
http://docs.sqlalchemy.org/en/rel_0_7/orm/events.html#mapper-events
Here's toy code from a Flask app that'll run the kick_off_analysis() function whenever a new YourModel model is created in the database.
from sqlalchemy import event
#event.listens_for(YourModel, "after_insert")
def kick_off_analysis(mapper, connection, your_model):
# do stuff here
Hope that helps you get started.
I don't know how expensive updateAndAnalyze() is, but I'm pretty sure it's like most SQL commands: not something you really want to poll.
You have a textbook case for the Observer Pattern. You want MySQL to call something somehow in your code whenever it gets updated. I'm not positive of the exact mechanism to do this, but there should be way to set a trigger on your relevant tables where it can notify your code that the underlying table has been updated. Then, instead of polling, you basically get "interrupted" with knowledge that you need to do something. It will also eliminate that up-to-a-minute lag you're introducing, which will make whatever you're doing feel more snappy.

Why doesn't this loop display an updated object count every five seconds?

I use this python code to output the number of Things every 5 seconds:
def my_count():
while True:
print "Number of Things: %d" % Thing.objects.count()
time.sleep(5)
my_count()
If another process generates a new Thing while my_count() is running, my_count() will keep printing the same number, even though it now has changed in the database. (But if I kill my_count() and restart it, it will display the new Thing count.)
Things are stored in a MYSQL innodb database, and this code runs on ubuntu.
Why won't my_count() display the new Thing.objects.count() without being restarted?
Because Python DB API is by default in AUTOCOMMIT=OFF mode, and (at least for MySQLdb) on REPEATABLE READ isolation level. This means that behind the scenes you have an ongoing database transaction (InnoDB is transactional engine) in which the first access to given row (or maybe even table, I'm not sure) fixes "view" of this resource for the remaining part of the transaction.
To prevent this behaviour, you have to 'refresh' current transaction:
from django.db import transaction
#transaction.autocommit
def my_count():
while True:
transaction.commit()
print "Number of Things: %d" % Thing.objects.count()
time.sleep(5)
-- note that the transaction.autocommit decorator is only for entering transaction management mode (this could also be done manually using transaction.enter_transaction_management/leave_transaction_managemen functions).
One more thing - to be aware - Django's autocommit is not the same autocommit you have in database - it's completely independent. But this is out of scope for this question.
Edited on 22/01/2012
Here is a "twin answer" to a similar question.

Categories