Threading in a django app

Threading in a django app - python

I'm trying to create what I call a scrobbler. The task is to read a Delicious user from a queue, fetch all of their bookmarks and put them in the bookmarks queue. Then something should go through that queue, do some parsing and then store the data in a database.
This obviously calls for threading because most of the time is spent waiting for Delicious to respond and then for bookmarked websites to respond and be passed through some API's and it would be silly for everything to wait for that.
However I am having trouble with threading and keep getting strange errors like database tables not being defined. Any help is appreciated :)
Here's the relevant code:
# relevant model #
class Bookmark(models.Model):
account = models.ForeignKey( Delicious )
url = models.CharField( max_length=4096 )
tags = models.TextField()
hash = models.CharField( max_length=32 )
meta = models.CharField( max_length=32 )
# bookmark queue reading #
def scrobble_bookmark(account):
try:
bookmark = Bookmark.objects.all()[0]
except Bookmark.DoesNotExist:
return False
bookmark.delete()
tags = bookmark.tags.split(' ')
user = bookmark.account.user
for concept in Concepts.extract( bookmark.url ):
for tag in tags:
Concepts.relate( user, concept['name'], tag )
return True
def scrobble_bookmarks(account):
semaphore = Semaphore(10)
for i in xrange(Bookmark.objects.count()):
thread = Bookmark_scrobble(account, semaphore)
thread.start()
class Bookmark_scrobble(Thread):
def __init__(self, account, semaphore):
Thread.__init__(self)
self.account = account
self.semaphore = semaphore
def run(self):
self.semaphore.acquire()
try:
scrobble_bookmark(self.account)
finally:
self.semaphore.release()
This is the error I get:
Exception in thread Thread-65:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 525, in __bootstrap_inner
self.run()
File "/home/swizec/Documents/trees/bookmarklet_server/../bookmarklet_server/Scrobbler/Scrobbler.py", line 60, in run
scrobble_bookmark(self.account)
File "/home/swizec/Documents/trees/bookmarklet_server/../bookmarklet_server/Scrobbler/Scrobbler.py", line 28, in scrobble_bookmark
bookmark = Bookmark.objects.all()[0]
File "/usr/local/lib/python2.6/dist-packages/django/db/models/query.py", line 152, in __getitem__
return list(qs)[0]
File "/usr/local/lib/python2.6/dist-packages/django/db/models/query.py", line 76, in __len__
self._result_cache.extend(list(self._iter))
File "/usr/local/lib/python2.6/dist-packages/django/db/models/query.py", line 231, in iterator
for row in self.query.results_iter():
File "/usr/local/lib/python2.6/dist-packages/django/db/models/sql/query.py", line 281, in results_iter
for rows in self.execute_sql(MULTI):
File "/usr/local/lib/python2.6/dist-packages/django/db/models/sql/query.py", line 2373, in execute_sql
cursor.execute(sql, params)
File "/usr/local/lib/python2.6/dist-packages/django/db/backends/sqlite3/base.py", line 193, in execute
return Database.Cursor.execute(self, query, params)
OperationalError: no such table: Scrobbler_bookmark
PS: all other tests depending on the same table pass with flying colours.

You can not use threading with in memory databases (sqlite3 in this case) in Django see this bug. This might work with PostgreSQL or MySQL.
I'd recommend something like celeryd instead of threads, message queues are MUCH easier to work with than threading.

This calls for a task queue, though not necessarily threads. You'll have your server process, one or several scrobbler processes, and a queue that lets them communicate. The queue can be in the database, or something separate like a beanstalkd. All this has nothing to do with your error, which sounds like your database is just misconfigured.

1) Does the error persist if you use a real database, not SQLite?
2) If you're using threads, you might need to create separate SQL cursors for use in threads.

i think the table really doesn't exists yew have to create it first by SQL commad
or any other way. As i have a small database for testing different modules i just delete
the database and recreate it using syncdb command

Related

psycopg2.extensions.TransactionRollbackError: could not serialize access due to concurrent update

I have an issue with TransactionRollbackError: could not serialize access due to concurrent update while updating the table.I'm using postgres with odoo V13, the issue occurs while updating the pool with specific record set and context using write() method.I'm migrating the code from odoo v7 to v13 I could see the same works in odoo V7 with no issues. I see no syntax errors but still i get this. I just want to understand is this a bug in the version or related to the concurrence of any data ?
I have a following line of code which is part of the one function.
self.env.get('pk.status').browse(pk_id).with_context(audit_log=True).write(update_vals)
I have a model named pk.status and it has attribute write(self,update_vals), based on conditions it will have to run x1(update_vals) as below.
def x1(self,update_vals):
product_pool = self.env.get('pk.product')
if update_vals:
if isinstance(update_vals, int):
update_vals = [update_vals]
for bs_obj in self.browse(update_vals).read(['End_Date']):
product_ids = product_pool.search([('id_pk_status', '=', bs_obj['id']),
('is_active', '=', 'Y')])
if product_ids:
end_date = bs_obj['End_Date'] or date.today()
force_update = self._context.get('force_update', False)
product_ids.with_context(audit_log=True,force_update=force_update).write(
{'is_active': 'N', 'end_date': end_date})
Product_ids record set has a write(self, val) function for 'pk.product' model.
As part of the write() and its conditions will execute x2()
def x2(self, vals, condition=None):
try:
status_pool = self.env.get('pk.status')
product_pool = self.env.get('pk.product')
result = False
status_obj = status_pool.browse(vals['id_pk_status']).read()[0]
product_obj = product_pool.browse(vals['id_pk_product']).read()[0]
if not product_obj['end_date']:
product_obj['end_date'] = date.today()
extra_check = True
if condition:
statuses = (status_obj['Is_Active'], product_obj['is_active'])
extra_check = statuses in condition
if extra_check:
result = True
if isinstance(vals['start_date'], str):
vals['start_date'] = datetime.strptime(vals['start_date'], '%Y-%m-%d').date()
if not (result and vals['start_date'] >= status_obj['Start_Date']):
result = False
except Exception as e:
traceback.print_exc()
return result
The error occurs while executing the line
status_obj = status_pool.browse(vals['id_pk_status]).read()[0]
Complete Error:
2020-08-09 15:39:11,303 4224 ERROR ek_openerp_dev odoo.sql_db: bad query: UPDATE "pk_status" SET "Is_Active"='N',"write_uid"=1,"write_date"=(now() at time zone 'UTC') WHERE id IN (283150)
ERROR: could not serialize access due to concurrent update
Traceback (most recent call last):
File "/current/addons/models/pk_product.py", line 141, in x2()
status_obj = status_pool.browse(vals['id_pk_status']).read()[0]
File "/current/core/addons/nest_migration_utils/helpers/old_cr.py", line 51, in old_cursor
result = method(*args, **kwargs)
File "/current/odoo/odoo/models.py", line 2893, in read
self._read(stored_fields)
File "/current/odoo/odoo/models.py", line 2953, in _read
self.flush(fields, self)
File "/current/odoo/odoo/models.py", line 5419, in flush
process(self.env[model_name], id_vals)
File "/current/odoo/odoo/models.py", line 5374, in process
recs._write(vals)
File "/current/odoo/odoo/models.py", line 3619, in _write
cr.execute(query, params + [sub_ids])
File "/current/odoo/odoo/sql_db.py", line 163, in wrapper
return f(self, *args, **kwargs)
File "/current/odoo/odoo/sql_db.py", line 240, in execute
res = self._obj.execute(query, params)
psycopg2.extensions.TransactionRollbackError: could not serialize access due to concurrent update
I assume the concurrence in the error states that im doing two write operations in a single thread but im not sure about it. Hope this helps.

Each subprocess needs to have a global connection to the database. If you are using Pool then you can define a function that creates a global connection and a cursor and pass it to the initializer parameter. If you are instead using a Process object then I'd recommend you create a single connection and pass the data via queues or pipes.
Like Klaver said, it would be better if you were to provide code so as to get a more accurate answer.

This happens if the transaction level is set to serializable and two processes are trying to update the same column values.
If your choice is to go with serializable isolation level, then you have to rollback and retry the transaction again.

Does Context/Scoping of a SQLAlchemy Session Require Non-Automatic Object/Attribute Expiration?

The Situation: Simple Class with Basic Attributes
In an application I'm working on, instances of particular class are persisted at the end of their lifecycle, and while they are not subsequently modified, their attributes may need to be read. For example, the end_time of the instance or its ordinal position relative to other instances of the same class (first instance initialized gets value 1, the next has value 2, etc.).
class Foo(object):
def __init__(self, position):
self.start_time = time.time()
self.end_time = None
self.position = position
# ...
def finishFoo(self):
self.end_time = time.time()
self.duration = self.end_time - self.start_time
# ...
The Goal: Persist an Instance using SQLAlchemy
Following what I believe to be a best practice - using a scoped SQLAlchemy Session, as suggested here, by way of contextlib.contextmanager - I save the instance in a newly-created Session which immediately commits. The very next line references the newly persistent instance by mentioning it in a log record, which throws a DetachedInstanceError because the attribute its referencing expired when the Session committed.
class Database(object):
# ...
def scopedSession(self):
session = self.sessionmaker()
try:
yield session
session.commit()
except:
session.rollback()
logger.warn("blah blah blah...")
finally:
session.close()
# ...
def saveMyFoo(self, foo):
with self.scopedSession() as sql_session:
sql_session.add(foo)
logger.info("Foo number {0} finished at {1} has been saved."
"".format(foo.position, foo.end_time))
## Here the DetachedInstanceError is raised
Two Known Possible Solutions: No Expiring or No Scope
I know I can set the expire_on_commit flag to False to circumvent this issue, but I'm concerned this is a questionable practice -- automatic expiration exists for a reason, and I'm hesitant to arbitrarily lump all ORM-tied classes into a non-expiry state without sufficient reason and understanding behind it. Alternatively, I can forget about scoping the Session and just leave the transaction pending until I explicitly commit at a (much) later time.
So my question boils down to this:
Is a scoped/context-managed Session being used appropriately in the case I described?
Is there an alternative way to reference expired attributes that is a better/more preferred approach? (e.g. using a property to wrap the steps of catching expiration/detached exceptions or to create & update a non-ORM-linked attribute that "mirrors" the ORM-linked expired attribute)
Am I misunderstanding or misusing the SQLAlchemy Session and ORM? It seems contradictory to me to use a contextmanager approach when that precludes the ability to subsequently reference any of the persisted attributes, even for a task as simple and broadly applicable as logging.
The Actual Exception Traceback
The example above is simplified to focus on the question at hand, but should it be useful, here's the actual exact traceback produced. The issue arises when str.format() is run in the logger.debug() call, which tries to execute the Set instance's __repr__() method.
Unhandled Error
Traceback (most recent call last):
File "/opt/zenith/env/local/lib/python2.7/site-packages/twisted/python/log.py", line 73, in callWithContext
return context.call({ILogContext: newCtx}, func, *args, **kw)
File "/opt/zenith/env/local/lib/python2.7/site-packages/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/opt/zenith/env/local/lib/python2.7/site-packages/twisted/python/context.py", line 81, in callWithContext
return func(*args,**kw)
File "/opt/zenith/env/local/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
why = selectable.doRead()
--- <exception caught here> ---
File "/opt/zenith/env/local/lib/python2.7/site-packages/twisted/internet/udp.py", line 248, in doRead
self.protocol.datagramReceived(data, addr)
File "/opt/zenith/operations/network.py", line 311, in datagramReceived
self.reactFunction(datagram, (host, port))
File "/opt/zenith/operations/schema_sqlite.py", line 309, in writeDatapoint
logger.debug("Data written: {0}".format(dataz))
File "/opt/zenith/operations/model.py", line 1770, in __repr__
repr_info = "Set: {0}, User: {1}, Reps: {2}".format(self.setNumber, self.user, self.repCount)
File "/opt/zenith/env/local/lib/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 239, in __get__
return self.impl.get(instance_state(instance), dict_)
File "/opt/zenith/env/local/lib/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 589, in get
value = callable_(state, passive)
File "/opt/zenith/env/local/lib/python2.7/site-packages/sqlalchemy/orm/state.py", line 424, in __call__
self.manager.deferred_scalar_loader(self, toload)
File "/opt/zenith/env/local/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", line 563, in load_scalar_attributes
(state_str(state)))
sqlalchemy.orm.exc.DetachedInstanceError: Instance <Set at 0x1c96b90> is not bound to a Session; attribute refresh operation cannot proceed

1.
Most likely, yes. It's used correctly insofar as correctly saving data to the database. However, because your transaction only spans the update, you may run into race conditions when updating the same row. Depending on the application, this can be okay.
2.
Not expiring attributes is the right way to do it. The reason the expiration is there by default is because it ensures that even naive code works correctly. If you are careful, it shouldn't be a problem.
3.
It's important to separate the concept of the transaction from the concept of the session. The contextmanager does two things: it maintains the session as well as the transaction. The lifecycle of each ORM instance is limited to the span of each transaction. This is so you can assume the state of the object is the same as the state of the corresponding row in the database. This is why the framework expires attributes when you commit, because it can no longer guarantee the state of the values after the transaction commits. Hence, you can only access the instance's attributes while a transaction is active.
After you commit, any subsequent attribute you access will result in a new transaction being started so that the ORM can once again guarantee the state of the values in the database.
But why do you get an error? This is because your session is gone, so the ORM has no way of starting a transaction. If you do a session.commit() in the middle of your context manager block, you'll notice a new transaction being started if you access one of the attributes.
Well, what if I want to just access the previously-fetched values? Then, you can ask the framework not to expire those attributes.

multiprocessing broken pipe after a long time

I develop a crawler using multiprocessing model.
which use multiprocessing.Queue to store url-infos which need to crawl , page contents which need to parse and something more;use multiprocessing.Event to control sub processes;use multiprocessing.Manager.dict to store hash of crawled url;each multiprocessing.Manager.dict instance use a multiprocessing.Lock to control access.
All the three type params are shared between all sub processes and parent process, and all the params are organized in a class, I use the instance of the class to transfer shared params from parent process to sub process. Just like:
MGR = SyncManager()
class Global_Params():
Queue_URL = multiprocessing.Queue()
URL_RESULY = MGR.dict()
URL_RESULY_Mutex = multiprocessing.Lock()
STOP_EVENT = multiprocessing.Event()
global_params = Global_Params()
In my own timeout mechanism, I use process.terminate to stop the process which can't stop by itself for a long time!
In my test case, there are 2500+ target sites(some are unservice, some are huge).
crawl site by site that in the target sites file.
At the begining the crawler could work well, but after a long time( sometime 8 hours, sometime 2 hours, sometime moer then 15 hours), the crawler has crawled moer than 100( which is indeterminate) sites, I'll get error info:"Errno 32 broken pipe"
I have tried the following methods to location and solve the problems:
location the site A which crawler broken on, then use crawler to crawls the site separately, the crawler worked well. Even I get a fragment(such as 20 sites) from all the target sites file which contain the site A, the crawler worked well!
add "-X /tmp/pymp-* 240 /tmp" to /etc/cron.daily/tmpwatch
when Broken occured the file /tmp/pymp-* is still there
use multiprocessing.managers.SyncManager replace multiprocessing.Manager and ignore most signal except SIGKILL and SIGTERM
for each target site, I clear most shared params(Queues,dicts and event),if error occured, create a new instance:
while global_params.Queue_url.qsize()>0:
try:
global_params.Queue_url.get(block=False)
except Exception,e:
print_info(str(e))
print_info("Clear Queue_url error!")
time.sleep(1)
global_params.Queue_url = Queue()
pass
the following is the Traceback info, the print_info function is defined to print and store debug info by myself:
[Errno 32] Broken pipe
Traceback (most recent call last):
File "Spider.py", line 613, in <module>
main(args)
File "Spider.py", line 565, in main
spider.start()
File "Spider.py", line 367, in start
print_info("STATIC_RESULT size:%d" % len(global_params.STATIC_RESULT))
File "<string>", line 2, in __len__
File "/usr/local/python2.7.3/lib/python2.7/multiprocessing/managers.py", line 769, in _callmethod
kind, result = conn.recv()
EOFError
I can't understand why, does anyone knows the reason?

I don't know if that is fixing your problem, but there is one point to mention:
global_params.Queue_url.get(block=False)
... throws an Queue.Empty expeption, if the Queue is empty. It's not worth to recreate the Queue for an empty exception.
The recreation of the queue can lead to race conditions.
From my point of view, you have to possibilities:
get rid of the "queue recreation" code block
switch to an other Queue implementation
use:
from Queue import Queue
instead of:
from multiprocessing import Queue

Pymongo failing but won't give exception

Here is the query in Pymongo
import mong #just my library for initializing
collection_1 = mong.init(collect="col_1")
collection_2 = mong.init(collect="col_2")
for name in collection_2.find({"field1":{"$exists":0}}):
try:
to_query = name['something']
actual_id = collection_1.find_one({"something":to_query})['_id']
crap_id = name['_id']
collection_2.update({"_id":id},{"$set":{"new_name":actual_id}},upset=True)
except:
open('couldn_find_id.txt','a').write(name)
All this is doing is taking a field from one collection, finding the id of that field and updating the id of another collection. It works for about 1000-5000 iterations, but periodically fails with this and then I have to restart the script.
> Traceback (most recent call last):
File "my_query.py", line 6, in <module>
for name in collection_2.find({"field1":{"$exists":0}}):
File "/home/user/python_mods/pymongo/pymongo/cursor.py", line 814, in next
if len(self.__data) or self._refresh():
File "/home/user/python_mods/pymongo/pymongo/cursor.py", line 776, in _refresh
limit, self.__id))
File "/home/user/python_mods/pymongo/pymongo/cursor.py", line 720, in __send_message
self.__uuid_subtype)
File "/home/user/python_mods/pymongo/pymongo/helpers.py", line 98, in _unpack_response
cursor_id)
pymongo.errors.OperationFailure: cursor id '7578200897189065658' not valid at server
^C
bye
Does anyone have any idea what this failure is, and how I can turn it into an exception to continue my script even at this failure?
Thanks

The reason of the problem is described in pymongo's FAQ:
Cursors in MongoDB can timeout on the server if they’ve been open for
a long time without any operations being performed on them. This can
lead to an OperationFailure exception being raised when attempting to
iterate the cursor.
This is because of the timeout argument of collection.find():
timeout (optional): if True (the default), any returned cursor is
closed by the server after 10 minutes of inactivity. If set to False,
the returned cursor will never time out on the server. Care should be
taken to ensure that cursors with timeout turned off are properly
closed.
Passing timeout=False to the find should fix the problem:
for name in collection_2.find({"field1":{"$exists":0}}, timeout=False):
But, be sure you are closing the cursor properly.
Also see:
mongodb cursor id not valid error

DetachedInstanceError when creating Pyramid Session with SQLAlchemy

I wrote my own implementation of the ISession interface of Pyramid which should store the Session in a database. Everything works real nice, but somehow pyramid_tm throws up on this. As soon as it is activated it says this:
DetachedInstanceError: Instance <Session at 0x38036d0> is not bound to a Session;
attribute refresh operation cannot proceed
(Don't get confused here: The <Session ...> is the class name for the model, the "... to a Session" most likely refers to SQLAlchemy's Session (which I call DBSession to avoid confusion).
I have looked through mailing lists and SO and it seems anytime someone has the problem, they are
spawning a new thread or
manually call transaction.commit()
I do neither of those things. However, the specialty here is, that my session gets passed around by Pyramid a lot. First I do DBSession.add(session) and then return session. I can afterwards work with the session, flash new messages etc.
However, it seems once the request finishes, I get this exception. Here is the full traceback:
Traceback (most recent call last):
File "/home/javex/data/Arbeit/libraries/python/web_projects/pyramid/lib/python2.7/site-packages/waitress-0.8.1-py2.7.egg/waitress/channel.py", line 329, in service
task.service()
File "/home/javex/data/Arbeit/libraries/python/web_projects/pyramid/lib/python2.7/site-packages/waitress-0.8.1-py2.7.egg/waitress/task.py", line 173, in service
self.execute()
File "/home/javex/data/Arbeit/libraries/python/web_projects/pyramid/lib/python2.7/site-packages/waitress-0.8.1-py2.7.egg/waitress/task.py", line 380, in execute
app_iter = self.channel.server.application(env, start_response)
File "/home/javex/data/Arbeit/libraries/python/web_projects/pyramid/lib/python2.7/site-packages/pyramid/router.py", line 251, in __call__
response = self.invoke_subrequest(request, use_tweens=True)
File "/home/javex/data/Arbeit/libraries/python/web_projects/pyramid/lib/python2.7/site-packages/pyramid/router.py", line 231, in invoke_subrequest
request._process_response_callbacks(response)
File "/home/javex/data/Arbeit/libraries/python/web_projects/pyramid/lib/python2.7/site-packages/pyramid/request.py", line 243, in _process_response_callbacks
callback(self, response)
File "/home/javex/data/Arbeit/libraries/python/web_projects/pyramid/miniblog/miniblog/models.py", line 218, in _set_cookie
print("Setting cookie %s with value %s for session with id %s" % (self._cookie_name, self._cookie, self.id))
File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/attributes.py", line 168, in __get__
return self.impl.get(instance_state(instance),dict_)
File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/attributes.py", line 451, in get
value = callable_(passive)
File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/state.py", line 285, in __call__
self.manager.deferred_scalar_loader(self, toload)
File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/mapper.py", line 1668, in _load_scalar_attributes
(state_str(state)))
DetachedInstanceError: Instance <Session at 0x7f4a1c04e710> is not bound to a Session; attribute refresh operation cannot proceed
For this case, I deactivated the debug toolbar. The error gets thrown from there once I activate it. It seems the problem here is accessing the object at any point.
I realize I could try to detach it somehow, but this doesn't seem like the right way as the element couldn't be modified without explicitly adding it to a session again.
So when I don't spawn new threads and I don't explicitly call commit, I guess the transaction is committing before the request is fully gone and afterwards there is again access to it. How do I handle this problem?

I believe what you're seeing here is a quirk to the fact that response callbacks and finished callbacks are actually executed after tweens. They are positioned just between your app's egress, and middleware. pyramid_tm, being a tween, is committing the transaction before your response callback executes - causing the error upon later access.
Getting the order of these things correct is difficult. A possibility off the top of my head is to register your own tween under pyramid_tm that performs a flush on the session, grabs the id, and sets the cookie on the response.
I sympathize with this issue, as anything that happens after the transaction has been committed is a real gray area in Pyramid where it's not always clear that the session should not be touched. I'll make a note to continue thinking about how to improve this workflow for Pyramid in the future.

I first tried with registering a tween and it worked somehow, but the data did not get saved. I then stumpled upon the SQLAlchemy Event System. I found the after_commit event. Using this, I could set up the detaching of the session object after the commit was done by pyramid_tm. I think this provides the full fexibility and doesn't impose any requirements on the order.
My final solution:
from sqlalchemy.event import listen
from sqlalchemy.orm import Session as SASession
def detach(db_session):
from pyramid.threadlocal import get_current_request
request = get_current_request()
log.debug("Expunging (detaching) session for DBSession")
db_session.expunge(request.session)
listen(SASession, 'after_commit', detach)
Only drawback: It requires calling get_current_request() which is discouraged. However, I saw no way of passing the session in any way, as the event gets called by SQLAlchemy. I thought of some ugly wrapping stuff but I think that would have been way to risky and unstable.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.