I have a twisted daemon that does some xml feed parsing.
I store my data in PostgreSQL via twisted.enterprise.adbapi , which IIRC is wrapping psycopg2
I've run into a few problems with storing data into database -- with duplicate data periodically getting in there.
To be honest, there are some underlying issues with my implementation which should be redone and designed much better. I lack the time and resources to do that though - so we're in 'just keep it running' mode for now.
I think the problem may happen from either my usage of deferToThread or how I've spawned the server at the start.
As a quick overview of the functionality I think is at fault:
Twisted queries Postgres for Accounts that should be analyzed , and sets a block on them
SELECT
id
FROM
xml_sources
WHERE
timestamp_last_process < ( CURRENT_TIMESTAMP AT TIME ZONE 'UTC' - INTERVAL '4 HOUR' )
AND
is_processing_block IS NULL ;
lock_ids = [ i['id'] for i in results ]
UPDATE xml_sources SET is_processing_block = True WHERE id IN %(lock_ids)s
What I think is happening, is (accidentally) having multiple servers running or various other issues results in multiple threads processing this data.
I think this would likely be fixed - or at least ruled out as an issue - if I wrap this quick section in an exclusive table lock.
I've never done table locking through twisted before though. can anyone point me in the right direction ?
You can do a SELECT FOR UPDATE instead of a plain SELECT, and that will lock the rows returned by your query. If you actually want table locking you can just issue a LOCK statement, but based on the rest of your question I think you want row locking.
If you are using adbapi, then keep in mind that you will need to use runInteraction if you want to run more than one statement in a transaction. Functions passed to runInteraction will run in a thread, so you may need to use callFromThread or blockingCallFromThread to reach from the database interaction back into the reactor.
However, locking may not be your problem. For one thing, if you are mixing deferToThread and adbapi, something's likely wrong. adbapi is already doing the equivalent of deferToThread for you. You should be able to do everything on the main thread.
You'll have to include a representative example for a more specific answer though. Consider your question: it's basically "Sometimes I get duplicate data, with a self-admittedly problematic implementation, that is big and I can't fix and I also can't show you." This is not a question which it is possible to answer.
Related
Hello I don't think this is in the right place for this question but I don't know where to ask it. I want to make a website and an api for that website using the same SQLAlchemy database would just running them at the same time independently be safe or would this cause corruption from two write happening at the same time.
SQLA is a python wrapper for SQL. It is not it's own database. If you're running your website (perhaps flask?) and managing your api from the same script, you can simply use the same reference to your instance of SQLA. Meaning, when you use SQLA to connect to a database and save to a variable, what is really happening is it saves the connection to a variable, and you continually reference that variable, as opposed to the more inefficient method of creating a new connection every time. So when you say
using the same SQLAlchemy database
I believe you are actually referring to the actual underlying database itself, not the SQLA wrapper/connection to it.
If your website and API are not running in the same script (or even if they are, depending on how your API handles simultaneous requests), you may encounter a race condition, which, according to Wikipedia, is defined as:
the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.
This may be what you are referring to when you mentioned
would this cause corruption from two write happening at the same time.
To avoid such situations, when a process accesses a file, (depending on the OS,) check is performed to see if there is a "lock" on that file, and if so, the OS refuses to open that file. A lock is created when a process accesses a file (and there is no other process holding a lock on that file), such as by using with open(filename): and is released when the process no longer holds an open reference to the file (such as when python execution leaves the with open(filename): indentation block.) This may be the real issue you might encounter when using two simultaneous connections to a SQLite db.
However, if you are using something like MySQL, where you connect to a SQL server process, and NOT a file, since there is no direct access to a file, there will be no lock on the database, and you may run in to that nasty race condition in the following made up scenario:
Stack Overflow queries the reputation an account to see if it should be banned due to negative reputation.
AT THE EXACT SAME TIME, Someone upvotes an answer made by that account that sets it one point under the account ban threshold.
The outcome is now determined by the speed of execution of these 2 tasks.
If the upvoter has, say, a slow computer, and the "upvote" does not get processed by StackOverflow before the reputation query completes, the account will be banned. However, if there is some lag on Stack Overflow's end, and the upvote processes before the account query finishes, the account will not get banned.
The key concept behind this example is that all of these steps can occur within fractions of a second, and the outcome depends of the speed of execution on both ends.
To address the issue of data corruption, most databases have a system in place that properly order database read and writes, however, there are still semantic issues that may arise, such as the example given above.
Two applications can use the same database as the DB is a separate application that will be accessed by each flask app.
What you are asking can be done and is the methodology used by many large web applications, specially when the API is written in a different framework than the main application.
Since SQL databases are ACID compliant, they have a system in place to queue the multiple read/write requests put to it and perform them in the correct order while ensuring data reliability.
One question to ask though is whether it is useful to write two separate applications. For most flask-only projects the best approach would be to separate the project using blueprints, having a “main” blueprint and a “api” blueprint.
I've been struggling with "sqlite3.OperationalError database is locked" all day....
Searching around for answers to what seems to be a well known problem I've found that it is explained most of the time by the fact that sqlite does not work very nice in multithreading where a thread could potentially timeout waiting for more than 5 (default timeout) seconds to write into the db because another thread has the db lock .
So having more threads that play with the db , one of them using transactions and frequently writing I've began measuring the time it takes for transactionns to complete. I've found that no transaction takes more than 300 ms , thus rendering as not plausible the above explication. Unless the thread that uses transactions makes ~21 (5000 ms / 300 ms) consecutive transactions while any other thread desiring to write gets ignored all this time
So what other hypothesis could potentially explain this behavior ?
I have had a lot of these problems with Sqlite before. Basically, don't have multiple threads that could, potentially, write to the db. If you this is not acceptable, you should switch to Postgres or something else that is better at concurrency.
Sqlite has a very simple implementation that relies on the file system for locking. Most file systems are not built for low-latency operations like this. This is especially true for network-mounted filesystems and the virtual filesystems used by some VPS solutions (that last one got me BTW).
Additionally, you also have the Django layer on top of all this, adding complexity. You don't know when Django releases connections (although I am pretty sure someone here can give that answer in detail :) ). But again, if you have multiple concurrent writers, you need a database layer than can do concurrency. Period.
I solved this issue by switching to postgres. Django makes this very simple for you, even migrating the data is a no-brainer with very little downtime.
In case anyone else might find this question via Google, here's my take on this.
SQLite is a database engine that implements the "serializable" isolation level (see here). By default, it implements this isolation level with a locking strategy (although it seems to be possible to change this to a more MVCC-like strategy by enabling the WAL mode described in that link).
But even with its fairly coarse-grained locking, the fact that SQLite has separate read and write locks, and uses deferred transactions (meaning it doesn't take the locks until necessary), means that deadlocks might still occur. It seems SQLite can detect such deadlocks and fail the transaction almost immediately.
Since SQLite does not support "select for update", the best way to grab the write lock early, and therefore avoid deadlocks, would be to start transactions with "BEGIN IMMEDIATE" or "BEGIN EXCLUSIVE" instead of just "BEGIN", but Django currently only uses "BEGIN" (when told to use transactions) and does not currently have a mechanism for telling it to use anything else. Therefore, locking failures become almost unavoidable with the combination of Django, SQLite, transactions, and concurrency (unless you issue the "BEGIN IMMEDIATE" manually, but that's pretty ugly and SQLite-specific).
But anyone familiar with databases knows that when you're using the "serializable" isolation level with many common database systems, then transactions can typically fail with a serialization error anyway. That happens in exactly the kind of situation this deadlock represents, and when a serialization error occurs, then the failing transaction must simply be retried. And, in fact, that works fine for me.
(Of course, in the end, you should probably use a less "lite" kind of database engine anyway if you need a lot of concurrency.)
I got a lot scripts running: scrappers, checkers, cleaners, etc. They have some things in common:
they are forever running;
they have no time constrain to finish their job;
they all access the same MYSQL DB, writting and reading.
Accumulating them, it's starting to slow down the website, which runs on the same system, but depends on these scripts.
I can use queues with Kombu to inline all writtings.
But do you know a way to make the same with reading ?
E.G: if one script need to read from the DB, his request is sent to a blocking queue, et it resumes when it got the answer ? This way everybody is making request to one process, and the process is the only one talking to the DB, making one request at the time.
I have no idea how to do this.
Of course, in the end I may have to add more servers to the mix, but before that, is there something I can do at the software level ?
You could use a connection pooler and make the connections from the scripts go through it. It would limit the number of real connections hitting your DB while being transparent to your scripts (their connections would be held in a "wait" state until a real connections is freed).
I don't know what DB you use, but for Postgres I'm using PGBouncer for similiar reasons, see http://pgfoundry.org/projects/pgbouncer/
You say that your dataset is <1GB, the problem is CPU bound.
Now start analyzing what is eating CPU cycles:
Which queries are really slow and executed often. MySQL can log those queries.
What about the slow queries? Can they be accelerated by using an index?
Are there unused indices? Drop them!
Nothing helps? Can you solve it by denormalizing/precomputing stuff?
You could create a function that each process must call in order to talk to the DB. You could re-write the scripts so that they must call that function rather than talk directly to the DB. Within that function, you could have a scope-based lock so that only one process would be talking to the DB at a time.
I have a Python program that uses the "threading" module. Once every second, my program starts a new thread that fetches some data from the web, and stores this data to my hard drive. I would like to use sqlite3 to store these results, but I can't get it to work. The issue seems to be about the following line:
conn = sqlite3.connect("mydatabase.db")
If I put this line of code inside each thread, I get an OperationalError telling me that the database file is locked. I guess this means that another thread has mydatabase.db open through a sqlite3 connection and has locked it.
If I put this line of code in the main program and pass the connection object (conn) to each thread, I get a ProgrammingError, saying that SQLite objects created in a thread can only be used in that same thread.
Previously I was storing all my results in CSV files, and did not have any of these file-locking issues. Hopefully this will be possible with sqlite. Any ideas?
Contrary to popular belief, newer versions of sqlite3 do support access from multiple threads.
This can be enabled via optional keyword argument check_same_thread:
sqlite.connect(":memory:", check_same_thread=False)
You can use consumer-producer pattern. For example you can create queue that is shared between threads. First thread that fetches data from the web enqueues this data in the shared queue. Another thread that owns database connection dequeues data from the queue and passes it to the database.
The following found on mail.python.org.pipermail.1239789
I have found the solution. I don't know why python documentation has not a single word about this option. So we have to add a new keyword argument to connection function
and we will be able to create cursors out of it in different thread. So use:
sqlite.connect(":memory:", check_same_thread = False)
works out perfectly for me. Of course from now on I need to take care
of safe multithreading access to the db. Anyway thx all for trying to help.
Switch to multiprocessing. It is much better, scales well, can go beyond the use of multiple cores by using multiple CPUs, and the interface is the same as using python threading module.
Or, as Ali suggested, just use SQLAlchemy's thread pooling mechanism. It will handle everything for you automatically and has many extra features, just to quote some of them:
SQLAlchemy includes dialects for SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase and Informix; IBM has also released a DB2 driver. So you don't have to rewrite your application if you decide to move away from SQLite.
The Unit Of Work system, a central part of SQLAlchemy's Object Relational Mapper (ORM), organizes pending create/insert/update/delete operations into queues and flushes them all in one batch. To accomplish this it performs a topological "dependency sort" of all modified items in the queue so as to honor foreign key constraints, and groups redundant statements together where they can sometimes be batched even further. This produces the maxiumum efficiency and transaction safety, and minimizes chances of deadlocks.
You shouldn't be using threads at all for this. This is a trivial task for twisted and that would likely take you significantly further anyway.
Use only one thread, and have the completion of the request trigger an event to do the write.
twisted will take care of the scheduling, callbacks, etc... for you. It'll hand you the entire result as a string, or you can run it through a stream-processor (I have a twitter API and a friendfeed API that both fire off events to callers as results are still being downloaded).
Depending on what you're doing with your data, you could just dump the full result into sqlite as it's complete, cook it and dump it, or cook it while it's being read and dump it at the end.
I have a very simple application that does something close to what you're wanting on github. I call it pfetch (parallel fetch). It grabs various pages on a schedule, streams the results to a file, and optionally runs a script upon successful completion of each one. It also does some fancy stuff like conditional GETs, but still could be a good base for whatever you're doing.
Or if you are lazy, like me, you can use SQLAlchemy. It will handle the threading for you, (using thread local, and some connection pooling) and the way it does it is even configurable.
For added bonus, if/when you realise/decide that using Sqlite for any concurrent application is going to be a disaster, you won't have to change your code to use MySQL, or Postgres, or anything else. You can just switch over.
You need to use session.close() after every transaction to the database in order to use the same cursor in the same thread not using the same cursor in multi-threads which cause this error.
Use threading.Lock()
I could not find any benchmarks in any of the above answers so I wrote a test to benchmark everything.
I tried 3 approaches
Reading and writing sequentially from the SQLite database
Using a ThreadPoolExecutor to read/write
Using a ProcessPoolExecutor to read/write
The results and takeaways from the benchmark are as follows
Sequential reads/sequential writes work the best
If you must process in parallel, use the ProcessPoolExecutor to read in parallel
Do not perform any writes either using the ThreadPoolExecutor or using the ProcessPoolExecutor as you will run into database locked errors and you will have to retry inserting the chunk again
You can find the code and complete solution for the benchmarks in my SO answer HERE Hope that helps!
Scrapy seems like a potential answer to my question. Its home page describes my exact task. (Though I'm not sure how stable the code is yet.)
I would take a look at the y_serial Python module for data persistence: http://yserial.sourceforge.net
which handles deadlock issues surrounding a single SQLite database. If demand on concurrency gets heavy one can easily set up the class Farm of many databases to diffuse the load over stochastic time.
Hope this helps your project... it should be simple enough to implement in 10 minutes.
I like Evgeny's answer - Queues are generally the best way to implement inter-thread communication. For completeness, here are some other options:
Close the DB connection when the spawned threads have finished using it. This would fix your OperationalError, but opening and closing connections like this is generally a No-No, due to performance overhead.
Don't use child threads. If the once-per-second task is reasonably lightweight, you could get away with doing the fetch and store, then sleeping until the right moment. This is undesirable as fetch and store operations could take >1sec, and you lose the benefit of multiplexed resources you have with a multi-threaded approach.
You need to design the concurrency for your program. SQLite has clear limitations and you need to obey them, see the FAQ (also the following question).
Please consider checking the value of THREADSAFE for the pragma_compile_options of your SQLite installation. For instance, with
SELECT * FROM pragma_compile_options;
If THREADSAFE is equal to 1, then your SQLite installation is threadsafe, and all you gotta do to avoid the threading exception is to create the Python connection with checksamethread equal to False. In your case, it means
conn = sqlite3.connect("mydatabase.db", checksamethread=False)
That's explained in some detail in Python, SQLite, and thread safety
The most likely reason you get errors with locked databases is that you must issue
conn.commit()
after finishing a database operation. If you do not, your database will be write-locked and stay that way. The other threads that are waiting to write will time-out after a time (default is set to 5 seconds, see http://docs.python.org/2/library/sqlite3.html#sqlite3.connect for details on that).
An example of a correct and concurrent insertion would be this:
import threading, sqlite3
class InsertionThread(threading.Thread):
def __init__(self, number):
super(InsertionThread, self).__init__()
self.number = number
def run(self):
conn = sqlite3.connect('yourdb.db', timeout=5)
conn.execute('CREATE TABLE IF NOT EXISTS threadcount (threadnum, count);')
conn.commit()
for i in range(1000):
conn.execute("INSERT INTO threadcount VALUES (?, ?);", (self.number, i))
conn.commit()
# create as many of these as you wish
# but be careful to set the timeout value appropriately: thread switching in
# python takes some time
for i in range(2):
t = InsertionThread(i)
t.start()
If you like SQLite, or have other tools that work with SQLite databases, or want to replace CSV files with SQLite db files, or must do something rare like inter-platform IPC, then SQLite is a great tool and very fitting for the purpose. Don't let yourself be pressured into using a different solution if it doesn't feel right!
I have a unit test which contains the following line of code
Site.objects.get(name="UnitTest").delete()
and this has worked just fine until now. However, that statement is currently hanging. It'll sit there forever trying to execute the delete. If I just say
print Site.objects.get(name="UnitTest")
then it works, so I know that it can retrieve the site. No other program is connected to Oracle, so it's not like there are two developers stepping on each other somehow. I assume that some sort of table lock hasn't been released.
So short of shutting down the Oracle database and bringing it back up, how do I release that lock or whatever is blocking me? I'd like to not resort to a database shutdown because in the future that may be disruptive to some of the other developers.
EDIT: Justin suggested that I look at the DBA_BLOCKERS and DBA_WAITERS tables. Unfortunately, I don't understand these tables at all, and I'm not sure what I'm looking for. So here's the information that seemed relevant to me:
The DBA_WAITERS table has 182 entries with lock type "DML". The DBA_BLOCKERS table has 14 entries whose session ids all correspond to the username used by our application code.
Since this needs to get resolved, I'm going to just restart the web server, but I'd still appreciate any suggestions about what to do if this problem repeats itself. I'm a real novice when it comes to Oracle administration and have mostly just used MySQL in the past, so I'm definitely out of my element.
EDIT #2: It turns out that despite what I thought, another programmer was indeed accessing the database at the same time as me. So what's the best way to detect this in the future? Perhaps I should have shut down my program and then queried the DBA_WAITERS and DBA_BLOCKERS tables to make sure they were empty.
From a separate session, can you query the DBA_BLOCKERS and DBA_WAITERS data dictionary tables and post the results? That will tell you if your session is getting blocked by a lock held by some other session, as well as what other session is holding the lock.