I have a unit test which contains the following line of code
Site.objects.get(name="UnitTest").delete()
and this has worked just fine until now. However, that statement is currently hanging. It'll sit there forever trying to execute the delete. If I just say
print Site.objects.get(name="UnitTest")
then it works, so I know that it can retrieve the site. No other program is connected to Oracle, so it's not like there are two developers stepping on each other somehow. I assume that some sort of table lock hasn't been released.
So short of shutting down the Oracle database and bringing it back up, how do I release that lock or whatever is blocking me? I'd like to not resort to a database shutdown because in the future that may be disruptive to some of the other developers.
EDIT: Justin suggested that I look at the DBA_BLOCKERS and DBA_WAITERS tables. Unfortunately, I don't understand these tables at all, and I'm not sure what I'm looking for. So here's the information that seemed relevant to me:
The DBA_WAITERS table has 182 entries with lock type "DML". The DBA_BLOCKERS table has 14 entries whose session ids all correspond to the username used by our application code.
Since this needs to get resolved, I'm going to just restart the web server, but I'd still appreciate any suggestions about what to do if this problem repeats itself. I'm a real novice when it comes to Oracle administration and have mostly just used MySQL in the past, so I'm definitely out of my element.
EDIT #2: It turns out that despite what I thought, another programmer was indeed accessing the database at the same time as me. So what's the best way to detect this in the future? Perhaps I should have shut down my program and then queried the DBA_WAITERS and DBA_BLOCKERS tables to make sure they were empty.
From a separate session, can you query the DBA_BLOCKERS and DBA_WAITERS data dictionary tables and post the results? That will tell you if your session is getting blocked by a lock held by some other session, as well as what other session is holding the lock.
Related
Hello I don't think this is in the right place for this question but I don't know where to ask it. I want to make a website and an api for that website using the same SQLAlchemy database would just running them at the same time independently be safe or would this cause corruption from two write happening at the same time.
SQLA is a python wrapper for SQL. It is not it's own database. If you're running your website (perhaps flask?) and managing your api from the same script, you can simply use the same reference to your instance of SQLA. Meaning, when you use SQLA to connect to a database and save to a variable, what is really happening is it saves the connection to a variable, and you continually reference that variable, as opposed to the more inefficient method of creating a new connection every time. So when you say
using the same SQLAlchemy database
I believe you are actually referring to the actual underlying database itself, not the SQLA wrapper/connection to it.
If your website and API are not running in the same script (or even if they are, depending on how your API handles simultaneous requests), you may encounter a race condition, which, according to Wikipedia, is defined as:
the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.
This may be what you are referring to when you mentioned
would this cause corruption from two write happening at the same time.
To avoid such situations, when a process accesses a file, (depending on the OS,) check is performed to see if there is a "lock" on that file, and if so, the OS refuses to open that file. A lock is created when a process accesses a file (and there is no other process holding a lock on that file), such as by using with open(filename): and is released when the process no longer holds an open reference to the file (such as when python execution leaves the with open(filename): indentation block.) This may be the real issue you might encounter when using two simultaneous connections to a SQLite db.
However, if you are using something like MySQL, where you connect to a SQL server process, and NOT a file, since there is no direct access to a file, there will be no lock on the database, and you may run in to that nasty race condition in the following made up scenario:
Stack Overflow queries the reputation an account to see if it should be banned due to negative reputation.
AT THE EXACT SAME TIME, Someone upvotes an answer made by that account that sets it one point under the account ban threshold.
The outcome is now determined by the speed of execution of these 2 tasks.
If the upvoter has, say, a slow computer, and the "upvote" does not get processed by StackOverflow before the reputation query completes, the account will be banned. However, if there is some lag on Stack Overflow's end, and the upvote processes before the account query finishes, the account will not get banned.
The key concept behind this example is that all of these steps can occur within fractions of a second, and the outcome depends of the speed of execution on both ends.
To address the issue of data corruption, most databases have a system in place that properly order database read and writes, however, there are still semantic issues that may arise, such as the example given above.
Two applications can use the same database as the DB is a separate application that will be accessed by each flask app.
What you are asking can be done and is the methodology used by many large web applications, specially when the API is written in a different framework than the main application.
Since SQL databases are ACID compliant, they have a system in place to queue the multiple read/write requests put to it and perform them in the correct order while ensuring data reliability.
One question to ask though is whether it is useful to write two separate applications. For most flask-only projects the best approach would be to separate the project using blueprints, having a “main” blueprint and a “api” blueprint.
I'd like to be able to tell someone (django, pgBouncer, or whoever can provide me with this service) to always hand me the same connection to the database (PostgreSQL in this case) on a per client/session basis, instead of getting a random one each time (or creating a new one for that matter).
To my knowledge:
Django's CONN_MAX_AGE can control the lifetime of connections, so
far so good. This will also have a positive impact on performance
(no connection setup penalties).
Some pooling package (pgBouncer for example) can hold the connections and hand them to me as I need them. We're almost there.
The only bit I'm missing is the possibility to ask pgBouncer (or any other similar tool for that matter) to give me a specific db connection, instead of "one from the pool". This is important because I want to have control over the lifetime of the transaction. I want to be able to open a transaction, then send a series of commands, and then manually commit all the work, or roll everything back should something fail.
Many years ago, I've implemented something very similar to what I'm looking for now. It was a simple connection pool made in C which would hold as many connections to oracle as clients needed on one hand, while on the other it would give these clients the chance to recover these exact connections based on some ID, which could have been for example a PHP session ID. That way users could acquire a lock on some database object/row, and the lock would persist even after the apache process died. From that point on the session owner was in total control of that row until he decided it was time to commit it, or until the backend decided it was time to let the transaction go by idleness.
I have a batch query that I'm running daily on my database. However, it seems to get stuck in idle state, and I'm having a lot of difficulty debugging what's going on.
The query is an aggregation on a table that is simultaneously getting inserted, which I'm guessing somehow relates to the issue. (The aggregation is on the previous days data, so the insertions shouldn't affect results.)
Clues
I'm running this inside a python script using sqlalchemy. However, I've set transaction level to autocommit, so I don't think things are getting wrapped inside a transaction. On the other hand, I don't see the query hang when I run it manually in sql terminal.
By querying pg_stat_activity, the query initially comes into the database as state='active'. After maybe 15 seconds, the state changes to 'idle' and additionally, the xact_start is set to NULL. The waiting flag is never set to true.
Before I figured out the transaction level autocommit for sqlalchemy, it would instead hang in state 'idle in transaction' rather than 'idle'. And it possibly hangs slightly less frequently since making that change?
I feel like I'm not equipped to dig any deeper than I have on this. Any feedback, even explaining more about different states and relevant postgres internals without giving a definite answer, would be greatly appreciated.
I have a twisted daemon that does some xml feed parsing.
I store my data in PostgreSQL via twisted.enterprise.adbapi , which IIRC is wrapping psycopg2
I've run into a few problems with storing data into database -- with duplicate data periodically getting in there.
To be honest, there are some underlying issues with my implementation which should be redone and designed much better. I lack the time and resources to do that though - so we're in 'just keep it running' mode for now.
I think the problem may happen from either my usage of deferToThread or how I've spawned the server at the start.
As a quick overview of the functionality I think is at fault:
Twisted queries Postgres for Accounts that should be analyzed , and sets a block on them
SELECT
id
FROM
xml_sources
WHERE
timestamp_last_process < ( CURRENT_TIMESTAMP AT TIME ZONE 'UTC' - INTERVAL '4 HOUR' )
AND
is_processing_block IS NULL ;
lock_ids = [ i['id'] for i in results ]
UPDATE xml_sources SET is_processing_block = True WHERE id IN %(lock_ids)s
What I think is happening, is (accidentally) having multiple servers running or various other issues results in multiple threads processing this data.
I think this would likely be fixed - or at least ruled out as an issue - if I wrap this quick section in an exclusive table lock.
I've never done table locking through twisted before though. can anyone point me in the right direction ?
You can do a SELECT FOR UPDATE instead of a plain SELECT, and that will lock the rows returned by your query. If you actually want table locking you can just issue a LOCK statement, but based on the rest of your question I think you want row locking.
If you are using adbapi, then keep in mind that you will need to use runInteraction if you want to run more than one statement in a transaction. Functions passed to runInteraction will run in a thread, so you may need to use callFromThread or blockingCallFromThread to reach from the database interaction back into the reactor.
However, locking may not be your problem. For one thing, if you are mixing deferToThread and adbapi, something's likely wrong. adbapi is already doing the equivalent of deferToThread for you. You should be able to do everything on the main thread.
You'll have to include a representative example for a more specific answer though. Consider your question: it's basically "Sometimes I get duplicate data, with a self-admittedly problematic implementation, that is big and I can't fix and I also can't show you." This is not a question which it is possible to answer.
I got a lot scripts running: scrappers, checkers, cleaners, etc. They have some things in common:
they are forever running;
they have no time constrain to finish their job;
they all access the same MYSQL DB, writting and reading.
Accumulating them, it's starting to slow down the website, which runs on the same system, but depends on these scripts.
I can use queues with Kombu to inline all writtings.
But do you know a way to make the same with reading ?
E.G: if one script need to read from the DB, his request is sent to a blocking queue, et it resumes when it got the answer ? This way everybody is making request to one process, and the process is the only one talking to the DB, making one request at the time.
I have no idea how to do this.
Of course, in the end I may have to add more servers to the mix, but before that, is there something I can do at the software level ?
You could use a connection pooler and make the connections from the scripts go through it. It would limit the number of real connections hitting your DB while being transparent to your scripts (their connections would be held in a "wait" state until a real connections is freed).
I don't know what DB you use, but for Postgres I'm using PGBouncer for similiar reasons, see http://pgfoundry.org/projects/pgbouncer/
You say that your dataset is <1GB, the problem is CPU bound.
Now start analyzing what is eating CPU cycles:
Which queries are really slow and executed often. MySQL can log those queries.
What about the slow queries? Can they be accelerated by using an index?
Are there unused indices? Drop them!
Nothing helps? Can you solve it by denormalizing/precomputing stuff?
You could create a function that each process must call in order to talk to the DB. You could re-write the scripts so that they must call that function rather than talk directly to the DB. Within that function, you could have a scope-based lock so that only one process would be talking to the DB at a time.