I am using Python with psycopg2 module to get data from Postgres database.
The database is quite large (tens of GB).
Everything appears to be working, I am creating objects from the fetched data.
However, after ~160000 of created objects I get the following error:
I suppose the reason is the amount of data, but I could not get anywhere searching for a solution online. I am not aware of using any proxy and have never used any on this machine before, the database is on localhost.
It's interesting how often the "It's a local server so I'm not open to SQL injection" stance leads to people thinking that string interpolation is somehow easier than a parameterized query. In your case it's ended up with:
'... cookie_id = \'{}\''.format(cookie)
So you've ended up with something that's less legible and also fails (though from the specific error I don't know exactly how). Use parameterization:
cursor.execute("SELECT user_id, created_at FROM cookies WHERE cookie_id = %s ORDER BY created_at DESC;", (cookie,))
Bottom line, do it the correct way all the time. Note, there are cases where you must use string interpolation, e.g. for table names:
cursor.execute("SELECT * FROM %s", (table_name,)) # Not valid
cursor.execute("SELECT * FROM {}".format(table_name)) # Valid
And in those cases, you need to take other precautions if someone else can interact with the code.
Related
I'm migrating a Django site from MySQL to PostgreSQL. The quantity of data isn't huge, so I've taken a very simple approach: I've just used the built-in Django serialize and deserialize routines to create JSON records, and then load them in the new instance, loop over the objects, and save each one to the new database.
This works very nicely, with one hiccup: after loading all the records, I run into an IntegrityError when I try to add new data after loading the old records. The Postgres equivalent of a MySQL autoincrement ID field is a serial field, but the internal counter for serial fields isn't incremented when id values are specified explicitly. As a result, Postgres tries to start numbering records at 1 -- already used -- causing a constraint violation. (This is a known issue in Django, marked wontfix.)
There are quite a few questions and answers related to this, but none of the answers seem to address the issue directly in the context of Django. This answer gives an example of the query you'd need to run to update the counter, but I try to avoid making explicit queries when possible. I could simply delete the ID field before saving and let Postgres do the numbering itself, but there are ForeignKey references that will be broken in that case. And everything else works beautifully!
It would be nice if Django provided a routine for doing this that intelligently handles any edge cases. (This wouldn't fix the bug, but it would allow developers to work around it in a consistent and correct way.) Do we really have to just use a raw query to fix this? It seems so barbaric.
If there's really no such routine, I will simply do something like the below, which directly runs the query suggested in the answer linked above. But in that case, I'd be interested to hear about any potential issues with this approach, or any other information about what I might be doing wrong. For example, should I just modify the records to use UUIDs instead, as this suggests?
Here's the raw approach (edited to reflect a simplified version of what I actually wound up doing). It's pretty close to Pere Picornell's answer, but his looks more robust to me.
table = model._meta.db_table
cur = connection.cursor()
cur.execute(
"SELECT setval('{}_id_seq', (SELECT max(id) FROM {}))".format(table, table)
)
About the debate: my case is a one-time migration, and my decision was to run this function right after I finish each table's migration, although you could call it anytime you suspect integrity could be broken.
def synchronize_last_sequence(model):
# Postgresql aut-increments (called sequences) don't update the 'last_id' value if you manually specify an ID.
# This sets the last incremented number to the last id
sequence_name = model._meta.db_table+"_"+model._meta.pk.name+"_seq"
with connections['default'].cursor() as cursor:
cursor.execute(
"SELECT setval('" + sequence_name + "', (SELECT max(" + model._meta.pk.name + ") FROM " +
model._meta.db_table + "))"
)
print("Last auto-incremental number for sequence "+sequence_name+" synchronized.")
Which I did using the SQL query you proposed in your question.
It's been very useful to find your post. Thank you!
It should work with custom PKs but not with multi-field PKs.
One option is to use natural keys during serialization and deserialization. That way when you insert it into PostgreSQL, it will auto-increment the primary key field and keep everything inline.
The downside to this approach is that you need to have a set of unique fields for each model that don't include the id.
I have been doing a fair amount of manual data analysis, reporting and dash boarding recently via SQL and wonder if perhaps python would be able to automate a lot of this. I am not familiar with Python at all so I hope my question makes sense. For security/performance issues, we store databases on a number of servers (more than 5) which contain data that would be pertinent to a query. Unfortunately, these servers are set up so they cannot talk to each other so I cant pull data from the two servers in the same query. I believe this is a limitation due to using windows credentials/security.
For my data analysis and reporting needs, I need to be able to grab pertinent data from two or more of these so the way I currently do this is by running a query, grabbing the results, running another query with the results, doing some formula work in excel, and then running another query and so on and so forth until I get what I need.
Unfortunately this both time consuming, and also makes me pull massive datasets (in the multiple millions of rows), which I then have to continually narrow down based on criteria that are in said databases.
I know Python has the ability to query SQL Server, however I figured I would ask the experts:
Can I manipulate the data in the background with Python similar to how I can do with excel (lookups, statistical functions, etc, perhaps even XML/webAPI?
Can Python handle connections to multiple different database servers at the same time?
Does Python handle windows credentials well?
If Python is not the tool for this, can you name one that would work better?
Please let me know if I can provide additional pertinent details.
Ideally, I would like to end up creating our own separate database and creating automated processes to pull everything from other databases but currently that is not possible due to project constraints.
Thanks!
I didn't use windows credential. But i have used Python to work with multiple MS-SQL databases at the same time. It worked very well. You can use the library pymssql or better with SQLAlchemy
But i think you should start with a basic tutorial about Python first. Because you want to work with millions of rows, it's very important to understand list, set, tuple, dict in Python. For good performance, you should use the right type.
A basic example with pymssql
import pymssql
conn1 = pymssql.connect("Host1", "user1", "password1", "db1")
conn2 = pymssql.connect("Host2", "user2", "password2", "db2")
cursor1 = conn1.cursor()
cursor2 = conn2.cursor()
cursor1.execute('SELECT * FROM TABLE1 LIMIT 10')
cursor2.execute('SELECT * FROM TABLE2 LIMIT 10')
result1 = cursor1.fetchall()
result2 = cursor2.fetchall()
# print each row
for row in result1:
print(row)
# print each row
for row in result2:
print(row)
You can do all of what you asked. Python allows to create multiple connection objects via a library, so for example, let's say you use MySQL python you would create two different objects like this:
NOT ACTUAL CODE, JUST EXAMPLE
conn1 = mysqlConnect(server1, user, pass)
conn2 = mysqlConnect(server2, user, pass)
Like this, conn1 connects to one database and conn2 connects to a different one, usually you would do:
conn1.execute(query_to_server_1)
conn2.execute(query_to_server_2)
This helps maintain two different connections in the same script. If you are looking for multi threading, python offers an incredible library that will help you execute multiple task from one master script.
I'm working in a environment with a very poorly managed legacy Paradox database system. (I'm not the administrator.) I've been messing around with using pyodbc to interact with our tables, and the basic functionality seems to work. Here's some (working) test code:
import pyodbc
LOCATION = "C:\test"
cnxn = pyodbc.connect(r"Driver={{Microsoft Paradox Driver (*.db )\}};DriverID=538;Fil=Paradox 5.X;DefaultDir={0};Dbq={0};CollatingSequence=ASCII;".format(LOCATION), autocommit=True, readonly=True)
cursor = cnxn.cursor()
cursor.execute("select last, first from test")
row = cursor.fetchone()
print row
The problem is that most of our important tables are going to be open in someone's Paradox GUI at pretty much all times. I get this error whenever I try to do a select from one of those tables:
pyodbc.Error: ('HY000', "[HY000] [Microsoft][ODBC Paradox Driver] Could not lock
table 'test'; currently in use by user '(unknown)' on machine '(unknown)'. (-1304)
(SQLExecDirectW)")
This is, obviously, because pyodbc tries to lock the table when cursor.execute() is called on it. This behavior makes perfect sense, since cursor.execute() runs arbitary SQL code and could change the table.
However, Paradox itself (through its gui) seems to handle multiple users fine. It only gives you similar errors if you try to restructure the table while people are using it.
Is there any way I can get pyodbc to use some sort of read-only mode, such that it doesn't have to lock the table when I'm just doing select and such? Or is locking a fundamental part of how it works that I'm not going to be able to get around?
Solutions that would use other modules are also totally fine.
Ok, I finally figured it out.
Apparently, odbc dislikes Paradox tables which have no primary key. You cannot update tables with no primary key under any circumstances, and you cannot read from tables with no primary key unless you are the only user trying to access that table.
Unrelatedly, you get essentially the same error messages from password-protected tables if you don't supply a password.
So I was testing my script on two different tables, one of which has both a password and a primary key, and one of which had neither. I assumed the error messages had the same root cause, but it was actually two different problems, with different solutions.
There still seems to be no way to get access to tables without primary keys if they are open in someone's GUI, but that's a smaller issue.
Make sure that you have the latest version of pyobdc (3.0.6)here, according to them, they
Added Cursor.commit() and Cursor.rollback(). It is now possible to use
only a cursor in your code instead of keeping track of a connection
and a cursor.
Added readonly keyword to connect. If set to True, SQLSetConnectAttr
SQL_ATTR_ACCESS_MODE is set to SQL_MODE_READ_ONLY. This may provide
better locking semantics or speed for some drivers.
Fixed an error reading SQL Server XML data types longer than 4K.
Also, i have tested this on a paradox server using readonly and it does works.
Hope this helps!
I just published a Python library for reading Paradox database files via the pxlib C-library: https://github.com/mherrmann/pypxlib. This operates on the file-level so should also let you read the database independently of who else is currently accessing it. Since it does not synchronize read/write accesses, you do have to be careful though!
I have an odd problem.
There are two databases. One has all products in it for an online shop. The other one has the online-shop software on it and only the active products.
I want to take the values from db1, change some values in python and insert into db2.
db1.execute("SELECT * FROM table1")
for product in db1.fetchall():
# ...
db2.execute("INSERT INTO table2 ...")
print "Check: " + str(db2.countrow)
So I can get the values via select, even selecting from db2 is no problem. My check always gives me 1 BUT there are no new rows in table2. The autoincrement value grows but the there is no data.
And I dont even get an error like "couldnt insert"
So does anybody has an idea what could be wrong? (if i do an insert manually via phpmyadmin it works... and if i just take the sql from my script and do it manually the sql-statements work aswell)
EDIT: Found the answer here How do I check if an insert was successful with MySQLdb in Python?
Is there a way to make these executes without committing everytime?
I have a wrapper around the mysqldb and it works perfectly for me, changing the behaviour with commit I need to make some big changes.
EDIT2: ok i found out that db1 is MyISAM (which would work without commiting) and db2 is InnoDB (which actually seems only to work with commiting) I guess I have to change db2 to MyISAM aswell.
Try adding db2.commit() after the inserts if you're using InnoDB.
Starting with 1.2.0, MySQLdb disables autocommit by default, as
required by the DB-API standard (PEP-249). If you are using InnoDB
tables or some other type of transactional table type, you'll need to
do connection.commit() before closing the connection, or else none of
your changes will be written to the database.
http://mysql-python.sourceforge.net/FAQ.html#my-data-disappeared-or-won-t-go-away
I have to run 40K requests against a username:
SELECT * from user WHERE login = :login
It's slow, so I figured I would just use a prepared statement.
So I do
e = sqlalchemy.create_engine(...)
c = e.connect()
c.execute("PREPARE userinfo(text) AS SELECT * from user WHERE login = $1")
r = c.execute("EXECUTE userinfo('bob')")
for x in r:
do_foo()
But I have a:
InterfaceError: (InterfaceError) cursor already closed None None
I don't understand why I get an exception
Not sure how to solve your cursor related error message, but I dont think a prepared staement will solve your performance issue - as long as your using SQL server 2005 or later the execution plan for SELECT * from user WHERE login = $login will already be re-used and there will be no performance gain from the prepared statement. I dont know about MySql or other SQL database servers, but I suspect they too have similar optimisations for Ad-Hoc queries that make the prepared statement redundant.
It sounds like the cause of the performance hit is more down to the fact that you are making 40,000 round trips to the database - you should try and rewrite the query so that you are only executing one SQL statement with a list of the login names. Am I right in thinking that MySql supports an aray data type? If it doesnt (or you are using Microsoft SQL) you should look into passing in some sort of delimited list of usernames.
From this discussion, it might be a good idea to check your paster debug logs in case there is a better error message there.