At the moment I query my database this way:
for author in session.query(Author).filter(Author.queried==0).slice(0, 1000):
print "Processing:", author
# do stuff and commit later on
This means every 1000 authors I have to restart the script.
Is it possible to make the script run endless (or as long as there are authors)? I mean by that, if it is possible to turn
session.query(Author).filter(Author.queried==0).slice(0, 1000)
into some sort of generator which yields the next author for which queried==0 is true.
Query objects can be treated as iterators as-is. SQL will be executed once you start using data from the query iterator. Example:
for author in session.query(Author).filter(Author.queried==0):
print "Processing: ", author
Your question uses the word "endlessly", so a word of caution. SQL is not an event-processing API; you cannot simply do a query that runs "forever" and spits out each new row as it is added to a table. I wish that were possible, but it isn't.
If your intent is to detect new rows, you will have to poll regularly with the same query, and design an indicator into your data model that allows you to tell you which rows are new. You seem to be representing that now with a queried column. In that case, within the for loop above you might set author.queried = 1 and session.add(author). But you can't session.commit() inside the loop.
Since the query is turned into an equivalent SQL SELECT statement you will only be able to get the set of Author rows where queried is 0 that existed when this transaction started. Any updates to the Author queried column will not change the current SELECT set.
If you want to continue processing all the Author rows even if there are more than 1000 then you could do
for author in session.query(Author).filter(Author.queried==0):
print "Processing:", author
The __iter__ method on the Query object will called automatically which will return the same iterator as calling instances on the Query object.
Related
Edit: The end goal is to essentially sync both databases which are updated by different software on both ends.
Here's the setup, the main database (dbOne.db) and the new database (dbTwo.db) both contain the same table (myTable) and have the same Columns (account_id, guid, state, last_update).
The goal is to iterate through dbOne and search dbTwo for account_id and guid, if no results are found then I'm modifying the guid with some regex and searching again before moving on to the next result, if I find any information then I'm updating it based on the last_update.
Here is what the code essentially looks like.
for m in dbOne.execute("SELECT account_id, guid, state, last_update From myTable"):
n = dbTwo.execute("SELECT account_id, guid, state, last_update From myTable WHERE account_id={dbOne.account_id} AND guid={dbOne.guid}'")
# I would check to see if there is any results from n using n.fetchone()
if n.fetchone() is None:
# I modify the guid using regex
n = dbTwo.execute("SELECT account_id, guid, state, last_update From myTable WHERE account_id={dbOne.account_id} AND guid={dbOne.guid_modified}'")
if n.fetchone() is None:
continue
# Compare dbOne.last_update with dbTwo.last_update then do stuff
Just to note that this isn't the actual code, I've just tried to simplify it as much as possible to give a better understanding of what I've done, the code I have works but I feel like there could be a better way to do this.
I've very little experience with SQL but through my searches, I've discovered things like ATTACH and JOIN as well as an IIF or CASE statements that can be used, so I had this idea of throwing it into a query somehow and doing something like IF guid does not match anything in the database then try {guid_modified}, else continue. if there is a match then it would return the account_id, guid, state, and last_update from both databases, then I would compare last_update from both to find the newer date and then update state in one of the databases.
One thing I don't know, is if it would be possible to modify the guid within the query since regex strips the guid then uses an api to get some information in order to modify it, so I think I would have to recall the query again regardless, but I was just hoping for a second opinion as I am still in the midst of learning python but I want to ensure my code is as efficient as possible.
Thanks in advance.
I am optimising my code, and reducing the amount of queries. These used to be in a loop but I am trying to restructure my code to be done like this. How do I get the second query working so that it uses the id entered in the first query from each row. Assume that the datasets are in the right order too.
self.c.executemany("INSERT INTO nodes (node_value, node_group) values (?, (SELECT node_group FROM nodes WHERE node_id = ?)+1)", new_values)
#my problem is here
new_id = self.c.lastrowid
connection_values.append((node_id, new_id))
#insert entry
self.c.executemany("INSERT INTO connections (parent, child, strength) VALUES (?,?,1)", connection_values)
These queries used to be a for loop but were taking too long so I am trying to avoid using a for loop and doing the query individually. I believe their might be a way with combining it into one query but I am unsure how this would be done.
You will need to either insert rows one at a time or read back the rowids that were picked by SQLite's ID assignment logic; as documented in Autoincrement in SQLite, there is no guarantee that the IDs generated will be consecutive and trying to guess them in client code is a bad idea.
You can do this implicitly if your program is single-threaded as follows:
Set the AUTOINCREMENT keyword in your table definition. This will guarantee that any generated row IDs will be higher than any that appear in the table currently.
Immediately before the first statement, determine the highest ROWID in use in the table.
oldmax ← Execute("SELECT max(ROWID) from nodes").
Perform the first insert as before.
Read back the row IDs that were actually assigned with a select statement:
NewNodes ← Execute("SELECT ROWID FROM nodes WHERE ROWID > ? ORDER BY ROWID ASC", oldmax) .
Construct the connection_values array by combining the parent ID from new_values and the child ID from NewNodes.
Perform the second insert as before.
This may or may not be faster than your original code; AUTOINCREMENT can slow down performance, and without actually doing the experiment there's no way to tell.
If your program is writing to nodes from multiple threads, you'll need to guard this algorithm with a mutex as it will not work at all with multiple concurrent writers.
New to using Python NDB.
I have something like:
class User(ndb.Model):
seen_list = nbd.KeyProperty(kind=Survey, repeated=True)
class Survey(ndb.Model):
same = ndb.StringProperty(required=True)
I want to be able to query for users that have not seen certain surveys.
What I am doing now is:
users = User.query(seen_list != 'survey name').fetch()
This does not work. What would be the proper way to do this? Should I first query the Survey list to get the key of the survey with a certain name? Is the != part correct?
I could not find any examples similar to this.
Thanks.
unfortunately, if your survey is a repeated property, it won't work that way. When you query a repeated property the datastore tries EVERY entry in your list, and if one works, it'll return the item. So when you say "!= survey name 1", if you have at least ONE entry in your list that isn't "survey name 1", it'll come back as positive, even if another result IS "survey name 1".
it's uninstinctive if you come from an SQL background I know.... the only way to go around that is to go programatically and evaluate the ones your query returns. It comes from the fact that, for repeated values, Big Table "flatten" your results, which means it creates one entry for EVERY value in your repeated attribute. so as it scans, it eventually finds one "correct" line with your info, grabs the object key from there, and returns the object.
I have postgresql db which i am updating with around 100000 records. I use session.merge() to insert/update each record and i do a commit after every 1000 records.
i=0
for record in records:
i+=1
session.merge(record)
if i%1000 == 0:
session.commit()
This code works fine. In my database i have a table with a UNIQUE field and there are some duplicated records that i insert into it. A error is thrown when this happens, saying the field is not unique. Since i am inserting 1000 records at a time, a rollback will not help me to skip these records. is there any way i can skip the session.merge() for the duplicate records (other than parsing through all the records to find the duplicate records of course)?
I think you already know this, but let's start out with a piece of dogma: you specified that the field needs to be unique, so you have to let the database check for uniqueness or deal with the errors from not letting that happen.
Checking for uniqueness:
if value not in database:
session.add(value)
session.commit()
Not checking for uniqueness and catching the exception.
try:
session.add(value)
session.commit()
except IntegrityError:
session.rollback()
The first one has a race condition. I tend to use the second pattern.
Now, bringing this back to your actual issue, if you want to assure uniqueness on a column in the database then obviously you're going to have to either let the db assure itself of the loaded value's actual uniqueness, or let the database give you an error and you handle it.
That's obviously a lot slower than adding 100k objects to the session and just committing them all, but that's how databases work.
You might want to consider massaging the data which you are loading OUTSIDE the database and BEFORE attempting to load it, to ensure uniqueness. That way, when you load it you can drop the need to check for uniqueness. Pretty easy to do with command line tools if for example you're loading from csv or text files.
you can get at a "partial rollback" using SAVEPOINT, which SQLAlchemy exposes via begin_nested(). You could do it just like this:
for i, record in enumerate(records):
try:
with session.begin_nested():
session.merge(record)
except:
print "Skipped record %s" % record
if not i % 1000:
session.commit()
notes for the above:
in python, we never do the "i = i+1" thing. use enumerate().
with session.begin_nested(): is the same as saying begin_nested(), then commit() if no exception, or rollback() if so.
You might want to consider writing a function along the lines of this example from the PostgreSQL documentation.
This is the option which works best for me because the number of records with duplicate unique keys is minimal.
def update_exception(records, i, failed_records):
failed_records.append(records[i]['pk'])
session.rollback()
start_range = int(round(i/1000,0) * 1000)
for index in range(start_range, i+1):
if records[index]['pk'] not in failed_records:
ins_obj = Model()
try:
session.merge(ins_obj)
except:
failed_records.append(json_data[table_name][index-1]['pk'])
pass
Say, if i hit an error at 2375 I store the primary key 'pk' for the 2375 record in failed_records and then i recommit from 2000 to 2375. It seems much faster than doing commits one by one.
In sqlite3 in python, I'm trying to make a program where the new row in the table to be written will be inserted next, needs to be printed out. But I just read the documentation here that an INSERT should be used in execute() statement. Problem is that the program I'm making asks the user for his/her information and the primary key ID will be assigned for the member as his/her ID number must be displayed. So in other words, the execute("INSERT") statement must not be executed first as the ID Keys would be wrong for the assignment of the member.
I first thought that lastrowid can be run without using execute("INSERT") but I noticed that it always gave me the value "None". Then I read the documentation in sqlite3 in python and googled alternatives to solve this problem.
I've read through google somewhere that SELECT last_insert_rowid() can be used but would it be alright to ask what is the syntax of it in python? I've tried coding it like this
NextID = con.execute("select last_insert_rowid()")
But it just gave me an cursor object output ""
I've also been thinking of just making another table where there will always only be one value. It will get the value of lastrowid of the main table whenever there is a new input of data in the main table. The value it gets will then be inserted and overwritten in another table so that every time there is a new set of data needs to be input in the main table and the next row ID is needed, it will just access the table with that one value.
Or is there an alternative and easier way of doing this?
Any help is very much appreciated bows deeply
You could guess the next ID if you would query your table before asking the user for his/her information with
SELECT MAX(ID) + 1 as NewID FROM DesiredTable.
Before inserting the new data (including the new ID), start a transaction,
only rollback if the insert failes (because another process was faster with the same operation) and ask your user again. If eveything is OK just do a commit.
Thanks for the answers and suggestions posted everyone but I ended up doing something like this:
#only to get the value of NextID to display
TempNick = "ThisIsADummyNickToBeDeleted"
cur.execute("insert into Members (Nick) values (?)", (TempNick, ))
NextID = cur.lastrowid
cur.execute("delete from Members where ID = ?", (NextID, ))
So basically, in order to get the lastrowid, I ended up inserting a Dummy data then after getting the value of the lastrowid, the dummy data will be deleted.
lastrowid
This read-only attribute provides the rowid of the last modified row. It is only set if you issued an INSERT statement using the execute() method. For operations other than INSERT or when executemany() is called, lastrowid is set to None.
from https://docs.python.org/2/library/sqlite3.html