Faster way to perform bulk insert, while avoiding duplicates, with SQLAlchemy

Faster way to perform bulk insert, while avoiding duplicates, with SQLAlchemy - python

I'm using the following method to perform a bulk insert, and to optionally avoid inserting duplicates, with SQLAlchemy:
def bulk_insert_users(self, users, allow_duplicates = False):
if not allow_duplicates:
users_new = []
for user in users:
if not self.SQL_IO.db.query(User_DB.id).filter_by(user_id = user.user_id).scalar():
users_new.append(user)
users = users_new
self.SQL_IO.db.bulk_save_objects(users)
self.SQL_IO.db.commit()
Can the above functionality be implemented such that the function is faster?

You can load all user ids first, put them into a set and then use user.user_id in existing_user_ids to determine whether to add a new user or not instead of sending a SELECT query every time. Even with ten thousands of users this will be quite fast, especially compared to contacting the database for each user.

How many users do you have? You're querying for the users one at a time, every single iteration of that loop. You might have more luck querying for ALL user Ids, put them in a list, then check against that list.
existing_users = #query for all user IDs
for user in new_users:
if user not in existing_users:
#do_stuff

Related

How do large scale databases handle locks?

I have a question about databases and updating rows.
I am currently running a flask application, and the endpoint runs a command like this (Please accept this pseudocode / shorthand syntax)
select * from Accounts where used = "False"
username = (first/random row of returned set)
update Accounts set used = "True" where name = username
return username
However, what if 100 people run a call to this endpoint at the Same Time?
How can I avoid colisions? (Meaning 2 people dont get the same username from the table, as the update statement hasnt run yet before the 2nd person queries).
The obvious solution is a lock, something like -- this way if both people hit the endpoint at the exact same time, the 2nd person will have to wait for the lock to release
Global lock
----
with lock:
select * from Accounts where used = "False"
username = (first/random row of returned set)
update Accounts set used = "True" where name = username
return username
I believe this would work, but it wouldnt be a great solution. Does anyone have any better ideas for this? I'm sure companies have this issue all the time with data consistency, how do they solve it?
Thanks!

MySQL / InnoDB offers four transaction isolation levels: READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, and SERIALIZABLE.
Assuming you perform all commands in a single transaction, with REPEATABLE READ and SERIALIZABLE isolation levels, only one transaction accessing the same rows would be executed at a time, so in the case of 100 users, only 1 user would be executing the transaction while the remaining 99 would be waiting in queue.
With READ UNCOMMITTED and READ COMMITTED isolation levels, it would be possible for two or more users to read the same row when it was used = False and try to set it to used = True.
I think it would be better if you refactored your database layout into two tables: one with all possible names, and the other with used names, with a unique constraint on the name column. For every new user, you would insert a new row into the used names table. If you tried to insert multiple users with the same name, you would get a unique constraint violated error, and would be able to try again with a different name.

Global locks on a database are a VERY bad thing. They will slow everything down immensely. Instead there are table locks (to be avoided), row locks (these are fine), and transactions.
Use a transaction. This serves to isolate your changes from others and theirs from yours. It also allows you to throw away all the changes, rollback, if there's a problem so you don't leave a change halfway done. Unless you have a very good reason otherwise, you should ALWAYS be in a transaction.
MySQL supports SELECT FOR UPDATE which tells the database that you're going to update the selected rows in this transaction so those rows get locked.
To use your pseudo-code example...
begin transaction
select * from Accounts where used = "False" for update
username = (first/random row of returned set)
update Accounts set used = "True" where name = username
commit
return username
Basically, transactions make a set of SQL statements "atomic" meaning they happen in a single operation from the point-of-view of concurrent use.
Other points... you should update with the primary key of Accounts to avoid the possibility of using a non-unique field. Maybe the primary key is username, maybe it isn't.
Second, a select without an order by can return in any order it wants. If you're working through a queue of accounts, you should probably specify some sort of order to ensure the oldest ones get done first (or whatever you decide your business logic will be). Even order by rand() will do a better job than relying on the default table ordering.
Finally, if you're only going to fetch one row, add a limit 1 so the database doesn't do a bunch of extra work.

First of all, I would add a new field to the table, let's call it session id. Each client that connecrs to that endpoint should have a unique session id, sg that sets it apart from the other clients.
Instead of doing a select, then an update, I would first update a single record and set its session id field to the client's session id, then retrieve the record bssed on the session id:
update Accounts
set used = "True", sessionid=...
where used="false" and sessionid is null limit 1;
select name from accounts where sessionid=...;
This way you avoid the need of locking

How can I improve django mysql copy performance?

I have a django app that has a model (Person) defined and I also have some DB (in there is a table Appointment) that do not have any models defined (not meant to be connected to the django app).
I need to move some data from Appointment table over to the Person such that all information the People table needs to mirror the Appointment table. It is this way because there are multiple independent DBs like Appointment that needs to be copied to the Person table (so I do not want to make any architectural changes to how this is setup).
Here is what I do now:
res = sourcedb.fetchall() # from Appointment Table
for myrecord in res:
try:
existingrecord = Person.objects.filter(vendorid = myrecord[12], office = myoffice)[0]
except:
existingrecord = Person(vendorid = myrecord[12], office = myoffice)
existingrecord.firstname = myrecord[0]
existingrecord.midname = myrecord[1]
existingrecord.lastname = myrecord[2]
existingrecord.address1 = myrecord[3]
existingrecord.address2 = myrecord[4]
existingrecord.save()
The problem is that this is way too slow (takes about 8 minutes for 20K records). What can I do to speed this up?
I have considered the following approach:
1. bulk_create: Cannot use this because I have to update sometimes.
2. delete all and then bulk_create There is dependency on the Person model to other things, so I cannot delete records in Person model.
3. INSERT ... ON DUPLICATE KEY UPDATE: cannot do this because the Person table's PK is different from the Appointment table PK (primary key). The Appointment PK is copied into Person table. If there was a way to check on two duplicate keys, this approach would work I think.

A few ideas:
EDIT: See Trewq's comment to this and create Indexes on your tables first of all…
Wrap it all in a transaction using with transaction.atomic():, as by default Django will create a new transaction per save() call which can become very expensive. With 20K records, one giant transaction might also be a problem, so you might have to write some code to split your transactions into multiple batches. Try it out and measure!
If RAM is not an issue (should not be one with 20k records), fetch all data first from the appointment table and then fetch all existing Person objects using a single SELECT query instead of one per record
Use bulk_create even if some of them are updates. This will still issue UPDATE queries for your updates, but will reduce all your INSERT queries to just one/a few, which still is an improvement. You can distinguish inserts and updates by the fact that inserts wont have a primary key set before calling save() and save the inserts into a Python list for a later bulk_create instead of saving them directly
As a last resort: Write raw SQL to make use of MySQLs INSERT … ON DUPLICATE KEY UPDATE syntax. You don't need the same primary key for this, a UNIQUE key would suffice. Keys can span multiple columns, see Django's Meta.unique_together model option.

Hosting a small database aside from Google App Engine?

I asked another question about doing large queries in GAE, to which the answer was pretty much not possible.
What I want to do is this: from an iOS device, I get all the user's contacts phone numbers. So now I have a list of say 250 phone numbers. I want to send these phone numbers back to the server and check to see which of these phone numbers belong to a User account.
So I need to do a query: query = User.query(User.phone.IN(phones_list))
However, with GAE, this is quite an expensive query. It will cost 250 reads for just this one query, and I expect to do this type of query often.
So I came up with a crazy idea. Why don't I host the phone numbers on another host, on another database, where this type of query is cheaper. Then I can have GAE send a HTTP request to my other server to get the desired info.
So I have two questions:
Are there any databases more streamlined to handle these kinds of
queries, and which it would be more cheaper to do? Or will it all be
the same as GAE?
Is this overkill? Is it a good idea? Should I suck it up and pay the cost?

GAE's datastore should be good enough for your service. Since your application looks like could be parallelized very well.
1. use phone number as key_name of User.
As you set number as key_name of User, the following code will increase the query speed and reduce the read operation.
memcache.get_multi([phone_number1, phone_number2 ... ])
db.get([number1_not_found_in_memcache, number2_not_found_in_memcache])
memcache.set_multi("all_number_found_in_db")
2. store multi number in one datastore.
the operation cost of GAE not directly related to the entity's size. therefore a large entity store multi data would be another way to save the operation cost.
for example, store several phone number which have the same number_prefix together.
class Number(db.Model):
number_prefix = db.StringProperty()
numbers = db.StringListProperty(indexed = False)
# check number 01234567, 032123124
numbers = Number.get(["01", "03'])
# check 01234567 in number[0].numbers ?
# check 032123124 in number[1].numbers ?
this method could further imporve with memcache.

Generalizing slightly on other ideas offered... assuming that all your search keys are unique to a single User (e.g. email, phone, twitter handle, etc.)
At User write time, you can generate a set of SearchIndex(...) and persist that. Each SearchIndex has the key of the User.
Then at search time you can construct the keys for any SearchIndex and do two ndb.get_multi_async calls. The first to get matching SearchIndex entities, and the second to get the Users associated with those index entities.

how to prevent multiple votes from a single user

I am writing a web app on google app engine with python. I am using jinja2 as a templating engine.
I currently have it set up so that users can upvote and downvote posts but right now they can vote on them as many times as they would like. I simply have the vote record in a database and then calculate it right after that. How can I efficiently prevent users from casting multiple votes?

I suggest making a toggleVote method, which accepts the key of the item you want to toggle the vote on, and the key of the user making the vote.
I'd also suggest adding a table to record the votes, basically containing two fields:
"keyOfUserVoting", "keyOfItemBeingVotedOn"
That way you can simply do a very query where the keys match, and if an item exists, then you know the user voted on that item. (Query where keyOfUserVoting = 'param1' and keyOfItemVoted='param2', if result != None, then it means the user voted)
For the toggleVote() method the case could be very simple:
toggleVote(keyOfUserVoting, keyOfItemToVoteOn):
if (queryResultExists):
// delete this record from the 'votes' table
else:
// add record to the 'votes' table
That way you'll never have to worry about keeping track on an individual basis of how many times the user has voted or not.
Also this way, if you want to find out how many votes are on an item, you can do another query to quickly count where keyOfItemToVoteOn = paramKeyOfItem. Again, with GAE, this will be very fast.
In this setup, you can also quickly tell how many times a user has voted on one item (count where userKey = value and where itemKey = value), or how many times a user has voted in the entire system (count where userKey = value)...
Lastly, for best reliability, you can wrap the updates in the toggleVote() method in a transaction, especially if you'll be doing other things on the user or item being voted on.
Hope this helps.

Store the voting user with the vote, and check for an existing vote by the current user, using your database.
You can perform the check either before you serve the page (and so disable your voting buttons), or when you get the vote attempt (and show some kind of message). You should probably write code to handle both scenarios if the voting really matters to you.

google app engine cross group transactions needing parent ancestor

From my understanding, #db.transactional(xg=True) allows for transactions across groups, however the following code returns "queries inside transactions must have ancestors".
#db.transactional(xg=True)
def insertUserID(self,userName):
user = User.gql("WHERE userName = :1", userName).get()
highestUser = User.all().order('-userID').get()
nextUserID = highestID + 1
user.userID = nextUserID
user.put()
Do you need to pass in the key for each entity despite being a cross group transaction? Can you please help modify this example accordingly?

An XG transaction can be applied across max 25 entity groups. Ancestor query limits the query to a single entity group, and you would be able to do queries within those 25 entity groups in a single XG transaction.
A transactional query without parent would potentially include all entity groups in the application and lock everything up, so you get an error message instead.
In app engine one usually tries to avoid monotonically increasing ids. The auto assigned ones might go like 101, 10001, 10002 and so on. If you know that you need monotonically increasng ids it and it'll work for you performance wise, how about:
Have some kind of model representation of userId to enable key_name
usage and direct lookup
Query for userId outside transaction, get highest candidate id
In transaction do get_or_insert; lookup UserId.get_by_key_name(candidateid+1). If
already present and pointing to a different user, try again with +2
and so on until you find a free one and create it, updating the
userid attribute of user at the same time.
If the XG-transaction of updating UserId+User is too slow, perhaps create UserId+task in transaction (not XG), and let the executing task associate UserId and User afterwards. Or a single backend that can serialize UserId creation and perhaps allow put_async if you retry to avoid holes in the sequence and do something like 50 creations per second.
If it's possible to use userName as key_name you can do direct lookup instead of query and make things faster and cheaper.

Cross group transactions allow you to perform a transaction across multiple groups, but they don't remove the prohibition on queries inside transactions. You need to perform the query outside the transaction, and pass the ID of the entity in (and then check any invariants specified in the query still hold) - or, as Shay suggests, use IDs so you don't have to do a query in the first place.

Every datastore entity has a key, a key (amount other things) has a numeric id that the AppEngine assign to it or key_name which you can give it.
In your case it looks like you can use the numeric id, after you call put() on the user entity you will have: user.key().id() (or user.key.id() if your using NDB) which will be unique for each user (as long as all the user have the same parent, which is None in your code).
This id is not sequential but guarantee to be unique.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.