I have model in peewee ORM with unique=True field. Im saving data to my MySQL db like this :
try:
model.save()
except IntegrityError: # do not save if it's already in db
pass
But when peewee trying to save data that already in db, MySQL increments id and ids order is broken. How to avoid this behavior ?
Here's my model im trying to save :
class FeedItem(Model):
vendor = ForeignKeyField(Vendor, to_field='name')
url = CharField(unique=True)
title = CharField()
pub = DateTimeField()
rating = IntegerField(default=0)
img = CharField(default='null')
def construct(self, vendor, url, title):
self.vendor = vendor
self.url = url
self.title = title
self.pub = datetime.now()
self.save()
class Meta:
database = db
There's im saving it:
for article in feedparser.parse(vendor.feed)['items']:
try:
entry = FeedItem()
entry.construct(vendor.name, article.link, article.title)
except IntegrityError:
pass
MySQL increments id and ids order is broken. How to avoid this behavior?
You don't.
The database-generated identifier is outside your control. It's generated by the database. There's no guarantee that all identifiers have to be sequential and without gaps, just that they're unique. There are any number of things which would result in a number not being present in that sequence, such as:
A record was deleted.
A record was attempted to be inserted, which generated an ID, but the insert in some way failed after that ID was generated.
A record was inserted as part of a transaction which wasn't committed.
A set of IDs was generated to memory as part of an internal optimization in the database engine and the engine went down before the IDs were used.
A record was inserted with an explicit ID, causing the auto-increment feature to re-adjust to the new value.
There may be more I'm not considering. But the point is that you simply don't control that value, the database engine does.
If you want to control that value then don't use autoincrement. Though be aware that this would come with a whole host of other problems that you'd need to solve which autoincrement solves for you. Or you'd have to switch to a GUID instead of an integer, which itself could result in other considerations you'd need to account for.
I'm not positive if this will work but you can try something like:
try:
with database.atomic():
model.save()
except IntegrityError:
pass # Model already exists.
By wrapping in atomic() the code will execute in a transaction (or savepoint if you are already in a transaction). This may lead to the ID sequence remaining intact.
I agree with David's answer, though, which is that really this is a database detail and should not be part of your application logic. If you need monotonically incrementing IDs you should implement that yourself.
Related
I have currently worked abit with ORM using Peewee and I have been trying to understand how I am able to get the field url from the table. The condition is that column visible needs to be true as well. Meaning that if visible is True and the store_id is 4 then return all the url as set.
I have currently done something like this
from peewee import (
Model,
TextField,
BooleanField
)
from playhouse.pool import PooledPostgresqlDatabase
# -------------------------------------------------------------------------
# Connection to Postgresql
# -------------------------------------------------------------------------
postgres_pool = PooledPostgresqlDatabase(
'xxxxxxx',
host='xxxxxxxx',
user='xxxxxxxx',
password='xxxxxx',
max_connections=20,
stale_timeout=30,
)
# ------------------------------------------------------------------------------- #
class Products(Model):
store_id = TextField(column_name='store_id')
url = TextField(column_name='url')
visible = BooleanField(column_name='visible')
class Meta:
database = postgres_pool
db_table = "develop"
#classmethod
def get_urls(cls):
try:
return set([i.url for i in cls.select().where((cls.store_id == 4) & (cls.visible))])
except Products.IntegrityError:
return None
However using the method takes around 0.13s which feels abit too long for me than what it supposed to do which I believe is due to the for loop and needing to put it as a set() and I wonder if there is a possibility that peewee can do something like cls.select(cls.url).where((cls.store_id == 4) & (cls.visible) and return as set()?
How many products do you have? How big is this set? Why not use distinct() so that the database de-duplicates them for you? What indexes do you have? All of these questions are much more pertinent than "how do I make this python loop faster".
I'd suggest that you need an index on store_id, visible or store_id where visible.
create index "product_urls" on "products" ("store_id") where "visible"
You could even use a covering index but this may take up a lot of disk space:
create index "product_urls" on "products" ("store_id", "url") where visible
Once you've got the actual query sped up with an index, you can also use distinct() to make the db de-dupe the URLs before sending them to Python. Additionally, since you only need the URL, just select that column and use the tuples() method to avoid creating a class:
#classmethod
def get_urls(cls):
query = cls.select(cls.url).where((cls.store_id == 4) & cls.visible)
return set(url for url, in query.distinct().tuples())
Lastly please read the docs: http://docs.peewee-orm.com/en/latest/peewee/querying.html#iterating-over-large-result-sets
I am currently trying to create a CustomUser entity in my app engine project upon a user signing in for the first time. I would like CustomUser entities to be unique, and I would like to prevent the same entity from being created more than once. This would be fairly easy to do if I can supply it with an ancestor upon entity creation, as this will make the transaction strongly consistent.
Unfortunately, this is not the case, due to the fact that a CustomUser entity is a root entity, and it will thus be eventually consistent, not strongly consistent. Because of this, there are instances when the entity is created twice, which I would like to prevent as this will cause problems later on.
So the question is, is there a way I can prevent the entity from being created more than once? Or at least make the commit of the ancestor entity strongly consistent to prevent duplication? Here's my code, and interim (hacky) solution.
import time
import logging
from google.appengine.ext import ndb
# sample Model
class CustomUser(ndb.Model):
user_id = ndb.StringProperty(required=True)
some_data = ndb.StringProperty(required=True)
some_more_data = ndb.StringProperty(required=True)
externally_based_user_id = "id_taken_from_somewhere_else"
# check if this id already exists in the Model.
# If it does not exist yet, create it
user_entity = CustomUser.query(
CustomUser.user_id == externally_based_user_id,
ancestor=None).get()
if not user_entity:
# prepare the entity
user_entity = CustomUser(
user_id=externally_based_user_id,
some_data="some information",
some_more_data="even more information",
parent=None
)
# write the entity to ndb
user_key = user_entity.put()
# inform of success
logging.info("user " + str(user_key) + " created")
# eventual consistency workaround - loop and keep checking if the
# entity has already been created
#
# I understand that a while loop may not be the wisest solution.
# I can also use a for loop with n range to avoid going around the loop infinitely.
# Both however seem to be band aid solutions
while not entity_check:
entity_check = CustomUser.query(
CustomUser.user_id == externally_based_user_id,
ancestor=None).get()
# time.sleep to prevent the instance from consuming too much processing power and
# memory, although I'm not certain if this has any real effect apart from
# reducing the number of loops
if not entity_check:
time.sleep(0.5)
EDIT: Solution I ended up using based on both of Daniel Roseman's suggestions. This can be further simplified by using get_or_insert as suggested by voscausa. I've stuck to using the usual way of doing things to make things clearer.
import logging
from google.appengine.ext import ndb
# ancestor Model
# we can skip the creation of an empty class like this, and just use a string when
# retrieving a key
class PhantomAncestor(ndb.Model):
pass
# sample Model
class CustomUser(ndb.Model):
# user_id now considered redundance since we will be
# user_id = ndb.StringProperty(required=True)
some_data = ndb.StringProperty(required=True)
some_more_data = ndb.StringProperty(required=True)
externally_based_user_id = "id_taken_from_somewhere_else"
# construct the entity key using information we know.
# entity_key = ndb.Key(*arbitrary ancestor kind*, *arbitrary ancestor id*, *Model*, *user_id*)
# we can also use the string "PhantomAncestor" instead of passing in an empty class like so:
# entity_key = ndb.Key("SomeRandomString", externally_based_user_id, CustomUser, externally_based_user_id)
# check this page on how to construct a key: https://cloud.google.com/appengine/docs/python/ndb/keyclass#Constructors
entity_key = ndb.Key(PhantomAncestor, externally_based_user_id, CustomUser, externally_based_user_id)
# check if this id already exists in the Model.
user_entity = entity_key.get()
# If it does not exist yet, create it
if not user_entity:
# prepare the entity with the desired key value
user_entity = CustomUser(
# user_id=externally_based_user_id,
some_data="some information",
some_more_data="even more information",
parent=None,
# specify the custom key value here
id=externally_based_user_id
)
# write the entity to ndb
user_key = user_entity.put()
# inform of success
logging.info("user " + str(user_key) + " created")
# we should also be able to use CustomUser.get_and_insert to simplify the code further
A couple of things here.
First, note that the ancestor doesn't have to actually exist. If you want a strongly consistent query, you can use any arbitrary key as an ancestor.
A second option would be to use user_id as your key. Then you can do a key get, rather than a query, which again is strongly consistent.
Given a simple declarative based class;
class Entity(db.Model):
__tablename__ = 'brand'
id = db.Column(db.Integer, primary_key=True)
name = db.Column(db.String(255), nullable=False)
And the next script
entity = Entity()
entity.name = 'random name'
db.session.add(entity)
db.session.commit()
# Just by accessing the property name of the created object a
# SELECT statement is sent to the database.
print entity.name
When I enable echo mode in SQLAlchemy, I can see in the terminal the INSERT statement and an extra SELECT just when I access a property (column) of the model (table row).
If I don't access to any property, the query is not created.
What is the reason for that behavior? In this basic example, We already have the value of the name property assigned to the object. So, Why is needed an extra query? It to secure an up to date value, or something like that?
By default, SQLAlchemy expires objects in the session when you commit. This is controlled via the expire_on_commit parameter.
The reasoning behind this is that the row behind the instance could have been modified outside of the transaction, so if you are not careful you could run into data races, but if you know what you are doing you can safely turn it off.
I have the following two models (just for a test):
class IdGeneratorModel(models.Model):
table = models.CharField(primary_key=True, unique=True,
null=False, max_length=32)
last_created_id = models.BigIntegerField(default=0, null=False,
unique=False)
#staticmethod
def get_id_for_table(table: str) -> int:
try:
last_id_set = IdGeneratorModel.objects.get(table=table)
new_id = last_id_set.last_created_id + 1
last_id_set.last_created_id = new_id
last_id_set.save()
return new_id
except IdGeneratorModel.DoesNotExist:
np = IdGeneratorModel()
np.table = table
np.save()
return IdGeneratorModel.get_id_for_table(table)
class TestDataModel(models.Model):
class Generator:
#staticmethod
def get_id():
return IdGeneratorModel.get_id_for_table('TestDataModel')
id = models.BigIntegerField(null=False, primary_key=True,
editable=False, auto_created=True,
default=Generator.get_id)
data = models.CharField(max_length=16)
Now I use the normal Django Admin site to create a new Test Data Set element. What I expected (and maybe I'm wrong here) is, that the method Generator.get_id() is called exactly one time when saving the new dataset to the database. But what really happens is, that the Generator.get_id() method is called three times:
First time when I click the "add a Test Data Set" button in the admin area
A second time shortly after that (no extra interaction from the user's side)
And a third time when finally saving the new data set
The first time could be OK: This would be the value pre-filled in a form field. Since the primary key field is not displayed in my form, this may be an unnecessary call.
The third time is also clear: It's done before saving. When it's really needed.
The code above is only an example and it is a test for me. In the real project I have to ask a remote system for an ID instead from another table model. But whenever I query that system, the delivered ID gets locked there - like the get_id_for_table() method counts up.
I'm sure there are better ways to get an ID from a method only when really needed - the method should be called exactly one time - when inserting the new dataset. Any idea how to achieve that?
Forgot the version: It's Django 1.8.5 on Python 3.4.
This is not an answer to your question, but could be a solution to your problem
I believe this issue is very complicated. Especially because you want a transaction that spans a webservice call and a database insert... What I would use in this case: generate a uuid locally. This value is practially guaranteed to be unique in the 4d world (time + location) and use that as id. Later, when the save is done, sync with your remote services.
I want to create a flat forum, where threads are no separate table, with a composite primary key for posts.
So posts have two fields forming a natural key: thread_id and post_number, where the further is the ID of the thread they are part of, and the latter is their position in the thread. if you aren’t convinced, check below the line.
My problem is that i don’t know how to tell SQLAlchemy
when committing the addition of new Post instances with thread_id tid, look up how many posts with thread_id tid exist, and autoincrement from that number on.
Why do i think that schema is a good idea? because it’s natural and performant:
class Post(Base):
number = Column(Integer, primary_key=True, autoincrement=False, nullable=False)
thread_id = Column(Integer, primary_key=True, autoincrement=False, nullable=False)
title = Column(Text) #nullable for not-first posts
text = Column(Text, nullable=False)
...
PAGESIZE = 10
#test
tid = 5
page = 4
Entire Thread (query):
thread5 = session.query(Post).filter_by(thread_id=5)
Thread title:
title = thread5.filter_by(number=0).one().title
Thread page
page4 = thread5.filter(
Post.number >= (page * PAGESIZE),
Post.number < ((page+1) * PAGESIZE)).all()
#or
page4 = thread5.offset(page * PAGESIZE).limit(PAGESIZE).all()
Number of pages:
ceil(thread5.count() / PAGESIZE)
You can probably do this with an SQL expression as a default value (see the default argument). Give it a callable like this:
from sqlalchemy.sql import func
def maxnumber_for_threadid(context):
return post_table.select([func.max(post_table.c.number)]).where(post_table.c.thread_id==context.current_parameters['thread_id'])
I'm not absolutely sure you can return an sql expression from a default callable--you may have to actually execute this query and return a scalar value inside the callback. (The cursor should be available from the context parameter.)
However, I strongly recommend you do what #kindall says and just use another auto-incrementing sequence for the number column. What you want to do is actually very tricky to get right even without SQLAlchemy. For example, if you are using an MVCC database you need to introduce special row-level locking so that the number of rows with a matching thread_id does not change while you are running the transaction. How this is done is database-dependent. For example with MySQL InnoDB, you need to do something like this:
BEGIN TRANSACTION;
SELECT MAX(number)+1 FROM posts WHERE thread_id=? FOR UPDATE;
INSERT INTO posts (thread_id, number) VALUES (?, ?); -- number is from previous query
COMMIT;
If you didn't use FOR UPDATE, then conceivably another connection trying to insert a new post into the same thread at the same time could have gotten the same value for number.
So rather than being performant, post inserts are actually quite slow (relatively speaking) because of the extra query and locking required.
All this is resolved by using a separate sequence and not worrying about post number incrementing only within a thread_id.
You should just use a global post number that increments for posts in any thread. Then you don't need to figure out the right number to use for a given thread. A given thread, then, might have posts numbered 7, 20, 42, 51, and so on. This does not matter because you can easily get the number of posts in the thread from the size of the recordset you get back from the query, and you can easily number the posts in the HTML output separately from the actual post numbers.