sql select from a large number of IDs

sql select from a large number of IDs - python

I have a table, Foo. I run a query on Foo to get the ids from a subset of Foo. I then want to run a more complicated set of queries, but only on those IDs. Is there an efficient way to do this? The best I can think of is creating a query such as:
SELECT ... --complicated stuff
WHERE ... --more stuff
AND id IN (1, 2, 3, 9, 413, 4324, ..., 939393)
That is, I construct a huge "IN" clause. Is this efficient? Is there a more efficient way of doing this, or is the only way to JOIN with the inital query that gets the IDs? If it helps, I'm using SQLObject to connect to a PostgreSQL database, and I have access to the cursor that executed the query to get all the IDs.
UPDATE: I should mention that the more complicated queries all either rely on these IDs, or create more IDs to look up in the other queries. If I were to make one large query, I'd end up joining six tables at once or so, which might be too slow.

One technique I've used in the past is to put the IDs into a temp table, and then use that to drive a sequence of queries. Something like:
BEGIN;
CREATE TEMP TABLE search_result ON COMMIT DROP AS
SELECT entity_id
FROM entity /* long complicated search joins and conditions ... */;
-- Fetch primary entities
SELECT entity_id, entity.x /*, ... */
FROM entity JOIN search_result USING (entity_id);
-- Fetch some related entities
SELECT entity_id, related_entity_id, related_entity.x /*, ... */
FROM related_entity JOIN search_result USING (entity_id);
-- And more, as required
END;
This is particularly useful where the search result entities have multiple one-to-many relationships which you want to fetch without either a) doing N*M+1 selects or b) doing a cartesian join of related entities.

I would think it might be useful to use a VIEW. Simple create a view with your query for ID's, then join to that view via ID. That will limit your results to the required subset of ID's without an expensive IN statement.
I do know that the IN statement is more expensive then an EXISTS statement would be.

I think the join with the criteria to select the id's will be more efficient because the query optimizer has more options to do the right thing. Use the explain plan to see how postgresql will approach it.

You are almost certainly better off with a join, however, another option is to use a sub select, i.e.
SELECT ... --complicated stuff
WHERE ... --more stuff
AND id IN (select distinct id from Foo where ...)

Related

How to implement a specific SQL statement as a SQLAlchemy ORM query

I was following a tutorial to make my first Flask API (https://medium.com/#dushan14/create-a-web-application-with-python-flask-postgresql-and-deploy-on-heroku-243d548335cc) I did it but now I want to do queries more custom with SQLAlchemy and PostgreSQL. My question is how I could do something like this:
query = text("""SELECT enc.*, persona."Persona_Nombre", persona."Persona_Apellido", metodo."MetEnt_Nombre", metodo_e."MetPag_Descripcion"
FROM "Ventas"."Enc_Ventas" AS enc
INNER JOIN "General"."Persona" AS persona ON enc."PersonaId" = persona."PersonaId"
INNER JOIN "Ventas"."Metodo_Entrega" AS metodo ON enc."MetodoEntregaId" = metodo."MetodoEntregaId"
INNER JOIN "General"."Metodo_Pago" AS metodo_e ON enc."MetodoPagoId" = metodo_e."MetodoPagoId"
INNER JOIN "General"."Estatus" AS estado ON enc. """)
but with SQLAlchemy in order to use the models that I created previously. Thanks in advance for any answer!!
Edit:
The columns that I wish to see at the final result are: enc.*, persona."Persona_Nombre", persona."Persona_Apellido", metodo."MetEnt_Nombre", metodo_e."MetPag_Descripcion"
I really wish I could share more info but sadly I can't at the moment.

Doing this from the ORM layer, you would reference model names (I match the names of your query above, but I'm sure some of the model/table names are off - now adjusted slightly).
Now revised to include the specific columns you only want to see (note that I ignore your SQL aliases, ORM layer handles the actual query construction):
selection = session.query(Enc_Ventas, Persona.Persona_Nombre, Persona.Persona_Apellido, Metodo_Entrega.MetEnt_Nombre, Metodo_Pago.MetPag_Descripcion).\
join(Persona, Enc_Ventas.PersonaId == Persona.PersonaId).
join(Metodo_Entrega, Enc_Ventas.MetodoEntregaId == Metodo_Entrega.MetodoEntregaId).\
join(Metodo_Pago, Enc_Ventas.MetodoPagoId == Metodo_Pago.MetodoPagoId).\
join(Estatus).all()
Referencing the selection collection would be by iteration through the rows of tuples. A more robust and stable solution would be to transform each output row into a dict.
Otherwise, by including whole models, the collection of rows returned can be individually accessed by referencing as dot notation the model names in the query().
If you need further access to the columns in the related tables, use the ORM technique of .options(joinedload(myTable)), which in a single database query will bring in those additional columns, using the relationship name, also as dot notation.
You also need to define sqlalchemy relationships within your models for this to work, as well as defining the underlying SQL foreign keys.
Much more detail and/or a more specific question is needed to help further, imo.

pyodbc join tables with equal named columns ("upsert") [duplicate]

I need to write an SQL query for MS-Access 2000 so that a row is updated if it exists, but inserted if it does not. (I believe this is called an "upsert")
i.e.
If row exists...
UPDATE Table1 SET (...) WHERE Column1='SomeValue'
If it does not exist...
INSERT INTO Table1 VALUES (...)
Can this be done in one query?

You can simulate an upsert in an Access by using an UPDATE query with a LEFT JOIN.
update b
left join a on b.id=a.id
set a.f1=b.f1
, a.f2=b.f2
, a.f3=b.f3

Assuming a unique index on Column1, you can use a DCount expression to determine whether you have zero or one row with Column1 = 'SomeValue'. Then INSERT or UPDATE based on that count.
If DCount("*", "Table1", "Column1 = 'SomeValue'") = 0 Then
Debug.Print "do INSERT"
Else
Debug.Print "do UPDATE"
End If
I prefer this approach to first attempting an INSERT, trapping the 3022 key violation error, and doing an UPDATE in response to the error. However I can't claim huge benefits from my approach. If your table includes an autonumber field, avoiding a failed INSERT would stop you from expending the next autonumber value needlessly. I can also avoid building an INSERT string when it's not needed. The Access Cookbook told me string concatenation is a moderately expensive operation in VBA, so I look for opportunities to avoid building strings unless they're actually needed. This approach will also avoid creating a lock for an unneeded INSERT.
However, none of those reasons may be very compelling for you. And in all honesty I think my preference in this case may be about what "feels right" to me. I agree with this comment by #David-W-Fenton to a previous Stack Overflow question: "It's better to write your SQL so you don't attempt to append values that already exist -- i.e., prevent the error from happening in the first place rather than depending on the database engine to save you from yourself."

An "upsert" is possible, if the tables have a unique key.
This old tip from Smart Access is one of my favourites:
Update and Append Records with One Query
By Alan Biggs
Did you know that you can use an update query in Access to both update
and add records at the same time? This is useful if you have two
versions of a table, tblOld and tblNew, and you want to integrate the
changes from tblNew into tblOld.
Follow these steps:
Create an update query and add the two tables. Join the two tables by
dragging the key field of tblNew onto the matching field of tblOld.
Double-click on the relationship and choose the join option that includes all records from tblNew and only those that match from
tblOld.
Select all the fields from tblOld and drag them onto the QBE grid.
For each field, in the Update To cell type in tblNew.FieldName, where FieldName matches the field name of tblOld.
Select Query Properties from the View menu and change Unique Records to False. (This switches off the DISTINCTROW option in the SQL
view. If you leave this on you'll get only one blank record in your
results, but you want one blank record for each new record to be added
to tblOld.)
Run the query and you'll see the changes to tblNew are now in tblOld.
This will only add records to tblOld that have been added to tblNew.
Records in tblOld that aren't present in tblNew will still remain in
tblOld.

I usually run the insert statement first and then I check to see if error 3022 occurred, which indicates the row already exists. So something like this:
On Error Resume Next
CurrentDb.Execute "INSERT INTO Table1 (Fields) VALUES (Data)", dbFailOnError
If Err.Number = 3022 Then
Err.Clear
CurrentDb.Execute "UPDATE Table1 SET (Fields = Values) WHERE Column1 = 'SomeValue'", dbFailOnError
ElseIf Err.Number <> 0 Then
'Handle the error here
Err.Clear
End If
Edit1:
I want to mention that what I've posted here is a very common solution but you should be aware that planning on errors and using them as part of the normal flow of your program is generally considered a bad idea, especially if there are other ways of achieving the same results. Thanks to RolandTumble for pointing this out.

You don't need to catch the error. Instead, just run the INSERT statement and then check
CurrentDb.RecordsAffected
It will either be 1 or 0, depending.
Note: It's not good practice to execute against CurrentDB. Better to capture the database to a local variable:
Dim db As DAO.Database
Set db = CurrentDb
db.Execute(INSERT...)
If db.RecordsAffected = 0 Then
db.Execute(UPDATE...)
End If

As others have mentioned, You can UPSERT with an UPDATE LEFT JOIN using the new table as the left hand side. This will add all missing records and update matching records, leaving deleted records intact.
If we follow the Create and run an update query Article we will end up with SQL that looks like this:
UPDATE Table1
INNER JOIN NewTable1 ON Table1.ID = NewTable1.ID
SET Table1.FirstName = [NewTable1].[FirstName]
but an inner join will only update matching records, it won't add new records. So let's change that INNER to a LEFT:
UPDATE Table1
LEFT JOIN NewTable1 ON Table1.ID = NewTable1.ID
SET Table1.FirstName = [NewTable1].[FirstName]
Now save a copy of the DB. Run a test on the copy before you run this on your primary DB.

Find parent with certain combination of child rows - SQLite with Python

There are several parts to this question. I am working with sqlite3 in Python 2.7, but I am less concerned with the exact syntax, and more with the methods I need to use. I think the best way to ask this question is to describe my current database design, and what I am trying to accomplish. I am new to databases in general, so I apologize if I don't always use correct nomenclature.
I am modeling refrigeration systems (using Modelica--not really important to know), and I am using the database to manage input data, results data, and models used for that data.
My top parent table is Model, which contains the columns:
id, name, version, date_created
My child table under Model is called Design. It is used to create a unique id for each combination of design input parameters and the model used. the columns it contains are:
id, model_id, date_created
I then have two child tables under Design, one called Input, and the other called Result. We can just look at Input for now, since one example should be enough. The columns for input are:
id, value, design_id, parameter_id, component_id
parameter_id and component_id are foreign keys to their own tables.The Parameter table has the following columns:
id, name, units
Some example rows for Parameter under name are: length, width, speed, temperature, pressure (there are many dozens more). The Component table has the following columns:
id, name
Some example rows for Component under name are: compressor, heat_exchanger, valve.
Ultimately, in my program I want to search the database for a specific design. I want to be able to search a specific design to be able to grab specific results for that design, or to know whether or not a model simulation with that design has already been run previously, to avoid re-running the same data point.
I also want to be able to grab all the parameters for a given design, and insert it into a class I have created in Python, which is then used to provide inputs to my models. In case it helps for solving the problem, the classes I have created are based on the components. So, for example, I have a compressor class, with attributes like compressor.speed, compressor.stroke, compressor.piston_size. Each of these attributes should have their own row in the Parameter table.
So, how would I query this database efficiently to find if there is a design that matches a long list (let's assume 100+) of parameters with specific values? Just as a side note, my friend helped me design this database. He knows databases, but not my application super well. It is possible that I designed it poorly for what I want to accomplish.
Here is a simple picture trying to map a certain combination of parameters with certain values to a design_id, where I have taken out component_id for simplicity:
Picture of simplified tables

Simply join the necessary tables. Your schema properly reflects normalization (separating tables into logical groupings) and can scale for one-to-many relationships. Specifically, to answer your question --So, how would I query this database efficiently to find if there is a design that matches a long list (let's assume 100+) of parameters with specific values?-- consider below approaches:
Inner Join with Where Clause
For handful of parameters, use an inner join with a WHERE...IN() clause. Below returns design fields joined by input and parameters tables, filtered for specific parameter names where you can have Python pass as parameterized values even iteratively in a loop:
SELECT d.id, d.model_id, d.date_created
FROM design d
INNER JOIN input i ON d.id = i.design_id
INNER JOIN parameters p ON p.id = i.parameter_id
WHERE p.name IN ('param1', 'param2', 'param3', 'param4', 'param5', ...)
Inner Join with Temp Table
Should values be over 100+ in a long list, consider a temp table that filters parameters table to specific parameter values:
# CREATE EMPTY TABLE (SAME STRUCTURE AS parameters)
sql = "CREATE TABLE tempparams AS SELECT id, name, units FROM parameters WHERE 0;"
cur.execute(sql)
db.commit()
# ITERATIVELY APPEND TO TEMP
for i in paramslist: # LIST OF 100+ ITEMS
sql = "INSERT INTO tempparams (id, name, units) \
SELECT p.id, p.name, p.units \
FROM parameters p \
WHERE p.name = ?;"
cur.execute(sql, i) # CURSOR OBJECT COMMAND PASSING PARAM
db.commit() # DB OBJECT COMMIT ACTION
Then, join main design and input tables with new temp table holding specific parameters:
SELECT d.id, d.model_id, d.date_created
FROM design d
INNER JOIN input i ON d.id = i.design_id
INNER JOIN tempparams t ON t.id = i.parameter_id
Same process can work with components table as well.
*Moved picture to question section

Bulk upsert (insert-update) a csv in postgres [duplicate]

A very frequently asked question here is how to do an upsert, which is what MySQL calls INSERT ... ON DUPLICATE UPDATE and the standard supports as part of the MERGE operation.
Given that PostgreSQL doesn't support it directly (before pg 9.5), how do you do this? Consider the following:
CREATE TABLE testtable (
id integer PRIMARY KEY,
somedata text NOT NULL
);
INSERT INTO testtable (id, somedata) VALUES
(1, 'fred'),
(2, 'bob');
Now imagine that you want to "upsert" the tuples (2, 'Joe'), (3, 'Alan'), so the new table contents would be:
(1, 'fred'),
(2, 'Joe'), -- Changed value of existing tuple
(3, 'Alan') -- Added new tuple
That's what people are talking about when discussing an upsert. Crucially, any approach must be safe in the presence of multiple transactions working on the same table - either by using explicit locking, or otherwise defending against the resulting race conditions.
This topic is discussed extensively at Insert, on duplicate update in PostgreSQL?, but that's about alternatives to the MySQL syntax, and it's grown a fair bit of unrelated detail over time. I'm working on definitive answers.
These techniques are also useful for "insert if not exists, otherwise do nothing", i.e. "insert ... on duplicate key ignore".

9.5 and newer:
PostgreSQL 9.5 and newer support INSERT ... ON CONFLICT (key) DO UPDATE (and ON CONFLICT (key) DO NOTHING), i.e. upsert.
Comparison with ON DUPLICATE KEY UPDATE.
Quick explanation.
For usage see the manual - specifically the conflict_action clause in the syntax diagram, and the explanatory text.
Unlike the solutions for 9.4 and older that are given below, this feature works with multiple conflicting rows and it doesn't require exclusive locking or a retry loop.
The commit adding the feature is here and the discussion around its development is here.
If you're on 9.5 and don't need to be backward-compatible you can stop reading now.
9.4 and older:
PostgreSQL doesn't have any built-in UPSERT (or MERGE) facility, and doing it efficiently in the face of concurrent use is very difficult.
This article discusses the problem in useful detail.
In general you must choose between two options:
Individual insert/update operations in a retry loop; or
Locking the table and doing batch merge
Individual row retry loop
Using individual row upserts in a retry loop is the reasonable option if you want many connections concurrently trying to perform inserts.
The PostgreSQL documentation contains a useful procedure that'll let you do this in a loop inside the database. It guards against lost updates and insert races, unlike most naive solutions. It will only work in READ COMMITTED mode and is only safe if it's the only thing you do in the transaction, though. The function won't work correctly if triggers or secondary unique keys cause unique violations.
This strategy is very inefficient. Whenever practical you should queue up work and do a bulk upsert as described below instead.
Many attempted solutions to this problem fail to consider rollbacks, so they result in incomplete updates. Two transactions race with each other; one of them successfully INSERTs; the other gets a duplicate key error and does an UPDATE instead. The UPDATE blocks waiting for the INSERT to rollback or commit. When it rolls back, the UPDATE condition re-check matches zero rows, so even though the UPDATE commits it hasn't actually done the upsert you expected. You have to check the result row counts and re-try where necessary.
Some attempted solutions also fail to consider SELECT races. If you try the obvious and simple:
-- THIS IS WRONG. DO NOT COPY IT. It's an EXAMPLE.
BEGIN;
UPDATE testtable
SET somedata = 'blah'
WHERE id = 2;
-- Remember, this is WRONG. Do NOT COPY IT.
INSERT INTO testtable (id, somedata)
SELECT 2, 'blah'
WHERE NOT EXISTS (SELECT 1 FROM testtable WHERE testtable.id = 2);
COMMIT;
then when two run at once there are several failure modes. One is the already discussed issue with an update re-check. Another is where both UPDATE at the same time, matching zero rows and continuing. Then they both do the EXISTS test, which happens before the INSERT. Both get zero rows, so both do the INSERT. One fails with a duplicate key error.
This is why you need a re-try loop. You might think that you can prevent duplicate key errors or lost updates with clever SQL, but you can't. You need to check row counts or handle duplicate key errors (depending on the chosen approach) and re-try.
Please don't roll your own solution for this. Like with message queuing, it's probably wrong.
Bulk upsert with lock
Sometimes you want to do a bulk upsert, where you have a new data set that you want to merge into an older existing data set. This is vastly more efficient than individual row upserts and should be preferred whenever practical.
In this case, you typically follow the following process:
CREATE a TEMPORARY table
COPY or bulk-insert the new data into the temp table
LOCK the target table IN EXCLUSIVE MODE. This permits other transactions to SELECT, but not make any changes to the table.
Do an UPDATE ... FROM of existing records using the values in the temp table;
Do an INSERT of rows that don't already exist in the target table;
COMMIT, releasing the lock.
For example, for the example given in the question, using multi-valued INSERT to populate the temp table:
BEGIN;
CREATE TEMPORARY TABLE newvals(id integer, somedata text);
INSERT INTO newvals(id, somedata) VALUES (2, 'Joe'), (3, 'Alan');
LOCK TABLE testtable IN EXCLUSIVE MODE;
UPDATE testtable
SET somedata = newvals.somedata
FROM newvals
WHERE newvals.id = testtable.id;
INSERT INTO testtable
SELECT newvals.id, newvals.somedata
FROM newvals
LEFT OUTER JOIN testtable ON (testtable.id = newvals.id)
WHERE testtable.id IS NULL;
COMMIT;
Related reading
UPSERT wiki page
UPSERTisms in Postgres
Insert, on duplicate update in PostgreSQL?
http://petereisentraut.blogspot.com/2010/05/merge-syntax.html
Upsert with a transaction
Is SELECT or INSERT in a function prone to race conditions?
SQL MERGE on the PostgreSQL wiki
Most idiomatic way to implement UPSERT in Postgresql nowadays
What about MERGE?
SQL-standard MERGE actually has poorly defined concurrency semantics and is not suitable for upserting without locking a table first.
It's a really useful OLAP statement for data merging, but it's not actually a useful solution for concurrency-safe upsert. There's lots of advice to people using other DBMSes to use MERGE for upserts, but it's actually wrong.
Other DBs:
INSERT ... ON DUPLICATE KEY UPDATE in MySQL
MERGE from MS SQL Server (but see above about MERGE problems)
MERGE from Oracle (but see above about MERGE problems)

Here are some examples for insert ... on conflict ... (pg 9.5+) :
Insert, on conflict - do nothing.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict do nothing;`
Insert, on conflict - do update, specify conflict target via column.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict(id)
do update set name = 'new_name', size = 3;
Insert, on conflict - do update, specify conflict target via constraint name.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict on constraint dummy_pkey
do update set name = 'new_name', size = 4;

I am trying to contribute with another solution for the single insertion problem with the pre-9.5 versions of PostgreSQL. The idea is simply to try to perform first the insertion, and in case the record is already present, to update it:
do $$
begin
insert into testtable(id, somedata) values(2,'Joe');
exception when unique_violation then
update testtable set somedata = 'Joe' where id = 2;
end $$;
Note that this solution can be applied only if there are no deletions of rows of the table.
I do not know about the efficiency of this solution, but it seems to me reasonable enough.

SQLAlchemy upsert for Postgres >=9.5
Since the large post above covers many different SQL approaches for Postgres versions (not only non-9.5 as in the question), I would like to add how to do it in SQLAlchemy if you are using Postgres 9.5. Instead of implementing your own upsert, you can also use SQLAlchemy's functions (which were added in SQLAlchemy 1.1). Personally, I would recommend using these, if possible. Not only because of convenience, but also because it lets PostgreSQL handle any race conditions that might occur.
Cross-posting from another answer I gave yesterday (https://stackoverflow.com/a/44395983/2156909)
SQLAlchemy supports ON CONFLICT now with two methods on_conflict_do_update() and on_conflict_do_nothing():
Copying from the documentation:
from sqlalchemy.dialects.postgresql import insert
stmt = insert(my_table).values(user_email='a#b.com', data='inserted data')
stmt = stmt.on_conflict_do_update(
index_elements=[my_table.c.user_email],
index_where=my_table.c.user_email.like('%#gmail.com'),
set_=dict(data=stmt.excluded.data)
)
conn.execute(stmt)
http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html?highlight=conflict#insert-on-conflict-upsert

MERGE in PostgreSQL v. 15
Since PostgreSQL v. 15, is possible to use MERGE command. It actually has been presented as the first of the main improvements of this new version.
It uses a WHEN MATCHED / WHEN NOT MATCHED conditional in order to choose the behaviour when there is an existing row with same criteria.
It is even better than standard UPSERT, as the new feature gives full control to INSERT, UPDATE or DELETE rows in bulk.
MERGE INTO customer_account ca
USING recent_transactions t
ON t.customer_id = ca.customer_id
WHEN MATCHED THEN
UPDATE SET balance = balance + transaction_value
WHEN NOT MATCHED THEN
INSERT (customer_id, balance)
VALUES (t.customer_id, t.transaction_value)

WITH UPD AS (UPDATE TEST_TABLE SET SOME_DATA = 'Joe' WHERE ID = 2
RETURNING ID),
INS AS (SELECT '2', 'Joe' WHERE NOT EXISTS (SELECT * FROM UPD))
INSERT INTO TEST_TABLE(ID, SOME_DATA) SELECT * FROM INS
Tested on Postgresql 9.3

Since this question was closed, I'm posting here for how you do it using SQLAlchemy. Via recursion, it retries a bulk insert or update to combat race conditions and validation errors.
First the imports
import itertools as it
from functools import partial
from operator import itemgetter
from sqlalchemy.exc import IntegrityError
from app import session
from models import Posts
Now a couple helper functions
def chunk(content, chunksize=None):
"""Groups data into chunks each with (at most) `chunksize` items.
https://stackoverflow.com/a/22919323/408556
"""
if chunksize:
i = iter(content)
generator = (list(it.islice(i, chunksize)) for _ in it.count())
else:
generator = iter([content])
return it.takewhile(bool, generator)
def gen_resources(records):
"""Yields a dictionary if the record's id already exists, a row object
otherwise.
"""
ids = {item[0] for item in session.query(Posts.id)}
for record in records:
is_row = hasattr(record, 'to_dict')
if is_row and record.id in ids:
# It's a row but the id already exists, so we need to convert it
# to a dict that updates the existing record. Since it is duplicate,
# also yield True
yield record.to_dict(), True
elif is_row:
# It's a row and the id doesn't exist, so no conversion needed.
# Since it's not a duplicate, also yield False
yield record, False
elif record['id'] in ids:
# It's a dict and the id already exists, so no conversion needed.
# Since it is duplicate, also yield True
yield record, True
else:
# It's a dict and the id doesn't exist, so we need to convert it.
# Since it's not a duplicate, also yield False
yield Posts(**record), False
And finally the upsert function
def upsert(data, chunksize=None):
for records in chunk(data, chunksize):
resources = gen_resources(records)
sorted_resources = sorted(resources, key=itemgetter(1))
for dupe, group in it.groupby(sorted_resources, itemgetter(1)):
items = [g[0] for g in group]
if dupe:
_upsert = partial(session.bulk_update_mappings, Posts)
else:
_upsert = session.add_all
try:
_upsert(items)
session.commit()
except IntegrityError:
# A record was added or deleted after we checked, so retry
#
# modify accordingly by adding additional exceptions, e.g.,
# except (IntegrityError, ValidationError, ValueError)
db.session.rollback()
upsert(items)
except Exception as e:
# Some other error occurred so reduce chunksize to isolate the
# offending row(s)
db.session.rollback()
num_items = len(items)
if num_items > 1:
upsert(items, num_items // 2)
else:
print('Error adding record {}'.format(items[0]))
Here's how you use it
>>> data = [
... {'id': 1, 'text': 'updated post1'},
... {'id': 5, 'text': 'updated post5'},
... {'id': 1000, 'text': 'new post1000'}]
...
>>> upsert(data)
The advantage this has over bulk_save_objects is that it can handle relationships, error checking, etc on insert (unlike bulk operations).

Query syntax to select exactly one item for each category

class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I want to select exactly one item for each category, which is the query syntax for do this?

Your question isn't entirely clear: since you didn't say otherwise, I'm going to assume that you don't care which item is selected for each category, just that you need any one. If that isn't the case, please update the question to clarify.
tl;dr version: there is no documented
way to explicitly use GROUP BY
statements in Django, except by using
a raw query. See the bottom for code to do so.
The problem is that in doing what you're looking for in SQL itself requires a bit of a hack. You can easily try this example with by entering sqlite3 :memory: at the command line:
CREATE TABLE category
(
id INT
);
CREATE TABLE item
(
id INT,
category_id INT
);
INSERT INTO category VALUES (1);
INSERT INTO category VALUES (2);
INSERT INTO category VALUES (3);
INSERT INTO item VALUES (1,1);
INSERT INTO item VALUES (2,2);
INSERT INTO item VALUES (3,3);
INSERT INTO item VALUES (4,1);
INSERT INTO item VALUES (5,2);
SELECT id, category_id, COUNT(category_id) FROM item GROUP BY category_id;
returns
4|1|2
5|2|2
3|3|1
Which is what you're looking for (one item id for each category id), albeit with an extraneous COUNT. The count (or some other aggregate function) is needed in order to apply the GROUP BY.
Note: this will ignore categories that don't contain any items, which seems like sensible behaviour.
Now the question becomes, how to do this in Django?
The obvious answer is to use Django's aggregation/annotation support, in particular, combining annotate with values as is recommend elsewhere to GROUP queries in Django.
Reading those posts, it would seem we could accomplish what we're looking for with
Item.objects.values('id').annotate(unneeded_count=Count('category_id'))
However this doesn't work. What Django does here is not just GROUP BY "category_id", but groups by all fields selected (ie GROUP BY "id", "category_id")1. I don't believe there is a way (in the public API, at least) to change this behaviour.
The solution is to fall back to raw SQL:
qs = Item.objects.raw('SELECT *, COUNT(category_id) FROM myapp_item GROUP BY category_id')
1: Note that you can inspect what queries Django is running with:
from django.db import connection
print connection.queries[-1]
Edit:
There are a number of other possible approaches, but most have (possibly severe) performance problems. Here are a couple:
1. Select an item from each category.
items = []
for c in Category.objects.all():
items.append(c.item_set[0])
This is a more clear and flexible approach, but has the obvious disadvantage of requiring many more database hits.
2. Use select_related
items = Item.objects.select_related()
and then do the grouping/filtering yourself (in Python).
Again, this is perhaps more clear than using raw SQL and only requires one query, but this one query could be very large (it will return all items and their categories) and doing the grouping/filtering yourself is probably less efficient than letting the database do it for you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.