SQLite: Batch aggregation with insert - python

I would like to do the following:
cur.execute("SELECT key, SUM(val) FROM table GROUP BY key")
cur.executemany("INSERT INTO table_sums VALUES(?,?)",(row for row in cur))
in a single SQLite statement with batch processing if possible, that is it does the sum only for a number of keys, inserts, continues till all are processed.
Apparently I am using Python right now but as I am asking for a single statement (if exists), I don't think this should matter. If it doesn't exist, perhaps there is an efficient(!) work-around in Python?
EDIT: To avoid a SELECT WHERE query, it would actually be desirable not to produce complete sums for a subset of keys, but to just sum over the first n rows and store the resulting sums so far, then continue with the next n...

The two SQLs could be combined into one using a temporary view.
WITH tempsums as
(SELECT key,sum(value) from table
GROUP by key
where key in :batch)
INSERT INTO total_sums SELECT * from tempsums)

Related

ON DUPLICATE KEY UPDATE non-index columns

I have a code that is updating a few mySQL tables with the data that is coming from a Sybase database. The table structures are exactly the same.
Since the number of tables may increase in the future, I wrote a Python script that loops over an array of table names, and based on the number of columns in each of those tables, the insert statement dynamically changes:
'''insert into databaseName.{} ({}) values ({})'''.format(table, columns, parameters)
as you can see, the value parameters are not hardcoded, which has caused this problem where I can't modify this query to do an "ON DUPLICATE KEY UPDATE".
for example, the insert statement may look like:
insert into databaseName.table_foo (col1,col2,col3,col4,col5) values (%s,%s,%s,%s,%s)
or
insert into databaseName.table_bar (col1,col2,col3) values (%s,%s,%s)
how can I use "ON DUPLICATE KEY UPDATE" in here to update non-index columns with their corresponding non-index values?
I can update this question by including more details if needed.
The easiest solution is this:
'''replace into databaseName.{} ({}) values ({})'''.format(table, columns, parameters)
This works similarly to IODKU, in that if the values conflict with a PRIMARY KEY or UNIQUE KEY of the table, it replaces the row, overwriting the other columns, instead of causing a duplicate key error.
The difference is that REPLACE does a DELETE of the old row followed by an INSERT of the new row. Whereas IODKU does either an INSERT or an UPDATE. We know this because if you create triggers on the table, you'll see which triggers are activated.
Anyway, using REPLACE would make your task a lot simpler in this case.
If you must use IODKU, you would need to add more syntax after the update at the end. Unfortunately, there is no syntax for "assign all the columns respectively to the new row's values." You must assign them individually.
For MySQL 8.0.19 or later use this syntax:
INSERT INTO t1 (a,b,c) VALUES (?,?,?) AS new
ON DUPLICATE KEY UPDATE a = new.a, b = new.b, c = new.c;
In earlier MySQL, use this syntax:
INSERT INTO t1 (a,b,c) VALUES (?,?,?)
ON DUPLICATE KEY UPDATE a = VALUES(a), b = VALUES(b), c = VALUES(c);

Delete first row from SQLITE table in python

Its a simple question, how can I just delete the first line from a table without having to give a search criteria.
Normaly it is:
c.execute('DELETE FROM name_table WHERE tada=?', (tadida,))
I just want to delete first row. Not having the WHERE part. The reason is that I want to create a FIFO table (or stack) add from the bottom and delete from the top.
I can do this by keeping track of time and date or giving the rows a ID. But I would prefer the described method.
Thanx.
I just want to delete first row
SQL tables have no inherent ordering, so there is no defined concept of first row, unless a column (or a set of columns) is specified for ordering.
Assuming that you do have an ordering colum, say id, you can use limit to restrict which row should be deleted:
delete from mytable order by id limit 1
This removes the record that has the smallest id from the table.
Unless you use a custom version of sqlite, you can't use ORDER BY or LIMIT with DELETE.
If your version of sqlite wasn't built with that option (Some OS-distributed ones are, some aren't), and building and installing a copy with it is beyond your comfort level, an alternative, assuming a column named id is used for ordering, with the smallest value of id being the oldest record:
DELETE FROM yourtable WHERE id = (SELECT min(id) FROM yourtable);

Copy row from Cassandra database and then insert it using Python

I'm using plugin DataStax Python Driver for Apache Cassandra.
I want to read 100 rows from database and then insert them again into database after changing one value. I do not want to miss previous records.
I know how to get my rows:
rows = session.execute('SELECT * FROM columnfamily LIMIT 100;')
for myrecord in rows:
print(myrecord.timestamp)
I know how to insert new rows into database:
stmt = session.prepare('''
INSERT INTO columnfamily (rowkey, qualifier, info, act_date, log_time)
VALUES (, ?, ?, ?, ?)
IF NOT EXISTS
''')
results = session.execute(stmt, [arg1, arg2, ...])
My problems are that:
I do not know how to change only one value in a row.
I don't know how to insert rows into database without using CQL. My columnfamily has more than 150 columns and writing all their names in query does not seem as a best idea.
To conclude:
Is there a way to get rows, modify one value from every one of them and then insert this rows into database without using only CQL?
First, you need to select only needed columns from Cassandra - it will be faster to transfer the data. You need to include all columns of primary key + column that you want to change.
After you get the data, you can use UPDATE command to update only necessary column (example from documentation):
UPDATE cycling.cyclist_name
SET comments ='='Rides hard, gets along with others, a real winner'
WHERE id = fb372533-eb95-4bb4-8685-6ef61e994caa
You can also use prepared statement to make it more performant...
But be careful - the UPDATE & INSERT in CQL are really UPSERTs, so if you change columns that are part of primary key, then it will create new entry...

Optimizing an Update statement with many records in SQLAlchemy

I am trying to update many records at a time using SQLAlchemy, but am finding it to be very slow. Is there an optimal way to perform this?
For some reference, I am performing an update on 40,000 records and it took about 1 hour.
Below is the code I am using. The table_name refers to the table which is loaded, the column is the single column which is to be updated, and the pairs refer to the primary key and new value for the column.
def update_records(table_name, column, pairs):
table = Table(table_name, db.MetaData, autoload=True,
autoload_with=db.engine)
conn = db.engine.connect()
values = []
for id, value in pairs:
values.append({'row_id': id, 'match_value': str(value)})
stmt = table.update().where(table.c.id == bindparam('row_id')).values({column: bindparam('match_value')})
conn.execute(stmt, values)
Passing a list of arguments to execute() essentially issues 40k individual UPDATE statements, which is going to have a lot of overhead. The solution for this is to increase the number of rows per query. For MySQL, this means inserting into a temp table and then doing an update:
# assuming temp table already created
conn.execute(temp_table.insert().values(values))
conn.execute(table.update().values({column: temp_table.c.match_value})
.where(table.c.id == temp_table.c.row_id))
Or, alternatively, you can use INSERT ... ON DUPLICATE KEY UPDATE to avoid creating the temp table, but SQLAlchemy does not support that natively, so you'll need to use a custom compiled construct for that (e.g. this gist).
According to document fast-execution-helpers, batch update statements can be issued as one statement. In my experiments, this trick reduce update or deletion time from 30 mins to 1 mins.
engine = create_engine(
"postgresql+psycopg2://scott:tiger#host/dbname",
executemany_mode='values_plus_batch',
executemany_values_page_size=5000, executemany_batch_page_size=5000)

SQLite get id and insert when using executemany

I am optimising my code, and reducing the amount of queries. These used to be in a loop but I am trying to restructure my code to be done like this. How do I get the second query working so that it uses the id entered in the first query from each row. Assume that the datasets are in the right order too.
self.c.executemany("INSERT INTO nodes (node_value, node_group) values (?, (SELECT node_group FROM nodes WHERE node_id = ?)+1)", new_values)
#my problem is here
new_id = self.c.lastrowid
connection_values.append((node_id, new_id))
#insert entry
self.c.executemany("INSERT INTO connections (parent, child, strength) VALUES (?,?,1)", connection_values)
These queries used to be a for loop but were taking too long so I am trying to avoid using a for loop and doing the query individually. I believe their might be a way with combining it into one query but I am unsure how this would be done.
You will need to either insert rows one at a time or read back the rowids that were picked by SQLite's ID assignment logic; as documented in Autoincrement in SQLite, there is no guarantee that the IDs generated will be consecutive and trying to guess them in client code is a bad idea.
You can do this implicitly if your program is single-threaded as follows:
Set the AUTOINCREMENT keyword in your table definition. This will guarantee that any generated row IDs will be higher than any that appear in the table currently.
Immediately before the first statement, determine the highest ROWID in use in the table.
oldmax ← Execute("SELECT max(ROWID) from nodes").
Perform the first insert as before.
Read back the row IDs that were actually assigned with a select statement:
NewNodes ← Execute("SELECT ROWID FROM nodes WHERE ROWID > ? ORDER BY ROWID ASC", oldmax) .
Construct the connection_values array by combining the parent ID from new_values and the child ID from NewNodes.
Perform the second insert as before.
This may or may not be faster than your original code; AUTOINCREMENT can slow down performance, and without actually doing the experiment there's no way to tell.
If your program is writing to nodes from multiple threads, you'll need to guard this algorithm with a mutex as it will not work at all with multiple concurrent writers.

Categories