Python: sqlite3 - how to speed up updating of the database

Python: sqlite3 - how to speed up updating of the database - python

I have a database, which I store as a .db file on my disk. I implemented all the function neccessary for managing this database using sqlite3. However, I noticed that updating the rows in the table takes a large amount of time. My database has currently 608042 rows. The database has one table - let's call it Table1. This table consists of the following columns:
id | name | age | address | job | phone | income
(id value is generated automaticaly while a row is inserted to the database).
After reading-in all the rows I perform some operations (ML algorithms for predicting the income) on the values from the rows, and next I have to update (for each row) the value of income (thus, for each one from 608042 rows I perform the SQL update operation).
In order to update, I'm using the following function (copied from my class):
def update_row(self, new_value, idkey):
update_query = "UPDATE Table1 SET income = ? WHERE name = ?" %
self.cursor.execute(update_query, (new_value, idkey))
self.db.commit()
And I call this function for each person registered in the database.
for each i out of 608042 rows:
update_row(new_income_i, i.name)
(values of new_income_i are different for each i).
This takes a huge amount of time, even though the dataset is not giant. Is there any way to speed up the updating of the database? Should I use something else than sqlite3? Or should I instead of storing the database as a .db file store it in memory (using sqlite3.connect(":memory:"))?

Each UPDATE statement must scan the entire table to find any row(s) that match the name.
An index on the name column would prevent this and make the search much faster. (See Query Planning and How does database indexing work?)
However, if the name column is not unique, then that value is not even suitable to find individual rows: each update with a duplicate name would modify all rows with the same name. So you should use the id column to identify the row to be updated; and as the primary key, this column already has an implicit index.

Related

Best way to perform bulk insert SQLAlchemy

I have a tabled called products
which has following columns
id, product_id, data, activity_id
What I am essentially trying to do is copy bulk of existing products and update it's activity_id and create new entry in the products table.
Example:
I already have 70 existing entries in products with activity_id 2
Now I want to create another 70 entries with same data except for updated activity_id
I could have thousands of existing entries that I'd like to make a copy of and update the copied entries activity_id to be a new id.
products = self.session.query(model.Products).filter(filter1, filter2).all()
This returns all the existing products for a filter.
Then I iterate through products, then simply clone existing products and just update activity_id field.
for product in products:
product.activity_id = new_id
self.uow.skus.bulk_save_objects(simulation_skus)
self.uow.flush()
self.uow.commit()
What is the best/ fastest way to do these bulk entries so it kills time, as of now it's OK performance, is there a better solution?

You don't need to load these objects locally, all you really want to do is have the database create these rows.
You essentially want to run a query that creates the rows from the existing rows:
INSERT INTO product (product_id, data, activity_id)
SELECT product_id, data, 2 -- the new activity_id value
FROM product
WHERE activity_id = old_id
The above query would run entirely on the database server; this is far preferable over loading your query into Python objects, then sending all the Python data back to the server to populate INSERT statements for each new row.
Queries like that are something you could do with SQLAlchemy core, the half of the API that deals with generating SQL statements. However, you can use a query built from a declarative ORM model as a starting point. You'd need to
Access the Table instance for the model, as that then lets you create an INSERT statement via the Table.insert() method.
You could also get the same object from models.Product query, more on that later.
Access the statement that would normally fetch the data for your Python instances for your filtered models.Product query; you can do so via the Query.statement property.
Update the statement to replace the included activity_id column with your new value, and remove the primary key (I'm assuming that you have an auto-incrementing primary key column).
Apply that updated statement to the Insert object for the table via Insert.from_select().
Execute the generated INSERT INTO ... FROM ... query.
Step 1 can be achieved by using the SQLAlchemy introspection API; the inspect() function, applied to a model class, gives you a Mapper instance, which in turn has a Mapper.local_table attribute.
Steps 2 and 3 require a little juggling with the Select.with_only_columns() method to produce a new SELECT statement where we swapped out the column. You can't easily remove a column from a select statement but we can, however, use a loop over the existing columns in the query to 'copy' them across to the new SELECT, and at the same time make our replacement.
Step 4 is then straightforward, Insert.from_select() needs to have the columns that are inserted and the SELECT query. We have both as the SELECT object we have gives us its columns too.
Here is the code for generating your INSERT; the **replace keyword arguments are the columns you want to replace when inserting:
from sqlalchemy import inspect, literal
from sqlalchemy.sql import ClauseElement
def insert_from_query(model, query, **replace):
# The SQLAlchemy core definition of the table
table = inspect(model).local_table
# and the underlying core select statement to source new rows from
select = query.statement
# validate asssumptions: make sure the query produces rows from the above table
assert table in select.froms, f"{query!r} must produce rows from {model!r}"
assert all(c.name in select.columns for c in table.columns), f"{query!r} must include all {model!r} columns"
# updated select, replacing the indicated columns
as_clause = lambda v: literal(v) if not isinstance(v, ClauseElement) else v
replacements = {name: as_clause(value).label(name) for name, value in replace.items()}
from_select = select.with_only_columns([
replacements.get(c.name, c)
for c in table.columns
if not c.primary_key
])
return table.insert().from_select(from_select.columns, from_select)
I included a few assertions about the model and query relationship, and the code accepts arbitrary column clauses as replacements, not just literal values. You could use func.max(models.Product.activity_id) + 1 as a replacement value (wrapped as a subselect), for example.
The above function executes steps 1-4, producing the desired INSERT SQL statement when printed (I created a products model and query that I thought might be representative):
>>> print(insert_from_query(models.Product, products, activity_id=2))
INSERT INTO products (product_id, data, activity_id) SELECT products.product_id, products.data, :param_1 AS activity_id
FROM products
WHERE products.activity_id != :activity_id_1
All you have to do is execute it:
insert_stmt = insert_from_query(models.Product, products, activity_id=2)
self.session.execute(insert_stmt)

Upsert / merge tables in SQLite

I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:
Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
If the date and time does not yet exist, create a new row and insert it to my database.
Below are my questions:
I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
Can UPSERT also be performed directly using pandas.DataFrame.to_sql?
My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).

Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.
import sqlite3
import pandas as pd
csv_data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def manual_upsert():
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data") # Viewing Data from Column
data = cur.fetchall()
old_data_list = [] # Collection of All Dates already in Database table.
for line in data:
old_data_list.append(line[0]) # I suppose you Date Column is on 0 Index.
for new_data in csv_data:
if new_data[0] in old_data_list:
cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?", # it will update column based on date if condition is true
(new_data[1],new_data[2],new_data[3],new_data[0]))
else:
cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)", # It will insert new row if date is not found.
(new_data[0],new_data[1],new_data[2],new_data[3]))
con.commit()
con.close()
manual_upsert()

First, even though the questions are related, ask them separately in the future.
There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE
Use a transaction and the statements are going to be executed in bulk.
As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).

best way to upsert 300 million entries into postgres?

I have a new csv file every day with 400 million+ entries which I need to upsert into my database (3 tables with 2 foreign keys, indexed). The majority of the entries are already in the table, in which case I need to update a column. Some entries, which are not already in the table need to be inserted.
I tried to insert the CSV each day into a temptable then run:
INSERT INTO restaurants (name, food_id, street_id, datecreated, lastdayobservedopen) SELECT DISTINCT temptable.name, typesoffood.food_id, location.street_id, temptable.datecreated, temptable.lastdayobservedopen FROM temptable INNER JOIN typesoffood on typesoffood.food_type = temptable.food_type INNER JOIN location ON location.street_name = temptable.street_name ON CONFLICT ON CONSTRAINT restaurants_pk DO UPDATE SET lastdayobservedopen = EXCLUDED.lastdayobservedopen
But it takes over 6 hrs.
Is it possible to make this faster?
Edit:
Some more details: 3 tables- restaurants(name, food_id, street_id, datecreated, lastdayobservedopen) with pk (name, street_id) and fks (food_id and street_id); typesoffood(food_id, food_type) with pk (food_id) and index on food_type; location(street_id, street_name) with pk (street_id) and index on street_name; as for the csv file, I don’t know which are new or old entries, but I do know that the majority of the entries are already in the database which would require me to update the lastdayobserved date. The rest are to be inserted with the lastdayobserved date as today. This is supposed to help distinguish between restaurants that are no longer in operation (in which case their lastdayobserved column would not be updated) and currently operating restaurants whose date in that column should always match today’s date. Open to more efficient schema suggestions, as well. Thanks to all!

There is a function in sql called bulk insert can handle large volume of data:
bulk insert #temp
from "file location path"

If you can change you postgres settings you could take advantage of parallelism in Postgres. Otherwise you could at least speed up the csv upload using Postgres's bulk upload otherwise known as the COPY command.
Without more details it's hard to give better advice.

How to Bulk Insert SQLAlchemy subclasses (Joined Inheritance)

I am trying to bulk insert SQL-Alchemy Subclasses into the parent table and their respective tables ie fruits tables -> Apple Table and so I insert a table of APPLE and it will insert both the row into the fruits table then give me the id of the row in fruits table and put it Apple
This works when inserting one row at a time, but I need it to work with bulk insertion due to performance
I have tried to bulk insert which failed and I tried single row insertion it works with single row insertion but the thing is this data is not really unique except for the id of the row which is auto-generated so its going to be really hard to do a bulk insert to the parent table then do a bulk insert into the subclass table where the data matches and use the id by a mapping function
for data in apple_list:
db.session.add(Apple(
brand=data["brand"],
picked_date=data["picked_date"],
type=data["type"],
color=data["color"],
sub_type=data["sub_type"],
))
what I want is something more like bulk insertion
db.session.bulk_insert_mappings(model_classes['Apple'], apple_list)
Actual results are that when it tries to insert it errors out on the insertion due to it not having the foreign primary key that tells the row for the fruits table to the apple table
Expect to insert without any errors and to populate both tables like when inserting both rows

I figured it out so SQLALCHEMY has a parameter in bulk insert mapping called return_defaults
WARNING: This is straight from the docs return_defaults – when True, rows that are missing values which generate defaults, namely integer primary key defaults and sequences, will be inserted one at a time, so that the primary key value is available. In particular this will allow joined-inheritance and other multi-table mappings to insert correctly without the need to provide primary key values ahead of time; however, Session.bulk_insert_mappings.return_defaults greatly reduces the performance gains of the method overall. If the rows to be inserted only refer to a single table, then there is no reason this flag should be set as the returned default information is not used.
so then all you have to do is this
db.session.bulk_insert_mappings(model_classes['Apple'], apple_list, return_defaults=True)
it's still alot faster than db.session.add

Optimizing an Update statement with many records in SQLAlchemy

I am trying to update many records at a time using SQLAlchemy, but am finding it to be very slow. Is there an optimal way to perform this?
For some reference, I am performing an update on 40,000 records and it took about 1 hour.
Below is the code I am using. The table_name refers to the table which is loaded, the column is the single column which is to be updated, and the pairs refer to the primary key and new value for the column.
def update_records(table_name, column, pairs):
table = Table(table_name, db.MetaData, autoload=True,
autoload_with=db.engine)
conn = db.engine.connect()
values = []
for id, value in pairs:
values.append({'row_id': id, 'match_value': str(value)})
stmt = table.update().where(table.c.id == bindparam('row_id')).values({column: bindparam('match_value')})
conn.execute(stmt, values)

Passing a list of arguments to execute() essentially issues 40k individual UPDATE statements, which is going to have a lot of overhead. The solution for this is to increase the number of rows per query. For MySQL, this means inserting into a temp table and then doing an update:
# assuming temp table already created
conn.execute(temp_table.insert().values(values))
conn.execute(table.update().values({column: temp_table.c.match_value})
.where(table.c.id == temp_table.c.row_id))
Or, alternatively, you can use INSERT ... ON DUPLICATE KEY UPDATE to avoid creating the temp table, but SQLAlchemy does not support that natively, so you'll need to use a custom compiled construct for that (e.g. this gist).

According to document fast-execution-helpers, batch update statements can be issued as one statement. In my experiments, this trick reduce update or deletion time from 30 mins to 1 mins.
engine = create_engine(
"postgresql+psycopg2://scott:tiger#host/dbname",
executemany_mode='values_plus_batch',
executemany_values_page_size=5000, executemany_batch_page_size=5000)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.