Improve sqlite query speed - python

I have a list of numbers (actually, percentages) to update in a database. The query is very simple, I obtain the id of the items somewhere in my code, and then, I update these items in the database, with the list of numbers. See my code:
start_time = datetime.datetime.now()
query = QtSql.QSqlQuery("files.sqlite")
for id_bdd, percentage in zip(list_id, list_percentages):
request = "UPDATE papers SET percentage_match = ? WHERE id = ?"
params = (percentage, id_bdd)
query.prepare(request)
for value in params:
query.addBindValue(value)
query.exec_()
elsapsed_time = datetime.datetime.now() - start_time
print(elsapsed_time.total_seconds())
It takes 1 second to generate list_percentages, and more than 2 minutes to write all the percentages in the database.
I use sqlite for the database, and there are about 7000 items in the database. Is it normal that the query takes so much time ?
If not, is there a way to optimize it ?
EDIT:
Comparison with the sqlite3 module from the std library:
bdd = sqlite3.connect("test.sqlite")
bdd.row_factory = sqlite3.Row
c = bdd.cursor()
request = "UPDATE papers SET percentage_match = ? WHERE id = ?"
for id_bdd, percentage in zip(list_id, list_percentages):
params = (percentage, id_bdd)
c.execute(request, params)
bdd.commit()
c.close()
bdd.close()
I think the QSqlQuery commits the changes at each loop lap, while the sqlite3 module allows to commit at the ends, all the different queries at the same time.
For the same test database, the QSqlQuery takes ~22 s, while the "normal" query takes ~0.3 s. I can't believe this is just a perf issue, I must do something wrong.

You need to start a transaction, and commit all the updates after the loop.
Not tested but should be close to:
start_time = datetime.datetime.now()
# Start the transaction time
QtSql.QSqlDatabase.transaction()
query = QtSql.QSqlQuery("files.sqlite")
for id_bdd, percentage in zip(list_id, list_percentages):
request = "UPDATE papers SET percentage_match = ? WHERE id = ?"
params = (percentage, id_bdd)
query.prepare(request)
for value in params:
query.addBindValue(value)
query.exec_()
# commit changues
if QtSql.QSqlDatabase.commit():
print "updates ok"
elsapsed_time = datetime.datetime.now() - start_time
print(elsapsed_time.total_seconds())
At the other hand, this question could to be a database performance issue, try to create an index on id field: https://www.sqlite.org/lang_createindex.html
You will need direct access to the database.
create index on papers (id);

Do you really need to call the prepare each time? to me, it seems the request doesn't change, so this "prepare" function could be moved out of loop?

Related

Efficient way to delete a large amount of records from a big table using python

I have a large table (About 10 million rows) that I need to delete records that are "older" than 10 days (according to created_at column). I have a python script that I run to do this. created_at is a varchar(255) and has values like for e.g. 1594267202000
import mysql.connector
import sys
from mysql.connector import Error
table = sys.argv[1]
deleteDays = sys.argv[2]
sql_select_query = """SELECT COUNT(*) FROM {} WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY))""".format(table)
sql_delete_query = """DELETE FROM {} WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY)) LIMIT 100""".format(table)
try:
connection = mysql.connector.connect(host=localhost,
database=myDatabase,
user=admin123,
password=password123)
cursor = connection.cursor()
#initial count of rows before deletion
cursor.execute(sql_select_query, (deleteDays,))
records = cursor.fetchone()[0]
while records >= 1:
# stuck at following below line and time out happens....
cursor.execute(sql_delete_query, (deleteDays,))
connection.commit()
cursor.execute(sql_select_query, (deleteDays,))
records = cursor.fetchone()[0]
#final count of rows after deletion
cursor.execute(sql_select_query, (deleteDays,))
records = cursor.fetchone()[0]
if records == 0:
print("\nRows deleted")
else:
print("\nRows NOT deleted")
except mysql.connector.Error as error:
print("Failed to delete: {}".format(error))
finally:
if (connection.is_connected()):
cursor.close()
connection.close()
print("MySQL connection is closed")
When I run this script and it runs the DELETE QUERY however... it fails due to:
Failed to delete: 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
I know that the innodb_lock_wait_timeout is currently set to 50 seconds and I can increase it to overcome this problem, however i'd rather not touch the timeout and.... I want to basically delete in chunks maybe? Anyone know how I can do it here using my code as example?
One approach here might be to use a delete limit query, to batch your deletes at a certain size. Assuming batches of 100 records:
DELETE
FROM yourTable
WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY))
LIMIT 100;
Note that strictly speaking you should always have an ORDER BY clause when using LIMIT. What I wrote above might delete any 100 records matching the criteria for deletion.
created_at has no index and is a varchar(255) – Saffik 11 hours ago
There's your problem. Two of them.
It needs to be indexed to have any hope of being performant. Without an index, MySQL has to check every record in the table. With an index, it can skip straight to the ones which match.
While storing an integer as a varchar will work, MySQL will convert it for you, it's bad practice; it wastes storage, allows bad data, and is slow.
Change created_at to a bigint so that it's stored as a number, then index it.
alter table your_table modify column created_at bigint;
create index created_at_idx on your_table(created_at);
Now that created_at is an indexed bigint, your query should use the index and it should be very fast.
Note that created_at should be a datetime which stores the time at microsecond accuracy. Then you can use MySQL's date functions without having to convert.
But that's going to mess with your code which expects a millisecond epoch number, so you're stuck with it. Keep it in mind for future tables.
For this table, you can add a generated created_at_datetime column to make working with dates easier. And, of course, index it.
alter table your_table add column created_at_datetime datetime generated always as (from_unixtime(created_at/1000));
create index created_at_datetime on your_table(created_at_datetime);
Then your where clause becomes much simpler.
WHERE created_at_datetime < DATE_SUB(NOW(), INTERVAL %s DAY)

update the last entered value from a selection of values in a database with python , mysql

Okay so i have a table which has student id and the student id is used as identifier to edit the column but what if the same student lends a book twice then all the student value will b edited which i don't want....i want the last entered data of student id to b edited and using a Sl.No is not a solution here because its practically complicated.I am using python connector. Please help :) Thanks in advance
code i use right now :
con = mysql.connect(host='localhost', user='root',
password='monkey123', database='BOOK')
c = con.cursor()
c.execute(
f"UPDATE library set `status`='Returned',`date returned`='{str(cal.selection_get())}' WHERE `STUDENT ID`='{e_sch.get()}';")
c.execute('commit')
con.close()
messagebox.showinfo(
'Success', 'Book has been returned successfully')
If I followed you correctly, you want to update just one record that matches the where condition. For this to be done in a reliable manner, you need a column to define the ordering of the records. It could be a date, an incrementing id, or else. I assume that such column exists in your table and is called ordering_column.
A simple option is to use ORDER BY and LIMIT in the UPDATE statement, like so:
sql = """
UPDATE library
SET status = 'Returned', date returned = %s
WHERE student_id = %s
ORDER BY ordering_column DESC
LIMIT 1
"""
c = con.cursor()
c.execute(sql, (str(cal.selection_get()), e_sch.get(), )
Note that I modified your code so input values are given as parameters rather than concatenated into the query string. This is an important change, that makes your code safer and more efficient.

Bigquery Partition Date Python

I would like to write another table by partition date the table in bigquery. But I couldn't find how to do it. I use Python and google cloud library. I want to create a table using standard SQL.But I get an error.
Error : google.api_core.exceptions.BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/astute-baton-272707/queries/f4b9dadb-1390-4260-bb0e-fb525aff662c?maxResults=0&location=US: The number of columns in the column definition list does not match the number of columns produced by the query at [2:72]
Please let me know if there is another solution. Day to day İnsert to table the next stage of the project.
I may have been doing it wrong from the beginning. I am not sure.
Thank You.
client = bigquery.Client()
sql = """
CREATE OR REPLACE TABLE zzzzz.xxxxx.yyyyy (visitStartTime_ts INT64,date TIMESTAMP,hitsTime_ts INT64,appId STRING,fullVisitorId STRING,cUserId STRING,eventCategory STRING,eventLabel STRING,player_type STRING,PLAY_SESSION_ID STRING,CHANNEL_ID STRING,CONTENT_EPG_ID STRING,OFF_SET STRING)
PARTITION BY date
OPTIONS (
description="weather stations with precipitation, partitioned by day"
) AS
select
FORMAT_TIMESTAMP("%Y-%m-%d %H:%M:%S", TIMESTAMP_SECONDS(SAFE_CAST(visitStartTime AS INT64)), "Turkey") AS visitStartTime_ts,
date
,FORMAT_TIMESTAMP("%Y-%m-%d %H:%M:%S", TIMESTAMP_SECONDS(SAFE_CAST(visitStartTime+(h.time/1000) AS INT64)), "Turkey") AS hitsTime_ts
,h.appInfo.appId as appId
,fullVisitorId
,(SELECT value FROM h.customDimensions where index=1) as cUserId
,h.eventInfo.eventCategory as eventCategory
,h.eventInfo.eventAction as eventAction
,h.eventInfo.eventLabel as eventLabel
,REPLACE(SPLIT(h.eventInfo.eventCategory,'/{')[OFFSET(1)],'}','') as player_type
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(0)] as PLAY_SESSION_ID
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(1)] as CHANNEL_ID
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(2)] as CONTENT_EPG_ID
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(3)] as OFF_SET
FROM `zzzzz.yyyyyy.xxxxxx*` a,
UNNEST(hits) AS h
where
1=1
and SPLIT(SPLIT(h.eventInfo.eventCategory,'/{')[OFFSET(0)],'/')[OFFSET(0)] like 'player'
and _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND (BYTE_LENGTH(h.eventInfo.eventCategory) - BYTE_LENGTH(REPLACE(h.eventInfo.eventCategory,'/{','')))/2 + 1 = 2
AND h.eventInfo.eventAction='heartBeat'
"""
job = client.query(sql) # API request.
job.result()
query_job.result() # Waits for the query to finish
print('Query results loaded to table {}'.format(table_ref.path))
A quick solution for the problem presented here: When creating a table, you don't need to declare the schema of it, if there's a query where data is coming from. Right now there's a conflict between the data and the declared schema. So remove one.
Instead of starting the query with:
CREATE OR REPLACE TABLE zzzzz.xxxxx.yyyyy (visitStartTime_ts INT64,date TIMESTAMP,hitsTime_ts INT64,appId STRING,fullVisitorId STRING,cUserId STRING,eventCategory STRING,eventLabel STRING,player_type STRING,PLAY_SESSION_ID STRING,CHANNEL_ID STRING,CONTENT_EPG_ID STRING,OFF_SET STRING)
PARTITION BY date
Start the query with:
CREATE OR REPLACE TABLE zzzzz.xxxxx.yyyyy
PARTITION BY date

How to retrieve data from SQLite faster in python

I have the following info in my database (example):
longitude (real): 70.74
userid (int): 12
This is how i fetch it:
import sqlite3 as lite
con = lite.connect(dbpath)
with con:
cur = con.cursor()
cur.execute('SELECT latitude, userid FROM message')
con.commit()
print "executed"
while True:
tmp = cur.fetchone()
if tmp != None:
info.append([tmp[0],tmp[1]])
else:
break
To get the same info on the form [70.74, 12]
What else can I do to speed up this process? At 10,000,000 rows this takes approx 50 seconds, as I'm aiming for 200,000,000 rows - I never get through this, possible to a memory leak or something like that?
From the sqlite3 documentation:
A Row instance serves as a highly optimized row_factory for Connection objects. It tries to mimic a tuple in most of its features.
Since a Row closely mimics a tuple, depending on your needs you may not even need to unpack the results.
However, since your numerical types are stored as strings, we do need to do some processing. As #Jon Clements pointed out, the cursor is an iterable, so we can just use a comprehension, obtaining the float and ints at the same time.
import sqlite3 as lite
with lite.connect(dbpath) as conn:
cur = conn.execute('SELECT latitude, userid FROM message')
items = [[float(x[0]), int(x[1])] for x in cur]
EDIT: We're not making any changes, so we don't need to call commit.

How to make an efficient query for extracting enteries of all days in a database in sets?

I have a database that includes 440 days of several series with a sampling time of 5 seconds. There is also missing data.
I want to calculate the daily average. The way I am doing it now is that I make 440 queries and do the averaging afterward. But, this is very time consuming since for every query the whole database is searched for related entries. I imagine there must be a more efficient way of doing this.
I am doing this in python, and I am just learning sql. Here's the query section of my code:
time_cur = date_begin
Data = numpy.zeros(shape=(N, NoFields - 1))
X = []
nN = 0
while time_cur<date_end:
X.append(time_cur)
cur = con.cursor()
cur.execute("SELECT * FROM os_table \
WHERE EXTRACT(year from datetime_)=%s\
AND EXTRACT(month from datetime_)=%s\
AND EXTRACT(day from datetime_)=%s",\
(time_cur.year, time_cur.month, time_cur.day));
Y = numpy.array([0]*(NoFields-1))
n = 0.0
while True:
n = n + 1
row = cur.fetchone()
if row == None:
break
Y = Y + numpy.array(row[1:])
Data[nN][:] = Y/n
nN = nN + 1
time_cur = time_cur + datetime.timedelta(days=1)
And, my data looks like this:
datetime_,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14
2012-11-13-00:07:53,42,30,0,0,1,9594,30,218,1,4556,42,1482,42
2012-11-13-00:07:58,70,55,0,0,2,23252,55,414,2,2358,70,3074,70
2012-11-13-00:08:03,44,32,0,0,0,11038,32,0,0,5307,44,1896,44
2012-11-13-00:08:08,36,26,0,0,0,26678,26,0,0,12842,36,1141,36
2012-11-13-00:08:13,33,26,0,0,0,6590,26,0,0,3521,33,851,33
I appreciate your suggestions.
Thanks
Iman
I don't know the np function so I don't understand what are you averaging. If you show your table and the logic to get the average...
But this is how to get a daily average for a single column
import psycopg2
conn = psycopg2.connect('host=localhost4 port=5432 dbname=cpn')
cursor = conn.cursor()
cursor.execute('''
select
datetime::date as day,
avg(c1) as c1_average,
avg(c2) as c2_average
from os_table
where datetime between %s and %s
group by 1
order by 1
''',
(time_cur, date_end)
);
rs = cursor.fetchall()
conn.close()
for day in rs:
print day[0], day[1], day[2]
This answer uses SQL Server syntax - I am not sure how different PostgreSQL is - it should be fairly similar, you may find things like the DATEADD, DATEDIFF and CONVERT statements are different, (Actually, almost certainly the CONVERT statement - just convert the date to a varchar instead -I am just using it as a reportName, so it not vital) You should be able to follow the theory of this, even if the code doesn't run in PostgreSQL without tweaking.
First Create a Reports Table ( you will use this to link to the actual table you want to report on )
CREATE TABLE Report_Periods (
report_name VARCHAR(30) NOT NULL PRIMARY KEY,
report_start_date DATETIME NOT NULL,
report_end_date DATETIME NOT NULL,
CONSTRAINT date_ordering
CHECK (report_start_date <= report_end_date)
)
Next populate the report table with the dates you need to report on, there are many ways to do this - the method I've chosen here will only use the days you need, but you could create this with all dates you are ever likely to use, so you only have to do it once.
INSERT INTO Report_Periods (report_name, report_start_date, report_end_date)
SELECT CONVERT(VARCHAR, [DatePartOnly], 0) AS DateName,
[DatePartOnly] AS StartDate,
DATEADD(ms, -3, DATEADD(dd,1,[DatePartOnly])) AS EndDate
FROM ( SELECT DISTINCT DATEADD(DD, DATEDIFF(DD, 0, datetime_), 0) AS [DatePartOnly]
FROM os_table ) AS M
Note in SQL server, the smallest time allowed is 3 milliseconds - so the above statement adds 1 day, then subtracts 3 milliseconds to create a start and end datetime for a day. Again PostgrSQL may have different values
This means you can simply link the reports table back to your os_table to get averages, counts etc very simply
SELECT AVG(value) AS AvgValue, COUNT(value) AS NumValues, R.report_name
FROM os_table AS T
JOIN Report_Periods AS R ON T.datetime_>= R.report_start_date AND T.datetime_<= R.report_end_date
GROUP BY R.report_name

Categories