Improve speed or find faster alternative to SQL Update - python

I have a 68m rows x 77 columns table (general_table) on a MySQL server that contains, amongst other things, user_id, user_name, date and media_channel.
There are rare instances (83k of them) where there is a user_id but not a user_name, we would find that the value for user_name is "-". I can get this information from the users_table table.
To update the values on general_table i use the following update function, but given the size of the table it takes a really long time so I'm looking for an alternative.
UPDATE
general_table as a,
users_table as b
SET a.user_name = b.user_name
where a.date > '2020-01-01'
and a.user_id = b.user_id
and a.media_channel = b.media_channel
and a.user_name = '-';
Answers using Pandas, PyMySQL or SQLAlchemy are also welcome
Keep in mind for those requesting an Explain function that that only works for SELECT queries not for updates.

For this query:
UPDATE general_table g
JOIN users_table u ON g.user_id = u.user_id AND g.media_channel = u.media_channel
SET g.user_name = u.user_name
WHERE g.date > '2020-01-01' AND g.user_name = '-'
You want indexes on general_table(user_name, date, user_id, media_channel) and on users_table(user_id, media_channel, user_name).
Note: It will still take some time to update 83k rows, so you might want to do this in batches.

Related

update the last entered value from a selection of values in a database with python , mysql

Okay so i have a table which has student id and the student id is used as identifier to edit the column but what if the same student lends a book twice then all the student value will b edited which i don't want....i want the last entered data of student id to b edited and using a Sl.No is not a solution here because its practically complicated.I am using python connector. Please help :) Thanks in advance
code i use right now :
con = mysql.connect(host='localhost', user='root',
password='monkey123', database='BOOK')
c = con.cursor()
c.execute(
f"UPDATE library set `status`='Returned',`date returned`='{str(cal.selection_get())}' WHERE `STUDENT ID`='{e_sch.get()}';")
c.execute('commit')
con.close()
messagebox.showinfo(
'Success', 'Book has been returned successfully')
If I followed you correctly, you want to update just one record that matches the where condition. For this to be done in a reliable manner, you need a column to define the ordering of the records. It could be a date, an incrementing id, or else. I assume that such column exists in your table and is called ordering_column.
A simple option is to use ORDER BY and LIMIT in the UPDATE statement, like so:
sql = """
UPDATE library
SET status = 'Returned', date returned = %s
WHERE student_id = %s
ORDER BY ordering_column DESC
LIMIT 1
"""
c = con.cursor()
c.execute(sql, (str(cal.selection_get()), e_sch.get(), )
Note that I modified your code so input values are given as parameters rather than concatenated into the query string. This is an important change, that makes your code safer and more efficient.

If a condition is not fulfilled return data for that user, else another query

I have a table with these data:
ID, Name, LastName, Date, Type
I have to query the table for the user with ID 1.
Get the row, if the Type of that user is not 2, then return that user, else return all users that have the same LastName and Date.
What would be the most efficient way to do this ?
What I had done is :
query1 = SELECT * FROM clients where ID = 1
query2 = SELECT * FROM client WHERE LastName = %s AND Date= %s
And I execute the first query
cursor.execute(sql)
rows = cursor.fetchall()
for row in rows:
if(row['Type'] ==2 )
cursor.execute(sql2(row['LastName'], row['Date']))
Save results
else
results = rows?
Is there a more efficient way of doing this using Joins?
Example if I only have a left join, how would I also ask if the type of the user is 2 ?
And if there is multiple rows to be returned, how to assign them to an array of objects in python?
Just do two queries to avoid loops here:
q1 = """
SELECT c.* FROM clients c where c.ID = 1
"""
q2 = """
SELECT b.* FROM clients b
JOIN (SELECT c.* FROM
clients c
c.ID = 1
AND
c.Type = 2) a
ON
a.LastName = b.LastName
AND
a.Date = b.Date
"""
Then you can just execute both queries and you'll have all the desired results you want without the need for loops since your loop will execute n number of queries where n is equal to the number of rows that match as opposed to grabbing it all in one join in one pass. Without more specifics as the desired data structure of final results, as it seems you only care about saving the results, this should give you what you want.

How to improve query performance to select info from postgress?

I have a flask app:
db = SQLAlchemy(app)
#app.route('/')
def home():
query = "SELECT door_id FROM table WHERE id = 2422628557;"
result = db.session.execute(query)
return json.dumps([dict(r) for r in result])
When I execute: curl http://127.0.0.1:5000/
I got result very quickly [{"door_id": 2063805}]
But when I reverse my query: query = "SELECT id FROM table WHERE door_id = 2063805;"
everything work very-very slow.
Probably I have index on id attribute and don't have such on door_id.
How can I improve performance? How to add index on door_id?
If you want index on that column, just create it:
create index i1 on table (door_id)
Then depending on your settings you might have to analyze it to introduce to query planner, eg:
analyze table;
keep in mind - all indexes require additional IO on data manipulation
Look at the explain plan for your query.
EXPLAIN
"SELECT door_id FROM table WHERE id = 2422628557;"
It is likely you are seeing something like this:
QUERY PLAN
------------------------------------------------------------
Seq Scan on table (cost=0.00..483.00 rows=99999 width=244)
Filter: (id = 2422628557)
The Seq Scan is checking every single row in this table and this is then being filtered by the id you are restricting by.
What you should do in this occasion is to add an index to the ID column.
The plan will change to something like:
QUERY PLAN
-----------------------------------------------------------------------------
Index Scan using [INDEX_NAME] on table (cost=0.00..8.27 rows=1 width=244)
Index Cond: (id = 2422628557)
The optimiser will be using the index to reduce the row look ups for your query.
This will speed up your query.

cleaning a Postgres table of bad rows

I have inherited a Postgres database, and am currently in the process of cleaning it. I have created an algorithm to find the rows where the data is bad. The algorithm is encoded into the function called checkProblems(). Using this, I am able to select the rows that contains the bad rows, as shown below ...
schema = findTables(dbName)
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
results = []
for t in tqdm(sorted(schema.keys())):
n = 0
cur.execute('select * from %s'%t)
for i, cs in enumerate(tqdm(cur)):
if checkProblem(cs):
n += 1
results.append({
'tableName': t,
'totalRows': i+1,
'badRows' : n,
})
cur.close()
conn.close()
print pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
Now, I need to delete the rows that are bad. I have two different ways of doing it. First, I can write the clean rows in a temporary table, and rename the table. I think that this option is too memory-intensive. It would be much better if I would be able to just delete the specific record at the cursor. Is this even an option?
Otherwise, what is the best way of deleting a record under such circumstances? I am guessing that this should be a relatively common thing that database administrators do ...
Of course that delete the specific record at the cursor is better. You can do something like:
for i, cs in enumerate(tqdm(cur)):
if checkProblem(cs):
# if cs is a tuple with cs[0] being the record id.
cur.execute('delete from %s where id=%d'%(t, cs[0]))
Or you can store the ids of the bad records and then do something like
DELETE FROM table WHERE id IN (id1,id2,id3,id4)

How to make an efficient query for extracting enteries of all days in a database in sets?

I have a database that includes 440 days of several series with a sampling time of 5 seconds. There is also missing data.
I want to calculate the daily average. The way I am doing it now is that I make 440 queries and do the averaging afterward. But, this is very time consuming since for every query the whole database is searched for related entries. I imagine there must be a more efficient way of doing this.
I am doing this in python, and I am just learning sql. Here's the query section of my code:
time_cur = date_begin
Data = numpy.zeros(shape=(N, NoFields - 1))
X = []
nN = 0
while time_cur<date_end:
X.append(time_cur)
cur = con.cursor()
cur.execute("SELECT * FROM os_table \
WHERE EXTRACT(year from datetime_)=%s\
AND EXTRACT(month from datetime_)=%s\
AND EXTRACT(day from datetime_)=%s",\
(time_cur.year, time_cur.month, time_cur.day));
Y = numpy.array([0]*(NoFields-1))
n = 0.0
while True:
n = n + 1
row = cur.fetchone()
if row == None:
break
Y = Y + numpy.array(row[1:])
Data[nN][:] = Y/n
nN = nN + 1
time_cur = time_cur + datetime.timedelta(days=1)
And, my data looks like this:
datetime_,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14
2012-11-13-00:07:53,42,30,0,0,1,9594,30,218,1,4556,42,1482,42
2012-11-13-00:07:58,70,55,0,0,2,23252,55,414,2,2358,70,3074,70
2012-11-13-00:08:03,44,32,0,0,0,11038,32,0,0,5307,44,1896,44
2012-11-13-00:08:08,36,26,0,0,0,26678,26,0,0,12842,36,1141,36
2012-11-13-00:08:13,33,26,0,0,0,6590,26,0,0,3521,33,851,33
I appreciate your suggestions.
Thanks
Iman
I don't know the np function so I don't understand what are you averaging. If you show your table and the logic to get the average...
But this is how to get a daily average for a single column
import psycopg2
conn = psycopg2.connect('host=localhost4 port=5432 dbname=cpn')
cursor = conn.cursor()
cursor.execute('''
select
datetime::date as day,
avg(c1) as c1_average,
avg(c2) as c2_average
from os_table
where datetime between %s and %s
group by 1
order by 1
''',
(time_cur, date_end)
);
rs = cursor.fetchall()
conn.close()
for day in rs:
print day[0], day[1], day[2]
This answer uses SQL Server syntax - I am not sure how different PostgreSQL is - it should be fairly similar, you may find things like the DATEADD, DATEDIFF and CONVERT statements are different, (Actually, almost certainly the CONVERT statement - just convert the date to a varchar instead -I am just using it as a reportName, so it not vital) You should be able to follow the theory of this, even if the code doesn't run in PostgreSQL without tweaking.
First Create a Reports Table ( you will use this to link to the actual table you want to report on )
CREATE TABLE Report_Periods (
report_name VARCHAR(30) NOT NULL PRIMARY KEY,
report_start_date DATETIME NOT NULL,
report_end_date DATETIME NOT NULL,
CONSTRAINT date_ordering
CHECK (report_start_date <= report_end_date)
)
Next populate the report table with the dates you need to report on, there are many ways to do this - the method I've chosen here will only use the days you need, but you could create this with all dates you are ever likely to use, so you only have to do it once.
INSERT INTO Report_Periods (report_name, report_start_date, report_end_date)
SELECT CONVERT(VARCHAR, [DatePartOnly], 0) AS DateName,
[DatePartOnly] AS StartDate,
DATEADD(ms, -3, DATEADD(dd,1,[DatePartOnly])) AS EndDate
FROM ( SELECT DISTINCT DATEADD(DD, DATEDIFF(DD, 0, datetime_), 0) AS [DatePartOnly]
FROM os_table ) AS M
Note in SQL server, the smallest time allowed is 3 milliseconds - so the above statement adds 1 day, then subtracts 3 milliseconds to create a start and end datetime for a day. Again PostgrSQL may have different values
This means you can simply link the reports table back to your os_table to get averages, counts etc very simply
SELECT AVG(value) AS AvgValue, COUNT(value) AS NumValues, R.report_name
FROM os_table AS T
JOIN Report_Periods AS R ON T.datetime_>= R.report_start_date AND T.datetime_<= R.report_end_date
GROUP BY R.report_name

Categories