SQL Count Optimisations - python

I have been using the Django Rest Framework, and part of the ORM does the following query as part of a generic object list endpoint:
`SELECT COUNT(*) AS `__count`
FROM `album`
INNER JOIN `tracks`
ON (`album`.`id` = `tracks`.`album_id`)
WHERE `tracks`.`viewable` = 1`
The API is supposed to only display albums with tracks that are set to viewable, but with a tracks table containing 50 million rows this is query never seem to complete and hangs the endpoint's execution.
All columns referenced are indexed, so I do not know why this is taking so long to execute. If there are any potential optimisations that I might have not considered please let me know.

For this query:
SELECT COUNT(*) AS `__count`
FROM `album` INNER JOIN
`tracks`
ON (`album`.`id` = `tracks`.`album_id`)
WHERE `tracks`.`viewable` = 1`;
An index on tracks(viewable, album_id) and album(id) would help.
But, in all likelihood a join is not needed, so you can do:
select count(*)
from tracks
where viewable = 1;
For this the index on tracks(viewable) will be a big help.

Related

Using top level item in nested subquery with SQLAlchemy

I have a query when I'm attempting to find a link between two tables, but I require few checks with association tables within the same query. I think my problem stems from having to check across multiple levels of relationships, where I want to filter a subquery based on the top level item, but I've hit an issue and have no idea how to proceed.
More specifically I want to query Script using the name of an Application, but narrow the results down to when the Application's Language matches the Script's Language.
Tables: Script (id, language_id), Application (id, name), Language (id)
Association Tables: ApplicationLanguage (app_id, language_id), ScriptApplication (script_id, app_id)
Current attempt: (it's important this stays as a single query)
value = 'appname'
# Search applications for a value
app_search = select([Application.id]).where(Application.name==value).as_scalar()
# Search for applications matching the language of the script
lang_search = select([ApplicationLanguage.app_id]).where(
ApplicationLanguage.language_id==Script.language_id
).as_scalar()
# Find the script based on which applications appear in both subqueries.
script_search = select([ScriptApplication.script_id]).where(and_(
ScriptApplication.app_id.in_(app_search),
ScriptApplication.app_id.in_(lang_search),
)).as_scalar()
# Turn it into an SQL expression
query = Script.id.in_(script_search)
Resulting SQL code:
SELECT script.id AS script_id
FROM script
WHERE script.id IN (SELECT script_application.script_id
FROM script_application
WHERE script_application.application_id IN (SELECT application.id
FROM application
WHERE application.name = ?) AND script_application.application_id IN (SELECT application_language.application_id
FROM application_language, script
WHERE script.language_id = application_language.language_id))
My theory
I believe the issue is on the line ApplicationLanguage.language_id==Script.language_id, because if I change it to (ApplicationLanguage.language_id==3, 3 being the value I'm expecting), then it works perfectly. In the SQL code, I assume it's the FROM application_language, script which is overwriting the top level script
How would I go about either rearranging or fixing this query? My current method seems to work fine if it's across a single relationship, just doesn't work if I try and do anything more complex.
I'd still love to know how I'd go about fixing the original query as I believe it'll come in useful in the future, but I managed to rearrange it.
I reversed the lang_search to grab languages for each application from app_search, and used that as part of the final query, instead of attempting to combine it in a subquery.
value = 'appname'
app_search = select([Application.id]).where(Application.name==value).as_scalar()
lang_search = select([ApplicationLanguage.language_id]).where(
ApplicationLanguage.app_id.in(app_search)
).as_scalar()
script_search = select([ScriptApplication.script_id]).where(and_(
ScriptApplication.app_id.in(app_search),
)).as_scalar()
query = and_(
Script.id.in_(script_search),
Script.language_id.in_(lang_search),
)
Final SQL query:
SELECT script.id AS script_id
FROM script
WHERE script.id IN (SELECT script_application.script_id
FROM script_application
WHERE script_application.application_id IN (SELECT application.id
FROM application
WHERE lower(application.name) = ?)) AND script.language_id IN (SELECT application_language.language_id
FROM application_language
WHERE application_language.application_id IN (SELECT application.id
FROM application
WHERE lower(application.name) = ?))

Speeding up GROUP BY clause in SQL (Python/Pandas)

I have searched this website thoroughly and have not been able to find a solution that works for me. I code in python, and have very little SQL knowledge. I currently need to create a code to pull data from a SQL database, and organize/summarize it. My code is below: (it has been scrubbed for data security purposes)
conn = pc.connect(host=myhost,dbname =mydb, port=myport,user=myuser,password=mypassword)
cur = conn.cursor()
query = ("""CREATE INDEX index ON myTable3 USING btree (name);
CREATE INDEX index2 ON myTable USING btree (date, state);
CREATE INDEX index3 ON myTable4 USING btree (currency, type);
SELECT tp.name AS trading_party_a,
tp2.name AS trading_party_b,
('1970-01-01 00:00:00'::timestamp without time zone + ((mc.date)::double precision * '00:00:00.001'::interval)) AS val_date,
mco.currency,
mco.type AS type,
mc.state,
COUNT(*) as call_count,
SUM(mco.call_amount) as total_call_sum,
SUM(mco.agreed_amount) as agreed_sum,
SUM(disputed_amount) as disputed_sum
FROM myTable mc
INNER JOIN myTable2 cp ON mc.a_amp_id = cp.amp_id
INNER JOIN myTable3 tp ON cp.amp_id = tp.amp_id
INNER JOIN myTable2 cp2 ON mc.b_amp_id = cp2.amp_id
INNER JOIN myTable3 tp2 ON cp2.amp_id = tp2.amp_id,
myTable4 mco
WHERE (((mc.amp_id)::text = (mco.call_amp_id)::text))
GROUP BY tp.name, tp2.name,
mc.date, mco.currency, mco.type, mc.state
LIMIT 1000""")
frame = pdsql.read_sql_query(query,conn)
The query takes over 15 minutes to run, even when my limit is set to 5. Before the GROUP BY clause was added, it would run with LIMIT 5000 in under 10 seconds. I was wondering, as I'm aware my SQL is not great, if anybody has any insight on where might be causing delay, as well as any improvements to be made.
EDIT: I do not know how to view the performance of a SQL query, but if someone could inform me on this as well, I could post the performance of the script.
In regards to speeding up your workflow, you might be interested in checking out the 3rd part of my answer on this post : https://stackoverflow.com/a/50457922/5922920
If you want to keep a SQL-like interface while using a distributed file system you might want to have a look into Hive, Pig and Sqoop in addition to Hadoop and Spark.
Besides, to trace the performance of your SQL query you can always track the execution time of your code on your client side if appropriate.
For example :
import timeit
start_time = timeit.default_timer()
#Your code here
end_time = timeit.default_timer()
print end_time - start_time
Or use tools like those to have a deeper look at what is going on: https://stackify.com/performance-tuning-in-sql-server-find-slow-queries/
I think the delay is because SQL runs the groupby statement first then it runs everything else. So it is going through your entire large dataset to group everything, then it is going through it again to pull values and do the counts and summations.
Without the groupby, it does not have to parse the entire dataset before it can start generating the results - it jumps right into summing and counting the variables that you desire.

When SQLAlchemy decides to use subquery with .limit() method?

I have an error, when SQLAlchemy produced wrong SQL query, but I can't determine conditions.
I use Flask-SQLAlchemy and initially it's a just MyModel.query and it represented by simple SELECT with JOINs. But when .limit() method is applied, it transforms and uses subquery for fetch main objects and only then apply JOINs. The problem is in ORDER BY statement, which remains the same and ignores the subquery definition.
Here's example and I've simplify select fields:
-- Initially
SELECT *
FROM customer_rates
LEFT OUTER JOIN seasons AS seasons_1 ON seasons_1.id = customer_rates.season_id
LEFT OUTER JOIN users AS users_1 ON users_1.id = customer_rates.customer_id
-- other joins ...
ORDER BY customer_rates.id, customer_rates.id
-- Then .limit()
SELECT anon_1.*, *
FROM (
SELECT customer_rates.*
FROM customer_rates
LIMIT :param_1) AS anon_1
LEFT OUTER JOIN seasons AS seasons_1 ON seasons_1.id = anon_1.customer_rates_season_id
LEFT OUTER JOIN users AS users_1 ON users_1.id = anon_1.customer_rates_customer_id
-- other joins
ORDER BY customer_rates.id, customer_rates.id
And this query gives following error:
ProgrammingError: (psycopg2.ProgrammingError) missing FROM-clause entry for table "customer_rates"
The last line in query should be:
ORDER BY anon_1.customer_rates_id
The code, that produces this queries is a part of large application. I've tried to implement this from scratch in a small flask application, But I can't reproduce it. In small application it always uses a JOIN.
So I need to know, when SQLAlchemy decides to use subquery.
I use python 2.7 and PostgreSQL 9
The answer is pretty straightforward. It uses subquery when it joined table has many-to-one relations with queried model. So for producing correct number of results it limits the queried rows in the subquery

Efficient way to run select query for millions of data

I want to run various select query 100 million times and I have aprox. 1 million rows in a table. Therefore, I am looking for the fastest method to run all these select queries.
So far I have tried three different methods, and the results were similar.
The following three methods are, of course, not doing anything useful, but are purely for comparing performance.
first Method:
for i in range (100000000):
cur.execute("select id from testTable where name = 'aaa';")
second method:
cur.execute("""PREPARE selectPlan AS
SELECT id FROM testTable WHERE name = 'aaa' ;""")
for i in range (10000000):
cur.execute("""EXECUTE selectPlan ;""")
third method:
def _data(n):
cur = conn.cursor()
for i in range (n):
yield (i, 'test')
sql = """SELECT id FROM testTable WHERE name = 'aaa' ;"""
cur.executemany(sql, _data(10000000))
And the table is created like this:
cur.execute("""CREATE TABLE testTable ( id int, name varchar(1000) );""")
cur.execute("""CREATE INDEX indx_testTable ON testTable(name)""")
I thought that using the prepared statement functionality would really speed up the queries, but as it seems like this will not happen, I thought you could give me a hint on other ways of doing this.
This sort of benchmark is unlikely to produce any useful data, but the second method should be fastest, as once the statement is prepared it is stored in memory by the database server. Further calls to repeat the query do not require the text of the query to be transmitted, so saving a small about of time.
This is likely to be moot as the query is very small (likely the same quantity of packets over the wire as repeating sending the query text), and the query cache will serve the same data for every request.
What's the purpose of retrieving such amount of data at once? I don't know your situation, but I'd definitely page the results using limit and offset. Take a look at:
7.6. LIMIT and OFFSET
If you just want to benchmark SQL all on it's own and not mix Python into the equation try pgbench.
http://developer.postgresql.org/pgdocs/postgres/pgbench.html
Also what is your goal here?

sql select from a large number of IDs

I have a table, Foo. I run a query on Foo to get the ids from a subset of Foo. I then want to run a more complicated set of queries, but only on those IDs. Is there an efficient way to do this? The best I can think of is creating a query such as:
SELECT ... --complicated stuff
WHERE ... --more stuff
AND id IN (1, 2, 3, 9, 413, 4324, ..., 939393)
That is, I construct a huge "IN" clause. Is this efficient? Is there a more efficient way of doing this, or is the only way to JOIN with the inital query that gets the IDs? If it helps, I'm using SQLObject to connect to a PostgreSQL database, and I have access to the cursor that executed the query to get all the IDs.
UPDATE: I should mention that the more complicated queries all either rely on these IDs, or create more IDs to look up in the other queries. If I were to make one large query, I'd end up joining six tables at once or so, which might be too slow.
One technique I've used in the past is to put the IDs into a temp table, and then use that to drive a sequence of queries. Something like:
BEGIN;
CREATE TEMP TABLE search_result ON COMMIT DROP AS
SELECT entity_id
FROM entity /* long complicated search joins and conditions ... */;
-- Fetch primary entities
SELECT entity_id, entity.x /*, ... */
FROM entity JOIN search_result USING (entity_id);
-- Fetch some related entities
SELECT entity_id, related_entity_id, related_entity.x /*, ... */
FROM related_entity JOIN search_result USING (entity_id);
-- And more, as required
END;
This is particularly useful where the search result entities have multiple one-to-many relationships which you want to fetch without either a) doing N*M+1 selects or b) doing a cartesian join of related entities.
I would think it might be useful to use a VIEW. Simple create a view with your query for ID's, then join to that view via ID. That will limit your results to the required subset of ID's without an expensive IN statement.
I do know that the IN statement is more expensive then an EXISTS statement would be.
I think the join with the criteria to select the id's will be more efficient because the query optimizer has more options to do the right thing. Use the explain plan to see how postgresql will approach it.
You are almost certainly better off with a join, however, another option is to use a sub select, i.e.
SELECT ... --complicated stuff
WHERE ... --more stuff
AND id IN (select distinct id from Foo where ...)

Categories