I tried to run this query:
update table1 A
set number = (select count(distinct(id)) from table2 B where B.col1 = A.col1 or B.col2 = A.col2);
but it takes forever bc table1 has 1,100,000 rows and table2 has 350,000,000 rows.
Is there any faster way to do this query in R? or in python?
I rewrote your query with three subqueries instead of one - with UNION and two INNER JOIN statements:
UPDATE table1 as A
SET number = (SELECT COUNT(DISTINCT(id))
FROM
(SELECT A.id as id
FROM table1 as A
INNER JOIN table2 as B
ON A.col1 = B.col1) -- condition for col1
UNION DISTINCT
(SELECT A.id as id
FROM table1 as A
INNER JOIN table2 as B
ON A.col2 = B.col2) -- condition for col2
)
My notes:
Updating all of the rows in table1 doesn't look like a good idea, because we have to touch 1.1M rows. Probably, another data structure for storing number would have better performance
Try to run part of the query without update of table1 (only part of the query in parenthesis
Take a look into EXPLAIN, if you need more general approach for optimization of SQL queries: https://dev.mysql.com/doc/refman/5.7/en/using-explain.html
Related
I have SQL tables in below format, I was figuring whats the best approach to navigate and join tables to get to the final result. I can do it in Python as well since this seems to require join on multiple columns which would end up duplicating rows.
Any tips?
Table 1:
Table 2 and Table 3 have different number of digits in Account.
Table 2:
Table 3:
Table 1 - With new columns that is needed after navigating from Table 2 and Table 3 to fetch the data into Table1.
You seem to want to combine tables 2 and 3. The logic is not 100% clear, but something like this
with t23 as (
select t2.account, concat(t3.account, ' - ', t3.desc) as desc
from table2 t2 join
table3 t3
on t2.desc = t3.desc
)
select t1.*, t23_1.desc, t23_2.desc, t23_3.desc
from table1 t1 left join
t23 t23_1
on t1.c1 = t23_1.account left join
t23 t23_2
on t1.c2 = t23_2.account left join
t23 t23_3
on t1.c3 = t23_3.account;
I am not sure if t23 is being defined by fiddling with the account column. However, joining on desc seems more obvious.
Also, desc is a very bad name for a column, because it is a SQL keyword (think order by).
In SQL, I can sum two counts like
SELECT (
(SELECT count(*) FROM a WHERE val=42)
+
(SELECT count(*) FROM b WHERE val=42)
)
How do I perform this query with the Django ORM?
The closest I got is
a.objects.filter(val=42).order_by().values_list('id', flat=True).union(
b.objects.filter(val=42).order_by().values_list('id', flat=True)
).count()
This works fine if the returned count is small, but seems bad if there's a lot of rows that the database must hold in memory just to count them.
Your solution can be only little simplified by values('pk') instead of values_list('id', flat=True), because this would affect only a type of rows of the output, but the source SQL of both querysets is the same:
SELECT id FROM a WHERE val=42 UNION SELECT id FROM b WHERE val=42
and the method .count() makes only a query around a subquery:
SELECT COUNT(*) FROM (... subquery ...)
It is not necessary that a database backend would hold all values in memory. It can also only count them and forget. (not checked)
Similarly if you run a simple SELECT COUNT(id) FROM a, it doesn't need to collect id.
Subqueries of the form SELECT count(*) FROM a WHERE val=42 in a bigger query are not possible because Django doesn't use lazy evaluation for aggregations and immediately evaluates them.
The evaluation can be postponed e.g. by grouping by some expression that has only one possible value, e.g. GROUP BY (i >= 0) (or by an outer reference if it would work), but the query plan can be worse.
Another problem is that a SELECT is not possible without a table. Therefore I will use an unimportant row of an unimportant table in the base of query.
Example:
qs = Unimportant.objects.filter(pk=unimportant_pk).values('id').annotate(
total_a=a.objects.filter(val=42).order_by().values('val')
.annotate(cnt=models.Count('*')).values('cnt'),
total_b=b.objects.filter(val=42).order_by().values('val')
.annotate(cnt=models.Count('*')).values('cnt')
)
It is not nice, but it could be easily parallelized
SELECT
id,
(SELECT COUNT(*) AS cnt FROM a WHERE val=42 GROUP BY val) AS total_a,
(SELECT COUNT(*) AS cnt FROM b WHERE val=42 GROUP BY val) AS total_b
FROM unimportant WHERE id = unimportant_pk
Django docs confirms that simple solution doesn't exist.
Using aggregates within a Subquery expression
...
... This is the only way to perform an aggregation within a Subquery, as using aggregate() attempts to evaluate the queryset (and if there is an OuterRef, this will not be possible to resolve).
I have a bunch of tables that I'm iterating through, and some of them have no rows (i.e. just a table of headers with no data).
ex: SELECT my_column FROM my_schema.my_table LIMIT 1 returns an empty result set.
What is the absolute fastest way to check that a table is one of these tables with no rows?
I've considered: SELECT my_column FROM my_schema.my_table LIMIT 1 or SELECT * FROM my_schema.my_table LIMIT 1
followed by an if result is None(I'm working in Python). Is there any faster way to check?
This is not faster than your solution but returns a boolean regadless:
select exists (select 1 from mytable)
select exists (select * from myTab);
or
select 1 where exists (select * from myTab)
or even
SELECT reltuples FROM pg_class WHERE oid = 'schema_name.table_name'::regclass;
The 3rd example uses the estimator to estimate rows, which may not be 100% accurate, but may be a tad bit faster.
SELECT COUNT(*) FROM table_name limit 1;
Try this code .
I have a query to search for data that falls within a certain time period like so:
SELECT id1
FROM table1
WHERE (time > '[time goes here]'
AND time < '[time goes here]')
I am storing and using this data in Python, and then I wish to search for data within the previous result from another table like so:
SELECT id2
FROM table2
WHERE (table2.id1 = '[results from previous query]'.id1
AND '[other conditions go here]')
SELECT id3
FROM table3
WHERE (table3.id2 = '[results from previous query]'.id2
AND '[other conditions go here]')
I would have to do this recursively (an undetermined number of times) so it cannot be done manually. Is it possible to use the results from the previous query in any way or would I have to put the entire first query into the second query, and then put the entire second query into the third query? If it is the latter, is there any way to speed this up, as the first query alone takes several seconds and I can't afford to rerun the query multiple times.
you can use CTE like this
with data1 as
(
select id2 FROM table2 WHERE
),
data2 as
(
select id3 from table 3 where
)
select * from data1,data2
with data1 as
(
select id2 FROM table2 WHERE
),
data2 as
(
select id3 from table3,data1 where <condition>
)
select * from data2
Solution1:
You could use join:
select t2.id2 from table1 t1,table2 t2 where t2.id2=t1.id1 and t1.time>x and t1.time
Solution2:
You could also save the result in a list and then pass it to the next query:
select id2 from table2 where id2 in (list)
I'm not quite even sure where / what to search for - so apologies if this is a trivial thing that has been asked before!
I have two tables in sqlite:
table_A = [id, value1, value2]
table_A$foo = [id, foo(value1), foo(value2)]
table_A$bar = [id, bar(value1), bar(value2)]
Where foo() / bar() are arbitrary functions not really relevant here
Now at the moment, I do:
select * from table_A
And use this cursor to compute all the rows for each of the derivative tables.
If something goes wrong (or I add new rows to table_A), i'd like a way to be able to compute (within SQL, rather than in python) which rows are already present in table_A$foo etc. and so just select the remaining (so like a AND NOT)to compute foo() and bar() - i should be able to do this on the ID col, as these remain the same.
Wondering if there is a way to do this in sqlite, which I imagine would be quicker than trying to rig this up in python.
Many thanks!
I don't understand if you consider a match based on value1 columns matching, or a combination of all three columns...
Using EXISTS to find those that are already present:
SELECT *
FROM TABLE_A a
WHERE EXISTS(SELECT NULL
FROM TABLE_A$foo f
WHERE a.id = f.id
AND a.value1 = f.value1
AND a.value2 = f.value2)
Using EXISTS to find those that are not present:
SELECT *
FROM TABLE_A a
WHERE NOT EXISTS(SELECT NULL
FROM TABLE_A$foo f
WHERE a.id = f.id
AND a.value1 = f.value1
AND a.value2 = f.value2)