I use psycopg2 to to join two tables together. But for some reason the process is taking way too long. Both tables have index. The tables have around 280 rows and 5-6 columns. But they do have a geometry column with complex boundaries of countries.
This is how the query looks like:
sqlstr = """CREATE TABLE {new_table_name} AS (
SELECT {geom_table_columns}, {added_columns}
FROM {geom_table}
INNER JOIN {temp_table} ON
(g.{id} = {temp_table}.{id})
);
""".format(** {
'new_table_name': new_table,
'geom_table': geom_table_name,
'geom_table_columns': geom_table_columns,
'temp_table': table_name_temp,
'id': geom_table_id,
'added_columns': added_columns
})
Is there any way to make this inner join query faster?
Related
I have two querysets -
A = Bids.objects.filter(*args,**kwargs).annotate(highest_priority=Case(*[
When(data_source=data_source, then Value(i))
for i, data_source in enumerate(data_source_order_list)
],
.order_by(
"date",
"highest_priority"
))
B= A.values("date").annotate(Min("highest_priority)).order_by("date")
First query give me all objects with selected time range with proper data sources and values. Through highest_priority i set which item should be selected. All items have additional data.
Second query gives me grouped by information about items in every date. In second query i do not have important values like price etc. So i assume i have to join these two tables and filter out where a.highest_priority = b.highest priority. Because in this case i will get queryset with objects and only one item per date.
I have tried using distinct - not working with .first()/.last(). Annotates gives me dict by grouped by, and grouping by only date cutting a lot of important data, but i have to group by only date...
Tables looks like that
A
B
How to join them? Because when i join them i could easily filter highest_prio with highest_prio and get my date with only one database shot. I want to use ORM, because i could just distinct and put it on the list and i do not want to hammer base with connecting multiple queries through date.
Look if this sugestion works :
SELECT * , (to_char(a.date, 'YYYYMMDD')::integer)*highest_priority AS prioritycalc;
FROM table A
JOIN table B ON (to_char(a.date, 'YYYYMMDD')::integer)*highest_priority = (to_char(b.date, 'YYYYMMDD')::integer)*highest_priority
ORDER BY prioritycalc DESC;
I have SQL tables in below format, I was figuring whats the best approach to navigate and join tables to get to the final result. I can do it in Python as well since this seems to require join on multiple columns which would end up duplicating rows.
Any tips?
Table 1:
Table 2 and Table 3 have different number of digits in Account.
Table 2:
Table 3:
Table 1 - With new columns that is needed after navigating from Table 2 and Table 3 to fetch the data into Table1.
You seem to want to combine tables 2 and 3. The logic is not 100% clear, but something like this
with t23 as (
select t2.account, concat(t3.account, ' - ', t3.desc) as desc
from table2 t2 join
table3 t3
on t2.desc = t3.desc
)
select t1.*, t23_1.desc, t23_2.desc, t23_3.desc
from table1 t1 left join
t23 t23_1
on t1.c1 = t23_1.account left join
t23 t23_2
on t1.c2 = t23_2.account left join
t23 t23_3
on t1.c3 = t23_3.account;
I am not sure if t23 is being defined by fiddling with the account column. However, joining on desc seems more obvious.
Also, desc is a very bad name for a column, because it is a SQL keyword (think order by).
I am writing large amounts of data to a sqlite database. I am using a temporary dataframe to find unique values.
This sql code takes forever in conn.execute(sql)
if upload_to_db == True:
print(f'########################################WRITING TO TEMP TABLE: {symbol} #######################################################################')
master_df.to_sql(name='tempTable', con=engine, if_exists='replace')
with engine.begin() as cn:
sql = """INSERT INTO instrumentsHistory (datetime, instrumentSymbol, observation, observationColName)
SELECT t.datetime, t.instrumentSymbol, t.observation, t.observationColName
FROM tempTable t
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)"""
print(f'##############################################WRITING TO FINAL TABLE: {symbol} #################################################################')
cn.execute(sql)
running this takes forever to write to the database. Can someone help me understand how to speed it up?
Edit 1:
How many rows roughly? -About 15,000 at a time. Basically it is pulling data into a pandas dataframe and making some transformations and then writing it to a sqlite database. there are probably 600 different instruments and each having like 15,000 rows so 9M rows ultimately. Give or take a million....
Depending on your SQL database, you could try using something like INSERT INTO IGNORE (MySQL), or MERGE (e.g. on Oracle), which would do the insert only if it would not violate a primary key or unique constraint. This would assume that such a constraint would exist on the 4 columns which you are checking.
In the absence of merge, you could try adding the following index to the instrumentsHistory table:
CREATE INDEX idx ON instrumentsHistory (datetime, instrumentSymbol, observation,
observationColName);
This index would allow for rapid lookup of each incoming record, coming from the tempTable, and so might speed up the insert process.
This subquery
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)
has to check every row in the table - and match four columns - until a match is found. In the worst case, there is no match and a full table scan must be completed. Therefore, the performance of the query will deteriorate as the table grows in size.
The solution, as mentioned in Tim's answer, is to create an index over the four columns to that the db can quickly determine whether a match exists.
I want to get all the columns of a table with max(timestamp) and group by name.
What i have tried so far is:
normal_query ="Select max(timestamp) as time from table"
event_list = normal_query \
.distinct(Table.name)\
.filter_by(**filter_by_query) \
.filter(*queries) \
.group_by(*group_by_fields) \
.order_by('').all()
the query i get :
SELECT DISTINCT ON (schema.table.name) , max(timestamp)....
this query basically returns two columns with name and timestamp.
whereas, the query i want :
SELECT DISTINCT ON (schema.table.name) * from table order by ....
which returns all the columns in that table.Which is the expected behavior and i am able to get all the columns, how could i right it down in python to get to this statement?.Basically the asterisk is missing.
Can somebody help me?
What you seem to be after is the DISTINCT ON ... ORDER BY idiom in Postgresql for selecting greatest-n-per-group results (N = 1). So instead of grouping and aggregating just
event_list = Table.query.\
distinct(Table.name).\
filter_by(**filter_by_query).\
filter(*queries).\
order_by(Table.name, Table.timestamp.desc()).\
all()
This will end up selecting rows "grouped" by name, having the greatest timestamp value.
You do not want to use the asterisk most of the time, not in your application code anyway, unless you're doing manual ad-hoc queries. The asterisk is basically "all columns from the FROM table/relation", which might then break your assumptions later, if you add columns, reorder them, and such.
In case you'd like to order the resulting rows based on timestamp in the final result, you can use for example Query.from_self() to turn the query to a subquery, and order in the enclosing query:
event_list = Table.query.\
distinct(Table.name).\
filter_by(**filter_by_query).\
filter(*queries).\
order_by(Table.name, Table.timestamp.desc()).\
from_self().\
order_by(Table.timestamp.desc()).\
all()
I tried to run this query:
update table1 A
set number = (select count(distinct(id)) from table2 B where B.col1 = A.col1 or B.col2 = A.col2);
but it takes forever bc table1 has 1,100,000 rows and table2 has 350,000,000 rows.
Is there any faster way to do this query in R? or in python?
I rewrote your query with three subqueries instead of one - with UNION and two INNER JOIN statements:
UPDATE table1 as A
SET number = (SELECT COUNT(DISTINCT(id))
FROM
(SELECT A.id as id
FROM table1 as A
INNER JOIN table2 as B
ON A.col1 = B.col1) -- condition for col1
UNION DISTINCT
(SELECT A.id as id
FROM table1 as A
INNER JOIN table2 as B
ON A.col2 = B.col2) -- condition for col2
)
My notes:
Updating all of the rows in table1 doesn't look like a good idea, because we have to touch 1.1M rows. Probably, another data structure for storing number would have better performance
Try to run part of the query without update of table1 (only part of the query in parenthesis
Take a look into EXPLAIN, if you need more general approach for optimization of SQL queries: https://dev.mysql.com/doc/refman/5.7/en/using-explain.html