I have three queries and another table called output_table. This code works but needs to be executed in 1. REPLACE INTO query. I know this involves nested and subqueries, but I have no idea if this is possible since my key is the DISTINCT coins datapoints from target_currency.
How to rewrite 2 and 3 so they execute in query 1? That is, the REPLACE INTO query instead of the individual UPDATE ones:
1. conn3.cursor().execute(
"""REPLACE INTO coin_best_returns(coin) SELECT DISTINCT target_currency FROM output_table"""
)
2. conn3.cursor().execute(
"""UPDATE coin_best_returns SET
highest_price = (SELECT MAX(ask_price_usd) FROM output_table WHERE coin_best_returns.coin = output_table.target_currency),
lowest_price = (SELECT MIN(bid_price_usd) FROM output_table WHERE coin_best_returns.coin = output_table.target_currency)"""
)
3. conn3.cursor().execute(
"""UPDATE coin_best_returns SET
highest_market = (SELECT exchange FROM output_table WHERE coin_best_returns.highest_price = output_table.ask_price_usd),
lowest_market = (SELECT exchange FROM output_table WHERE coin_best_returns.lowest_price = output_table.bid_price_usd)"""
)
You can do it with the help of some window functions, a subquery, and an inner join. The version below is pretty lengthy, but it is less complicated than it may appear. It uses window functions in a subquery to compute the needed per-currency statistics, and factors this out into a common table expression to facilitate joining it to
itself.
Other than the inline comments, the main reason for the complication is original query number 3. Queries (1) and (2) could easily be combined as a single, simple, aggregate query, but the third query is not as easily addressed. To keep the exchange data associated with the corresponding ask and bid prices, this query uses window functions instead of aggregate queries. This also provides a vehicle different from DISTINCT for obtaining one result per currency.
Here's the bare query:
WITH output_stats AS (
-- The ask and bid information for every row of output_table, every row
-- augmented by the needed maximum ask and minimum bid statistics
SELECT
target_currency as tc,
ask_price_usd as ask,
bid_price_usd as bid,
exchange as market,
MAX(ask_price_usd) OVER (PARTITION BY target_currency) as high,
ROW_NUMBER() OVER (
PARTITION_BY target_currency, ask_price_usd ORDER BY exchange DESC)
as ask_rank
MIN(bid_price_usd) OVER (PARTITION BY target_currency) as low,
ROW_NUMBER() OVER (
PARTITION_BY target_currency, bid_price_usd ORDER BY exchange ASC)
as bid_rank
FROM output_table
)
REPLACE INTO coin_best_returns(
-- you must, of course, include all the columns you want to fill in the
-- upsert column list
coin,
highest_price,
lowest_price,
highest_market,
lowest_market)
SELECT
-- ... and select a value for each column
asks.tc,
asks.ask,
bids.bid,
asks.market,
bids.market
FROM output_stats asks
JOIN output_stats bids
ON asks.tc = bids.tc
WHERE
-- These conditions choose exactly one asks row and one bids row
-- for each currency
asks.ask = asks.high
AND asks.ask_rank = 1
AND bids.bid = bids.low
AND bids.bid_rank = 1
Note well that unlike the original query 3, this will consider only exchange values associated with the target currency for setting the highest_market and lowest_market columns in the destination table. I'm supposing that that's what you really want, but if not, then a different strategy will be needed.
Related
In SQL, I can sum two counts like
SELECT (
(SELECT count(*) FROM a WHERE val=42)
+
(SELECT count(*) FROM b WHERE val=42)
)
How do I perform this query with the Django ORM?
The closest I got is
a.objects.filter(val=42).order_by().values_list('id', flat=True).union(
b.objects.filter(val=42).order_by().values_list('id', flat=True)
).count()
This works fine if the returned count is small, but seems bad if there's a lot of rows that the database must hold in memory just to count them.
Your solution can be only little simplified by values('pk') instead of values_list('id', flat=True), because this would affect only a type of rows of the output, but the source SQL of both querysets is the same:
SELECT id FROM a WHERE val=42 UNION SELECT id FROM b WHERE val=42
and the method .count() makes only a query around a subquery:
SELECT COUNT(*) FROM (... subquery ...)
It is not necessary that a database backend would hold all values in memory. It can also only count them and forget. (not checked)
Similarly if you run a simple SELECT COUNT(id) FROM a, it doesn't need to collect id.
Subqueries of the form SELECT count(*) FROM a WHERE val=42 in a bigger query are not possible because Django doesn't use lazy evaluation for aggregations and immediately evaluates them.
The evaluation can be postponed e.g. by grouping by some expression that has only one possible value, e.g. GROUP BY (i >= 0) (or by an outer reference if it would work), but the query plan can be worse.
Another problem is that a SELECT is not possible without a table. Therefore I will use an unimportant row of an unimportant table in the base of query.
Example:
qs = Unimportant.objects.filter(pk=unimportant_pk).values('id').annotate(
total_a=a.objects.filter(val=42).order_by().values('val')
.annotate(cnt=models.Count('*')).values('cnt'),
total_b=b.objects.filter(val=42).order_by().values('val')
.annotate(cnt=models.Count('*')).values('cnt')
)
It is not nice, but it could be easily parallelized
SELECT
id,
(SELECT COUNT(*) AS cnt FROM a WHERE val=42 GROUP BY val) AS total_a,
(SELECT COUNT(*) AS cnt FROM b WHERE val=42 GROUP BY val) AS total_b
FROM unimportant WHERE id = unimportant_pk
Django docs confirms that simple solution doesn't exist.
Using aggregates within a Subquery expression
...
... This is the only way to perform an aggregation within a Subquery, as using aggregate() attempts to evaluate the queryset (and if there is an OuterRef, this will not be possible to resolve).
I included an image that shows what I am trying to do in SQL. The table you see on the left is what I get from running a SQL query on my server. I am trying to create the table on the right using SQL/ Python. I would use a pivot, but as a pivot aggregates the values, I would have to take an average, sum, min, max, or something else of the component number column. Effectively, I need a 2 indices transpose. There can be multiple tests done for each serial number. I need to make the first test by date show up in the first test values column, same for the second, third, or fourth test. I need the values for each serial number and each component to show up in "test 1", "test 2", "test 3". The tricky part is that test 1, test 2, and test 3 are different for each part and yet they still need to be grouped by into the buckets by sequential test date for each serial number.
If anyone could help me out with some methods or help me generate some pseudocode for what I am trying to do, I would greatly appreciate it. Thanks.
Here is the link to the image of the tables:
The left is what my SQL query pulls, and the right is how I want it to be.
You can use row_number() and conditional aggregation:
select
serial_number,
component_number
max(case when rn = 1 then test_value end) test1,
max(case when rn = 2 then test_value end) test2
from (
select
t.*,
row_number() over(partition by serial_number, component_number order by test_date) rn
from mytable t
) t
group by serial_number, component_number
In the subquery, row_number() assigns a rank to each record within groups sharing the same serial_number and component_number. Then, the outer query aggregates by serial_number and component_number, and spreads the test_values accross columns according to their rank. You can expand the select clause of the outer query with more conditional max()s to handle more than two tests per (serial_number, component_number) tuple.
I want to get all the columns of a table with max(timestamp) and group by name.
What i have tried so far is:
normal_query ="Select max(timestamp) as time from table"
event_list = normal_query \
.distinct(Table.name)\
.filter_by(**filter_by_query) \
.filter(*queries) \
.group_by(*group_by_fields) \
.order_by('').all()
the query i get :
SELECT DISTINCT ON (schema.table.name) , max(timestamp)....
this query basically returns two columns with name and timestamp.
whereas, the query i want :
SELECT DISTINCT ON (schema.table.name) * from table order by ....
which returns all the columns in that table.Which is the expected behavior and i am able to get all the columns, how could i right it down in python to get to this statement?.Basically the asterisk is missing.
Can somebody help me?
What you seem to be after is the DISTINCT ON ... ORDER BY idiom in Postgresql for selecting greatest-n-per-group results (N = 1). So instead of grouping and aggregating just
event_list = Table.query.\
distinct(Table.name).\
filter_by(**filter_by_query).\
filter(*queries).\
order_by(Table.name, Table.timestamp.desc()).\
all()
This will end up selecting rows "grouped" by name, having the greatest timestamp value.
You do not want to use the asterisk most of the time, not in your application code anyway, unless you're doing manual ad-hoc queries. The asterisk is basically "all columns from the FROM table/relation", which might then break your assumptions later, if you add columns, reorder them, and such.
In case you'd like to order the resulting rows based on timestamp in the final result, you can use for example Query.from_self() to turn the query to a subquery, and order in the enclosing query:
event_list = Table.query.\
distinct(Table.name).\
filter_by(**filter_by_query).\
filter(*queries).\
order_by(Table.name, Table.timestamp.desc()).\
from_self().\
order_by(Table.timestamp.desc()).\
all()
I have a DB query that matches the desired rows. Let's say (for simplicity):
select * from stats where id in (1, 2);
Now I want to extract several frequency statistics (count of distinct values) for multiple columns, across these matching rows:
-- `stats.status` is one such column
select status, count(*) from stats where id in (1, 2) group by 1 order by 2 desc;
-- `stats.category` is another column
select category, count(*) from stats where id in (1, 2) group by 1 order by 2 desc;
-- etc.
Is there a way to re-use the same underlying query in SqlAlchemy? Raw SQL works too.
Or even better, return all the histograms at once, in a single command?
I'm mostly interested in performance, because I don't want Postgres to run the same row-matching many times, once for each column, over and over. The only change is which column is used for the histogram grouping. Otherwise it's the same set of rows.
I don't want Postgres to run the same row-matching many times
That's one of the motivations behind the GROUPING SETS functionality. Try this model:
SELECT category, status, count(*)
FROM stats where id in (1,2)
GROUP BY grouping sets ((category),(status));
User Abelisto's comment & the other answer both have the correct sql required to generate the histogram for multiple fields in 1 single query.
The only edit I would suggest to their efforts is to add an ORDER BY clause, as it seems from OP's attempts that more frequent labels are desired at the top of the result. You might find that sorting the results in python rather than in the database is simpler. In that case, disregard the complexity brought on the order by clause.
Thus, the modified query would be:
SELECT category, status, count(*)
FROM stats
WHERE id IN (1, 2)
GROUP BY GROUPING SETS (
(category), (status)
)
ORDER BY
GROUPING(category, status), 3 DESC
It is also possible to express the same query using sqlalchemy.
from sqlalchemy import *
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Stats(Base):
__tablename__ = 'stats'
id = Column(Integer, primary_key=True)
category = Column(Text)
status = Column(Text)
stmt = select(
[Stats.category, Stats.status, func.count(1)]
).where(
Stats.id.in_([1, 2])
).group_by(
func.grouping_sets(tuple_(Stats.category),
tuple_(Stats.status))
).order_by(
func.grouping(Stats.category, Stats.status),
func.count(1).desc()
)
Investigating the output, we see that it generates the desired query (extra newlines added in output for legibility)
print(stmt.compile(compile_kwargs={'literal_binds': True}))
# outputs:
SELECT stats.category, stats.status, count(1) AS count_1
FROM stats
WHERE stats.id IN (1, 2)
GROUP BY GROUPING SETS((stats.category), (stats.status))
ORDER BY grouping(stats.category, stats.status), count(1) DESC
I am trying to extract data that corresponds to a stock that is present in both of my data sets (given in a code below).
This is my data:
#(stock,price,recommendation)
my_data_1 = [('a',1,'BUY'),('b',2,'SELL'),('c',3,'HOLD'),('d',6,'BUY')]
#(stock,price,volume)
my_data_2 = [('a',1,5),('d',6,6),('e',2,7)]
Here are my questions:
Question 1:
I am trying to extract price, recommendation, and volume that correspond to asset 'a'. Ideally I would like to get a tuple like this:
(u'a',1,u'BUY',5)
Question 2:
What if I wanted to get intersection for all the stocks (not just 'a' as in Question 1), in this case it is stock 'a', and stock 'd', then my desired output becomes:
(u'a',1,u'BUY',5)
(u'd',6,u'BUY',6)
How should I do this?
Here is my try (Question 1):
import sqlite3
my_data_1 = [('a',1,'BUY'),('b',2,'SELL'),('c',3,'HOLD'),('d',6,'BUY')]
my_data_2 = [('a',1,5),('d',6,6),('e',2,7)]
#I am using :memory: because I want to experiment
#with the database a lot
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('''CREATE TABLE MY_TABLE_1
(stock TEXT, price REAL, recommendation TEXT )''' )
c.execute('''CREATE TABLE MY_TABLE_2
(stock TEXT, price REAL, volume REAL )''' )
for ele in my_data_1:
c.execute('''INSERT INTO MY_TABLE_1 VALUES(?,?,?)''',ele)
for ele in my_data_2:
c.execute('''INSERT INTO MY_TABLE_2 VALUES(?,?,?)''',ele)
conn.commit()
# The problem is with the following line:
c.execute( 'select* from my_table_1 where stock = ? INTERSECT select* from my_table_2 where stock = ?',('a','a') )
for entry in c:
print entry
I get no error, but also no output, so something is clearly off.
I also tried this line:
c.execute( 'select* from my_table_1 where stock = ? INTERSECT select volume from my_table_2 where stock = ?',('a','a')
but it does not work, I get this error:
c.execute( 'select* from my_table_1 where stock = ? INTERSECT select volume from my_table_2 where stock = ?',('a','a') )
sqlite3.OperationalError: SELECTs to the left and right of INTERSECT do not have the same number of result columns
I understand why I would have different number of resulting columns, but don't quite get why that triggers an error.
How should I do this?
Thank You in advance
It looks like those two questions are really the same question.
Why your query doesn't work: Let's reformat the query.
SELECT * FROM my_table_1 WHERE stock=?
INTERSECT
SELECT volume FROM my_table_2 WHERE stock=?
There are two queries in the intersection,
SELECT * FROM my_table_1 WHERE stock=?
SELECT volume FROM my_table_2 WHERE stock=?
The meaning of "intersect" is "give me the rows that are in both queries". That doesn't make any sense if the queries have a different number of columns, since it's impossible for any row to appear in both queries.
Note that SELECT volume FROM my_table_2 isn't a very useful query, since it doesn't tell you which stock the volume belongs to. The query will give you something like {100, 15, 93, 42}.
What you're actually trying to do: You want a join.
SELECT my_table_1.stock, my_table_2.price, recommendation, volume
FROM my_table_1
INNER JOIN my_table_2 ON my_table_1.stock=my_table_2.stock
WHERE stock=?
Think of join as "glue the rows from one table onto the rows from another table, giving data from both tables in a single row."
It's bizarre that the price appears in both tables; when you write the query with the join you have to decide whether you want my_table_1.price or my_table_2.price, or whether you want to join on my_table_1.price=my_table_2.price. You may want to consider redesigning your schema so this doesn't happen, it may make your life easier.
You are suffering from a misunderstanding about how to correlate different tables.
In order to do this the easiest way is to JOIN them with a suitable condition, resulting in results which automatically include the data from both the joined tables. In the example below I select all columns, but you can of course select only those you want by naming them in the FROM clause. You can also select only those rows you want with (a) further condition(s) in a WHERE clause. After you execute you code, try the following:
>>> c.execute("select * from my_table_1 t1 JOIN my_table_2 t2 ON t1.stock=t2.stock")
<sqlite3.Cursor object at 0x1004608f0>
This tells SQLite to take rows from table 1 and join them with rows in table 2 meeting the conditions in the ON clause (i.e. they have to have the same value for their STOCK attribute). Because you chose such long table names, and because I am a crappy typist, I used table alises in the FROM clause to allow me to use shortened names in the rest of the query.
>>> c.fetchall()
then gives you the result
[(u'a', 1.0, u'BUY', u'a', 1.0, 5.0), (u'd', 6.0, u'BUY', u'd', 6.0, 6.0)]
which would seem to answer both 1) and 2). For only a particular value of STOCK just add
WHERE t1.STOCK = 'a' -- or other required value, naturally
to the query string. You can see the names of the columns returned by querying the cursor's description attribute:
>>> [d[0] for d in c.description]
['stock', 'price', 'recommendation', 'stock', 'price', 'volume']
The INTERSECT operation is used to take the outputs from two separate SELECT queries and return only those elements that occur in both. I don't think that's going to be helpful here. The reason you got the error is because the queries have to be "UNION compatible", which is to say they need the same number and type of columns in the intersected queries.