Intersection in sqlite3 in Python - python

I am trying to extract data that corresponds to a stock that is present in both of my data sets (given in a code below).
This is my data:
#(stock,price,recommendation)
my_data_1 = [('a',1,'BUY'),('b',2,'SELL'),('c',3,'HOLD'),('d',6,'BUY')]
#(stock,price,volume)
my_data_2 = [('a',1,5),('d',6,6),('e',2,7)]
Here are my questions:
Question 1:
I am trying to extract price, recommendation, and volume that correspond to asset 'a'. Ideally I would like to get a tuple like this:
(u'a',1,u'BUY',5)
Question 2:
What if I wanted to get intersection for all the stocks (not just 'a' as in Question 1), in this case it is stock 'a', and stock 'd', then my desired output becomes:
(u'a',1,u'BUY',5)
(u'd',6,u'BUY',6)
How should I do this?
Here is my try (Question 1):
import sqlite3
my_data_1 = [('a',1,'BUY'),('b',2,'SELL'),('c',3,'HOLD'),('d',6,'BUY')]
my_data_2 = [('a',1,5),('d',6,6),('e',2,7)]
#I am using :memory: because I want to experiment
#with the database a lot
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('''CREATE TABLE MY_TABLE_1
(stock TEXT, price REAL, recommendation TEXT )''' )
c.execute('''CREATE TABLE MY_TABLE_2
(stock TEXT, price REAL, volume REAL )''' )
for ele in my_data_1:
c.execute('''INSERT INTO MY_TABLE_1 VALUES(?,?,?)''',ele)
for ele in my_data_2:
c.execute('''INSERT INTO MY_TABLE_2 VALUES(?,?,?)''',ele)
conn.commit()
# The problem is with the following line:
c.execute( 'select* from my_table_1 where stock = ? INTERSECT select* from my_table_2 where stock = ?',('a','a') )
for entry in c:
print entry
I get no error, but also no output, so something is clearly off.
I also tried this line:
c.execute( 'select* from my_table_1 where stock = ? INTERSECT select volume from my_table_2 where stock = ?',('a','a')
but it does not work, I get this error:
c.execute( 'select* from my_table_1 where stock = ? INTERSECT select volume from my_table_2 where stock = ?',('a','a') )
sqlite3.OperationalError: SELECTs to the left and right of INTERSECT do not have the same number of result columns
I understand why I would have different number of resulting columns, but don't quite get why that triggers an error.
How should I do this?
Thank You in advance

It looks like those two questions are really the same question.
Why your query doesn't work: Let's reformat the query.
SELECT * FROM my_table_1 WHERE stock=?
INTERSECT
SELECT volume FROM my_table_2 WHERE stock=?
There are two queries in the intersection,
SELECT * FROM my_table_1 WHERE stock=?
SELECT volume FROM my_table_2 WHERE stock=?
The meaning of "intersect" is "give me the rows that are in both queries". That doesn't make any sense if the queries have a different number of columns, since it's impossible for any row to appear in both queries.
Note that SELECT volume FROM my_table_2 isn't a very useful query, since it doesn't tell you which stock the volume belongs to. The query will give you something like {100, 15, 93, 42}.
What you're actually trying to do: You want a join.
SELECT my_table_1.stock, my_table_2.price, recommendation, volume
FROM my_table_1
INNER JOIN my_table_2 ON my_table_1.stock=my_table_2.stock
WHERE stock=?
Think of join as "glue the rows from one table onto the rows from another table, giving data from both tables in a single row."
It's bizarre that the price appears in both tables; when you write the query with the join you have to decide whether you want my_table_1.price or my_table_2.price, or whether you want to join on my_table_1.price=my_table_2.price. You may want to consider redesigning your schema so this doesn't happen, it may make your life easier.

You are suffering from a misunderstanding about how to correlate different tables.
In order to do this the easiest way is to JOIN them with a suitable condition, resulting in results which automatically include the data from both the joined tables. In the example below I select all columns, but you can of course select only those you want by naming them in the FROM clause. You can also select only those rows you want with (a) further condition(s) in a WHERE clause. After you execute you code, try the following:
>>> c.execute("select * from my_table_1 t1 JOIN my_table_2 t2 ON t1.stock=t2.stock")
<sqlite3.Cursor object at 0x1004608f0>
This tells SQLite to take rows from table 1 and join them with rows in table 2 meeting the conditions in the ON clause (i.e. they have to have the same value for their STOCK attribute). Because you chose such long table names, and because I am a crappy typist, I used table alises in the FROM clause to allow me to use shortened names in the rest of the query.
>>> c.fetchall()
then gives you the result
[(u'a', 1.0, u'BUY', u'a', 1.0, 5.0), (u'd', 6.0, u'BUY', u'd', 6.0, 6.0)]
which would seem to answer both 1) and 2). For only a particular value of STOCK just add
WHERE t1.STOCK = 'a' -- or other required value, naturally
to the query string. You can see the names of the columns returned by querying the cursor's description attribute:
>>> [d[0] for d in c.description]
['stock', 'price', 'recommendation', 'stock', 'price', 'volume']
The INTERSECT operation is used to take the outputs from two separate SELECT queries and return only those elements that occur in both. I don't think that's going to be helpful here. The reason you got the error is because the queries have to be "UNION compatible", which is to say they need the same number and type of columns in the intersected queries.

Related

How do I nest these queries in one Replace Into query?

I have three queries and another table called output_table. This code works but needs to be executed in 1. REPLACE INTO query. I know this involves nested and subqueries, but I have no idea if this is possible since my key is the DISTINCT coins datapoints from target_currency.
How to rewrite 2 and 3 so they execute in query 1? That is, the REPLACE INTO query instead of the individual UPDATE ones:
1. conn3.cursor().execute(
"""REPLACE INTO coin_best_returns(coin) SELECT DISTINCT target_currency FROM output_table"""
)
2. conn3.cursor().execute(
"""UPDATE coin_best_returns SET
highest_price = (SELECT MAX(ask_price_usd) FROM output_table WHERE coin_best_returns.coin = output_table.target_currency),
lowest_price = (SELECT MIN(bid_price_usd) FROM output_table WHERE coin_best_returns.coin = output_table.target_currency)"""
)
3. conn3.cursor().execute(
"""UPDATE coin_best_returns SET
highest_market = (SELECT exchange FROM output_table WHERE coin_best_returns.highest_price = output_table.ask_price_usd),
lowest_market = (SELECT exchange FROM output_table WHERE coin_best_returns.lowest_price = output_table.bid_price_usd)"""
)
You can do it with the help of some window functions, a subquery, and an inner join. The version below is pretty lengthy, but it is less complicated than it may appear. It uses window functions in a subquery to compute the needed per-currency statistics, and factors this out into a common table expression to facilitate joining it to
itself.
Other than the inline comments, the main reason for the complication is original query number 3. Queries (1) and (2) could easily be combined as a single, simple, aggregate query, but the third query is not as easily addressed. To keep the exchange data associated with the corresponding ask and bid prices, this query uses window functions instead of aggregate queries. This also provides a vehicle different from DISTINCT for obtaining one result per currency.
Here's the bare query:
WITH output_stats AS (
-- The ask and bid information for every row of output_table, every row
-- augmented by the needed maximum ask and minimum bid statistics
SELECT
target_currency as tc,
ask_price_usd as ask,
bid_price_usd as bid,
exchange as market,
MAX(ask_price_usd) OVER (PARTITION BY target_currency) as high,
ROW_NUMBER() OVER (
PARTITION_BY target_currency, ask_price_usd ORDER BY exchange DESC)
as ask_rank
MIN(bid_price_usd) OVER (PARTITION BY target_currency) as low,
ROW_NUMBER() OVER (
PARTITION_BY target_currency, bid_price_usd ORDER BY exchange ASC)
as bid_rank
FROM output_table
)
REPLACE INTO coin_best_returns(
-- you must, of course, include all the columns you want to fill in the
-- upsert column list
coin,
highest_price,
lowest_price,
highest_market,
lowest_market)
SELECT
-- ... and select a value for each column
asks.tc,
asks.ask,
bids.bid,
asks.market,
bids.market
FROM output_stats asks
JOIN output_stats bids
ON asks.tc = bids.tc
WHERE
-- These conditions choose exactly one asks row and one bids row
-- for each currency
asks.ask = asks.high
AND asks.ask_rank = 1
AND bids.bid = bids.low
AND bids.bid_rank = 1
Note well that unlike the original query 3, this will consider only exchange values associated with the target currency for setting the highest_market and lowest_market columns in the destination table. I'm supposing that that's what you really want, but if not, then a different strategy will be needed.

Why is this sql statement super slow?

I am writing large amounts of data to a sqlite database. I am using a temporary dataframe to find unique values.
This sql code takes forever in conn.execute(sql)
if upload_to_db == True:
print(f'########################################WRITING TO TEMP TABLE: {symbol} #######################################################################')
master_df.to_sql(name='tempTable', con=engine, if_exists='replace')
with engine.begin() as cn:
sql = """INSERT INTO instrumentsHistory (datetime, instrumentSymbol, observation, observationColName)
SELECT t.datetime, t.instrumentSymbol, t.observation, t.observationColName
FROM tempTable t
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)"""
print(f'##############################################WRITING TO FINAL TABLE: {symbol} #################################################################')
cn.execute(sql)
running this takes forever to write to the database. Can someone help me understand how to speed it up?
Edit 1:
How many rows roughly? -About 15,000 at a time. Basically it is pulling data into a pandas dataframe and making some transformations and then writing it to a sqlite database. there are probably 600 different instruments and each having like 15,000 rows so 9M rows ultimately. Give or take a million....
Depending on your SQL database, you could try using something like INSERT INTO IGNORE (MySQL), or MERGE (e.g. on Oracle), which would do the insert only if it would not violate a primary key or unique constraint. This would assume that such a constraint would exist on the 4 columns which you are checking.
In the absence of merge, you could try adding the following index to the instrumentsHistory table:
CREATE INDEX idx ON instrumentsHistory (datetime, instrumentSymbol, observation,
observationColName);
This index would allow for rapid lookup of each incoming record, coming from the tempTable, and so might speed up the insert process.
This subquery
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)
has to check every row in the table - and match four columns - until a match is found. In the worst case, there is no match and a full table scan must be completed. Therefore, the performance of the query will deteriorate as the table grows in size.
The solution, as mentioned in Tim's answer, is to create an index over the four columns to that the db can quickly determine whether a match exists.

What is the correct way to use distinct on (Postgres) with SqlAlchemy?

I want to get all the columns of a table with max(timestamp) and group by name.
What i have tried so far is:
normal_query ="Select max(timestamp) as time from table"
event_list = normal_query \
.distinct(Table.name)\
.filter_by(**filter_by_query) \
.filter(*queries) \
.group_by(*group_by_fields) \
.order_by('').all()
the query i get :
SELECT DISTINCT ON (schema.table.name) , max(timestamp)....
this query basically returns two columns with name and timestamp.
whereas, the query i want :
SELECT DISTINCT ON (schema.table.name) * from table order by ....
which returns all the columns in that table.Which is the expected behavior and i am able to get all the columns, how could i right it down in python to get to this statement?.Basically the asterisk is missing.
Can somebody help me?
What you seem to be after is the DISTINCT ON ... ORDER BY idiom in Postgresql for selecting greatest-n-per-group results (N = 1). So instead of grouping and aggregating just
event_list = Table.query.\
distinct(Table.name).\
filter_by(**filter_by_query).\
filter(*queries).\
order_by(Table.name, Table.timestamp.desc()).\
all()
This will end up selecting rows "grouped" by name, having the greatest timestamp value.
You do not want to use the asterisk most of the time, not in your application code anyway, unless you're doing manual ad-hoc queries. The asterisk is basically "all columns from the FROM table/relation", which might then break your assumptions later, if you add columns, reorder them, and such.
In case you'd like to order the resulting rows based on timestamp in the final result, you can use for example Query.from_self() to turn the query to a subquery, and order in the enclosing query:
event_list = Table.query.\
distinct(Table.name).\
filter_by(**filter_by_query).\
filter(*queries).\
order_by(Table.name, Table.timestamp.desc()).\
from_self().\
order_by(Table.timestamp.desc()).\
all()

count how many times in an sqlite3 database table column the values occurs

I have been performing a query to count how many times in my sqlite3 database table (Users), within the column "country", the value "Australia" occurs.
australia = db.session.query(Users.country).filter_by(country="Australia").count()
I need to do this in a more dynamic way for any country value that may be within this column.
I have tried the following but unfortunately I only get a count of 0 for all values that are passed in the loop variable (each).
country = list(db.session.query(Users.country))
country_dict = list(set(country))
for each in country_dict:
print(db.session.query(Users.country).filter_by(country=(str(each))).count())
Any assistance would be greatly appreciated.
The issue is that country is a list of result tuples, not a list of strings. The end result is that the value of str(each) is something along the lines of ('Australia',), which should make it obvious why you are getting counts of 0 as results.
For when you want to extract a list of single column values, see here. When you want distinct results, use DISTINCT in SQL.
But you should not first query distinct countries and then fire a query to count the occurrence of each one. Instead use GROUP BY:
country_counts = db.session.query(Users.country, db.func.count()).\
group_by(Users.country).\
all()
for country, count in country_counts:
print(country, count)
The main thing to note is that SQLAlchemy does not hide the SQL when using the ORM, but works with it.
If you can use the sqlite3 module with direct SQL it is a simple query:
curs = con.execute("SELECT COUNT(*) FROM users WHERE country=?", ("Australia",))
nb = curs.fetchone()[0]

sqlite SQL query for unprocessed rows

I'm not quite even sure where / what to search for - so apologies if this is a trivial thing that has been asked before!
I have two tables in sqlite:
table_A = [id, value1, value2]
table_A$foo = [id, foo(value1), foo(value2)]
table_A$bar = [id, bar(value1), bar(value2)]
Where foo() / bar() are arbitrary functions not really relevant here
Now at the moment, I do:
select * from table_A
And use this cursor to compute all the rows for each of the derivative tables.
If something goes wrong (or I add new rows to table_A), i'd like a way to be able to compute (within SQL, rather than in python) which rows are already present in table_A$foo etc. and so just select the remaining (so like a AND NOT)to compute foo() and bar() - i should be able to do this on the ID col, as these remain the same.
Wondering if there is a way to do this in sqlite, which I imagine would be quicker than trying to rig this up in python.
Many thanks!
I don't understand if you consider a match based on value1 columns matching, or a combination of all three columns...
Using EXISTS to find those that are already present:
SELECT *
FROM TABLE_A a
WHERE EXISTS(SELECT NULL
FROM TABLE_A$foo f
WHERE a.id = f.id
AND a.value1 = f.value1
AND a.value2 = f.value2)
Using EXISTS to find those that are not present:
SELECT *
FROM TABLE_A a
WHERE NOT EXISTS(SELECT NULL
FROM TABLE_A$foo f
WHERE a.id = f.id
AND a.value1 = f.value1
AND a.value2 = f.value2)

Categories