sqlite SQL query for unprocessed rows - python

I'm not quite even sure where / what to search for - so apologies if this is a trivial thing that has been asked before!
I have two tables in sqlite:
table_A = [id, value1, value2]
table_A$foo = [id, foo(value1), foo(value2)]
table_A$bar = [id, bar(value1), bar(value2)]
Where foo() / bar() are arbitrary functions not really relevant here
Now at the moment, I do:
select * from table_A
And use this cursor to compute all the rows for each of the derivative tables.
If something goes wrong (or I add new rows to table_A), i'd like a way to be able to compute (within SQL, rather than in python) which rows are already present in table_A$foo etc. and so just select the remaining (so like a AND NOT)to compute foo() and bar() - i should be able to do this on the ID col, as these remain the same.
Wondering if there is a way to do this in sqlite, which I imagine would be quicker than trying to rig this up in python.
Many thanks!

I don't understand if you consider a match based on value1 columns matching, or a combination of all three columns...
Using EXISTS to find those that are already present:
SELECT *
FROM TABLE_A a
WHERE EXISTS(SELECT NULL
FROM TABLE_A$foo f
WHERE a.id = f.id
AND a.value1 = f.value1
AND a.value2 = f.value2)
Using EXISTS to find those that are not present:
SELECT *
FROM TABLE_A a
WHERE NOT EXISTS(SELECT NULL
FROM TABLE_A$foo f
WHERE a.id = f.id
AND a.value1 = f.value1
AND a.value2 = f.value2)

Related

Sum numeric values from different tables in one query

In SQL, I can sum two counts like
SELECT (
(SELECT count(*) FROM a WHERE val=42)
+
(SELECT count(*) FROM b WHERE val=42)
)
How do I perform this query with the Django ORM?
The closest I got is
a.objects.filter(val=42).order_by().values_list('id', flat=True).union(
b.objects.filter(val=42).order_by().values_list('id', flat=True)
).count()
This works fine if the returned count is small, but seems bad if there's a lot of rows that the database must hold in memory just to count them.
Your solution can be only little simplified by values('pk') instead of values_list('id', flat=True), because this would affect only a type of rows of the output, but the source SQL of both querysets is the same:
SELECT id FROM a WHERE val=42 UNION SELECT id FROM b WHERE val=42
and the method .count() makes only a query around a subquery:
SELECT COUNT(*) FROM (... subquery ...)
It is not necessary that a database backend would hold all values in memory. It can also only count them and forget. (not checked)
Similarly if you run a simple SELECT COUNT(id) FROM a, it doesn't need to collect id.
Subqueries of the form SELECT count(*) FROM a WHERE val=42 in a bigger query are not possible because Django doesn't use lazy evaluation for aggregations and immediately evaluates them.
The evaluation can be postponed e.g. by grouping by some expression that has only one possible value, e.g. GROUP BY (i >= 0) (or by an outer reference if it would work), but the query plan can be worse.
Another problem is that a SELECT is not possible without a table. Therefore I will use an unimportant row of an unimportant table in the base of query.
Example:
qs = Unimportant.objects.filter(pk=unimportant_pk).values('id').annotate(
total_a=a.objects.filter(val=42).order_by().values('val')
.annotate(cnt=models.Count('*')).values('cnt'),
total_b=b.objects.filter(val=42).order_by().values('val')
.annotate(cnt=models.Count('*')).values('cnt')
)
It is not nice, but it could be easily parallelized
SELECT
id,
(SELECT COUNT(*) AS cnt FROM a WHERE val=42 GROUP BY val) AS total_a,
(SELECT COUNT(*) AS cnt FROM b WHERE val=42 GROUP BY val) AS total_b
FROM unimportant WHERE id = unimportant_pk
Django docs confirms that simple solution doesn't exist.
Using aggregates within a Subquery expression
...
... This is the only way to perform an aggregation within a Subquery, as using aggregate() attempts to evaluate the queryset (and if there is an OuterRef, this will not be possible to resolve).

How to delete large quantity of records from Oracle Table that has no primary key

The situation: I'm loading an entire SQL table into my program. For convenience I'm using pandas to maintain the row data. I am then creating a dataframe of rows I would like to have removed from the SQL table. Unfortunately (and I can't change this) the table does not have any primary keys other than the built-in Oracle ROWID (which isn't a real table column its a pseudocolumn), but I can make ROWID part of my dataframe if I need to.
The table has hundreds of thousands of rows, and I'll probably be deleting a few thousand records with each run of the program.
Question:
Using Cx_Oracle what is the best method of deleting multiple rows/records that don't have a primary key? I don't think creating a loop to submit thousands of delete statements is very efficient or pythonic. Although I am concerned about building a singular SQL delete statement keyed off of ROWID and that contains a clause with thousands of items:
Where ROWID IN ('eg1','eg2',........, 'eg2345')
Is this concern valid? Any Suggestions?
Using ROWID
Since you can use ROWID, that would be the ideal way to do it. And depending on the Oracle version, the query length limit may be large enough for a query with that many elements in the IN clause. The issue is the number of elements in the IN expression list - limited to 1000.
So you'll either have to break up the list of RowIDs into sets of 1000 at a time or delete just a single row at a time; with or without executemany().
>>> len(delrows) # rowids to delete
5000
>>> q = 'DELETE FROM sometable WHERE ROWID IN (' + ', '.join(f"'{row}'" for row in delrows) + ')'
>>> len(q) # length of the query
55037
>>> # let's try with just the first 1000 id's and no extra spaces
... q = 'DELETE FROM sometable WHERE ROWID IN (' + ','.join(f"'{row}'" for row in delrows[:1000]) + ')'
>>> len(q)
10038
You're probably within query-length limits, and can even save some chars with a minimal ',' item separator.
Without ROWID
Without the Primary Key or ROWID, the only way to identify each row is to specify all the columns in the WHERE clause and to do many rows at a time, they'll need to be OR'd together:
DELETE FROM sometable
WHERE ( col1 = 'val1'
AND col2 = 'val2'
AND col3 = 'val3' ) -- row 1
OR ( col1 = 'other2'
AND col2 = 'value2'
AND col3 = 'val3' ) -- row 2
OR ( ... ) -- etc
As you can see it's not the nicest query to construct but allows you to do it without ROWIDs.
And in both cases, you probably don't need to be using parameterised queries since the IN list in 1 or OR grouping in 2 is variable. (Yes, you could create it parameterised after constructing the whole extended SQL with thousands of parameters. Not sure what the limit is on that.) The executemany() way is definitely easier to write & do but for speed, the single large queries (either of the above two) will probably outperform executemany with thousands of items.
You can use cursor.executemany() to delete multiple rows at once. Something like the following should work:
dataToDelete = [['eg1'], ['eg2'], ...., ['eg2345']]
cursor.executemany("delete from sometable where rowid = :1", dataToDelete)

Any faster way to do mysql update query in R? in python?

I tried to run this query:
update table1 A
set number = (select count(distinct(id)) from table2 B where B.col1 = A.col1 or B.col2 = A.col2);
but it takes forever bc table1 has 1,100,000 rows and table2 has 350,000,000 rows.
Is there any faster way to do this query in R? or in python?
I rewrote your query with three subqueries instead of one - with UNION and two INNER JOIN statements:
UPDATE table1 as A
SET number = (SELECT COUNT(DISTINCT(id))
FROM
(SELECT A.id as id
FROM table1 as A
INNER JOIN table2 as B
ON A.col1 = B.col1) -- condition for col1
UNION DISTINCT
(SELECT A.id as id
FROM table1 as A
INNER JOIN table2 as B
ON A.col2 = B.col2) -- condition for col2
)
My notes:
Updating all of the rows in table1 doesn't look like a good idea, because we have to touch 1.1M rows. Probably, another data structure for storing number would have better performance
Try to run part of the query without update of table1 (only part of the query in parenthesis
Take a look into EXPLAIN, if you need more general approach for optimization of SQL queries: https://dev.mysql.com/doc/refman/5.7/en/using-explain.html

Check if entry exists in previous mysql query results

I'm using MySQLdb with python. I execute a SELECT query and store the results in a dictionary, D. The actual query is quite complicated so it's not clear how to do it in a single query, which is why I'm splitting it into two.
Then I run a second query and would like to add the condition that rows from columnB in the second query exist IN D.values(). ie, I'd like to do something like:
import MySQLdb
db = MySQLdb.connect()
cursor = db.cursor(MySQLdb.cursors.DictCursor)
cursor.execute("SELECT a, b FROM t1;")
results1 = cursor.fetchall()
I'd like to do the following, somehow passing an array from a previous query's results into to SELECT command:
cursor.execute("SELECT c, d FROM t2 WHERE d IN results1.b.values();")
Thank you,
It sounds like you can do what you want with a single query, and a join statement:
cursor.execute("SELECT t1.a, t1.b, t2.c, t2.d FROM t1 JOIN t2 ON t1.b = t2.d;")
You can set a set of values in an IN statement, but it is not advisable for the following reasons:
Slow: An IN statement becomes an OR a OR b OR etc statement, which is just slow in execution
An IN statement is limited in number of values you can place in there.
A better solution is to use a subquery:
SELECT c, d FROM t2 WHERE d IN (SELECT b FROM t1);
Or even better (faster) is the JOIN:
SELECT c, d FROM t2
INNER JOIN t1 ON t2.d=t1.b;

Intersection in sqlite3 in Python

I am trying to extract data that corresponds to a stock that is present in both of my data sets (given in a code below).
This is my data:
#(stock,price,recommendation)
my_data_1 = [('a',1,'BUY'),('b',2,'SELL'),('c',3,'HOLD'),('d',6,'BUY')]
#(stock,price,volume)
my_data_2 = [('a',1,5),('d',6,6),('e',2,7)]
Here are my questions:
Question 1:
I am trying to extract price, recommendation, and volume that correspond to asset 'a'. Ideally I would like to get a tuple like this:
(u'a',1,u'BUY',5)
Question 2:
What if I wanted to get intersection for all the stocks (not just 'a' as in Question 1), in this case it is stock 'a', and stock 'd', then my desired output becomes:
(u'a',1,u'BUY',5)
(u'd',6,u'BUY',6)
How should I do this?
Here is my try (Question 1):
import sqlite3
my_data_1 = [('a',1,'BUY'),('b',2,'SELL'),('c',3,'HOLD'),('d',6,'BUY')]
my_data_2 = [('a',1,5),('d',6,6),('e',2,7)]
#I am using :memory: because I want to experiment
#with the database a lot
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('''CREATE TABLE MY_TABLE_1
(stock TEXT, price REAL, recommendation TEXT )''' )
c.execute('''CREATE TABLE MY_TABLE_2
(stock TEXT, price REAL, volume REAL )''' )
for ele in my_data_1:
c.execute('''INSERT INTO MY_TABLE_1 VALUES(?,?,?)''',ele)
for ele in my_data_2:
c.execute('''INSERT INTO MY_TABLE_2 VALUES(?,?,?)''',ele)
conn.commit()
# The problem is with the following line:
c.execute( 'select* from my_table_1 where stock = ? INTERSECT select* from my_table_2 where stock = ?',('a','a') )
for entry in c:
print entry
I get no error, but also no output, so something is clearly off.
I also tried this line:
c.execute( 'select* from my_table_1 where stock = ? INTERSECT select volume from my_table_2 where stock = ?',('a','a')
but it does not work, I get this error:
c.execute( 'select* from my_table_1 where stock = ? INTERSECT select volume from my_table_2 where stock = ?',('a','a') )
sqlite3.OperationalError: SELECTs to the left and right of INTERSECT do not have the same number of result columns
I understand why I would have different number of resulting columns, but don't quite get why that triggers an error.
How should I do this?
Thank You in advance
It looks like those two questions are really the same question.
Why your query doesn't work: Let's reformat the query.
SELECT * FROM my_table_1 WHERE stock=?
INTERSECT
SELECT volume FROM my_table_2 WHERE stock=?
There are two queries in the intersection,
SELECT * FROM my_table_1 WHERE stock=?
SELECT volume FROM my_table_2 WHERE stock=?
The meaning of "intersect" is "give me the rows that are in both queries". That doesn't make any sense if the queries have a different number of columns, since it's impossible for any row to appear in both queries.
Note that SELECT volume FROM my_table_2 isn't a very useful query, since it doesn't tell you which stock the volume belongs to. The query will give you something like {100, 15, 93, 42}.
What you're actually trying to do: You want a join.
SELECT my_table_1.stock, my_table_2.price, recommendation, volume
FROM my_table_1
INNER JOIN my_table_2 ON my_table_1.stock=my_table_2.stock
WHERE stock=?
Think of join as "glue the rows from one table onto the rows from another table, giving data from both tables in a single row."
It's bizarre that the price appears in both tables; when you write the query with the join you have to decide whether you want my_table_1.price or my_table_2.price, or whether you want to join on my_table_1.price=my_table_2.price. You may want to consider redesigning your schema so this doesn't happen, it may make your life easier.
You are suffering from a misunderstanding about how to correlate different tables.
In order to do this the easiest way is to JOIN them with a suitable condition, resulting in results which automatically include the data from both the joined tables. In the example below I select all columns, but you can of course select only those you want by naming them in the FROM clause. You can also select only those rows you want with (a) further condition(s) in a WHERE clause. After you execute you code, try the following:
>>> c.execute("select * from my_table_1 t1 JOIN my_table_2 t2 ON t1.stock=t2.stock")
<sqlite3.Cursor object at 0x1004608f0>
This tells SQLite to take rows from table 1 and join them with rows in table 2 meeting the conditions in the ON clause (i.e. they have to have the same value for their STOCK attribute). Because you chose such long table names, and because I am a crappy typist, I used table alises in the FROM clause to allow me to use shortened names in the rest of the query.
>>> c.fetchall()
then gives you the result
[(u'a', 1.0, u'BUY', u'a', 1.0, 5.0), (u'd', 6.0, u'BUY', u'd', 6.0, 6.0)]
which would seem to answer both 1) and 2). For only a particular value of STOCK just add
WHERE t1.STOCK = 'a' -- or other required value, naturally
to the query string. You can see the names of the columns returned by querying the cursor's description attribute:
>>> [d[0] for d in c.description]
['stock', 'price', 'recommendation', 'stock', 'price', 'volume']
The INTERSECT operation is used to take the outputs from two separate SELECT queries and return only those elements that occur in both. I don't think that's going to be helpful here. The reason you got the error is because the queries have to be "UNION compatible", which is to say they need the same number and type of columns in the intersected queries.

Categories