Python SQL how to look for non-matches in other tables? - python

I'm looking for a solution so that I can check for non-matches in other tables.
Basically I have 3 tables (see below). Here I want to look into Table 1, and identify the first row that doesn't have a match in either Name or Location. If both is recognized it should move to next row and check.
I have tried to accomplish this with SQL and looping through them, but as I only want the first row that doesn't match I haven't found a smooth solution (or pretty for that sake, as I'm fairly rookie-ish).
I'm pretty sure this can be accomplished with SQL.
Table 1
Id Name Location
1 John Canada
2 James Brazil
3 Jim Hungary
Table 2 - Recognized Names
Id Name
1 John
2 James
Table 3 - Recognized Locations
Id Location
1 Brazil
2 Hungary
So I want to select from Table 1, where Name can't find a match in Table 2 or where Location can't find a match in Table 3.
In my example from above the result should be Id = 1, as Location is not in Table 3.
Thanks in advance.

You can use not exists to select if some sub-query doesn't select a row:
select
*
from
Table1 t1
where
not exists (
select * from Table2 t2 where t2.`Name` = t1.`Name`
) and
not exists (
select * from Table3 t3 where t3.`Location` = t1.`Location`
)
order by
t1.Id
limit 1
It's not a very complicated query, but still some things are going on, so here is the same one again, but with comments to explain the various parts:
select
/* I select all columns, *, for the example, but in real life scenarios
it's always better to explicitly specify which columns you need. */
*
from
/* Optionally, you can specify a short or different alias for a table (t1)
this can be helpful to make your query more readable by allowing you to explicitly
specify where a column is coming from, without cluttering the query with long names. */
Table1 t1
where
/* exists takes a sub-query, which is executed for each row of the main query.
The expression returns true if the subquery returns a row.
With not (not exists), the expression is inversed: true becomes false. */
not exists (
/* In MariaDB, backticks can be used to escape identifiers that also are
reserved words. You are allowed to use them for any identifier, but
for reserved word identifiers, they are often necessary. */
select * from Table2 t2 where t2.`Name` = t1.`Name`
)
/* Combine the two subqueries. We only want rows don't have a matching row
in sub-query one, and neither in sub-query two. */
and
not exists (
select * from Table3 t3 where t3.`Location` = t1.`Location`
)
/* Specify _some_ ordering by which you can distinguish 'first' from 'second'. */
order by
t1.Id
/* Return only 1 row (the first according to the order clause) */
limit 1

Related

SQL database with a column being a list or a set

With a SQL database (in my case Sqlite, using Python), what is a standard way to have a column which is a set of elements?
id name items_set
1 Foo apples,oranges,tomatoes,ananas
2 Bar tomatoes,bananas
...
A simple implementation is using
CREATE TABLE data(id int, name text, items_set text);
but there are a few drawbacks:
to query all rows that have ananas, we have to use items_set LIKE '%ananas%' and some tricks with separators to avoid querying "ananas" to also return rows with "bananas", etc.
when we insert a new item in one row, we have to load the whole items_set, and see if the item is already in the list or not, before concatenating ,newitem at the end.
etc.
There is surely better, what is a standard SQL solution for a column which is a list or set?
Note: I don't know in advance all the possible values for the set/list.
I can see a solution with a few additional tables, but in my tests, it multiplies the size on disk by a factor x2 or x3, which is a problem with many gigabytes of data.
Is there a better solution?
To have a well structured SQL database, you should extract the items to their own table and use a join table between the main table and the items table
I'm not familiar with the Sqlite syntax but you should be able to create the tables with
CREATE TABLE entities(id int, name text);
CREATE TABLE entity_items(entity_id int, item_id int);
CREATE TABLE items(id int, name text);
add data
INSERT INTO entities (name) VALUES ('Foo'), ('Bar');
INSERT INTO items (name) VALUES ('tomatoes'), ('ananas'), ('bananas');
INSERT INTO entity_items (entity_id, item_id) VALUES (
(SELECT id from entities WHERE name='Foo'),
(SELECT id from items WHERE name='bananas')
);
query data
SELECT * FROM entities
LEFT JOIN entity_items
ON entities.id = entity_items.entity_id
LEFT JOIN items
ON items.id = entity_items.item_id
WHERE items.name = 'bananas';
You have probably two options. One standard approach, which is more conventional, is many-to-many relationship. Like you have three tables, for example, Employees, Projects, and ProjectEmployees. The latter describes your many-to-many relationship (each employee can work on multiple projects, each project has a team).
Having a set in a single value denormalized the table and it will complicate the things either way. But if you just, use the JSON format and the JSON functionality provided by SQLite. If your SQLite version is not recent, it may not have the JSON extension built in. You would need either updated (best option) or load the JSON extension dynamically. Not sure if you can do it using the SQLite copy supplied with Python.
To elaborate on what #ussu said, ideally your table would have one row per thing & item pair, using IDs instead of names:
id thing_id item_id
1 1 1
2 1 2
3 1 3
4 1 4
5 2 3
5 2 4
Then look-up tables for the thing and item names:
id name
1 Foo
2 Bar
id name
1 apples
2 oranges
3 tomatoes
4 bananas
In Mysql, You have set Type
Creation:
CREATE TABLE myset (col SET('a', 'b', 'c', 'd'));
Select:
mysql> SELECT * FROM tbl_name WHERE FIND_IN_SET('value',set_col)>0;
mysql> SELECT * FROM tbl_name WHERE set_col LIKE '%value%';
Insertion:
INSERT INTO myset (col) VALUES ('a,d'), ('d,a'), ('a,d,a'), ('a,d,d'), ('d,a,d');

Pagination for SQLite

Hey Friends I am working on a application which I similar to service now. I have requests coming from users and have to work on it. I am using python-flask and sqlite for this.
I am new to flask and this is my first project. Please correct me if I am wrong.
result = cur.execute("SELECT * from orders")
orders = result.fetchmany(5)
I am trying to use orders = result.paginate(...)
But it seems there's some problem.
Also, I am not sure of how to display the db data in different pages.
I want first 10 records on 1st page next 10 on 2nd page and so on..
Could you please help me?
I've never used flask but assuming that you can issue a paginate/page throw then a query that introduces a value 0-9 would allow a conditional page throw.
For example, assuming an orders tables that has 3 columns, orderdate, ordertype, orderdesc and that the order required was for the columns according to those columns (see notes) then the following would inroduce a column that is from 0 to 9 and thus allow the check for a pafe throw :-
SELECT *,
(
SELECT count()
FROM ORDERS
WHERE orderdate||ordertype||orderdesc < o.orderdate||o.ordertype||o.orderdesc
ORDER BY orderdate||ordertype||orderdesc
) % 10 AS orderby
FROM ORDERS AS o ORDER BY orderdate||ordertype||orderdesc
Note that the above relies upon sort-orders and the where clause having the same result, a more complex WHERE clause may be needed. The above is intended as an in-principle example.
Example Usage
Consider the following example of the above in use. This generates 100 rows of orders with randomly generated orderdates and ordertypes within specififc ranges and then extracts the data according to the above query. The results of the underyling data and the extracted data are shown in the results section.
/* Create Test Environment */
DROP TABLE IF EXISTS orders;
/* Generate and load some random orders */
CREATE TABLE If NOT EXISTS orders (orderdate TEXT, ordertype TEXT, orderdesc TEXT);
WITH RECURSIVE cte1(od,ot,counter) AS
(
SELECT
datetime('now','+'||(abs(random()) % 10)||' days'),
(abs(random()) % 26),
1
UNION ALL SELECT
datetime('now','+'||(abs(random()) % 10)||' days'),
(abs(random()) % 26),
counter+1
FROM cte1 LIMIT 100
)
INSERT INTO orders SELECT * FROM cte1;
/* Display the resultant data */
SELECT rowid,* FROM orders;
/* Display data with generated page throw indicator */
SELECT rowid,*,
(
SELECT count()
FROM ORDERS
WHERE orderdate||ordertype||orderdesc < o.orderdate||o.ordertype||o.orderdesc
ORDER BY orderdate||ordertype||orderdesc
) % 10 AS orderby
FROM ORDERS AS o ORDER BY orderdate||ordertype||orderdesc;
/* Clean up */
DROP TABLE IF EXISTS orders;
Results (partial)
The core data (not sorted so by rowid (rowid included for comparison purposes)) :-
The extracted data with page-throw indicator (highlighted)
Obviously you would likely not throw a page for the first row.
As concatention of the 3 columns has been used for convenience, the results may be a little confusing as e.g. 2 would appear to be greater than 11 and so on.
the rowid indicates the original position, so demonstrates that the data has been sorted.

What's the fastest way to see if a table has no rows in postgreSQL?

I have a bunch of tables that I'm iterating through, and some of them have no rows (i.e. just a table of headers with no data).
ex: SELECT my_column FROM my_schema.my_table LIMIT 1 returns an empty result set.
What is the absolute fastest way to check that a table is one of these tables with no rows?
I've considered: SELECT my_column FROM my_schema.my_table LIMIT 1 or SELECT * FROM my_schema.my_table LIMIT 1
followed by an if result is None(I'm working in Python). Is there any faster way to check?
This is not faster than your solution but returns a boolean regadless:
select exists (select 1 from mytable)
select exists (select * from myTab);
or
select 1 where exists (select * from myTab)
or even
SELECT reltuples FROM pg_class WHERE oid = 'schema_name.table_name'::regclass;
The 3rd example uses the estimator to estimate rows, which may not be 100% accurate, but may be a tad bit faster.
SELECT COUNT(*) FROM table_name limit 1;
Try this code .

Selecting a date between two dates while also accounting for separate time field

I have a date and a time field in Postgresql. I am reading it in python and need to sort out things on certain days past certain times.
The steps would basically be like this:
Select * from x where date > monthdayyear
In that subset, select only those that are > time given for that date
AND date2 must be < monthdayyear2 AND time2 must be less than time2 given on that date
I know there are definitely some python ways I could do this, by iterating through results and et cetera. I'm wondering if there is a better way than brute forcing this? I would rather not run multiple queries or have to sort out a lot of extra results in the fetchall() if possible.
If I've understood your design, this is really a schema design issue. Instead of:
CREATE TABLE sometable (
date1 date,
time1 time,
date2 date,
time2 time
);
you generally want:
CREATE TABLE sometable (
timestamp1 timestamp with time zone,
timestamp2 timestamp with time zone
);
if you want the timestamp converted automatically to UTC and back to the client's TimeZone, or timestamp without time zone if you want to store the raw timestamp without timezone conversion.
If an inclusive test is OK, you can write:
SELECT ...
FROM sometable
WHERE '2012-01-01 11:15 +0800' BETWEEN timestamp1 AND timestamp2;
If you cannot amend your schema, your best bet is something like this:
SELECT ...
FROM sometable
WHERE '2012-01-01 11:15 +0800' BETWEEN (date1 + time1) AND (date2 + time2);
This may have some unexpected quirks when it comes to clients in multiple time zones; you may land up needing to look at the AT TIME ZONE operator.
If you need an exclusive test on one side an/or the other, you can't use BETWEEN since it's an a <= x <= b operator. Instead write:
SELECT ...
FROM sometable
WHERE '2012-01-01 11:15 +0800' > (date1 + time1)
AND '2012-01-01 11:15 +0800' < (date2 + time2);
Automating the schema change
Automating a schema change is possible.
You want to query INFORMATION_SCHEMA or pg_catalog.pg_class and pg_catalog.pg_attribute for tables that have pairs of date and time columns, then generate sets of ALTER TABLE commands to unify them.
Determining what a "pair" is is quite application specific; if you've used a consistent naming scheme it should be easy to do with LIKE or ~ operators and/or regexp_matches. You want to produce a set of (tablename, datecolumnname, timecolumnname) tuples.
Once you have that, you can for each (tablename, datecolumnname, timecolumnname) tuple produce the following ALTER TABLE statements, which must be run in a transaction to be safe, and should be tested before use on any data you care about, and where the entries in [brackets] are substitutions:
BEGIN;
ALTER TABLE [tablename] ADD COLUMN [timestampcolumnname] TIMESTAMP WITH TIME ZONE;
--
-- WARNING: This part can lose data; if one of the columns is null and the other one isn't
-- the result is null. You should've had a CHECK constraint preventing that, but probably
-- didn't. You might need to special case that; the `coalesce` and `nullif` functions and
-- the `CASE` clause might be useful if so.
--
UPDATE [tablename] SET [timestampcolumnname] = ([datecolumnname] + [timecolumnname]);
ALTER TABLE [tablename] DROP COLUMN [datecolumnname];
ALTER TABLE [tablename] DROP COLUMN [timecolumnname];
-- Finally, if the originals were NOT NULL:
ALTER TABLE [tablename] ALTER COLUMN [timestampcolumnname] SET NOT NULL;
then check the results and COMMIT if happy. Be aware that an exclusive lock is taken on the table from the first ALTER so nothing else can use the table until you COMMIT or ROLLBACK.
If you're on a vaguely modern PostgreSQL you can generate the SQL with the format function; on older versions you can use string concatenation (||) and the quote_literal function. Example:
Given the sample data:
CREATE TABLE sometable(date1 date not null, time1 time not null, date2 date not null, time2 time not null);
INSERT INTO sometable(date1,time1,date2,time2) VALUES
('2012-01-01','11:15','2012-02-03','04:00');
CREATE TABLE othertable(somedate date, sometime time);
INSERT INTO othertable(somedate, sometime) VALUES
(NULL, NULL),
(NULL, '11:15'),
('2012-03-08',NULL),
('2014-09-18','23:12');
Here's a query that generates the input data set. Note that it relies on the naming convention that matching column pairs always have a common name once any date or time word is removed from the column. You could instead use adjacency by testing for c1.attnum + 1 = c2.attnum.
BEGIN;
WITH
-- Create set of each date/time column along with its table name, oids, and not null flag
cols AS (
select attrelid, relname, attname, typname, atttypid, attnotnull
from pg_attribute
inner join pg_class on pg_attribute.attrelid = pg_class.oid
inner join pg_type on pg_attribute.atttypid = pg_type.oid
where NOT attisdropped AND atttypid IN ('date'::regtype, 'time'::regtype)
),
-- Self join the time and date column set, filtering the left side for only dates and
-- the right side for only times, producing two distinct sets. Then filter for entries
-- where the names are the same after replacing any appearance of the word `date` or
-- `time`.
tableinfo (tablename, datecolumnname, timecolumnname, nonnull, hastimezone) AS (
SELECT
c1.relname, c1.attname, c2.attname,
c1.attnotnull AND c2.attnotnull AS nonnull,
't'::boolean AS withtimezone
FROM cols c1
INNER JOIN cols c2 ON (
c1.atttypid = 'date'::regtype
AND c2.atttypid = 'time'::regtype
AND c1.attrelid = c2.attrelid
-- Match column pairs; I used name matching, you might use adjancency:
AND replace(c1.attname,'date','') = replace(c2.attname,'time','')
)
)
-- Finally, format the results into a series of ALTER TABLE statements.
SELECT format($$
ALTER TABLE %1$I ADD COLUMN %4$I TIMESTAMP %5$s;
UPDATE %1$I SET %4$I = (%2$I + %3$I);
ALTER TABLE %1$I DROP COLUMN %2$I;
ALTER TABLE %1$I DROP COLUMN %3$I;
$$ ||
-- Append a clause to make the column NOT NULL now that it's populated, only
-- if the original date or time were NOT NULL:
CASE
WHEN nonnull
THEN ' ALTER TABLE %1$I ALTER COLUMN %4$I SET NOT NULL;'
ELSE ''
END,
-- Now the format arguments
tablename, -- 1
datecolumnname, -- 2
timecolumnname, -- 3
-- You'd use a better column name generator than this simple example:
datecolumnname||'_'||timecolumnname, -- 4
CASE
WHEN hastimezone THEN 'WITH TIME ZONE'
ELSE 'WITHOUT TIME ZONE'
END -- 5
)
FROM tableinfo;
You can read the results and send them as SQL commands in a second session, or if you want to get fancy you can write a fairly simple PL/PgSQL function that LOOPs over the results and EXECUTEs each one. The query produces output like:
ALTER TABLE sometable ADD COLUMN date1_time1 TIMESTAMP WITH TIME ZONE;
UPDATE sometable SET date1_time1 = (date1 + time1);
ALTER TABLE sometable DROP COLUMN date1;
ALTER TABLE sometable DROP COLUMN time1;
ALTER TABLE sometable ALTER COLUMN date1_time1 SET NOT NULL;
ALTER TABLE sometable ADD COLUMN date2_time2 TIMESTAMP WITH TIME ZONE;
UPDATE sometable SET date2_time2 = (date2 + time2);
ALTER TABLE sometable DROP COLUMN date2;
ALTER TABLE sometable DROP COLUMN time2;
ALTER TABLE sometable ALTER COLUMN date2_time2 SET NOT NULL;
ALTER TABLE othertable ADD COLUMN somedate_sometime TIMESTAMP WITHOUT TIME ZONE;
UPDATE othertable SET somedate_sometime = (somedate + sometime);
ALTER TABLE othertable DROP COLUMN somedate;
ALTER TABLE othertable DROP COLUMN sometime;
I don't know if there's any useful way to work out on a per-column basis whether you want WITH TIME ZONE or WITHOUT TIME ZONE. It's likely you'll land up just doing it hardcoded, in which case you can just remove that column. I put it in there in case there's a good way to figure it out in your application.
If you have cases where the time can be null but the date non-null or vice versa, you will need to wrap the date and time in an expression that decide what result to return when null. The nullif and coalesce functions are useful for this, as is CASE. Remember that adding a null and a non-null value produces a null result so you may not need to do anything special.
If you use schemas you may need to further refine the query to use %I substitution of schema name prefixes to disambiguate. If you don't use schemas (if you don't know what one is, you don't) then this doesn't matter.
Consider adding CHECK constraints enforcing that time1 is less than or equal to time2 where it makes sense in your application once you've done this. Also look at exclusion constraints in the documentation.

Effective query merging more than 2 subqueries

I have a database with
books (primary key: bookID)
characterNames (foreign key: books.bookID)
locations (foreign key: books.bookID)
The in-text-position of character names and locations are saved in the corresponding tables.
I'm writing a Pythonscript using psycopg2, finding all occurences of given character names and locations in books. I only want the occurences in books, where both the character name AND the location are found.
Here I already got a solution for searching one location and one character:
WITH b AS (
SELECT bookid
FROM characternames
WHERE name = 'XXX'
GROUP BY 1
INTERSECT
SELECT bookid
FROM locations
WHERE l.locname = 'YYY'
GROUP BY 1
)
SELECT bookid, position, 'char' AS what
FROM b
JOIN characternames USING (bookid)
WHERE name = 'XXX'
UNION ALL
SELECT bookid, position, 'loc' AS what
FROM b
JOIN locations USING (bookid)
WHERE locname = 'YYY'
ORDER BY bookid, position;
The CTE 'b' contains all bookid's, where the character name 'XXX' and the location 'YYY' appear.
Now I'm additionally wondering about searching for 2 places and a name (or 2 names and a place respectively). It's simple if all searched entities must occur in one book, but what about this:
Searching for: Tim, Al, Toolshop
Results: books including
(Tim, Al, Toolshop) or
(Tim, Al) or
(Tim, Toolshop) or
(Al, Toolshop)
The problem could be repeated for 4, 5, 6...conditions.
I thougt about INTERSECTing more subqueries, but that wouldn't work.
Instead I would UNION the found bookIDs, GROUP them and select bookid's occurring more then once:
WITH b AS (
SELECT bookid, count(bookid) AS occurrences
FROM
(SELECT DISTINCT bookid
FROM characterNames
WHERE name='XXX'
UNION
SELECT DISTINCT bookid
FROM characterNames
WHERE name='YYY'
UNION
SELECT DISTINCT bookid
FROM locations
WHERE locname='ZZZ'
GROUP BY bookid)
WHERE occurrences>1)
I think this works, can't test it at the moment, but is it the best way to do this?
The idea to use a count for the generalized case is sound. A couple of adjustments to the syntax, though:
WITH b AS (
SELECT bookid
FROM (
SELECT DISTINCT bookid
FROM characterNames
WHERE name='XXX'
UNION ALL
SELECT DISTINCT bookid
FROM characterNames
WHERE name='YYY'
UNION ALL
SELECT DISTINCT bookid
FROM locations
WHERE locname='ZZZ'
) x
GROUP BY bookid
HAVING count(*) > 1
)
SELECT bookid, position, 'char' AS what
FROM b
JOIN characternames USING (bookid)
WHERE name = 'XXX'
UNION ALL
SELECT bookid, position, 'loc' AS what
FROM b
JOIN locations USING (bookid)
WHERE locname = 'YYY'
ORDER BY bookid, position;
Notes
Use UNION ALL (not UNION) to preserve duplicates between the subqueries. You want them in this case to be able to count them.
The subqueries are supposed to produces distinct values. It works with DISTINCT the way you have it. You may want to try GROUP BY 1 instead and see if that performs better (I don't expect it to.)
The GROUP BY hast to go outside the subquery. It would only be applied to the last subquery and makes no sense there as you have DISTINCT bookid already.
The check whether there are more than one hits on a book has to go into a HAVING clause:
HAVING count(*) > 1
You can not use aggregated values in a WHERE clause.
Combining conditions on one table
You cannot simply combine multiple conditions on one table. How will you count the number of findings? But there is a somewhat more sophisticated way. May or may not improve performance, You'll have to test (with EXPLAIN ANALYZE). Both queries require at least two index scans for the table characterNames. At least it shortens the syntax.
Consider how I compute the number of hits for characterNames and how I changed to sum(hits) in the outer SELECT:
WITH b AS (
SELECT bookid
FROM (
SELECT bookid
, max((name='XXX')::int)
+ max((name='YYY')::int) AS hits
FROM characterNames
WHERE (name='XXX'
OR name='YYY')
GROUP BY bookid
UNION ALL
SELECT DISTINCT bookid, 1 AS hits
FROM locations
WHERE locname='ZZZ'
) x
GROUP BY bookid
HAVING sum(hits) > 1
)
...
Converting a boolean to integer gives 0 for FALSE and 1 for TRUE. That helps.
Faster with EXISTS
While riding my bike to my company this thing kept kicking at the back of my head. I have reason to believe this query might be even faster. Please give it a try:
WITH b AS (
SELECT bookid
, (EXISTS (
SELECT *
FROM characterNames c
WHERE c.bookid = b.bookid
AND c.name = 'XXX'))::int
+ (EXISTS (
SELECT *
FROM characterNames c
WHERE c.bookid = b.bookid
AND c.name = 'YYY'))::int AS c_hits
, (EXISTS (
SELECT *
FROM locations l
WHERE l.bookid = b.bookid
AND l.locname='ZZZ'))::int AS l_hits
FROM books b
WHERE (c_hits + l_hits) > 1
)
SELECT c.bookid, c.position, 'char' AS what
FROM b
JOIN characternames c USING (bookid)
WHERE b.c_hits > 0
AND c.name IN ('XXX', 'YYY')
UNION ALL
SELECT l.bookid, l.position, 'loc' AS what
FROM b
JOIN locations l USING (bookid)
WHERE b.l_hits > 0
AND l.locname = 'YYY'
ORDER BY 1,2,3;
The EXISTS semi-join can stop executing at the first match. As we are only interested in an all-or-nothing answer in the CTE, this could possibly do the job much faster.
This way we also don't need to aggregate (no GROUP BY necessary).
I also remember whether any characters or locations were found and only revisit tables with actual matches.

Categories