PostgreSQL - how do you get the column formats? - python

I'm using PostgreSQL 9.3.3 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit.
I've done
psycopg2.connect
, got the cursor and can run lines of code similar to
cur.execute('SELECT latitude, longitude, date from db')
table = cur.fetchall()
From what I understand at http://initd.org/psycopg/docs/cursor.html, running
print(cur.description)
should show the type_code of each column. How come I don't get this?
I get
(Column(name='column_name', type_code=1043, display_size=None, internal_size=-1, precision=None, scale=None, null_ok=None), Column(name='data_type', type_code=1043, display_size=None, internal_size=-1, precision=None, scale=None, null_ok=None))
Another solution I've been suggested is running
cur.execute('select column_name, data_type from information_schema.columns')
cur.fetchall()
cols = cur.fetchall()
but this returns an empty list.
That's what I've tried. What do you suggest for getting the column formats?

information_schema.columns should provide you with the column data-type info.
For example, given this DDL:
create table foo
(
id serial,
name text,
val int
);
insert into foo (name, val) values ('narf', 1), ('poit', 2);
And this query (filtering out the meta tables to get at your tables):
select *
from information_schema.columns
where table_schema NOT IN ('information_schema', 'pg_catalog')
order by table_schema, table_name;
Will yield 4 rows, for the table foo -- the three columns I defined, plus a FK.
SQL fiddle
Regarding psycopg2, the information_schema-related code that you have shown looks like it should work... What's the entirety of the code? I would also recommend trying to step through the code in a debugger (the built-in pdb is OK, but I would recommend pudb, as it's more full featured and easier to use, but still terminal-based. It only runs on *nix platforms, though, due to the underlying modules it uses.
Edit:
I was able to get the data_type info from information_schema using psycopg2 with the following code:
#!/usr/bin/env python
import psycopg2
import psycopg2.extras
conn = psycopg2.connect("host=<host> dbname=<dbname> user=<user> password=<password>")
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
cur.execute("""select *
from information_schema.columns
where table_schema NOT IN ('information_schema', 'pg_catalog')
order by table_schema, table_name""")
for row in cur:
print "schema: {schema}, table: {table}, column: {col}, type: {type}".format(
schema = row['table_schema'], table = row['table_name'],
col = row['column_name'], type = row['data_type'])
I prefer to use DictCursors, as I find them much easier to work with, but it should work with a regular cursor, too -- you would just need to change how you accessed the rows.
Also, regarding cur.description, that returns a tuple of tuples. If you want to get at the type_code there, you can do so like this:
print cur.description[0][1]
Where the first dimension in the index of the column you want to look at, and the second dimension is the datum within that column. type_code is always 1. So you could iterate over the outer tuple and always look at its second item, for example.

select oid,typname from pg_type;
oid | typname
------+-------------
16 | bool
23 | int4
25 | text
1043 | varchar
1184 | timestamptz

Related

Reading an SQL query into a Dask DataFrame

I'm trying create a function that takes an SQL SELECT query as a parameter and use dask to read its results into a dask DataFrame using the dask.read_sql_query function. I am new to dask and to SQLAlchemy.
I first tried this:
import dask.dataFrame as dd
query = "SELECT name, age, date_of_birth from customer"
df = dd.read_sql_query(sql=query, con=con_string, index_col="name", npartitions=10)
As you probably already know, this won't work because the sql parameter has to be an SQLAlchemy selectable and more importantly, TextClause isn't supported.
I then wrapped the query behind a select like this:
import dask.dataFrame as dd
from sqlalchemy import sql
query = "SELECT name, age, date_of_birth from customer"
sa_query = sql.select(sql.text(query))
df = dd.read_sql_query(sql=sa_query, con=con_string, index_col="name")
This fails too with a very weird error that I have been trying to solve. The problem is that dask needs to infer the types of the columns and it does so by reading the first head_row rows in the table - 5 rows by default - and infer the types there. This line in the dask codebase adds a LIMIT ? to the query, which ends up being
SELECT name, age, date_of_birth from customer LIMIT param_1
The param_1 doesn't get substituted at all with the right value - 5 in this case. It then fails on the next line, https://github.com/dask/dask/blob/main/dask/dataframe/io/sql.py#L119, tjat evaluates the SQL expression.
sqlalchemy.exc.ProgrammingError: (mariadb.ProgrammingError) You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT name, age, date_of_birth from customer
LIMIT ?' at line 1
[SQL: SELECT SELECT name, age, date_of_birth from customer
LIMIT ?]
[parameters: (5,)]
(Background on this error at: https://sqlalche.me/e/14/f405)
I can't understand why param_1 wasn't substituted with the value of head_rows. One can see from the error message that it detects there's a parameter that needs to be used for the substitution but for some reason it doesn't actually substitute it.
Perhaps, I didn't correctly create the SQLAlchemy selectable?
I can simply use pandas.read_sql and create a dask dataframe from the resulting pandas dataframe but that defeats the purpose of using dask in the first place.
I have the following constraints:
I cannot change the function to accept a ready-made sqlalchemy
selectable. This feature will be added to a private library used at
my company and various projects using this library do not use
sqlalchemy.
Passing meta to the custom function is not an option because it would require the caller do create it. However, passing a meta attribute to read_sql_query and setting head_rows=0 is completely ok as long as there's an efficient way to retrieve/create
while dask-sql might work for this case, using it is not an
option, unfortunately
How can I go about correctly reading an SQL query into dask dataframe?
The crux of the problem is this line:
sa_query = sql.select(sql.text(query))
What is happening is that we are constructing a nested SELECT query,
which can cause a problem downstream.
Let's first create a test database:
# create a test database (using https://stackoverflow.com/a/64898284/10693596)
from sqlite3 import connect
from dask.datasets import timeseries
con = "delete_me_test.sqlite"
db = connect(con)
# create a pandas df and store (timestamp is dropped to make sure
# that the index is numeric)
df = (
timeseries(start="2000-01-01", end="2000-01-02", freq="1h", seed=0)
.compute()
.reset_index()
)
df.to_sql("ticks", db, if_exists="replace")
Next, let's try to get things working with pandas without sqlalchemy:
from pandas import read_sql_query
con = "sqlite:///test.sql"
query = "SELECT * FROM ticks LIMIT 3"
meta = read_sql_query(sql=query, con=con).set_index("index")
print(meta)
# id name x y
# index
# 0 998 Ingrid 0.760997 -0.381459
# 1 1056 Ingrid 0.506099 0.816477
# 2 1056 Laura 0.316556 0.046963
Now, let's add sqlalchemy functions:
from pandas import read_sql_query
from sqlalchemy.sql import text, select
con = "sqlite:///test.sql"
query = "SELECT * FROM ticks LIMIT 3"
sa_query = select(text(query))
meta = read_sql_query(sql=sa_query, con=con).set_index("index")
# OperationalError: (sqlite3.OperationalError) near "SELECT": syntax error
# [SQL: SELECT SELECT * FROM ticks LIMIT 3]
# (Background on this error at: https://sqlalche.me/e/14/e3q8)
Note the SELECT SELECT due to running sqlalchemy.select on an existing query. This can cause problems. How to fix this? In general, I don't think there's a safe and robust way of transforming arbitrary SQL queries into their sqlalchemy equivalent, but if this is for an application where you know that users will only run SELECT statements, you can manually sanitize the query before passing it to sqlalchemy.select:
from dask.dataframe import read_sql_query
from sqlalchemy.sql import select, text
con = "sqlite:///test.sql"
query = "SELECT * FROM ticks"
def _remove_leading_select_from_query(query):
if query.startswith("SELECT "):
return query.replace("SELECT ", "", 1)
else:
return query
sa_query = select(text(_remove_leading_select_from_query(query)))
ddf = read_sql_query(sql=sa_query, con=con, index_col="index")
print(ddf)
print(ddf.head(3))
# Dask DataFrame Structure:
# id name x y
# npartitions=1
# 0 int64 object float64 float64
# 23 ... ... ... ...
# Dask Name: from-delayed, 2 tasks
# id name x y
# index
# 0 998 Ingrid 0.760997 -0.381459
# 1 1056 Ingrid 0.506099 0.816477
# 2 1056 Laura 0.316556 0.046963

Column does not exist (SQLAlchemy / PostgreSQL): Trouble with quotation marks

Im having trouble with a postgresql query using SQLAlchemy.
I created some large tables using this line of code:
frame.to_sql('Table1', con=engine, method='multi', if_exists='append')
It worked fine. Now, when I want to query data out of it, my first problem is that I have to use quotation marks for each table and column name and I dont really know why, maybe somebody can help me out there.
That is not my main problem though. My main problem is, that when querying the data, all numerical WHERE conditions work fine, but not the ones with Strings in the column data. I get an error that the column does not exist. Im using:
df = pd.read_sql_query('SELECT "variable1", "variable2" FROM "Table1" WHERE "variable1" = 123 AND "variable2" = "abc" ', engine)
I think it might be a problem that I use "abc" instead of 'abc', but I cant change it because of the ' signs in the argument of the query. If I change those ' to " then the Column names and Table names are not detected correctly (because of the problem before that they have to be in quotation marks).
This is the error message:
ProgrammingError: (psycopg2.errors.UndefinedColumn) ERROR: COLUMN »abc« does not exist
LINE 1: ...er" FROM "Table1" WHERE "variable2" = "abc"
And there is an arrow pointing to the first quotation mark of the "abc".
Im new to SQL and I would really appreciate if someone could point me in the right direction.
"Most" SQL dialects (notable exceptions being MS SQL Server and MS Access) strictly differentiate between
single quotes: for string literals, e.g., WHERE thing = 'foo'
double quotes: for object (table, column) names, e.g., WHERE "some col" = 123
PostgreSQL throws in the added wrinkle that table/column names are forced to lower case if they are not (double-)quoted and then uses case-sensitive matching, so if your table is named Table1 then
SELECT * FROM Table1 will fail because PostgreSQL will look for table1, but
SELECT * FROM "Table1" will succeed.
The way to avoid confusion in your query is to use query parameters instead of string literals:
# set up test environment
with engine.begin() as conn:
conn.exec_driver_sql('DROP TABLE IF EXISTS "Table1"')
conn.exec_driver_sql('CREATE TABLE "Table1" (variable1 int, variable2 varchar(50))')
df1 = pd.DataFrame([(123, "abc"), (456, "def")], columns=["variable1", "variable2"])
df1.to_sql("Table1", engine, index=False, if_exists="append")
# test .read_sql_query() with parameters
import sqlalchemy as sa
sql = sa.text('SELECT * FROM "Table1" WHERE variable1 = :v1 AND variable2 = :v2')
param_dict = {"v1": 123, "v2": "abc"}
df2 = pd.read_sql_query(sql, engine, params=param_dict)
print(df2)
"""
variable1 variable2
0 123 abc
"""
It should be: AND "variable2" = 'abc'.
You cannot quote strings/literals with ", as PostgreSQL will interpret it as a database object. Btw. you do not need to wrap table names and and columns with double quotes unless it is extremely necessary, e.g. case sensitive object names, names containing spaces, etc. Imho it is a bad practice and on the long run only leads to confusion. So your query could be perfectly written as follows:
SELECT variable1, variable2
FROM table1
WHERE variable1 = 123 AND variable2 = 'abc';
Keep in mind that it also applies for other objects, like tables or indexes.
CREATE TABLE Table1 (id int) - nice.
CREATE TABLE "Table1" (id int) - not nice.
CREATE TABLE "Table1" ("id" int) - definitely not nice ;)
In case you want to remove the unnecessary double quotes from your table name:
ALTER TABLE "Table1" RENAME TO table1;
Demo: db<>fiddle

How to drop all tables including dependent object

When I'm trying to remove all tables with:
base.metadata.drop_all(engine)
I'm getting following error:
ERROR:libdl.database_operations:Cannot drop table: (psycopg2.errors.DependentObjectsStillExist) cannot drop sequence <schema>.<sequence> because other objects depend on it
DETAIL: default for table <schema>.<table> column id depends on sequence <schema>.<sequence>
HINT: Use DROP ... CASCADE to drop the dependent objects too.
Is there an elegant one-line solution for that?
import psycopg2
from psycopg2 import sql
cnn = psycopg2.connect('...')
cur = cnn.cursor()
cur.execute("""
select s.nspname as s, t.relname as t
from pg_class t join pg_namespace s on s.oid = t.relnamespace
where t.relkind = 'r'
and s.nspname !~ '^pg_' and s.nspname != 'information_schema'
order by 1,2
""")
tables = cur.fetchall() # make sure they are the right ones
for t in tables:
cur.execute(
sql.SQL("drop table if exists {}.{} cascade")
.format(sql.Identifier(t[0]), sql.Identifier(t[1])))
cnn.commit() # goodbye

Python Sqlite: how to select non-existing records(rows) based on a column?

Hope everyone's doing well.
Database:
Value Date
---------------------------------
3000 2019-12-15
6000 2019-12-17
What I hope to return:
"Data:3000 on 2019-12-15"
"NO data on 2019-12-16" (non-existing column based on Date)
"Data:6000 on 2019-12-17"
I don't know how to filter non-existing records(rows) based on a column.
Possible boilerplate code:
db = sqlite3.connect("Database1.db")
cursor = db.cursor()
cursor.execute("""
SELECT * FROM Table1
WHERE Date >= "2019-12-15" and Date <= "2019-12-17"
""")
entry = cursor.fetchall()
for i in entry:
if i is None:
print("No entry found:", i)
else:
print("Entry found")
db.close()
Any help is much appreciated!
The general way you might handle this problem uses something called a calendar table, which is just a table containing all dates you want to see in your report. Consider the following query:
SELECT
d.dt,
t.Value
FROM
(
SELECT '2019-12-15' AS dt UNION ALL
SELECT '2019-12-16' UNION ALL
SELECT '2019-12-17'
) d
LEFT JOIN yourTable t
ON d.dt = t.Date
ORDER BY
d.dt;
In practice, if you had a long term need to do this and/or had a large number of dates to cover, you might setup a bona-fide calendar table in your SQLite database for this purpose. The above query is only intended to be a proof-of-concept.

DBAPI syntax with pd.read_sql_query() call

I want to read all of the tables contained in a database into pandas data frames. This answer does what I want to accomplish, but I'd like to use the DBAPI syntax with the ? instead of the %s, per the documentation. However, I ran into an error. I thought this answer may address the problem, but I'm now posting my own question because I can't figure it out.
Minimal example
import pandas as pd
import sqlite3
pd.__version__ # 0.19.1
sqlite3.version # 2.6.0
excon = sqlite3.connect('example.db')
c = excon.cursor()
c.execute('''CREATE TABLE stocks
(date text, trans text, symbol text, qty real, price real)''')
c.execute("INSERT INTO stocks VALUES ('2006-01-05', 'BUY', 'RHAT', 100, 35.14)")
c.execute('''CREATE TABLE bonds
(date text, trans text, symbol text, qty real, price real)''')
c.execute("INSERT INTO bonds VALUES ('2015-01-01', 'BUY', 'RSOCK', 90, 23.11)")
data = pd.read_sql_query('SELECT * FROM stocks', excon)
# >>> data
# date trans symbol qty price
# 0 2006-01-05 BUY RHAT 100.0 35.14
But when I include a ? or a (?) as below, I get the error message pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT * FROM (?)': near "?": syntax error.
Problem code
c.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = c.fetchall()
# >>> tables
# [('stocks',), ('bonds',)]
table = tables[0]
data = pd.read_sql_query("SELECT * FROM ?", excon, params=table)
It's probably something trivial that I'm missing, but I'm not seeing it!
The problem is that you're trying to use parameter substitution for a table name, which is not possible. There's an issue on GitHub that discusses this. The relevant part is at the very end of the thread, in a comment by #jorisvandenbossche:
Parameter substitution is not possible for the table name AFAIK.
The thing is, in sql there is often a difference between string
quoting, and variable quoting (see eg
https://sqlite.org/lang_keywords.html the difference in quoting
between string and identifier). So you are filling in a string, which
is for sql something else as a variable name (in this case a table
name).
Parameter substitution is essential to prevent SQL Injection from unsafe user-entered values.
In this particular example you are sourcing table names directly from the database's own metadata, which is already safe, so it's OK to just use normal string formatting to construct the query, but still good to wrap the table names in quotes.
If you are sourcing user-entered table names, you can also parameterize them first before using them in your normal python string formatting.
e.g.
# assume this is user-entered:
table = '; select * from members; DROP members --'
c.execute("SELECT name FROM sqlite_master WHERE type='table' and name = ?;", excon, params=table )
tables = c.fetchall()
In this case the user has entered some malicious input intended to cause havoc, and the parameterized query will cleanse it and the query will return no rows.
If the user entered a clean table e.g. table = 'stocks' then the above query would return that same name back to you, through the wash, and it is now safe.
Then it is fine to continue with normal python string formatting, in this case using f-string style:
table = tables[0]
data = pd.read_sql_query(f"""SELECT * FROM "{table}" ;""", excon)
Referring back to your original example, my first step above is entirely unnecessary. I just provided it for context. It is unnecessary, because there is no user input so you could just do something like this to get a dictionary of dataframes for every table.
c.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = c.fetchall()
# >>> tables
# [('stocks',), ('bonds',)]
dfs = dict()
for t in tables:
dfs[t] = pd.read_sql_query(f"""SELECT * FROM "{t}" ;""", excon)
Then you can fetch the dataframe from the dictionary using the tablename as the key.

Categories