How to convert SQL Query to Pandas DataFrame using SQLAlchemy ORM? - python

According to SQLAlchemy documentation you are supposed to use Session object when executing SQL statements. But using a Session with Pandas .read_sql, gives an error: AttributeError 'Session' object has no attribute 'cursor'.
However using the Connection object works even with the ORM Mapped Class:
with ENGINE.connect() as conn:
df = pd.read_sql_query(
sqlalchemy.select(MeterValue),
conn
)
Where MeterValue is a Mapped Class.
This doesn't feel like the correct solution, because SQLAlchemy documentation says you are not supposed to use engine connection with ORM. I just can't find out why.
Does anyone know if there is any issue using the connection instead of Session with ORM Mapped Class?
What is the correct way to read sql in to a DataFrame using SQLAlchemy ORM?
I found a couple of old answers on this where you use the engine directly as the second argument, or use session.bind and so on. Nothing works.

Just reading the documentation of pandas.read_sql_query:
pandas.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None, dtype=None)
Parameters:
sql: str SQL query or SQLAlchemy Selectable (select or text object)
SQL query to be executed.
con: SQLAlchemy connectable, str, or sqlite3 connection
Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.
...
So pandas does allow a SQLAlchemy Selectable (e.g. select(MeterValue)) and a SQLAlchemy connectable (e.g. engine.connect()), so your code block is correct and pandas will handle the querying correctly.
with ENGINE.connect() as conn:
df = pd.read_sql_query(
sqlalchemy.select(MeterValue),
conn,
)

Related

SQLAlchemy: Dynamically pass schema and table name avoiding SQL Injection

How can I execute an SQL query where the schema and table name are passed in a function? Something like below?
def get(engine, schema: str, table: str):
query = text("select * from :schema.:table")
result = engine.connect().execute(query, schema=schema, table=table)
return result
Two things going on here:
Avoiding SQL injection
Dynamically setting a schema with (presumably) PostgreSQL
The first question has a very broad scope, you might want to look at older questions about SQLAlchemy and SQL Injection like this one SQLAlchemy + SQL Injection
Your second question can be addressed in a number of ways, though I would recommend the following approach from SQLAlchemy's documentation: https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#remote-schema-table-introspection-and-postgresql-search-path
PostgreSQL supports a "search path" command which sets the schema for all operations in the transaction.
So your query code might look like:
qry_str = f"SET search_path TO {schema}";
Alternatively, if you use an SQLAlchemy declarative approach, you can use a MetaData object like in this question/answer SQLAlchemy support of Postgres Schemas
You could create a collection of existing table and schema names in your database and check inputs against those values before creating your query:
-- assumes we are connected to the correct *database*
SELECT table_schema, table_name
FROM information_schema.tables;

How to get database size in SQL Alchemy?

I am looking for a way to get the size of a database with SQL Alchemy. Ideally, it will be agnostic to which underlying type of database is used. Is this possible?
Edit:
By size, I mean total number of bytes that the database uses.
The way I would do is to find out if you can run a SQL query to get the answer. Then, you can just run this query via SQLAlchemy and get the result.
Currently, SQLAlchemy does not provide any convenience function to determine the table size in bytes. But you can execute an SQL statement. Caveat is that you have to use a statement that is specific to your type of SQL (MySQL, Postgres, etc)
Checking this answer, for MySQL you can execute a statement manually like:
from sqlalchemy import create_engine
connection_string = 'mysql+pymysql://...'
engine = create_engine(connection_string)
statement = 'SELECT table_schema, table_name, data_length, index_length FROM information_schema.tables'
with engine.connect() as con:
res = con.execute(statement)
size_of_table = res.fetchall()
For SQLite you can just check the entire database size with the os module:
import os
os.path.getsize('sqlite.db')
For PostgreSQL you can do it like this:
from sqlalchemy.orm import Session
dbsession: Session
engine = dbsession.bind
database_name = engine.url.database
# https://www.a2hosting.com/kb/developer-corner/postgresql/determining-the-size-of-postgresql-databases-and-tables
# Return the database size in bytes
database_size = dbsession.execute(f"""SELECT pg_database_size('{database_name}')""").scalar()

Speeding up Pandas to_sql()?

I have a 1,000,000 x 50 Pandas DataFrame that I am currently writing to a SQL table using:
df.to_sql('my_table', con, index=False)
It takes an incredibly long time. I've seen various explanations about how to speed up this process online, but none of them seem to work for MSSQL.
If I try the method in:
Bulk Insert A Pandas DataFrame Using SQLAlchemy
then I get a no attribute copy_from error.
If I try the multithreading method from:
http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html
then I get a QueuePool limit of size 5 overflow 10 reach, connection timed out error.
Is there any easy way to speed up to_sql() to an MSSQL table? Either via BULK COPY or some other method, but entirely from within Python code?
I've used ctds to do a bulk insert that's a lot faster with SQL server. In example below, df is the pandas DataFrame. The column sequence in the DataFrame is identical to the schema for mydb.
import ctds
conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))
in pandas 0.24 you can use method ='multi' with chunk size of 1000 which is the sql server limit
chunksize=1000, method='multi'
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method
New in version 0.24.0.
The parameter method controls the SQL insertion clause used. Possible values are:
None: Uses standard SQL INSERT clause (one per row).
'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documention.
even I had the same issue so I applied sqlalchemy with fast execute many.
from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
#event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
if executemany:
cursor.fast_executemany = True # replace from execute many to fast_executemany.
cursor.commit()
always make sure that the given function should be present after the engine variable and before cursor execute.
conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.

SQLAlchemy/pandas to_sql for SQLServer -- CREATE TABLE in master db

Using MSSQL (version 2012), I am using SQLAlchemy and pandas (on Python 2.7) to insert rows into a SQL Server table.
After trying pymssql and pyodbc with a specific server string, I am trying an odbc name:
import sqlalchemy, pyodbc, pandas as pd
engine = sqlalchemy.create_engine("mssql+pyodbc://mssqlodbc")
sqlstring = "EXEC getfoo"
dbdataframe = pd.read_sql(sqlstring, engine)
This part works great and worked with the other methods (pymssql, etc). However, the pandas to_sql method doesn't work.
finaloutput.to_sql("MyDB.dbo.Loader_foo",engine,if_exists="append",chunksize="10000")
With this statement, I get a consistent error that pandas is trying to do a CREATE TABLE in the sql server Master db, which it is not permisioned for.
How do I get pandas/SQLAlchemy/pyodbc to point to the correct mssql database? The to_sql method seems to ignore whatever I put in engine connect string (although the read_sql method seems to pick it up just fine.
To have this question as answered: the problem is that you specify the schema in the table name itself. If you provide "MyDB.dbo.Loader_foo" as the table name, pandas will interprete this full string as the table name, instead of just "Loader_foo".
Solution is to only provide "Loader_foo" as table name. If you need to specify a specific schema to write this table into, you can use the schema kwarg (see docs):
finaloutput.to_sql("Loader_foo", engine, if_exists="append")
finaloutput.to_sql("Loader_foo", engine, if_exists="append", schema="something_else_as_dbo")

Python MS Access Database Table Creation From Pandas Dataframe Using SQLAlchemy

I'm trying to create an MS Access database from Python and was wondering if it's possible to create a table directly from a pandas dataframe. I know that I can use pandas dataframe.to_sql() function to successfully write the dataframe to an SQLite database or by an using sqlalchemy engine for some other database format (but not Access unfortunately) but I can't get all the pieces parts to come together. Here's the code snippet that I've been testing with:
import pandas as pd
import sqlalchemy
import pypyodbc # Used to actually create the .mdb file
import pyodbc
# Connection function to use for sqlalchemy
def Connection():
MDB = 'C:\\database.mdb'
DRV = '{Microsoft Access Driver (*.mdb)}'
connection_string = 'Driver={Microsoft Access Driver (*.mdb)};DBQ=%s' % MDB
return pyodbc.connect('DRIVER={};DBQ={}'.format(DRV,MDB))
# Try to connect to the database
try:
Conn = Connection()
# If it fails because its not been created yet, create it and connect to it
except:
pypyodbc.win_create_mdb(MDB)
Conn = Connection()
# Create the sqlalchemy engine using the pyodbc connection
Engine = sqlalchemy.create_engine('mysql+pyodbc://', creator=Connection)
# Some dataframe
data = {'Values' : [1., 2., 3., 4.],
'FruitsAndPets' : ["Apples", "Oranges", "Puppies", "Ducks"]}
df = pd.DataFrame(data)
# Try to send it to the access database (and fail)
df.to_sql('FruitsAndPets', Engine, index = False)
I'm not sure that what I'm trying to do is even possible with the current packages I'm using but I wanted to check here before I write my own hacky dataframe to MS Access table function. Maybe my sqlalchemy engine is set up wrong?
Here's the end of my error with mssql+pyodbc in the engine:
cursor.execute(statement, parameters)
sqlalchemy.exc.DBAPIError: (Error) ('HY000', "[HY000] [Microsoft][ODBC Microsoft Access Driver] Could not find file 'C:\\INFORMATION_SCHEMA.mdb'. (-1811) (SQLExecDirectW)") u'SELECT [COLUMNS_1].[TABLE_SCHEMA], [COLUMNS_1].[TABLE_NAME], [COLUMNS_1].[COLUMN_NAME], [COLUMNS_1].[IS_NULLABLE], [COLUMNS_1].[DATA_TYPE], [COLUMNS_1].[ORDINAL_POSITION], [COLUMNS_1].[CHARACTER_MAXIMUM_LENGTH], [COLUMNS_1].[NUMERIC_PRECISION], [COLUMNS_1].[NUMERIC_SCALE], [COLUMNS_1].[COLUMN_DEFAULT], [COLUMNS_1].[COLLATION_NAME] \nFROM [INFORMATION_SCHEMA].[COLUMNS] AS [COLUMNS_1] \nWHERE [COLUMNS_1].[TABLE_NAME] = ? AND [COLUMNS_1].[TABLE_SCHEMA] = ?' (u'FruitsAndPets', u'dbo')
and the ending error for mysql+pyodbc in the engine:
cursor.execute(statement, parameters)
sqlalchemy.exc.ProgrammingError: (ProgrammingError) ('42000', "[42000] [Microsoft][ODBC Microsoft Access Driver] Invalid SQL statement; expected 'DELETE', 'INSERT', 'PROCEDURE', 'SELECT', or 'UPDATE'. (-3500) (SQLExecDirectW)") "SHOW VARIABLES LIKE 'character_set%%'" ()
Just to note, I don't care if I use sqlalchemy or pandas to_sql() I just am looking for some easy way of getting a dataframe into my MS Access database easily. If that's dump to JSON then a loop function to insert rows using SQL manually, whatever, if it works well I'll take it.
For those still looking into this, basically you can't use pandas to_sql method for MS Access without a great deal of difficulty. If you are determined to do it this way, here is a link where someone fixed sqlalchemy's Access dialect (and presumably the OP's code would work with this Engine):
connecting sqlalchemy to MSAccess
The best way to get a data frame into MS Access is to build the INSERT statments from the records, then simply connect via pyodbc or pypyodbc and execute them with a cursor. You have to do inserts one at a time, its probably best to break this up into chunks (around 5000) if you have a lot of data.
There is a short tutorial on the pypyodbc website for executing SQL commands and populating an Access database:
https://code.google.com/p/pypyodbc/wiki/pypyodbc_for_access_mdb_file
I also found this useful Python wiki article:
https://wiki.python.org/moin/Microsoft%20Access
It states that mxODBC also has the capability to work with MS Access. A long time ago, I believe I successfully used ADOdb to connect to MS Access as well.
A few years ago, SQLAlchemy had experimental support for Microsoft Access. I used it to move an Access database to MS SQL Server at the time. I used SQLAlchemy to autoload / reflect the database. It was super handy. I believe that code was in version 0.5. You can read a bit about what what I did here.

Categories