Pandas DataFrame to SQL [duplicate] - python

Context
I just get into trouble while trying to do some I/O operations on some databases from a Python3 script.
When I want to connect to a database, I habitually use psycopg2 in order to handle the connections and cursors.
My data are usually stored as Pandas DataFrames and/or GeoPandas's equivalent GeoDataFrames.
Difficulties
In order to read data from a database table;
Using Pandas:
I can rely on its .read_sql() methods which takes as a parameter con, as stated in the doc:
con : SQLAlchemy connectable (engine/connection) or database str URI
or DBAPI2 connection (fallback mode)'
Using SQLAlchemy makes it possible to use any DB supported by that
library. If a DBAPI2 object, only sqlite3 is supported. The user is responsible
for engine disposal and connection closure for the SQLAlchemy connectable. See
`here <https://docs.sqlalchemy.org/en/13/core/connections.html>`_
Using GeoPandas:
I can rely on its .read_postigs() methods which takes as a parameter con, as stated in the doc:
con : DB connection object or SQLAlchemy engine
Active connection to the database to query.
In order to write data to a database table;
Using Pandas:
I can rely on the .to_sql() methods which takes as a parameter con, as stated in the doc:
con : sqlalchemy.engine.Engine or sqlite3.Connection
Using SQLAlchemy makes it possible to use any DB supported by that
library. Legacy support is provided for sqlite3.Connection objects. The user
is responsible for engine disposal and connection closure for the SQLAlchemy
connectable See `here <https://docs.sqlalchemy.org/en/13/core/connections.html>`_
Using GeoPandas:
I can rely on the .to_sql() methods (which directly relies on the Pandas .to_sql()) which takes as a parameter con, as stated in the doc:
con : sqlalchemy.engine.Engine or sqlite3.Connection
Using SQLAlchemy makes it possible to use any DB supported by that
library. Legacy support is provided for sqlite3.Connection objects. The user
is responsible for engine disposal and connection closure for the SQLAlchemy
connectable See `here <https://docs.sqlalchemy.org/en/13/core/connections.html>`_
From here, I easily understand that GeoPandas is built on Pandas especially for its GeoDataFrame object, which is, shortly, a special DataFrame that can handle geographic data.
But I'm wondering why do GeoPandas has the ability to directly takes a psycopg2 connection as an argument and not Pandas and if it is planned for the latter?
And why is it neither the case for one nor the other when it comes to writing data?
I would like (as probably many of others1,2) to directly give them a psycopg2 connections argument instead of relying on SQLAlchemy engine.
Because even is this tool is really great, it makes me use two different frameworks to connect to my database and thus handle two different connection strings (and I personally prefer the way psycopg2 handles the parameters expansion from a dictionary to build a connection string properly such as; psycopg2.connect(**dict_params) vs URL injection as explained here for example: Is it possible to pass a dictionary into create_engine function in SQLAlchemy?).
Workaround
I was first creating my connection string with psycopg2 from a dictionary of parameters this way:
connParams = ("user={}", "password={}", "host={}", "port={}", "dbname={}")
conn = ' '.join(connParams).format(*dict_params.values())
Then I figured out it was better and more pythonic this way:
conn = psycopg2.connect(**dict_params)
Which I finally replaced by this, so that I can interchangeably use it to build either a psycopg2 connections, or a SQLAlchemy engine:
def connector():
return psycopg2.connect(**dict_params)
a) Initialize a psycopg2 connection is now done by:
conn = connector()
curs = conn.cursor()
b) And initialize a SQLAlchemy engine by:
engine = create_engine('postgresql+psycopg2://', creator=connector)
(or with any of your flavored db+driver)
This is well documented here:
https://docs.sqlalchemy.org/en/13/core/engines.html#custom-dbapi-args
and here:
https://docs.sqlalchemy.org/en/13/core/engines.html#sqlalchemy.create_engine
[1] Dataframe to sql without Sql Alchemy engine
[2] How to write data frame to Postgres table without using SQLAlchemy engine?

Probably the main reason why to_sql needs a SQLAlchemy Connectable (Engine or Connection) object is that to_sql needs to be able to create the database table if it does not exist or if it needs to be replaced. Early versions of pandas worked exclusively with DBAPI connections, but I suspect that when they were adding new features to to_sql they found themselves writing a lot of database-specific code to work around the quirks of the various DDL implementations.
On realizing that they were duplicating a lot of logic that was already in SQLAlchemy they likely decided to "outsource' all of that complexity to SQLAlchemy itself by simply accepting an Engine/Connection object and using SQLAlchemy's (database-independent) SQL Expression language to create the table.
it makes me use two different frameworks to connect to my database
No, because .read_sql_query() also accepts a SQLAlchemy Connectable object so you can just use your SQLAlchemy connection for both reading and writing.

Related

What is the difference between pyodbc vs sqlalchemy?

Both are supposed to parse connection string and able to insert into say, SQL Server from pandas dataframe.
What is the real difference here?
PyODBC allows you connecting to and using an ODBC database using the standard DB API 2.0. SQL Alchemy is a toolkit that resides one level higher than that and provides a variety of features:
Object-relational mapping (ORM)
Query constructions
Caching
Eager loading
and others. It can work with PyODBC or any other driver that supports DB API 2.0.

How to determine if Python DB-API connection object is of a certain DBMS (e.g., PostgreSQL, MySQL)

Let the driver be a DB-API driver (e.g., psycopg2, pymysql) and you can get the connection object via connnection = driver.connect(...). How do I check what kind of DBMS the connection object is connected to: (1) At best the DBMS name or (2) the module name.
Use Case:
I need to make special queries that are of different SQL syntax (e.g., COPY clause for PostgreSQL vs using bulk INSERT for MySQL).
Something like?:
type(con)
psycopg2.extensions.connection
type(con)
sqlite3.Connection
Or shorter yet:
con.__class__
psycopg2.extensions.connection
con.__class__
sqlite3.Connection

Connect to sqlite3.Connection using sqlalchemy

I am using a library that creates an SQLite library in-memory by calling sqlite3.connect(':memory:'). I would like to connect to this database using sqlalchemy to use some ORM and other nice bells and whistles. Is there, in the depths of SQLAlchemy's API, a way to pass the resulting sqlite3.Connection object through so that I can re-use it?
I cannot just re-connect with connection = sqlalchemy.create_engine('sqlite:///:memory:').connect() – as the SQLite documentation states: “The database ceases to exist as soon as the database connection is closed. Every :memory: database is distinct from every other. So, opening two database connections each with the filename ":memory:" will create two independent in-memory databases.” (Which makes sense. I also tried it, and the behaviour is as expected.)
I have tried to follow SQLAlchemy's source code to find the low level location where the database connection is established and SQLite is actually called, but so far I found nothing. It looks like SQLAlchemy uses far too much obscure alchemy to do that for me to understand when and where it happens.
Here's a way to do that:
# some connection is created - by you or someone else
conn = sqlite3.connect(':memory:')
...
def get_connection():
# just a debug print to verify that it's indeed getting called:
print("returning the connection")
return conn
# create a SQL Alchamy engine that uses the same in-memory sqlite connection
engine = create_engine('sqlite://', creator = get_connection)
From this point on, just use the engine as you wish.
Here's a link to the documentation of this feature.

Instantiating SQLAlchemy engine from DBAPI connection object [duplicate]

I am working in an environment where I am given an ODBC connection, which has been created using credentials to which I don't have access (for security reasons). However I would like to access the underlying database using SQLAlchemy - so my question is, can I pass this ODBC connection to something like create_engine, or alternatively, wrap it in such a way that it looks like a SQLAlchemy connection?
As a supplementary question (and working on the optimistic assumption that the first part can be satisfied) is there a way that I can tell SQLA what dialect to use for the underlying RDBMS?
Thanks
yes you can:
from sqlalchemy import create_engine
from sqlalchemy.pool import StaticPool
eng = create_engine("mssql+pyodbc://", poolclass=StaticPool, creator=lambda: my_odbc_connection)
however, if you truly have only one connection already created, as opposed to a callable that creates them, you must only use this engine in a single thread, one operation at a time. It is not threadsafe for use in a multithreaded application.
If OTOH you can actually get at a Python function that creates new connections whenever called, this is much more appopriate:
from sqlalchemy import create_engine
eng = create_engine("mssql+pyodbc://", creator=my_odbc_connection_function)
the above engine will pool connections normally and can be used freely as a source of connectivity.
found this old post (14 years ago)
import sqlalchemy
sqlalchemy.create_engine('mssql://?dsn=mydsn')
my user case is for using jupyter SQL magics in new notebooks to be distributed to users which already have the DSN configured on their systems
for jupyter, after pip install ipython-sql:
import sqlalchemy
%load_ext sql
%sql mssql://?dsn=mydsn
tested on SQL Server 2019 with Windows authentication, as all info I found point that DSNs don't hold passwords for MSSQL
there's no reason why it shouldn't work on older versions (I'll use it on SQL Server 2005 too, of course with another DSN)

In SQLAlchemy, can I create an Engine from an existing ODBC connection?

I am working in an environment where I am given an ODBC connection, which has been created using credentials to which I don't have access (for security reasons). However I would like to access the underlying database using SQLAlchemy - so my question is, can I pass this ODBC connection to something like create_engine, or alternatively, wrap it in such a way that it looks like a SQLAlchemy connection?
As a supplementary question (and working on the optimistic assumption that the first part can be satisfied) is there a way that I can tell SQLA what dialect to use for the underlying RDBMS?
Thanks
yes you can:
from sqlalchemy import create_engine
from sqlalchemy.pool import StaticPool
eng = create_engine("mssql+pyodbc://", poolclass=StaticPool, creator=lambda: my_odbc_connection)
however, if you truly have only one connection already created, as opposed to a callable that creates them, you must only use this engine in a single thread, one operation at a time. It is not threadsafe for use in a multithreaded application.
If OTOH you can actually get at a Python function that creates new connections whenever called, this is much more appopriate:
from sqlalchemy import create_engine
eng = create_engine("mssql+pyodbc://", creator=my_odbc_connection_function)
the above engine will pool connections normally and can be used freely as a source of connectivity.
found this old post (14 years ago)
import sqlalchemy
sqlalchemy.create_engine('mssql://?dsn=mydsn')
my user case is for using jupyter SQL magics in new notebooks to be distributed to users which already have the DSN configured on their systems
for jupyter, after pip install ipython-sql:
import sqlalchemy
%load_ext sql
%sql mssql://?dsn=mydsn
tested on SQL Server 2019 with Windows authentication, as all info I found point that DSNs don't hold passwords for MSSQL
there's no reason why it shouldn't work on older versions (I'll use it on SQL Server 2005 too, of course with another DSN)

Categories