I'm trying to create an MS Access database from Python and was wondering if it's possible to create a table directly from a pandas dataframe. I know that I can use pandas dataframe.to_sql() function to successfully write the dataframe to an SQLite database or by an using sqlalchemy engine for some other database format (but not Access unfortunately) but I can't get all the pieces parts to come together. Here's the code snippet that I've been testing with:
import pandas as pd
import sqlalchemy
import pypyodbc # Used to actually create the .mdb file
import pyodbc
# Connection function to use for sqlalchemy
def Connection():
MDB = 'C:\\database.mdb'
DRV = '{Microsoft Access Driver (*.mdb)}'
connection_string = 'Driver={Microsoft Access Driver (*.mdb)};DBQ=%s' % MDB
return pyodbc.connect('DRIVER={};DBQ={}'.format(DRV,MDB))
# Try to connect to the database
try:
Conn = Connection()
# If it fails because its not been created yet, create it and connect to it
except:
pypyodbc.win_create_mdb(MDB)
Conn = Connection()
# Create the sqlalchemy engine using the pyodbc connection
Engine = sqlalchemy.create_engine('mysql+pyodbc://', creator=Connection)
# Some dataframe
data = {'Values' : [1., 2., 3., 4.],
'FruitsAndPets' : ["Apples", "Oranges", "Puppies", "Ducks"]}
df = pd.DataFrame(data)
# Try to send it to the access database (and fail)
df.to_sql('FruitsAndPets', Engine, index = False)
I'm not sure that what I'm trying to do is even possible with the current packages I'm using but I wanted to check here before I write my own hacky dataframe to MS Access table function. Maybe my sqlalchemy engine is set up wrong?
Here's the end of my error with mssql+pyodbc in the engine:
cursor.execute(statement, parameters)
sqlalchemy.exc.DBAPIError: (Error) ('HY000', "[HY000] [Microsoft][ODBC Microsoft Access Driver] Could not find file 'C:\\INFORMATION_SCHEMA.mdb'. (-1811) (SQLExecDirectW)") u'SELECT [COLUMNS_1].[TABLE_SCHEMA], [COLUMNS_1].[TABLE_NAME], [COLUMNS_1].[COLUMN_NAME], [COLUMNS_1].[IS_NULLABLE], [COLUMNS_1].[DATA_TYPE], [COLUMNS_1].[ORDINAL_POSITION], [COLUMNS_1].[CHARACTER_MAXIMUM_LENGTH], [COLUMNS_1].[NUMERIC_PRECISION], [COLUMNS_1].[NUMERIC_SCALE], [COLUMNS_1].[COLUMN_DEFAULT], [COLUMNS_1].[COLLATION_NAME] \nFROM [INFORMATION_SCHEMA].[COLUMNS] AS [COLUMNS_1] \nWHERE [COLUMNS_1].[TABLE_NAME] = ? AND [COLUMNS_1].[TABLE_SCHEMA] = ?' (u'FruitsAndPets', u'dbo')
and the ending error for mysql+pyodbc in the engine:
cursor.execute(statement, parameters)
sqlalchemy.exc.ProgrammingError: (ProgrammingError) ('42000', "[42000] [Microsoft][ODBC Microsoft Access Driver] Invalid SQL statement; expected 'DELETE', 'INSERT', 'PROCEDURE', 'SELECT', or 'UPDATE'. (-3500) (SQLExecDirectW)") "SHOW VARIABLES LIKE 'character_set%%'" ()
Just to note, I don't care if I use sqlalchemy or pandas to_sql() I just am looking for some easy way of getting a dataframe into my MS Access database easily. If that's dump to JSON then a loop function to insert rows using SQL manually, whatever, if it works well I'll take it.
For those still looking into this, basically you can't use pandas to_sql method for MS Access without a great deal of difficulty. If you are determined to do it this way, here is a link where someone fixed sqlalchemy's Access dialect (and presumably the OP's code would work with this Engine):
connecting sqlalchemy to MSAccess
The best way to get a data frame into MS Access is to build the INSERT statments from the records, then simply connect via pyodbc or pypyodbc and execute them with a cursor. You have to do inserts one at a time, its probably best to break this up into chunks (around 5000) if you have a lot of data.
There is a short tutorial on the pypyodbc website for executing SQL commands and populating an Access database:
https://code.google.com/p/pypyodbc/wiki/pypyodbc_for_access_mdb_file
I also found this useful Python wiki article:
https://wiki.python.org/moin/Microsoft%20Access
It states that mxODBC also has the capability to work with MS Access. A long time ago, I believe I successfully used ADOdb to connect to MS Access as well.
A few years ago, SQLAlchemy had experimental support for Microsoft Access. I used it to move an Access database to MS SQL Server at the time. I used SQLAlchemy to autoload / reflect the database. It was super handy. I believe that code was in version 0.5. You can read a bit about what what I did here.
Related
According to SQLAlchemy documentation you are supposed to use Session object when executing SQL statements. But using a Session with Pandas .read_sql, gives an error: AttributeError 'Session' object has no attribute 'cursor'.
However using the Connection object works even with the ORM Mapped Class:
with ENGINE.connect() as conn:
df = pd.read_sql_query(
sqlalchemy.select(MeterValue),
conn
)
Where MeterValue is a Mapped Class.
This doesn't feel like the correct solution, because SQLAlchemy documentation says you are not supposed to use engine connection with ORM. I just can't find out why.
Does anyone know if there is any issue using the connection instead of Session with ORM Mapped Class?
What is the correct way to read sql in to a DataFrame using SQLAlchemy ORM?
I found a couple of old answers on this where you use the engine directly as the second argument, or use session.bind and so on. Nothing works.
Just reading the documentation of pandas.read_sql_query:
pandas.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None, dtype=None)
Parameters:
sql: str SQL query or SQLAlchemy Selectable (select or text object)
SQL query to be executed.
con: SQLAlchemy connectable, str, or sqlite3 connection
Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.
...
So pandas does allow a SQLAlchemy Selectable (e.g. select(MeterValue)) and a SQLAlchemy connectable (e.g. engine.connect()), so your code block is correct and pandas will handle the querying correctly.
with ENGINE.connect() as conn:
df = pd.read_sql_query(
sqlalchemy.select(MeterValue),
conn,
)
1 step: Create a temporary table with pyodbc into sql server for objects
2 step: Select objects from temporary table and load it into pandas dataframe
3 step: print dataframe
for creating a temporary table i work with pyodbc cursor as it trohws errors with pandas.read_sql command. wheras it trohws an error if i try to convert the cursor into a pandas dataframe. even with the special line for handling tuples into dataframes.
my program to connect, create, read and print which works as long as the query stays simple as it is now. (my actual approach has a few hundred lines of sql query statement)
import codecs
import os
import io
import pandas as pd
import pyodbc as po
server = 'sql_server'
database = 'sql_database'
connection = po.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';Trusted_Connection=yes;')
cursor = connection.cursor()
query1 = """
CREATE TABLE #ttobject (object_nr varchar(6), change_date datetime)
INSERT INTO #ttobject (object_nr)
VALUES
('112211'),
('113311'),
('114411');
"""
query2 = """
SELECT *
FROM #ttobject
Drop table if exists #ttobject
"""
cursor.execute(query1)
df = pd.read_sql_query(query2, connection)
print(df)
Because of the lenght of the actually query i save you the trouble but instead post here the error code:
('HY000', '[HY000] [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt (0) (SQLExecDirectW)')
This error gets thrown at query2 which is a multiple select statement with some joins and pivote functions
When I'm trying to put everything into one cursor i got issues with converting it from cursor to DataFrame (tried several methodes, maybe someone knows one which isn't on SO already or has a special title so i couldn't find it)
same problem if I'm trying to only use pd.read_sql then the creation of the temporary table is not working
I don't know where to go on from here.
Please let me know if i can assist you with further details which i may overwatched in accordance to my lostlyness :S
23.5.19 Further investigating:
According to Gord i tried to add autocommit to true which will work
for simple sql statements but not for my really long and
timeconsuming one.
Secondly i tried to add
"cursor.execute('SET NOCOUNT ON; EXEC schema.proc #muted = 1')
At the moment i guess that the first query takes longer so python already starting with the second and therefore the connection is
blocked. Or that the first query is returing some feedback so python
thinks it is finished before it actually is.
Added a time.sleep(100) after ececution of first query but still getting the hstmt is busy error. Wondering why this is becaus it should have had enough time to process the first
Funfact: The query is running smoothly as long as I'm not trying to output any result from it
I am new to using 'pyodbc' for querying data from ODBC DB. Specifically a Lotus Notes DB.
This is an example where the query fails using a function in SQL:
import pyodbc
import pandas as pd
cnxn = pyodbc.connect("Driver={Lotus Notes SQL Driver (*.nsf)};SERVER=server;DATABASE=db.nsf;PWD=xxxxx;UID=userid", autocommit=True)
cursor = cnxn.cursor()
sql_addon = """SELECT REPLACE(timestamp_DT,'-','') as timestamp_DT
FROM ViewInNoteDB
"""
df_addon = pd.read_sql(sql_addon, cnxn)
This the error I get:
': ('37000', u"[37000] [Lotus][ODBC Lotus Notes]Name, constant, or expression expected (23008) (SQLExecDirectW); [37000] [Lotus][ODBC Lotus Notes]Incorrect syntax near 'SELECT' (23064)")
I get different errors using GETDATE(), CONVERT function, and many other functions.
It seems that the issue is related to using to SQL*Server syntax which is not supported by the Lotus Notes ODBC driver. CAST and CONVERT are not supported unfortunately.
The only supported column functions: http://www-12.lotus.com/ldd/doc/notessql/2.0.6/notessql.nsf/66208c256b4136a2852563c000646f8c/1f3d9225b5e6a547852567010067254d?OpenDocument
I have a 1,000,000 x 50 Pandas DataFrame that I am currently writing to a SQL table using:
df.to_sql('my_table', con, index=False)
It takes an incredibly long time. I've seen various explanations about how to speed up this process online, but none of them seem to work for MSSQL.
If I try the method in:
Bulk Insert A Pandas DataFrame Using SQLAlchemy
then I get a no attribute copy_from error.
If I try the multithreading method from:
http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html
then I get a QueuePool limit of size 5 overflow 10 reach, connection timed out error.
Is there any easy way to speed up to_sql() to an MSSQL table? Either via BULK COPY or some other method, but entirely from within Python code?
I've used ctds to do a bulk insert that's a lot faster with SQL server. In example below, df is the pandas DataFrame. The column sequence in the DataFrame is identical to the schema for mydb.
import ctds
conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))
in pandas 0.24 you can use method ='multi' with chunk size of 1000 which is the sql server limit
chunksize=1000, method='multi'
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method
New in version 0.24.0.
The parameter method controls the SQL insertion clause used. Possible values are:
None: Uses standard SQL INSERT clause (one per row).
'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documention.
even I had the same issue so I applied sqlalchemy with fast execute many.
from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
#event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
if executemany:
cursor.fast_executemany = True # replace from execute many to fast_executemany.
cursor.commit()
always make sure that the given function should be present after the engine variable and before cursor execute.
conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.
Using MSSQL (version 2012), I am using SQLAlchemy and pandas (on Python 2.7) to insert rows into a SQL Server table.
After trying pymssql and pyodbc with a specific server string, I am trying an odbc name:
import sqlalchemy, pyodbc, pandas as pd
engine = sqlalchemy.create_engine("mssql+pyodbc://mssqlodbc")
sqlstring = "EXEC getfoo"
dbdataframe = pd.read_sql(sqlstring, engine)
This part works great and worked with the other methods (pymssql, etc). However, the pandas to_sql method doesn't work.
finaloutput.to_sql("MyDB.dbo.Loader_foo",engine,if_exists="append",chunksize="10000")
With this statement, I get a consistent error that pandas is trying to do a CREATE TABLE in the sql server Master db, which it is not permisioned for.
How do I get pandas/SQLAlchemy/pyodbc to point to the correct mssql database? The to_sql method seems to ignore whatever I put in engine connect string (although the read_sql method seems to pick it up just fine.
To have this question as answered: the problem is that you specify the schema in the table name itself. If you provide "MyDB.dbo.Loader_foo" as the table name, pandas will interprete this full string as the table name, instead of just "Loader_foo".
Solution is to only provide "Loader_foo" as table name. If you need to specify a specific schema to write this table into, you can use the schema kwarg (see docs):
finaloutput.to_sql("Loader_foo", engine, if_exists="append")
finaloutput.to_sql("Loader_foo", engine, if_exists="append", schema="something_else_as_dbo")