Speeding up Pandas to_sql()?

Speeding up Pandas to_sql()? - python

I have a 1,000,000 x 50 Pandas DataFrame that I am currently writing to a SQL table using:
df.to_sql('my_table', con, index=False)
It takes an incredibly long time. I've seen various explanations about how to speed up this process online, but none of them seem to work for MSSQL.
If I try the method in:
Bulk Insert A Pandas DataFrame Using SQLAlchemy
then I get a no attribute copy_from error.
If I try the multithreading method from:
http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html
then I get a QueuePool limit of size 5 overflow 10 reach, connection timed out error.
Is there any easy way to speed up to_sql() to an MSSQL table? Either via BULK COPY or some other method, but entirely from within Python code?

I've used ctds to do a bulk insert that's a lot faster with SQL server. In example below, df is the pandas DataFrame. The column sequence in the DataFrame is identical to the schema for mydb.
import ctds
conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))

in pandas 0.24 you can use method ='multi' with chunk size of 1000 which is the sql server limit
chunksize=1000, method='multi'
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method
New in version 0.24.0.
The parameter method controls the SQL insertion clause used. Possible values are:
None: Uses standard SQL INSERT clause (one per row).
'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documention.

even I had the same issue so I applied sqlalchemy with fast execute many.
from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
#event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
if executemany:
cursor.fast_executemany = True # replace from execute many to fast_executemany.
cursor.commit()
always make sure that the given function should be present after the engine variable and before cursor execute.
conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.

Related

Sql Select statement Optimization

I have made an test table in sql with the following information schema as shown:
Now I extract this information using the python script the code of which is as shown:
import pandas as pd
import mysql.connector
db = mysql.connector.connect(host="localhost", user="root", passwd="abcdef")
pointer = db.cursor()
pointer.execute("use holdings")
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
pointer.execute(x)
rows = pointer.fetchall()
rows = pd.DataFrame(rows)
stock = rows[1]
The production table contains 200 unique trading symbols and has the schema similar to the test table.
My doubt is that for the following statement:
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
I will have to replace value of tradingsymbols 200 times which is ineffective.
Is there an effective way to do this?

If I understand you correctly, your problem is that you want to avoid sending multiple queries for each trading symbol, correct? In this case the following MySQL IN might be of help. You could then simply send one query to the database containing all tradingsymbols you want. If you want to do different things with the various trading symbols, you could select the subsets within pandas.
Another performance improvement could be pandas.read_sql since this speeds up the creation of the dataframe somewhat
Two more things to add for efficiency:
Ensure that tradingsymbols is indexed in MySQL for faster lookup processes
Make tradingsymbols an ENUM to ensure that no typos or alike are accepted. Otherwise the above-mentioned "IN" method also does not work since it does a full-text comparison.

from sql server to pandas dataframe with pyodbc - while working with small tables, it gives an error on complex sql queries

1 step: Create a temporary table with pyodbc into sql server for objects
2 step: Select objects from temporary table and load it into pandas dataframe
3 step: print dataframe
for creating a temporary table i work with pyodbc cursor as it trohws errors with pandas.read_sql command. wheras it trohws an error if i try to convert the cursor into a pandas dataframe. even with the special line for handling tuples into dataframes.
my program to connect, create, read and print which works as long as the query stays simple as it is now. (my actual approach has a few hundred lines of sql query statement)
import codecs
import os
import io
import pandas as pd
import pyodbc as po
server = 'sql_server'
database = 'sql_database'
connection = po.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';Trusted_Connection=yes;')
cursor = connection.cursor()
query1 = """
CREATE TABLE #ttobject (object_nr varchar(6), change_date datetime)
INSERT INTO #ttobject (object_nr)
VALUES
('112211'),
('113311'),
('114411');
"""
query2 = """
SELECT *
FROM #ttobject
Drop table if exists #ttobject
"""
cursor.execute(query1)
df = pd.read_sql_query(query2, connection)
print(df)
Because of the lenght of the actually query i save you the trouble but instead post here the error code:
('HY000', '[HY000] [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt (0) (SQLExecDirectW)')
This error gets thrown at query2 which is a multiple select statement with some joins and pivote functions
When I'm trying to put everything into one cursor i got issues with converting it from cursor to DataFrame (tried several methodes, maybe someone knows one which isn't on SO already or has a special title so i couldn't find it)
same problem if I'm trying to only use pd.read_sql then the creation of the temporary table is not working
I don't know where to go on from here.
Please let me know if i can assist you with further details which i may overwatched in accordance to my lostlyness :S
23.5.19 Further investigating:
According to Gord i tried to add autocommit to true which will work
for simple sql statements but not for my really long and
timeconsuming one.
Secondly i tried to add
"cursor.execute('SET NOCOUNT ON; EXEC schema.proc #muted = 1')
At the moment i guess that the first query takes longer so python already starting with the second and therefore the connection is
blocked. Or that the first query is returing some feedback so python
thinks it is finished before it actually is.
Added a time.sleep(100) after ececution of first query but still getting the hstmt is busy error. Wondering why this is becaus it should have had enough time to process the first
Funfact: The query is running smoothly as long as I'm not trying to output any result from it

Update MSSQL table through SQLAlchemy using dataframes

I'm trying to replace some old MSSQL stored procedures with python, in an attempt to take some of the heavy calculations off of the sql server. The part of the procedure I'm having issues replacing is as follows
UPDATE mytable
SET calc_value = tmp.calc_value
FROM dbo.mytable mytable INNER JOIN
#my_temp_table tmp ON mytable.a = tmp.a AND mytable.b = tmp.b AND mytable.c = tmp.c
WHERE (mytable.a = some_value)
and (mytable.x = tmp.x)
and (mytable.b = some_other_value)
Up to this point, I've made some queries with SQLAlchemy, stored those data in Dataframes, and done the requisite calculations on them. I don't know now how to put the data back into the server using SQLAlchemy, either with raw SQL or function calls. The dataframe I have on my end would essentially have to work in the place of the temporary table created in MSSQL Server, but I'm not sure how I can do that.
The difficulty is of course that I don't know of a way to join between a dataframe and a mssql table, and I'm guessing this wouldn't work so I'm looking for a workaround

As the pandas doc suggests here :
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://user:password#DSN", echo = False)
dataframe.to_sql('tablename', engine , if_exists = 'replace')
engine parameter for msSql is basically the connection string check it here
if_exist parameter is a but tricky since 'replace' actually drops the table first and then recreates it and then inserts all data at once.
by setting the echo attribute to True it shows all background logs and sql's.

Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC

I would like to send a large pandas.DataFrame to a remote server running MS SQL. The way I do it now is by converting a data_frame object to a list of tuples and then send it away with pyODBC's executemany() function. It goes something like this:
import pyodbc as pdb
list_of_tuples = convert_df(data_frame)
connection = pdb.connect(cnxn_str)
cursor = connection.cursor()
cursor.fast_executemany = True
cursor.executemany(sql_statement, list_of_tuples)
connection.commit()
cursor.close()
connection.close()
I then started to wonder if things can be sped up (or at least more readable) by using data_frame.to_sql() method. I have came up with the following solution:
import sqlalchemy as sa
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % cnxn_str)
data_frame.to_sql(table_name, engine, index=False)
Now the code is more readable, but the upload is at least 150 times slower...
Is there a way to flip the fast_executemany when using SQLAlchemy?
I am using pandas-0.20.3, pyODBC-4.0.21 and sqlalchemy-1.1.13.

EDIT (2019-03-08): Gord Thompson commented below with good news from the update logs of sqlalchemy: Since SQLAlchemy 1.3.0, released 2019-03-04, sqlalchemy now supports engine = create_engine(sqlalchemy_url, fast_executemany=True) for the mssql+pyodbc dialect. I.e., it is no longer necessary to define a function and use #event.listens_for(engine, 'before_cursor_execute') Meaning the below function can be removed and only the flag needs to be set in the create_engine statement - and still retaining the speed-up.
Original Post:
Just made an account to post this. I wanted to comment beneath the above thread as it's a followup on the already provided answer. The solution above worked for me with the Version 17 SQL driver on a Microsft SQL storage writing from a Ubuntu based install.
The complete code I used to speed things up significantly (talking >100x speed-up) is below. This is a turn-key snippet provided that you alter the connection string with your relevant details. To the poster above, thank you very much for the solution as I was looking quite some time for this already.
import pandas as pd
import numpy as np
import time
from sqlalchemy import create_engine, event
from urllib.parse import quote_plus
conn = "DRIVER={ODBC Driver 17 for SQL Server};SERVER=IP_ADDRESS;DATABASE=DataLake;UID=USER;PWD=PASS"
quoted = quote_plus(conn)
new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted)
engine = create_engine(new_con)
#event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
print("FUNC call")
if executemany:
cursor.fast_executemany = True
table_name = 'fast_executemany_test'
df = pd.DataFrame(np.random.random((10**4, 100)))
s = time.time()
df.to_sql(table_name, engine, if_exists = 'replace', chunksize = None)
print(time.time() - s)
Based on the comments below I wanted to take some time to explain some limitations about the pandas to_sql implementation and the way the query is handled. There are 2 things that might cause the MemoryError being raised afaik:
1) Assuming you're writing to a remote SQL storage. When you try to write a large pandas DataFrame with the to_sql method it converts the entire dataframe into a list of values. This transformation takes up way more RAM than the original DataFrame does (on top of it, as the old DataFrame still remains present in RAM). This list is provided to the final executemany call for your ODBC connector. I think the ODBC connector has some troubles handling such large queries. A way to solve this is to provide the to_sql method a chunksize argument (10**5 seems to be around optimal giving about 600 mbit/s (!) write speeds on a 2 CPU 7GB ram MSSQL Storage application from Azure - can't recommend Azure btw). So the first limitation, being the query size, can be circumvented by providing a chunksize argument. However, this won't enable you to write a dataframe the size of 10**7 or larger, (at least not on the VM I am working with which has ~55GB RAM), being issue nr 2.
This can be circumvented by breaking up the DataFrame with np.split (being 10**6 size DataFrame chunks) These can be written away iteratively. I will try to make a pull request when I have a solution ready for the to_sql method in the core of pandas itself so you won't have to do this pre-breaking up every time. Anyhow I ended up writing a function similar (not turn-key) to the following:
import pandas as pd
import numpy as np
def write_df_to_sql(df, **kwargs):
chunks = np.split(df, df.shape()[0] / 10**6)
for chunk in chunks:
chunk.to_sql(**kwargs)
return True
A more complete example of the above snippet can be viewed here: https://gitlab.com/timelord/timelord/blob/master/timelord/utils/connector.py
It's a class I wrote that incorporates the patch and eases some of the necessary overhead that comes with setting up connections with SQL. Still have to write some documentation. Also I was planning on contributing the patch to pandas itself but haven't found a nice way yet on how to do so.
I hope this helps.

After contacting the developers of SQLAlchemy, a way to solve this problem has emerged. Many thanks to them for the great work!
One has to use a cursor execution event and check if the executemany flag has been raised. If that is indeed the case, switch the fast_executemany option on. For example:
from sqlalchemy import event
#event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
More information on execution events can be found here.
UPDATE: Support for fast_executemany of pyodbc was added in SQLAlchemy 1.3.0, so this hack is not longer necessary.

I ran into the same problem but using PostgreSQL. They now just release pandas version 0.24.0 and there is a new parameter in the to_sql function called method which solved my problem.
from sqlalchemy import create_engine
engine = create_engine(your_options)
data_frame.to_sql(table_name, engine, method="multi")
Upload speed is 100x faster for me.
I also recommend setting the chunksize parameter if you are going to send lots of data.

I just wanted to post this full example as an additional, high-performance option for those who can use the new turbodbc library: http://turbodbc.readthedocs.io/en/latest/
There clearly are many options in flux between pandas .to_sql(), triggering fast_executemany through sqlalchemy, using pyodbc directly with tuples/lists/etc., or even trying BULK UPLOAD with flat files.
Hopefully, the following might make life a bit more pleasant as functionality evolves in the current pandas project or includes something like turbodbc integration in the future.
import pandas as pd
import numpy as np
from turbodbc import connect, make_options
from io import StringIO
test_data = '''id,transaction_dt,units,measures
1,2018-01-01,4,30.5
1,2018-01-03,4,26.3
2,2018-01-01,3,12.7
2,2018-01-03,3,8.8'''
df_test = pd.read_csv(StringIO(test_data), sep=',')
df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt'])
options = make_options(parameter_sets_to_buffer=1000)
conn = connect(driver='{SQL Server}', server='server_nm', database='db_nm', turbodbc_options=options)
test_query = '''DROP TABLE IF EXISTS [db_name].[schema].[test]
CREATE TABLE [db_name].[schema].[test]
(
id int NULL,
transaction_dt datetime NULL,
units int NULL,
measures float NULL
)
INSERT INTO [db_name].[schema].[test] (id,transaction_dt,units,measures)
VALUES (?,?,?,?) '''
cursor.executemanycolumns(test_query, [df_test['id'].values, df_test['transaction_dt'].values, df_test['units'].values, df_test['measures'].values]
turbodbc should be VERY fast in many use cases (particularly with numpy arrays). Please observe how straightforward it is to pass the underlying numpy arrays from the dataframe columns as parameters to the query directly. I also believe this helps prevent the creation of intermediate objects that spike memory consumption excessively. Hope this is helpful!

It seems that Pandas 0.23.0 and 0.24.0 use multi values inserts with PyODBC, which prevents fast executemany from helping – a single INSERT ... VALUES ... statement is emitted per chunk. The multi values insert chunks are an improvement over the old slow executemany default, but at least in simple tests the fast executemany method still prevails, not to mention no need for manual chunksize calculations, as is required with multi values inserts. Forcing the old behaviour can be done by monkeypatching, if no configuration option is provided in the future:
import pandas.io.sql
def insert_statement(self, data, conn):
return self.table.insert(), data
pandas.io.sql.SQLTable.insert_statement = insert_statement
The future is here and at least in the master branch the insert method can be controlled using the keyword argument method= of to_sql(). It defaults to None, which forces the executemany method. Passing method='multi' results in using the multi values insert. It can even be used to implement DBMS specific approaches, such as Postgresql COPY.

As pointed out by #Pylander
Turbodbc is the best choice for data ingestion, by far!
I got so excited about it that I wrote a 'blog' on it on my github and medium:
please check https://medium.com/#erickfis/etl-process-with-turbodbc-1d19ed71510e
for a working example and comparison with pandas.to_sql
Long story short,
with turbodbc
I've got 10000 lines (77 columns) in 3 seconds
with pandas.to_sql
I've got the same 10000 lines (77 columns) in 198 seconds...
And here is what I'm doing in full detail
The imports:
import sqlalchemy
import pandas as pd
import numpy as np
import turbodbc
import time
Load and treat some data - Substitute my sample.pkl for yours:
df = pd.read_pickle('sample.pkl')
df.columns = df.columns.str.strip() # remove white spaces around column names
df = df.applymap(str.strip) # remove white spaces around values
df = df.replace('', np.nan) # map nans, to drop NAs rows and columns later
df = df.dropna(how='all', axis=0) # remove rows containing only NAs
df = df.dropna(how='all', axis=1) # remove columns containing only NAs
df = df.replace(np.nan, 'NA') # turbodbc hates null values...
Create the table using sqlAlchemy
Unfortunately, turbodbc requires a lot of overhead with a lot of sql manual labor, for creating the tables and for inserting data on it.
Fortunately, Python is pure joy and we can automate this process of writing sql code.
The first step is creating the table which will receive our data. However, creating the table manually writing sql code can be problematic if your table has more than a few columns. In my case, very often the tables have 240 columns!
This is where sqlAlchemy and pandas still can help us: pandas is bad for writing a large number of rows (10000 in this example), but what about just 6 rows, the head of the table? This way, we automate the process of creating the tables.
Create sqlAlchemy connection:
mydb = 'someDB'
def make_con(db):
"""Connect to a specified db."""
database_connection = sqlalchemy.create_engine(
'mssql+pymssql://{0}:{1}#{2}/{3}'.format(
myuser, mypassword,
myhost, db
)
)
return database_connection
pd_connection = make_con(mydb)
Create table on SQL Server
Using pandas + sqlAlchemy, but just for preparing room for turbodbc as previously mentioned. Please note that df.head() here: we are using pandas + sqlAlchemy for inserting only 6 rows of our data. This will run pretty fast and is being done to automate the table creation.
table = 'testing'
df.head().to_sql(table, con=pd_connection, index=False)
Now that the table is already in place, let’s get serious here.
Turbodbc connection:
def turbo_conn(mydb):
"""Connect to a specified db - turbo."""
database_connection = turbodbc.connect(
driver='ODBC Driver 17 for SQL Server',
server=myhost,
database=mydb,
uid=myuser,
pwd=mypassword
)
return database_connection
Preparing sql comands and data for turbodbc. Let’s automate this code creation being creative:
def turbo_write(mydb, df, table):
"""Use turbodbc to insert data into sql."""
start = time.time()
# preparing columns
colunas = '('
colunas += ', '.join(df.columns)
colunas += ')'
# preparing value place holders
val_place_holder = ['?' for col in df.columns]
sql_val = '('
sql_val += ', '.join(val_place_holder)
sql_val += ')'
# writing sql query for turbodbc
sql = f"""
INSERT INTO {mydb}.dbo.{table} {colunas}
VALUES {sql_val}
"""
# writing array of values for turbodbc
valores_df = [df[col].values for col in df.columns]
# cleans the previous head insert
with connection.cursor() as cursor:
cursor.execute(f"delete from {mydb}.dbo.{table}")
connection.commit()
# inserts data, for real
with connection.cursor() as cursor:
try:
cursor.executemanycolumns(sql, valores_df)
connection.commit()
except Exception:
connection.rollback()
print('something went wrong')
stop = time.time() - start
return print(f'finished in {stop} seconds')
Writing data using turbodbc - I’ve got 10000 lines (77 columns) in 3 seconds:
turbo_write(mydb, df.sample(10000), table)
Pandas method comparison - I’ve got the same 10000 lines (77 columns) in 198 seconds…
table = 'pd_testing'
def pandas_comparisson(df, table):
"""Load data using pandas."""
start = time.time()
df.to_sql(table, con=pd_connection, index=False)
stop = time.time() - start
return print(f'finished in {stop} seconds')
pandas_comparisson(df.sample(10000), table)
Environment and conditions
Python 3.6.7 :: Anaconda, Inc.
TURBODBC version ‘3.0.0’
sqlAlchemy version ‘1.2.12’
pandas version ‘0.23.4’
Microsoft SQL Server 2014
user with bulk operations privileges
Please check https://erickfis.github.io/loose-code/ for updates in this code!

SQL Server INSERT performance: pyodbc vs. turbodbc
When using to_sql to upload a pandas DataFrame to SQL Server, turbodbc will definitely be faster than pyodbc without fast_executemany. However, with fast_executemany enabled for pyodbc, both approaches yield essentially the same performance.
Test environments:
[venv1_pyodbc]
pyodbc 2.0.25
[venv2_turbodbc]
turbodbc 3.0.0
sqlalchemy-turbodbc 0.1.0
[common to both]
Python 3.6.4 64-bit on Windows
SQLAlchemy 1.3.0b1
pandas 0.23.4
numpy 1.15.4
Test code:
# for pyodbc
engine = create_engine('mssql+pyodbc://sa:whatever#SQL_panorama', fast_executemany=True)
# for turbodbc
# engine = create_engine('mssql+turbodbc://sa:whatever#SQL_panorama')
# test data
num_rows = 10000
num_cols = 100
df = pd.DataFrame(
[[f'row{x:04}col{y:03}' for y in range(num_cols)] for x in range(num_rows)],
columns=[f'col{y:03}' for y in range(num_cols)]
)
t0 = time.time()
df.to_sql("sqlalchemy_test", engine, if_exists='replace', index=None)
print(f"pandas wrote {num_rows} rows in {(time.time() - t0):0.1f} seconds")
Tests were run twelve (12) times for each environment, discarding the single best and worst times for each. Results (in seconds):
rank pyodbc turbodbc
---- ------ --------
1 22.8 27.5
2 23.4 28.1
3 24.6 28.2
4 25.2 28.5
5 25.7 29.3
6 26.9 29.9
7 27.0 31.4
8 30.1 32.1
9 33.6 32.5
10 39.8 32.9
---- ------ --------
average 27.9 30.0

Just wanted to add to the #J.K.'s answer.
If you are using this approach:
#event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
And you are getting this error:
"sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('HY010', '[HY010]
[Microsoft][SQL Server Native Client 11.0]Function sequence error (0)
(SQLParamData)') [SQL: 'INSERT INTO ... (...) VALUES (?, ?)']
[parameters: ((..., ...), (..., ...)] (Background on this error at:
http://sqlalche.me/e/dbapi)"
Encode your string values like this: 'yourStringValue'.encode('ascii')
This will solve your problem.

I just modify engine line which helps me to speedup the insertion 100 times.
Old Code -
import json
import maya
import time
import pandas
import pyodbc
import pandas as pd
from sqlalchemy import create_engine
retry_count = 0
retry_flag = True
hostInfoDf = pandas.read_excel('test.xlsx', sheet_name='test')
print("Read Ok")
engine = create_engine("mssql+pyodbc://server_name/db_name?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server")
while retry_flag and retry_count < 5:
try:
df.to_sql("table_name",con=engine,if_exists="replace",index=False,chunksize=5000,schema="dbo")
retry_flag = False
except:
retry_count = retry_count + 1
time.sleep(30)
Modified engine line -
From -
engine = create_engine("mssql+pyodbc://server_name/db_name?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server")
to -
engine = create_engine("mssql+pyodbc://server_name/db_name?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server", fast_executemany=True)
ask me any Query related python to SQL connectivity, I will be happy to help you.

SQLAlchemy/pandas to_sql for SQLServer -- CREATE TABLE in master db

Using MSSQL (version 2012), I am using SQLAlchemy and pandas (on Python 2.7) to insert rows into a SQL Server table.
After trying pymssql and pyodbc with a specific server string, I am trying an odbc name:
import sqlalchemy, pyodbc, pandas as pd
engine = sqlalchemy.create_engine("mssql+pyodbc://mssqlodbc")
sqlstring = "EXEC getfoo"
dbdataframe = pd.read_sql(sqlstring, engine)
This part works great and worked with the other methods (pymssql, etc). However, the pandas to_sql method doesn't work.
finaloutput.to_sql("MyDB.dbo.Loader_foo",engine,if_exists="append",chunksize="10000")
With this statement, I get a consistent error that pandas is trying to do a CREATE TABLE in the sql server Master db, which it is not permisioned for.
How do I get pandas/SQLAlchemy/pyodbc to point to the correct mssql database? The to_sql method seems to ignore whatever I put in engine connect string (although the read_sql method seems to pick it up just fine.

To have this question as answered: the problem is that you specify the schema in the table name itself. If you provide "MyDB.dbo.Loader_foo" as the table name, pandas will interprete this full string as the table name, instead of just "Loader_foo".
Solution is to only provide "Loader_foo" as table name. If you need to specify a specific schema to write this table into, you can use the schema kwarg (see docs):
finaloutput.to_sql("Loader_foo", engine, if_exists="append")
finaloutput.to_sql("Loader_foo", engine, if_exists="append", schema="something_else_as_dbo")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding up Pandas to_sql()? - python

Related

Sql Select statement Optimization

from sql server to pandas dataframe with pyodbc - while working with small tables, it gives an error on complex sql queries

Update MSSQL table through SQLAlchemy using dataframes

Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC

SQLAlchemy/pandas to_sql for SQLServer -- CREATE TABLE in master db

Categories

Resources