Update MSSQL table through SQLAlchemy using dataframes - python

I'm trying to replace some old MSSQL stored procedures with python, in an attempt to take some of the heavy calculations off of the sql server. The part of the procedure I'm having issues replacing is as follows
UPDATE mytable
SET calc_value = tmp.calc_value
FROM dbo.mytable mytable INNER JOIN
#my_temp_table tmp ON mytable.a = tmp.a AND mytable.b = tmp.b AND mytable.c = tmp.c
WHERE (mytable.a = some_value)
and (mytable.x = tmp.x)
and (mytable.b = some_other_value)
Up to this point, I've made some queries with SQLAlchemy, stored those data in Dataframes, and done the requisite calculations on them. I don't know now how to put the data back into the server using SQLAlchemy, either with raw SQL or function calls. The dataframe I have on my end would essentially have to work in the place of the temporary table created in MSSQL Server, but I'm not sure how I can do that.
The difficulty is of course that I don't know of a way to join between a dataframe and a mssql table, and I'm guessing this wouldn't work so I'm looking for a workaround

As the pandas doc suggests here :
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://user:password#DSN", echo = False)
dataframe.to_sql('tablename', engine , if_exists = 'replace')
engine parameter for msSql is basically the connection string check it here
if_exist parameter is a but tricky since 'replace' actually drops the table first and then recreates it and then inserts all data at once.
by setting the echo attribute to True it shows all background logs and sql's.

Related

Spark 3.0 and Cassandra Spark / Python Conenctors: Table is not being created prior to write

I'm currently trying to upgrade my application to Spark 3.0.1. For table creation, I drop and create a table using cassandra-driver, the Python-Cassandra connector. Then I write a dataframe into the table using the spark-cassandra connector. There isn't really a good alternative to create and drop the table using only the spark-cassandra connector.
With Spark 2.4, there were no issues with the drop-create-write flow. But with Spark 3.0, the application seems to do these things in no particular order, often trying to write before dropping and creating. I have no clue how to ensure dropping and creating the table happens first. I know the drop and create does happen even while the application errors out on write, because when I query Cassandra via cqlsh I can see the table being dropped and re-created. Any ideas about this behavior in Spark 3.0?
Note: because the schema changes, this particular table needs to be dropped and recreated instead of a straight overwrite.
A code snippet as requested:
session = self._get_python_cassandra_session(self.env_conf, self.database)
# build drop table query
drop_table_query = 'DROP TABLE IF EXISTS {}.{}'.format(self.database, tablename)
session.execute(drop_table_query)
df, table_columns, table_keys = self._create_table_metadata(df, keys=keys)
# build create query
create_table_query = 'CREATE TABLE IF NOT EXISTS {}.{} ({} PRIMARY KEY({}), );'.format(self.database, tablename, table_columns, table_keys)
# execute table creation
session.execute(create_table_query)
session.shutdown()
# spark-cassandra connection options
copts = _cassandra_cluster_spark_options(self.env_conf)
# set write mode
copts['confirm.truncate'] = overwrite
mode = 'overwrite' if overwrite else 'append'
# write dataframe to cassandra
get_dataframe_writer(df, 'cassandra', keyspace=self.database,
table=tablename, mode=mode, copts=copts).save()
I ended up building a time.sleep(5) delay with 100 second timeout to periodically ping Cassandra for the table, and then writing if the table was found.
In the Spark Cassandra Connector 3.0+ you can use new functionality - manipulating the keyspaces & tables via Catalogs API. You can create/alter/drop keyspaces & tables using the Spark SQL. For example, you can create a table in Cassandra with following command:
CREATE TABLE casscatalog.ksname.table_name (
key_1 Int,
key_2 Int,
key_3 Int,
cc1 STRING,
cc2 String,
cc3 String,
value String)
USING cassandra
PARTITIONED BY (key_1, key_2, key_3)
TBLPROPERTIES (
clustering_key='cc1.asc, cc2.desc, cc3.asc',
compaction='{class=SizeTieredCompactionStrategy,bucket_high=1001}'
)
As you can see here, you can specify quite complex primary keys, and also specify table options. The casscatalog piece is a prefix that links to the specific Cassandra cluster (you can use multiple at the same time) - it's specified when you're starting Spark job, like:
spark-shell --packages com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.0.0 \
--conf spark.sql.catalog.casscatalog=com.datastax.spark.connector.datasource.CassandraCatalog
More examples could be found in the documentation:

How to convert select_from object into a new table in sqlalchemy

I have a database that contains two tables in the data, cdr and mtr. I want a join of the two based on columns ego_id and alter_id, and I want to output this into another table in the same database, complete with the column names, without the use of pandas.
Here's my current code:
mtr_table = Table('mtr', MetaData(), autoload=True, autoload_with=engine)
print(mtr_table.columns.keys())
cdr_table = Table('cdr', MetaData(), autoload=True, autoload_with=engine)
print(cdr_table.columns.keys())
query = db.select([cdr_table])
query = query.select_from(mtr_table.join(cdr_table,
((mtr_table.columns.ego_id == cdr_table.columns.ego_id) &
(mtr_table.columns.alter_id == cdr_table.columns.alter_id))),
)
results = connection.execute(query).fetchmany()
Currently, for my test code, what I do is to convert the results as a pandas dataframe and then put it back in the original SQL database:
df = pd.DataFrame(results, columns=results[0].keys())
df.to_sql(...)
but I have two problems:
loading everything into a pandas dataframe would require too much memory when I start working with the full database
the columns names are (apparently) not included in results and would need to be accessed by results[0].keys()
I've checked this other stackoverflow question but it uses the ORM framework of sqlalchemy, which I unfortunately don't understand. If there's a simpler way to do this (like pandas' to_sql), I think this would be easier.
What's the easiest way to go about this?
So I found out how to do this via CREATE TABLE AS:
query = """
CREATE TABLE mtr_cdr AS
SELECT
mtr.idx,cdr.*
FROM mtr INNER JOIN cdr
ON (mtr.ego_id = cdr.ego_id AND mtr.alter_id = cdr.alter_id)""".format(new_table)
with engine.connect() as conn:
conn.execute(query)
The query string seems to be highly sensitive to parentheses though. If I put a parentheses enclosing the whole SELECT...FROM... statement, it doesn't work.

from sql server to pandas dataframe with pyodbc - while working with small tables, it gives an error on complex sql queries

1 step: Create a temporary table with pyodbc into sql server for objects
2 step: Select objects from temporary table and load it into pandas dataframe
3 step: print dataframe
for creating a temporary table i work with pyodbc cursor as it trohws errors with pandas.read_sql command. wheras it trohws an error if i try to convert the cursor into a pandas dataframe. even with the special line for handling tuples into dataframes.
my program to connect, create, read and print which works as long as the query stays simple as it is now. (my actual approach has a few hundred lines of sql query statement)
import codecs
import os
import io
import pandas as pd
import pyodbc as po
server = 'sql_server'
database = 'sql_database'
connection = po.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';Trusted_Connection=yes;')
cursor = connection.cursor()
query1 = """
CREATE TABLE #ttobject (object_nr varchar(6), change_date datetime)
INSERT INTO #ttobject (object_nr)
VALUES
('112211'),
('113311'),
('114411');
"""
query2 = """
SELECT *
FROM #ttobject
Drop table if exists #ttobject
"""
cursor.execute(query1)
df = pd.read_sql_query(query2, connection)
print(df)
Because of the lenght of the actually query i save you the trouble but instead post here the error code:
('HY000', '[HY000] [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt (0) (SQLExecDirectW)')
This error gets thrown at query2 which is a multiple select statement with some joins and pivote functions
When I'm trying to put everything into one cursor i got issues with converting it from cursor to DataFrame (tried several methodes, maybe someone knows one which isn't on SO already or has a special title so i couldn't find it)
same problem if I'm trying to only use pd.read_sql then the creation of the temporary table is not working
I don't know where to go on from here.
Please let me know if i can assist you with further details which i may overwatched in accordance to my lostlyness :S
23.5.19 Further investigating:
According to Gord i tried to add autocommit to true which will work
for simple sql statements but not for my really long and
timeconsuming one.
Secondly i tried to add
"cursor.execute('SET NOCOUNT ON; EXEC schema.proc #muted = 1')
At the moment i guess that the first query takes longer so python already starting with the second and therefore the connection is
blocked. Or that the first query is returing some feedback so python
thinks it is finished before it actually is.
Added a time.sleep(100) after ececution of first query but still getting the hstmt is busy error. Wondering why this is becaus it should have had enough time to process the first
Funfact: The query is running smoothly as long as I'm not trying to output any result from it

Oracle DB table not updating after running merge query in SQLAlchemy [duplicate]

I'm attempting to use python with sqlalchemy to download some data, create a temporary staging table on a Teradata Server, then MERGEing that table into another table which I've created to permanently store this data. I'm using sql = slqalchemy.text(merge) and td_engine.execute(sql) where merge is a string similar to the below:
MERGE INTO perm_table as p
USING temp_table as t
ON p.Id = t.Id
WHEN MATCHED THEN
UPDATE
SET col1 = t.col1,
col2 = t.col2,
...
col50 = t.col50
WHEN NOT MATCHED THEN
INSERT (col1,
col2,
...
col50)
VALUES (t.col1,
t.col2,
...
t.col50)
The script runs all the way to the end without error and the SQL executes properly through Teradata Studio, but for some reason the table won't update when I execute it through SQLAlchemy. However, I've also run different SQL expressions, like the insert that populated perm_table from the same python script and it worked fine. Maybe there's something specific to the MERGE and SQLAlchemy combo?
Since you're using the engine directly, without using a transaction, you're probably (barring unseen configuration on your part) relying on SQLAlchemy's version of autocommit, which works by detecting data changing operations such as INSERTs etc. Possibly MERGE is not one of the detected operations. Try
sql = sqlalchemy.text(merge).execution_options(autocommit=True)
td_engine.execute(sql)

Speeding up Pandas to_sql()?

I have a 1,000,000 x 50 Pandas DataFrame that I am currently writing to a SQL table using:
df.to_sql('my_table', con, index=False)
It takes an incredibly long time. I've seen various explanations about how to speed up this process online, but none of them seem to work for MSSQL.
If I try the method in:
Bulk Insert A Pandas DataFrame Using SQLAlchemy
then I get a no attribute copy_from error.
If I try the multithreading method from:
http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html
then I get a QueuePool limit of size 5 overflow 10 reach, connection timed out error.
Is there any easy way to speed up to_sql() to an MSSQL table? Either via BULK COPY or some other method, but entirely from within Python code?
I've used ctds to do a bulk insert that's a lot faster with SQL server. In example below, df is the pandas DataFrame. The column sequence in the DataFrame is identical to the schema for mydb.
import ctds
conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))
in pandas 0.24 you can use method ='multi' with chunk size of 1000 which is the sql server limit
chunksize=1000, method='multi'
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method
New in version 0.24.0.
The parameter method controls the SQL insertion clause used. Possible values are:
None: Uses standard SQL INSERT clause (one per row).
'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documention.
even I had the same issue so I applied sqlalchemy with fast execute many.
from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
#event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
if executemany:
cursor.fast_executemany = True # replace from execute many to fast_executemany.
cursor.commit()
always make sure that the given function should be present after the engine variable and before cursor execute.
conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.

Categories