I'm attempting to use python with sqlalchemy to download some data, create a temporary staging table on a Teradata Server, then MERGEing that table into another table which I've created to permanently store this data. I'm using sql = slqalchemy.text(merge) and td_engine.execute(sql) where merge is a string similar to the below:
MERGE INTO perm_table as p
USING temp_table as t
ON p.Id = t.Id
WHEN MATCHED THEN
UPDATE
SET col1 = t.col1,
col2 = t.col2,
...
col50 = t.col50
WHEN NOT MATCHED THEN
INSERT (col1,
col2,
...
col50)
VALUES (t.col1,
t.col2,
...
t.col50)
The script runs all the way to the end without error and the SQL executes properly through Teradata Studio, but for some reason the table won't update when I execute it through SQLAlchemy. However, I've also run different SQL expressions, like the insert that populated perm_table from the same python script and it worked fine. Maybe there's something specific to the MERGE and SQLAlchemy combo?
Since you're using the engine directly, without using a transaction, you're probably (barring unseen configuration on your part) relying on SQLAlchemy's version of autocommit, which works by detecting data changing operations such as INSERTs etc. Possibly MERGE is not one of the detected operations. Try
sql = sqlalchemy.text(merge).execution_options(autocommit=True)
td_engine.execute(sql)
Related
I'm currently trying to upgrade my application to Spark 3.0.1. For table creation, I drop and create a table using cassandra-driver, the Python-Cassandra connector. Then I write a dataframe into the table using the spark-cassandra connector. There isn't really a good alternative to create and drop the table using only the spark-cassandra connector.
With Spark 2.4, there were no issues with the drop-create-write flow. But with Spark 3.0, the application seems to do these things in no particular order, often trying to write before dropping and creating. I have no clue how to ensure dropping and creating the table happens first. I know the drop and create does happen even while the application errors out on write, because when I query Cassandra via cqlsh I can see the table being dropped and re-created. Any ideas about this behavior in Spark 3.0?
Note: because the schema changes, this particular table needs to be dropped and recreated instead of a straight overwrite.
A code snippet as requested:
session = self._get_python_cassandra_session(self.env_conf, self.database)
# build drop table query
drop_table_query = 'DROP TABLE IF EXISTS {}.{}'.format(self.database, tablename)
session.execute(drop_table_query)
df, table_columns, table_keys = self._create_table_metadata(df, keys=keys)
# build create query
create_table_query = 'CREATE TABLE IF NOT EXISTS {}.{} ({} PRIMARY KEY({}), );'.format(self.database, tablename, table_columns, table_keys)
# execute table creation
session.execute(create_table_query)
session.shutdown()
# spark-cassandra connection options
copts = _cassandra_cluster_spark_options(self.env_conf)
# set write mode
copts['confirm.truncate'] = overwrite
mode = 'overwrite' if overwrite else 'append'
# write dataframe to cassandra
get_dataframe_writer(df, 'cassandra', keyspace=self.database,
table=tablename, mode=mode, copts=copts).save()
I ended up building a time.sleep(5) delay with 100 second timeout to periodically ping Cassandra for the table, and then writing if the table was found.
In the Spark Cassandra Connector 3.0+ you can use new functionality - manipulating the keyspaces & tables via Catalogs API. You can create/alter/drop keyspaces & tables using the Spark SQL. For example, you can create a table in Cassandra with following command:
CREATE TABLE casscatalog.ksname.table_name (
key_1 Int,
key_2 Int,
key_3 Int,
cc1 STRING,
cc2 String,
cc3 String,
value String)
USING cassandra
PARTITIONED BY (key_1, key_2, key_3)
TBLPROPERTIES (
clustering_key='cc1.asc, cc2.desc, cc3.asc',
compaction='{class=SizeTieredCompactionStrategy,bucket_high=1001}'
)
As you can see here, you can specify quite complex primary keys, and also specify table options. The casscatalog piece is a prefix that links to the specific Cassandra cluster (you can use multiple at the same time) - it's specified when you're starting Spark job, like:
spark-shell --packages com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.0.0 \
--conf spark.sql.catalog.casscatalog=com.datastax.spark.connector.datasource.CassandraCatalog
More examples could be found in the documentation:
I'm trying to replace some old MSSQL stored procedures with python, in an attempt to take some of the heavy calculations off of the sql server. The part of the procedure I'm having issues replacing is as follows
UPDATE mytable
SET calc_value = tmp.calc_value
FROM dbo.mytable mytable INNER JOIN
#my_temp_table tmp ON mytable.a = tmp.a AND mytable.b = tmp.b AND mytable.c = tmp.c
WHERE (mytable.a = some_value)
and (mytable.x = tmp.x)
and (mytable.b = some_other_value)
Up to this point, I've made some queries with SQLAlchemy, stored those data in Dataframes, and done the requisite calculations on them. I don't know now how to put the data back into the server using SQLAlchemy, either with raw SQL or function calls. The dataframe I have on my end would essentially have to work in the place of the temporary table created in MSSQL Server, but I'm not sure how I can do that.
The difficulty is of course that I don't know of a way to join between a dataframe and a mssql table, and I'm guessing this wouldn't work so I'm looking for a workaround
As the pandas doc suggests here :
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://user:password#DSN", echo = False)
dataframe.to_sql('tablename', engine , if_exists = 'replace')
engine parameter for msSql is basically the connection string check it here
if_exist parameter is a but tricky since 'replace' actually drops the table first and then recreates it and then inserts all data at once.
by setting the echo attribute to True it shows all background logs and sql's.
I am using the below python code to update postgres DB column valuebased on Id. This loop has to run for thousands of records and it is taking longer time.
Is there a way where I can pass array of dataframe values instead of looping each row?
for i in range(0,len(df)):
QUERY=""" UPDATE "Table" SET "value"='%s' WHERE "Table"."id"='%s'
""" % (df['value'][i], df['id'][i])
cur.execute(QUERY)
conn.commit()
Depends on a library you use to communicate with PostgreSQL, but usually bulk inserts are much faster via COPY FROM command.
If you use psycopg2 it is as simple as following:
cursor.copy_from(io.StringIO(string_variable), "destination_table", columns=('id', 'value'))
Where string_variable is tab and new line delimited dataset like 1\tvalue1\n2\tvalue2\n.
To achieve a performant bulk update I would do:
Create a temporary table: CREATE TEMPORARY TABLE tmp_table;;
Insert records with copy_from;
Just update destination table with query UPDATE destination_table SET value = t.value FROM tmp_table t WHERE id = t.id or any other preferred syntax
I'm using alembic to manage my database migrations. In my current migration I need also to populate a column based on a SELECT statement (basically copying a column from a different table).
With plain SQL I can do:
UPDATE foo_table
SET bar_id=
(SELECT bar_table.id FROM bar_table
WHERE bar_table.foo_id = foo_table.id);
However can't figure out how to do that with alembic:
execute(
foo_table.update().\
values({
u'bar_id': ???
})
)
I tried to use plain SQLAlchemy expressions for the '???':
select([bar_table.columns['id']],
bar_table.columns[u'foo_id'] == foo_table.columns[u'id'])
But that only generates bad SQL and a ProgrammingError during execution:
'UPDATE foo_table SET ' {}
Actually it works exactly as I described above.
My problem was that the table definition for 'foo_table' in my alembic script did not include the 'bar_id' column so SQLALchemy did not use that to generate the SQL...
I'm looking for some help on how to do this in Python using sqlite3
Basically I have a process which downloads a DB (temp) and then needs to insert it's records into a 2nd identical DB (the main db).. and at the same time ignore/bypass any possible duplicate key errors
I was thinking of two scenarios but am unsure how to best do this in Python
Option 1:
create 2 connections and cursor objects, 1 to each DB
select from DB 1 eg:
dbcur.executemany('SELECT * from table1')
rows = dbcur.fetchall()
insert them into DB 2:
dbcur.execute('INSERT INTO table1 VALUES (:column1, :column2)', rows)
dbcon.commit()
This of course does not work as I'm not sure how to do it properly :)
Option 2 (which I would prefer, but not sure how to do):
SELECT and INSERT in 1 statement
Also, I have 4 tables within the DB's each with varying columns, can I skip naming the columns on the INSERT statement?
As far as the duplicate keys go, I have read I can use 'ON DUPLICATE KEY' to handle
eg.
INSERT INTO table1 VALUES (:column1, :column2) ON DUPLICATE KEY UPDATE set column1=column1
You can ATTACH two databases to the same connection with code like this:
import sqlite3
connection = sqlite3.connect('/path/to/temp.sqlite')
cursor=connection.cursor()
cursor.execute('ATTACH "/path/to/main.sqlite" AS master')
There is no ON DUPLICATE KEY syntax in sqlite as there is in MySQL. This SO question contains alternatives.
So to do the bulk insert in one sql statement, you could use something like
cursor.execute('INSERT OR REPLACE INTO master.table1 SELECT * FROM table1')
See this page for information about REPLACE and other ON CONFLICT options.
The code for option 1 looks correct.
If you need filtering to bypass duplicate keys, do the insert into a temporary table and then use SQL commands to eliminate duplicates and merge them into the target table.