I'm developing a Pylons app which is based on exisitng database, so I'm using reflection. I have an SQL file with the schema that I used to create my test database. That's why I can't simply use drop_all and create_all.
I would like to write some unit tests and I faced the problem of clearing the database content after each test. I just want to erase all the data but leave the tables intact. Is this possible?
The application uses Postgres and this is what has to be used also for the tests.
I asked about the same thing on the SQLAlchemy Google group, and I got a recipe that appears to work well (all my tables are emptied). See the thread for reference.
My code (excerpt) looks like this:
import contextlib
from sqlalchemy import MetaData
meta = MetaData()
with contextlib.closing(engine.connect()) as con:
trans = con.begin()
for table in reversed(meta.sorted_tables):
con.execute(table.delete())
trans.commit()
Edit: I modified the code to delete tables in reverse order; supposedly this should ensure that children are deleted before parents.
For PostgreSQL using TRUNCATE:
with contextlib.closing(engine.connect()) as con:
trans = con.begin()
con.execute('TRUNCATE {} RESTART IDENTITY;'.format(
','.join(table.name
for table in reversed(Base.metadata.sorted_tables))))
trans.commit()
Note: RESTART IDENTITY; ensures that all sequences are reset as well. However, this is slower than the DELETE recipe by #aknuds1 by 50%.
Another recipe is to drop all tables first and then recreate them. This is slower by another 50%:
Base.metadata.drop_all(bind=engine)
Base.metadata.create_all(bind=engine)
How about using truncate:
TRUNCATE [ TABLE ] name [, ...]
(http://www.postgresql.org/docs/8.4/static/sql-truncate.html)
This will delete all the records in the table, but leave the schema in tact.
Related
I have an Airflow DAG which takes an argument from the user for table.
I then use this value in an SQL statement and execute it in BigQuery. I'm worried about exposing myself to SQL Injection.
Here is the code:
sql = f"""
CREATE OR REPLACE TABLE {PROJECT}.{dataset}.{table} PARTITION BY DATE(start_time) as (
//OTHER CODE
)
"""
client = bigquery.Client()
query_job = client.query(sql)
Both dataset and table get passed through via airflow but I'm worried someone could pass through something like: random_table; truncate other_tbl; -- as the table argument.
My fear is that the above will create a table called random_table and then truncate an existing table.
Is there a safer way to process these passed through arguments?
I've looked into parameterized queries in BigQuery but these don't work for table names.
You will have to create a table name validator. I think you can safely validate by using just backticks --> ` at the start and at the end of your table name string. It's not a 100% solution but it worked for some of my test scenarios I try. It should work like this:
# validate should look for ` at the beginning and end of your tablename
table_name = validate(f"`{project}.{dataset}.{table}`")
sql = f"""
CREATE OR REPLACE TABLE {table_name} PARTITION BY DATE(start_time) as (
//OTHER CODE
)
"""
...
Note: I suggest you to check the following post on medium site to check about bigquery sql injection.
I checked the official documentation about Running parameterized queries, and sadly it only covers the parameterization of variables not tables or other string part of your query.
As a final note, I recommend to open a feature request for BigQuery for this particular scenario.
You should probably look into sanitization/validation of user input in general. This is done before passing the input to the BQ query.
With Python, you could look for malicious strings in the user input - like truncate in your example - or use a regex to filter input that for instance contains --. Those are just some quick examples. I recommend you do more research on that topic; you will also find quite a few questions on that topic on SE.
I am working on a open source persistance layer for a MQTT-Broker https://github.com/volkerjaenisch/amqtt_db
Incoming MQTT messages are irregular blobs of data so usually the DB-Backend is some kind of object storage.
I do it the hard way and deserialize the blobs into typed data colums and store them into a fast relational database. My finally target will be timescaleDB but first I go via SQLAlchemy to access a wide bunch of DBs with one API.
MQTT messages are volatile (think not always complete) so the DB scheme has to adjust dynamically e.g. adding new columns for new information.
First Message:
Time: 1234
Temperature : 23.4
Second Message:
Time: 1245
Temperature : 23.6
Rel Hum : 87 %
I have used SQLalchemy ORM for more than a decade but always for quite static databases. So I am new to work dynmically.
Utilizing the ORM to build DB tables dynamically from the structure of incoming MQTT-Messages was quite doable and worked out perfect.
But currently I am stuck with the case of additional information in the MQTT-Packages that extends the tables with new columns.
What I did so far:
Utilizing sqlalchemy-migration it was quite easy to dynamically add new columns to the existing table in the DB. In the code "topic_cls" is the declarative class and "column_def" a col_name - type mapping.
from migrate.versioning.schema import Table as MiTable, Column as MiColumn
def add_new_colums(self, topic_cls, column_def):
table_name = str(topic_cls.__table__.name)
table = MiTable(table_name, self.metadata)
for col_name, col_type in column_def.items():
col = MiColumn(col_name, col_type)
col.create(table)
Works like a charm. But how to get this changes to the DB reflected back into declarative classes? I tried to get a new inspection of the table:
new_table = Table(topic_cls.__table__.name, self.metadata, autoload_with=self.engine)
This also works but it gives me a new table but not a declarative base.
So my stupid questions are:
Is this the right way to achive my goal?
How can I get a declarative class by inspecting an already existing table in a DB?
"Drop the ORM and use SQL" is not the answer I am looking for.
Cheers,
Volker
Found a solution but it is a bit of a hack.
new_table = Table("test/topic_growth", Base.metadata, autoload_with=self.engine)
Base.metadata.remove(topic_cls.__table__)
new_dcl = type(str(table_name), (Base,), {'__table__': new_table})
Base.metadata._add_table(table_name, None, new_table)
After you obtained the new table via inspection, remove the old table entry from the metadata.
Then generate a new declarative base with the new table and same table name.
At last add the new table to the metadata.
I have such issue that SQLAlchemy Core does not insert rows when I'm trying to insert data using connection.execute(table.insert(), list_of_rows). I construct connection object without any additional parameters, it means connection = engine.connect() and engine only with one additional parameter engine = create_engine(uri, echo=True).
Except that I can't find data in db also I can't find "INSERT" statement in logs of my app.
May be important that this issue I'm reproducing during py.test tests.
DB that I use is mssql in docker container.
EDIT1:
rowcount of proxyresult is always -1 regardless if I use transaction or no and if I changed insert to connection.execute(table.insert().execution_options(autocommit=True), list_of_rows).rowcount
EDIT2:
I rewrote this code and now it works. I don't see any major difference.
What's the inserted row count after connection.execute:
proxy = connection.execute(table.insert(), list_of_rows)
print(proxy.rowcount)
if rowcount is positive integer, it proves it indeed writes the data into DB, but may be only present in a transaction, if so you could then check whether autocommit is on: https://docs.sqlalchemy.org/en/latest/core/connections.html#understanding-autocommit
I have a production_table and stage_table.
I have a python script that runs for few hours and generate data in the stage_table.
I want at the end of the script to COPY data from the stage_table to the production_table.
Basically this is what I want:
1. TRUNCATE production_table
2. COPY production_table from stage_table
This is my code:
from sqlalchemy import create_engine
from sqlalchemy.sql import text as sa_text
engine = create_engine("mysql+pymysql:// AMAZON AWS")
engine.execute(sa_text('''TRUNCATE TABLE {1}; COPY TABLE {1} from {0}'''.format(stage_table, production_table)).execution_options(autocommit=True))
This should generate :
TRUNCATE TABLE production_table; COPY TABLE production_table from stage_table
However this doesn't work.
sqlalchemy.exc.ProgrammingError: (pymysql.err.ProgrammingError) (1064,
u"You have an error in your SQL syntax;
How can I make it work? and how can I make sure that the TRUNCATE and COPY are together. I don't want TRUNCATE to happen if COPY aborts.
The usual way to handle multiple statements in a single transaction in SQLAlchemy would be to begin an explicit transaction and execute each statement in it:
with engine.begin() as conn:
conn.execute(statement_1)
conn.execute(statement_2)
...
As to your original attempt, there is no COPY statement in MySQL. Some other DBMS do have something of the kind. Also not all DB-API drivers support multiple statements in a single query or command, at least out of the box, which would seem to be the case here as well. See this issue and the related note in the PyMySQL ChangeLog.
The biggest issue is that not all statements in MySQL can be rolled back, of which the most common are DDL statements. In other words you simply cannot execute TRUNCATE [TABLE] ... in the same transaction as the following INSERT INTO ... and must design your application around that limitation. As suggested in the comments by Christian W. you could perhaps create an entirely new table from your staging table and rename, or just swap the production and staging tables. RENAME TABLE ... cannot be rolled back either, but at least you'd reduce the window for error, and could undo the changes since the original production table would still be there, just under a new name. You could then remove the original production table when all else is done. Here's something that demonstrates the idea, but requires manual intervention, if something goes awry:
# No point in faking transactions here, since MySQL in use.
engine.execute("CREATE TABLE new_production AS SELECT * FROM stage_table")
engine.execute("RENAME TABLE production_table TO old_production")
engine.execute("RENAME TABLE new_production TO production_table")
# Point of no return:
engine.execute("DROP TABLE old_production")
I am creating a sql alchemy table like so:
myEngine = self.get_my_engine() # creates engine
metadata = MetaData(bind=myEngine)
SnapshotTable = Table("mytable", metadata, autoload=False, schema="my schema")
I have to use autoload false because the table may or may not exist (and that code has to run before the table is created)
The problem is, if I use autoload = False, when I try to query the table (after it was created by another process) doing session.query(SnapshotTable) I get an:
InvalidRequestError: Query contains no columns with which to SELECT from.
error; which is understandable because the table wasn't loaded yet.
My question is: how do I "load" the table metadata after it has been defined with autoload = False.
I looked at the schema.py code and it seems that I can do:
SnapshotTable._autoload(metadata, None, None)
but that doesn't look right to me...any other ideas or thoughts?
Thanks
First declare the table model:
class MyTable(Base):
__table__ = Table('mytable', metadata)
Or directly:
MyTable = Table("mytable", metadata)
Then, once you are ready to load it, call this:
Table('mytable', metadata, autoload_with=engine, extend_existing=True)
Where the key to it all is extend_existing=True.
All credit goes to Mike Bayer on the SQLAlchemy mailing list.
I was dealing with this issue just last night, and it turns out that all you need to do is load all available table definitions from the database with the help of metadat.reflect. This is very much similar to #fgblomqvist's solution. The major difference is that you do not have to recreate the table. In essence, the following should help:
SnapshotTable.metadata.reflect(extend_existing=True, only=['mytable'])
The unsung hero here is the extend_existing parameter. It basically makes sure that the schema and other info associated with SnapshotTable are reloaded. The parameter only is used here to limit how much information is retrieved. This will save you a tremendous amount of time, if you are dealing with a large database
I hope this serves a purpose in the future.
I guess that problem is with not reflected metadata.
You could try to load metadata with method this call bevore executing any query :
metadata.reflect()
It will reload definition of table, so framework will know how to build proper SELECT.
And then calling
if SnapshotTable.exists :
SnapshotTable._init_existing()