I am using python to perform basic ETL to transfer records from a mysql database to a postgres database. I am using python to commence the tranfer:
python code
source_cursor = source_cnx.cursor()
source_cursor.execute(query.extract_query)
data = source_cursor.fetchall()
source_cursor.close()
# load data into warehouse db
if data:
target_cursor = target_cnx.cursor()
#target_cursor.execute("USE {};".format(datawarehouse_name))
target_cursor.executemany(query.load_query, data)
print('data loaded to warehouse db')
target_cursor.close()
else:
print('data is empty')
MySQL Extract (extract_query):
SELECT `tbl_rrc`.`id`,
`tbl_rrc`.`col_filing_operator`,
`tbl_rrc`.`col_medium`,
`tbl_rrc`.`col_district`,
`tbl_rrc`.`col_type`,
DATE_FORMAT(`tbl_rrc`.`col_timestamp`, '%Y-%m-%d %T.%f') as `col_timestamp`
from `tbl_rrc`
PostgreSQL (loading_query)
INSERT INTO geo_data_staging.tbl_rrc
(id,
col_filing_operator,
col_medium,
col_district,
col_type,
col_timestamp)
VALUES
(%s,%s,%s,%s,%s);
Of note, there is a PK constraint on Id.
The problem is while I have no errors, I'm not seeing any of the records in the target table. I tested this by manually inserting a record, then running again. The code errored out violating PK constraint. So I know it's finding the table.
Any idea on what I could be missing, I would be greatly appreciate it.
Using psycopg2, you have to call commit() on your cursors in order for transactions to be committed. If you just call close(), the transaction will implicitly roll back.
There are a couple of exceptions to this. You can set the connection to autocommit. You can also use your cursors inside a with block, which will automatically commit if the block doesn't throw any exceptions.
Related
I have such issue that SQLAlchemy Core does not insert rows when I'm trying to insert data using connection.execute(table.insert(), list_of_rows). I construct connection object without any additional parameters, it means connection = engine.connect() and engine only with one additional parameter engine = create_engine(uri, echo=True).
Except that I can't find data in db also I can't find "INSERT" statement in logs of my app.
May be important that this issue I'm reproducing during py.test tests.
DB that I use is mssql in docker container.
EDIT1:
rowcount of proxyresult is always -1 regardless if I use transaction or no and if I changed insert to connection.execute(table.insert().execution_options(autocommit=True), list_of_rows).rowcount
EDIT2:
I rewrote this code and now it works. I don't see any major difference.
What's the inserted row count after connection.execute:
proxy = connection.execute(table.insert(), list_of_rows)
print(proxy.rowcount)
if rowcount is positive integer, it proves it indeed writes the data into DB, but may be only present in a transaction, if so you could then check whether autocommit is on: https://docs.sqlalchemy.org/en/latest/core/connections.html#understanding-autocommit
I had no problem with SELECTing data in python from postgres database using cursor/execute. Just changed the sql to INSERT a row but nothing is inserted to DB. Can anyone let me know what should be modified? A little confused because everything is the same except for the sql statement.
<!-- language: python -->
#app.route("/addcontact")
def addcontact():
# this connection/cursor setting showed no problem so far
conn = pg.connect(conn_str)
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
sql = f"INSERT INTO jna (sid, phone, email) VALUES ('123','123','123')"
cur.execute(sql)
return redirect("/contacts")
first look at your table setup and make sure your variables are named right in the right order, format and all that, if your not logging into the specific database on the sql server it won't know where the table is, you might need to send something like 'USE databasename' before you do your insert statement so your computer is in the right place in the server.
I might not be up to date with the language but is that 'f' supposed to be right before the quotes? if thats in ur code that'd probably throw an error unless it has a use im not aware of or its not relevant to the problem.
You have to commit your transaction by adding the line below after execute(sql)
conn.commit()
Ref: Using INSERT with a PostgreSQL Database using Python
I'm running a python script with sqlalchemy to export and then import data from production to a postgres db on a daily basis. The script runs successfully once and then the second time and beyond the script fails. As you will see in the script below, the error returned suggets the dependencies in the tables (foreign keys) are the cause of the import failure, however, I do not understand why this issue is not circumvented by the sorted_tables object. I've opted to remove any of the intialiation code like repoistory imports, db connection objects to simplify the post and reduce clutter.
def create_db(src,dst,src_schema,dst_schema,drop_dst_schema=False):
if drop_dst_schema:
post_db.engine.execute('DROP SCHEMA IF EXISTS {0} CASCADE'.format(dst_schema))
print "Schema {0} Dropped".format(dst_schema)
post_db.engine.execute('CREATE SCHEMA IF NOT EXISTS {0}'.format(dst_schema))
post_db.engine.execute('GRANT USAGE ON SCHEMA {0} TO {0}_ro'.format(dst_schema))
post_db.engine.execute('GRANT USAGE ON SCHEMA {0} TO {0}_rw'.format(dst_schema))
print "Schema {0} Created".format(dst_schema)
def create_table(tbl, dst_schema):
dest_table=tbl
dest_table.schema=dst_schema
for col in dest_table.columns:
if hasattr(col.type, 'collation'):
col.type.collation = None
if col.name == 'id':
dest_table.append_constraint(PrimaryKeyConstraint(col))
col.type=convert(col.type)
timestamp_col=Column ('timestamp',DateTime(timezone=False), server_default=func.now())
#print tbl.c
dest_table.append_column(timestamp_col)
dest_table.create(post_db.engine,checkfirst=True)
post_db.engine.execute('GRANT INSERT ON {1} to {0}_ro'.format(dst_schema, dest_table))
post_db.engine.execute('GRANT ALL PRIVILEGES ON {1} to {0}_rw'.format(dst_schema, dest_table))
print "Table {0} created".format(dest_table)
create_db(mysql_db.engine,post_db.engine,src_schema,dst_schema,drop_dst_schema=False)
mysql_meta=MetaData(bind=mysql_db.engine)
mysql_meta.reflect(schema=src_schema)
post_meta=MetaData(bind=post_db.engine)
post_meta.reflect(schema=dst_schema)
script_begin=time.time()
rejected_list=[]
for table in mysql_meta.sorted_tables:
df=mysql_db.sql_retrieve('select * from {0}'.format(table.name))
df=df.where((pd.notnull(df)), None)
print "Table {0} : {1}".format(table.name,len(df))
dest_table=table
dest_table.schema = dst_schema
dest_table.drop(post_db.engine, checkfirst=True)
create_table(dest_table, dst_schema)
print "Table {0} emptied".format(dest_table.name)
try:
start=time.time()
if len(df)>10000:
for g,df_new in df.groupby(np.arange(len(df))//10000):
dict_items=df_new.to_dict(orient='records')
post_db.engine.connect().execute(dest_table.insert().values(dict_items))
else:
dict_items=df.to_dict(orient='records')
post_db.engine.connect().execute(dest_table.insert().values(dict_items))
loadtime=time.time()-start
print "Data loaded with datasize {0}".format(str(len(df)))
print "Table {0} loaded to BI database with loadtime {1}".format(dest_table.name,loadtime)
except:
print "Table {0} could not be loaded".format(dest_table.name)
rejected_list.append(dest_table.name)
If I drop the entire dst_schema before importing the data, the import succeeds.
This is the erorr I see:
sqlalchemy.exc.InternalError: (psycopg2.InternalError) cannot drop table A because other objects depend on it
DETAIL: constraint fk_rails_111193 on table B depends on table A
HINT: Use DROP ... CASCADE to drop the dependent objects too.
[SQL: '\nDROP TABLE A']
Can someone steer me into a possible solution?
Are there better alternatives other than dropping the dst_schema before importing the data to the destination db (drop_dst_schema=true)?
def create_db(src,dst,src_schema,dst_schema,drop_dst_schema=True)
Has anyone have an idea why sorted_tables does not drop the dependencies in the schema? Am I misunderstanding this object?
You have several options:
Drop the whole schema every time
If you have a complex schema, with any kind of closed loop reference chain, your best option is to always drop the whole schema.
You could have some self-referencing tables (such as a persons table, with a self-relation of type person parent-of person). You could also have a schema where table A references table B which references table A. For instance, you have one table persons and one companies, and two relations (probably, with intermediat tables): company employs persons, and persons trade shares of companies.
In cases like this, that are realistic, no matter what you do with sorted_tables, this will never work.
If you're actually replicating data from another DB, and can afford the time, dropping and recreating the whole schema is the easiest to implement solution. Your code will be much simpler: less cases to consider.
DROP CASCADE
You can also drop the tables using DROP CASCADE. If one table is referenced by another, this will drop both (or as many as necessary). You have to make sure the order in which you DROP and CREATE gives you the end result you expect. I'd check very carefully that this scenario works in all cases.
Drop all FK constraints, then recreate them at the end
Also, there is also one last possibility: drop all FK constraints for all tables before manipulating them, and recreate them at the end. This way, you'll be able to drop any table at any moment.
In short:
I have Postgresql database and I connect to that DB through Python's psycopg2 module. Such script might look like this:
import psycopg2
# connect to my database
conn = psycopg2.connect(dbname="<my-dbname>",
user="postgres",
password="<password>",
host="localhost",
port="5432")
cur = conn.cursor()
ins = "insert into testtable (age, name) values (%s,%s);"
data = ("90", "George")
sel = "select * from testtable;"
cur.execute(sel)
print(cur.fetchall())
# prints out
# [(100, 'Paul')]
#
# db looks like this
# age | name
# ----+-----
# 100 | Paul
# insert new data - no commit!
cur.execute(ins, data)
# perform the same select again
cur.execute(sel)
print(cur.fetchall())
# prints out
# [(100, 'Paul'),(90, 'George')]
#
# db still looks the same
# age | name
# ----+-----
# 100 | Paul
cur.close()
conn.close()
That is, I connect to that database which at the start of the script looks like this:
age | name
----+-----
100 | Paul
I perform SQL select and retrieve only Paul data. Then I do SQL insert, however without any commit, but the second SQL select still fetches both Paul and George - and I don't want that. I've looked both into psycopg and Postgresql docs and found out about ISOLATION LEVEL (see Postgresql and see psycopg2). In Postgresql docs (under 13.2.1. Read Committed Isolation Level) it explicitly says:
However, SELECT does see the effects of previous updates executed within its own transaction, even though they are not yet committed.
I've tried different isolation levels, I understand, that Read Committed and Repeatable Read don't wokr, I thought, that Serializable might work, but it does not -- meaning that I still can fetch uncommitted data with select.
I could do conn.set_isolation_level(0), where 0 represents psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT, or I could probably wrap the execute commands inside with statements (see).
After all, I am bit confused, whether I understand transactions and isolations (and the behavior of select without commit is completely normal) or not. Can somebody enlighten this topic to me?
Your two SELECT statements are using the same connection, and therefore the same transaction. From the psycopg manual you linked:
By default, the first time a command is sent to the database ... a new transaction is created. The following database commands will be executed in the context of the same transaction.
Your code is therefore equivalent to the following:
BEGIN TRANSACTION;
select * from testtable;
insert into testtable (age, name) values (90, 'George');
select * from testtable;
ROLLBACK TRANSACTION;
Isolation levels control how a transaction interacts with other transactions. Within a transaction, you can always see the effects of commands within that transaction.
If you want to isolate two different parts of your code, you will need to open two connections to the database, each of which will (unless you enable autocommit) create a separate transaction.
Note that according to the document already linked, creating a new cursor will not be enough:
...not only the commands issued by the first cursor, but the ones issued by all the cursors created by the same connection
Using autocommit will not solve your problem. When autocommit is one every insert and update is automatically committed to the database and all subsequent reads will see that data.
It's most unusual to not want to see data that has been written to the database by you. But if that's what you want, you need two separate connections and you must make sure that your select is executed prior to the commit.
I've written a program to scrape a website for data, place it into several arrays, iterate through each array and place it in a query and then execute the query. The code looks like this:
for count in range(391):
query = #long query
values = (doctor_names[count].encode("utf-8"), ...) #continues for about a dozen arrays
cur.execute(query, values)
cur.close()
db.close()
I run the program and aside from a few truncation warnings everything goes fine. I open the database in MySQL Workbench and nothing has changed. I tried changing the arrays in the values to constant strings and running it but still nothing would change.
I then created an array to hold the last executed query: sql_queries.append(cur._last_executed) and pushed them out to a text file:
fo = open("foo.txt", "wb")
for q in sql_queries:
fo.write(q)
fo.close()
Which gives me a large text file with multiple queries. When I copy the whole text file and create a new query in MySQL Workbench and execute it, it populates the database as desired. What is my program missing?
If your table is using a transactional storage engine, like Innodb, then you need to call db.commit() to have the transaction stored:
for count in range(391):
query = #long query
values = (doctor_names[count].encode("utf-8"), ...)
cur.execute(query, values)
db.commit()
cur.close()
db.close()
Note that with a transactional database, besides comitting you also have the opportunity to handle errors by rollingback inserts or updates with db.rollback(). The db.commit is required to finalize the transaction. Otherwise,
Closing a connection without committing the changes first will cause
an implicit rollback to be performed.