Summary:
I need to get the IDs for the last inserted rows using pandas .to_sql() with an sqlalchemy connector.
In more detail:
I have a database with 3 tables:
Event table
Event type table
Reference table between them
An event can have multiple types, thus I created table 3 which is able to link the 1. and 2. table.
As a good data scientist I of cause use pandas (pd.DataFrame.to_sql) and sqlalchemy. So far great for the event and the event type table.
But I also need to write the reference table. How to retrieve the latest inserted IDs when using df.to_sql()? The issue is that I don't get back the result of the sqlalchemy library when executing. Is there an more elegant way than getting the highest ID first, manually defining the ID for the inserted events (bypassing the auto increment of the database) and then using them to write the reference table?
Related
I am building a DWH based on data I am collecting from an ERP API.
currently, I am fetching the data from the API based on an incremental mechanism I built using python: The python script fetches all invoices whose last modified date is in the last 24 hours and inserts the data into a "staging table" (no changes are required during this step).
The next step is to insert all data from the staging area into the "final tables". The final tables include primary keys according to the ERP (for example invoice number).
There are no primary keys defined at the staging tables.
For now, I am putting aside the data manipulation and transformation.
In some cases, it's possible that a specific invoice is already in the "final tables", but then the user updates the invoice at the ERP system which causes the python script to fetch the data again from the API into the staging tables. In the case when I try to insert the invoice into the "final table" I will get a conflict due to the primary key restriction at the "final tables".
Any idea of how to solve this?
I am thinking to add a field that details the date and timestamp at which the record land at the staging table ("insert date") and then upsert the records if
insert date at the staging table > insert date at the final tables
Is this best practice?
Any other suggestions? maybe use a specific tool/data solution?
I prefer using python scripts since it is part of a wider project.
Thank you!
Instead of a straight INSERT use an UPSERT pattern. Either the MERGE statement if your database has it, or UPDATE the existing rows, followed by INSERTing the new ones.
I've created a cloud function using Python that receives some data and inserts it into a BigQuery table. Currently, it uses the insert_rows method to insert a new row.
row = client.insert_rows(table, row_to_insert) # API request to insert row
The problem is that I already have data with unique primary keys in the table, and I just need one measurement value to be updated in those rows.
I would like it to update or replace that row instead of creating a new one (assuming the primary keys in the table and input data match). Is this possible?
BigQuery is not designed for transactional data, it prefer append-only. Please refer documentation on Bigquery DML quota. That means you can only apply a limited number of DML commands on a table per day.
Updating rows will not work on BQ tables.
Recommended solution:-
Create 2 tables (T1 & T2).
Insert All transactional records on T1 table, through your existing Function.
Then Write a BQ-SQL to read most recent record from T1 table and then insert most recent records to T2 table
I am new to both python and SQLite.
I have used python to extract data from xlsx workbooks. Each workbook is one series of several sheets and is its own database, but I would also like a merged database of every series together. The structure is the same for all.
The structure of my database is:
*Table A with autoincrement primary key id, logical variable and 1 other variable.
*Table B autoincrement primary key id, logical variable and 4 other variables
*Table C is joined by table A id and table B id, together the primary key, and also has 4 other variables specific to this instance of intersection between table A and table B.
I tried using the answer at
Sqlite merging databases into one, with unique values, preserving foregin key relation
along with various other ATTACH solutions, but each time I got an error message ("cannot ATTACH database within transaction").
Can anyone suggest why I can't get ATTACH to work?
I also tried a ToMerge like the one at How can I merge many SQLite databases?
and it couldn't do ToMerge in the transaction either.
I also initially tried connecting to the existing SQLite db, making dictionaries from the existing tables in python, then adding the information in the dictionaries into a new 'merged' db, but this actually seemed to be far slower than the original process of extracting everything from the xlsx files.
I realize I can easily just run my xlsx to SQL python script again and again for each series directing it all into the one big SQL database and that is my backup plan, but I want to learn how to do it the best, fastest way.
So, what is the best way for me to merge identical structured SQLite databases into one, maintaining my foreign keys.
TIA for any suggestions
:-)L
You cannot execute the ATTACH statement from inside a transaction.
You did not start a transaction, but Python tried to be clever, got the type of your statement wrong, and automatically started a transaction for you.
Set connection.isolation_level = None.
I've been trying to write a bulk insert to a table.
I've already connected and tried to use SQLAlchemy bulk insert functions, but it's not really bulk inserting, it inserts the rows one by one (the dba tracked the db and showed me).
I wrote a class for the table:
Class SystemLog(Base):
__tablename__ = 'systemlogs'
# fields goes here...
Because of the fact that the bulk insert functions doesn't work I want to try to use a stored procedure.
I have a stored procedure named 'insert_new_system_logs' that receives a tablevar as a parameter.
How can I call it with a table from python SQLAlchemy?
My SQLAlchemy version is 1.0.6
I can't paste my code because it's in a closed network.
I don't have to use SQLAlchemy, I just want to bulk insert my logs.
Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!
There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda
You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.