I've created a cloud function using Python that receives some data and inserts it into a BigQuery table. Currently, it uses the insert_rows method to insert a new row.
row = client.insert_rows(table, row_to_insert) # API request to insert row
The problem is that I already have data with unique primary keys in the table, and I just need one measurement value to be updated in those rows.
I would like it to update or replace that row instead of creating a new one (assuming the primary keys in the table and input data match). Is this possible?
BigQuery is not designed for transactional data, it prefer append-only. Please refer documentation on Bigquery DML quota. That means you can only apply a limited number of DML commands on a table per day.
Updating rows will not work on BQ tables.
Recommended solution:-
Create 2 tables (T1 & T2).
Insert All transactional records on T1 table, through your existing Function.
Then Write a BQ-SQL to read most recent record from T1 table and then insert most recent records to T2 table
Related
I am building a DWH based on data I am collecting from an ERP API.
currently, I am fetching the data from the API based on an incremental mechanism I built using python: The python script fetches all invoices whose last modified date is in the last 24 hours and inserts the data into a "staging table" (no changes are required during this step).
The next step is to insert all data from the staging area into the "final tables". The final tables include primary keys according to the ERP (for example invoice number).
There are no primary keys defined at the staging tables.
For now, I am putting aside the data manipulation and transformation.
In some cases, it's possible that a specific invoice is already in the "final tables", but then the user updates the invoice at the ERP system which causes the python script to fetch the data again from the API into the staging tables. In the case when I try to insert the invoice into the "final table" I will get a conflict due to the primary key restriction at the "final tables".
Any idea of how to solve this?
I am thinking to add a field that details the date and timestamp at which the record land at the staging table ("insert date") and then upsert the records if
insert date at the staging table > insert date at the final tables
Is this best practice?
Any other suggestions? maybe use a specific tool/data solution?
I prefer using python scripts since it is part of a wider project.
Thank you!
Instead of a straight INSERT use an UPSERT pattern. Either the MERGE statement if your database has it, or UPDATE the existing rows, followed by INSERTing the new ones.
Summary:
I need to get the IDs for the last inserted rows using pandas .to_sql() with an sqlalchemy connector.
In more detail:
I have a database with 3 tables:
Event table
Event type table
Reference table between them
An event can have multiple types, thus I created table 3 which is able to link the 1. and 2. table.
As a good data scientist I of cause use pandas (pd.DataFrame.to_sql) and sqlalchemy. So far great for the event and the event type table.
But I also need to write the reference table. How to retrieve the latest inserted IDs when using df.to_sql()? The issue is that I don't get back the result of the sqlalchemy library when executing. Is there an more elegant way than getting the highest ID first, manually defining the ID for the inserted events (bypassing the auto increment of the database) and then using them to write the reference table?
We're doing streaming inserts on a BigQuery table.
We want to update the schema of a table without changing its name.
For example, we want to drop a column because it has sensitive data but we want to keep all the other data and the table name the same.
Our process is as follows:
copy original table to temp table
delete original table
create new table with original table name and new schema
populate new table with old table's data
cry because the last (up to) 90 minutes of data is stuck in streaming buffer and was not transferred.
How to avoid the last step?
I believe the new streaming API does not use streaming buffer anymore. Instead, it writes data directly to the destination table.
To enable API you have to enroll with BigQuery Streaming V2 Beta Enrollment Form:
You can find out more in the following link
I hope it addresses your case.
I am trying to bulk insert data to MondoDB without overwriting existing data. I want to insert new data to the database if no match with unique id (sourceID). Looking at the documentation for Pymongo I have written some code but cannot make it work. Any ideas to what I am doing wrong?
db.bulk_write(UpdateMany({"sourceID"}, test, upsert=True))
db is the name of my database, SourceID is the unique ID of the documents that I don't want to overwrite in the existing data, test is the array that I am tying to insert.
Either I don't understand your requirement or you misunderstands the UpdateMany operation. As per documentation, this operation serves for modifying the existing data (those matching the query) and only if no documents match the query, and upsert=True, insert new documents. Are you sure you don't want to use insert_many method?
Also, in your example, the first parameter which should be a filter for update, is not a valid query which has to be in a form {"key": "value"}.
Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!
There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda
You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.