We're doing streaming inserts on a BigQuery table.
We want to update the schema of a table without changing its name.
For example, we want to drop a column because it has sensitive data but we want to keep all the other data and the table name the same.
Our process is as follows:
copy original table to temp table
delete original table
create new table with original table name and new schema
populate new table with old table's data
cry because the last (up to) 90 minutes of data is stuck in streaming buffer and was not transferred.
How to avoid the last step?
I believe the new streaming API does not use streaming buffer anymore. Instead, it writes data directly to the destination table.
To enable API you have to enroll with BigQuery Streaming V2 Beta Enrollment Form:
You can find out more in the following link
I hope it addresses your case.
Related
I've created a cloud function using Python that receives some data and inserts it into a BigQuery table. Currently, it uses the insert_rows method to insert a new row.
row = client.insert_rows(table, row_to_insert) # API request to insert row
The problem is that I already have data with unique primary keys in the table, and I just need one measurement value to be updated in those rows.
I would like it to update or replace that row instead of creating a new one (assuming the primary keys in the table and input data match). Is this possible?
BigQuery is not designed for transactional data, it prefer append-only. Please refer documentation on Bigquery DML quota. That means you can only apply a limited number of DML commands on a table per day.
Updating rows will not work on BQ tables.
Recommended solution:-
Create 2 tables (T1 & T2).
Insert All transactional records on T1 table, through your existing Function.
Then Write a BQ-SQL to read most recent record from T1 table and then insert most recent records to T2 table
Consider the following scenario:
Incremental data gets ingested daily into a HDFS location, and from there I have to read the data using pyspark and find out the latest/active records.
Also, I have to handle schema changes in the data, as new fields may get added.
How can I achieve schema comparison and handle schema changes in pyspark?
How can I handle data which got loaded before the schema changes?
Is the below approach is a good one?
Generate a script to create hive tables on top of HDFS location.
Then compare the schema of source table and Hive table using pyspark. If there is schema change use the new schema from source to create the new ddl for table creation. Drop the existing table and create the table with new schema.
Create a view from the hive tables to get the latest records using primary key and audit column.
Background
I studied and found that bigQuery doesn't accept schemas defined by online tools (which have different formats, even though meaning is same).
So, I found that if I want to load data (where no. of columns keeps varying and increasing dynamically) into a table which has a fixed schema.
Thoughts
What i could do as a workaround is:
First check if the data being loaded has extra fields.
If it has, a schema mismatch will occur, so first you create a temporary table in BQ and load this data into the table using "autodetect" parameter, which gives me a schema (that is in a format,which BQ accepts schema files).
Now i can download this schema file and use it,to update my exsisting table in BQ and load it with appropriate data.
Suggestion
Any thoughts on this, if there is a better approach please share.
We are in the process of releasing a new feature that can update the schema of the destination table within a load/query job. With autodetect and the new feature you can directly load the new data to the existing table, and the schema will be updated as part of the load job. Please stay tuned. The current ETA is 2 weeks.
I'd like to create a snapshot of a database periodically, and execute some queries on the snapshot data to generate data for next step. Finally I want to discard the snapshot.
I read and convert all data into memory data structure(python dict) from the database and execute queries(implemented by my own code) on data structure
The program have a bottleneck on "execute query" step after data size increased
How can I query on data snapshot elegantly? Thanks much for any advice.
you can get all tables from your database with
SHOW TABLES FROM <yourDBname>
after that you may create copies of the tables in a new DB via
CREATE TABLE copy.tableA AS SELECT * FROM <yourDBname>.tableA
afterwars you can query the copy-database instead of the real data.
if you do queries on the tables, pls add indexes since they are not copied.
Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!
There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda
You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.