I'm coding a python script that writes query results to a BQ table . After the first time running the script, it always errors out after that with the following error: google.api_core.exceptions.Conflict: 409 Already Exists: Table project-id.dataset-id. I do not understand why it is attempting to create a table everytime I run the script. Do I have specify any specific parameters?
This is from the documentation from google. I'm using this as an example and under the idea that a current table has already been created. Where do I stop the api from creating the same table ?
from google.cloud import bigquery
# TODO(developer): Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the destination table.
table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.QueryJobConfig(destination=table_id)
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(sql, job_config=job_config) # Make an API request.
query_job.result() # Wait for the job to complete.
print("Query results loaded to the table {}".format(table_id))
If you check the class QueryJobConfig you will see that there is a parameter called write_disposition. As you can see in the REST API reference here, this parameter can be set to 3 different options:
WRITE_TRUNCATE: If the table already exists, BigQuery overwrites the table data and uses the schema from the query result.
WRITE_APPEND: If the table already exists, BigQuery appends the data to the table.
WRITE_EMPTY: If the table already exists and contains data, a 'duplicate' error is returned in the job result.
So, adding this line after the job_config definiton will do the trick:
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
look into this BigQuery client documentation here look into exists_ok parameter
Related
I am writing JSON records into a BigQuery table using the function bq.insert_rows_json(f'{project}.{dataset}.{table_name}', rows_to_insert). This operation is done in INSERT mode. I was wondering if I could use the same function but in UPSERT mode. Is it possible ? I check the documentation here but did not find an argument for that.
I can't seem to find an in-built UPSERT function for python. However, you may try and consider the below approach which is derived from the comment of #Mr.Nobody.
from google.cloud import bigquery
client = bigquery.Client()
query_job = client.query(
"""
MERGE my-dataset.json_table T
USING my-dataset.json_table_source S
ON T.int64_field_0 = S.int64_field_0
WHEN MATCHED THEN
UPDATE SET string_field_1 = s.string_field_1
WHEN NOT MATCHED THEN
INSERT (int64_field_0, string_field_1) VALUES(int64_field_0, string_field_1)"""
)
results = query_job.result() # Waits for job to complete.
In this approach, you will be needing to ingest all of your supposedly "updated" JSON data on a table before inserting or updating them to your main BigQuery table. The query then will match each rows to the main table if the primary ID (uniqueness checker) is already there (then query will do UPDATE) or not yet (then query will do INSERT).
Screenshot of both tables before running the python code.
Main Table:
Source Table:
Screenshot of the Main Table when the python code finished executing.
Conclusion: The int64_field_0 4 was updated (from version 1.0.0. to 6.5.1) because it is already existing in the Main table. The int64_field_0 5 was inserted because it is not yet existing on the main table.
I am using Airflow's BigQueryOperator to populate the BQ table with write_disposition='WRITE_TRUNCATE'. The problem is that every time the task runs, it alters the table schema and also the column mode from Required to Nullable. The create_disposition I am using is 'CREATE_NEVER'. Since my tables are pre-created, I don't want the schemas or column modes to be altered. Using write_disposition='WRITE_APPEND' fixes the issue but my requirement is to use WRITE_TRUNCATE. Any idea why BigQueryOperator alters the schema and mode?
I had a similar issue, not with the required/nullable shcema value, but on policy tags, and the behavior is the same: the policy tags are overriden (and lost). Here the answer of the Google support team:
If you overwrite to a destination table, any existing policy tags are removed from the table, unless you use the --destination_schema flag to specify a schema with policy tags.
For WRITE_TRUNCATE, the disposition overwrites the existing table and the schema. If you want to keep the policy tags, you can use "--destination_schema" to specify a schema with policy tags.
However, with my test in python, I observed 2 different behaviors between the QueryJob (job based on sql query and that sinks the result in a table) and the LoadJob (job that loads data from a file and that sinks the data in a table).
If you perform a LoadJob,
Remove the schema autodetec
Get the schema of the original table
Perform the load job
Like this in Python
job_config = bigquery.job.LoadJobConfig()
job_config.create_disposition = bigquery.job.CreateDisposition.CREATE_IF_NEEDED
job_config.write_disposition = bigquery.job.WriteDisposition.WRITE_TRUNCATE
job_config.skip_leading_rows = 1
# job_config.autodetect = True
job_config.schema = client.get_table(table).schema
query_job = client.load_table_from_uri(uri, table, job_config=job_config)
res = query_job.result()
This solution, to copy the schema, doesn't work with a QueryJob
The workaround is the following (works for LoadJob and QueryJob)
Truncate the table
Perform a job in WRITE_EMPTY mode
Trade-off:
the WRITE_TRUNCATE is atomic: if the write write failed, the data aren't truncated
the workaround is in 2 steps: if the write write failed, the data are already deleted
config = bigquery.job.QueryJobConfig()
config.create_disposition = bigquery.job.CreateDisposition.CREATE_IF_NEEDED
config.write_disposition = bigquery.job.WriteDisposition.WRITE_EMPTY
# Don't work
# config.schema = client.get_table(table).schema
config.destination = table
# Step 1 truncate the table
query_job = client.query(f'TRUNCATE TABLE `{table}`')
res = query_job.result()
# Step 2: Load the new data
query_job = client.query(request, job_config=job_config)
res = query_job.result()
All of this to tell you that the BigQuery operator on Airflow isn't the problem. It's a BigQuery issue. You have a workaround to achieve what you want.
I'm trying to copy one BigQuery table to another, within the same dataset, working off of the example at https://cloud.google.com/bigquery/docs/managing-tables#copy-table
I've created a function to do the copy, as follows:
def copy_table (source_table,dest_table):
client = bigquery.Client()
source_table_ref="my_project.my_dataset."+source_table
dest_table_ref="my_project.my_dataset."+dest_table
job = client.copy_table(
source_table_ref,
dest_table_ref) # API request
job.result()
But when I do the copy, the dest_table is created, with the same schema as the source_table, but none of the data is copied from the source_table to the dest_table.
Here's the overall sequence of what I'm doing:
Create source_table
Insert rows into source_table
Do query to check that the rows are in the source_table (they are -- SELECT COUNT(*) returns the right number of rows)
Copy source_table to dest_table using the function above
Do query to check that the rows are in the dest_table (they are NOT -- SELECT COUNT(*) returns zero rows)
My guess is that somehow these jobs are being executed asynchronously, but I can't figure out how to make them execute synchronously. Any ideas would be greatly appreciated.
If it helps any, my overall goal is to create and fill a new table with the results of a daily batch job (get_user_info_2020-06-27, for example), then copy it to a get_user_info_current table that always contains the current day's user info.
Edit:
Some more info, based on testing:
On the page at https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability, it says: "Data can take up to 90 minutes to become available for copy operations". So, I coded up a little thing to go after my insert statement to wait until it completes:
def insert_table_wait(table_name,prev_rows,rows_inserted):
client = bigquery.Client()
table_id = "pacs-user-analysis-dev.google_users."+table_name
table = client.get_table(table_id) # Make an API request.
#wait until the insert fully completes
curr_table=client.get_table(table_id)
sys.stderr.write(str(datetime.datetime.now()) +" "+table_name +" properties: "+str(curr_table._properties)+"\n")
curr_rows=int(curr_table._properties.get('numRows'))
while curr_table._properties.get('streamingBuffer') is not None or curr_rows != prev_rows+rows_inserted:
sys.stderr.write(str(datetime.datetime.now()) +" Waiting for insert into "+str(curr_table._properties.get('id'))+" to complete. StreamingBuffer details: "+str(curr_table._properties.get('streamingBuffer'))+" prev_rows: "+str(prev_rows)+" curr_rows: "+str(curr_rows)+ " should be: " + str(prev_rows+rows_inserted)+"\n")
time.sleep(10)
curr_table=client.get_table(table_id)
curr_rows=int(curr_table._properties.get('numRows') )
I'm hoping that that might fix things. What I don't understand is that the new rows appear in the BigQuery Console UI almost immediately, but table._properties.get('numRows') doesn't seem to update nearly as promptly.
Since there is a limitation i copy jobs for data in streaming buffer, I suggest that you use a query job and set a destination table just like the code below:
from google.cloud import bigquery
client = bigquery.Client(project = "your-project")
job_config = bigquery.QueryJobConfig(destination="destination-table-id")
# SELECT * to copy the whole table
sql = "SELECT * FROM <source_table>"
query_job = client.query(sql, job_config=job_config)
query_job.result()
Please let me know if that helps you
Hello everyone what I am trying to do is through a query insert data to a tableA(), once data is inserted into Table A then delete the newly inserted values in A, then write the response/output into Table B that I created.
Here is my python code :
client = bigquery.Client()
#This is Table B
table_id = "ntest.practice.btabletest"
#here is the table I am writing my deleted output to
job_config = bigquery.QueryJobConfig(destination=table_id)
sql2 ="""
INSERT INTO `ntest.practice.atabletest`(%s) VALUES (%s);
DELETE FROM `ntest.practice.atabletest`
WHERE name = 'HEART'
"""%(columns_aaa,valueaaa)
query_job1 = client.query(sql2,job_config=job_config) # Make an API request.
query_job1.result() # Waits for query to finish
print("Query results loaded to the table {}".format(table_id))
Yet, I get an error code saying:
google.api_core.exceptions.BadRequest: 400
configuration.query.destinationTable cannot be set for scripts
Any thoughts on how to fix this error? I don't believe, that my query is wrong, nor my tables or values are incorrect.
Although BigQuery scripting doesn't support destination table, it doesn't seems that you'll need it for your specific query.
DELETE query never writes any data to a destination table. You may work around it by sending INSERT first then DELETE, this way, destination table will "work" (I mean BigQuery won't complaint about it). But you're getting an empty table.
Here is a code snippet I used as a reference from GCP documentation:
job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table('your_table_id')
job_config.destination = table_ref
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(
sql,
# Location must match that of the dataset(s) referenced in the query
# and of the destination table.
location='US',
job_config=job_config) # API request - starts the query
query_job.result() # Waits for the query to finish
print('Query results loaded to table {}'.format(table_ref.path))
This works fine but if the table already exists, this spits out an error. I know how to delete the table first but I am wondering if there is a way to make this overwrite the table in atomic fashion so that the table always exists.
You can control how results are persisted through a combination of setting the create_disposition and write_disposition. The python library exposes these options in QueryJobConfig, and links to more details from the REST API docs.
For queries, the default behavior for write dispostion is WRITE_EMPTY, which causes a failure if the table already exists. Switching that to WRITE_TRUNCATE should give you the atomic replacement of data that you're looking for.
TL;DR:
just add this to your job config:
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE