I'm trying to copy one BigQuery table to another, within the same dataset, working off of the example at https://cloud.google.com/bigquery/docs/managing-tables#copy-table
I've created a function to do the copy, as follows:
def copy_table (source_table,dest_table):
client = bigquery.Client()
source_table_ref="my_project.my_dataset."+source_table
dest_table_ref="my_project.my_dataset."+dest_table
job = client.copy_table(
source_table_ref,
dest_table_ref) # API request
job.result()
But when I do the copy, the dest_table is created, with the same schema as the source_table, but none of the data is copied from the source_table to the dest_table.
Here's the overall sequence of what I'm doing:
Create source_table
Insert rows into source_table
Do query to check that the rows are in the source_table (they are -- SELECT COUNT(*) returns the right number of rows)
Copy source_table to dest_table using the function above
Do query to check that the rows are in the dest_table (they are NOT -- SELECT COUNT(*) returns zero rows)
My guess is that somehow these jobs are being executed asynchronously, but I can't figure out how to make them execute synchronously. Any ideas would be greatly appreciated.
If it helps any, my overall goal is to create and fill a new table with the results of a daily batch job (get_user_info_2020-06-27, for example), then copy it to a get_user_info_current table that always contains the current day's user info.
Edit:
Some more info, based on testing:
On the page at https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability, it says: "Data can take up to 90 minutes to become available for copy operations". So, I coded up a little thing to go after my insert statement to wait until it completes:
def insert_table_wait(table_name,prev_rows,rows_inserted):
client = bigquery.Client()
table_id = "pacs-user-analysis-dev.google_users."+table_name
table = client.get_table(table_id) # Make an API request.
#wait until the insert fully completes
curr_table=client.get_table(table_id)
sys.stderr.write(str(datetime.datetime.now()) +" "+table_name +" properties: "+str(curr_table._properties)+"\n")
curr_rows=int(curr_table._properties.get('numRows'))
while curr_table._properties.get('streamingBuffer') is not None or curr_rows != prev_rows+rows_inserted:
sys.stderr.write(str(datetime.datetime.now()) +" Waiting for insert into "+str(curr_table._properties.get('id'))+" to complete. StreamingBuffer details: "+str(curr_table._properties.get('streamingBuffer'))+" prev_rows: "+str(prev_rows)+" curr_rows: "+str(curr_rows)+ " should be: " + str(prev_rows+rows_inserted)+"\n")
time.sleep(10)
curr_table=client.get_table(table_id)
curr_rows=int(curr_table._properties.get('numRows') )
I'm hoping that that might fix things. What I don't understand is that the new rows appear in the BigQuery Console UI almost immediately, but table._properties.get('numRows') doesn't seem to update nearly as promptly.
Since there is a limitation i copy jobs for data in streaming buffer, I suggest that you use a query job and set a destination table just like the code below:
from google.cloud import bigquery
client = bigquery.Client(project = "your-project")
job_config = bigquery.QueryJobConfig(destination="destination-table-id")
# SELECT * to copy the whole table
sql = "SELECT * FROM <source_table>"
query_job = client.query(sql, job_config=job_config)
query_job.result()
Please let me know if that helps you
Related
I'm coding a python script that writes query results to a BQ table . After the first time running the script, it always errors out after that with the following error: google.api_core.exceptions.Conflict: 409 Already Exists: Table project-id.dataset-id. I do not understand why it is attempting to create a table everytime I run the script. Do I have specify any specific parameters?
This is from the documentation from google. I'm using this as an example and under the idea that a current table has already been created. Where do I stop the api from creating the same table ?
from google.cloud import bigquery
# TODO(developer): Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the destination table.
table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.QueryJobConfig(destination=table_id)
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(sql, job_config=job_config) # Make an API request.
query_job.result() # Wait for the job to complete.
print("Query results loaded to the table {}".format(table_id))
If you check the class QueryJobConfig you will see that there is a parameter called write_disposition. As you can see in the REST API reference here, this parameter can be set to 3 different options:
WRITE_TRUNCATE: If the table already exists, BigQuery overwrites the table data and uses the schema from the query result.
WRITE_APPEND: If the table already exists, BigQuery appends the data to the table.
WRITE_EMPTY: If the table already exists and contains data, a 'duplicate' error is returned in the job result.
So, adding this line after the job_config definiton will do the trick:
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
look into this BigQuery client documentation here look into exists_ok parameter
Hello everyone what I am trying to do is through a query insert data to a tableA(), once data is inserted into Table A then delete the newly inserted values in A, then write the response/output into Table B that I created.
Here is my python code :
client = bigquery.Client()
#This is Table B
table_id = "ntest.practice.btabletest"
#here is the table I am writing my deleted output to
job_config = bigquery.QueryJobConfig(destination=table_id)
sql2 ="""
INSERT INTO `ntest.practice.atabletest`(%s) VALUES (%s);
DELETE FROM `ntest.practice.atabletest`
WHERE name = 'HEART'
"""%(columns_aaa,valueaaa)
query_job1 = client.query(sql2,job_config=job_config) # Make an API request.
query_job1.result() # Waits for query to finish
print("Query results loaded to the table {}".format(table_id))
Yet, I get an error code saying:
google.api_core.exceptions.BadRequest: 400
configuration.query.destinationTable cannot be set for scripts
Any thoughts on how to fix this error? I don't believe, that my query is wrong, nor my tables or values are incorrect.
Although BigQuery scripting doesn't support destination table, it doesn't seems that you'll need it for your specific query.
DELETE query never writes any data to a destination table. You may work around it by sending INSERT first then DELETE, this way, destination table will "work" (I mean BigQuery won't complaint about it). But you're getting an empty table.
I am doing inserts via Streaming. In the UI, I can see the following row counts:
Is there a way to get that via the API? Current when I do:
from google.cloud import bigquery
client = bigquery.Client()
dataset = client.dataset("bqtesting")
table = client.get_table(dataset.table('table_streaming'))
table.num_rows
0
Obviously 0 is not the number that I'm looking to get. From the API documentation it says:
numRows unsigned long [Output-only] The number of rows of data in this table, excluding any data in the streaming buffer.
So then, my question is: how do we get the exact number of rows in a table? Currently I'm doing:
count=[item[0] for item in client.query('SELECT COUNT(*) FROM `bqtesting.table_streaming`').result()][0]
But this takes about 5s just to get the count (and I need to execute this query quite frequently to see if all streaming inserts have 'finished').
select count(1) and select count(*) etc have 0 scanned and billed bytes (you can see this in the job metadata after you run it or in a dry run) so you should be able to run those as often as you like
if i'm reading the documentation correctly, the numbers there are not guaranteed to give you rows in the buffer which have not yet been flushed to big-query storage
you can also use the API mentioned here https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability i.e. check the streamingBuffer.oldestEntryTime field from the tables.get result
You can use the __TABLES__ metadata table to get the information that you want. Querying __TABLES__ incurs no charges.
The query that you need is:
SELECT table_id, row_count, size_bytes
FROM `your-project-name.bqtesting.__TABLES__`
WHERE STARTS_WITH(table_id, "table_streaming")
ORDER BY table_id DESC
Here is a code snippet I used as a reference from GCP documentation:
job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table('your_table_id')
job_config.destination = table_ref
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(
sql,
# Location must match that of the dataset(s) referenced in the query
# and of the destination table.
location='US',
job_config=job_config) # API request - starts the query
query_job.result() # Waits for the query to finish
print('Query results loaded to table {}'.format(table_ref.path))
This works fine but if the table already exists, this spits out an error. I know how to delete the table first but I am wondering if there is a way to make this overwrite the table in atomic fashion so that the table always exists.
You can control how results are persisted through a combination of setting the create_disposition and write_disposition. The python library exposes these options in QueryJobConfig, and links to more details from the REST API docs.
For queries, the default behavior for write dispostion is WRITE_EMPTY, which causes a failure if the table already exists. Switching that to WRITE_TRUNCATE should give you the atomic replacement of data that you're looking for.
TL;DR:
just add this to your job config:
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
I'm using Python to hit the BigQuery API. I've been successful at running queries and writing new tables, but would like to ensure those output tables are partitioned per https://cloud.google.com/bigquery/docs/creating-partitioned-tables
The output of the query would have the columns: event_date[string in the format "2017-11-12"], metric[integer]
Per the code below, I've been assigning the "partitioning_type" code to various objects, but it never returns an error.
( I guess it'd also be useful to know how to tell if my partitioning efforts are actually working (i.e. how to identify the _PARTITIONTIME pseudo column)).
dest_table_id = "BQresults"
query_job = client.run_async_query(str(uuid.uuid4()), query))
query_job.allow_large_results = True
dest_dataset = client.dataset(dest_dataset_id)
dest_table = dest_dataset.table(dest_table_id)
dest_table.partitioning_type ="DAY"
query_job.destination = dest_table
query_job.write_disposition = 'WRITE_TRUNCATE'
query_job.use_legacy_sql = False
query_job.begin()
query_job.result()
If you want to check if the table is partitioned on a time column or not, use get_table() method and check the partitioning_type property of the returned object.
You can check on a integer partitioning checking the range_partitioning property. You can also get the job object using get_job() with the job id and check if the time_partitioning was set in the configuration.
I don't think that query job you're running results in the partitioned table, since the time_partitioning should be set in the job configuration, and it seems like the client doesn't do this. If it is true, you can create partitioned table first and use existing table as a destination.