BigQuery : add new column to existing tables using python BQ API - python

Related question: Bigquery add columns to table schema using BQ command line tools
I want to add a new column to existing tables (update the existing table's schema) in BigQuery using BigQuery Python API.
However my code seems not working.
Here's my code:
flow = flow_from_clientsecrets('secret_key_path', scope='my_scope')
storage = Storage('CREDENTIAL_PATH')
credentials = storage.get()
if credentials is None or credentials.invalid:
credentials = tools.run_flow(flow, storage, tools.argparser.parse_args([]))
http = httplib2.Http()
http = credentials.authorize(http)
bigquery_service = build('bigquery', 'v2', http=http)
tbObject = bigquery_service.tables()
query_body = {'schema': {'name':'new_column_name', 'type':'STRING'}}
tbObject.update(projectId='projectId', datasetId='datasetId', tableId='tableId', body=query_body).execute()
it returns Provided schema doesn't match existing table's schema error.
Can anyone give me a working Python example?
Many thanks!

Base on Mikhail Berlyant comments, I have to pass existing table's schema with new field (column) to the update() method to update the existing tables's schema.
A python code example is given below:
...
tbObject = bigquery_service.tables()
# get current table schema
table_data = tbObject.get(projectId=projectId, datasetId=datasetId, tableId=tableId).execute()
schema = table_data.get('schema')
new_column = {'name': 'new_column_name', 'type': 'STRING'}
# append new field to current table's schema
schema.get('fields').append(new_column)
query_body = {'schema': schema}
tbObject.update(projectId='projectId', datasetId='datasetId', tableId='tableId', body=query_body).execute()
And also, there's no way to set value of new columns for existing rows (tables). Thanks for Mikhail Berlyant suggestion, the way to set the value for existing rows is to create a seperate table for new columns with values, and join the existing table with that table to replace the old schema table

summary of my comments (as i've got some minutes now for this):
whole schema (along with new field) needs to be supplied to api
new field will be added with null for existing rows. no way to set
value
you can have some logic in queries that you will be running against
this table to compensate this. or you can have separate table with
just this new field and some key that you will be joining your
existing table with new table to get this field

Related

BigQuery client python get column based partitioning column name

I am looking to include a where clause to an existing query to exclude data in the BigQuery streaming buffer.
To do this I would like to get the Partition column name so I can add
WHERE partition_column IS NOT NULL;
to my existing query.
I have been looking at the CLI and the get_table method however that just returns the value of the column not the column name.
I get the same when searching on .INFORMATION_SCHEMA.PARTITIONS this returns a field for partition_id but I would prefer the column name itself, is there away to get this ?
Additionally the table is setup with column based partitioning.
Based on python BigQuery client documentation, use the attribute time_partitioning:
from google.cloud import bigquery
bq_table = client.get_table('my_partioned_table')
bq_table.time_partitioning # TimePartitioning(field='column_name',type_='DAY')
bq_table.time_partitioning.field # column_name
Small tips, if you don't know where to search, print the API repr:
bq_table.to_api_repr()

BigQueryOperator changes the table schema and column modes when write_disposition='WRITE_TRUNCATE'

I am using Airflow's BigQueryOperator to populate the BQ table with write_disposition='WRITE_TRUNCATE'. The problem is that every time the task runs, it alters the table schema and also the column mode from Required to Nullable. The create_disposition I am using is 'CREATE_NEVER'. Since my tables are pre-created, I don't want the schemas or column modes to be altered. Using write_disposition='WRITE_APPEND' fixes the issue but my requirement is to use WRITE_TRUNCATE. Any idea why BigQueryOperator alters the schema and mode?
I had a similar issue, not with the required/nullable shcema value, but on policy tags, and the behavior is the same: the policy tags are overriden (and lost). Here the answer of the Google support team:
If you overwrite to a destination table, any existing policy tags are removed from the table, unless you use the --destination_schema flag to specify a schema with policy tags.
For WRITE_TRUNCATE, the disposition overwrites the existing table and the schema. If you want to keep the policy tags, you can use "--destination_schema" to specify a schema with policy tags.
However, with my test in python, I observed 2 different behaviors between the QueryJob (job based on sql query and that sinks the result in a table) and the LoadJob (job that loads data from a file and that sinks the data in a table).
If you perform a LoadJob,
Remove the schema autodetec
Get the schema of the original table
Perform the load job
Like this in Python
job_config = bigquery.job.LoadJobConfig()
job_config.create_disposition = bigquery.job.CreateDisposition.CREATE_IF_NEEDED
job_config.write_disposition = bigquery.job.WriteDisposition.WRITE_TRUNCATE
job_config.skip_leading_rows = 1
# job_config.autodetect = True
job_config.schema = client.get_table(table).schema
query_job = client.load_table_from_uri(uri, table, job_config=job_config)
res = query_job.result()
This solution, to copy the schema, doesn't work with a QueryJob
The workaround is the following (works for LoadJob and QueryJob)
Truncate the table
Perform a job in WRITE_EMPTY mode
Trade-off:
the WRITE_TRUNCATE is atomic: if the write write failed, the data aren't truncated
the workaround is in 2 steps: if the write write failed, the data are already deleted
config = bigquery.job.QueryJobConfig()
config.create_disposition = bigquery.job.CreateDisposition.CREATE_IF_NEEDED
config.write_disposition = bigquery.job.WriteDisposition.WRITE_EMPTY
# Don't work
# config.schema = client.get_table(table).schema
config.destination = table
# Step 1 truncate the table
query_job = client.query(f'TRUNCATE TABLE `{table}`')
res = query_job.result()
# Step 2: Load the new data
query_job = client.query(request, job_config=job_config)
res = query_job.result()
All of this to tell you that the BigQuery operator on Airflow isn't the problem. It's a BigQuery issue. You have a workaround to achieve what you want.

BigQuery Python 409 Already Exists: Table

I'm coding a python script that writes query results to a BQ table . After the first time running the script, it always errors out after that with the following error: google.api_core.exceptions.Conflict: 409 Already Exists: Table project-id.dataset-id. I do not understand why it is attempting to create a table everytime I run the script. Do I have specify any specific parameters?
This is from the documentation from google. I'm using this as an example and under the idea that a current table has already been created. Where do I stop the api from creating the same table ?
from google.cloud import bigquery
# TODO(developer): Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the destination table.
table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.QueryJobConfig(destination=table_id)
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(sql, job_config=job_config) # Make an API request.
query_job.result() # Wait for the job to complete.
print("Query results loaded to the table {}".format(table_id))
If you check the class QueryJobConfig you will see that there is a parameter called write_disposition. As you can see in the REST API reference here, this parameter can be set to 3 different options:
WRITE_TRUNCATE: If the table already exists, BigQuery overwrites the table data and uses the schema from the query result.
WRITE_APPEND: If the table already exists, BigQuery appends the data to the table.
WRITE_EMPTY: If the table already exists and contains data, a 'duplicate' error is returned in the job result.
So, adding this line after the job_config definiton will do the trick:
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
look into this BigQuery client documentation here look into exists_ok parameter

AWS Glue Search Option

I'm currently using AWS Glue Data Catalog to organize my database. Once I set up the connection and sent my crawler to gather information, I was able to see the formulated metadata.
One feature that would be nice to have is the ability to SEARCH the entire data catalog on ONE column name. For example, if i have 5 tables in my data catalog, and one of those tables happen to have a field "age". I'd like to be able to see that table.
I also was wondering if I can search on the "comments" field every column has in a table on AWS Glue Data Catalog
Hope to get some help!
You can do that with AWS Glue API. For example, you can use python SDK boto3 and get_tables() method to retrieve all meta information about tables in a particular database. Have a look at the Response Syntax returned by calling get_tables() and then you would only need to parse it, for example:
import boto3
glue_client = boto3.client('glue')
response = glue_client.get_tables(
DatabaseName='__SOME_NAME__'
)
for table in response['TableList']:
columns = table['StorageDescriptor']['Columns']
for col in columns:
col_name = col['Name']
col_comment = col['Comment']
# Here you do search for what you need
Note: if you have a table with partitioning (artificial columns), then you would all need to search through
columns_as_partitions = table['PartitionKeys']
for col in columns_as_partitions:
col_name = col['Name']
col_comment = col['Comment']
# Here you do search for what you need

How to overwrite a table in atomic fashion in BigQuery client Python API

Here is a code snippet I used as a reference from GCP documentation:
job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table('your_table_id')
job_config.destination = table_ref
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(
sql,
# Location must match that of the dataset(s) referenced in the query
# and of the destination table.
location='US',
job_config=job_config) # API request - starts the query
query_job.result() # Waits for the query to finish
print('Query results loaded to table {}'.format(table_ref.path))
This works fine but if the table already exists, this spits out an error. I know how to delete the table first but I am wondering if there is a way to make this overwrite the table in atomic fashion so that the table always exists.
You can control how results are persisted through a combination of setting the create_disposition and write_disposition. The python library exposes these options in QueryJobConfig, and links to more details from the REST API docs.
For queries, the default behavior for write dispostion is WRITE_EMPTY, which causes a failure if the table already exists. Switching that to WRITE_TRUNCATE should give you the atomic replacement of data that you're looking for.
TL;DR:
just add this to your job config:
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE

Categories