Using Google-Colab (python) to append to Bigquery table - python

I am trying to use Colab to upload data to a GCP bigQuery table. However, after initial upload I am struggling to add additional data. Mainly as bq says that my data does not match schema.
I was hoping to get some help how to download table schema and use it to upload data ?
Thanks

Following the example from BigQuery docs:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# Set table_id to the ID of the table model to fetch.
table_id = 'your-project.your_dataset.your_table'
# Make an API request.
table = client.get_table(table_id)
# View table properties
print(f"Got table {table_id}.")
print(f"Table schema: {table.schema}")
print(f"Table description: {table.description}")
print(f"Table has {table.num_rows} rows")
Once you have the table schema information in hand, ensure the data you're trying to append to your table has the appropriate type and configure a load job with bigquery.LoadJobConfig or stream the data into your table using bigquery.Client().insert_rows_json or bigquery.Client().insert_rows_from_dataframe
See the google-cloud-bigquery docs for more details on how to use those methods.

thanks for that,
Slight change to your code.
# Construct a BigQuery client object.
project_id = 'gcp-project'
client = bigquery.Client(project=project_id)
# Set table_id to the ID of the table model to fetch.
table_id = 'analysis.Test01'
I got the other bit working too. esentially i save schema to a variable, and the .to_gbq has table_schema field. What was giving me a hard time was that my table had over 50 columns and I needed some automated way to get schema
pandas_gbq.to_gbq(mergedDB, table_id, project_id=project_id, if_exists='append', table_schema=lcr_table_schema)
lcr_table_schema = [
{
"mode": "NULLABLE",
"name": "Month",
"type": "STRING"
},

Related

BigQuery - LoadJobConfig vs QueryJobConfig in update schema

I run into in a situation like below: In fact, I try to update the schema of my partition table (partitioned by Time-unit column). I use this article and this example as my references and the doc says that
schemaUpdateOptions[]
: Schema update options are supported in two cases: when writeDisposition is WRITE_APPEND; when writeDisposition is WRITE_TRUNCATE and the destination table is a partition of a table, specified by partition decorators. For normal tables, WRITE_TRUNCATE will always overwrite the schema.
So what I understand is, with LoadJobConfig().schema_update_options = [bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION]:
For normals tables, the LoadJobConfig().write_disposition
WRITE_APPEND
WRITE_TRUNCATE
Update schema successfully and new data is appended into the table
Update schema successfully BUT the table is overwrited with new data
For partitions tables, the LoadJobConfig().write_disposition
WRITE_APPEND
WRITE_TRUNCATE
NOT ALLOWED - Error message: "invalid: Schema update options should only be specified with WRITE_APPEND disposition, or with WRITE_TRUNCATE disposition on a table partition."
Update schema successfully BUT the table is overwrited with new data
It's always true when I'm using LoadJobConfig() but if I use QueryJobConfig() instead, things have changed.
In fact, it's still true for normals tables but for partitions tables, even when the write_disposition=WRITE_APPEND, the schema is still updated successfully and the new data is appended into the table !!
How do we explain this situation, please ? There is something special about QueryJobConfig()? Or Did I missunderstand something, please?
Many thanks !!
There are some slight differences between each class, I would recommend that you pay attention to the default configurations of each one and you can solve your problem, if any of them that ends up returning erroes may be due to an incorrect inizialization of the configuration and its functionalities.
QueryJobConfig
QueryJobConfig(**kwargs)
Configuration options for query jobs.
All properties in this class are optional. Values which are :data:None -> server defaults. Set properties on the constructed configuration by using the property name as the name of a keyword argument.
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the destination table.
# table_id = "your-project.your_dataset.your_table_name"
# Retrieves the destination table and checks the length of the schema.
table = client.get_table(table_id) # Make an API request.
print("Table {} contains {} columns".format(table_id, len(table.schema)))
# Configures the query to append the results to a destination table,
# allowing field addition.
job_config = bigquery.QueryJobConfig(
destination=table_id,
schema_update_options=[bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION],
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
)
# Start the query, passing in the extra configuration.
query_job = client.query(
# In this example, the existing table contains only the 'full_name' and
# 'age' columns, while the results of this query will contain an
# additional 'favorite_color' column.
'SELECT "Timmy" as full_name, 85 as age, "Blue" as favorite_color;',
job_config=job_config,
) # Make an API request.
query_job.result() # Wait for the job to complete.
# Checks the updated length of the schema.
table = client.get_table(table_id) # Make an API request.
print("Table {} now contains {} columns".format(table_id, len(table.schema)))
LoadJobConfig
LoadJobConfig(**kwargs)
Configuration options for load jobs.
Set properties on the constructed configuration by using the property name as the name of a keyword argument. Values which are unset or :data:None use the BigQuery REST API default values. See the BigQuery REST API reference documentation for a list of default values.
Required options differ based on the source_format value. For example, the BigQuery API's default value for source_format is "CSV". When loading a CSV file, either schema must be set or autodetect must be set to :data:True.
# from google.cloud import bigquery
# client = bigquery.Client()
# project = client.project
# dataset_ref = bigquery.DatasetReference(project, 'my_dataset')
# filepath = 'path/to/your_file.csv'
# Retrieves the destination table and checks the length of the schema
table_id = "my_table"
table_ref = dataset_ref.table(table_id)
table = client.get_table(table_ref)
print("Table {} contains {} columns.".format(table_id, len(table.schema)))
# Configures the load job to append the data to the destination table,
# allowing field addition
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.schema_update_options = [
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION
]
# In this example, the existing table contains only the 'full_name' column.
# 'REQUIRED' fields cannot be added to an existing schema, so the
# additional column must be 'NULLABLE'.
job_config.schema = [
bigquery.SchemaField("full_name", "STRING", mode="REQUIRED"),
bigquery.SchemaField("age", "INTEGER", mode="NULLABLE"),
]
job_config.source_format = bigquery.SourceFormat.CSV
job_config.skip_leading_rows = 1
with open(filepath, "rb") as source_file:
job = client.load_table_from_file(
source_file,
table_ref,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # API request
job.result() # Waits for table load to complete.
print(
"Loaded {} rows into {}:{}.".format(
job.output_rows, dataset_id, table_ref.table_id
)
)
# Checks the updated length of the schema
table = client.get_table(table)
print("Table {} now contains {} columns.".format(table_id, len(table.schema)))
It should be noted that google.cloud.bigquery.job.SchemaUpdateOption is overloaded in both classes and specifies updates to the destination table schema to allow as a side effect of the query job.

Cloud Storage to BigQuery via Python

I am receiving a data drop into my GCS bucket daily and have a cloud function that moves said csv data to a BigQuery Table (see code below).
import datetime
def load_table_uri_csv(table_id):
# [START bigquery_load_table_gcs_csv]
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "dataSet.dataTable"
job_config = bigquery.LoadJobConfig(
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True,
)
uri = "gs://client-data/team/looker-client-" + str(datetime.date.today()) + ".csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Loaded {} rows.".format(destination_table.num_rows))
# [END bigquery_load_table_gcs_csv]
However, the data comes with a 2 day look back resulting in repeated data in the BigQuery table.
Is there a way for me to update this cloud function to only pull in the most recent date from the csv once it is dropped off? This way I can easily avoid duplicative data within the reporting.
Or, maybe theres a way for me to run a scheduled query via BigQuery to resolve this?
For reference, the date column within the CSV comes in a TIMESTAMP schema.
Any and all help is appreciated!
There is seems to be no way to do this directly from Google Cloud Platform, unfortunately. You will need filter your information somehow before loading it.
You could review the information from the CSV in your code or through another medium.
It's also possible to submit a feature request for Google to consider this functionality.

BigQuery | Python | Export data from BigQuery table VIEW

This is a simple code to export from Biq Query to Google storage, in CSV format
def export_data():
client = bigquery.Client()
project = 'xxxxx'
dataset_id = 'xxx'
table_id = 'xxx'
bucket_name = 'xxx'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'EXPORT_FILE.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='EU') # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
It works perfectly for general tables, BUT when I try to export data from saved table VIEW, it failed with this error:
BadRequest: 400 Using table xxx:xxx.xxx#123456 is not allowed for this operation because of its type. Try using a different table that is of type TABLE.
Does exist any way to export data from table view?
What I'm trying to achieve is, to get the data from BigQuery in CSV format, and upload to Google analytics Product Data
BigQuery views are subject to a few limitations:
You cannot run a BigQuery job that exports data from a view.
There are more than 10+ other limitations which I didn't posted in the answer as they might change. Follow the link to read all of them.
You need to query your view and write the results to a destination table, and then issue an export job on the destination table.

Loading table from Cloud Storage to BigQuery using Python

Could someone share an example of a job config for uploading json newline_delimited file to a new Bigquery table, please?
Trying to do this based on google docs with no success so far.
This example from GCP repository is a good one for loading data from GCS.
The only thing you will have to adapt in your code is setting the job.source_format to be the new delimited json file, like so:
def load_data_from_gcs(dataset_name, table_name, source):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(
job_name, table, source)
job.source_format = 'NEWLINE_DELIMITED_JSON'
job.begin()
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_name, table_name))
(The correct thing would be to receive this parameter as input in your function but this works as an example).
Also, the table should already exist when you run this code (I looked for schema auto-detection in the Python API but it seems there isn't one yet).

Create a table from query results in Google BigQuery

We're using Google BigQuery via the Python API. How would I create a table (new one or overwrite old one) from query results? I reviewed the query documentation, but I didn't find it useful.
We want to simulate:
"SELEC ... INTO ..." from ANSI SQL.
You can do this by specifying a destination table in the query. You would need to use the Jobs.insert API rather than the Jobs.query call, and you should specify writeDisposition=WRITE_APPEND and fill out the destination table.
Here is what the configuration would look like, if you were using the raw API. If you're using Python, the Python client should give accessors to these same fields:
"configuration": {
"query": {
"query": "select count(*) from foo.bar",
"destinationTable": {
"projectId": "my_project",
"datasetId": "my_dataset",
"tableId": "my_table"
},
"createDisposition": "CREATE_IF_NEEDED",
"writeDisposition": "WRITE_APPEND",
}
}
The accepted answer is correct, but it does not provide Python code to perform the task. Here is an example, refactored out of a small custom client class I just wrote. It does not handle exceptions, and the hard-coded query should be customised to do something more interesting than just SELECT * ...
import time
from google.cloud import bigquery
from google.cloud.bigquery.table import Table
from google.cloud.bigquery.dataset import Dataset
class Client(object):
def __init__(self, origin_project, origin_dataset, origin_table,
destination_dataset, destination_table):
"""
A Client that performs a hardcoded SELECT and INSERTS the results in a
user-specified location.
All init args are strings. Note that the destination project is the
default project from your Google Cloud configuration.
"""
self.project = origin_project
self.dataset = origin_dataset
self.table = origin_table
self.dest_dataset = destination_dataset
self.dest_table_name = destination_table
self.client = bigquery.Client()
def run(self):
query = ("SELECT * FROM `{project}.{dataset}.{table}`;".format(
project=self.project, dataset=self.dataset, table=self.table))
job_config = bigquery.QueryJobConfig()
# Set configuration.query.destinationTable
destination_dataset = self.client.dataset(self.dest_dataset)
destination_table = destination_dataset.table(self.dest_table_name)
job_config.destination = destination_table
# Set configuration.query.createDisposition
job_config.create_disposition = 'CREATE_IF_NEEDED'
# Set configuration.query.writeDisposition
job_config.write_disposition = 'WRITE_APPEND'
# Start the query
job = self.client.query(query, job_config=job_config)
# Wait for the query to finish
job.result()
Create a table from query results in Google BigQuery. Assuming you are using Jupyter Notebook with Python 3 going to explain the following steps:
How to create a new dataset on BQ (to save the results)
How to run a query and save the results in a new dataset in table format on BQ
Create a new DataSet on BQ: my_dataset
bigquery_client = bigquery.Client() #Create a BigQuery service object
dataset_id = 'my_dataset'
dataset_ref = bigquery_client.dataset(dataset_id) # Create a DatasetReference using a chosen dataset ID.
dataset = bigquery.Dataset(dataset_ref) # Construct a full Dataset object to send to the API.
dataset.location = 'US' # Specify the geographic location where the new dataset will reside. Remember this should be same location as that of source data set from where we are getting data to run a query
# Send the dataset to the API for creation. Raises google.api_core.exceptions.AlreadyExists if the Dataset already exists within the project.
dataset = bigquery_client.create_dataset(dataset) # API request
print('Dataset {} created.'.format(dataset.dataset_id))
Run a query on BQ using Python:
There are 2 types here:
Allowing Large Results
Query without mentioning large result etc.
I am taking the Public dataset here: bigquery-public-data:hacker_news & Table id: comments to run a query.
Allowing Large Results
DestinationTableName='table_id1' #Enter new table name you want to give
!bq query --allow_large_results --destination_table=project_id:my_dataset.$DestinationTableName 'SELECT * FROM [bigquery-public-data:hacker_news.comments]'
This query will allow large query results if required.
Without mentioning --allow_large_results:
DestinationTableName='table_id2' #Enter new table name you want to give
!bq query destination_table=project_id:my_dataset.$DestinationTableName 'SELECT * FROM [bigquery-public-data:hacker_news.comments] LIMIT 100'
This will work for the query where the result is not going to cross the limit mentioned in Google BQ documentation.
Output:
A new dataset on BQ with the name my_dataset
Results of the queries saved as tables in my_dataset
Note:
These queries are Commands which you can run on the terminal(without ! in the beginning). But as we are using Python to run these commands/queries we are using !. This will enable us to use/run commands in the Python program as well.
Also please upvote the answer :). Thank You.

Categories