I am receiving a data drop into my GCS bucket daily and have a cloud function that moves said csv data to a BigQuery Table (see code below).
import datetime
def load_table_uri_csv(table_id):
# [START bigquery_load_table_gcs_csv]
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "dataSet.dataTable"
job_config = bigquery.LoadJobConfig(
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True,
)
uri = "gs://client-data/team/looker-client-" + str(datetime.date.today()) + ".csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Loaded {} rows.".format(destination_table.num_rows))
# [END bigquery_load_table_gcs_csv]
However, the data comes with a 2 day look back resulting in repeated data in the BigQuery table.
Is there a way for me to update this cloud function to only pull in the most recent date from the csv once it is dropped off? This way I can easily avoid duplicative data within the reporting.
Or, maybe theres a way for me to run a scheduled query via BigQuery to resolve this?
For reference, the date column within the CSV comes in a TIMESTAMP schema.
Any and all help is appreciated!
There is seems to be no way to do this directly from Google Cloud Platform, unfortunately. You will need filter your information somehow before loading it.
You could review the information from the CSV in your code or through another medium.
It's also possible to submit a feature request for Google to consider this functionality.
Related
I need to create a Google Cloud function that automatically creates a table from simple .csv file located in bucket on Google Storage. I created new function and I wrote Python script according to the schema below. It seems to be correct, but when I'm trying to implement the function, I see the error.here is the error screenshot. I really don't know what is wrong with my code. Please help.
from google.cloud import bigquery
client = bigquery.Client()
table_id = 'myprojectname.newdatasetname.newtablename'
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField('A', 'INTEGER'),
bigquery.SchemaField('B', 'INTEGER'),
bigquery.SchemaField('C', 'INTEGER')
],
skip_leading_rows=0,
)
uri = 'gs://my-bucket-name/*.csv'
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
)
load_job.result()
destination_table = client.get_table(table_id)
print('Loaded {} rows.'.format(destination_table.num_rows))
I am trying to use Colab to upload data to a GCP bigQuery table. However, after initial upload I am struggling to add additional data. Mainly as bq says that my data does not match schema.
I was hoping to get some help how to download table schema and use it to upload data ?
Thanks
Following the example from BigQuery docs:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# Set table_id to the ID of the table model to fetch.
table_id = 'your-project.your_dataset.your_table'
# Make an API request.
table = client.get_table(table_id)
# View table properties
print(f"Got table {table_id}.")
print(f"Table schema: {table.schema}")
print(f"Table description: {table.description}")
print(f"Table has {table.num_rows} rows")
Once you have the table schema information in hand, ensure the data you're trying to append to your table has the appropriate type and configure a load job with bigquery.LoadJobConfig or stream the data into your table using bigquery.Client().insert_rows_json or bigquery.Client().insert_rows_from_dataframe
See the google-cloud-bigquery docs for more details on how to use those methods.
thanks for that,
Slight change to your code.
# Construct a BigQuery client object.
project_id = 'gcp-project'
client = bigquery.Client(project=project_id)
# Set table_id to the ID of the table model to fetch.
table_id = 'analysis.Test01'
I got the other bit working too. esentially i save schema to a variable, and the .to_gbq has table_schema field. What was giving me a hard time was that my table had over 50 columns and I needed some automated way to get schema
pandas_gbq.to_gbq(mergedDB, table_id, project_id=project_id, if_exists='append', table_schema=lcr_table_schema)
lcr_table_schema = [
{
"mode": "NULLABLE",
"name": "Month",
"type": "STRING"
},
This is a simple code to export from Biq Query to Google storage, in CSV format
def export_data():
client = bigquery.Client()
project = 'xxxxx'
dataset_id = 'xxx'
table_id = 'xxx'
bucket_name = 'xxx'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'EXPORT_FILE.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='EU') # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
It works perfectly for general tables, BUT when I try to export data from saved table VIEW, it failed with this error:
BadRequest: 400 Using table xxx:xxx.xxx#123456 is not allowed for this operation because of its type. Try using a different table that is of type TABLE.
Does exist any way to export data from table view?
What I'm trying to achieve is, to get the data from BigQuery in CSV format, and upload to Google analytics Product Data
BigQuery views are subject to a few limitations:
You cannot run a BigQuery job that exports data from a view.
There are more than 10+ other limitations which I didn't posted in the answer as they might change. Follow the link to read all of them.
You need to query your view and write the results to a destination table, and then issue an export job on the destination table.
Could someone share an example of a job config for uploading json newline_delimited file to a new Bigquery table, please?
Trying to do this based on google docs with no success so far.
This example from GCP repository is a good one for loading data from GCS.
The only thing you will have to adapt in your code is setting the job.source_format to be the new delimited json file, like so:
def load_data_from_gcs(dataset_name, table_name, source):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(
job_name, table, source)
job.source_format = 'NEWLINE_DELIMITED_JSON'
job.begin()
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_name, table_name))
(The correct thing would be to receive this parameter as input in your function but this works as an example).
Also, the table should already exist when you run this code (I looked for schema auto-detection in the Python API but it seems there isn't one yet).
We're using Google BigQuery via the Python API. How would I create a table (new one or overwrite old one) from query results? I reviewed the query documentation, but I didn't find it useful.
We want to simulate:
"SELEC ... INTO ..." from ANSI SQL.
You can do this by specifying a destination table in the query. You would need to use the Jobs.insert API rather than the Jobs.query call, and you should specify writeDisposition=WRITE_APPEND and fill out the destination table.
Here is what the configuration would look like, if you were using the raw API. If you're using Python, the Python client should give accessors to these same fields:
"configuration": {
"query": {
"query": "select count(*) from foo.bar",
"destinationTable": {
"projectId": "my_project",
"datasetId": "my_dataset",
"tableId": "my_table"
},
"createDisposition": "CREATE_IF_NEEDED",
"writeDisposition": "WRITE_APPEND",
}
}
The accepted answer is correct, but it does not provide Python code to perform the task. Here is an example, refactored out of a small custom client class I just wrote. It does not handle exceptions, and the hard-coded query should be customised to do something more interesting than just SELECT * ...
import time
from google.cloud import bigquery
from google.cloud.bigquery.table import Table
from google.cloud.bigquery.dataset import Dataset
class Client(object):
def __init__(self, origin_project, origin_dataset, origin_table,
destination_dataset, destination_table):
"""
A Client that performs a hardcoded SELECT and INSERTS the results in a
user-specified location.
All init args are strings. Note that the destination project is the
default project from your Google Cloud configuration.
"""
self.project = origin_project
self.dataset = origin_dataset
self.table = origin_table
self.dest_dataset = destination_dataset
self.dest_table_name = destination_table
self.client = bigquery.Client()
def run(self):
query = ("SELECT * FROM `{project}.{dataset}.{table}`;".format(
project=self.project, dataset=self.dataset, table=self.table))
job_config = bigquery.QueryJobConfig()
# Set configuration.query.destinationTable
destination_dataset = self.client.dataset(self.dest_dataset)
destination_table = destination_dataset.table(self.dest_table_name)
job_config.destination = destination_table
# Set configuration.query.createDisposition
job_config.create_disposition = 'CREATE_IF_NEEDED'
# Set configuration.query.writeDisposition
job_config.write_disposition = 'WRITE_APPEND'
# Start the query
job = self.client.query(query, job_config=job_config)
# Wait for the query to finish
job.result()
Create a table from query results in Google BigQuery. Assuming you are using Jupyter Notebook with Python 3 going to explain the following steps:
How to create a new dataset on BQ (to save the results)
How to run a query and save the results in a new dataset in table format on BQ
Create a new DataSet on BQ: my_dataset
bigquery_client = bigquery.Client() #Create a BigQuery service object
dataset_id = 'my_dataset'
dataset_ref = bigquery_client.dataset(dataset_id) # Create a DatasetReference using a chosen dataset ID.
dataset = bigquery.Dataset(dataset_ref) # Construct a full Dataset object to send to the API.
dataset.location = 'US' # Specify the geographic location where the new dataset will reside. Remember this should be same location as that of source data set from where we are getting data to run a query
# Send the dataset to the API for creation. Raises google.api_core.exceptions.AlreadyExists if the Dataset already exists within the project.
dataset = bigquery_client.create_dataset(dataset) # API request
print('Dataset {} created.'.format(dataset.dataset_id))
Run a query on BQ using Python:
There are 2 types here:
Allowing Large Results
Query without mentioning large result etc.
I am taking the Public dataset here: bigquery-public-data:hacker_news & Table id: comments to run a query.
Allowing Large Results
DestinationTableName='table_id1' #Enter new table name you want to give
!bq query --allow_large_results --destination_table=project_id:my_dataset.$DestinationTableName 'SELECT * FROM [bigquery-public-data:hacker_news.comments]'
This query will allow large query results if required.
Without mentioning --allow_large_results:
DestinationTableName='table_id2' #Enter new table name you want to give
!bq query destination_table=project_id:my_dataset.$DestinationTableName 'SELECT * FROM [bigquery-public-data:hacker_news.comments] LIMIT 100'
This will work for the query where the result is not going to cross the limit mentioned in Google BQ documentation.
Output:
A new dataset on BQ with the name my_dataset
Results of the queries saved as tables in my_dataset
Note:
These queries are Commands which you can run on the terminal(without ! in the beginning). But as we are using Python to run these commands/queries we are using !. This will enable us to use/run commands in the Python program as well.
Also please upvote the answer :). Thank You.