Partitioning BigQuery Tables via API in python - python

I'm using Python to hit the BigQuery API. I've been successful at running queries and writing new tables, but would like to ensure those output tables are partitioned per https://cloud.google.com/bigquery/docs/creating-partitioned-tables
The output of the query would have the columns: event_date[string in the format "2017-11-12"], metric[integer]
Per the code below, I've been assigning the "partitioning_type" code to various objects, but it never returns an error.
( I guess it'd also be useful to know how to tell if my partitioning efforts are actually working (i.e. how to identify the _PARTITIONTIME pseudo column)).
dest_table_id = "BQresults"
query_job = client.run_async_query(str(uuid.uuid4()), query))
query_job.allow_large_results = True
dest_dataset = client.dataset(dest_dataset_id)
dest_table = dest_dataset.table(dest_table_id)
dest_table.partitioning_type ="DAY"
query_job.destination = dest_table
query_job.write_disposition = 'WRITE_TRUNCATE'
query_job.use_legacy_sql = False
query_job.begin()
query_job.result()

If you want to check if the table is partitioned on a time column or not, use get_table() method and check the partitioning_type property of the returned object.
You can check on a integer partitioning checking the range_partitioning property. You can also get the job object using get_job() with the job id and check if the time_partitioning was set in the configuration.
I don't think that query job you're running results in the partitioned table, since the time_partitioning should be set in the job configuration, and it seems like the client doesn't do this. If it is true, you can create partitioned table first and use existing table as a destination.

Related

How to perform the UPSERT operation using the python BigQuery client when writing JSON record

I am writing JSON records into a BigQuery table using the function bq.insert_rows_json(f'{project}.{dataset}.{table_name}', rows_to_insert). This operation is done in INSERT mode. I was wondering if I could use the same function but in UPSERT mode. Is it possible ? I check the documentation here but did not find an argument for that.
I can't seem to find an in-built UPSERT function for python. However, you may try and consider the below approach which is derived from the comment of #Mr.Nobody.
from google.cloud import bigquery
client = bigquery.Client()
query_job = client.query(
"""
MERGE my-dataset.json_table T
USING my-dataset.json_table_source S
ON T.int64_field_0 = S.int64_field_0
WHEN MATCHED THEN
UPDATE SET string_field_1 = s.string_field_1
WHEN NOT MATCHED THEN
INSERT (int64_field_0, string_field_1) VALUES(int64_field_0, string_field_1)"""
)
results = query_job.result() # Waits for job to complete.
In this approach, you will be needing to ingest all of your supposedly "updated" JSON data on a table before inserting or updating them to your main BigQuery table. The query then will match each rows to the main table if the primary ID (uniqueness checker) is already there (then query will do UPDATE) or not yet (then query will do INSERT).
Screenshot of both tables before running the python code.
Main Table:
Source Table:
Screenshot of the Main Table when the python code finished executing.
Conclusion: The int64_field_0 4 was updated (from version 1.0.0. to 6.5.1) because it is already existing in the Main table. The int64_field_0 5 was inserted because it is not yet existing on the main table.

BigQueryOperator changes the table schema and column modes when write_disposition='WRITE_TRUNCATE'

I am using Airflow's BigQueryOperator to populate the BQ table with write_disposition='WRITE_TRUNCATE'. The problem is that every time the task runs, it alters the table schema and also the column mode from Required to Nullable. The create_disposition I am using is 'CREATE_NEVER'. Since my tables are pre-created, I don't want the schemas or column modes to be altered. Using write_disposition='WRITE_APPEND' fixes the issue but my requirement is to use WRITE_TRUNCATE. Any idea why BigQueryOperator alters the schema and mode?
I had a similar issue, not with the required/nullable shcema value, but on policy tags, and the behavior is the same: the policy tags are overriden (and lost). Here the answer of the Google support team:
If you overwrite to a destination table, any existing policy tags are removed from the table, unless you use the --destination_schema flag to specify a schema with policy tags.
For WRITE_TRUNCATE, the disposition overwrites the existing table and the schema. If you want to keep the policy tags, you can use "--destination_schema" to specify a schema with policy tags.
However, with my test in python, I observed 2 different behaviors between the QueryJob (job based on sql query and that sinks the result in a table) and the LoadJob (job that loads data from a file and that sinks the data in a table).
If you perform a LoadJob,
Remove the schema autodetec
Get the schema of the original table
Perform the load job
Like this in Python
job_config = bigquery.job.LoadJobConfig()
job_config.create_disposition = bigquery.job.CreateDisposition.CREATE_IF_NEEDED
job_config.write_disposition = bigquery.job.WriteDisposition.WRITE_TRUNCATE
job_config.skip_leading_rows = 1
# job_config.autodetect = True
job_config.schema = client.get_table(table).schema
query_job = client.load_table_from_uri(uri, table, job_config=job_config)
res = query_job.result()
This solution, to copy the schema, doesn't work with a QueryJob
The workaround is the following (works for LoadJob and QueryJob)
Truncate the table
Perform a job in WRITE_EMPTY mode
Trade-off:
the WRITE_TRUNCATE is atomic: if the write write failed, the data aren't truncated
the workaround is in 2 steps: if the write write failed, the data are already deleted
config = bigquery.job.QueryJobConfig()
config.create_disposition = bigquery.job.CreateDisposition.CREATE_IF_NEEDED
config.write_disposition = bigquery.job.WriteDisposition.WRITE_EMPTY
# Don't work
# config.schema = client.get_table(table).schema
config.destination = table
# Step 1 truncate the table
query_job = client.query(f'TRUNCATE TABLE `{table}`')
res = query_job.result()
# Step 2: Load the new data
query_job = client.query(request, job_config=job_config)
res = query_job.result()
All of this to tell you that the BigQuery operator on Airflow isn't the problem. It's a BigQuery issue. You have a workaround to achieve what you want.

Get number of rows in a BigQuery table (streaming buffer)

I am doing inserts via Streaming. In the UI, I can see the following row counts:
Is there a way to get that via the API? Current when I do:
from google.cloud import bigquery
client = bigquery.Client()
dataset = client.dataset("bqtesting")
table = client.get_table(dataset.table('table_streaming'))
table.num_rows
0
Obviously 0 is not the number that I'm looking to get. From the API documentation it says:
numRows unsigned long [Output-only] The number of rows of data in this table, excluding any data in the streaming buffer.
So then, my question is: how do we get the exact number of rows in a table? Currently I'm doing:
count=[item[0] for item in client.query('SELECT COUNT(*) FROM `bqtesting.table_streaming`').result()][0]
But this takes about 5s just to get the count (and I need to execute this query quite frequently to see if all streaming inserts have 'finished').
select count(1) and select count(*) etc have 0 scanned and billed bytes (you can see this in the job metadata after you run it or in a dry run) so you should be able to run those as often as you like
if i'm reading the documentation correctly, the numbers there are not guaranteed to give you rows in the buffer which have not yet been flushed to big-query storage
you can also use the API mentioned here https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability i.e. check the streamingBuffer.oldestEntryTime field from the tables.get result
You can use the __TABLES__ metadata table to get the information that you want. Querying __TABLES__ incurs no charges.
The query that you need is:
SELECT table_id, row_count, size_bytes
FROM `your-project-name.bqtesting.__TABLES__`
WHERE STARTS_WITH(table_id, "table_streaming")
ORDER BY table_id DESC

How to overwrite a table in atomic fashion in BigQuery client Python API

Here is a code snippet I used as a reference from GCP documentation:
job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table('your_table_id')
job_config.destination = table_ref
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(
sql,
# Location must match that of the dataset(s) referenced in the query
# and of the destination table.
location='US',
job_config=job_config) # API request - starts the query
query_job.result() # Waits for the query to finish
print('Query results loaded to table {}'.format(table_ref.path))
This works fine but if the table already exists, this spits out an error. I know how to delete the table first but I am wondering if there is a way to make this overwrite the table in atomic fashion so that the table always exists.
You can control how results are persisted through a combination of setting the create_disposition and write_disposition. The python library exposes these options in QueryJobConfig, and links to more details from the REST API docs.
For queries, the default behavior for write dispostion is WRITE_EMPTY, which causes a failure if the table already exists. Switching that to WRITE_TRUNCATE should give you the atomic replacement of data that you're looking for.
TL;DR:
just add this to your job config:
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE

Update MSSQL table through SQLAlchemy using dataframes

I'm trying to replace some old MSSQL stored procedures with python, in an attempt to take some of the heavy calculations off of the sql server. The part of the procedure I'm having issues replacing is as follows
UPDATE mytable
SET calc_value = tmp.calc_value
FROM dbo.mytable mytable INNER JOIN
#my_temp_table tmp ON mytable.a = tmp.a AND mytable.b = tmp.b AND mytable.c = tmp.c
WHERE (mytable.a = some_value)
and (mytable.x = tmp.x)
and (mytable.b = some_other_value)
Up to this point, I've made some queries with SQLAlchemy, stored those data in Dataframes, and done the requisite calculations on them. I don't know now how to put the data back into the server using SQLAlchemy, either with raw SQL or function calls. The dataframe I have on my end would essentially have to work in the place of the temporary table created in MSSQL Server, but I'm not sure how I can do that.
The difficulty is of course that I don't know of a way to join between a dataframe and a mssql table, and I'm guessing this wouldn't work so I'm looking for a workaround
As the pandas doc suggests here :
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://user:password#DSN", echo = False)
dataframe.to_sql('tablename', engine , if_exists = 'replace')
engine parameter for msSql is basically the connection string check it here
if_exist parameter is a but tricky since 'replace' actually drops the table first and then recreates it and then inserts all data at once.
by setting the echo attribute to True it shows all background logs and sql's.

Categories