Google Dataflow: insert + update in BigQuery in a streaming pipeline - python

The main object
A python streaming pipeline in which I read the input from pub/sub.
After the input is analyzed, two option are available:
If x=1 -> insert
If x=2 -> update
Testing
This can not be done using apache beam function, so you need to develop it using the 0.25 API of BigQuery (currently this is the version supported in Google Dataflow)
The problem
The inserted record are still in the BigQuery buffer, so the update statement fail:
UPDATE or DELETE statement over table table would affect rows in the streaming buffer, which is not supported
The code
Insert
def insertCanonicalBQ(input):
from google.cloud import bigquery
client = bigquery.Client(project='project')
dataset = client.dataset('dataset')
table = dataset.table('table' )
table.reload()
table.insert_data(
rows=[[values]])
Update
def UpdateBQ(input):
from google.cloud import bigquery
import uuid
import time
client = bigquery.Client()
STD= "#standardSQL"
QUERY= STD + "\n" + """UPDATE table SET field1 = 'XXX' WHERE field2= 'YYY'"""
client.use_legacy_sql = False
query_job = client.run_async_query(query=QUERY, job_name='temp-query-job_{}'.format(uuid.uuid4())) # API request
query_job.begin()
while True:
query_job.reload() # Refreshes the state via a GET request.
if query_job.state == 'DONE':
if query_job.error_result:
raise RuntimeError(query_job.errors)
print "done"
return input
time.sleep(1)

Even if the row wasn't in the streaming buffer, this still wouldn't be the way to approach this problem in BigQuery. BigQuery storage is better suited for bulk mutations rather than mutating individual entities like this via UPDATE. Your pattern is aligned with something I'd expect from an transactional rather than analytical use case.
Consider an append-based pattern for this. Each time you process an entity message write it to BigQuery via streaming insert. Then, when needed you can get the latest version of all entities via a query.
As an example, let's assume an arbitrary schema: idfield is your unique entity key/identifier, and message_time represents the time the message was emitted. Your entities may have many other fields. To get the latest version of the entities, we could run the following query (and possibly write this to another table):
#standardSQL
SELECT
idfield,
ARRAY_AGG(
t ORDER BY message_time DESC LIMIT 1
)[OFFSET(0)].* EXCEPT (idfield)
FROM `myproject.mydata.mytable` AS t
GROUP BY idfield
An additional advantage of this approach is that it also allows you to perform analysis at arbitrary points of time. To perform an analysis of the entities as of their state an hour ago would simply involve adding a WHERE clause: WHERE message_time <= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)

Related

How to perform the UPSERT operation using the python BigQuery client when writing JSON record

I am writing JSON records into a BigQuery table using the function bq.insert_rows_json(f'{project}.{dataset}.{table_name}', rows_to_insert). This operation is done in INSERT mode. I was wondering if I could use the same function but in UPSERT mode. Is it possible ? I check the documentation here but did not find an argument for that.
I can't seem to find an in-built UPSERT function for python. However, you may try and consider the below approach which is derived from the comment of #Mr.Nobody.
from google.cloud import bigquery
client = bigquery.Client()
query_job = client.query(
"""
MERGE my-dataset.json_table T
USING my-dataset.json_table_source S
ON T.int64_field_0 = S.int64_field_0
WHEN MATCHED THEN
UPDATE SET string_field_1 = s.string_field_1
WHEN NOT MATCHED THEN
INSERT (int64_field_0, string_field_1) VALUES(int64_field_0, string_field_1)"""
)
results = query_job.result() # Waits for job to complete.
In this approach, you will be needing to ingest all of your supposedly "updated" JSON data on a table before inserting or updating them to your main BigQuery table. The query then will match each rows to the main table if the primary ID (uniqueness checker) is already there (then query will do UPDATE) or not yet (then query will do INSERT).
Screenshot of both tables before running the python code.
Main Table:
Source Table:
Screenshot of the Main Table when the python code finished executing.
Conclusion: The int64_field_0 4 was updated (from version 1.0.0. to 6.5.1) because it is already existing in the Main table. The int64_field_0 5 was inserted because it is not yet existing on the main table.

BigQuery Python API copy_table copies schema but not data

I'm trying to copy one BigQuery table to another, within the same dataset, working off of the example at https://cloud.google.com/bigquery/docs/managing-tables#copy-table
I've created a function to do the copy, as follows:
def copy_table (source_table,dest_table):
client = bigquery.Client()
source_table_ref="my_project.my_dataset."+source_table
dest_table_ref="my_project.my_dataset."+dest_table
job = client.copy_table(
source_table_ref,
dest_table_ref) # API request
job.result()
But when I do the copy, the dest_table is created, with the same schema as the source_table, but none of the data is copied from the source_table to the dest_table.
Here's the overall sequence of what I'm doing:
Create source_table
Insert rows into source_table
Do query to check that the rows are in the source_table (they are -- SELECT COUNT(*) returns the right number of rows)
Copy source_table to dest_table using the function above
Do query to check that the rows are in the dest_table (they are NOT -- SELECT COUNT(*) returns zero rows)
My guess is that somehow these jobs are being executed asynchronously, but I can't figure out how to make them execute synchronously. Any ideas would be greatly appreciated.
If it helps any, my overall goal is to create and fill a new table with the results of a daily batch job (get_user_info_2020-06-27, for example), then copy it to a get_user_info_current table that always contains the current day's user info.
Edit:
Some more info, based on testing:
On the page at https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability, it says: "Data can take up to 90 minutes to become available for copy operations". So, I coded up a little thing to go after my insert statement to wait until it completes:
def insert_table_wait(table_name,prev_rows,rows_inserted):
client = bigquery.Client()
table_id = "pacs-user-analysis-dev.google_users."+table_name
table = client.get_table(table_id) # Make an API request.
#wait until the insert fully completes
curr_table=client.get_table(table_id)
sys.stderr.write(str(datetime.datetime.now()) +" "+table_name +" properties: "+str(curr_table._properties)+"\n")
curr_rows=int(curr_table._properties.get('numRows'))
while curr_table._properties.get('streamingBuffer') is not None or curr_rows != prev_rows+rows_inserted:
sys.stderr.write(str(datetime.datetime.now()) +" Waiting for insert into "+str(curr_table._properties.get('id'))+" to complete. StreamingBuffer details: "+str(curr_table._properties.get('streamingBuffer'))+" prev_rows: "+str(prev_rows)+" curr_rows: "+str(curr_rows)+ " should be: " + str(prev_rows+rows_inserted)+"\n")
time.sleep(10)
curr_table=client.get_table(table_id)
curr_rows=int(curr_table._properties.get('numRows') )
I'm hoping that that might fix things. What I don't understand is that the new rows appear in the BigQuery Console UI almost immediately, but table._properties.get('numRows') doesn't seem to update nearly as promptly.
Since there is a limitation i copy jobs for data in streaming buffer, I suggest that you use a query job and set a destination table just like the code below:
from google.cloud import bigquery
client = bigquery.Client(project = "your-project")
job_config = bigquery.QueryJobConfig(destination="destination-table-id")
# SELECT * to copy the whole table
sql = "SELECT * FROM <source_table>"
query_job = client.query(sql, job_config=job_config)
query_job.result()
Please let me know if that helps you

How to join PCollection in streaming mode on a static lookup table by key in Apache Beam (Python)

I'm streaming in (unbounded) data from Google Cloud Pubsub into a PCollection in the form of a dictionary. As the streamed data comes in, I'd like to enrich it by joining it by key on a static (bounded) lookup table. This table is small enough to live in memory.
I currently have a working solution that runs using the DirectRunner, but when I try to run it on the DataflowRunner, I get an error.
I've read the bounded lookup table in from a csv using the beam.io.ReadFromText function, and parsed the values into a dictionary. I've then created a ParDo function that takes my unbounded PCollection and the lookup dictionary as a side input. In the ParDo, it uses a generator to "join" on the correct row of the lookup table, and will enrich the input element.
Here's some of the main parts..
# Get bounded lookup table
lookup_dict = (pcoll | 'Read PS Table' >> beam.io.ReadFromText(...)
| 'Split CSV to Dict' >> beam.ParDo(SplitCSVtoDict()))
# Use lookup table as side input in ParDo func to enrich unbounded pcoll
# I found that it only worked on my local machine when decorating it with AsList
enriched = pcoll | 'join pcoll on lkup' >> beam.ParDo(JoinLkupData(), data=beam.pvalue.AsList(lookup_dict)
class JoinLkupData(beam.DoFn):
def process(self, element, lookup_data):
# I used a generator here
lkup = next((row for row in lookup_data if row[<JOIN_FIELD>]) == element[<JOIN_FIELD>]), None)
if lkup:
# If there is a join, add new fields to the pcoll
element['field1'] = lkup['field1']
element['field2'] = lkup['field2']
yield element
I was able to get the correct result when running locally using DirectRunner, but when running on the DataFlow Runner, I receive this error:
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Workflow failed. Causes: Expected custom source to have non-zero number of splits.
This post: " Error while splitting pcollections on Dataflow runner " made me think that the reason for this error has to do with the multiple workers not having access to the same lookup table when splitting the work.
In the future, please share the version of Beam and the stack trace if you can.
In this case, it is a known issue that the error message is not very good. At the time of this writing, Dataflow for Python streaming is limited to only Pubsub for reading and writing and BigQuery for writing. Using the text source in a pipeline results in this error.

Get number of rows in a BigQuery table (streaming buffer)

I am doing inserts via Streaming. In the UI, I can see the following row counts:
Is there a way to get that via the API? Current when I do:
from google.cloud import bigquery
client = bigquery.Client()
dataset = client.dataset("bqtesting")
table = client.get_table(dataset.table('table_streaming'))
table.num_rows
0
Obviously 0 is not the number that I'm looking to get. From the API documentation it says:
numRows unsigned long [Output-only] The number of rows of data in this table, excluding any data in the streaming buffer.
So then, my question is: how do we get the exact number of rows in a table? Currently I'm doing:
count=[item[0] for item in client.query('SELECT COUNT(*) FROM `bqtesting.table_streaming`').result()][0]
But this takes about 5s just to get the count (and I need to execute this query quite frequently to see if all streaming inserts have 'finished').
select count(1) and select count(*) etc have 0 scanned and billed bytes (you can see this in the job metadata after you run it or in a dry run) so you should be able to run those as often as you like
if i'm reading the documentation correctly, the numbers there are not guaranteed to give you rows in the buffer which have not yet been flushed to big-query storage
you can also use the API mentioned here https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability i.e. check the streamingBuffer.oldestEntryTime field from the tables.get result
You can use the __TABLES__ metadata table to get the information that you want. Querying __TABLES__ incurs no charges.
The query that you need is:
SELECT table_id, row_count, size_bytes
FROM `your-project-name.bqtesting.__TABLES__`
WHERE STARTS_WITH(table_id, "table_streaming")
ORDER BY table_id DESC

Partitioning BigQuery Tables via API in python

I'm using Python to hit the BigQuery API. I've been successful at running queries and writing new tables, but would like to ensure those output tables are partitioned per https://cloud.google.com/bigquery/docs/creating-partitioned-tables
The output of the query would have the columns: event_date[string in the format "2017-11-12"], metric[integer]
Per the code below, I've been assigning the "partitioning_type" code to various objects, but it never returns an error.
( I guess it'd also be useful to know how to tell if my partitioning efforts are actually working (i.e. how to identify the _PARTITIONTIME pseudo column)).
dest_table_id = "BQresults"
query_job = client.run_async_query(str(uuid.uuid4()), query))
query_job.allow_large_results = True
dest_dataset = client.dataset(dest_dataset_id)
dest_table = dest_dataset.table(dest_table_id)
dest_table.partitioning_type ="DAY"
query_job.destination = dest_table
query_job.write_disposition = 'WRITE_TRUNCATE'
query_job.use_legacy_sql = False
query_job.begin()
query_job.result()
If you want to check if the table is partitioned on a time column or not, use get_table() method and check the partitioning_type property of the returned object.
You can check on a integer partitioning checking the range_partitioning property. You can also get the job object using get_job() with the job id and check if the time_partitioning was set in the configuration.
I don't think that query job you're running results in the partitioned table, since the time_partitioning should be set in the job configuration, and it seems like the client doesn't do this. If it is true, you can create partitioned table first and use existing table as a destination.

Categories