How to pass dataset id to bigquery client for python - python

I just started playing around with bigquery, and I am trying to pass the dataset id to the python client. It should be a pretty basic operation, but I can't find it on other threads.
In practice I would like to to take the following example
# import packages
import os
from google.cloud import bigquery
# set current work directory to the one with this script.
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# initialize client object using the bigquery key I generated from Google clouds
google_credentials_path = 'bigquery-stackoverflow-DC-fdb49371cf87.json'
client = bigquery.Client.from_service_account_json(google_credentials_path)
# create simple query
query_job = client.query(
"""
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
LIMIT 10"""
)
# store results in dataframe
dataframe_query = query_job.result().to_dataframe()
and make it look something like
# import packages
import os
from google.cloud import bigquery
# set current work directory to the one with this script.
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# initialize client object using the bigquery key I generated from Google clouds
google_credentials_path = 'bigquery-stackoverflow-DC-fdb49371cf87.json'
client = bigquery.Client.from_service_account_json(google_credentials_path)\
.A_function_to_specify_id(bigquery-public-data.stackoverflow)
# create simple query
query_job = client.query(
"""
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `posts_questions` -- No dataset ID here anymore
WHERE tags like 'google-bigquery'
ORDER BY view_count DESC
LIMIT 10"""
)
# store results in dataframe
dataframe_query = query_job.result().to_dataframe()
The documentation eludes me, so any help would be appreciated.

The closest thing to what you're asking for is the default_dataset (reference) property of the query job config. The query job config is an optional object that can be passed into the query() method of the instantiated BigQuery client.
You don't set default dataset as part of instantiating a client as not all resources are dataset scoped. You're implicitly working with a query job in your example, which is a project scoped resource.
So, to adapt your sample a bit, it might look something like this:
# skip the irrelevant bits like imports and client construction
job_config = bigquery.QueryJobConfig(default_dataset="bigquery-public-data.stackoverflow")
sql = "SELECT COUNT(1) FROM posts_questions WHERE tags like 'google-bigquery'"
dataframe = client.query(sql, job_config=job_config).to_dataframe()
If you're issuing multiple queries against this same dataset you could certainly reuse the same job config object with multiple query invocations.

Related

How to avoid duplication in a dataflow beam pipeline in the writeToBq step?

We have a job working on Dataflow that ingests data from a Pub/Sub to write it to BigQuery. On a limited amount of data we were not having any duplicates but on our current volume 100 evts/s we have duplicates in the BigQuery tables. What we call here a duplicate is a row with the same event uuid.
Here is my code:
class CustomParse(beam.DoFn):
""" Custom ParallelDo class to apply a custom transformation """
def to_runner_api_parameter(self, unused_context):
return "beam:transforms:custom_parsing:custom_v0", None
def process(self, message: beam.io.PubsubMessage, timestamp=beam.DoFn.TimestampParam, window=beam.DoFn.WindowParam):
import uuid
data_parsed = {
"data": message.data,
"dataflow_timestamp": timestamp.to_rfc3339(),
"uuid": uuid.uuid4().hex
}
yield data_parsed
def run():
parser = argparse.ArgumentParser()
parser.add_argument(
"--input_subscription",
help='Input PubSub subscription of the form "projects/<PROJECT>/subscriptions/<SUBSCRIPTION>."'
)
parser.add_argument(
"--output_table", help="Output BigQuery Table"
)
known_args, pipeline_args = parser.parse_known_args()
additional_bq_parameters = {
'timePartitioning': {'type': 'HOUR'}}
# Creating pipeline options
pipeline_options = PipelineOptions(pipeline_args)
def get_table_name(x):
namespace = NAMESPACE_EXTRACTED
date = x['dataflow_timestamp'][:10].replace('-', '')
return f"{known_args.output_table}_{namespace}_{date}"
# Defining our pipeline and its steps
p = beam.Pipeline(options=pipeline_options)
(
p
| "ReadFromPubSub" >> beam.io.gcp.pubsub.ReadFromPubSub(
subscription=known_args.input_subscription, timestamp_attribute=None, with_attributes=True
)
| "Prevent fusion" >> beam.transforms.util.Reshuffle()
| "CustomParse" >> beam.ParDo(CustomParse(broker_model))
| "WriteToBigQuery" >> beam.io.WriteToBigQuery(
table=get_table_name,
schema=BIGQUERY_SCHEMA,
additional_bq_parameters=additional_bq_parameters,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
batch_size=1000
)
)
pipeline_result = p.run()
if __name__ == "__main__":
run()
What should we do to avoid this ? Are we missing a combining step ? For the record it has not happened following an error.
I'm missing some context (for example, you haven't included the BrokerParsing transform), but based on what you've included here, it seems like the issue might be that you're not including the id_label parameter in the ReadFromPubSub transform. According to the documentation:
id_label – The attribute on incoming Pub/Sub messages to use as a unique record identifier. When specified, the value of this attribute (which can be any string that uniquely identifies the record) will be used for deduplication of messages. If not provided, we cannot guarantee that no duplicate data will be delivered on the Pub/Sub stream. In this case, deduplication of the stream will be strictly best effort.
I believe this is due to beam.io.WriteToBigQuery streaming_inserts guarantee
Use the BigQuery streaming insert API to insert data. This provides the lowest-latency insert path into BigQuery, and therefore is the default method when the input is unbounded. BigQuery will make a strong effort to ensure no duplicates when using this path, however there are some scenarios in which BigQuery is unable to make this guarantee (see https://cloud.google.com/bigquery/streaming-data-into-bigquery). A query can be run over the output table to periodically clean these rare duplicates. Alternatively, using the FILE_LOADS insert method does guarantee no duplicates, though the latency for the insert into BigQuery will be much higher. For more information, see Streaming Data into BigQuery.

How to handle BigQuery insert errors in a Dataflow pipeline using Python?

I'm trying to create a streaming pipeline with Dataflow that reads messages from a PubSub topic to end up writing them on a BigQuery table. I don't want to use any Dataflow template.
For the moment I just want to create a pipeline in a Python3 script executed from a Google VM Instance to carry out a loading and transformation process of every message that arrives from Pubsub (parsing the records that it contains and adding a new field) to end up writing the results on a BigQuery table.
Simplifying, my code would be:
#!/usr/bin/env python
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1,
import apache_beam as beam
import apache_beam.io.gcp.bigquery
import logging
import argparse
import sys
import json
from datetime import datetime, timedelta
def load_pubsub(message):
try:
data = json.loads(message)
records = data["messages"]
return records
except:
raise ImportError("Something went wrong reading data from the Pub/Sub topic")
class ParseTransformPubSub(beam.DoFn):
def __init__(self):
self.water_mark = (datetime.now() + timedelta(hours = 1)).strftime("%Y-%m-%d %H:%M:%S.%f")
def process(self, records):
for record in records:
record["E"] = self.water_mark
yield record
def main():
table_schema = apache_beam.io.gcp.bigquery.parse_table_schema_from_json(open("TableSchema.json"))
parser = argparse.ArgumentParser()
parser.add_argument('--input_topic')
parser.add_argument('--output_table')
known_args, pipeline_args = parser.parse_known_args(sys.argv)
with beam.Pipeline(argv = pipeline_args) as p:
pipe = ( p | 'ReadDataFromPubSub' >> beam.io.ReadStringsFromPubSub(known_args.input_topic)
| 'LoadJSON' >> beam.Map(load_pubsub)
| 'ParseTransform' >> beam.ParDo(ParseTransformPubSub())
| 'WriteToAvailabilityTable' >> beam.io.WriteToBigQuery(
table = known_args.output_table,
schema = table_schema,
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
(For example) The messages published in the PubSub topic use to come as follows:
'{"messages":[{"A":"Alpha", "B":"V1", "C":3, "D":12},{"A":"Alpha", "B":"V1", "C":5, "D":14},{"A":"Alpha", "B":"V1", "C":3, "D":22}]}'
If the field "E" is added in the record, then the structure of the record (dictionary in Python) and the data type of the fields is what the BigQuery table expects.
The problems that a I want to handle are:
If some messages come with an unexpected structure I want to fork the pipeline flatten and write them in another BigQuery table.
If some messages come with an unexpected data type of a field, then in the last level of the pipeline when they should be written in the table an error will occur. I want to manage this type of error by diverting the record to a third table.
I read the documentation found on the following pages but I found nothing:
https://cloud.google.com/dataflow/docs/guides/troubleshooting-your-pipeline
https://cloud.google.com/dataflow/docs/guides/common-errors
By the way, if I choose the option to configure the pipeline through the template that reads from a PubSubSubscription and writes into BigQuery I get the following schema which turns out to be the same one I'm looking for:
Template: Cloud Pub/Sub Subscription to BigQuery
You can't catch the errors that occur in the sink to BigQuery. The message that you write into bigquery must be good.
The best pattern is to perform a transform which checks your messages structure and fields type. In case of error, you create an error flow and you write this issue flow in a file (for example, or in a table without schema, you write in plain text your message)
We do the following when errors occur at the BigQuery sink.
send a message (without stacktrace) to GCP Error Reporting, for developers to be notified
log the error to StackDriver
stop the pipeline execution (the best place for messages to wait until a developer has fixed the issue, is the incomming pubSub subscription)

Running python script to get SQL Statement for Google BigQuery

Trying to run a script that contains a SQL query:
import example_script
example_script.df.describe()
example_script.df.info()
q1 = '''
example_script.df['specific_column'])
'''
job_config = bigquery.QueryJobConfig()
query_job = client.query(q1, job_config= job_config)
q = query_job.to_dataframe()
Issues I'm having are when I import it, how do I get that specific column name used as a text? Then it will run the query from GBQ but instead, it's stuck in pandas formatting that Google doesn't want to read. Are there other options?

How to retrieve Test Results in Azure DevOps with Python REST API?

How to retrieve Test Results from VSTS (Azure DevOps) by using Python REST API?
Documentation is (as of today) very light, and even the examples in the dedicated repo of the API examples are light (https://github.com/Microsoft/azure-devops-python-samples).
For some reasons, the Test Results are not considered as WorkItems so a regular WIQL query would not work.
Additionally, it would be great to query the results for a given Area Path.
Thanks
First you need to get the proper connection client with the client string that matches the test results.
from vsts.vss_connection import VssConnection
from msrest.authentication import BasicAuthentication
token = "hcykwckuhe6vbnigsjs7r3ai2jefsdlkfjslkfj5mxizbtfu6k53j4ia"
team_instance = "https://tfstest.toto.com:8443/tfs/Development/"
credentials = BasicAuthentication("", token)
connection = VssConnection(base_url=team_instance, creds=credentials)
TEST_CLIENT = "vsts.test.v4_1.test_client.TestClient"
test_client = connection.get_client(TEST_CLIENT)
Then, you can have a look at all the functions available in: vsts/test/<api_version>/test_client.py"
The following functions look interesting:
def get_test_results(self, project, run_id, details_to_include=None, skip=None, top=None, outcomes=None) (Get Test Results for a run based on filters)
def get_test_runs(self, project, build_uri=None, owner=None, tmi_run_id=None, plan_id=None, include_run_details=None, automated=None, skip=None, top=None)
def query_test_runs(self, project, min_last_updated_date, max_last_updated_date, state=None, plan_ids=None, is_automated=None, publish_context=None, build_ids=None, build_def_ids=None, branch_name=None, release_ids=None, release_def_ids=None, release_env_ids=None, release_env_def_ids=None, run_title=None, top=None, continuation_token=None) (although this function has a limitation of 7 days range between min_last_updated_date and max_last_updated_date
To retrieve all the results from the Test Plans in a given Area Path, I have used the following code:
tp_query = Wiql(query="""
SELECT
[System.Id]
FROM workitems
WHERE
[System.WorkItemType] = 'Test Plan'
AND [Area Path] UNDER 'Development\MySoftware'
ORDER BY [System.ChangedDate] DESC""")
for plan in wit_client.query_by_wiql(tp_query).work_items:
print(f"Results for {plan.id}")
for run in test_client.get_test_runs(my_project, plan_id = plan.id):
for res in test_client.get_test_results(my_project, run.id):
tc = res.test_case
print(f"#{run.id}. {tc.name} ({tc.id}) => {res.outcome} by {res.run_by.display_name} in {res.duration_in_ms}")
Note that a test result includes the following attributes:
duration_in_ms
build
outcome (string)
associated_bugs
run_by (Identity)
test_case (TestCase)
test_case_title (string)
area (AreaPath)
Test_run, corresponding to the test run
test_suite
test_plan
completed_date (Python datetime object)
started_date ( Python datetime object)
configuration
Hope it can help others save the number of hours I spent exploring this API.
Cheers

How to create date partitioned tables in GBQ? Can you use python?

I have just under 100M records of data that I wish to transform by denormalising a field and then input into a date partitioned GBQ table. The dates go back to 2001.
I had hoped that I could transform it with Python and then use GBQ directly from the script to accomplish this, but after reading up on this and particularly this document it doesn't seem straight-forward to create date-partitioned tables. I'm looking for a steer in the right direction.
Is there any working example of a python script that can do this? Or is it not possible to do via Python? Or is there another method someone can point me in the direction of?
Update
I'm not sure if I've missed something, but the tables created appear to be partitioned as per the insert date of when I'm creating the table and I want to partition by a date set within the existing dataset. I can't see anyway of changing this.
Here's what I've experimenting with:
import uuid
import os
import csv
from google.cloud import bigquery
from google.cloud.bigquery import SchemaField
from google.cloud.bigquery import Client
from google.cloud.bigquery import Table
import logging #logging.warning(data_store+file)
import json
import pprint
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path to service account credentials'
client = bigquery.Client()
dataset = client.dataset('test_dataset')
dataset.create()
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = dataset.table('table_name', SCHEMA)
table.partitioning_type = "DAY"
table.create()
rows = [
('bob', 30),
('bill', 31)
]
table.insert_data(rows)
Is it possible to modify this to take control of the partitions as I create tables and insert data?
Update 2
It turns out I wasn't looking for table partitioning, for my use case it's enough to simply append a date serial to the end of my table name and then query with something along the lines of:
SELECT * FROM `dataset.test_dataset.table_name_*`
WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170702'
I don't know whether this is technically still partitioning or not, but as far as I can see it has the same benefits.
Updated to latest version (google-cloud-biquery==1.4.0)
from google.cloud import bigquery
client = bigquery.Client()
dataset_ref = client.dataset('test_dataset')
table_ref = dataset_ref.table('test_table')
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = bigquery.Table(table_ref, schema=SCHEMA)
if partition not in ('DAY', ):
raise NotImplementedError(f"BigQuery partition type unknown: {partition}")
table.time_partitioning = bigquery.table.TimePartitioning(type_=partition)
table = client.create_table(table) # API request
You can easily create date partitioned tables using the API and Python SDK. Simply set the timePartitioning field to DAY in your script:
https://github.com/GoogleCloudPlatform/google-cloud-python/blob/a14905b6931ba3be94adac4d12d59232077b33d2/bigquery/google/cloud/bigquery/table.py#L219
Or roll your own table insert request with the following body:
{
"tableReference": {
"projectId": "myProject",
"tableId": "table1",
"datasetId": "mydataset"
},
"timePartitioning": {
"type": "DAY"
}
}
Everything is just backed by the REST api here.
Be aware that different versions of google-api-core handle time-partitioned tables differently. For example, using google-cloud-core==0.29.1, you must use the bigquery.Table object to create time-partitioned tables:
from google.cloud import bigquery
MY_SA_PATH = "/path/to/my/service-account-file.json"
MY_DATASET_NAME = "example"
MY_TABLE_NAME = "my_table"
client = bigquery.Client.from_service_account_json(MY_SA_PATH)
dataset_ref = client.dataset(MY_DATASET_NAME)
table_ref = dataset_ref.table(MY_TABLE_NAME)
actual_table = bigquery.Table(table_ref)
actual_table.partitioning_type = "DAY"
client.create_table(actual_table)
I only discovered this by looking at the 0.20.1 Table source code. I didn't see this in any docs or examples. If you're having problems creating time-partitioned tables, I suggest that you identify the version of each Google library that you're using (for example, using pip freeze), and check your work against the library's source code.

Categories