I'm working on an ETL which is pulling data from a database, doing minor transformation and outputs to BigQuery. I have written my pipeline in Apache Beam 2.26.0 using Python SDK. I'm loading a dozen or so tables, and I'm passing their names as arguments to beam.io.WriteToBigQuery
Now, the documentation says that (https://beam.apache.org/documentation/io/built-in/google-bigquery):
When writing to BigQuery, you must supply a table schema for the destination table that you want to write to, unless you specify a create disposition of CREATE_NEVER.
I believe this is not exactly true. In my tests I saw that this is the case only when passing static table name.
If you have a bunch of tables and want to pass a table name as an argument then it throws an error:
ErrorProto message: 'No schema specified on job or table.'
My code:
bq_data | "Load data to BQ" >> beam.io.WriteToBigQuery(
table=lambda row: bg_config[row['table_name']],
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER
)
bq_data is a dict of row of pandas data frame. Where I have a column table_name.
bq_config is a dictionary where key = row['table_name'] and the value is of the format:
[project_id]:[dataset_id].[table_id]
Anyone have some thoughts on this?
Have a look at this thread, I addressed it there. In short; I used the internal python time/date-function to render a variable before executing the python BigQuery API request.
Related
I am writing JSON records into a BigQuery table using the function bq.insert_rows_json(f'{project}.{dataset}.{table_name}', rows_to_insert). This operation is done in INSERT mode. I was wondering if I could use the same function but in UPSERT mode. Is it possible ? I check the documentation here but did not find an argument for that.
I can't seem to find an in-built UPSERT function for python. However, you may try and consider the below approach which is derived from the comment of #Mr.Nobody.
from google.cloud import bigquery
client = bigquery.Client()
query_job = client.query(
"""
MERGE my-dataset.json_table T
USING my-dataset.json_table_source S
ON T.int64_field_0 = S.int64_field_0
WHEN MATCHED THEN
UPDATE SET string_field_1 = s.string_field_1
WHEN NOT MATCHED THEN
INSERT (int64_field_0, string_field_1) VALUES(int64_field_0, string_field_1)"""
)
results = query_job.result() # Waits for job to complete.
In this approach, you will be needing to ingest all of your supposedly "updated" JSON data on a table before inserting or updating them to your main BigQuery table. The query then will match each rows to the main table if the primary ID (uniqueness checker) is already there (then query will do UPDATE) or not yet (then query will do INSERT).
Screenshot of both tables before running the python code.
Main Table:
Source Table:
Screenshot of the Main Table when the python code finished executing.
Conclusion: The int64_field_0 4 was updated (from version 1.0.0. to 6.5.1) because it is already existing in the Main table. The int64_field_0 5 was inserted because it is not yet existing on the main table.
I am trying to write to bigquery to different table destinations and I would like to create the tables dynamically if they don't exist already.
bigquery_rows | "Writing to Bigquery" >> WriteToBigQuery(lambda e: compute_table_name(e),
schema=compute_table_schema,
additional_bq_parameters=additional_bq_parameters,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
)
The function compute_table_name is quite simple actually, I am just trying to get it to work.
def compute_table_name(element):
if element['table'] == 'table_id':
del element['table']
return "project_id:dataset.table_id"
The schema is detected correctly and the table IS created and populated with records. The problem is, the table ID I get is something along the lines of:
datasetId: 'dataset'
projectId: 'project_id'
tableId: 'beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP...
I have also tried returning a bigquery.TableReference object in my compute_table_name function to no avail.
EDIT: I am using apache-beam 2.34.0 and I have opened an issue on JIRA here
You pipeline code is fine. However, you can just pass the callable to the compute_table name function:
bigquery_rows | "Writing to Bigquery" >> WriteToBigQuery(compute_table_name,
schema=compute_table_schema,
additional_bq_parameters=additional_bq_parameters,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
)
The 'beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP' table name in BigQuery probably means that the load job either has not finished yet, or that it has errors ; you should check the "Personnal history" or "Project history" tabs in the BigQuery UI, to see what is the status of the job.
I have found the solution to my answer by following this answer. It felt like a workaround because I'm not passing a callable to WriteToBigQuery(). Testing a bunch of ways I found that giving a string/TableReference to the method directly it worked, but not giving it a callable.
I process ~50 gigs of data every 15 minutes spread across 6 tables and it works decently.
I am looking to include a where clause to an existing query to exclude data in the BigQuery streaming buffer.
To do this I would like to get the Partition column name so I can add
WHERE partition_column IS NOT NULL;
to my existing query.
I have been looking at the CLI and the get_table method however that just returns the value of the column not the column name.
I get the same when searching on .INFORMATION_SCHEMA.PARTITIONS this returns a field for partition_id but I would prefer the column name itself, is there away to get this ?
Additionally the table is setup with column based partitioning.
Based on python BigQuery client documentation, use the attribute time_partitioning:
from google.cloud import bigquery
bq_table = client.get_table('my_partioned_table')
bq_table.time_partitioning # TimePartitioning(field='column_name',type_='DAY')
bq_table.time_partitioning.field # column_name
Small tips, if you don't know where to search, print the API repr:
bq_table.to_api_repr()
I'm currently using AWS Glue Data Catalog to organize my database. Once I set up the connection and sent my crawler to gather information, I was able to see the formulated metadata.
One feature that would be nice to have is the ability to SEARCH the entire data catalog on ONE column name. For example, if i have 5 tables in my data catalog, and one of those tables happen to have a field "age". I'd like to be able to see that table.
I also was wondering if I can search on the "comments" field every column has in a table on AWS Glue Data Catalog
Hope to get some help!
You can do that with AWS Glue API. For example, you can use python SDK boto3 and get_tables() method to retrieve all meta information about tables in a particular database. Have a look at the Response Syntax returned by calling get_tables() and then you would only need to parse it, for example:
import boto3
glue_client = boto3.client('glue')
response = glue_client.get_tables(
DatabaseName='__SOME_NAME__'
)
for table in response['TableList']:
columns = table['StorageDescriptor']['Columns']
for col in columns:
col_name = col['Name']
col_comment = col['Comment']
# Here you do search for what you need
Note: if you have a table with partitioning (artificial columns), then you would all need to search through
columns_as_partitions = table['PartitionKeys']
for col in columns_as_partitions:
col_name = col['Name']
col_comment = col['Comment']
# Here you do search for what you need
I'm streaming in (unbounded) data from Google Cloud Pubsub into a PCollection in the form of a dictionary. As the streamed data comes in, I'd like to enrich it by joining it by key on a static (bounded) lookup table. This table is small enough to live in memory.
I currently have a working solution that runs using the DirectRunner, but when I try to run it on the DataflowRunner, I get an error.
I've read the bounded lookup table in from a csv using the beam.io.ReadFromText function, and parsed the values into a dictionary. I've then created a ParDo function that takes my unbounded PCollection and the lookup dictionary as a side input. In the ParDo, it uses a generator to "join" on the correct row of the lookup table, and will enrich the input element.
Here's some of the main parts..
# Get bounded lookup table
lookup_dict = (pcoll | 'Read PS Table' >> beam.io.ReadFromText(...)
| 'Split CSV to Dict' >> beam.ParDo(SplitCSVtoDict()))
# Use lookup table as side input in ParDo func to enrich unbounded pcoll
# I found that it only worked on my local machine when decorating it with AsList
enriched = pcoll | 'join pcoll on lkup' >> beam.ParDo(JoinLkupData(), data=beam.pvalue.AsList(lookup_dict)
class JoinLkupData(beam.DoFn):
def process(self, element, lookup_data):
# I used a generator here
lkup = next((row for row in lookup_data if row[<JOIN_FIELD>]) == element[<JOIN_FIELD>]), None)
if lkup:
# If there is a join, add new fields to the pcoll
element['field1'] = lkup['field1']
element['field2'] = lkup['field2']
yield element
I was able to get the correct result when running locally using DirectRunner, but when running on the DataFlow Runner, I receive this error:
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Workflow failed. Causes: Expected custom source to have non-zero number of splits.
This post: " Error while splitting pcollections on Dataflow runner " made me think that the reason for this error has to do with the multiple workers not having access to the same lookup table when splitting the work.
In the future, please share the version of Beam and the stack trace if you can.
In this case, it is a known issue that the error message is not very good. At the time of this writing, Dataflow for Python streaming is limited to only Pubsub for reading and writing and BigQuery for writing. Using the text source in a pipeline results in this error.