I am building a process in Google Cloud Dataflow that will consume messages in a Pub/Sub and based on a value of one key it will either write them to BQ or to GCS. I am able to split the messages, but I am not sure how to write the data to BigQuery. I've tried using the beam.io.gcp.bigquery.WriteToBigQuery, but no luck.
My full code is here: https://pastebin.com/4W9Vu4Km
Basically my issue is that I don't know, how to specify in the WriteBatchesToBQ (line 73) that the variable element should be written into BQ.
I've also tried using beam.io.gcp.bigquery.WriteToBigQuery directly in the pipeline (line 128), but then I got an error AttributeError: 'list' object has no attribute 'items' [while running 'Write to BQ/_StreamToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn)'] . This is probably because I am not feeding it a dictionary, but a list of dictionaries (I would like to use 1-minute windows).
Any ideas please? (also if there is something too stupid in the code, let me know - I am playing with apache beam just for a short time and I might be overlooking some obvious issues).
WriteToBigQuery sample format is given below:-
project_id = "proj1"
dataset_id = 'dataset1'
table_id = 'table1'
table_schema = ('id:STRING, reqid:STRING')
| 'Write-CH' >> beam.io.WriteToBigQuery(
table=table_id,
dataset=dataset_id,
project=project_id,
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
))
You can refer this case it will give you a brief understanding of beam data pipeline.
The second approach is the solution to this issue, you need to use WriteToBigQuery function directly in the pipeline. However, a beam.FlatMap step needs to be included so the WriteToBigQuery can process the list of dictionaries correctly.
Hence the complete pipeline splitting data, grouping them by time, and writing them into BQ is defined like this:
accepted_messages = tagged_lines_result[Split.OUTPUT_TAG_BQ] | "Window into BQ" >> GroupWindowsIntoBatches(
window_size) | "FlatMap" >> beam.FlatMap(
lambda elements: elements) | "Write to BQ" >> beam.io.gcp.bigquery.WriteToBigQuery(table=output_table_bq,
schema=(
output_table_bq_schema),
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
The complete working code is here: https://pastebin.com/WFwBvPcU
Related
Learning Apache Beam with the dataframe API at the moment and coming across some unexpected behavior that I was hoping an expert could explain to me.
Here's the simplest version of my issue I could drill it down to (in the real version the dataframe transform is something more complex):
class LocationRow(NamedTuple):
h3_index: str
with beam.Pipeline(options=beam_options) as pipeline:
(pipeline
| ReadFromBigQuery(
query=f'SELECT h3_index FROM {H3_INDEX_TABLE} LIMIT 100', use_standard_sql=True)
.with_output_types(LocationRow)
| DataframeTransform(lambda df: df)
| WriteToBigQuery(
schema='h3_index:STRING',
table=OUTPUT_TABLE))
Running this with DirectRunner (or DataflowRunner) crashes with the following:
message: 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details. File: gs://analysis-dataflow-temp/temp/bq_load/0163282c2bbc47ba8ec368b158aefe2e/core-modules-development.analysis.fake_grid_power_price/5a1fc783-dcdc-44bd-9855-faea6151574f'
So, I looked into that file and it's just a json list per line:
$ cat 5a1fc783-dcdc-44bd-9855-faea6151574f
["8800459aedfffff"]
["88004536c5fffff"]
["8800418237fffff"]
["8800422b39fffff"]
["8800432451fffff"]
["88004175d7fffff"]
...
I figured out that BigQuery is expecting an object per line (like {"h3_index": "88004175d7fffff"}), and if I remove the DataframeTransform in the pipeline it works. So I tried using print to figure out what's happening, and changed the pipeline to this:
with beam.Pipeline(options=beam_options) as pipeline:
(pipeline
| ReadFromBigQuery(
query=f'SELECT h3_index FROM {H3_INDEX_TABLE} LIMIT 5', use_standard_sql=True)
.with_output_types(LocationRow)
| DataframeTransform(lambda df: df)
| beam.Map(print)
Which gives this output:
BeamSchema_574444a4_ae3e_4bb2_9cca_4867670ef2bb(h3_index='8806b00819fffff')
BeamSchema_574444a4_ae3e_4bb2_9cca_4867670ef2bb(h3_index='8806ab98d3fffff')
BeamSchema_574444a4_ae3e_4bb2_9cca_4867670ef2bb(h3_index='8806accd45fffff')
BeamSchema_574444a4_ae3e_4bb2_9cca_4867670ef2bb(h3_index='8806ac60a7fffff')
BeamSchema_574444a4_ae3e_4bb2_9cca_4867670ef2bb(h3_index='8806acb409fffff')
If I remove the DataframeTransform and keep the Map(print) I get this instead:
{'h3_index': '88012db281fffff'}
{'h3_index': '88012ea527fffff'}
{'h3_index': '88012e38c5fffff'}
{'h3_index': '88012e2135fffff'}
{'h3_index': '88012ea949fffff'}
So it looks like the DataframeTransform is returning collections of NamedTuples (or similar) rather than dictionaries, and the WriteToBigQuery fails with these tuples. I can fix it by adding a Map after the DataframeTransform to change this explicitly:
with beam.Pipeline(options=beam_options) as pipeline:
(pipeline
| ReadFromBigQuery(
query=f'SELECT h3_index FROM {H3_INDEX_TABLE} LIMIT 100', use_standard_sql=True)
.with_output_types(LocationRow)
| DataframeTransform(lambda df: df)
| beam.Map(lambda row: {'h3_index': row.h3_index})
| WriteToBigQuery(
schema='h3_index:STRING',
table=OUTPUT_TABLE))
But this feels unnecessary, and I don't really understand what's happening here. What's the difference between a collection of tuples and one of dictionaries? Hoping a Beam expert can shed some light on this!
The DataframeTransform allows using the Dataframe api and returns a Beam schema.
Unfortunately if you need to manipulate a Dict after this step, you have to add a transformation after in order to map the Beam schema to a Dict as you shown in your question.
It can be interesting sometimes to use the Dataframe api and its powerful syntaxe, example :
from apache_beam.dataframe.transforms import DataframeTransform
with beam.Pipeline() as p:
...
| beam.Select(DOLocationID=lambda line: int(..),
passenger_count=lambda line: int(..))
| DataframeTransform(lambda df: df.groupby('DOLocationID').sum())
| beam.Map(lambda row: f"{row.DOLocationID},{row.passenger_count}")
But to be able to interact with many native Beam IOs like WriteToBigquery, you have to transform your result schema to another structure (Dict or others).
If you don't want to add the transformation from Beam schema to Dict before to write data to BigQuery, instead of using DataframeTransform, you can use an usual ParDo/DoFn or beam.Map with a Python method containing your transformations + business logic and returning the result as Dict.
I am trying to write to bigquery to different table destinations and I would like to create the tables dynamically if they don't exist already.
bigquery_rows | "Writing to Bigquery" >> WriteToBigQuery(lambda e: compute_table_name(e),
schema=compute_table_schema,
additional_bq_parameters=additional_bq_parameters,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
)
The function compute_table_name is quite simple actually, I am just trying to get it to work.
def compute_table_name(element):
if element['table'] == 'table_id':
del element['table']
return "project_id:dataset.table_id"
The schema is detected correctly and the table IS created and populated with records. The problem is, the table ID I get is something along the lines of:
datasetId: 'dataset'
projectId: 'project_id'
tableId: 'beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP...
I have also tried returning a bigquery.TableReference object in my compute_table_name function to no avail.
EDIT: I am using apache-beam 2.34.0 and I have opened an issue on JIRA here
You pipeline code is fine. However, you can just pass the callable to the compute_table name function:
bigquery_rows | "Writing to Bigquery" >> WriteToBigQuery(compute_table_name,
schema=compute_table_schema,
additional_bq_parameters=additional_bq_parameters,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
)
The 'beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP' table name in BigQuery probably means that the load job either has not finished yet, or that it has errors ; you should check the "Personnal history" or "Project history" tabs in the BigQuery UI, to see what is the status of the job.
I have found the solution to my answer by following this answer. It felt like a workaround because I'm not passing a callable to WriteToBigQuery(). Testing a bunch of ways I found that giving a string/TableReference to the method directly it worked, but not giving it a callable.
I process ~50 gigs of data every 15 minutes spread across 6 tables and it works decently.
I'm facing a problem. I have no clue how to fix it.
I have a pipeline in batch mode. I read a file using the method read_fwf : beam.dataframe.io.read_fwf; However, all following PTransforms are ignored. I wonder why ?
As you can see, my pipeline ended up having 1 step:
But, my code has the following pipeline :
#LOAD FILE RNS
elements = p | 'Load File' >> beam.dataframe.io.read_fwf('gs://my_bucket/files/202009.txt', header=None, colspec=col_definition, dtype=str, keep_default_na=False, encoding='ISO-8859-1')
#PREPARE VALUES (BULK INSERT)
Script_Values = elements | 'Prepare Bulk Insert' >> beam.ParDo(Prepare_Bulk_Insert())
#GROUP ALL VALUES
Grouped_Values = Script_Values | 'Grouping values' >> beam.GroupByKey()
#BULK INSERT INTO POSTGRESQL GROUPING BASED
Inserted = Grouped_Values | 'Insert PostgreSQL' >> beam.Map(functionMapInsert)
Do you know what I am doing wrong ?
Kindly,
Juliano
I think the problem has to do with the fact that, as you can see in the Apache Beam documentation, beam.dataframe.io.read_fwf returns a deferred Beam dataframe representing the contents of the file, not a PCollection.
You can embed DataFrames in a pipeline and convert between them and PCollections with the help of the functions defined in the apache_beam.dataframe.convert module.
The SDK documentation provides an example of this setup, fully described in Github.
I think it is worth value to try the DataframeTransform as well, perhaps it is more suitable for being integrated in the pipeline with the help of a schema definition.
In relation with this last suggestion, please, consider reviewing this related SO question, especially the answer from #robertwb and the exceptional linked google slides document, I think it can be helpful.
I'm working on an ETL which is pulling data from a database, doing minor transformation and outputs to BigQuery. I have written my pipeline in Apache Beam 2.26.0 using Python SDK. I'm loading a dozen or so tables, and I'm passing their names as arguments to beam.io.WriteToBigQuery
Now, the documentation says that (https://beam.apache.org/documentation/io/built-in/google-bigquery):
When writing to BigQuery, you must supply a table schema for the destination table that you want to write to, unless you specify a create disposition of CREATE_NEVER.
I believe this is not exactly true. In my tests I saw that this is the case only when passing static table name.
If you have a bunch of tables and want to pass a table name as an argument then it throws an error:
ErrorProto message: 'No schema specified on job or table.'
My code:
bq_data | "Load data to BQ" >> beam.io.WriteToBigQuery(
table=lambda row: bg_config[row['table_name']],
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER
)
bq_data is a dict of row of pandas data frame. Where I have a column table_name.
bq_config is a dictionary where key = row['table_name'] and the value is of the format:
[project_id]:[dataset_id].[table_id]
Anyone have some thoughts on this?
Have a look at this thread, I addressed it there. In short; I used the internal python time/date-function to render a variable before executing the python BigQuery API request.
I'm streaming in (unbounded) data from Google Cloud Pubsub into a PCollection in the form of a dictionary. As the streamed data comes in, I'd like to enrich it by joining it by key on a static (bounded) lookup table. This table is small enough to live in memory.
I currently have a working solution that runs using the DirectRunner, but when I try to run it on the DataflowRunner, I get an error.
I've read the bounded lookup table in from a csv using the beam.io.ReadFromText function, and parsed the values into a dictionary. I've then created a ParDo function that takes my unbounded PCollection and the lookup dictionary as a side input. In the ParDo, it uses a generator to "join" on the correct row of the lookup table, and will enrich the input element.
Here's some of the main parts..
# Get bounded lookup table
lookup_dict = (pcoll | 'Read PS Table' >> beam.io.ReadFromText(...)
| 'Split CSV to Dict' >> beam.ParDo(SplitCSVtoDict()))
# Use lookup table as side input in ParDo func to enrich unbounded pcoll
# I found that it only worked on my local machine when decorating it with AsList
enriched = pcoll | 'join pcoll on lkup' >> beam.ParDo(JoinLkupData(), data=beam.pvalue.AsList(lookup_dict)
class JoinLkupData(beam.DoFn):
def process(self, element, lookup_data):
# I used a generator here
lkup = next((row for row in lookup_data if row[<JOIN_FIELD>]) == element[<JOIN_FIELD>]), None)
if lkup:
# If there is a join, add new fields to the pcoll
element['field1'] = lkup['field1']
element['field2'] = lkup['field2']
yield element
I was able to get the correct result when running locally using DirectRunner, but when running on the DataFlow Runner, I receive this error:
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Workflow failed. Causes: Expected custom source to have non-zero number of splits.
This post: " Error while splitting pcollections on Dataflow runner " made me think that the reason for this error has to do with the multiple workers not having access to the same lookup table when splitting the work.
In the future, please share the version of Beam and the stack trace if you can.
In this case, it is a known issue that the error message is not very good. At the time of this writing, Dataflow for Python streaming is limited to only Pubsub for reading and writing and BigQuery for writing. Using the text source in a pipeline results in this error.