Batch insert to Bigquery with Dataflow - python

I am using apache beam pipeline and I want to batch insert to bigquery with python. My data comes from Pub/Sub which is unbounded. As a result of my research, GlobalWindows with triggers should solve my problem. I tried my pipeline with windowing, but it does still streaming insertion.My pipeline code is the following:
p2 = (p | 'Read ' >> beam.io.ReadFromPubSub(subscription=subscription_path,
with_attributes=True,
timestamp_attribute=None,id_label=None)
| 'Windowing' >> beam.WindowInto(window.GlobalWindows(),
trigger=Repeatedly(
AfterAny(
AfterCount(100),
AfterProcessingTime(1 * 60))),
accumulation_mode=AccumulationMode.DISCARDING)
| 'Process ' >> beam.Map(getAttributes))
p3 = (p2 | 'Filter ' >> beam.Filter(lambda msg: (("xx" in msg) and (msg["xx"].lower() == "true")))
| 'Delete ' >> beam.Map(deleteAttribute)
| 'Write ' >> writeTable(bq_table_test, bq_batch_size))
def writeTable(table_name):
return beam.io.WriteToBigQuery(
table=table_name,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
batch_size=100)
I'm checking from Billing Reports that the inserts are whether batch or stream. When Streming insert usage increases,I understand that the bulk insertion did not happen. Is there another feature that I can check insertion was stream or batch ? And also how can I do batch insert to bigquery ?

According to the documentation you cannot specify the insert type, it is automatically detected based on your input PCollection:
The Beam SDK for Python does not currently support specifying the
insertion method.
BigQueryIO supports two methods of inserting data into BigQuery: load
jobs and streaming inserts. Each insertion method provides different
tradeoffs of cost, quota, and data consistency. See the BigQuery
documentation for load jobs and streaming inserts for more information
about these tradeoffs.
BigQueryIO chooses a default insertion method based on the input
PCollection.
BigQueryIO uses load jobs when you apply a BigQueryIO write transform
to a bounded PCollection.
BigQueryIO uses streaming inserts when you apply a BigQueryIO write
transform to an unbounded PCollection.
In your case you're reading from an unbounded source (Pubsub) so it is always streaming writes in this case. Windowing will not change the nature of the data.
One workaround I can think of is to split the pipeline, e.g. a streaming pipeline would write to a collection of files at some storage (GCS) and then another pipeline would read and upload those files (the files are bounded).

Related

I am tying to build a snowflake connector for Kafka. Is there a way the connector handle events like update, delete and create?

I am using python connector for the transformations and pushing data to snowflake it is taking longer time for insertions into snowflake, since I have to handle transformations message by message to produce the same sequence. The existing Kafka snowflake connector can flatten the json messages but cannot handle the other events.
I am looking for faster ways to transfer and transform kafka json messages to snowflake tables.
You have two options in this case IMO.
Write some consumer code that will transform the data from the original kafka topic (input topic) and then write it to a new topic (output topic). Then you can use the Kafka snowflake connector to write to snowflake.
or
Write the consumer code to do the transform and then write the data directly to snowflake in that consumer.
Option 1 has an extra step and requires using the kafka connector, so something extra to manage. Option 2 sounds like what you are currently trying. Using option 1 would allow you to leverage the community maintained snowflake connector, which is probably quite efficient. You will need to use partitions in your kafka topic to get higher throughput.
Regardless of your choice of 1. or 2. It sounds like you need to write a consumer for your kafka topic to transform the data first. In which case I would recommend you use a stream processor to do this so you do not need to manage the complexities of state, recovery and parallelism. If you are dead set on using Python, the options are Bytewax or Faust. Below is some code using Bytewax to transform kafka topic data.
import json
from bytewax.dataflow import Dataflow
from bytewax.execution import spawn_cluster
from bytewax.inputs import KafkaInputConfig
from bytewax.outputs import KafkaOutputConfig
def deserialize(key_bytes__payload_bytes):
key_bytes, payload_bytes = key_bytes__payload_bytes
key = json.loads(key_bytes) if key_bytes else None
payload = json.loads(payload_bytes) if payload_bytes else None
return key, payload
def my_transformations(data):
### WRITE YOUR PYTHON TRANSFORMATION CODE HERE ###
return data
def serialize_with_key(key_payload):
key, payload = key_payload
new_key_bytes = key if key else json.dumps("my_key").encode("utf-8")
return new_key_bytes, json.dumps(payload).encode("utf-8")
flow = Dataflow()
flow.input("inp", KafkaInputConfig(
brokers=["localhost:9092"],
topic="input_topic",
),
)
flow.map(deserialize)
flow.map(my_transformations)
flow.map(serialize_with_key)
flow.capture(KafkaOutputConfig(
brokers=["localhost:9092"],
topic="output_topic",
)
)
if __name__ == "__main__":
spawn_cluster(flow, proc_count = 2, worker_count_per_proc = 1,)

Spark Repartition Issue

Good day everyone,
I'm working with a project where I'm running an ETL process over millions of data records with the aid of Spark (2.4.4) and PySpark.
We're fetching from an S3 bucket in AWS huge compressed CSV files, converting them into Spark Dataframes, using the repartition() method and converting each piece into a parquet data to lighten and speed up the process:
for file in files:
if not self.__exists_parquet_in_s3(self.config['aws.output.folder'] + '/' + file, '.parquet'):
# Run the parquet converter
print('**** Processing %s ****' % file)
# TODO: number of repartition variable
df = SparkUtils.get_df_from_s3(self.spark_session, file, self.config['aws.bucket']).repartition(94)
s3folderpath = 's3a://' + self.config['aws.bucket'] + \
'/' + self.config['aws.output.folder'] + \
'/%s' % file + '/'
print('Writing down process')
df.write.format('parquet').mode('append').save(
'%s' % s3folderpath)
print('**** Saving %s completed ****' % file)
df.unpersist()
else:
print('Parquet files already exist!')
So as a first step this piece of code is searching inside the s3 bucket if these parquet file exists, if not it will enter the for cycle and run all the transformations.
Now, let's get to the point. I have this pipeline which is working fine with every csv file, except for one which is identical to the others except for bein much heavier also after the repartition and conversion in parquet (29 MB x 94 parts vs 900 kB x 32 parts).
This is causing a bottleneck after some time during the process (which is divided into identical cycles, where the number of cycles is equal to the number of repartitions made) raising a java heap memory space issue after several Warnings:
WARN TaskSetManager: Stage X contains a task of very large size (x KB). The maximum recommended size is 100 KB. (Also see pics below)
Part 1]:
Part 2
The most logical solution would be that of further increasing the repartition parameter to lower the weight of each parquet file BUT it does not allow me to create more than 94 partitions, after some time during the for cycle (above mentioned) it raises this error:
ERROR FileFormatWriter: Aborting job 8fc9c89f-dccd-400c-af6f-dfb312df0c72.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: HGC6JTRN5VT5ERRR, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: 7VBu4mUEmiAWkjLoRPruTiCY3IhpK40t+lg77HDNC0lTtc8h2Zi1K1XGSnJhjwnbLagN/kS+TpQ=
Or also:
Second issue type, notice the warning
What I noticed is that I can under partition the files related to the original value: I can use a 16 as parameter instead of the 94 and it will run fine, but if i increase it over 94, the original value, it won't work.
Remember this pipeline is perfectly working until the end with other (lighter) CSV files, the only variable here seems to be the input file (size in particular) which seems to make it stop after some time. If you need any other detail please let me know, I'll be extremely glad if you help me with this. Thank you everyone in advance.
Not sure what's your logic in your SparkUtils, based on the code and log you provided, it looks like it doesn't relate to your resource or partitioning, it may cause by the connection between your spark application and S3:
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403
403: your login don't have access to the bucket/file you are trying to read/write. Although it's from the Hadoop documents about the authentication, you can see several case will cause this error: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#Authentication_Failure. As you mentioned that you see this error during the loop but not at the beginning of your job, please check your running time of your spark job, also the IAM and session authentication as it maybe cause by session expiration (default 1 hour, based on how your DevOps team set), details your can check: https://docs.aws.amazon.com/singlesignon/latest/userguide/howtosessionduration.html.

How to enable parallel reading of files in Dataflow?

I'm working on a Dataflow pipeline that reads 1000 files (50 MB each) from GCS and performs some computations on the rows across all files. Each file is a CSV with the same structure, just with different numbers in it, and I'm computing the average of each cell over all files.
The pipeline looks like this (python):
additional_side_inputs = {'key1': 'value1', 'key2': 'value2'} # etc.
p | 'Collect CSV files' >> MatchFiles(input_dir + "*.csv")
| 'Read files' >> ReadMatches()
| 'Parse contents' >> beam.ParDo(FileToRowsFn(), additional_side_inputs)
| 'Compute average' >> beam.CombinePerKey(AverageCalculatorFn())
The FileToRowsFn class looks like this (see below, some details omitted). The row_id is the 1st column and is a unique key of each row; it exists exactly once in each file, so that I can compute the average over all the files. There's some additional value provided as side inputs to the transformer, which is not shown inside the method body below, but is still used by the real implementation. This value is a dictionary that is created outside of the pipeline. I mention it here in case this might be a reason for the lack of parallelization.
class FileToRowsFn(beam.DoFn):
def process(self, file_element, additional_side_inputs):
with file_element.open() as csv_file:
for row_id, *values in csv.reader(TextIOWrapper(csv_file, encoding='utf-8')):
yield row_id, values
The AverageCalculatorFn is a typical beam.CombineFn with an accumulator, that performs the average of each cell of a given row over all rows with the same row_id across all files.
All this works fine, but there's a problem with performances and throughput: it takes more than 60 hours to execute this pipeline. From the monitoring console, I notice that the files are read sequentially (1 file every 2 minutes). I understand that reading a file may take 2 minutes (each file is 50 MB), but I don't understand why dataflow doesn't assign more workers to read multiple files in parallel. The cpu remains at ~2-3% because most of the time is spent in file IO, and the number of workers doesn't exceed 2 (although no limit is set).
The output of ReadMatches is 1000 file records, so why doesn't dataflow create lots of FileToRowsFn instances and dispatch them to new workers, each one handling a single file?
Is there a way to enforce such a behavior?
This is probably because all your steps get fused into a single step by the Dataflow runner.
For such a fused bundle to parallelize, the first step needs to be parallelizable. In your case this is a glob expansion which is not parallelizable.
To make your pipeline parallelizable, you can try to break fusion. This can be done by adding a Reshuffle transform as the consumer of one of the steps that produce many elements.
For example,
from apache_beam import Reshuffle
additional_side_inputs = {'key1': 'value1', 'key2': 'value2'} # etc.
p | 'Collect CSV files' >> MatchFiles(input_dir + "*.csv")
| 'Read files' >> ReadMatches()
| 'Reshuffle' >> Reshuffle()
| 'Parse contents' >> beam.ParDo(FileToRowsFn(), additional_side_inputs)
| 'Compute average' >> beam.CombinePerKey(AverageCalculatorFn())
You should not have to do this if you use one of the standard sources available in Beam such as textio.ReadFromText() to read data. (unfortunately we do not have a CSV source but ReadFromText supports skipping header lines).
See here for more information regarding the fusion optimization and preventing fusion.

WriteToText is only writing to temp files

I am new to Apache Beam, and attempting to write my first pipeline in Python to output data from a Google Pub/Sub subscription to flat files for later use; ideally I want to batch these up into a file for say every half an hour. I have the following code as the final transform in my pipeline:-
| 'write output' >> WriteToText('TestNewPipeline.txt')
However all the files that are created are in a directory prefixed "beam-temp-TestNewPipeline.txt-[somehash]" and batched into groups of 10, which is not what I was expecting.
I've tried playing with the window function, but it doesn't seem to have had much effect, so either I'm totally misunderstanding the concept or doing something completely wrong.
The code for the window is:-
| 'Window' >> beam.WindowInto(beam.window.FixedWindows(5))
I assumed this would result in the output to text file being written in a static five second window, bit this is not the case.
Full code below:-
options = PipelineOptions()
options.view_as(StandardOptions).streaming=True
def format_message(message, timestamp=beam.DoFn.TimestampParam):
formatted_message = {
'data': message.data,
'attributes': str(message.attributes),
'timestamp': float(timestamp)
}
return formatted_message
with beam.Pipeline(options=options) as p:
(p
| 'Read From Pub Sub' >> ReadFromPubSub(subscription='projects/[my proj]/subscriptions/[my subscription]',with_attributes=True)
| 'Window' >> beam.WindowInto(beam.window.FixedWindows(5))
| 'Map Message' >> beam.Map(format_message)
| 'write output' >> WriteToText('TestNewPipeline.txt')
)
result = p.run()
As expected, the process runs indefinitely and successfully reads messages from the subscription; however it only writes them to the beam-temp files. Is anyone able to help point out where I'm going wrong?
Update:
Following comments from Jason, I've amended the pipeline a little more:-
class AddKeyToDict(beam.DoFn):
def process(self, element):
return [(element['rownumber'], element)]
with beam.Pipeline(options=options) as p:
(p
| 'Read From Pub Sub' >> ReadFromPubSub(subscription=known_args.input_subscription)# can't make attributes work as yet! ,with_attributes=True)
# failed attempt 1| 'Map Message' >> beam.Map(format_message)
# failed attempt 2| 'Parse JSON' >> beam.Map(format_message_element)
| 'Parse to Json' >> beam.Map(lambda x: json.loads(x))
| 'Add key' >> beam.ParDo(AddKeyToDict())
| 'Window' >> beam.WindowInto(beam.window.FixedWindows(5), trigger=AfterProcessingTime(15), accumulation_mode=AccumulationMode.DISCARDING)
| 'Group' >> beam.GroupByKey()
| 'write output' >> WriteToText(known_args.output_file)
)
I've not been able to extract the message_id or published time from PubSub as yet, so I'm just using a rownumber generated in my message. At this point, I'm still only getting the temporary files created, and nothing accumulated into a final file? Starting to wonder if the Python implementation is still a bit lacking and I'm going to have to pick up Java....
From Apache Beam's documentation on Windowing Constraints:
If you set a windowing function using the Window transform, each element is assigned to a window, but the windows are not considered until GroupByKey or Combine aggregates across a window and key.
Since there doesn't seem to be a notion of keys in this example, can you try using Combine?
From engaging with the Apache Beam Python guys, streaming writes to GCS (or local filesystem) is not yet supported in Python, hence why the streaming write does not occur; only unbounded targets are currently supported (e.g. Big Query tables).
Apparently this will be supported in the upcoming release of Beam for Python v2.14.0.

Apache Beam python Bigquery change streaming insert into batch insert?

I am running an apache beam dataflow job, which reads from a bucket, performs some transformation and write to bigquery.
But the records are inserted into the streaming buffer.
validated_data = (p1
| 'Read files from Storage '+url >> beam.io.ReadFromText(url)
| 'Validate records ' + url >> beam.Map(data_ingestion.validate, url)\
.with_outputs(SUCCESS_TAG_KEY, FAILED_TAG_KEY, main="main")
)
all_data, _, _ = validated_data
success_records = validated_data[SUCCESS_TAG_KEY]
failed_records = validated_data[FAILED_TAG_KEY]
(success_records
| 'Extracting row from tagged row {}'.format(url) >> beam.Map(lambda row: row['row'])
| 'Write to BigQuery table for {}'.format(url) >> beam.io.WriteToBigQuery(
table=data_ingestion.get_table(tmp=TEST, run_date=data_ingestion.run_date),
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
)
Actually, I need to delete the partition before running above as a way to avoid duplicated records for ingestion time partitioned table.
And Say If I run this job more than 1 time for the same file, without truncating the table, the table will end up having duplicate records.
And because last records are in streaming buffer, the delete partition table command does not actually remove the partition.
Below is the code I am using to truncate the table. and this code runs before running the pipeline
client = bigquery.Client()
dataset = TABLE_MAP['dataset']
table = TABLE_MAP[sentiment_pipeline][table_type]['table']
table_id = "{}${}".format(table, format_date(run_date, '%Y%m%d'))
table_ref = client.dataset(dataset).table(table_id)
output = client.delete_table(table_ref)
According to BigQuery documentation, you may have to wait 30 minutes in order to make a DML statement on a a streaming table, and schema changes like delete/truncate tables might result in data loss for some scenarios. Here are some workarounds you could try for dealing with duplicates in a streaming scenario.
Additionally, Apache Beam and Dataflow now supports batch insert for python, so it might be a good way to avoid streaming limitations.

Categories