I am new to Apache Beam, and attempting to write my first pipeline in Python to output data from a Google Pub/Sub subscription to flat files for later use; ideally I want to batch these up into a file for say every half an hour. I have the following code as the final transform in my pipeline:-
| 'write output' >> WriteToText('TestNewPipeline.txt')
However all the files that are created are in a directory prefixed "beam-temp-TestNewPipeline.txt-[somehash]" and batched into groups of 10, which is not what I was expecting.
I've tried playing with the window function, but it doesn't seem to have had much effect, so either I'm totally misunderstanding the concept or doing something completely wrong.
The code for the window is:-
| 'Window' >> beam.WindowInto(beam.window.FixedWindows(5))
I assumed this would result in the output to text file being written in a static five second window, bit this is not the case.
Full code below:-
options = PipelineOptions()
options.view_as(StandardOptions).streaming=True
def format_message(message, timestamp=beam.DoFn.TimestampParam):
formatted_message = {
'data': message.data,
'attributes': str(message.attributes),
'timestamp': float(timestamp)
}
return formatted_message
with beam.Pipeline(options=options) as p:
(p
| 'Read From Pub Sub' >> ReadFromPubSub(subscription='projects/[my proj]/subscriptions/[my subscription]',with_attributes=True)
| 'Window' >> beam.WindowInto(beam.window.FixedWindows(5))
| 'Map Message' >> beam.Map(format_message)
| 'write output' >> WriteToText('TestNewPipeline.txt')
)
result = p.run()
As expected, the process runs indefinitely and successfully reads messages from the subscription; however it only writes them to the beam-temp files. Is anyone able to help point out where I'm going wrong?
Update:
Following comments from Jason, I've amended the pipeline a little more:-
class AddKeyToDict(beam.DoFn):
def process(self, element):
return [(element['rownumber'], element)]
with beam.Pipeline(options=options) as p:
(p
| 'Read From Pub Sub' >> ReadFromPubSub(subscription=known_args.input_subscription)# can't make attributes work as yet! ,with_attributes=True)
# failed attempt 1| 'Map Message' >> beam.Map(format_message)
# failed attempt 2| 'Parse JSON' >> beam.Map(format_message_element)
| 'Parse to Json' >> beam.Map(lambda x: json.loads(x))
| 'Add key' >> beam.ParDo(AddKeyToDict())
| 'Window' >> beam.WindowInto(beam.window.FixedWindows(5), trigger=AfterProcessingTime(15), accumulation_mode=AccumulationMode.DISCARDING)
| 'Group' >> beam.GroupByKey()
| 'write output' >> WriteToText(known_args.output_file)
)
I've not been able to extract the message_id or published time from PubSub as yet, so I'm just using a rownumber generated in my message. At this point, I'm still only getting the temporary files created, and nothing accumulated into a final file? Starting to wonder if the Python implementation is still a bit lacking and I'm going to have to pick up Java....
From Apache Beam's documentation on Windowing Constraints:
If you set a windowing function using the Window transform, each element is assigned to a window, but the windows are not considered until GroupByKey or Combine aggregates across a window and key.
Since there doesn't seem to be a notion of keys in this example, can you try using Combine?
From engaging with the Apache Beam Python guys, streaming writes to GCS (or local filesystem) is not yet supported in Python, hence why the streaming write does not occur; only unbounded targets are currently supported (e.g. Big Query tables).
Apparently this will be supported in the upcoming release of Beam for Python v2.14.0.
Related
I have a beginner question. I have ~11millions files stored on some GCS bucket with the following structure in a given bucket:
yyyy/mm/dd
I have 6 years of data with ~1500 files per years.
lines = (p
| "ReadInputData" >> fileio.MatchFiles(file_pattern = 'gs://bucket/**/*')
| "FileToBytes" >> fileio.ReadMatches()
| "Reshuffle" >> beam.Reshuffle()
| "GetMetada" >> beam.Map(lambda file: post_process(file))
| "WriteTableToBQ" >> beam.io.WriteToBigQuery(...)
)
For each files, I am checking the metadata (extension, size, ...), then for PDF files I open it to check the number of pages. For images, I compute some hash. This is is the post_process() python function.
The current dataflow is running since 3 days (which is quite a lot) but it is not parallelized properly since most of the time 1 vCPU is used:
It seems that the job search for the file for very long time with 1 worker (vCPU) then for the processing to to ~25 workers for ~10 min and continue like that for days. I am doing something wrong.
Before I did some test for a month(144'000 files) with file_pattern = 'gs://bucket/2022/04/**/*'
I went from 4h (only one workers was used ) to 16 min (scale up to 57 workers) by adding "Reshuffle" >> beam.Reshuffle() which was missing.
I am using python, Apache Beam Python 3.9 SDK 2.40.0, I use Dataflow template with a container(--sdk_container_image option).
My question is how to use properly fileio.MatchFiles(file_pattern = 'gs://bucket/**/*') to have it using multiple workers all the time. It seems that what I did works for file_pattern = 'gs://bucket/2022/04/**/*' but not for file_pattern = 'gs://bucket/**/*'. What is the best practice ? Didn't find something that could answer my question in the documentation. If I do a loop over year and month I don't know if this will be properly parallelized and if this is the right things to do. Any suggestion, recommendation ? Thanks
edit 1:
I found that looping over the years give better results
for year in range(2017, 2023):
list_yyyy_mm.append( p | f"MatchFile {year:04d}" >> fileio.MatchFiles(file_pattern = f"gs://bucket/{year:04d}/**/*"))
files = (
list_yyyy_mm
| "Flatten" >> beam.Flatten()
p
...
)
Now I can process the 6 years of data in ~2 hours which is much better.
This is not well parallelized as it takes ~30 min with 5 workers to list the 12 millions of files. Later the reading the 12 millions of files and the extraction of the metadata is working as expected. Last part is limited to ~230 workers because I don't have enough IP but this is an issue on my side).
I tried to loop over years and months but it is even worse.
Not sure is this is expected for fileio.MatchFiles or if I should write my own version and optimize it for my use case. Maybe this is documented but I didn't find it.
I'm working on a Dataflow pipeline that reads 1000 files (50 MB each) from GCS and performs some computations on the rows across all files. Each file is a CSV with the same structure, just with different numbers in it, and I'm computing the average of each cell over all files.
The pipeline looks like this (python):
additional_side_inputs = {'key1': 'value1', 'key2': 'value2'} # etc.
p | 'Collect CSV files' >> MatchFiles(input_dir + "*.csv")
| 'Read files' >> ReadMatches()
| 'Parse contents' >> beam.ParDo(FileToRowsFn(), additional_side_inputs)
| 'Compute average' >> beam.CombinePerKey(AverageCalculatorFn())
The FileToRowsFn class looks like this (see below, some details omitted). The row_id is the 1st column and is a unique key of each row; it exists exactly once in each file, so that I can compute the average over all the files. There's some additional value provided as side inputs to the transformer, which is not shown inside the method body below, but is still used by the real implementation. This value is a dictionary that is created outside of the pipeline. I mention it here in case this might be a reason for the lack of parallelization.
class FileToRowsFn(beam.DoFn):
def process(self, file_element, additional_side_inputs):
with file_element.open() as csv_file:
for row_id, *values in csv.reader(TextIOWrapper(csv_file, encoding='utf-8')):
yield row_id, values
The AverageCalculatorFn is a typical beam.CombineFn with an accumulator, that performs the average of each cell of a given row over all rows with the same row_id across all files.
All this works fine, but there's a problem with performances and throughput: it takes more than 60 hours to execute this pipeline. From the monitoring console, I notice that the files are read sequentially (1 file every 2 minutes). I understand that reading a file may take 2 minutes (each file is 50 MB), but I don't understand why dataflow doesn't assign more workers to read multiple files in parallel. The cpu remains at ~2-3% because most of the time is spent in file IO, and the number of workers doesn't exceed 2 (although no limit is set).
The output of ReadMatches is 1000 file records, so why doesn't dataflow create lots of FileToRowsFn instances and dispatch them to new workers, each one handling a single file?
Is there a way to enforce such a behavior?
This is probably because all your steps get fused into a single step by the Dataflow runner.
For such a fused bundle to parallelize, the first step needs to be parallelizable. In your case this is a glob expansion which is not parallelizable.
To make your pipeline parallelizable, you can try to break fusion. This can be done by adding a Reshuffle transform as the consumer of one of the steps that produce many elements.
For example,
from apache_beam import Reshuffle
additional_side_inputs = {'key1': 'value1', 'key2': 'value2'} # etc.
p | 'Collect CSV files' >> MatchFiles(input_dir + "*.csv")
| 'Read files' >> ReadMatches()
| 'Reshuffle' >> Reshuffle()
| 'Parse contents' >> beam.ParDo(FileToRowsFn(), additional_side_inputs)
| 'Compute average' >> beam.CombinePerKey(AverageCalculatorFn())
You should not have to do this if you use one of the standard sources available in Beam such as textio.ReadFromText() to read data. (unfortunately we do not have a CSV source but ReadFromText supports skipping header lines).
See here for more information regarding the fusion optimization and preventing fusion.
I am running an apache beam dataflow job, which reads from a bucket, performs some transformation and write to bigquery.
But the records are inserted into the streaming buffer.
validated_data = (p1
| 'Read files from Storage '+url >> beam.io.ReadFromText(url)
| 'Validate records ' + url >> beam.Map(data_ingestion.validate, url)\
.with_outputs(SUCCESS_TAG_KEY, FAILED_TAG_KEY, main="main")
)
all_data, _, _ = validated_data
success_records = validated_data[SUCCESS_TAG_KEY]
failed_records = validated_data[FAILED_TAG_KEY]
(success_records
| 'Extracting row from tagged row {}'.format(url) >> beam.Map(lambda row: row['row'])
| 'Write to BigQuery table for {}'.format(url) >> beam.io.WriteToBigQuery(
table=data_ingestion.get_table(tmp=TEST, run_date=data_ingestion.run_date),
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
)
Actually, I need to delete the partition before running above as a way to avoid duplicated records for ingestion time partitioned table.
And Say If I run this job more than 1 time for the same file, without truncating the table, the table will end up having duplicate records.
And because last records are in streaming buffer, the delete partition table command does not actually remove the partition.
Below is the code I am using to truncate the table. and this code runs before running the pipeline
client = bigquery.Client()
dataset = TABLE_MAP['dataset']
table = TABLE_MAP[sentiment_pipeline][table_type]['table']
table_id = "{}${}".format(table, format_date(run_date, '%Y%m%d'))
table_ref = client.dataset(dataset).table(table_id)
output = client.delete_table(table_ref)
According to BigQuery documentation, you may have to wait 30 minutes in order to make a DML statement on a a streaming table, and schema changes like delete/truncate tables might result in data loss for some scenarios. Here are some workarounds you could try for dealing with duplicates in a streaming scenario.
Additionally, Apache Beam and Dataflow now supports batch insert for python, so it might be a good way to avoid streaming limitations.
I am using apache beam pipeline and I want to batch insert to bigquery with python. My data comes from Pub/Sub which is unbounded. As a result of my research, GlobalWindows with triggers should solve my problem. I tried my pipeline with windowing, but it does still streaming insertion.My pipeline code is the following:
p2 = (p | 'Read ' >> beam.io.ReadFromPubSub(subscription=subscription_path,
with_attributes=True,
timestamp_attribute=None,id_label=None)
| 'Windowing' >> beam.WindowInto(window.GlobalWindows(),
trigger=Repeatedly(
AfterAny(
AfterCount(100),
AfterProcessingTime(1 * 60))),
accumulation_mode=AccumulationMode.DISCARDING)
| 'Process ' >> beam.Map(getAttributes))
p3 = (p2 | 'Filter ' >> beam.Filter(lambda msg: (("xx" in msg) and (msg["xx"].lower() == "true")))
| 'Delete ' >> beam.Map(deleteAttribute)
| 'Write ' >> writeTable(bq_table_test, bq_batch_size))
def writeTable(table_name):
return beam.io.WriteToBigQuery(
table=table_name,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
batch_size=100)
I'm checking from Billing Reports that the inserts are whether batch or stream. When Streming insert usage increases,I understand that the bulk insertion did not happen. Is there another feature that I can check insertion was stream or batch ? And also how can I do batch insert to bigquery ?
According to the documentation you cannot specify the insert type, it is automatically detected based on your input PCollection:
The Beam SDK for Python does not currently support specifying the
insertion method.
BigQueryIO supports two methods of inserting data into BigQuery: load
jobs and streaming inserts. Each insertion method provides different
tradeoffs of cost, quota, and data consistency. See the BigQuery
documentation for load jobs and streaming inserts for more information
about these tradeoffs.
BigQueryIO chooses a default insertion method based on the input
PCollection.
BigQueryIO uses load jobs when you apply a BigQueryIO write transform
to a bounded PCollection.
BigQueryIO uses streaming inserts when you apply a BigQueryIO write
transform to an unbounded PCollection.
In your case you're reading from an unbounded source (Pubsub) so it is always streaming writes in this case. Windowing will not change the nature of the data.
One workaround I can think of is to split the pipeline, e.g. a streaming pipeline would write to a collection of files at some storage (GCS) and then another pipeline would read and upload those files (the files are bounded).
I am using Pyspark in the community version of Databricks, using Python 2.7 and Spark 2.2.1. I have a Pyspark dataframe "top100m":
In: type(movie_ratings_top100m)
Out: pyspark.sql.dataframe.DataFrame
Which has 3 numeric type columns:
In: top100m.printSchema()
Out: root
|-- userId: long (nullable = true)
|-- itemId: long (nullable = true)
|-- userPref: double (nullable = true)
In: top100m.show(6)
Out:
+------+-------+--------+
|userId| itemId|userPref|
+------+-------+--------+
| 243| 10| 3.5|
| 243| 34| 3.5|
| 243| 47| 4.0|
| 243| 110| 4.0|
| 243| 150| 2.5|
| 243| 153| 2.0|
+------+-------+--------+
There are no strings in the dataframe. When attempting to output this file as either a csv or a txt file using either the following lines of code (based on Databricks documentation found here):
dbutils.fs.put("/FileStore/mylocation/top100m.csv", top100m)
dbutils.fs.put("/FileStore/mylocation/top100m.txt", top100m)
I get the following error:
TypeError: DataFrame[userId: bigint, itemId: bigint, userPref: double] has the wrong type - (<type 'basestring'>,) is expected.
I have a cursory understanding of the basestring supertype that existed in Python 2, and that it was abandoned in Python 3, which I don't think is relevant here, but I could be wrong. My ultimate goal with this is to be able to export my Pyspark dataframe from Databricks onto my local machine. My question is why Spark/Databricks would be expecting a basestring type in this case, and what I can do with my data to make it comply.
After reviewing the Databricks documentation including forums, it seems that there isn't a very straightforward way of transferring data to my local machine (I'm not attached to an S3 bucket). The simplest seems to be the approach I've noted above, which is giving me errors. If there is a better way to accomplish this, that would be extremely helpful.
Looking at the databricks documentation csv files can be loaded into Spark from DBFS using sqlContext. Since that is the case, you can save data in a similar way (some information regarding saving RDDs is available here). In other words, there is no need to use dbutils for saving, instead do:
top100m.write.format("csv").save("/FileStore/mylocation/top100m.csv")
Due to how Spark saves files top100m.csv will be a directory. Inside there will be one csv file for each partition of the dataframe. These are called part-xxxxx (where xxxxx is a number starting from 00000). It's possible to get a single part file by calling coalesce(1) on the dataframe before saving it. In this case, the csv file will be called part-00000.