Apache Beam – issue with Deduplicate function - python

I have an issue with apache_beam.transforms.deduplicate.Deduplicate transformation. Please, look on a code sample below:
with beam.Pipeline() as pipeline:
(
pipeline
# | 'Load' >> beam.Create(['a', 'b', 'b']) ## <- works fine
| 'Load' >> beam.io.ReadFromText('./input.txt'). ## <- breaks Dedup
| 'Dedup' >> Deduplicate(processing_time_duration=1000).with_input_types(AnyStr)
| 'Print' >> beam.Map(print)
)
If I create collection manually – everything works fine and as expected. But when I try to load something from disk (text file, Avro files, etc.) Deduplicate stops working and throws an exception:
Traceback (most recent call last):
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/direct/executor.py", line 370, in call
self.attempt_call(
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/direct/executor.py", line 404, in attempt_call
evaluator.start_bundle()
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/direct/transform_evaluator.py", line 867, in start_bundle
self.runner.start()
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 1475, in start
self._invoke_bundle_method(self.do_fn_invoker.invoke_start_bundle)
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 1460, in _invoke_bundle_method
self._reraise_augmented(exn)
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 1507, in _reraise_augmented
raise new_exn.with_traceback(tb)
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 1458, in _invoke_bundle_method
bundle_method()
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 559, in invoke_start_bundle
self.signature.start_bundle_method.method_value())
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/direct/sdf_direct_runner.py", line 122, in start_bundle
self._invoker = DoFnInvoker.create_invoker(
TypeError: create_invoker() got an unexpected keyword argument 'output_processor' [while running 'Load/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/pair']
This happens only with Deduplicate and DeduplicatePerKey transformations. All other things like ParDo, Map, etc. work fine.
Python version: 3.9.14
Apache Beam: 2.41.0
Platform: Apple's M1 ARM

I hope it can help.
Indeed I tested your code and it doesn't works, maybe I am wrong but I think the Deduplicate PTransform seems to be more adapted for jobs with windowing logics (processing time and event time).
It works with beam.Create (even if it's a bounded source) but not with ReadFromText because a type is not inferred :
E TypeError: create_invoker() got an unexpected keyword argument 'output_processor'
I propose you another solution that works in your case, and it's more adapted to deduplicate data in batch job and bounded source :
def test_dedup(self):
with TestPipeline() as p:
(
p
# | 'Load' >> beam.Create(['a', 'b', 'b']) ## <- works fine
| 'Load' >> beam.io.ReadFromText(
f'{ROOT_DIR}/input.txt') ## <- breaks D
# | 'Dedup' >> Deduplicate(processing_time_duration=1000).with_input_types(AnyStr)
| 'Group by' >> beam.GroupBy(lambda el: el)
| 'Get key' >> beam.Map(lambda t: t[0])
| 'Print' >> beam.Map(self.print_el)
)
The input.txt content is :
1
2
2
3
4
The output PCollections is :
1
2
3
4
I used GroupBy on the current element, it gives me a Tuple => 2 -> [2, 2]
and then I added a map on the deduplicated key on the Tuple.

Related

Log Apache Beam WriteToDatastore result to BigQuery - Python

I have a very simple pipeline in python
with beam.Pipeline(options=create_pipeline_options(pipeline_args)) as p:
rows = (p | 'ReadFromBigquery' >> beam.io.ReadFromBigQuery(table=f"{known_args.project}:{known_args.datasetId}.{known_args.tableId}", use_standard_sql=True))
entities = (rows | 'GetEntities' >> beam.ParDo(GetEntity()))
updated = (entities | 'Update Entities' >> beam.ParDo(UpdateEntity()))
_ = (updated | 'Write To Datastore' >> WriteToDatastore(known_args.project))
I want to log which entities have been correctly updated after WriteToDatastore has finished running so I could write them in a bigquery audit table. Ideally it would look something like this
successful_entities, failed entities = (updated | 'Write To Datastore' >> WriteToDatastoreWrapper(known_args.project))
_ = (successful_entities | 'Write Success To Bigquery' >> beam.io.WriteToBigQuery(table=f"{c.audit_table}:{known_args.datasetId}.{known_args.tableId}"))
_ = (failed_entities| 'Write Failed To Bigquery' >> beam.io.WriteToBigQuery(table=f"{c.audit_table}:{known_args.datasetId}.{known_args.tableId}"))
Is this possible to achieve?
Alternatively if the whole batch fails after n numbers of retries, is it possible to catch that failure and log which batch has failed (assuming I have some sort of runId to keep track of batches)
I hope it can help.
I propose you a solution with a dead letter queue before writing the result to Datastore.
Beam suggests using a dead letter queue in this case, and we can achieve that with TupleTags.
You can write it with Beam native but the code is verbose.
I created a library in Beam Java and Python called Asgarde :
Here the link of the Python version : https://github.com/tosun-si/pasgarde
You can install the package with pip :
pip install asgarde==0.16.0
With Asgarde, you can catch errors in each step of the pipeline, before writing the result with the IO (Datastore in this case)
Example :
input_teams: PCollection[str] = p | 'Read' >> beam.Create(team_names)
result = (CollectionComposer.of(input_teams)
.map('Map with country', lambda tname: TeamInfo(name=tname, country=team_countries[tname], city=''))
.map('Map with city', lambda tinfo: TeamInfo(name=tinfo.name, country=tinfo.country, city=team_cities[tinfo.name]))
.filter('Filter french team', lambda tinfo: tinfo.country == 'France'))
result_outputs: PCollection[TeamInfo] = result.outputs
result_failures: PCollection[Failure] = result.failures
Asgarde proposes a wrapper with CollectionComposer class instantiated from a PCollection.
Then each operator like map, flat_map, filter applies the operation while error handling.
The result of CollectionComposer is a Tuple with :
PCollection of successful outputs
PCollection of Failure
Failure is an object given by Asgarde :
#dataclass
class Failure:
pipeline_step: str
input_element: str
exception: Exception
This object gives the current input_element concerned by the error and the current exception.
pipeline_step is the current name used for the transformation.
Your pipeline can be adapted in the following way with Asgarde :
with beam.Pipeline(options=create_pipeline_options(pipeline_args)) as p:
rows = (p | 'ReadFromBigquery' >> beam.io.ReadFromBigQuery(table=f"{known_args.project}:{known_args.datasetId}.{known_args.tableId}", use_standard_sql=True))
result = (CollectionComposer.of(rows)
.map('GetEntities', lambda el : get_entity_function(el))
.map('Update Entities', lambda entity : update_entity_function(entity)))
result_outputs = result.outputs
result_failures: PCollection[Failure] = result.failures
(result_outputs | 'Write To Datastore' >> WriteToDatastore(known_args.project))
(result_failures
| 'Map before Write to BQ' >> beam.Map(failure_to_your_obj_function)
| 'Write Failed To Bigquery' >> beam.io.WriteToBigQuery(table=f"{c.audit_table}:{known_args.datasetId}.{known_args.tableId}"))
You can also apply the same logic with Beam native, I share you an example from my personal Github repository :
https://github.com/tosun-si/teams-league-python-dlq-native-beam-summit/blob/main/team_league/domain_ptransform/team_stats_transform.py

DocPlex giving CpoSolverException when using search phases

I am running a constraint programming model in docplex. When I add the following search phase I get an error in docplex:
model.set_parameters({'SearchType': 'DepthFirst', 'Workers': 2, "LogVerbosity": "Verbose"})
p1 = search_phase(
vars=shifts.values(),
varchooser=select_largest(var_impact()),
valuechooser=select_largest(value_impact())
)
p2 = search_phase(
vars=work_hours.values(),
varchooser=select_smallest(domain_size()),
valuechooser=select_random_value()
)
model.add(p1)
ans = model.solve(TimeLimit=100, execfile='cpoptimizer.exe')
I get the following error
(base) dipplestix#DESKTOP-37BA91G:~/classes/csci 2951/hw2$ ./run.sh input/7_14.sched
! --------------------------------------------------- CP Optimizer 20.1.0.0 --
! Satisfiability problem - 196 variables, 266 constraints, 1 phase
! Presolve : 21 extractables eliminated, 7 constraints generated
! TimeLimit = 100
! Workers = 2
! LogVerbosity = Verbose
! SearchType = DepthFirst
! Initial process time : 0.02s (0.02s extraction + 0.00s propagation)
! . Log search space : 449.3 (before), 449.3 (after)
! . Memory usage : 501.9 kB (before), 501.9 kB (after)
! Using parallel search with 2 workers.
! ----------------------------------------------------------------------------
! Branches Non-fixed W Branch decision
Traceback (most recent call last):
File "src/run.py", line 8, in <module>
p = solve(sys.argv[1])
File "/home/dipplestix/classes/csci 2951/hw2/src/solver.py", line 97, in solve
ans = model.solve(TimeLimit=100, execfile='cpoptimizer.exe')
File "/home/dipplestix/anaconda3/lib/python3.7/site-packages/docplex/cp/model.py", line 1080, in solve
msol = solver.solve()
File "/home/dipplestix/anaconda3/lib/python3.7/site-packages/docplex/cp/solver/solver.py", line 614, in solve
raise e
File "/home/dipplestix/anaconda3/lib/python3.7/site-packages/docplex/cp/solver/solver.py", line 607, in solve
msol = self.agent.solve()
File "/home/dipplestix/anaconda3/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 191, in solve
jsol = self._wait_json_result(EVT_SOLVE_RESULT)
File "/home/dipplestix/anaconda3/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 474, in _wait_json_result
data = self._wait_event(evt)
File "/home/dipplestix/anaconda3/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 424, in _wait_event
evt, data = self._read_message()
File "/home/dipplestix/anaconda3/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 533, in _read_message
frame = self._read_frame(6)
File "/home/dipplestix/anaconda3/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 593, in _read_frame
raise CpoSolverException("Nothing to read from local solver process. Process seems to have been stopped (rc={}).".format(rc))
docplex.cp.solver.solver.CpoSolverException: Nothing to read from local solver process. Process seems to have been stopped (rc=5).
Hoiwever, if I use this search_phase instead it works
p1 = search_phase(
vars=shifts.values(),
varchooser=select_random_var(),
valuechooser=select_random_value()
)
Any ideas what could be causing this?
Unfortunately, the evaluators using statistics over the branches of the search like impacts, success rate, or objective variation measures are not available for variable and value in DepthFirst search. You can use them in Restart and MultiPoint. However docplex should raise an error is this case and not exit this way. We will fix this for the next release.

How To Combine Parsed TextFiles In Apache-Beam DataFlow in Python?

This seems to work fine in DirectRunner, but errors out when I switch to DataflowRunner. I basically need to somehow combine the files that are read in, but as soon as I use beam.combiners.ToList() to concatenate my data, it introduces a whole slew of issues.
Code Example:
def convert_to_dataframe(readable_file):
yield pd.read_csv(io.TextIOWrapper(readable_file.open()))
class merge_dataframes(beam.DoFn):
def process(self, element):
yield pd.concat(element).reset_index(drop=True)
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Match Files From GCS' >> beam.io.fileio.MatchFiles(raw_data_path)
| 'Read Files' >> beam.io.fileio.ReadMatches()
| 'Shuffle' >> beam.Reshuffle()
| 'Create DataFrames' >> beam.FlatMap(convert_to_dataframe)
| 'Combine To List' >> beam.combiners.ToList()
| 'Merge DataFrames' >> beam.ParDo(merge_dataframes())
| 'Apply Transformations' >> beam.ParDo(ApplyPipeline(creds_path=args.creds_path,
project_name=args.project_name,
feature_group_name=args.feature_group_name
))
| 'Write To GCS' >> beam.io.WriteToText(feature_data_path,
file_name_suffix='.csv',
shard_name_template='')
)
Error:
"No objects to concatenate [while running 'Merge DataFrames']"
I don't understand this error because the part that does 'Combine To List' should have produced a list of dataframes that would then get passed into the step 'Merge DataFrames', which is indeed the case when I use DirectRunner.
Given this error, I suspect that MatchFiles is not actually matching anything (e.g. due to a bad filepattern) and, consequently, the output of beam.combiners.ToList is an empty list.

PySpark UDF - resulting DF fails to show "value error: "mycolumn" name is not in list"

The scenario is very similar to this post with some variations: Pyspark Unsupported literal type class java.util.ArrayList
I have data of this format:
data.show()
+---------------+--------------------+--------------------+
| features| meta| telemetry|
+---------------+--------------------+--------------------+
| [seattle, 3]|[seattle, 3, 5344...|[[47, 1, 27, 92, ...|
| [miami, 1]|[miami, 1, 236881...|[[31, 84, 24, 67,...|
| [miami, 3]|[miami, 3, 02f4ca...|[[84, 5, 4, 93, 2...|
| [seattle, 3]|[seattle, 3, ec48...|[[43, 16, 94, 93,...|
| [seattle, 1]|[seattle, 1, 7d19...|[[70, 22, 45, 74,...|
|[kitty hawk, 3]|[kitty hawk, 3, d...|[[46, 15, 56, 94,...|
You can download a generated .json sample from this link: https://aiaccqualitytelcapture.blob.core.windows.net/streamanalytics/2019/08/21/10/0_43cbc7b0c9e845a187ce182b46eb4a3a_1.json?st=2019-08-22T15%3A20%3A20Z&se=2026-08-23T15%3A20%3A00Z&sp=rl&sv=2018-03-28&sr=b&sig=tsYh4oTNZXWbLnEgYypNqIsXH3BXOG8XyAH5ODi8iQg%3D
In particular, you can see that the actual data in each of these is actually a dictionary: the "features" column which is the one of interest to us is of this form: {"factory_id":"seattle","line_id":"3"}
I'm attempting to encode the data in features to one_hot via classical functional means.
See below:
def one_hot(value, categories_list):
num_cats = len(categories_list)
one_hot = np.eye(num_cats)[categories_list.index(value)]
return one_hot
def one_hot_features(row, feature_keys, u_features):
"""
feature_keys must be sorted.
"""
cur_key = feature_keys[0]
vector = one_hot(row["features"][cur_key], u_features[cur_key])
for i in range(1, len(feature_keys)):
cur_key = feature_keys[i]
n_vector = one_hot(row["features"][cur_key], u_features[cur_key])
vector = np.concatenate((vector, n_vector), axis=None)
return vector
The feature_keys and u_features in this case contain the following data:
feature_keys = ['factory_id', 'line_id']
u_features = {'factory_id': ['kitty hawk', 'miami', 'nags head', 'seattle'], 'line_id': ['1', '2', '3']}
I have created a udf and am attempting to create a new dataframe with the new column added using this udf. Code below:
def calc_onehot_udf(feature_keys, u_features):
return udf(lambda x: one_hot_features(x, feature_keys, u_features))
n_data = data.withColumn("hot_feature", calc_onehot_udf(feature_keys,
u_features)( col("features") ))
n_data.show()
This results in the following sets of error:
Py4JJavaError: An error occurred while calling o148257.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 91.0 failed 4 times, most recent failure: Lost task 0.3 in stage 91.0 (TID 1404, 10.139.64.5, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/spark/python/pyspark/sql/types.py", line 1514, in getitem
idx = self.fields.index(item)
ValueError: 'features' is not in list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 480, in main
process()
File "/databricks/spark/python/pyspark/worker.py", line 472, in process
serializer.dump_stream(out_iter, outfile)
File "/databricks/spark/python/pyspark/serializers.py", line 456, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/databricks/spark/python/pyspark/serializers.py", line 149, in dump_stream
for obj in iterator:
File "/databricks/spark/python/pyspark/serializers.py", line 445, in _batched
for item in iterator:
File "", line 1, in
File "/databricks/spark/python/pyspark/worker.py", line 87, in
return lambda *a: f(*a)
File "/databricks/spark/python/pyspark/util.py", line 99, in wrapper
return f(*args, **kwargs)
File "", line 4, in
File "", line 11, in one_hot_features
File "/databricks/spark/python/pyspark/sql/types.py", line 1519, in getitem
raise ValueError(item)
ValueError: features
Any assistance is greatly appreciated. I am actively investigating this.
The ideal output would be a new dataframe with the column: "hot_features" which contains the 1 dimensional one hot encoded array from the features column.
Turns out that there were a few key problems:
You should or must specify the return type in the UDF. In this case it is ArrayType(FloatType())
instead of returning an nd array from one_hot_features, I called vectors.tolist()
passing col("features") sends the actual values inside the features column row by row and not the actual row data; therefor calling row["features"] as originally done is incorrect as there is no accessor because I already have the value for that row. I therefor renamed the first parameter to be "features_val" instead of "row" to better reflect the expected input.
new code below for one_hot_features.
def one_hot_features(features_val, feature_keys, u_features):
cur_key = feature_keys[0]
vector = one_hot(features_val[cur_key], u_features[cur_key])
for i in range(1, len(feature_keys)):
cur_key = feature_keys[i]
n_vector = one_hot(features_val[cur_key], u_features[cur_key])
vector = np.concatenate((vector, n_vector), axis=None)
return vector.tolist()
According to various other documentation I've found it appears that numpy arrays don't play particularly well with spark dataframes as of this writing and therefor it is best to transform them into the more generic python types. This appears to have solved the problem faced here.
updated code for the udf definition below:
def calc_onehot_udf(feature_keys, u_features):
return udf(lambda x: one_hot_features(x, feature_keys, u_features),
ArrayType(FloatType()))
n_data = data.withColumn("hot_feature", calc_onehot_udf(feature_keys,
u_features)(col("features")))
n_data.show()
Good Luck if you face this problem; hopefully documenting here helps.

Unicode error while converting RDD to Spark DataFrame

I am getting the following error when I run show method on the data frame.
Py4JJavaError: An error occurred while calling o14904.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23450.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23450.0 (TID 120652, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-8-b76896bc4e43>", line 320, in <lambda>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-5: ordinal not in range(128)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.next(PythonRDD.scala:156)
When I only fetch 12 rows, it doesn't throw the error.
jpsa_rf.features_df.show(12)
+------------+--------------------+
|Feature_name| Importance_value|
+------------+--------------------+
| competitive|0.019380017988201638|
| new|0.012416277407924172|
|self-reliant|0.009044388916918005|
| related|0.008968947484358822|
| retail|0.008729510712416655|
| sales,|0.007680271475590303|
| work|0.007548541044789985|
| performance|0.007209008630295571|
| superior|0.007065626808393139|
| license|0.006436001036918034|
| industry|0.006416712169788629|
| record|0.006227581067732823|
+------------+--------------------+
only showing top 12 rows
But when I do .show(15) I get the error.
I created this data frame as below: it is basically a data frame of features with the their importance values from a Random Forest Model
vocab=np.array(self.cvModel.bestModel.stages[3].vocabulary)
if est_name=="rf":
feature_importance=self.cvModel.bestModel.stages[5].featureImportances.toArray()
argsort_feature_indices=feature_importance.argsort()[::-1]
elif est_name=="blr":
feature_importance=self.cvModel.bestModel.stages[5].coefficients.toArray()
argsort_feature_indices=abs(feature_importance).argsort()[::-1]
# Sort the features importance array in descending order and get the indices
feature_names=vocab[argsort_feature_indices]
self.features_df=sc.parallelize(zip(feature_names,feature_importance[argsort_feature_indices])).\
map(lambda x: (str(x[0]),float(x[1]))).toDF(["Feature_name","Importance_value"])
I assume you're using Python 2. The problem at hand is most likely the str(x[0]) part in your df.map. It seems x[0] refers to a unicode string and the str is supposed to convert it to a bytestring. It does so however by implicitly assuming an ASCII encoding, which will only work for plain english text.
This is not how things are supposed to be done.
The short answer is: Change str(x[0]) to x[0].encode('utf-8').
The long answer can be found e.g. here or here.

Categories