Unicode error while converting RDD to Spark DataFrame

Unicode error while converting RDD to Spark DataFrame - python

I am getting the following error when I run show method on the data frame.
Py4JJavaError: An error occurred while calling o14904.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23450.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23450.0 (TID 120652, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-8-b76896bc4e43>", line 320, in <lambda>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-5: ordinal not in range(128)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.next(PythonRDD.scala:156)
When I only fetch 12 rows, it doesn't throw the error.
jpsa_rf.features_df.show(12)
+------------+--------------------+
|Feature_name| Importance_value|
+------------+--------------------+
| competitive|0.019380017988201638|
| new|0.012416277407924172|
|self-reliant|0.009044388916918005|
| related|0.008968947484358822|
| retail|0.008729510712416655|
| sales,|0.007680271475590303|
| work|0.007548541044789985|
| performance|0.007209008630295571|
| superior|0.007065626808393139|
| license|0.006436001036918034|
| industry|0.006416712169788629|
| record|0.006227581067732823|
+------------+--------------------+
only showing top 12 rows
But when I do .show(15) I get the error.
I created this data frame as below: it is basically a data frame of features with the their importance values from a Random Forest Model
vocab=np.array(self.cvModel.bestModel.stages[3].vocabulary)
if est_name=="rf":
feature_importance=self.cvModel.bestModel.stages[5].featureImportances.toArray()
argsort_feature_indices=feature_importance.argsort()[::-1]
elif est_name=="blr":
feature_importance=self.cvModel.bestModel.stages[5].coefficients.toArray()
argsort_feature_indices=abs(feature_importance).argsort()[::-1]
# Sort the features importance array in descending order and get the indices
feature_names=vocab[argsort_feature_indices]
self.features_df=sc.parallelize(zip(feature_names,feature_importance[argsort_feature_indices])).\
map(lambda x: (str(x[0]),float(x[1]))).toDF(["Feature_name","Importance_value"])

I assume you're using Python 2. The problem at hand is most likely the str(x[0]) part in your df.map. It seems x[0] refers to a unicode string and the str is supposed to convert it to a bytestring. It does so however by implicitly assuming an ASCII encoding, which will only work for plain english text.
This is not how things are supposed to be done.
The short answer is: Change str(x[0]) to x[0].encode('utf-8').
The long answer can be found e.g. here or here.

Related

Apache Beam – issue with Deduplicate function

I have an issue with apache_beam.transforms.deduplicate.Deduplicate transformation. Please, look on a code sample below:
with beam.Pipeline() as pipeline:
(
pipeline
# | 'Load' >> beam.Create(['a', 'b', 'b']) ## <- works fine
| 'Load' >> beam.io.ReadFromText('./input.txt'). ## <- breaks Dedup
| 'Dedup' >> Deduplicate(processing_time_duration=1000).with_input_types(AnyStr)
| 'Print' >> beam.Map(print)
)
If I create collection manually – everything works fine and as expected. But when I try to load something from disk (text file, Avro files, etc.) Deduplicate stops working and throws an exception:
Traceback (most recent call last):
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/direct/executor.py", line 370, in call
self.attempt_call(
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/direct/executor.py", line 404, in attempt_call
evaluator.start_bundle()
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/direct/transform_evaluator.py", line 867, in start_bundle
self.runner.start()
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 1475, in start
self._invoke_bundle_method(self.do_fn_invoker.invoke_start_bundle)
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 1460, in _invoke_bundle_method
self._reraise_augmented(exn)
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 1507, in _reraise_augmented
raise new_exn.with_traceback(tb)
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 1458, in _invoke_bundle_method
bundle_method()
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 559, in invoke_start_bundle
self.signature.start_bundle_method.method_value())
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/direct/sdf_direct_runner.py", line 122, in start_bundle
self._invoker = DoFnInvoker.create_invoker(
TypeError: create_invoker() got an unexpected keyword argument 'output_processor' [while running 'Load/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/pair']
This happens only with Deduplicate and DeduplicatePerKey transformations. All other things like ParDo, Map, etc. work fine.
Python version: 3.9.14
Apache Beam: 2.41.0
Platform: Apple's M1 ARM

I hope it can help.
Indeed I tested your code and it doesn't works, maybe I am wrong but I think the Deduplicate PTransform seems to be more adapted for jobs with windowing logics (processing time and event time).
It works with beam.Create (even if it's a bounded source) but not with ReadFromText because a type is not inferred :
E TypeError: create_invoker() got an unexpected keyword argument 'output_processor'
I propose you another solution that works in your case, and it's more adapted to deduplicate data in batch job and bounded source :
def test_dedup(self):
with TestPipeline() as p:
(
p
# | 'Load' >> beam.Create(['a', 'b', 'b']) ## <- works fine
| 'Load' >> beam.io.ReadFromText(
f'{ROOT_DIR}/input.txt') ## <- breaks D
# | 'Dedup' >> Deduplicate(processing_time_duration=1000).with_input_types(AnyStr)
| 'Group by' >> beam.GroupBy(lambda el: el)
| 'Get key' >> beam.Map(lambda t: t[0])
| 'Print' >> beam.Map(self.print_el)
)
The input.txt content is :
1
2
2
3
4
The output PCollections is :
1
2
3
4
I used GroupBy on the current element, it gives me a Tuple => 2 -> [2, 2]
and then I added a map on the deduplicated key on the Tuple.

ValueError: dictionary update sequence element #0 has length 1; 2 is required in Python

I am trying to create a login system for a simple text-based game I am currently making. I used the PyCrypto module to encrypt and decrypt my data, yet it still seems to error whenever I pass a dict as a string and try to turn it into a dict:
with open(f"users/{username}", "r") as encr_file:
txt = encr_file.read()
encr_file.close()
info1 = encr.decrypt(txt)
print(info1)
info = dict(info1)
And here's the error:
Traceback (most recent call last):
File "main.py", line 3, in <module>
l.login()
File "/home/runner/Space/files/login.py", line 93, in login
info = dict(info1)
ValueError: dictionary update sequence element #0 has length 1; 2 is required
And here is the dictionary that's causing it to error:
{'username': 'rrxx', 'password': '123', 'money': 761, 'ship': 'Wheeler 11', 'guild': 'Military'}
Note that this is just an example dict and I know that passwords should not be like this
Can anyone help??

PySpark UDF - resulting DF fails to show "value error: "mycolumn" name is not in list"

The scenario is very similar to this post with some variations: Pyspark Unsupported literal type class java.util.ArrayList
I have data of this format:
data.show()
+---------------+--------------------+--------------------+
| features| meta| telemetry|
+---------------+--------------------+--------------------+
| [seattle, 3]|[seattle, 3, 5344...|[[47, 1, 27, 92, ...|
| [miami, 1]|[miami, 1, 236881...|[[31, 84, 24, 67,...|
| [miami, 3]|[miami, 3, 02f4ca...|[[84, 5, 4, 93, 2...|
| [seattle, 3]|[seattle, 3, ec48...|[[43, 16, 94, 93,...|
| [seattle, 1]|[seattle, 1, 7d19...|[[70, 22, 45, 74,...|
|[kitty hawk, 3]|[kitty hawk, 3, d...|[[46, 15, 56, 94,...|
You can download a generated .json sample from this link: https://aiaccqualitytelcapture.blob.core.windows.net/streamanalytics/2019/08/21/10/0_43cbc7b0c9e845a187ce182b46eb4a3a_1.json?st=2019-08-22T15%3A20%3A20Z&se=2026-08-23T15%3A20%3A00Z&sp=rl&sv=2018-03-28&sr=b&sig=tsYh4oTNZXWbLnEgYypNqIsXH3BXOG8XyAH5ODi8iQg%3D
In particular, you can see that the actual data in each of these is actually a dictionary: the "features" column which is the one of interest to us is of this form: {"factory_id":"seattle","line_id":"3"}
I'm attempting to encode the data in features to one_hot via classical functional means.
See below:
def one_hot(value, categories_list):
num_cats = len(categories_list)
one_hot = np.eye(num_cats)[categories_list.index(value)]
return one_hot
def one_hot_features(row, feature_keys, u_features):
"""
feature_keys must be sorted.
"""
cur_key = feature_keys[0]
vector = one_hot(row["features"][cur_key], u_features[cur_key])
for i in range(1, len(feature_keys)):
cur_key = feature_keys[i]
n_vector = one_hot(row["features"][cur_key], u_features[cur_key])
vector = np.concatenate((vector, n_vector), axis=None)
return vector
The feature_keys and u_features in this case contain the following data:
feature_keys = ['factory_id', 'line_id']
u_features = {'factory_id': ['kitty hawk', 'miami', 'nags head', 'seattle'], 'line_id': ['1', '2', '3']}
I have created a udf and am attempting to create a new dataframe with the new column added using this udf. Code below:
def calc_onehot_udf(feature_keys, u_features):
return udf(lambda x: one_hot_features(x, feature_keys, u_features))
n_data = data.withColumn("hot_feature", calc_onehot_udf(feature_keys,
u_features)( col("features") ))
n_data.show()
This results in the following sets of error:
Py4JJavaError: An error occurred while calling o148257.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 91.0 failed 4 times, most recent failure: Lost task 0.3 in stage 91.0 (TID 1404, 10.139.64.5, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/spark/python/pyspark/sql/types.py", line 1514, in getitem
idx = self.fields.index(item)
ValueError: 'features' is not in list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 480, in main
process()
File "/databricks/spark/python/pyspark/worker.py", line 472, in process
serializer.dump_stream(out_iter, outfile)
File "/databricks/spark/python/pyspark/serializers.py", line 456, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/databricks/spark/python/pyspark/serializers.py", line 149, in dump_stream
for obj in iterator:
File "/databricks/spark/python/pyspark/serializers.py", line 445, in _batched
for item in iterator:
File "", line 1, in
File "/databricks/spark/python/pyspark/worker.py", line 87, in
return lambda *a: f(*a)
File "/databricks/spark/python/pyspark/util.py", line 99, in wrapper
return f(*args, **kwargs)
File "", line 4, in
File "", line 11, in one_hot_features
File "/databricks/spark/python/pyspark/sql/types.py", line 1519, in getitem
raise ValueError(item)
ValueError: features
Any assistance is greatly appreciated. I am actively investigating this.
The ideal output would be a new dataframe with the column: "hot_features" which contains the 1 dimensional one hot encoded array from the features column.

Turns out that there were a few key problems:
You should or must specify the return type in the UDF. In this case it is ArrayType(FloatType())
instead of returning an nd array from one_hot_features, I called vectors.tolist()
passing col("features") sends the actual values inside the features column row by row and not the actual row data; therefor calling row["features"] as originally done is incorrect as there is no accessor because I already have the value for that row. I therefor renamed the first parameter to be "features_val" instead of "row" to better reflect the expected input.
new code below for one_hot_features.
def one_hot_features(features_val, feature_keys, u_features):
cur_key = feature_keys[0]
vector = one_hot(features_val[cur_key], u_features[cur_key])
for i in range(1, len(feature_keys)):
cur_key = feature_keys[i]
n_vector = one_hot(features_val[cur_key], u_features[cur_key])
vector = np.concatenate((vector, n_vector), axis=None)
return vector.tolist()
According to various other documentation I've found it appears that numpy arrays don't play particularly well with spark dataframes as of this writing and therefor it is best to transform them into the more generic python types. This appears to have solved the problem faced here.
updated code for the udf definition below:
def calc_onehot_udf(feature_keys, u_features):
return udf(lambda x: one_hot_features(x, feature_keys, u_features),
ArrayType(FloatType()))
n_data = data.withColumn("hot_feature", calc_onehot_udf(feature_keys,
u_features)(col("features")))
n_data.show()
Good Luck if you face this problem; hopefully documenting here helps.

error in labelled point object pyspark

I am writing a function
which takes a RDD as input
splits the comma separated values
then convert each row into labelled point object
finally fetch the output as a dataframe
code:
def parse_points(raw_rdd):
cleaned_rdd = raw_rdd.map(lambda line: line.split(","))
new_df = cleaned_rdd.map(lambda line:LabeledPoint(line[0],[line[1:]])).toDF()
return new_df
output = parse_points(input_rdd)
upto this if I run the code, there is no error it is working fine.
But on adding the line,
output.take(5)
I am getting the error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 129.0 failed 1 times, most recent failure: Lost task 0.0 in s stage 129.0 (TID 152, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Py4JJavaError Traceback (most recent call last)
<ipython-input-100-a68c448b64b0> in <module>()
20
21 output = parse_points(raw_rdd)
---> 22 print output.show()
Please suggest me what is the mistake.

The reason you had no errors until you execute the action:
output.take(5)
Is due to the nature of spark, which is lazy.
i.e. nothing was execute in spark until you execute the action "take(5)"
You have a few issues in your code, and I think that you are failing due to extra "[" and "]" in [line[1:]]
So you need to remove extra "[" and "]" in [line[1:]] (and keep only the line[1:])
Another issue which you might need to solve is the lack of dataframe schema.
i.e. replace "toDF()" with "toDF(["features","label"])"
This will give the dataframe a schema.

Try:
>>> raw_rdd.map(lambda line: line.split(",")) \
... .map(lambda line:LabeledPoint(line[0], [float(x) for x in line[1:]])

UserWarning: X scores are null at iteration

I am trying to run CCA for a multi label/text classification problem but keep getting following warning and an error which I think are related
warnings.warn('Maximum number of iterations reached')
/Library/Python/2.7/site-packages/sklearn/cross_decomposition/pls_.py:290:
UserWarning: X scores are null at iteration 0 warnings.warn('X
scores are null at iteration %s' % k)
warnings.warn('Maximum number of iterations reached')
/Library/Python/2.7/site-packages/sklearn/cross_decomposition/pls_.py:290:
UserWarning: X scores are null at iteration 1
warnings.warn('X scores are null at iteration %s' % k)
...
for all the 400 iterations and then following error at the end which I think is a side effect of above warning:
Traceback (most recent call last): File "scikit_fb3.py", line 477,
in
getCCA(shorttestfilepathPreProcessed) File "scikit_fb3.py", line 318, in getCCA
X_CCA = cca.fit(x_array, Y_indicator).transform(X) File "/Library/Python/2.7/site-packages/sklearn/cross_decomposition/pls_.py",
line 368, in transform
Xc = (np.asarray(X) - self.x_mean_) / self.x_std_ File "/usr/local/bin/src/scipy/scipy/sparse/compressed.py", line 389, in
sub
raise NotImplementedError('adding a nonzero scalar to a ' NotImplementedError: adding a nonzero scalar to a sparse matrix is not
supported
What could possibly be wrong?

CCA doesn't support sparse matrices. By default, you should assume scikit-learn estimators do not grok sparse matrices and check their docstrings to find out if by chance you found one that does.
(I admit the warning could have been friendlier.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicode error while converting RDD to Spark DataFrame - python

Related

Apache Beam – issue with Deduplicate function

ValueError: dictionary update sequence element #0 has length 1; 2 is required in Python

PySpark UDF - resulting DF fails to show "value error: "mycolumn" name is not in list"

error in labelled point object pyspark

UserWarning: X scores are null at iteration

Categories

Resources