mapper_pre_filter in MRJob

mapper_pre_filter in MRJob - python

I have been trying to tweek the mapper_pre_filter example given here. Now, if instead of specifying the command directly in steps, if I'm writing a method to return that command, like this:
from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol
class KittiesJob(MRJob):
OUTPUT_PROTOCOL = JSONValueProtocol
def filter_input(self):
return ''' grep 'kitty' '''
def test_for_kitty(self, _, value):
yield None, 0 # make sure we have some output
if 'kitty' in value:
yield None, 1
def sum_missing_kitties(self, _, values):
yield None, sum(values)
def steps(self):
return [
self.mr(mapper_pre_filter=self.filter_input,
mapper=self.test_for_kitty,
reducer=self.sum_missing_kitties)]
if __name__ == '__main__':
KittiesJob().run()
I'm getting the following exception:
Exception: error getting step information:
Traceback (most recent call last):
File "/Users/sverma/work/mrjob/filter_input.py", line 30, in <module>
KittiesJob().run()
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 500, in execute
self.show_steps()
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 677, in show_steps
print >> self.stdout, json.dumps(self._steps_desc())
File "/Library/Python/2.7/site-packages/simplejson/__init__.py", line 370, in dumps
return _default_encoder.encode(obj)
File "/Library/Python/2.7/site-packages/simplejson/encoder.py", line 269, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Library/Python/2.7/site-packages/simplejson/encoder.py", line 348, in iterencode
return _iterencode(o, 0)
File "/Library/Python/2.7/site-packages/simplejson/encoder.py", line 246, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <bound method KittiesJob.filter_input of <__main__.KittiesJob object at 0x10449ac90>> is not JSON serializable
Can someone please explain what I'm doing wrong ?

Wow, that's a late answere. I think you want to change this:
mapper_pre_filter=self.filter_input, to
mapper_pre_filter=self.filter_input(),.
From the example mapper_pre_filter is expected to be a string, not a function. Maybe it'll help somebody in the future.
The stack trace says that the output of the filter is not JSON serializable, because it's probably empty.

Related

My first hello word program in MistQL is throwing exception Expected RuntimeValueType.Number, got RuntimeValueType.String

I was trying this tutorial given in the MistQL for a personal work but this below code is throwing exception as given below
import mistql
data="{\"foo\": \"bar\"}"
query = '#.foo'
results = mistql.query(query, data)
print(results)
Traceback (most recent call last): File "C:\Temp\sample.py", line 4,
in
purchaserEmails = mistql.query(query, data) File "C:\Python3.10.0\lib\site-packages\mistql\query.py", line 18, in query
result = execute_outer(ast, data) File "C:\Python3.10.0\lib\site-packages\mistql\execute.py", line 73, in
execute_outer
return execute(ast, build_initial_stack(data, builtins)) File "C:\Python3.10.0\lib\site-packages\typeguard_init_.py", line 1033,
in wrapper
retval = func(*args, **kwargs) File "C:\Python3.10.0\lib\site-packages\mistql\execute.py", line 60, in
execute
return execute_fncall(ast.fn, ast.args, stack) File "C:\Python3.10.0\lib\site-packages\typeguard_init_.py", line 1033,
in wrapper
retval = func(*args, **kwargs) File "C:\Python3.10.0\lib\site-packages\mistql\execute.py", line 28, in
execute_fncall
return function_definition(arguments, stack, execute) File "C:\Python3.10.0\lib\site-packages\mistql\builtins.py", line 37, in
wrapped
return fn(arguments, stack, exec) File "C:\Python3.10.0\lib\site-packages\mistql\builtins.py", line 168, in
dot
return _index_single(RuntimeValue.of(right.name), left) File "C:\Python3.10.0\lib\site-packages\mistql\builtins.py", line 295, in
_index_single
assert_type(index, RVT.Number) File "C:\Python3.10.0\lib\site-packages\mistql\runtime_value.py", line 344,
in assert_type
raise MistQLTypeError(f"Expected {expected_type}, got {value.type}") mistql.exceptions.MistQLTypeError: Expected
RuntimeValueType.Number, got RuntimeValueType.String

Strings passed as data into MistQL's query method correspond to MistQL string types, and thus the #.foo query is attempting to index the string "{\"foo\": \"bar\"}") using the "foo" value, leading to the error above.
MistQL is expecting native dictionaries and lists. The correspondence between Python data types and MistQL data types can be seen here
That being said, the error message is indeed very confusing and should be fixed post-haste! Issue here

PicklingError when getting the result from ray

I'm working on slowly converting my very serialized text analysis engine to use Modin and Ray. Feels like I'm nearly there, however, I seem to have hit a stumbling block. My code looks like this:
vectorizer = TfidfVectorizer(
analyzer=ngrams, encoding="ascii", stop_words="english", strip_accents="ascii"
)
tf_idf_matrix = vectorizer.fit_transform(r_strings["name"])
r_vectorizer = ray.put(vectorizer)
r_tf_idf_matrix = ray.put(tf_idf_matrix)
n = 2
match_results = []
for fn in files["c.file"]:
match_results.append(
match_name.remote(fn, r_vectorizer, r_tf_idf_matrix, r_strings, n)
)
match_returns = ray.get(match_results)
I'm following the guidance from the "anti-patterns" section in the Ray documentation, on what to avoid, and this is very similar to that of the "better" pattern.
Traceback (most recent call last):
File "alt.py", line 213, in <module>
match_returns = ray.get(match_results)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/worker.py", line 1501, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(PicklingError): ray::match_name() (pid=23393, ip=192.168.1.173)
File "python/ray/_raylet.pyx", line 564, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 565, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1652, in ray._raylet.CoreWorker.store_task_outputs
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 327, in serialize
return self._serialize_to_msgpack(value)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 307, in _serialize_to_msgpack
self._serialize_to_pickle5(metadata, python_objects)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 267, in _serialize_to_pickle5
raise e
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 264, in _serialize_to_pickle5
value, protocol=5, buffer_callback=writer.buffer_callback)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 580, in dump
return Pickler.dump(self, obj)
_pickle.PicklingError: args[0] from __newobj__ args has the wrong class
Definitely an unexpected result. I'm not sure where to go next with this and would appreciate help from folks who have more experience with Ray and Modin.

Mutagen: TypeError: a bytes-like object is required, not 'str'

What should I do? I'm getting this error. I want add some tags for FLAC.
I searched but i didnt find anythings. Please help me.
Traceback (most recent call last):
File "indir.py", line 50, in <module>
audio.save()
File "/usr/local/lib/python3.6/dist-packages/mutagen/_util.py", line 169, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/mutagen/_util.py", line 140, in wrapper
return func(self, h, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/mutagen/flac.py", line 847, in save
self._save(filething, self.metadata_blocks, deleteid3, padding)
File "/usr/local/lib/python3.6/dist-packages/mutagen/flac.py", line 864, in _save
metadata_blocks, available, content_size, padding)
File "/usr/local/lib/python3.6/dist-packages/mutagen/flac.py", line 154, in _writeblocks
data += cls._writeblock(block)
File "/usr/local/lib/python3.6/dist-packages/mutagen/flac.py", line 126, in _writeblock
datum = block.write()
File "/usr/local/lib/python3.6/dist-packages/mutagen/flac.py", line 620, in write
f.write(self.data)
TypeError: a bytes-like object is required, not 'str'
My Code:
audio = FLAC("music.flac")
audio['artist'] = sarki.artist.name
audio['title'] = sarki.name
pic = Picture()
pic.type = id3.PictureType.COVER_FRONT
pic.width = 640
pic.height = 640
pic.mime = 'image/jpeg'
pic.data = "music.jpg"
audio.add_picture(pic)
audio.save()

I believe the error is here:
pic.data = "music.jpg"
You are attempting to set the image data of the picture to be a string. I'm guessing you wanted to set the image data to be the contents of the file music.jpg instead. If so, try replacing this line with the following two:
with open("music.jpg", "rb") as f:
pic.data = f.read()
This follows an example in the Mutagen API reference.

Pyspark UDF unable to use large dictionary

I've a dictionary consisting of keys = word, value = Array of 300 float numbers.
I'm unable to use this dictionary in my pyspark UDF.
When size of this dictionary is 2Million keys it does not work. But when I reduce the size to 200K it works.
This is my code for the function to be converted to UDF
def get_sentence_vector(sentence, dictionary_containing_word_vectors):
cleanedSentence = list(clean_text(sentence))
words_vector_list = np.zeros(300)# 300 dimensional vector
for x in cleanedSentence:
try:
words_vector_list = np.add(words_vector_list, dictionary_containing_word_vectors[str(x)])
except Exception as e:
print("Exception caught while finding word vector from Fast text pretrained model Dictionary: ",e)
return words_vector_list.tolist()
This is my UDF
get_sentence_vector_udf = F.udf(lambda val: get_sentence_vector(val, fast_text_dictionary), ArrayType(FloatType()))
This is how i'm calling the udf to be added as a column in my dataframe
dmp_df_with_vectors = df.filter(df.item_name.isNotNull()).withColumn("sentence_vector", get_sentence_vector_udf(df.item_name))
And this is the stack trace for the error
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/broadcast.py", line 83, in dump
pickle.dump(value, f, 2)
SystemError: error return without exception set
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1957, in wrapper
return udf_obj(*args)
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1916, in __call__
judf = self._judf
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1900, in _judf
self._judf_placeholder = self._create_judf()
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1909, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1866, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/usr/lib/spark/python/pyspark/rdd.py", line 2377, in _prepare_for_python_RDD
broadcast = sc.broadcast(pickled_command)
File "/usr/lib/spark/python/pyspark/context.py", line 799, in broadcast
return Broadcast(self, value, self._pickled_broadcast_vars)
File "/usr/lib/spark/python/pyspark/broadcast.py", line 74, in __init__
self._path = self.dump(value, f)
File "/usr/lib/spark/python/pyspark/broadcast.py", line 90, in dump
raise pickle.PicklingError(msg)
cPickle.PicklingError: Could not serialize broadcast: SystemError: error return without exception set

How big is your fast_text_dictionary in the 2M case? It might be too big.
Try broadcast it first before running udf. e.g.
broadcastVar = sc.broadcast(fast_text_dictionary)
Then use broadcastVar instead in your udf.
See the document for broadcast

Serializing twisted.protocols.amp.AmpList for testing

I have a command as follows:
class AddChatMessages(Command):
arguments = [
('messages', AmpList([('message', Unicode()), ('type', Integer())]))]
And I have a responder for it in a controller:
def add_chat_messages(self, messages):
for i, m in enumerate(messages):
messages[i] = (m['message'], m['type'])
self.main.add_chat_messages(messages)
return {}
commands.AddChatMessages.responder(add_chat_messages)
I am writing a unit test for it. This is my code:
class AddChatMessagesTest(ProtocolTestMixin, unittest.TestCase):
command = commands.AddChatMessages
data = {'messages': [{'message': 'hi', 'type': 'None'}]}
def assert_callback(self, unused):
pass
Where ProtocolMixin is as follows:
class ProtocolTestMixin(object):
def setUp(self):
self.protocol = client.CommandProtocol()
def assert_callback(self, unused):
raise NotImplementedError("Has to be implemented!")
def test_responder(self):
responder = self.protocol.lookupFunction(
self.command.commandName)
d = responder(self.data)
d.addCallback(self.assert_callback)
return d
It works if AmpList is not involved, but when it is - I get following error:
======================================================================
ERROR: test_responder
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/internet/defer.py", line 139, in maybeDeferred
result = f(*args, **kw)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/internet/utils.py", line 203, in runWithWarningsSuppressed
reraise(exc_info[1], exc_info[2])
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/internet/utils.py", line 199, in runWithWarningsSuppressed
result = f(*a, **kw)
File "/Users/<username>/Projects/space/tests/client_test.py", line 32, in test_responder
d = responder(self.data)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/protocols/amp.py", line 1016, in doit
kw = command.parseArguments(box, self)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/protocols/amp.py", line 1717, in parseArguments
return _stringsToObjects(box, cls.arguments, protocol)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/protocols/amp.py", line 2510, in _stringsToObjects
argparser.fromBox(argname, myStrings, objects, proto)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/protocols/amp.py", line 1209, in fromBox
objects[nk] = self.fromStringProto(st, proto)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/protocols/amp.py", line 1465, in fromStringProto
boxes = parseString(inString)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/protocols/amp.py", line 2485, in parseString
return cls.parse(StringIO(data))
TypeError: must be string or buffer, not list
Which makes sense, but how do I serialize a list in AddChatMessagesTest.data?

The responder expects to be called with a serialized box. It will then deserialize it, dispatch the objects to application code, take the object the application code returns, serialize it, and then return that serialized form.
For a few AMP types. most notably String, the serialized form is the same as the deserialized form, so it's easy to overlook this.
I think that you'll want to pass your data through Command.makeArguments in order to produce an object suitable to pass to a responder.
For example:
>>> from twisted.protocols.amp import Command, Integer
>>> class Foo(Command):
... arguments = [("bar", Integer())]
...
>>> Foo.makeArguments({"bar": 17}, None)
AmpBox({'bar': '17'})
>>>
If you do this with a Command that uses AmpList I think you'll find makeArguments returns an encoded string for the value of that argument and that the responder is happy to accept and parse that kind of string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

mapper_pre_filter in MRJob - python

Related

My first hello word program in MistQL is throwing exception Expected RuntimeValueType.Number, got RuntimeValueType.String

PicklingError when getting the result from ray

Mutagen: TypeError: a bytes-like object is required, not 'str'

Pyspark UDF unable to use large dictionary

Serializing twisted.protocols.amp.AmpList for testing

Categories

Resources