I have a scrapy script that works locally, but when I deploy it to Scrapinghub, it's giving all errors. Upon debugging, the error is coming from Yielding the item.
This is the error I get.
ERROR [scrapy.utils.signal] Error caught on signal handler: <bound method ?.item_scraped of <sh_scrapy.extension.HubstorageExtension object at 0x7fd39e6141d0>> Less
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/usr/local/lib/python2.7/site-packages/sh_scrapy/extension.py", line 45, in item_scraped
item = self.exporter.export_item(item)
File "/usr/local/lib/python2.7/site-packages/scrapy/exporters.py", line 304, in export_item
result = dict(self._get_serialized_fields(item))
File "/usr/local/lib/python2.7/site-packages/scrapy/exporters.py", line 75, in _get_serialized_fields
value = self.serialize_field(field, field_name, item[field_name])
File "/usr/local/lib/python2.7/site-packages/scrapy/exporters.py", line 284, in serialize_field
return serializer(value)
File "/usr/local/lib/python2.7/site-packages/scrapy/exporters.py", line 290, in _serialize_value
return dict(self._serialize_dict(value))
File "/usr/local/lib/python2.7/site-packages/scrapy/exporters.py", line 300, in _serialize_dict
key = to_bytes(key) if self.binary else key
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/python.py", line 117, in to_bytes
'object, got %s' % type(text).__name__)
TypeError: to_bytes must receive a unicode, str or bytes object, got int
It doesn't specify the field with issues, but by process of elimination, I came to realize it's this part of the code:
try:
item["media"] = {}
media_index = 0
media_content = response.xpath("//audio/source/#src").extract_first()
if media_content is not None:
item["media"][media_index] = {}
preview = item["media"][media_index]
preview["Media URL"] = media_content
preview["Media Type"] = "Audio"
media_index += 1
except IndexError:
print "Index error for media " + item["asset_url"]
I cleared some parts up to make it easier to tackle, but basically this part is the issue. Something it doesn't like about the item media.
I'm beginner in both Python and Scrapy. So sorry if this turns out to be silly basic Python mistake. Any idea?
EDIT: So after getting the answer from ThunderMind, the solution was to simply do str(media_index) for key
Yeah, right here:
item["media"][media_index] = {}
media_index is a mutable. and Keys can't be mutable.
Read Python dict, to know what should be used as keys.
Related
I was trying this tutorial given in the MistQL for a personal work but this below code is throwing exception as given below
import mistql
data="{\"foo\": \"bar\"}"
query = '#.foo'
results = mistql.query(query, data)
print(results)
Traceback (most recent call last): File "C:\Temp\sample.py", line 4,
in
purchaserEmails = mistql.query(query, data) File "C:\Python3.10.0\lib\site-packages\mistql\query.py", line 18, in query
result = execute_outer(ast, data) File "C:\Python3.10.0\lib\site-packages\mistql\execute.py", line 73, in
execute_outer
return execute(ast, build_initial_stack(data, builtins)) File "C:\Python3.10.0\lib\site-packages\typeguard_init_.py", line 1033,
in wrapper
retval = func(*args, **kwargs) File "C:\Python3.10.0\lib\site-packages\mistql\execute.py", line 60, in
execute
return execute_fncall(ast.fn, ast.args, stack) File "C:\Python3.10.0\lib\site-packages\typeguard_init_.py", line 1033,
in wrapper
retval = func(*args, **kwargs) File "C:\Python3.10.0\lib\site-packages\mistql\execute.py", line 28, in
execute_fncall
return function_definition(arguments, stack, execute) File "C:\Python3.10.0\lib\site-packages\mistql\builtins.py", line 37, in
wrapped
return fn(arguments, stack, exec) File "C:\Python3.10.0\lib\site-packages\mistql\builtins.py", line 168, in
dot
return _index_single(RuntimeValue.of(right.name), left) File "C:\Python3.10.0\lib\site-packages\mistql\builtins.py", line 295, in
_index_single
assert_type(index, RVT.Number) File "C:\Python3.10.0\lib\site-packages\mistql\runtime_value.py", line 344,
in assert_type
raise MistQLTypeError(f"Expected {expected_type}, got {value.type}") mistql.exceptions.MistQLTypeError: Expected
RuntimeValueType.Number, got RuntimeValueType.String
Strings passed as data into MistQL's query method correspond to MistQL string types, and thus the #.foo query is attempting to index the string "{\"foo\": \"bar\"}") using the "foo" value, leading to the error above.
MistQL is expecting native dictionaries and lists. The correspondence between Python data types and MistQL data types can be seen here
That being said, the error message is indeed very confusing and should be fixed post-haste! Issue here
I have made this code, to get the year from an mp3, and if i print it, it works, but when i write to text box in my webpage, it gives an error(traceback below), but not always, sometimes the error not show, so i suspect it is from the way the mp3 is tagged:
nfo_year = ''
audio_filename = 'myfile.mp3'
f = mutagen.File(audio_filename)
audio = ID3(audio_filename) # path: path to file
# Year
try:
nfo_year = audio['TDRC'].text[0]
print(nfo_year)
except:
pass
time.sleep(2)
logger_internet.info('Writing Year...')
AB_author = driver.find_element_by_name('year')
AB_author.send_keys(nfo_year)
Traceback (most recent call last):
File "D:\AB\redacted.py", line 1252, in <module>
AB_author.send_keys(nfo_year)
File "D:\AB\venv\lib\site-packages\selenium\webdriver\remote\webelement.py", line 478, in send_keys
{'text': "".join(keys_to_typing(value)),
File "D:\AB\venv\lib\site-packages\selenium\webdriver\common\utils.py", line 150, in keys_to_typing
for i in range(len(val)):
TypeError: object of type 'ID3TimeStamp' has no len()
my question is: an error on the mp3 tag or am i doing something wrong?
nfo_year is a timestamp object, of type ID3TimeStamp. You have to pass strings to AB_author.send_keys. Since print worked, you can try str(nfo_year).
I set up a try catch in my code, but it appears that my exception was not correct because it did not seem to catch it.
I am using an exception from a module, and perhaps I didn't import it correctly? Here is my code:
import logging
import fhirclient.models.bundle as b
from fhirclient.server import FHIRUnauthorizedException
logging.disable(logging.WARNING)
def get_all_resources(resource, struct, smart):
'''Perform a search on a resource type and get all resources entries from all retunred bundles.\n
This function takes all paginated bundles into consideration.'''
if smart.ready == False:
smart.reauthorize
search = resource.where(struct)
bundle = search.perform(smart.server)
resources = [entry.resource for entry in bundle.entry or []]
next_url = _get_next_url(bundle.link)
while next_url != None:
try:
json_dict = smart.server.request_json(next_url)
except FHIRUnauthorizedException:
smart.reauthorize
continue
bundle = b.Bundle(json_dict)
resources += [entry.resource for entry in bundle.entry or []]
next_url = _get_next_url(bundle.link)
return resources
Now when i ran the code I got the following error:
Traceback (most recent call last):
File "code.py", line 79, in <module>
main()
File "code.py", line 42, in main
reports = get_all_resources(dr.DiagnosticReport, search, smart)
File "somepath/fhir_tools/resource.py", line 23, in get_all_resources
json_dict = smart.server.request_json(next_url)
File "/usr/local/lib/python3.6/dist-packages/fhirclient/server.py", line 153, in request_json
res = self._get(path, headers, nosign)
File "/usr/local/lib/python3.6/dist-packages/fhirclient/server.py", line 181, in _get
self.raise_for_status(res)
File "/usr/local/lib/python3.6/dist-packages/fhirclient/server.py", line 256, in raise_for_status
raise FHIRUnauthorizedException(response)
server.FHIRUnauthorizedException: <Response [401]>
Shouldn't my exception catch this?
I have been trying to tweek the mapper_pre_filter example given here. Now, if instead of specifying the command directly in steps, if I'm writing a method to return that command, like this:
from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol
class KittiesJob(MRJob):
OUTPUT_PROTOCOL = JSONValueProtocol
def filter_input(self):
return ''' grep 'kitty' '''
def test_for_kitty(self, _, value):
yield None, 0 # make sure we have some output
if 'kitty' in value:
yield None, 1
def sum_missing_kitties(self, _, values):
yield None, sum(values)
def steps(self):
return [
self.mr(mapper_pre_filter=self.filter_input,
mapper=self.test_for_kitty,
reducer=self.sum_missing_kitties)]
if __name__ == '__main__':
KittiesJob().run()
I'm getting the following exception:
Exception: error getting step information:
Traceback (most recent call last):
File "/Users/sverma/work/mrjob/filter_input.py", line 30, in <module>
KittiesJob().run()
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 500, in execute
self.show_steps()
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 677, in show_steps
print >> self.stdout, json.dumps(self._steps_desc())
File "/Library/Python/2.7/site-packages/simplejson/__init__.py", line 370, in dumps
return _default_encoder.encode(obj)
File "/Library/Python/2.7/site-packages/simplejson/encoder.py", line 269, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Library/Python/2.7/site-packages/simplejson/encoder.py", line 348, in iterencode
return _iterencode(o, 0)
File "/Library/Python/2.7/site-packages/simplejson/encoder.py", line 246, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <bound method KittiesJob.filter_input of <__main__.KittiesJob object at 0x10449ac90>> is not JSON serializable
Can someone please explain what I'm doing wrong ?
Wow, that's a late answere. I think you want to change this:
mapper_pre_filter=self.filter_input, to
mapper_pre_filter=self.filter_input(),.
From the example mapper_pre_filter is expected to be a string, not a function. Maybe it'll help somebody in the future.
The stack trace says that the output of the filter is not JSON serializable, because it's probably empty.
I have a command as follows:
class AddChatMessages(Command):
arguments = [
('messages', AmpList([('message', Unicode()), ('type', Integer())]))]
And I have a responder for it in a controller:
def add_chat_messages(self, messages):
for i, m in enumerate(messages):
messages[i] = (m['message'], m['type'])
self.main.add_chat_messages(messages)
return {}
commands.AddChatMessages.responder(add_chat_messages)
I am writing a unit test for it. This is my code:
class AddChatMessagesTest(ProtocolTestMixin, unittest.TestCase):
command = commands.AddChatMessages
data = {'messages': [{'message': 'hi', 'type': 'None'}]}
def assert_callback(self, unused):
pass
Where ProtocolMixin is as follows:
class ProtocolTestMixin(object):
def setUp(self):
self.protocol = client.CommandProtocol()
def assert_callback(self, unused):
raise NotImplementedError("Has to be implemented!")
def test_responder(self):
responder = self.protocol.lookupFunction(
self.command.commandName)
d = responder(self.data)
d.addCallback(self.assert_callback)
return d
It works if AmpList is not involved, but when it is - I get following error:
======================================================================
ERROR: test_responder
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/internet/defer.py", line 139, in maybeDeferred
result = f(*args, **kw)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/internet/utils.py", line 203, in runWithWarningsSuppressed
reraise(exc_info[1], exc_info[2])
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/internet/utils.py", line 199, in runWithWarningsSuppressed
result = f(*a, **kw)
File "/Users/<username>/Projects/space/tests/client_test.py", line 32, in test_responder
d = responder(self.data)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/protocols/amp.py", line 1016, in doit
kw = command.parseArguments(box, self)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/protocols/amp.py", line 1717, in parseArguments
return _stringsToObjects(box, cls.arguments, protocol)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/protocols/amp.py", line 2510, in _stringsToObjects
argparser.fromBox(argname, myStrings, objects, proto)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/protocols/amp.py", line 1209, in fromBox
objects[nk] = self.fromStringProto(st, proto)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/protocols/amp.py", line 1465, in fromStringProto
boxes = parseString(inString)
File "/Users/<username>/Projects/space/env/lib/python2.7/site-packages/twisted/protocols/amp.py", line 2485, in parseString
return cls.parse(StringIO(data))
TypeError: must be string or buffer, not list
Which makes sense, but how do I serialize a list in AddChatMessagesTest.data?
The responder expects to be called with a serialized box. It will then deserialize it, dispatch the objects to application code, take the object the application code returns, serialize it, and then return that serialized form.
For a few AMP types. most notably String, the serialized form is the same as the deserialized form, so it's easy to overlook this.
I think that you'll want to pass your data through Command.makeArguments in order to produce an object suitable to pass to a responder.
For example:
>>> from twisted.protocols.amp import Command, Integer
>>> class Foo(Command):
... arguments = [("bar", Integer())]
...
>>> Foo.makeArguments({"bar": 17}, None)
AmpBox({'bar': '17'})
>>>
If you do this with a Command that uses AmpList I think you'll find makeArguments returns an encoded string for the value of that argument and that the responder is happy to accept and parse that kind of string.