Scrapy spider converts float / int to string - python

I always receive a string in my result, even in exported JSON.
Using double translate to replace everything. The decimal_serializer was just for testing purposes. I called print(value) inside and it returned a valid float value. In my result it's always unicode string. add_value('offerCountNew', 1.3) returns valid float value in my result.
I also tried removing any processor or serializer. Any ideas on what I am doing wrong?
Item
offerCountNew = scrapy.Field(output_processor = TakeFirst(), serializer = decimal_serializer)
Spider
l.add_xpath('offerCountNew', 'number(translate(//*[#id="olp_feature_div"]//a[contains(#href, "new")], translate(//*[#id="olp_feature_div"]//a[contains(#href, "new")], "0123456789", ""), ""))')
Result
'offerCountNew': u'1.0',
JSON
"offerCountNew": "1.0",

def process_float_or_int(value):
try:
return eval(value)
except:
return value
offerCountNew = scrapy.Field(input_processor = MapCompose(lambda x: process_float_or_int(x)), output_processor = TakeFirst())

Related

Django Rest framework - I am trying to convert a property received in Response object to JSON object and iterate through it. But response is string

In views.py VENDOR_MAPPER is list of dictionary each dictionary has id, name, placeholder and autocommit key. I also tried sending json instead of Response object.
resp_object = {}
resp_object['supported_vendors'] = VENDOR_MAPPER
resp_object['vendor_name'] = ""
resp_object['create_vo_entry'] = False
resp_object['generate_signature_flag'] = False
resp_object['branch_flag'] = False
resp_object['trunk_flag'] = False
resp_object['branch_name'] = ""
resp_object['advisory'] = ""
data = {'data': resp_object}
return Response(data)
On home.html I am accessing the vendors_supported which is list and iterate through it, however instead of object i am getting string as type of variable.
var supported_vendors = "{{data.supported_vendors|safe}}";
console.log(supported_vendors);
console.log("Supported_vendors ", supported_vendors);
console.log("Supported_vendors_type:", typeof(supported_vendors));
data.supported_vendors|safe (django template tagging) is used to remove the unwanted characters in the response i have also tried without safe, but still the type was string
also tried converted as well as parse the response but type is shown as string
var supported_vendors = "{{data.supported_vendors}}";
console.log(JSON.parse(supported_vendors));
console.log(JSON.stringify(supported_vendors));
Output generated, i have printed the response type and values i get, also converting using JSON.parse and JSON.stringify did not work and output every time was string
[1]: https://i.stack.imgur.com/DuSMb.png
I want to convert the property into javascript object and perform some computations
You can try this instead ,
return HttpResponse(json.dumps(data),
content_type="application/json")
I got the answer:
var supported_vendors = "{{data.supported_vendors}}";
Converted the above line to
var supported_vendors = {{data.supported_vendors}};
removed quotes from the variable

Verifying SendGrid's Signed Event Webhook in Django

I am trying to get signed from sengrid Webhook:
https://docs.sendgrid.com/for-developers/tracking-events/getting-started-event-webhook-security-features
from sendgrid.helpers.eventwebhook import EventWebhook, EventWebhookHeader
def is_valid_signature(request):
#event_webhook_signature=request.META['HTTP_X_TWILIO_EMAIL_EVENT_WEBHOOK_SIGNATURE']
#event_webhook_timestamp=request.META['HTTP_X_TWILIO_EMAIL_EVENT_WEBHOOK_TIMESTAMP']
event_webhook = EventWebhook()
key=settings.SENDGRID_HEADER
ec_public_key = event_webhook.convert_public_key_to_ecdsa(key)
text=json.dumps(str(request.body))
return event_webhook.verify_signature(
text,
request.headers[EventWebhookHeader.SIGNATURE],
request.headers[EventWebhookHeader.TIMESTAMP],
ec_public_key
)
When I send test example from sengrid, always return False. I compared keys and all is correct, so, I think that the problem is the sintax of the payload:
"b[{\"email\":\"example#test.com\",\"timestamp\":1648560198,\"smtp-id\":\"\\\\u003c14c5d75ce93.dfd.64b469#ismtpd-555\\\\u003e\",\"event\":\"processed\",\"category\":[\"cat facts\"],\"sg_event_id\":\"G6NRn4zC5sGxoV2Hoz7gpw==\",\"sg_message_id\":\"14c5d75ce93.dfd.64b469.filter0001.16648.5515E0B88.0\"},{other tests},\\r\\n]\\r\\n"
I think the issue is that you are calling:
text = json.dumps(str(request.body))
json.dumps serializes an object to a JSON formatted string, but str(request.body) is already a string.
Try just
text = str(request.body)
I found the solution, my function is now like this:
def is_valid_signature(request):
#event_webhook_signature=request.META['HTTP_X_TWILIO_EMAIL_EVENT_WEBHOOK_SIGNATURE']
#event_webhook_timestamp=request.META['HTTP_X_TWILIO_EMAIL_EVENT_WEBHOOK_TIMESTAMP']
event_webhook = EventWebhook()
key=settings.SENDGRID_HEADER
ec_public_key = event_webhook.convert_public_key_to_ecdsa(key)
return event_webhook.verify_signature(
request.body.decode('latin-1'),
request.headers[EventWebhookHeader.SIGNATURE],
request.headers[EventWebhookHeader.TIMESTAMP],
ec_public_key
)
I had to decode in Latin-1, because we have my codification in UTF-8.
Thanks
( not failing on missing headers , utf8 decoding , types converted to strings)
def flask_verifySendgridSignedWebhook(myrequest , expectedKey ):
try:
if(myrequest.is_json):
sg_verify=EventWebhook()
msgbody=""
#print("JSON FOUND")
if(myrequest.data):
msgbody=myrequest.get_data().decode('utf-8')
##print(msgbody)
if(sg_verify.verify_signature( msgbody , str( myrequest.headers.get(EventWebhookHeader.SIGNATURE)),
str( myrequest.headers.get(EventWebhookHeader.TIMESTAMP)),
sg_verify.convert_public_key_to_ecdsa(expectedKey) )):
return True
else:
#print("NO JSON SENT")
return False
except:
return False

Logical evaluation error when looping through dictionary of conditions

I'm looping through a list of web pages with Scrapy. Some of the pages that I scrape are in error. i want to keep track of the various error types so I have set up my function to first check if a series of error conditions ( which I have placed in a dictionary are true and if none are proceed with normal page scraping:
def parse_detail_page(self, response):
error_value = False
output = ""
error_cases = {
"' pageis not found' in response.body" : 'invalid',
"'has been transferred' in response.body" : 'transferred',
}
for key, value in error_cases.iteritems():
if bool(key):
error_value = True
output = value
if error_value:
for field in J1_Item.fields:
if field == 'case':
item[field] = id
else:
item[field] = output
else:
item['case'] = id
........................
However I see that despite even in cases with none of the error cases being valid, the 'invalid' option is being selected. What am I doing wrong?
Your conditions (something in response.body) are not evaluated. Instead, you evaluate the truth value of a nonempty string, which is True.
This might work:
def parse_detail_page(self, response):
error_value = False
output = ""
error_cases = {
"pageis not found" : 'invalid',
"has been transferred" : 'transferred',
}
for key, value in error_cases.iteritems():
if key in response.body:
error_value = True
output = value
break
.................
(Must it be "pageis not found" or "page is not found"?)
bool(key) will convert key from a string to a bool.
What it won't do is actually evaluate the condition. You could use eval() for that, but I'd recommend instead storing a list of functions (each returning an object or throwing an exception) rather than your current dict-with-string-keys-that-are-actually-Python-code.
I'm not sure why you are evaluating bool(key) like you are. Let's look at your error_cases. You have two keys, and two values. "' pageis not found' in response.body" will be your key the first time, and "'has been transferred' in response.body" will be the key in the second round in your for loop. Neither of those will be false when you check bool(key), because key has a value other than False or 0.
>>> a = "' pageis not found' in response.body"
>>> bool(a)
True
You need to have a different evaluator other than bool(key) there or you will always have an error.
Your conditions are strings, so they are not be evaluated.
You could evaluate your strings using eval(key) function, that is quite unsafe.
With the help of the operator module, there is no need to evaluate unsafe strings (as long as your conditions stay quite simple).
error['operator'] holds reference to the 'contains' function, which can be used as a replacement for 'in'.
from operator import contains
class ...:
def parse_detail_page(self, response):
error_value = False
output = ""
error_cases = [
{'search': ' pageis not found', 'operator': contains, 'output': 'invalid' },
{'search': 'has been transferred', 'operator': contains, 'output': 'invalid' },
]
for error in error_cases:
if error['operator'](error['search'], response.body):
error_value = True
output = error['output']
print output
if error_value:
for field in J1_Item.fields:
if field == 'case':
item[field] = id
else:
item[field] = output
else:
item['case'] = id
...

How to check many variables in Python if not null?

I'm writing a python scraper code for OpenData and I have one question about : how to check if all values aren't filled in site and if it is null change value to null.
My scraper is here.
Currently I'm working on it to optimalize.
My variables now look like:
evcisloval = soup.find_all('td')[3].text.strip()
prinalezival = soup.find_all('td')[5].text.strip()
popisfaplnenia = soup.find_all('td')[7].text.replace('\"', '')
hodnotafaplnenia = soup.find_all('td')[9].text[:-1].replace(",", ".").replace(" ", "")
datumdfa = soup.find_all('td')[11].text
datumzfa = soup.find_all('td')[13].text
formazaplatenia = soup.find_all('td')[15].text
obchmenonazov = soup.find_all('td')[17].text
sidlofirmy = soup.find_all('td')[19].text
pravnaforma = soup.find_all('td')[21].text
sudregistracie = soup.find_all('td')[23].text
ico = soup.find_all('td')[25].text
dic = soup.find_all('td')[27].text
cislouctu = soup.find_all('td')[29].text
And Output :
scraperwiki.sqlite.save(unique_keys=["invoice_id"],
data={ "invoice_id":number,
"invoice_price":hodnotafaplnenia,
"evidence_no":evcisloval,
"paired_with":prinalezival,
"invoice_desc":popisfaplnenia,
"date_received":datumdfa,
"date_payment":datumzfa,
"pay_form":formazaplatenia,
"trade_name":obchmenonazov,
"trade_form":pravnaforma,
"company_location":sidlofirmy,
"court":sudregistracie,
"ico":ico,
"dic":dic,
"accout_no":cislouctu,
"invoice_attachment":urlfa,
"invoice_url":url})
I googled it but without success.
First, write a configuration dict of your variables in the form:
conf = {'evidence_no': (3, str.strip),
'trade_form': (21, None),
...}
i.e. key is the output key, value is a tuple of id from soup.find_all('td') and of an optional function that has to be applied to the result, None otherwise. You don't need those Slavic variable names that may confuse other SO members.
Then iterate over conf and fill the data dict.
Also, run soup.find_all('td') before the loop.
tds = soup.find_all('td')
data = {}
for name, (num, func) in conf.iteritems():
text = tds[num].text
# replace text with None or "NULL" or whatever if needed
...
if func is None:
data[name] = text
else:
data[name] = func(text)
This will remove a lot of duplicated code. Easier to maintain.
Also, I am not sure the strings "NULL" are the best way to write missing data. Doesn't sqlite support Python's real None objects?
Just read your attached link, and it seems what you want is
evcisloval = soup.find_all('td')[3].text.strip() or "NULL"
But be careful. You should only do this with strings. If the part before or is either empty or False or None, or 0, they will all be replaced with "NULL"

How do I check if a string is valid JSON in Python?

In Python, is there a way to check if a string is valid JSON before trying to parse it?
For example working with things like the Facebook Graph API, sometimes it returns JSON, sometimes it could return an image file.
You can try to do json.loads(), which will throw a ValueError if the string you pass can't be decoded as JSON.
In general, the "Pythonic" philosophy for this kind of situation is called EAFP, for Easier to Ask for Forgiveness than Permission.
Example Python script returns a boolean if a string is valid json:
import json
def is_json(myjson):
try:
json.loads(myjson)
except ValueError as e:
return False
return True
Which prints:
print is_json("{}") #prints True
print is_json("{asdf}") #prints False
print is_json('{ "age":100}') #prints True
print is_json("{'age':100 }") #prints False
print is_json("{\"age\":100 }") #prints True
print is_json('{"age":100 }') #prints True
print is_json('{"foo":[5,6.8],"foo":"bar"}') #prints True
Convert a JSON string to a Python dictionary:
import json
mydict = json.loads('{"foo":"bar"}')
print(mydict['foo']) #prints bar
mylist = json.loads("[5,6,7]")
print(mylist)
[5, 6, 7]
Convert a python object to JSON string:
foo = {}
foo['gummy'] = 'bear'
print(json.dumps(foo)) #prints {"gummy": "bear"}
If you want access to low-level parsing, don't roll your own, use an existing library: http://www.json.org/
Great tutorial on python JSON module: https://pymotw.com/2/json/
Is String JSON and show syntax errors and error messages:
sudo cpan JSON::XS
echo '{"foo":[5,6.8],"foo":"bar" bar}' > myjson.json
json_xs -t none < myjson.json
Prints:
, or } expected while parsing object/hash, at character offset 28 (before "bar}
at /usr/local/bin/json_xs line 183, <STDIN> line 1.
json_xs is capable of syntax checking, parsing, prittifying, encoding, decoding and more:
https://metacpan.org/pod/json_xs
I would say parsing it is the only way you can really entirely tell. Exception will be raised by python's json.loads() function (almost certainly) if not the correct format. However, the the purposes of your example you can probably just check the first couple of non-whitespace characters...
I'm not familiar with the JSON that facebook sends back, but most JSON strings from web apps will start with a open square [ or curly { bracket. No images formats I know of start with those characters.
Conversely if you know what image formats might show up, you can check the start of the string for their signatures to identify images, and assume you have JSON if it's not an image.
Another simple hack to identify a graphic, rather than a text string, in the case you're looking for a graphic, is just to test for non-ASCII characters in the first couple of dozen characters of the string (assuming the JSON is ASCII).
I came up with an generic, interesting solution to this problem:
class SafeInvocator(object):
def __init__(self, module):
self._module = module
def _safe(self, func):
def inner(*args, **kwargs):
try:
return func(*args, **kwargs)
except:
return None
return inner
def __getattr__(self, item):
obj = getattr(self.module, item)
return self._safe(obj) if hasattr(obj, '__call__') else obj
and you can use it like so:
safe_json = SafeInvocator(json)
text = "{'foo':'bar'}"
item = safe_json.loads(text)
if item:
# do something
An effective, and reliable way to check for valid JSON. If the 'get' accessor does't throw an AttributeError then the JSON is valid.
import json
valid_json = {'type': 'doc', 'version': 1, 'content': [{'type': 'paragraph', 'content': [{'text': 'Request for widget', 'type': 'text'}]}]}
invalid_json = 'opo'
def check_json(p, attr):
doc = json.loads(json.dumps(p))
try:
doc.get(attr) # we don't care if the value exists. Only that 'get()' is accessible
return True
except AttributeError:
return False
To use, we call the function and look for a key.
# Valid JSON
print(check_json(valid_json, 'type'))
Returns 'True'
# Invalid JSON / Key not found
print(check_json(invalid_json, 'type'))
Returns 'False'
Much simple in try block. You can then validate if the body is a valid JSON
async def get_body(request: Request):
try:
body = await request.json()
except:
body = await request.body()
return body

Categories