I'm looping through a list of web pages with Scrapy. Some of the pages that I scrape are in error. i want to keep track of the various error types so I have set up my function to first check if a series of error conditions ( which I have placed in a dictionary are true and if none are proceed with normal page scraping:
def parse_detail_page(self, response):
error_value = False
output = ""
error_cases = {
"' pageis not found' in response.body" : 'invalid',
"'has been transferred' in response.body" : 'transferred',
}
for key, value in error_cases.iteritems():
if bool(key):
error_value = True
output = value
if error_value:
for field in J1_Item.fields:
if field == 'case':
item[field] = id
else:
item[field] = output
else:
item['case'] = id
........................
However I see that despite even in cases with none of the error cases being valid, the 'invalid' option is being selected. What am I doing wrong?
Your conditions (something in response.body) are not evaluated. Instead, you evaluate the truth value of a nonempty string, which is True.
This might work:
def parse_detail_page(self, response):
error_value = False
output = ""
error_cases = {
"pageis not found" : 'invalid',
"has been transferred" : 'transferred',
}
for key, value in error_cases.iteritems():
if key in response.body:
error_value = True
output = value
break
.................
(Must it be "pageis not found" or "page is not found"?)
bool(key) will convert key from a string to a bool.
What it won't do is actually evaluate the condition. You could use eval() for that, but I'd recommend instead storing a list of functions (each returning an object or throwing an exception) rather than your current dict-with-string-keys-that-are-actually-Python-code.
I'm not sure why you are evaluating bool(key) like you are. Let's look at your error_cases. You have two keys, and two values. "' pageis not found' in response.body" will be your key the first time, and "'has been transferred' in response.body" will be the key in the second round in your for loop. Neither of those will be false when you check bool(key), because key has a value other than False or 0.
>>> a = "' pageis not found' in response.body"
>>> bool(a)
True
You need to have a different evaluator other than bool(key) there or you will always have an error.
Your conditions are strings, so they are not be evaluated.
You could evaluate your strings using eval(key) function, that is quite unsafe.
With the help of the operator module, there is no need to evaluate unsafe strings (as long as your conditions stay quite simple).
error['operator'] holds reference to the 'contains' function, which can be used as a replacement for 'in'.
from operator import contains
class ...:
def parse_detail_page(self, response):
error_value = False
output = ""
error_cases = [
{'search': ' pageis not found', 'operator': contains, 'output': 'invalid' },
{'search': 'has been transferred', 'operator': contains, 'output': 'invalid' },
]
for error in error_cases:
if error['operator'](error['search'], response.body):
error_value = True
output = error['output']
print output
if error_value:
for field in J1_Item.fields:
if field == 'case':
item[field] = id
else:
item[field] = output
else:
item['case'] = id
...
Related
Using Python, I've got a function retrieving a list of operations from an API endpoint.
The functino takes a filter argument as a variable in order to filter the results on a given predicate.
Function looks like this:
def list_operations(filter=None):
# make a curl call to the product recognizer
headers = {
'Authorization': 'Bearer {}'.format(creds.token),
'Content-Type': 'application/json',
}
response = requests.get(
'https://{}/v1alpha1/projects/{}/locations/us-central1/operations'.format(API_ENDPOINT, project),
headers=headers
)
# dump the json response and display their names
data = json.loads(response.text)
#add a Metadata element to the operations if it does not exist
for item in data['operations']:
if not item.get('metadata'):
item['metadata'] = {}
item['metadata']['createTime'] = ''
else:
if not item['metadata'].get('createTime'):
item['metadata']['createTime'] = ''
# Order operations by create time if the metadata exists and the createTime exists
data['operations'] = sorted(data['operations'], key=lambda k: k['done'], reverse=True)
if filter:
# filter the operations by the filter value
# Parse the filter value to get the operation name
filter_path = filter.split('=')[0].split('.')
filter_value = filter.split('=')[1]
#check if the filter_value could be a Boolean
if filter_value == 'True':
filter_value = True
elif filter_value == 'False':
filter_value = False
# iterate backwards to avoid index out of range error using reversed
for item in reversed(data['operations']):
# for every element in filter_path, check if it exists
item_value = item
for filter_el in filter_path:
if item_value.get(filter_el):
item_value = item_value.get(filter_el)
# if the item value is not equal to the filter value, remove it from the list
if item_value != filter_value:
data['operations'].remove(item)
My problem is when I'm calling the function with/
list_operations(filter='done=False')
even when the done key from the response message is False, the assignment of the value to item_value does not work:
item_value = item_value.get(filter_el)
Using the debugger, item_value is {'name': 'api_path/operation-1676883175156-5f51dc9fc2ad1-b4c56f97-edd1e5be', 'done': False, 'metadata': {'createTime': ''}} instead of False
It works fine when calling
list_operations(filter='done=True')
I can't see what's missing here ...
[EDIT]
Problem was it the
if item_value.get(filter_el):
To test existence of the key, should have done:
if filter_el in item_value:
stupid mistake ...
json.loads() loads a JSON to a Python dictionary, so "done": false is already converted to {"done": False} in Python:
import json
d = json.loads("""{"name": "api_path/operation-1676883175156-5f51dc9fc2ad1-b4c56f97-edd1e5be",
"done": false,
"metadata": {"createTime": ""}}""")
print(type(d['done']))
>>> <class 'bool'>
I don't have your full response so I cannot help beyond this point.
I'm trying to do a ternary like operator for python to check if my dictionary value exist then use it or else leave it blank, for example in the code below I want to get the value of creator and assignee, if the value doesn't exist I want it to be '' if theres a way to use ternary operator in python?
Here's my code :
in_progress_response = requests.request("GET", url, headers=headers, auth=auth).json()
issue_list = []
for issue in in_progress_response['issues'] :
# return HttpResponse( json.dumps( issue['fields']['creator']['displayName'] ) )
issue_list.append(
{
"id": issue['id'],
"key": issue['key'],
# DOESN'T WORK
"creator": issue['fields']['creator']['displayName'] ? '',
"is_creator_active": issue['fields']['creator']['active'] ? '',
"assignee": issue['fields']['assignee']['displayName'] ? '',
"is_assignee_active": issue['fields']['assignee']['active'] ? '',
"updated": issue['fields']['updated'],
}
)
return issue_list
Ternary operators in python act as follows:
condition = True
foo = 3.14 if condition else 0
But for your particular use case, you should consider using dict.get(). The first argument specifies what you are trying to access, and the second argument specifies a default return value if the key does not exist in the dictionary.
some_dict = {'a' : 1}
foo = some_dict.get('a', '') # foo is 1
bar = some_dict.get('b', '') # bar is ''
You can use .get(…) [Django-doc] to try to fetch an item from a dictionary and return an optional default value in case the dictionary does not contain the given key, you thus can implement this as:
"creator": issue.get('fields', {}).get('creator', {}).get('displayName', ''),
the same with the other items.
if you want to use something like ternary then
you can say
value = issue['fields']['creator']['displayName'] if issue['fields']['creator'] else ""
I have a dictionary of values. This is for a company name.
It has 3 keys:
{'html_attributions': [],
'result' : {'Address': '123 Street', 'website' :'123street.com'
'status': 'Ok' }
I have a dataframe of many dictionaries. I want to loop through each row's dictionary and get the necessary information I want.
Currently I am writing for loops to retrieve these information. Is there a more efficient way to retrieve these information?
addresses = []
for i in range(len(testing)):
try:
addresses.append(testing['Results_dict'][i]['result']['Address'])
except:
addresses.append('No info')
What I have works perfectly fine. However I would like something that would be more efficient. Perhaps using the get() method? but I don't know how I can call to get the inside of 'result'.
Try this:
def get_address(r):
try:
return r['result']['Address']
except Exception:
return 'No info'
addresses = df['Results_dict'].map(get_address)
This guards against cases where Result_dict is None, not a dict, or any key along the path way does not exist.
This is a way faster solution if the data is big:
addresses = list(map(lambda x: x.get('result').get('Address', 'No info'), testing['Results_dict']))
Here is how I deal with nested dict keys:
Example:
def keys_exists(element, *keys):
if not isinstance(element, dict):
raise AttributeError('keys_exists() expects dict as first argument.')
if len(keys) == 0:
raise AttributeError('keys_exists() expects at least two arguments,
one given.')
_element = element
for key in keys:
try:
_element = _element[key]
except KeyError:
return False
return True
For data :
{'html_attributions': [],
'result' : {'Address': '123 Street', 'website' :'123street.com'
'status': 'Ok' }
if you want to check result exists or not use above function like this
`print 'result (exists/Not): {}'.format(keys_exists(data,"result"))`
To check address exist inside result Try this
`print 'result > Address (exists/not): {}'.format(keys_exists(data, "result", "Address"))`
It will return output in True/False
I have searched quite thoroughly and have not found a suitable solution. I am new to Python/Programming, so I appreciate any advice I can get:
I am trying to search a string from StringSet, here is what i am trying to do but not getting the value.
string_set = {'"123", "456", "789"'}
value = '123'
values_list = []
def fun():
for i in string_set:
if i in value:
output=LookupTables.get('dynamo-table', i, {})
return output
fun()
Using the above if it value is in the stringset then it will return the value which is in my dynmodb table.
Nothe: There could be more than 5000 values in my table so i wanted to get earliest possible return.
maybe you should romove the extra '' firstly
string_set = {'"123", "456", "789"'} # this set has just one value '"123", "456", "789"'
string_set_fixed = {"123", "456", "789"}
im assuming you're just checking if 123 is in "123", "456", "789" since you had it wrapped in single quotes:
to represent that lets use:
strset = {"123", "456", "789"}
what if you have to use that weird variable?
this should render it useable
strset = {'"123", "456", "789"'}
removed = next(iter(strset))
strset.update((removed).split())
strset.remove(removed)
strset = set([i.strip(",").strip('"') for i in strset])
another cleaner way:
strset = {'"123", "456", "789"'}
exec(f"strset = {next(iter(strset))}")
print("123" in strset)
now to check if value is in there:
if value in strset:
#do code here
Try this:
string_set = {"123", "456", "789"}
value = '123'
values_list = []
def fun():
if value in string_set:
output = LookupTables.get('dynamo-table', value, {})
return output
fun()
Explanation:
Your definition of string_set contains an extraneous pair of ' ';
When you are testing i in value, you are comparing i against all substrings of value, rather than against the whole string.
I have a flask application which is receiving a request from dataTables Editor. Upon receipt at the server, request.form looks like (e.g.)
ImmutableMultiDict([('data[59282][gender]', u'M'), ('data[59282][hometown]', u''),
('data[59282][disposition]', u''), ('data[59282][id]', u'59282'),
('data[59282][resultname]', u'Joe Doe'), ('data[59282][confirm]', 'true'),
('data[59282][age]', u'27'), ('data[59282][place]', u'3'), ('action', u'remove'),
('data[59282][runnerid]', u''), ('data[59282][time]', u'29:49'),
('data[59282][club]', u'')])
I am thinking to use something similar to this really ugly code to decode it. Is there a better way?
from collections import defaultdict
# request.form comes in multidict [('data[id][field]',value), ...]
# so we need to exec this string to turn into python data structure
data = defaultdict(lambda: {}) # default is empty dict
# need to define text for each field to be received in data[id][field]
age = 'age'
club = 'club'
confirm = 'confirm'
disposition = 'disposition'
gender = 'gender'
hometown = 'hometown'
id = 'id'
place = 'place'
resultname = 'resultname'
runnerid = 'runnerid'
time = 'time'
# fill in data[id][field] = value
for formkey in request.form.keys():
exec '{} = {}'.format(d,repr(request.form[formkey]))
This question has an accepted answer and is a bit old but since the DataTable module seems being pretty popular among jQuery community still, I believe this approach may be useful for someone else. I've just wrote a simple parsing function based on regular expression and dpath module, though it appears not to be quite reliable module. The snippet may be not very straightforward due to an exception-relied fragment, but it was only one way to prevent dpath from trying to resolve strings as integer indices I found.
import re, dpath.util
rxsKey = r'(?P<key>[^\W\[\]]+)'
rxsEntry = r'(?P<primaryKey>[^\W]+)(?P<secondaryKeys>(\[' \
+ rxsKey \
+ r'\])*)\W*'
rxKey = re.compile(rxsKey)
rxEntry = re.compile(rxsEntry)
def form2dict( frmDct ):
res = {}
for k, v in frmDct.iteritems():
m = rxEntry.match( k )
if not m: continue
mdct = m.groupdict()
if not 'secondaryKeys' in mdct.keys():
res[mdct['primaryKey']] = v
else:
fullPath = [mdct['primaryKey']]
for sk in re.finditer( rxKey, mdct['secondaryKeys'] ):
k = sk.groupdict()['key']
try:
dpath.util.get(res, fullPath)
except KeyError:
dpath.util.new(res, fullPath, [] if k.isdigit() else {})
fullPath.append(int(k) if k.isdigit() else k)
dpath.util.new(res, fullPath, v)
return res
The practical usage is based on native flask request.form.to_dict() method:
# ... somewhere in a view code
pars = form2dict(request.form.to_dict())
The output structure includes both, dictionary and lists, as one could expect. E.g.:
# A little test:
rs = jQDT_form2dict( {
'columns[2][search][regex]' : False,
'columns[2][search][value]' : None,
'columns[2][search][regex]' : False,
} )
generates:
{
"columns": [
null,
null,
{
"search": {
"regex": false,
"value": null
}
}
]
}
Update: to handle lists as dictionaries (in more efficient way) one may simplify this snippet with following block at else part of if clause:
# ...
else:
fullPathStr = mdct['primaryKey']
for sk in re.finditer( rxKey, mdct['secondaryKeys'] ):
fullPathStr += '/' + sk.groupdict()['key']
dpath.util.new(res, fullPathStr, v)
I decided on a way that is more secure than using exec:
from collections import defaultdict
def get_request_data(form):
'''
return dict list with data from request.form
:param form: MultiDict from `request.form`
:rtype: {id1: {field1:val1, ...}, ...} [fieldn and valn are strings]
'''
# request.form comes in multidict [('data[id][field]',value), ...]
# fill in id field automatically
data = defaultdict(lambda: {})
# fill in data[id][field] = value
for formkey in form.keys():
if formkey == 'action': continue
datapart,idpart,fieldpart = formkey.split('[')
if datapart != 'data': raise ParameterError, "invalid input in request: {}".format(formkey)
idvalue = int(idpart[0:-1])
fieldname = fieldpart[0:-1]
data[idvalue][fieldname] = form[formkey]
# return decoded result
return data