speeding up dictionary (list) processing in python - python

I have a dictionary list of size ~250k in python (i.e 250k dictionaries in a list), which I try to process as shown below. The aim is to clean up the dictionary and return an iterable at the end. So, I have something like this:
def check_qs(dict_list_in):
try:
del_id=[]
for i in dict_list_in:
tmp=i["get_url"][0]
if i["from"][0]=="var0":
try:
URLValidator()(tmp)
except:
del_id.append( i["id"] )
elif i["from"][0]=="var1":
try:
URLValidator()( tmp.split("\"")[1] )
except:
del_id.append( i["id"] )
elif i["from"][0]=="var2":
try:
URLValidator()( tmp.split("\'")[1] )
except:
del_id.append( i["id"] )
else:
del_id.append( i["id"] )
gc.collect()
result = filter(lambda x: x['id'] not in del_id,dict_list_in)
return result
except:
return dict_list_in
What I am doing above, is checking each dictionary in ths list for some condition, and if this fails, I get the id and then use filter to delete those dictionaries specific from the list.
At the moment, this takes a long time to run - and I was wondering if there were any obvious optimizations I am missing out on. I think at the moment the above code is too naive.

I made a couple changes. I put the validation instance out of the loop so that you don't have to initialize it every time. If it's required to instantiate every time, just move it into the try accept block. I also changed from deleting items in the original list, to appending the items to a new list that you want, removing the need for a filter. I also moved the validation out of the if statements so that if you hit the else statement you don't have to run the validation. Look at the logic of the if statements, it is the same as yours. It appears that you are using django, but if you aren't change the except to except Exception.
from django.core.exceptions import ValidationError
def check_qs(dict_list_in):
new_dict_list = []
validate = URLValidator()
for i in dict_list_in:
test_url = i["get_url"][0]
if i["from"][0] == "var0":
pass
elif i["from"][0] == "var1":
test_url = test_url.split("\"")[1]
elif i["from"][0] == "var2":
test_url = test_url.split("\'")[1]
else:
continue
try:
validate(test_url)
# If you aren't using django you can change this to 'Exception'
except ValidationError:
continue
new_dict_list.append(i)
return new_dict_list

Related

Continue a nested loop after exception

I have a nested loop to get certain JSON elements the way I want, but occasionally, the API I'm fetching from gets messy and it breaks some of the fields - I am not exactly sure how to handle this since It seems to be different each time, so I'm wondering if there is a way to continue a nested for loop even if an exception occurs inside it, or at least go back to the first loop and continue again.
My code is like this:
fields = ['email', 'displayname', 'login']
sub_fields = ['level', 'name']
all_data = []
for d in data:
login_value = d['login']
if login_value.startswith('3b3'):
continue
student = fetched_student.student_data(login_value)
student = json.loads(student)
final_json = dict()
try:
for field in fields:
#print ("Student field here: %s" % student[field])
final_json[field] = student[field]
except Exception as e:
print (e) # this is where I get a random KeyValue Error
#print ("Something happening here: %s " % final_json[field])
finally:
for sub_field in sub_fields:
for element in student['users']:
if element.get(sub_field):
final_json[sub_field] = element.get(sub_field)
for element in student['campus']:
if element.get(sub_field):
final_json[sub_field] = element.get(sub_field)
all_data.append(final_json)
print (all_data)
Is there a way to just go back to the first try block and continue after the exception has occurred or simply just ignore it and continue?
Because as things are now, if the exception ever occurs it breaks everything.
EDIT1: I have tried putting continue like so:
try:
for field in fields:
#print ("Student field here: %s" % student[field])
final_json[field] = student[field]
except Exception as e:
print (e)
continue
for sub_field in sub_fields:
for element in student['users']:
But it still fails regardless.
Use this for the try block:
for field in fields:
try:
#print ("Student field here: %s" % student[field])
final_json[field] = student[field]
except Exception as e:
print (e)
continue
for sub_field in sub_fields:
for element in student['users']:
The issue is due to the indentation level of the try block, the continue was affecting the outer most loop. Changing the try block to be inside of the loop will catch the error in that loop and continue the iteration of that specific loop.
Possibly you can use dict's get method like this in your try block:
try:
for field in fields:
#print ("Student field here: %s" % student[field])
final_json[field] = student.get(field, "") # 2nd arg is fallback object
Depending on what is needed, you can pass in an fresh dict (aka JSON object), fresh list (aka JSON array), or a str like above to suit your downstream needs.

Multiple Try/Except for Validate Config-File

Thats my first question on Stackoverflow and im a totally Python beginner.
I want to write, to get firm with python, a small Backup-Programm, the main part is done, but now i want to make it a bit "portable" and use a Config file, which i want to Validate.
My class "getBackupOptions" should be give Back a validate dict which should be enriched with "GlobalOptions" and "BackupOption" so that i finally get an fully "BackupOption" dict when i call "getBackupOptions.BackupOptions".
My Question now is, (in this Example is it easy, because its only the Function which check if the Path should be Recursive searched or not) how to simplify my Code?
For each (possible) Error i must write a new "TryExcept" Block - Can i Simplify it?
Maybe is there another way to Validate Config Files/Arrays?
class getBackupOptions:
def __init__(self,BackupOption,GlobalOptions):
self.BackupOption = BackupOption
self.GlobalOptions = GlobalOptions
self.getRecusive()
def getRecusive(self):
try:
if self.BackupOption['recursive'] != None:
pass
else:
raise KeyError
except KeyError:
try:
if self.GlobalOptions['recursive'] != None:
self.BackupOption['recursive'] = self.GlobalOptions['recursive']
else:
raise KeyError
except KeyError:
print('Recusive in: ' + str(self.BackupOption) + ' and Global is not set!')
exit()
Actually i only catch an KeyError, but what if the the Key is there but there is something else than "True" or "False"?
Thanks a lot for you help!
You may try this
class getBackupOptions:
def __init__(self,BackupOption,GlobalOptions):
self.BackupOption = BackupOption
self.GlobalOptions = GlobalOptions
self.getRecusive()
def getRecusive(self):
if self.BackupOption.get('recursive') == 'True' and self.GlobalOptions.get('recursive') == 'True':
self.BackupOption['recursive'] = self.GlobalOptions['recursive']
else:
print('Recusive in: ' + str(self.BackupOption) + ' and Global is not set!')
exit()
Here get method is used, therefore KeyError will not be faced.
If any text other than True comes in the field it will be considered as False.

I got aTypeError: 'NoneType' object is not iterable

in fucntion getLink(urls), I have return (cloud,parent,children)
in main function, I have (cloud,parent,children) = getLink(urls) and I got error of this line: TypeError: 'NoneType' object is not iterable
parent and children are all list of http links. since, it is not able to paste them here, parent is a list contains about 30 links; children is a list contains about 30 items, each item is about 10-100 links which is divide by ",".
cloud is a list contain about 100 words, like that: ['official store', 'Java Applets Centre', 'About Google', 'Web History'.....]
I didnot know why I get an error. Is there anything wrong in passing parameter? Or because the list take too much space?
#crawler url: read webpage and return a list of url and a list of its name
def crawler(url):
try:
m = urllib.request.urlopen(url)
msg = m.read()
....
return (list(set(list(links))),list(set(list(titles))) )
except Exception:
print("url wrong!")
#this is the function has gone wrong: it throw an exception here, also the error I mentioned, also it will end while before len(parent) reach 100.
def getLink(urls):
try:
newUrl=[]
parent = []
children =[]
cloud =[]
i=0
while len(parent)<=100:
url = urls[i]
if url in parent:
i += 1
continue
(links, titles) = crawler(url)
parent.append(url)
children.append(",".join(links))
cloud = cloud + titles
newUrl= newUrl+links
print ("links: ",links)
i += 1
if i == len(urls):
urls = list(set(newUrl))
newUrl = []
i = 0
return (cloud,parent,children)
except Exception:
print("can not get links")
def readfile(file):
#not related, this function will return a list of url
def main():
file='sampleinput.txt'
urls=readfile(file)
(cloud,parent,children) = getLink(urls)
if __name__=='__main__':
main()
There might be a way that your function ends without reaching the explicit return statement.
Look at the following example code.
def get_values(x):
if x:
return 'foo', 'bar'
x, y = get_values(1)
x, y = get_values(0)
When the function is called with 0 as parameter the return is skipped and the function will return None.
You could add an explicit return as the last line of your function. In the example given in this answer it would look like this.
def get_values(x):
if x:
return 'foo', 'bar'
return None, None
Update after seing the code
When the exception is triggered in get_link you just print something and return from the function. You have no return statement, so Python will return None. The calling function now tries to expand None into three values and that fails.
Change your exception handling to return a tuple with three values like you do it when everything is fine. Using None for each value is a good idea for it shows you, that something went wrong. Additionally I wouldn't print anything in the function. Don't mix business logic and input/output.
except Exception:
return None, None, None
Then in your main function use the following:
cloud, parent, children = getLink(urls)
if cloud is None:
print("can not get links")
else:
# do some more work

Exception handling in for-loop / EAFP

I have a request with JSON data, it may or may not contain 'items' key, if it does it has to be a list of objects, that I want to process individually. So I have to write something like:
json_data = request.get_json()
for item in json_data['items']:
process_item(item)
But, since presence of the 'items' key is not mandatory, an additional measure needs to be taken. I would like to follow EAFP approach, so wrapping it up into try ... except statement:
json_data = request.get_json()
try:
for item in json_data['items']:
process_item(item)
except KeyError as e:
pass
Let's assume that a KeyError exception can happened inside the process_item(...) function, that may indicate a code error, thus it should not go unnoticed, so I want to make sure that I will catch only exceptions coming from for statement predicate, as a workaround I came up with:
json_data = request.get_json()
try:
for item in json_data['items']:
process_item(item)
except KeyError as e:
if e.message != 'items':
raise e
pass
But
It looks ugly
It relies on knowledge of the process_item(...) implementation, assuming that KeyError('items') cannot be raised inside of it.
If the for statement becomes more complex e.g. for json_data['raw']['items'] so will the except clause making it even less readable and maintainable.
Update:
The suggested alternative
json_data = request.get_json()
try:
items = json_data["items"]
except KeyError:
items = []
for item in items:
process_item(item)
is essentially the same as
json_data = request.get_json()
if json_data.has('items')
items = json_data['items']
else:
items = []
for item in items:
process_item(item)
So we check before we loop. I would like to know if there is any more pythonic/EAFP approach?
You can catch the exception only when accessing "items":
json_data = request.get_json()
try:
items = json_data["items"]
except KeyError:
items = []
for item in items:
process_item(item)
However, we can replace the try-block with a call to the .get() function, making it much cleaner:
for item in request.get_json().get("items", []):
process_item(item)
I think the cleanest option is to use atryblock around only the code that attempts to retrieve the data associated with the'items'key:
json_data = request.get_json()
try:
items = json_data['items']
except KeyError:
print "no 'items' to process" # or whatever you want to...
else:
for item in items:
process_item(item)
This layout will allow to you clearly separate the error handling as you see fit. You can add a separate independenttry/exceptaround theforloop if desired.

Greedy execution of statements?

I have something like this using BeautifulSoup:
for line in lines:
code = l.find('span', {'class':'boldHeader'}).text
coded = l.find('div', {'class':'Description'}).text
definition = l.find('ul', {'class':'definitions'}).text
print code, coded, def
However, not all elements exist at all times. I can enclose this in a try except so that it does not break the program execution like this:
for line in lines:
try:
code = l.find('span', {'class':'boldHeader'}).text
coded = l.find('div', {'class':'Description'}).text
definition = l.find('ul', {'class':'definitions'}).text
print code, coded, def
except:
pass
But how I execute the statements in a greedy fashion? For instance, if there are only two elements available code and coded, I just want to get those and continue with the execution. As of now, even if code and coded exist, if def does not exist, the print command is never executed.
One way of doing this is to put a try...except for every statement like this:
for line in lines:
try:
code = l.find('span', {'class':'boldHeader'}).text
except:
pass
try:
coded = l.find('div', {'class':'Description'}).text
except:
pass
try:
definition = l.find('ul', {'class':'definitions'}).text
except:
pass
print code, coded, def
But this is an ugly approach and I want something cleaner. Any suggestions?
How about capture the "ugly" code in a function, and just call the function as needed:
def get_txt(l,tag,classname):
try:
txt=l.find(tag, {'class':classname}).text
except AttributeError:
txt=None
return txt
for line in lines:
code = get_txt(l,'span','boldHeader')
coded = get_txt(l,'div','Description')
defn = get_txt(l,'ul','definitions')
print code, coded, defn
PS. I changed def to defn because def is a Python keyword. Using it as a variable name raises a SyntaxError.
PPS. It's not a good practice to use bare exceptions:
try:
....
except:
...
because it almost always captures more that you intend. Much better to be explicit about what you want to catch:
try:
...
except AttributeError as err:
...
First of all, you can test for None instead of catching an exception. l.find should return None if it doesn't find your item. Exceptions should be reserved for errors and really extraordinary situations.
Second thing you can do is to create an array of all HTML elements you want to check and then have a nested for loop. Since it's been a while since I've used python, I will outline the code and then (hopefully) edit the answer when I test it.
Something like:
elementsToCheck = [
[ 'span', {'class':'boldHeader'} ],
[ 'div', {'class':'Description'} ],
[ 'ul', {'class':'definitions'} ]]
concatenated = ''
for line in lines:
for something in elementsToCheck
element = l.find(something[0], something[1])
if element is not None
concatenated += element.text
print concatenated
Obviously the code above won't work, but you should get the idea. :)

Categories