I am parsing some content out of different urls. Not all the urls have the same structure and thus the code fails for some urls, so the code I came up with is this (simplified version):
meta_dict = {}
try:
meta_dict['date_published'] = html.find('date'}).text
except:
meta_dict['date_published'] = ''
try:
meta_dict['headline'] = html.find('headline').text
except:
meta_dict['headline']
try:
meta_dict['description'] = html.find('description').text
except:
meta_dict['description']
return meta_dict
This being a simplified block, but the idea is to try and get more than 50 variables and doing a try except block for every one of them just feels too repetitive and ugly in the code too.
I know I could make a function for it and return '' if it fails, but I want to know if there is another way to handle this case.
l = [('date_published', 'date'), ('headline', 'headline'), ('description', 'description')]
for dict_val, html_val in l:
try:
meta_dict[dict_val] = html.find(html_val).text
except:
meta_dict[dict_val] = ''
If the list of these variables you are checking is constant, you can put them into a list and then just iterate over that list.
vars = [date_publis, hedheadline, description, . . . ]
for var in vars:
try:
meta_dict[var] = html.find(var}).text
except:
meta_dict[var] = ''
Related
I'm trying to request a json object and run through the object with a for loop and take out the data I need and save it to a model in django.
I only want the first two attributes of runner_1_name and runner_2_name but in my json object the amount or runners varies inside each list. I keep getting list index out of range error. I have tried to use try and accept but when I try save to the model it's showing my save variables is referenced before assignment What's the best way of ignoring list index out or range error or fixing the list so the indexes are correct? I also want the code to run really fast as I will using this function as a background task to poll every two seconds.
#shared_task()
def mb_get_events():
mb = APIClient('username' , 'pass')
tennis_events = mb.market_data.get_events()
for data in tennis_events:
id = data['id']
event_name = data['name']
sport_id = data['sport-id']
start_time = data['start']
is_ip = data['in-running-flag']
par = data['event-participants']
event_id = par[0]['event-id']
cat_id = data['meta-tags'][0]['id']
cat_name = data['meta-tags'][0]['name']
cat_type = data['meta-tags'][0]['type']
url_name = data['meta-tags'][0]['type']
try:
runner_1_name = data['markets'][0]['runners'][0]['name']
except IndexError:
pass
try:
runner_2_name = data['markets'][0]['runners'][1]['name']
except IndexError:
pass
run1_par_id = data['markets'][0]['runners'][0]['id']
run2_par_id = data['markets'][0]['runners'][1]['id']
run1_back_odds = data['markets'][0]['runners'][0]['prices'][0]['odds']
run2_back_odds = data['markets'][0]['runners'][1]['prices'][0]['odds']
run1_lay_odds = data['markets'][0]['runners'][0]['prices'][3]['odds']
run2_lay_odds = data['markets'][0]['runners'][1]['prices'][3]['odds']
te, created = MBEvent.objects.update_or_create(id=id)
te.id = id
te.event_name = event_name
te.sport_id = sport_id
te.start_time = start_time
te.is_ip = is_ip
te.event_id = event_id
te.runner_1_name = runner_1_name
te.runner_2_name = runner_2_name
te.run1_back_odds = run1_back_odds
te.run2_back_odds = run2_back_odds
te.run1_lay_odds = run1_lay_odds
te.run2_lay_odds = run2_lay_odds
te.run1_par_id = run1_par_id
te.run2_par_id = run2_par_id
te.cat_id = cat_id
te.cat_name = cat_name
te.cat_type = cat_type
te.url_name = url_name
te.save()
Quick Fix:
try:
runner_1_name = data['markets'][0]['runners'][0]['name']
except IndexError:
runner_1_name = '' # don't just pass here
try:
runner_2_name = data['markets'][0]['runners'][1]['name']
except IndexError:
runner_2_name = ''
It giving you variables is referenced before assignment because in expect block you are just passing, so if try fails runner_1_name or runner_2_name is never defined. You when you try to use those variables you get an error because they were never defined. So in except block either set the value to a blank string or some other string like 'Runner Does not Exists'.
Now if you want to totally avoid try/except and IndexError you can use if statements to check the length of markets and runners. Something like this:
runner_1_name = ''
runner_2_name = ''
# Make sure markets exists in data and its length is greater than 0 and runners exists in first market
if 'markets' in data and len(data['markets']) > 0 and 'runners' in data['market'][0]:
runners = data['markets'][0]['runners']
# get runner 1
if len(runners) > 0 and `name` in runners[0]:
runner_1_name = runners[0]['name']
else:
runner_1_name = 'Runner 1 does not exists'
# get runner 2
if len(runners) > 1 and `name` in runners[1]:
runner_2_name = runners[1]['name']
else:
runner_2_name = 'Runner 2 does not exists'
As you can see this gets too long and its not the recommended way to do things.
You should just assume data is alright and try to get the names and use try/except to catch any errors as suggested above in my first code snippet.
I had the issue with a list of comments that can be empty or filled by an unknown number of comments
My solution is to initialize a counting variable at 0 and have a while loop on a boolean
In the loop I try to get comment[count] if it fails on except IndexError I set the boolean to False to stop the infinite loop
count = 0
condition_continue = True
while condition_continue :
try:
detailsCommentDict = comments[count]
....
except IndexError:
# no comment at all or no more comment
condition_continue = False
I keep getting the following error when trying to parse some json:
Traceback (most recent call last):
File "/Users/batch/projects/kl-api/api/helpers.py", line 37, in collect_youtube_data
keywords = channel_info_response_data['items'][0]['brandingSettings']['channel']['keywords']
KeyError: 'brandingSettings'
How do I make sure that I check my JSON output for a key before assigning it to a variable? If a key isn’t found, then I just want to assign a default value. Code below:
try:
channel_id = channel_id_response_data['items'][0]['id']
channel_info_url = YOUTUBE_URL + '/channels/?key=' + YOUTUBE_API_KEY + '&id=' + channel_id + '&part=snippet,contentDetails,statistics,brandingSettings'
print('Querying:', channel_info_url)
channel_info_response = requests.get(channel_info_url)
channel_info_response_data = json.loads(channel_info_response.content)
no_of_videos = int(channel_info_response_data['items'][0]['statistics']['videoCount'])
no_of_subscribers = int(channel_info_response_data['items'][0]['statistics']['subscriberCount'])
no_of_views = int(channel_info_response_data['items'][0]['statistics']['viewCount'])
avg_views = round(no_of_views / no_of_videos, 0)
photo = channel_info_response_data['items'][0]['snippet']['thumbnails']['high']['url']
description = channel_info_response_data['items'][0]['snippet']['description']
start_date = channel_info_response_data['items'][0]['snippet']['publishedAt']
title = channel_info_response_data['items'][0]['snippet']['title']
keywords = channel_info_response_data['items'][0]['brandingSettings']['channel']['keywords']
except Exception as e:
raise Exception(e)
You can either wrap all your assignment in something like
try:
keywords = channel_info_response_data['items'][0]['brandingSettings']['channel']['keywords']
except KeyError as ignore:
keywords = "default value"
or, let say, use .has_key(...). IMHO In your case first solution is preferable
suppose you have a dict, you have two options to handle the key-not-exist situation:
1) get the key with default value, like
d = {}
val = d.get('k', 10)
val will be 10 since there is not a key named k
2) try-except
d = {}
try:
val = d['k']
except KeyError:
val = 10
This way is far more flexible since you can do anything in the except block, even ignore the error with a pass statement if you really don't care about it.
How do I make sure that I check my JSON output
At this point your "JSON output" is just a plain native Python dict
for a key before assigning it to a variable? If a key isn’t found, then I just want to assign a default value
Now you know you have a dict, browsing the official documention for dict methods should answer the question:
https://docs.python.org/3/library/stdtypes.html#dict.get
get(key[, default])
Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.
so the general case is:
var = data.get(key, default)
Now if you have deeply nested dicts/lists where any key or index could be missing, catching KeyErrors and IndexErrors can be simpler:
try:
var = data[key1][index1][key2][index2][keyN]
except (KeyError, IndexError):
var = default
As a side note: your code snippet is filled with repeated channel_info_response_data['items'][0]['statistics'] and channel_info_response_data['items'][0]['snippet'] expressions. Using intermediate variables will make your code more readable, easier to maintain, AND a bit faster too:
# always set a timeout if you don't want the program to hang forever
channel_info_response = requests.get(channel_info_url, timeout=30)
# always check the response status - having a response doesn't
# mean you got what you expected. Here we use the `raise_for_status()`
# shortcut which will raise an exception if we have anything else than
# a 200 OK.
channel_info_response.raise_for_status()
# requests knows how to deal with json:
channel_info_response_data = channel_info_response.json()
# we assume that the response MUST have `['items'][0]`,
# and that this item MUST have "statistics" and "snippets"
item = channel_info_response_data['items'][0]
stats = item["statistics"]
snippet = item["snippet"]
no_of_videos = int(stats.get('videoCount', 0))
no_of_subscribers = int(stats.get('subscriberCount', 0))
no_of_views = int(stats.get('viewCount', 0))
avg_views = round(no_of_views / no_of_videos, 0)
try:
photo = snippet['thumbnails']['high']['url']
except KeyError:
photo = None
description = snippet.get('description', "")
start_date = snippet.get('publishedAt', None)
title = snippet.get('title', "")
try:
keywords = item['brandingSettings']['channel']['keywords']
except KeyError
keywords = ""
You may also want to learn about string formatting (contatenating strings is quite error prone and barely readable), and how to pass arguments to requests.get()
Im working on a small project of retrieving information about books from the Google Books API using Python 3. For this i make a call to the API, read out the variables and store those in a list. For a search like "linkedin" this works perfectly. However when i enter "Google", it reads the second title from the JSON input. How can this happen?
Please find my code below (Google_Results is the class I use to initialize the variables):
import requests
def Book_Search(search_term):
parms = {"q": search_term, "maxResults": 3}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
print(r.url)
results = r.json()
i = 0
for result in results["items"]:
try:
isbn13 = str(result["volumeInfo"]["industryIdentifiers"][0]["identifier"])
isbn10 = str(result["volumeInfo"]["industryIdentifiers"][1]["identifier"])
title = str(result["volumeInfo"]["title"])
author = str(result["volumeInfo"]["authors"])[2:-2]
publisher = str(result["volumeInfo"]["publisher"])
published_date = str(result["volumeInfo"]["publishedDate"])
description = str(result["volumeInfo"]["description"])
pages = str(result["volumeInfo"]["pageCount"])
genre = str(result["volumeInfo"]["categories"])[2:-2]
language = str(result["volumeInfo"]["language"])
image_link = str(result["volumeInfo"]["imageLinks"]["thumbnail"])
dict = Google_Results(isbn13, isbn10, title, author, publisher, published_date, description, pages, genre,
language, image_link)
gr.append(dict)
print(gr[i].title)
i += 1
except:
pass
return
gr = []
Book_Search("Linkedin")
I am a beginner to Python, so any help would be appreciated!
It does so because there is no publisher entry in volumeInfo of the first entry, thus it raises a KeyError and your except captures it. If you're going to work with fuzzy data you have to account for the fact that it will not always have the expected structure. For simple cases you can rely on dict.get() and its default argument to return a 'valid' default entry if an entry is missing.
Also, there are a few conceptual problems with your function - it relies on a global gr which is bad design, it shadows the built-in dict type and it captures all exceptions guaranteeing that you cannot exit your code even with a SIGINT... I'd suggest you to convert it to something a bit more sane:
def book_search(search_term, max_results=3):
results = [] # a list to store the results
parms = {"q": search_term, "maxResults": max_results}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
try: # just in case the server doesn't return valid JSON
for result in r.json().get("items", []):
if "volumeInfo" not in result: # invalid entry - missing volumeInfo
continue
result_dict = {} # a dictionary to store our discovered fields
result = result["volumeInfo"] # all the data we're interested is in volumeInfo
isbns = result.get("industryIdentifiers", None) # capture ISBNs
if isinstance(isbns, list) and isbns:
for i, t in enumerate(("isbn10", "isbn13")):
if len(isbns) > i and isinstance(isbns[i], dict):
result_dict[t] = isbns[i].get("identifier", None)
result_dict["title"] = result.get("title", None)
authors = result.get("authors", None) # capture authors
if isinstance(authors, list) and len(authors) > 2: # you're slicing from 2
result_dict["author"] = str(authors[2:-2])
result_dict["publisher"] = result.get("publisher", None)
result_dict["published_date"] = result.get("publishedDate", None)
result_dict["description"] = result.get("description", None)
result_dict["pages"] = result.get("pageCount", None)
genres = result.get("authors", None) # capture genres
if isinstance(genres, list) and len(genres) > 2: # since you're slicing from 2
result_dict["genre"] = str(genres[2:-2])
result_dict["language"] = result.get("language", None)
result_dict["image_link"] = result.get("imageLinks", {}).get("thumbnail", None)
# make sure Google_Results accepts keyword arguments like title, author...
# and make them optional as they might not be in the returned result
gr = Google_Results(**result_dict)
results.append(gr) # add it to the results list
except ValueError:
return None # invalid response returned, you may raise an error instead
return results # return the results
Then you can easily retrieve as much info as possible for a term:
gr = book_search("Google")
And it will be far more tolerant of data omissions, provided that your Google_Results type makes most of the entries optional.
Following #Coldspeed's recommendation it became clear that missing information in the JSON file caused the exception to run. Since I only had a "pass" statement there it skipped the entire result. Therefore I will have to adapt the "Try and Except" statements so errors do get handled properly.
Thanks for the help guys!
I have the following function, which reads a dict and affects some values to local variables, which are then returned as a tuple.
The problem is that some of the desired keys may not exist in the dictionary.
So far I have this code, it does what I want but I wonder if there is a more elegant way to do it.
def getNetwork(self, search):
data = self.get('ip',search)
handle = data['handle']
name = data['name']
try:
country = data['country']
except KeyError:
country = ''
try:
type = data['type']
except KeyError:
type = ''
try:
start_addr = data['startAddress']
except KeyError:
start_addr = ''
try:
end_addr = data['endAddress']
except KeyError:
end_addr = ''
try:
parent_handle = data['parentHandle']
except KeyError:
parent_handle = ''
return (handle, name, country, type, start_addr, end_addr, parent_handle)
I'm kind of afraid by the numerous try: except: but if I put all the affectations inside a single try: except: it would stop to affect values once the first missing dict key raises an error.
Just use dict.get. Each use of:
try:
country = data['country']
except KeyError:
country = ''
can be equivalently replaced with:
country = data.get('country', '')
You could instead iterate through the keys and try for each key, on success append it to a list and on failure append a " ":
ret = []
for key in {'country', 'type', 'startAddress', 'endAddress', 'parentHandle'}:
try:
ret.append(data[key])
except KeyError:
ret.append([" "])
Then at the end of the function return a tuple:
return tuple(ret)
if that is necessary.
Thx ShadowRanger, with you answer I went to the following code, which is indeed more confortable to read :
def getNetwork(self, search):
data = self.get('ip',search)
handle = data.get('handle', '')
name = data.get('name', '')
country = data.get('country','')
type = data.get('type','')
start_addr = data.get('start_addr','')
end_addr = data.get('end_addr','')
parent_handle = data.get('parent_handle','')
return (handle, name, country, type, start_addr, end_addr, parent_handle)
There are many elements that I need on each page I scrape, but many pages don't have all the items I need, so I end up having to wrap each and every item grab in
try:
itemNeeded = soup.find(text="yada yada yada").next
except AttributeError:
pass
This balloons my code by 400%.
Is there any way to abstract this away, or at least reduce the eyesore?
Edit: I'm not only searching for strings, but doing things like this as well:
navLinks = carSoup.find("span", "nav").findAll("a")
carDict['manufacturer'] = navLinks[1].next
carDict['model'] = navLinks[2].next
Build a list and iterate over the list... Use some templating.. You just need to figure out how to iterate over the whole page, in a smaller, simpler fashion.
text_list = ['items', 'to', 'search', 'for']
pre_find = {'items': (('span', 'nav'), 'a', ('manufacturer', 'model'))}
carDict = {}
for text in text_list:
try:
if pre_find.has_key(text):
x = 1
navLinks = carSoup.find(pre_find[text][0]).findAll(pre_find[text][1])
for item in pre_find[text][2]:
carDict[item] = navLinks[x].next
x += 1
else:
carDict[text] = soup.find(text=text).next
except AttributeError:
pass
Have you considered writing a more global try except block, something like:
try:
itemNeeded = soup.find(text="yada yada yada").next
nextItem = soup.find(text = "blah blah blah").next
except AttributeError:
pass