I have the following block to write an xml tag. Sometimes the name is already in the correct form (that is, it won't error), and sometimes it is not
if 'Name' in title_data:
name = etree.SubElement(info, 'Name')
try:
name.text = title_data['Name']
except ValueError:
name.text = title_data['Name'].decode('utf-8')
Is there a way to simplify this? For example, something along the lines of:
name.text = title_data['Name'] if (**something**) else title_data['Name'].decode('utf-8')
I assume that you want to avoid having to write similar code for every element you want to set. This has the smell of trying to treat the symptom rather than the cause, but if nothing else, you can simply break that out into a helper function:
def assign_text(field, text):
try:
field.text = text
except ValueError:
field.text = text.decode("utf-8")
# ...
if "Name" in title_data:
name = etree.SubElement(info, "Name")
assign_text(name, title_data["Name"] or None)
Related
For example, I am parsing an xml file element, and all 4 elements are required. My code this like this with minidom library:
id = pattern.getElementsByTagName("id")[0].firstChild.data
name = pattern.getElementsByTagName("name")[0].firstChild.data
trigger = pattern.getElementsByTagName("trigger")[0].firstChild.data
test = pattern.getElementsByTagName("test")[0].firstChild.data
If the xml document lack any of the 4 tags, I want to throw an IndexError exception. Should I use 4 try ... except blocks to capture each element exception, or I should just capture all the 4 similar exceptions in one big block? The benefit of capturing individual errors is that I can print out more explicit error message regarding the lack of a specific xml element, but it looks verbose. Is there a good practice here?
try:
id = pattern.getElementsByTagName("id")[0].firstChild.data
except IndexError:
raise IndexError('id must exists in the xml file!')
try:
name = pattern.getElementsByTagName("name")[0].firstChild.data
except IndexError:
raise IndexError('name must exists in the xml file!')
try:
test = pattern.getElementsByTagName("test")[0].firstChild.data
except IndexError:
raise IndexError('test must exists in the xml file!')
try:
trigger = pattern.getElementsByTagName("trigger")[0].firstChild.data
except IndexError:
raise IndexError('trigger must exists in the xml file!')
OR
try:
id = pattern.getElementsByTagName("id")[0].firstChild.data
name = pattern.getElementsByTagName("name")[0].firstChild.data
trigger = pattern.getElementsByTagName("trigger")[0].firstChild.data
test = pattern.getElementsByTagName("test")[0].firstChild.data
except IndexError:
raise IndexError('id, name, trigger and test must exist in the xml file!')
Which one is better or both are not great?
Consider using a loop over the field names and packing the results into a dict instead!
results = {}
for field_name in ("id", "name", "trigger", "test"):
try:
results[field_name] = xml.getElementsByTagName(field_name)[0].firstChild.data
except IndexError as ex:
raise IndexError(f"failed to read '{field_name}' from xml {repr(ex)}")
IF would like to create a generic code (by using Selenium) which will look for the label, and the find next to the label input(OR select) tag and insert the value.
Main function:
for l in label:
try:
xpathInput = "//label[contains(.,'{}')]/following::input".format(l)
checkXpathInput, pathInput= check_xpath(browser,xpathInput)
if checkXpathInput is True:
pathInput.clear()
pathInput.send_keys("\b{}".format(value))
break
for op in option:
xpathSelect = "//label[contains(.,'{}')]/following::select/option[text()='{}']".format(l,op)
checkXpathSelect, pathSelect= check_xpath(browser,xpathSelect)
if checkXpathSelect is True:
pathSelect.click()
break
except:
print("Can't match: {}".format(l))
Path checker:
def check_xpath(browser,xpath):
try:
path = browser.find_element_by_xpath(xpath)
except NoSuchElementException:
return False
return True , path
What is the current issue?
I need that if LABEL will be for example TITLE the code will check that there is NO input tag next to "Title" label and then he will go and check is there is the select tag next to the label "Title" and e.t.c....
In my current, he will find the label "Title" and then will fill in value to the first next input (which is incorrect as "Title" is using the SELECT tag)
I'd exploit the fact that find_elements_by_xpath returns a list of found elements and empty lists are falsy. So you wouldn't need a try/except and a function which returns bool or tuple values (which is not the most optimal behavior).
It would be easier to give a good answer with some html source example but I assume what you'd like to do is this:
def handle_label_inputs(label, value):
# if there is a such label, this result won't be empty
found_labels = driver.find_elements_by_xpath('//label[contains(.,"{}")]'.format(label))
# if the list is not empty
if found_labels:
l = found_labels[0]
# any options with the given value as text
following_select_option_values = l.find_elements_by_xpath('./following::select//option[text()="{}"]'.format(value))
# any inputs next to the label
following_inputs = l.find_elements_by_xpath('./following::input')
# did we find an option?
if following_select_option_values:
following_select_option_values[0].click()
# or is there an input?
elif following_inputs:
in_field = following_inputs[0]
in_field.clear()
in_field.send_keys(value)
else:
print("Can't match: {} - {}".format(label, value))
driver.get('http://thenewcode.com/166/HTML-Forms-Drop-down-Menus')
handle_label_inputs('State / Province / Territory', 'California')
I don't know how tidy the page you are work with but if it is well done, then your label should have a for="something" attribute. If that is the case then you can simply find the label-related-element and find out if its tag is input (or select):
related_element_if_done_properly = driver.find_elements_by_xpath('//*[#id="{}"]'.format(label_element.get_attribute("for")))
if related_element_if_done_properly:
your_element = related_element_if_done_properly[0]
is_input = your_element.tagname.lower() == "input"
else:
print('Ohnoes')
I keep getting the following error when trying to parse some json:
Traceback (most recent call last):
File "/Users/batch/projects/kl-api/api/helpers.py", line 37, in collect_youtube_data
keywords = channel_info_response_data['items'][0]['brandingSettings']['channel']['keywords']
KeyError: 'brandingSettings'
How do I make sure that I check my JSON output for a key before assigning it to a variable? If a key isn’t found, then I just want to assign a default value. Code below:
try:
channel_id = channel_id_response_data['items'][0]['id']
channel_info_url = YOUTUBE_URL + '/channels/?key=' + YOUTUBE_API_KEY + '&id=' + channel_id + '&part=snippet,contentDetails,statistics,brandingSettings'
print('Querying:', channel_info_url)
channel_info_response = requests.get(channel_info_url)
channel_info_response_data = json.loads(channel_info_response.content)
no_of_videos = int(channel_info_response_data['items'][0]['statistics']['videoCount'])
no_of_subscribers = int(channel_info_response_data['items'][0]['statistics']['subscriberCount'])
no_of_views = int(channel_info_response_data['items'][0]['statistics']['viewCount'])
avg_views = round(no_of_views / no_of_videos, 0)
photo = channel_info_response_data['items'][0]['snippet']['thumbnails']['high']['url']
description = channel_info_response_data['items'][0]['snippet']['description']
start_date = channel_info_response_data['items'][0]['snippet']['publishedAt']
title = channel_info_response_data['items'][0]['snippet']['title']
keywords = channel_info_response_data['items'][0]['brandingSettings']['channel']['keywords']
except Exception as e:
raise Exception(e)
You can either wrap all your assignment in something like
try:
keywords = channel_info_response_data['items'][0]['brandingSettings']['channel']['keywords']
except KeyError as ignore:
keywords = "default value"
or, let say, use .has_key(...). IMHO In your case first solution is preferable
suppose you have a dict, you have two options to handle the key-not-exist situation:
1) get the key with default value, like
d = {}
val = d.get('k', 10)
val will be 10 since there is not a key named k
2) try-except
d = {}
try:
val = d['k']
except KeyError:
val = 10
This way is far more flexible since you can do anything in the except block, even ignore the error with a pass statement if you really don't care about it.
How do I make sure that I check my JSON output
At this point your "JSON output" is just a plain native Python dict
for a key before assigning it to a variable? If a key isn’t found, then I just want to assign a default value
Now you know you have a dict, browsing the official documention for dict methods should answer the question:
https://docs.python.org/3/library/stdtypes.html#dict.get
get(key[, default])
Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.
so the general case is:
var = data.get(key, default)
Now if you have deeply nested dicts/lists where any key or index could be missing, catching KeyErrors and IndexErrors can be simpler:
try:
var = data[key1][index1][key2][index2][keyN]
except (KeyError, IndexError):
var = default
As a side note: your code snippet is filled with repeated channel_info_response_data['items'][0]['statistics'] and channel_info_response_data['items'][0]['snippet'] expressions. Using intermediate variables will make your code more readable, easier to maintain, AND a bit faster too:
# always set a timeout if you don't want the program to hang forever
channel_info_response = requests.get(channel_info_url, timeout=30)
# always check the response status - having a response doesn't
# mean you got what you expected. Here we use the `raise_for_status()`
# shortcut which will raise an exception if we have anything else than
# a 200 OK.
channel_info_response.raise_for_status()
# requests knows how to deal with json:
channel_info_response_data = channel_info_response.json()
# we assume that the response MUST have `['items'][0]`,
# and that this item MUST have "statistics" and "snippets"
item = channel_info_response_data['items'][0]
stats = item["statistics"]
snippet = item["snippet"]
no_of_videos = int(stats.get('videoCount', 0))
no_of_subscribers = int(stats.get('subscriberCount', 0))
no_of_views = int(stats.get('viewCount', 0))
avg_views = round(no_of_views / no_of_videos, 0)
try:
photo = snippet['thumbnails']['high']['url']
except KeyError:
photo = None
description = snippet.get('description', "")
start_date = snippet.get('publishedAt', None)
title = snippet.get('title', "")
try:
keywords = item['brandingSettings']['channel']['keywords']
except KeyError
keywords = ""
You may also want to learn about string formatting (contatenating strings is quite error prone and barely readable), and how to pass arguments to requests.get()
After running my script I notice that my "parse_doc" function throws error when it find's any url None. Turn out that, my "process_doc" function were supposed to produce 25 links but it produces only 19 because few pages doesn't have any link to lead to another page. However, when my second function receives that link with None value, it produces that error indicating "MissingSchema". How to get around this so that when it finds any link with None value it will go for another. Here is the partial portion of my script which will give you an idea what I meant:
def process_doc(medium_link):
page = requests.get(medium_link).text
tree = html.fromstring(page)
try:
name = tree.xpath('//span[#id="titletextonly"]/text()')[0]
except IndexError:
name = ""
try:
link = base + tree.xpath('//section[#id="postingbody"]//a[#class="showcontact"]/#href')[0]
except IndexError:
link = ""
parse_doc(name, link) "All links get to this function whereas some links are with None value
def parse_doc(title, target_link):
page = requests.get(target_link).text # Error thrown here when it finds any link with None value
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
print(title, tel)
The error what I'm getting:
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '': No schema supplied. Perhaps you meant http://?
Btw, in my first function there is a variable named "base" which is for concatenating with the produced result to make a full-fledged link.
If you want to avoid cases when your target_link == None then try
def parse_doc(title, target_link):
if target_link:
page = requests.get(target_link).text
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
print(tel)
print(title)
This should allow you to handle only non-empty links or do nothing otherwise
First of all make sure that your schema, meaning url, is correct. Sometimes you are just missing a character or have one too much in https://.
If you have to raise an exception though you can do it like this:
import requests
from requests.exceptions import MissingSchema
...
try:
res = requests.get(linkUrl)
print(res)
except MissingSchema:
print('URL is not complete')
I'm writing a python scraper code for OpenData and I have one question about : how to check if all values aren't filled in site and if it is null change value to null.
My scraper is here.
Currently I'm working on it to optimalize.
My variables now look like:
evcisloval = soup.find_all('td')[3].text.strip()
prinalezival = soup.find_all('td')[5].text.strip()
popisfaplnenia = soup.find_all('td')[7].text.replace('\"', '')
hodnotafaplnenia = soup.find_all('td')[9].text[:-1].replace(",", ".").replace(" ", "")
datumdfa = soup.find_all('td')[11].text
datumzfa = soup.find_all('td')[13].text
formazaplatenia = soup.find_all('td')[15].text
obchmenonazov = soup.find_all('td')[17].text
sidlofirmy = soup.find_all('td')[19].text
pravnaforma = soup.find_all('td')[21].text
sudregistracie = soup.find_all('td')[23].text
ico = soup.find_all('td')[25].text
dic = soup.find_all('td')[27].text
cislouctu = soup.find_all('td')[29].text
And Output :
scraperwiki.sqlite.save(unique_keys=["invoice_id"],
data={ "invoice_id":number,
"invoice_price":hodnotafaplnenia,
"evidence_no":evcisloval,
"paired_with":prinalezival,
"invoice_desc":popisfaplnenia,
"date_received":datumdfa,
"date_payment":datumzfa,
"pay_form":formazaplatenia,
"trade_name":obchmenonazov,
"trade_form":pravnaforma,
"company_location":sidlofirmy,
"court":sudregistracie,
"ico":ico,
"dic":dic,
"accout_no":cislouctu,
"invoice_attachment":urlfa,
"invoice_url":url})
I googled it but without success.
First, write a configuration dict of your variables in the form:
conf = {'evidence_no': (3, str.strip),
'trade_form': (21, None),
...}
i.e. key is the output key, value is a tuple of id from soup.find_all('td') and of an optional function that has to be applied to the result, None otherwise. You don't need those Slavic variable names that may confuse other SO members.
Then iterate over conf and fill the data dict.
Also, run soup.find_all('td') before the loop.
tds = soup.find_all('td')
data = {}
for name, (num, func) in conf.iteritems():
text = tds[num].text
# replace text with None or "NULL" or whatever if needed
...
if func is None:
data[name] = text
else:
data[name] = func(text)
This will remove a lot of duplicated code. Easier to maintain.
Also, I am not sure the strings "NULL" are the best way to write missing data. Doesn't sqlite support Python's real None objects?
Just read your attached link, and it seems what you want is
evcisloval = soup.find_all('td')[3].text.strip() or "NULL"
But be careful. You should only do this with strings. If the part before or is either empty or False or None, or 0, they will all be replaced with "NULL"