Try/Except not working with BeautifulSoup

Try/Except not working with BeautifulSoup - python

I am trying to loop over a series of pages and extract some info. However, in certain pages some exceptions occur and I need to deal with them. I created the following function to try to deal with them. See below:
def iferr(x):
try:
x
except (Exception, TypeError, AttributeError) as e:
pass
I intend to use as part of code like this:
articles = [[iferr(dp[0].find('span', class_='citation')),\
iferr(dp[0].find('div', class_='abstract')),\
iferr(dp[0].find('a', rel='nofollow')['href'])] for dp in data]
The idea is that if, for example, dp[0].find('a', rel='nofollow')['href'] leads to an error (fails), it will simply ignore it (fill it with a blank or a None).
However, whenever an error/exception occurs in one of the three elements it does not 'pass'. It just tells me that the error has occurred. There errors it displays are those I listed in the 'except' command which I assume would be dealt with.
EDIT:
Per Michael's suggestion, I was able to see that the order in which iferr processes what is going on would always prompt the error before he try. So I worked on workaround:
def fndwoerr(d,x,y,z,h):
try:
if not h:
d.find('x',y = 'z')
else:
d.find('x',y = 'z')['h']
except (Exception, TypeError, AttributeError) as e:
pass
...
articles = [[fndwoerr(dp[0],'span','class_','citation',None),\
fndwoerr(dp[0],'div','class_','abstract',None),\
fndwoerr(dp[0], 'a', 'rel','nofollow','href')] for dp in data]
Now it runs without prompting an error. However, everything returned becomes None. I am pretty sure it has to do with he way the parameters are entered. y should not be displayed as a string in the find function, whereas z has. However, I input both as string when i call the function. How can I go about this?

Example looks a bit strange, so it would be a good idea to improve the question, so that we can reproduce your issue easily. May read how to create minimal, reproducible example
The idea is that if, for example, dp[0].find('a',
rel='nofollow')['href'] leads to an error (fails), it will simply
ignore it (fill it with a blank or a None).
What about checking if element is available with an if-statement?
dp[0].find('a', rel='nofollow').get('href']) if dp[0].find('a', rel='nofollow') else None
or with walrus operator from python 3.8:
l.get('href']) if (l:=dp[0].find('a', rel='nofollow')) else None
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<h1>This is a Heading</h1>', 'html.parser')
for e in soup.select('h1'):
print(e.find('a').get('href') if e.find('a') else None)

Related

attributes not appearing in xml2dict parse output even with xml_attribs = True

I am having a problem with python xmltodict. Following the near-consensus recommendation here, I tried xmltodict and liked it very much until I had to access attributes at the top level of my handler. I'm probably doing something wrong but it's not clear to me what. I have an xml document looking something like this
<api>
<cons id="79550" modified_dt="1526652449">
<firstname>Mackenzie</firstname>
...
</cons>
<cons id="79551" modified_dt="1526652549">
<firstname>Joe</firstname>
...
</cons>
<api>
I parse it with this:
xmltodict.parse(apiResult.body, item_depth=2, item_callback=handler, xml_attribs=True)
where apiResult.body contains the xml shown above. But, in spite of the xml_attribs=True, I see no #id or #modified_dt in the output after parsing in the handler, although all the elements in the original do appear.
The handler is coded as follows:
def handler(_, cons):
print (cons)
mc = MatchChecker(cons)
mc.check()
return True
What might I be doing wrong?
I've also tried xmljson and instantly don't like it as well as xmltodict, if only I had the way around this issue. Does anyone have a solution to this problem or a package that would handle this better?

xmltodict works just fine, but you are parsing the argument item_depth=2 which means your handler will only see the elements inside the <cons> elements rather than the <cons> element itself.
xml = """
<api>
<cons id="79550" modified_dt="1526652449">
<firstname>Mackenzie</firstname>
</cons>
</api>
"""
def handler(_,arg):
for i in arg.items():
print(i)
return True
xmltodict.parse(xml, item_depth=2, item_callback=handler, xml_attribs=True)
Prints ('firstname', 'Mackenzie') as expected.
Whereas:
xmltodict.parse(xml, item_depth=1, item_callback=handler, xml_attribs=True)
Prints ('cons', OrderedDict([('#id', '79550'), ('#modified_dt', '1526652449'), ('firstname', 'Mackenzie')])), again as expected.

Python: How go to next statement when "Unsuccessful command:"occur

i have the following command of:
testing = plateX = g_o.getresults(g_o.Phases[0], g_o.ResultTypes.NodeToNodeAnchor.X, 'node')
it is not working in "Phases[0]" when it's 0 to 2 which gives the following error:,
plxscripting.plx_scripting_exceptions.PlxScriptingError: Unsuccessful command:
The command did not deliver any results
but will work from "Phases[3]" from 3 onward to whatever number it is. However in my script the x value in "Phases[x]" is different in every case, what i want is to test the x value until it worked without error...
So I'm guessing something like this:
for phase in g_o.Phases[0:]:
try:
testing = plateX = g_o.getresults(phase, g_o.ResultTypes.NodeToNodeAnchor.X, 'node')
except ???:
pass
but What should I put in the "???"? Or is there some other way to do this?
Thanks in advance..

Use the type of error listed in the traceback. In the example error message you provided, this is a PlxScriptingError. So:
except PlxScriptingError:

pdfquery/PyQuery: example code shows no AttributeError but mine does...why?

I'm following the example code found here. The author has some documentation where he list some steps that used to write the program. When I run the whole program together it runs perfectly but when I follow the steps he's put I get an AttributeError.
Here's my code
pdf = pdfquery.PDFQuery("Aberdeen_2015_1735t.pdf")
pdf.load()
pdf.tree.write("test3.xml", pretty_print=True, encoding="utf-8")
sept = pdf.pq('LTPage[pageid=\'1\'] LTTextLineHorizontal:contains("SEPTEMBER")')
print(sept.text())
x = float(sept.get('x0'))
y = float(sept.get('y0'))
cells = pdf.extract( [
('with_parent','LTPage[pageid=\'1\']'),
('cells', 'LTTextLineHorizontal:in_bbox("%s,%s,%s,%s")' % (x, y, x+600, y+20))
])
Everything runs fine until it gets to "sept.get" where it says that "'PyQuery' object has no attribute 'get'." Does anyone know why the program wouldn't encounter this error when it's run all together but it occurs when a piece of the code is run?

According to the PyQuery API reference, a PyQuery object indeed doesn't have a get member. The code example must be obsolete.
According to https://pypi.python.org/pypi/pdfquery, attributes are retrieved with .attr:
x = float(sept.attr('x0'))
Judging by the history of pyquery's README.rst, get was never documented and only worked due to some side effect (some delegation to a dict, perhaps).

SPSS-Python Script stops with an Error when spss commands inside spss.Submit() would create a warning

Let's assume I have two lists of variables
list a: a1 a2 a3
list b: b1 b2 b3
which I want to process in a way like this:
TEMPORARY.
SELECT IF a1=b1.
FREQUENCY someVar.
TEMPORARY.
SELECT IF a2=b2.
FREQUENCY someVar.
TEMPORARY.
SELECT IF a2=b2.
FREQUENCY someVar.
I tried tried to do this within a python loop:
BEGIN PROGRAM.
import spss
la = ['a1', 'a2', 'a3']
lb = ['b1', 'b2', 'b3']
for a, b in zip(la, lb):
spss.Submit('''
TEMPORARY.
SELECT IF %s=%s.
FREQUENCY someVar.
''' % (a, b))
END PROGRAM.
So far so good. This works except when the SELECT IF command would create an empty dataset. Outside the Python program block this would cause the following warning message in the output viewer:
No cases were input to this procedure. Either there are none in the
working data file or all of them have been filtered out. Execution of
this command stops.
But inside a Python block it causes an Error and the python script to stop.
Traceback (most recent call last):
File "", line 7, in
File "C:\PROGRA~1\ibm\SPSS\STATIS~1\23\Python\Lib\site-packages\spss\spss.py", line 1527, in Submit
raise SpssError,error spss.errMsg.SpssError: [errLevel 3] Serious error.
Is there a way to run this loop (which might produce temporary empty data sets and therefore warnings) inside of python?

Yes, if you wrap the problematic function inside a try-except construct:
for a, b in zip(la, lb):
try:
spss.Submit('''
TEMPORARY.
SELECT IF %s=%s.
FREQUENCY someVar.
''' % (a, b))
except:
pass

You could also use Python APIs to calculate how many cases ai=bi and execute conditional blocks according to this.
So for example if only less than 5 valid cases remain you may not want to produce any output (or some output suggesting no output is being produced due to low base sizes). If under 50 cases remain then you may want to run frequencies and if more than 50 cases then run descriptives ect ect. There are a number of ways you get the case count, which approach is best and most efficient perhaps depends on your data set and end goal.
See for example:
spss.GetCaseCount
Here's another approach where you can get case count like statistics from active dataset to inspire further ideas.

Adding some explanation: Statistics syntax errors have an associated severity level between 1 (lowest) and 5. You will probably never see a 5, because that means that the system has gone down in flames. When running syntax from Python via Submit, level 1 and 2 errors, which are warnings that don't stop the syntax from running, are executed normally. Level 3 and higher raise exceptions. You can handle those in your Python code via the try/except mechanism as was suggested above.

I experimented with SET MXWARNS, however neither setting it to zero, nor setting it to a very high value (e.g. 1000000) worked. The warnings were still converted into errors. So I wrote this work-around:
import codecs
import re
import sys
import spss
from spssaux import getShow
def submit_syntax(sps_filename):
output_on = spss.IsOutputOn()
spss.SetOutput("off")
unicode_mode = getShow("unicode") == u"Yes"
encoding = "utf-8-sig" if unicode_mode else getShow("locale").split(u".")[-1]
if output_on:
spss.SetOutput("on")
with codecs.open(sps_filename, encoding=encoding) as f:
syntax = f.read()
statements = re.split(ur"\. *\r\n", syntax, flags=re.MULTILINE)
for stmtno, statement in enumerate(statements, 1):
if statement.startswith(u"*"):
continue
try:
spss.Submit(statement)
except spss.SpssError:
# "no cases were input" warnings are translated into errors.
if not spss.GetCaseCount() and spss.GetLastErrorLevel() <= 3:
continue
msg = u"ERROR in statement #%d: %s [%s]"
raise RuntimeError(msg % (stmtno, statement, sys.exc_info()[1]))

Is there a way to decode numerical COM error-codes in pywin32

Here is part of a stack-trace from a recent run of an unreliable application written in Python which controls another application written in Excel:
pywintypes.com_error: (-2147352567, 'Exception occurred.', (0, None, None, None, 0, -2146788248), None)
Obviously something has gone wrong ... but what?[1] These COM error codes seem to be excessively cryptic.
How can I decode this error message? Is there a table somewhere that allows me to convert this numerical error code into something more meaningful?
[1] I actually know what went wrong in this case, it was attempting to access a Name prperty on a Range object which did not have a Name property... not all bugs are this easy to find!

You are not doing anything wrong. The first item in your stack trace (the number) is the error code returned by the COM object. The second item is the description associated with the error code which in this case is "Exception Occurred". pywintypes.com_error already called the equivalent of win32api.FormatMessage(errCode) for you. We'll look at the second number in a minute.
By the way, you can use the "Error Lookup" utility that comes in Visual Studio (C:\Program Files\Microsoft Visual Studio 9.0\Common7\Tools\ErrLook.exe) as a quick launching pad to check COM error codes. That utility also calls FormatMessage for you and displays the result. Not all error codes will work with this mechanism, but many will. That's usually my first stop.
Error handling and reporting in COM is a bit messy. I'll try to give you some background.
All COM method calls will return a numeric code called an HRESULT that can indicate success or failure. All forms of error reporting in COM build on top of that.
The codes are commonly expressed in hex, although sometimes you will see them as large 32-bit numbers, like in your stack trace. There are all kinds of predefined return codes for common results and problems, or the object can return custom numeric codes for special situations. For example, the value 0 (called S_OK) universally means "No error" and 0x80000002 is E_OUTOFMEMORY. Sometimes the HRESULT codes are returned by the object, sometimes by the COM infrastructure.
A COM object can also choose to provide much richer error information by implementing an interface called IErrorInfo. When an object implements IErrorInfo, it can provide all kinds of detail about what happened, such as a detailed custom error message and even the name of a help file that describes the problem. In VB6 and VBA. the Err object allows you to access all that information (Err.Description, etc).
To complicate matters, late bound COM objects (which use a mechanism called COM Automation or IDispatch) add some layers that need to be peeled off to get information out. Excel is usually manipulated via late binding.
Now let's look at your situation again. What you are getting as the first number is a fairly generic error code: DISP_E_EXCEPTION. Note: you can usually figure out the official name of an HRESULT by googling the number, although sometimes you will have to use the hex version to find anything useful.
Errors that begin with DISP_ are IDISPATCH error codes. The error loosely means "There was a COM exception thrown by the object", with more information packed elsewhere (although I don't quite know where; I'll have to look it up).
From what I understand of pywintypes.com_error, the last number in your message is the actual error code that was returned by the object during the exception. It's the actual numeric code that you would get out of VBA's Err.Number.
Unfortunately, that second code -2146788248 (0x800A9C68) is in the range reserved for custom application-defined error messages (in VBA: VbObjectError + someCustomErrorNumber), so there is no centralized meaning. The same number can mean entirely different things for different programs.
In this case, we have reached a dead end:
The error code is "custom", and the application needs to document what it is, except that Excel doesn't. Also, Excel (or the actual source of the error) doesn't seem to be providing any more information via IErrorInfo.
Excel is notorious (to me at least) for cryptic error codes from automation and obscure situations that cause them. This is especially so for errors that one could consider "design-time errors" ("you should have known better than calling a method that doesn't exist in the object"). Instead of a nice "Could not read the Name property", you get "Run-time error '1004': Application defined or object-defined error" (which I just got by trying to access a Name property on a Range, from VBA in Excel). That is NOT very helpful.
The problem is not routed on Python or it's interface to Excel. Excel itself doesn't explain what happened, even to VBA.
However, the general procedure above remains valid. If you get an error from Excel in the future, you might get a better error message that you can track the same way.
Good luck!

Do it like this:
try:
[whatever code]
except pythoncom.com_error as error:
print(win32api.FormatMessage(error.excepinfo[5]))
More information on digesting the pythoncom.com_error object here: https://web.archive.org/web/20170831073447/http://docs.activestate.com/activepython/3.2/pywin32/com_error.html

Yes try the win32api module:
import win32api
e_msg = win32api.FormatMessage(-2147352567)
You can grab any codes returned from the exception and pass them to FormatMessage. Your example had 2 error codes.

Specifically for pythoncom, the errors codes that result are more than cryptic. This is because pythoncom represents them internally as a 32bit signed integer, when the correct representation is a 32bit unsigned integer. As a result, the conversion that you end up seeing in the stack trace is incorrect.
In particular, your exception, according to pythoncom, is -2147352567, and your (for lack of a better word) Err.Number is -2146788248.
This however causes some issues when watching for specific errors, like below:
DISP_E_EXCEPTION = 0x80020009
#...
#except pywintypes.com_error as e:
# print repr(e)
# #pywintypes.com_error: (-2147352567, 'Exception occurred.', (0, None, None, None, 0, -2146788248), None)
# hr = e.hresult
hr = -2147352567
if hr == DISP_E_EXCEPTION:
pass #This never occurs
else:
raise
To see why this has issues, lets look into these error codes:
>>> DISP_E_EXCEPTION = 0x80020009
>>> DISP_E_EXCEPTION
2147614729L
>>> my_hr = -2147352567
>>> my_hr == DISP_E_EXCEPTION
False
Again, this is because python sees the constant declared as positive, and pythoncom's incorrect declaration interpreted it as negative. Of course, the most obvious solution fails:
>>> hex(my_hr)
'-0x7ffdfff7'
The solution is to properly interpret the number. Luckily, pythoncom's representation is reversible. We need to interpret the negative number as a 32 bit signed integer, then interpret that as an unsigned integer:
def fix_com_hresult(hr):
import struct
return struct.unpack("L", struct.pack("l", hr))[0]
>>> DISP_E_EXCEPTION = 0x80020009
>>> my_hr = -2147352567
>>> my_hr == DISP_E_EXCEPTION
False
>>> fixed_hr = fix_com_hresult(my_hr)
>>> fixed_hr
2147614729L
>>> fixed_hr == DISP_E_EXCEPTION
True
So, putting it all together, you need to run fix_com_hresult() on that result from pythoncom, essentially all the time.
Since normally you need to do this when checking for exceptions, I created these functions:
def fix_com_exception(e):
e.hresult = fix_com_hresult(e.hresult)
e.args = [e.hresult] + list(e.args[1:])
return e
def fix_com_hresult(hr):
import struct
return struct.unpack("L", struct.pack("l", hr))[0]
which can then be used how you expect:
DISP_E_EXCEPTION = 0x80020009
try:
#failing call
except pywintypes.com_error as e:
print repr(e)
#pywintypes.com_error: (-2147352567, 'Exception occurred.', (0, None, None, None, 0, -2146788248), None)
fix_com_exception(e)
print repr(e)
#pywintypes.com_error: (2147614729L, 'Exception occurred.', (0, None, None, None, 0, -2146788248), None)
if e.hresult == DISP_E_EXCEPTION:
print "Got expected failure"
else:
raise
I was unable to find a MSDN document listing all HRESULTs, but I found this: http://www.megos.ch/support/doserrors_e.txt
Also, since you have it, fix_com_hresult() should also be run on your extended error code (-2146788248), but as Euro Micelli said, it doesn't help you in this particular instance :)

No-one has yet mentioned the strerror attribute of the pywintypes.com_error Exception. This returns the result of FormatMessage for the error code. So instead of doing it yourself like this
try:
[whatever code]
except pythoncom.com_error as error:
print(win32api.FormatMessage(error.excepinfo[5]))
You can just do this:
try:
[whatever code]
except pythoncom.com_error as error:
print(error.strerror)
Note it will return None if you have a non-standard HRESULT :(

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.