Python - Error Parsing HTML w/ BeautifulSoup

Python - Error Parsing HTML w/ BeautifulSoup - python

I'm developing an app that inputs data onto a webpage shopping cart and verifies the total. That works fine, however, I am having issue with parsing the HTML output.
A previous discussion; retrieving essential data from a webpage using python, recommended using BeautifulSoup to make solve said user's problem.
I've borrowed some of the python code, and got it to work on a MacOS system. However when I copied the code over to an ubuntu installation, I'm seeing a strange error.
**The Code (where I'm seeing the issue):
response = opener.open(req)
html = response.read()
doc = BeautifulSoup.BeautifulSoup(html)
table = doc.find('tr', {'id':'carttablerow0'})
dump = [cell.getText().strip() for cell in table.findAll('td')]
print "\n Catalog Number: %s \n Description: %s \n Price: %s\n" %(dump[0], dump[1], dump[5])
**The Error ( on the ubuntu server)
Traceback (most recent call last):
File "./shopping_cart_checker.py", line 49, in <module>
dump = [cell.getText().strip() for cell in table.findAll('td')]
TypeError: 'NoneType' object is not callable
I think I've narrowed it down to getText() being the culprit. But I'm not certain why this works on MacOS and not ubuntu.
Any suggestions?
Thank you.
#########################
Hi Guys,
Thank you for the various suggestions. I've attempted most of them, (incorporating the "if cell" statement into the code, however it still isn't working.
# Ignacio Vazquez-Abrams -- Here's a copy of the HTML I'm attempting to strip:
http://pastebin.com/WdaeExnC

As to why it doesn't work on Ubutntu, no idea. However, you can try this:
dump = [(cell.getText() if cell.getText() else '').strip() for cell in table.findAll('td')]

It doesn't seem to be a problem with the code but with the HTML you are reading, what I would do is changing your code to do this:
dump = [cell.getText().strip() for cell in table.findAll('td') if cell]
That way if cell is None it will not try to execute getText and just skip that cell. You should debug if you can, i recommend you to use pdb or ipdb (the one i like to use). Here is a tutorial, with that you can stop just before the line and print values, etc.

Related

Syntax error when trying to open a file through Python 3 Command Shell

So I am learning Python through Udemy tutorials and now I need to open a file through CMD(CMD is opened on folder I need) and when I am typing function for opening file it says syntax error, but I have made everything good what a guy on tutorials says, I really don't know what what should I do, I checked all of the forums and still cant find the answer.
Here are some screenshots:

Couple of issues:
1.Your text file is called "example.txt.txt" instead of "example.txt"
2.The "example.txt","r" should be surrounded with brackets () instead of <>. These symbols look similar in cmd and are easy to confuse.
#instead of
file = open<"example.txt","r">
#use
file = open("example.txt","r")
This should fix your problem; let me know if it does.

You have to use parenthesss () not <>
file = open("example.txt","r")
check https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

Why isn't my code working in command line?

I am using Python3.6 and I need to run my code in command line. The code works when I run it in PyCharm but when I use command line I get this error:
File "path", line 43, in <module>
rb = ds.GetRasterBand(1)
AttributeError: 'NoneType' object has no attribute 'GetRasterBand'
It seems that I have a problem with these lines:
ds = gdal.Open('tif_file.tif', gdal.GA_ReadOnly)
rb = ds.GetRasterBand(1)
img_array = rb.ReadAsArray()
Does anyone know what I might have done wrong?
EDIT
Some magic just happened. I tried to run my code this morning and everything seems fine. I guess what my computer needed was a restart or something. Thanks to you all for help.

from the gdal documentation:
from osgeo import gdal
dataset = gdal.Open(filename, gdal.GA_ReadOnly)
if not dataset:
...
Note that if GDALOpen() returns NULL it means the open failed, and
that an error messages will already have been emitted via CPLError().
If you want to control how errors are reported to the user review the
CPLError() documentation. Generally speaking all of GDAL uses
CPLError() for error reporting. Also, note that pszFilename need not
actually be the name of a physical file (though it usually is). It's
interpretation is driver dependent, and it might be an URL, a filename
with additional parameters added at the end controlling the open or
almost anything. Please try not to limit GDAL file selection dialogs
to only selecting physical files.
looks like the file you are trying to open is not a valid gdal file or some other magic is going on in the file selection. you could try to direct the program to a known good file online to test it.

Write to an HTML file with Python

I have a couple of graphs I need to display in my browser offline, MPLD3 outputs the html as a string and I need to be able to make an html file containing that string. What I'm doing right now is:
tohtml = mpld3.fig_to_html(fig, mpld3_url='/home/pi/webpage/mpld3.js',
d3_url='/home/pi/webpage/d3.js')
print(tohtml)
Html_file = open("graph.html","w")
Html_file.write(tohtml)
Html_file.close();
tohtml is the variable where the HTML string is stored. I've printed this string to the terminal and then pasted it into an empty HTML file and I get my desired result. However, when I run my code, I get an empty file named graph.html

It seems like you may be reinventing the wheel here. Have you tried something like,
mpld3_url='/home/pi/webpage/mpld3.js'
d3_url='/home/pi/webpage/d3.js'
with open('graph.html', 'w') as fileobj:
mpld3.save_html(fig, fileobj, d3_url=d3_url, mpld3_url=mpld3_url)
Note, this is untested just going off of mpld3.save_html documentation and using prior knowledge about Python IO Streams

requests - Python command line behavior differs from behavior when script is run

I'm trying to write a script that will input data I supply into a web form at a url I supply.
To start with, I'm testing it out by simply getting the html of the page and outputting it as a text file. (I'm using Windows, hence .txt.)
import sys
import requests
sys.stdout = open('html.txt', 'a')
content = requests.get('http://www.york.ac.uk/teaching/cws/wws/webpage1.html')
content.text
When I do this (i.e., the last two lines) on the python command line (>>>), I get what I expect. When I do it in this script and run it from the normal command line, the resulting html.txt is blank. If I add print(content) then html.txt contains only: <Response [200]>.
Can anyone elucidate what's going on here? Also, as you can probably tell, I'm a beginner, and I can't for the life of me find a beginner-level tutorial that explains how to use requests (or urllib[2] or selenium or whatever) to send data to webpages and retrieve the results. Thanks!

You want:
import sys
import requests
result = requests.get('http://www.york.ac.uk/teaching/cws/wws/webpage1.html')
if result.status_code == requests.codes.ok:
with open('html.txt', 'a') as sys.stdout:
print result.content
Requests returns an instance of type request.Response. When you tried to print that, the __repr__ method was called, which looks like this:
def __repr__(self):
return '<Response [%s]>' % (self.status_code)
That is where the <Response [200]> came from.
The requests.Reponse has a content attribute which is an instance of str (or bytes for Python 3) that contains your HTML.
The text attribute is type unicode which may or may not be what you want. You mention in the comments that you saw a UnicodeDecodeError when you tried to write it to a file. I was able to replace the print result.content above with print result.text and I did not get that error.
If you need help solving your unicode problems, I recommend reading this unicode presentation. It explains why and when to decode and encode unicode.

The interactive interpreter echoes the result of every expression that doesn't produce None. This doesn't happen in regular scripts.
Use print to explicitly echo values:
print response.content
I used the undecoded version here as you are redirecting stdout to a file with no further encoding information.
You'd be better of writing the output directly to a file however:
with open('html.txt', 'ab') as outputfile:
outputfile.write(response.content)
This writes the response body, undecoded, directly to the file.

Google search issue in Python

I have implemented a program in python which performs the Google search and captures top ten links from the search results. I am using 'pygoogle' library for search, when I am implementing my program for the first two or three times, it is getting proper hits and the entire project is working very fine. But afterward, after certain links got downloaded, it's giving an error as follows. (gui_two.py is my program name)
Exception in Tkinter callback
Traceback (most recent call last):
File "/usr/lib/python2.7/lib-tk/Tkinter.py", line 1413, in __call__
return self.func(*args)
File "gui_two.py", line 113, in action
result = uc.utilcorpus(self.fn1,"")
File "/home/ci/Desktop/work/corpus/corpus.py", line 125, in utilcorpus
for url in g1.get_urls(): #this is key sentence based search loop
File "/home/ci/Desktop/work/corpus/pygoogle.py", line 132, in get_urls
for result in data['responseData']['results']:
TypeError: 'NoneType' object has no attribute '__getitem__'
I know this is most familiar error in python, but I am not able to do anything since it is a library. I wonder my program is spamming the Google or I need custom Google search API's or may be the other reason. Please give me precise information for performing search without any issue. I will be so grateful for your help.
Thanks.
Edited: Actually my code is very huge, here is a small piece of code, where problem arises.
g1 = pygoogle(query)
g1.pages = 1
for url in g1.get_urls(): #error is in this line
print "URL : ",url
It may work if we simply copy it in a simple .py file, but if we execute it many times, program gives an error.

Here's the culprit code from pygoogle.py (from http://pygoogle.googlecode.com/svn/trunk/pygoogle.py)
def get_urls(self):
"""Returns list of result URLs"""
results = []
search_results = self.__search__()
if not search_results:
self.logger.info('No results returned')
return results
for data in search_results:
if data and data.has_key('responseData') and data['responseData']['results']:
for result in data['responseData']['results']:
if result:
results.append(urllib.unquote(result['unescapedUrl']))
return results
Unlike every other place where data['responseData']['results'] is used, they're not both being checked for existence using has_key().
I suspect that your responseData is missing results, hence the for loop fails.
Since you have the source, you can edit this yourself.
Also make an issue for the project - very similar to this one in fact.

I fixed the issue by modifying the source code of pygoogle.py library program. The bug in this code is, whether an element has the data or none is not checked in the code. The modified code is:
def get_urls(self):
"""Returns list of result URLs"""
results = []
for data in self.__search__():
#following two lines are added to fix the issue
if data['responseData'] == None or data['responseData']['results'] == None:
break
for result in data['responseData']['results']:
if result:
results.append(urllib.unquote(result['unescapedUrl']))
return results

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.