Python HTMLParser Detecting the End of Data - python

I am using the HTMLParser library of Python 2.7 to process and extract some information from
an HTML content which was fetched from a remote url. I did not quite understand how to know or catch the exact moment when the parser instance finishes parsing the HTML data.
The basic implementation of my parser class looks like this:
class MyParser(HTMLParser.HTMLParser):
def __init__(self, url):
self.url = url
self.users = set()
def start(self):
self.reset()
response = urllib3.PoolManager().request('GET', self.url)
if not str(response.status).startswith('2'):
raise urllib3.HTTPError('HTTP error here..')
self.feed(response.data.decode('utf-8'))
def handle_starttag(self, tag, attrs):
if tag == 'div':
attrs = dict(attrs)
if attrs.get('class') == 'js_userPictureOuterOnRide':
user = attrs.get("data-name")
if user:
self.users.add(user)
def reset(self):
HTMLParser.HTMLParser.reset(self)
self.users.clear()
My question is, how can I detect that parsing process is finished?
Thanks.

HTMLParser is synchronous, that is, once it returns from feed, all data so far has been parsed and all callbacks called.
self.feed(response.data.decode('utf-8'))
print 'ready!'
(if I misunderstood your question, please let me know).

Related

Structuring python classes to determine type of parser to use

I am building a parser in python that needs to:
Retrieve a stored HTML page from S3 based on an ID
Determine what type of parser to use based on header information in the HTML
Return some data from the HTML using the correct parser
How can I create an elegant structure where I request the data from S3 one time, determine what parser to use based on classes I have built, then return the appropriate result?
This is the structure I came up to build the first parser:
# / parser.py
from gzip import decompress
from bs4 import BeautifulSoup
import requests
class Parser:
def __init__(self, page_id):
self.landing_page_endpoint = f"https://my_org.org/{page_id}"
self.parser_name = None
self.soup = self.get_soup()
def get_html(self):
r = requests.get(self.landing_page_endpoint)
html = decompress(r.content)
return html
def get_soup(self):
soup = BeautifulSoup(self.get_html(), "html.parser")
return soup
def parse(self):
"""Core method that returns authors and associated affiliations."""
pass
# /parsers/gregory.py
import json
from parser import Parser
class Gregory(Parser):
def __init__(self, doi):
super().__init__(doi)
self.parser_name = "gregory"
def parse(self):
my_parsed_info = 'asdf'
return my_parsed_info
Then I call this with:
# views.py
from flask import jsonify, request
from parsers.gregory import Gregory
page_id = request.args.get('page_id')
g = Gregory(page_id)
result = g.parse()
return jsonify(result)
My idea is to add a method to each parser class I create such as 'detect_parser' that returns True if it is the correct parser. Then I can make a list of all the parser classes, and go through each one running that method until it is True.
The problem with my current setup is this will call the request to S3 every time I call a class, which is slow and unnecessary. Should I do something where I initialize the overall Parser class once, then pass then into each lower class?
You'll want to look into a way to register parsers as you create them in your code. I'd suggest learning about decorators.
You classes would look something like:
class ParserController:
def __init__(self):
self.parsers=[]
def register(self, test):
def decorated(cls):
self.parsers.append((test,cls))
return cls
return decorated
def parse(self, page_id):
html = get_html(page_id)
for test, cls in self.parsers:
if test(html):
parser = cls()
break
return parser.parse(html)
controller = ParserController()
#controller.register(lambda html: #test to use Gregory parser)
class Gregory(Parser):
parse(self, html):
#do stuff
#controller.register(lambda html: #test to use Mark parser)
class Mark(Parser):
parse(self, html):
#do stuff

What does handle_data() return?

I tried to get a list of only meaningful context from a webpage (there are only two lines of webpage content in my test code) using handle_data() from html.parser, but got multiple lists which are not in a list. I don't understand what does handle_data() returns. Anybody can help me with it? How can I store them in only one list?
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
a = []
for i in data.split():
a.append(i)
print(a)
return a
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')
Result:
['Test']
['Parse', 'me!']
the handle_data method from HTMLparser is called every time the parser found a text/content inside html tag.
In your case, handle_data will be called two times. In the first call, the variable data is equal to 'Test', and the second call will be 'Parse me!'.
If you want to store all of the text/content inside one list, then just create a variable inside your class.
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
a = []
def handle_data(self, data):
self.a.append(data)
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')
print(parser.a) # ['Test', 'Parse me!']
As I know http.parser is low level module which can help to parse HTML but it returns nothing. You have to decide what to do with data. You can print it or put in some variable or create tree, etc. But this need to write all code which will return anything.
For example I create class variable result to keep all strings which I get in handle_data and later I can get all text from this variable.
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
result = []
def handle_data(self, data):
self.result.append(data)
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')
print(parser.result)

Access attribute of one class from another class in Python

I' playing with OOP (OOP concept is something totally new for me) in Python 3 and trying to access attribute (list) of one class from another class. Obviously I am doing something wrong but don't understand what.
from urllib import request
from bs4 import BeautifulSoup
class getUrl(object):
def __init__(self):
self.appList = []
self.page = None
def getPage(self, url):
url = request.urlopen(url)
self.page = url.read()
url.close()
def parsePage(self):
soup = BeautifulSoup(self.page)
for link in soup.find_all("a"):
self.appList.append(link.get('href'))
return (self.appList)
class getApp(object):
def __init__(self):
pass
def selectApp(self):
for i in getUrl.appList():
return print(i)
a = getUrl()
a.getPage("http://somepage/page")
a.parsePage()
b = getApp()
b.selectApp()
And I get:
AttributeError: type object 'getUrl' has no attribute 'appList'
Your code seems to confuse classes with functions. Normally a function name is a verb (e.g. getUrl) because it represents an action. A class name is usually a noun, because it represents a class of objects rather than actions. For example, the following is closer to how I would expect to see classes being used:
from urllib import request
from bs4 import BeautifulSoup
class Webpage(object):
def __init__(self, url):
self.app_list = []
url = request.urlopen(url)
self.page = url.read()
def parse(self):
soup = BeautifulSoup(self.page)
for link in soup.find_all("a"):
self.app_list.append(link.get('href'))
return self.app_list
class App(object):
def __init__(self, webpage, number):
self.webpage = webpage
self.link = webpage.app_list[number]
my_webpage = Webpage("http://somepage/page")
my_webpage.parse()
selected_app = App(my_webpage, 1)
print (selected_app.link)
Note that we usually make an instance of a class (e.g. my_webpage) then access methods and properties of the instance rather than of the class itself. I don't know what you intend to do with the links found on the page, so it is not clear if these need their own class (App) or not.
You need to pass in the getUrl() instance; the attributes are not present on the class itself:
class getApp(object):
def __init__(self):
pass
def selectApp(self, geturl_object):
for i in geturl_object.appList:
print(i)
(note the removed return as well; print() returns None and you'd exit the loop early).
and
b = getApp()
b.selectApp(a)
The appList is a variable in an instance of the getUrl class. So you can only access it for each instance (object) of the getUrl class. The problem is here:
class getApp(object):
def __init__(self):
pass
def selectApp(self):
for i in getUrl.appList():
return print(i)
Look at getUrl.appList(). Here you call the class, not an object. You might also want to look at the return print(i) statement.
Use requests instead of urllib, it's more comfortable.

Removing html tags and entities from string in python

I am getting xml data from api.careerbuilder.com
Particularly, the string contains some html entities I am willing to remove, to no effect!
I have tried doing this:
import re
re.sub('\&lt;.*?\&gt;', '', job_title_text)
and this
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
strip_tags(job_title_text)
and finally this
import lxml.html
(lxml.html.fromstring(job_title_text)).text_content()
But all of these were failures. The second approach deleted html entities like "&amp" but the text inside the tags was left, that is "pbrspan", for example. Third one completely ruined everything, no data was shown at all, instead
< bound method HtmlElement.text_content of < Element html at 0x33717d8> >
Finally, I suspect, that the regex I have written is entirely wrong.
Any ideas, how this can be handled?
Try this regular expression
(\&lt\;).*?(\&gt\;)
Consider to use BeautifulSoup to remove tags, pretty well documented, http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Removing%20elements

test instance attributes with python

I am writing a unit test to determine if an attribute is properly set during the instantiation of my parser object. Unfortunetly the only way that I can think to do it is to use self.assertTrue(p.soup)
I haven't slung any python in awhile, but that doesn't seem like a very clear way to check that the instance attribute was properly set. Any ideas on how to improve it?
Here is my test class:
class ParserTest(unittest.TestCase):
def setUp(self):
self.uris = ctd.Url().return_urls()
self.uri = self.uris['test']
def test_create_soup(self):
p = ctd.Parser(self.uri)
self.assertTrue(p.soup)
if __name__ == '__main__':
unittest.main()
# suite = unittest.TestLoader().loadTestsFromTestCase(UrlTest)
unittest.TextTestRunner(verbosity=2).run(suite)
Here is my Parser class that I am testing
class Parser():
def __init__(self, uri):
self.uri = uri
self.soup = self.createSoup()
def createSoup(self):
htmlPage = urlopen(self.uri)
htmlText = htmlPage.read()
self.soup = BeautifulSoup(htmlText)
return BeautifulSoup(htmlText)
I got in the bad habot over the past few years of not unit testing, so I am fairly new to the topic. Any good resources to look at for an in depth explaination of unit testing in Python would be appreciated. I look at the standard library unittest documentation, but that really didn't help much...
If p.soup attribute needs to be instance of BeautifulSoup you can explicitly check its type
self.assertIsInstance(p.soup, BeautifulSoup)

Categories