Parse html source code into xml tree - python

I know there's many ways to do this using 3rd party libraries such as resources, pyparsing, selenium, etc. but I'm looking for a quick and dirty way to do it without any 3rd party modules.
Basically what I want to do is take the HTML code from the page source of a webpage and parse it into xml format (probably using xml.etree.ElementTree). I've tried this:
import urllib.request
import xml.etree.ElementTree as ET
data = urllib.request.urlopen(website)
tree = ET.fromstring(data.read)
However when I do this I either get mismatched tags or unknown symbol for UTF-8 encoding, which the page source is definitely in. I was under the assumption that a functioning html page wouldn't have mismatched tags so I'm thinking there's something I'm missing.
And the whole reason I don't want to use a 3rd party library is because I need to grab a small set of information and don't think it's enough to justify using another module.

Related

How do I parse HTML-like with errors?

I have data that looks like it is part of an HTML document. However there are some bugs in it like
<td class= foo"bar">
on which all the parsers I tried (lxml, xml.etree) fail with an error.
Since I don't actually care about this specific part of the document I am looking for a more robust parser.
Something where I can allow errors in specific subtrees to be ignored and maybe just not insert the nodes or something that will only lazily parse the parts of the tree I am traversing for example.
You are using XML parsers. XML is a strict language, while the HTML standard requires parsers to be tolerant of errors.
Use a compliant HTML parser like lxml.html, or html5lib, or the wrapper library BeautifulSoup (which uses either of the previous with a cleaner API). html5lib is slower but closely mimics how a modern browser would treat errors.
Use lxml:
Create a HTML parser with the recover set to True:
parser = etree.HTMLParser(recover=True)
tree = etree.parse(StringIO(broken_html), parser)
See the tutorial Parsing XML and HTML with lxml.

Is it possible to scrape webpage without using third-party libraries in python?

I am trying to understand how beautiful soup works in python. I used beautiful soup,lxml in my past but now trying to implement one script which can read data from given webpage without any third-party libraries but it looks like xml module don't have much options and throwing many errors. Is there any other library with good documentation for reading data from web page?
I am not using these scripts on any particular websites. I am just trying to read from public pages and news blogs.
Third party libraries exist to make your life easier. Yes, of course you could write a program without them (the authors of the libraries had to). However, why reinvent the wheel?
Your best options are beautifulsoup and scrappy. However, if your having trouble with beautifulsoup, I wouldn't try scrappy.
Perhaps you can get by with just the plain text from the website?
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
pagetxt = soup.get_text()
Then you can be done with all external libraries and just work with plain text. However, if you need to do something more complicated. HTML is something you really should use a library for manipulating. They is just too much that can go wrong.

Python and Parse HTML

My input is the URL of a page. I wanna get the HTML of the page then parse it for a specific JSON response and grab a Product ID and another URL. On the next step, would like to append the Product ID to the URL found.
Any advice on how to achieve this?
As far as retrieving the page, the requests library is a great tool, and much more sanity-friendly than cURL.
I'm not sure based on your question, but if you're getting JSON back, just import the native JSON library (import json) and use json.loads(data) to get a dictionary (or list) provided the response is valid JSON.
If you're parsing HTML, there are several good choices, including BeautifulSoup and lxml. The former is easier to use but doesn't run as quickly or efficiently; the latter can be a bit obtuse but it's blazingly fast. Which is better depends on your app's requirements.

BeautifulSoup: how to ignore spurious end tags

I've read many good things about BeautifulSoup, that's why I'm trying to use it currently to scrape a set of websites with badly formed HTML.
Unfortunately, there's one feature of BeautifulSoup that pretty much is a showstopper currently:
It seems that when BeautifulSoup encounters a closing tag (in my case </p>) that was never opened, it decides to end the document instead.
Also, the find method seems to not search the contents behind the (self-induced) </html> tag in this case. This means that when the block I'm interested in happens to be behind a spurious closing tag, I can't access the contents.
Is there a way I can configure BeautifulSoup to ignore unmatched closing tags rather than closing the document when they are encountered?
BeautifulSoup doesn't do any parsing, it uses the output of a dedicated parser (lxml or html.parser or html5lib).
Pick a different parser if the one you are using right now doesn't handle broken HTML quite the way you want it to. lxml is the faster parser and can handle broken HTML quite well, html5lib comes closest to how your browser would parse broken HTML but is a lot slower.
Also see Installing a parser in the BeautifulSoup documentation, as well as the Differences between parsers section.

how do i parse a wiki page without taking a dump of it in python?

Is it possible to parse a wiki without taking its dump , as the dump itself is way too much data to handle . Thus lets say I have the url of a certain wiki and once i call it through urllib , how do I parse it and get a certain type of data using python .
here type means a certain data corresponding to a semantic match to the search that would have been done .
You need an HTML parser to get the useful data from the HTML.
You can use BeautifulSoup to help parse the HTML. I recommend that you read the documentation and have a look at the examples there.
I'd suggest an option such as Harvestman instead, since a semantic search is likely throw multiple pages, compared to a simpler solution such as BS

Categories