How to extract nested tables from HTML? - python

I have an HTML file (encoded in utf-8). I open it with codecs.open(). The file architecture is:
<html>
// header
<body>
// some text
<table>
// some rows with cells here
// some cells contains tables
</table>
// maybe some text here
<table>
// a form and other stuff
</table>
// probably some more text
</body></html>
I need to retrieve only first table (discard the one with form). Omit all input before first <table> and after corresponding </table>. Some cells contains also paragraphs, bolds and scripts. There is no more than one nested table per row of main table.
How can I extract it to get a list of rows, where each elements holds plain (unicode string) cell's data and a list of rows for each nested table? There's no more than 1 level of nesting.
I tried HTMLParse, PyParse and re module, but can't get this working.
I'm quite new to Python.

Try beautiful soup
In principle you need to use a real parser (which Beaut. Soup is), regex cannot deal with nested elements, for computer sciencey reasons (finite state machines can't parse context-free grammars, IIRC)

You may like lxml. I'm not sure I really understood what you want to do with that structure, but maybe this example will help...
import lxml.html
def process_row(row):
for cell in row.xpath('./td'):
inner_tables = cell.xpath('./table')
if len(inner_tables) < 1:
yield cell.text_content()
else:
yield [process_table(t) for t in inner_tables]
def process_table(table):
return [process_row(row) for row in table.xpath('./tr')]
html = lxml.html.parse('test.html')
first_table = html.xpath('//body/table[1]')[0]
data = process_table(first_table))

If the HTML is well-formed you can parse it into a DOM tree and use XPath to extract the table you want. I usually use lxml for parsing XML, and it can parse HTML as well.
The XPath for pulling out the first table would be "//table[1]".

Related

How can I load an html file into a multilevel array of elements in python

In an ideal world, I'm trying to figure out how to load an html document into a list which is elements, for example:
elements=[['h1', 'This is the first heading.'], ['p', 'Someone made a paragraph. A short one.'], ['table', ['tr', ['td', 'a table cell']]]]
I've played a little with beautifulsoup, but can't see a way to do this.
Is this currently doable, or do I nee to write a parser.
In an ideal world(definition: One where the website you want to read has well-formed XHTML), you can toss it to an XML parser like lxml and you'll get something much like that back. Very short version:
Elements are lists, and the entries in the list are subelements, in proper order
Elements are dictionaries, which have the "key=value" attributes from the element.
Elements have a text attribute, which is the text between the opening element and it's first sub-element
Elements have a tail attribute, which is the text after the closing element.
Once you have a tree in a shape like that, then you can probably write a 3-line function that rebuilds it the way you want.
XHTML is basically restricted HTML - a combination between that and XML. In theory, sites should give your browser XHTML, since it's better in every way, but most browsers are a lot more permissive, and therefore don't provide the stricter set.
Some of the problems most sites have are for example the omitting of closing tags. XML parsers tend to error out on those.
You can use recursion:
html = """
<html>
<body>
<h1>This is the first heading.</h1>
<p>Someone made a paragraph. A short one.</p>
<table>
<tr>
<td>a table cell</td>
<tr>
</table>
</body>
</html>
"""
import bs4
def to_list(d):
return [d.name, *[to_list(i) if not isinstance(i, bs4.element.NavigableString) else i for i in d.contents if i != '\n']]
_, *r = to_list(bs4.BeautifulSoup(html).body)
print(r)
Output:
[['h1', 'This is the first heading.'], ['p', 'Someone made a paragraph. A short one.'], ['table', ['tr', ['td', 'a table cell'], ['tr']]]]

Understanding encodings in HTML

I am parsing a .html file using BeautifulSoup4 doing the following:
data = [item.text.strip() for item in soup.find_all('span')]
The code takes all the items in a given table and stores into data. I noticed some of the elements in the data contain texts what seems like html entity encoding. An example element:
data[5] stores 'CSCI-GA.1144-\u200b001'
the text I expected was just CSCI-GA.1144-001'
In the html file, I find it as 'CSCI-GA.1144-​001'
Why does it show differently when I parse, vs when I inspect the html code? And how do I parse the data so it does not take into account these encodings? Is there a way to exclude?

python how to identify block html contain text?

I have raw HTML files and i remove script tag.
I want to identify in the DOM the block elements (like <h1> <p> <div> etc, not <a> <em> <b> etc) and enclose them in <div> tags.
Is there any easy way to do it in python?
is there library in python to identify the block element
Thanks
UPDATE
actually i want to extract the html document. i have to identify the blocks which contain text. For each text element i have to find its closest parent element that are displayed as block. After that for each block i will extract the feature such as size and posisition of the block.
You should use something like Beautiful Soup or HTMLParser.
Have a look at their docs: Beautiful Soup or HTMLParser.
You should find what you are looking fore there. If you cannot get it to work, consider asking a more specific question.
Here is a simple example, how you cold go about this. Say 'data' is the raw content of a site, then you could:
soup = BeautifulSoup(data) # you may need to add from_encoding="utf-8"or so
Then you might want to walk through the tree looking for a specific node and to something with it. You could use a fct like this:
def walker(soup):
if soup.name is not None:
for child in soup.children:
# do stuff with the node
print ':'.join([str(child.name), str(type(child))])
walker(child)
Note: the code is from this great tutorial.

Writing clean code in beautifulsoup

While parsing a table on a webpage with little semantic structure, my beautiful soup expressions are getting really ugly. I might be going about it the wrong way and would like to know how I can rewrite my code to make it more readable and less messy?
For example, in a page there are three tables. Relevant data is in the third table. The actual data starts in the second row. The first entry in the row is an index and the data I need is in the second td element. This second td element has two links and my text of interest is within the second a tag. Translating this into beuatifulsoup I wrote
soup.find_all('table')[2].find_all('tr')[2].find_all('td')[1].find_all('a')[1].text
works fine, and I grab all the 70 elements in the table using the same principle in a list comprehension.
relevant_data = [ x.find_all('td')[1].find_all('a')[1].text for x in soup.find_all('table')[2].find_all('tr')[2:]]
Is this kind of code OK or is there any scope for improvement?
Using lxml, you can use XPath.
For example:
html = '''
<body>
<table></table>
<table></table>
<table>
<tr></tr>
<tr></tr>
<tr><td></td><td><a>blah1</a><a>blah1-1</a></td></tr>
<tr><td></td><td><a>blah2</a><a>blah2-1</a></td></tr>
<tr><td></td><td><a>blah3</a><a>blah3-1</a></td></tr>
<tr><td></td><td><a>blah4</a><a>blah4-1</a></td></tr>
<tr><td></td><td><a>blah5</a><a>blah5-1</a></td></tr>
</table>
<table></table>
</body>
'''
import lxml.html
root = lxml.html.fromstring(html)
print(root.xpath('.//table[3]/tr[position()>=2]/td[2]/a[2]/text()'))
output:
['blah1-1', 'blah2-1', 'blah3-1', 'blah4-1', 'blah5-1']

BeautifulSoup in Python - getting the n-th tag of a type

I have some html code that contains many <table>s in it.
I'm trying to get the information in the second table. Is there a way to do this without using soup.findAll('table') ?
When I do use soup.findAll('table'), I get an error:
ValueError: too many values to unpack
Is there a way to get the n-th tag in some code or another way that does not require going through all the tables? Or should I see if I can add titles to the tables? (like <table title="things">)
There are also headers (<h4>title</h4>) above each table, if that helps.
Thanks.
EDIT
Here's what I was thinking when I asked the question:
I was unpacking the objects into two values, when there were many more. I thought this would just give me the first two things from the list, but of course, it kept giving me the error mentioned above. I was unaware the return value was a list and thought it was a special object or something and I was basing my code off of my friends'.
I was thinking this error meant there were too many tables on the page and that it couldn't handle all of them, so I was asking for a way to do it without the method I was using. I probably should have stopped assuming things.
Now I know it returns a list and I can use this in a for loop or get a value from it with soup.findAll('table')[someNumber]. I learned what unpacking was and how to use it, as well. Thanks everyone who helped.
Hopefully that clears things up, now that I know what I'm doing my question makes less sense than it did when I asked it, so I thought I'd just put a note here on what I was thinking.
EDIT 2:
This question is now pretty old, but I still see that I was never really clear about what I was doing.
If it helps anyone, I was attempting to unpack the findAll(...) results, of which the amount of them I didn't know.
useless_table, table_i_want, another_useless_table = soup.findAll("table");
Since there weren't always the amount of tables I had guessed in the page, and all the values in the tuple need to be unpacked, I was receiving the ValueError:
ValueError: too many values to unpack
So, I was looking for the way to grab the second (or whichever index) table in the tuple returned without running into errors about how many tables were used.
To get the second table from the call soup.findAll('table'), use it as a list, just index it:
secondtable = soup.findAll('table')[1]
Martjin Pieter's answer will make it work indeed. I had some experience with nested table tag which broke my code when I just simply get the second table in the list without paying attention.
When you try to find_all and get the nth element, there is a potential you will mess up, you had better locate the first element you want and make sure the n-th element is actually a sibling of that element instead of children.
You can use the find_next_sibling() to secure your code
you can find the parent first and then use find_all(recursive=False) to guarantee your search range.
Just in case you need it. I will list my code below(use recursive=FALSE).
import urllib2
from bs4 import BeautifulSoup
text = """
<html>
<head>
</head>
<body>
<table>
<p>Table1</p>
<table>
<p>Extra Table</p>
</table>
</table>
<table>
<p>Table2</p>
</table>
</body>
</html>
"""
soup = BeautifulSoup(text)
tables = soup.find('body').find_all('table')
print len(tables)
print tables[1].text.strip()
#3
#Extra Table # which is not the table you want without warning
tables = soup.find('body').find_all('table', recursive=False)
print len(tables)
print tables[1].text.strip()
#2
#Table2 # your desired output
Here's my version
# Import bs4
from bs4 import BeautifulSoup
# Read your HTML
#html_doc = your html
# Get BS4 object
soup = BeautifulSoup(html_doc, "lxml")
# Find next Sibling Table to H3 Header with text "THE GOOD STUFF"
the_good_table = soup.find(name='h3', text='THE GOOD STUFF').find_next_sibling(name='table')
# Find Second tr in your table
your_tr = the_good_table.findAll(name='tr')[1]
# Find Text Value of First td in your tr
your_string = your_tr.td.text
print(your_string)
Output:
'I WANT THIS STRING'

Categories