HTML div with padding - python

I apologies in advance because I'm sure this is pretty straightforward but frontend isn't my field!
I have an html document I'm having issues with. The general structure is as follows:
<div style="padding-bottom: 1cm;">
<h4>Digital</h4>
<h5 style="text-align: left;">Data Table:</h5>
<table>
<!-- the rest of the table here -->
</table>
</div>
I have several of these div elements being generated by Jinja (python). My intention of the padding is to add space between the div elements in the page. This works as I'd expect if the is empty (or very few rows) but as soon as the table grows the padding no longer works as I'd expect.
The HTML above gives me an output like the one below: (note how the empty space between 'Second table' and 'Third table' is smaller than between the space between 'First table' and 'second table'

use margin-bottom instead and allow overflow on your parent element.
for better result style the parent element that holds them
use flexbox or grid

Related

XPath, nested conditions

I have the following HTML code, and I need to have an XPath expression, which finds the table element.
<div>
<div>Dezember</div>
<div>
<div class="dash-table-container">more divs</div>
</div>
</div>
My current Xpath expression:
//div[./div[1]/text() = "Dezember"]/preceding::div[./div[2][#class=dash-table-container]
I don't know how to check if the dash table container is the last one loaded, since I have many of them. So I need the check if it's under the div with "Dezember" as a text because the div's before with the other months are being loaded faster.
I want the XPATH to select the "dash table container" div.
Thanks in advance
To select the div with the text content of "more divs", you can use
//div/div[#class="dash-table-container" and ../preceding-sibling::div[1]="Dezember"]
and to select its parent div element, use
//div[div/#class="dash-table-container"][preceding-sibling::div[1]="Dezember"]/..
I figured it out.
//div[preceding-sibling::div="Dezember"]/div[#class="dash-table-container"]
worked perfectly for me.

How can I load an html file into a multilevel array of elements in python

In an ideal world, I'm trying to figure out how to load an html document into a list which is elements, for example:
elements=[['h1', 'This is the first heading.'], ['p', 'Someone made a paragraph. A short one.'], ['table', ['tr', ['td', 'a table cell']]]]
I've played a little with beautifulsoup, but can't see a way to do this.
Is this currently doable, or do I nee to write a parser.
In an ideal world(definition: One where the website you want to read has well-formed XHTML), you can toss it to an XML parser like lxml and you'll get something much like that back. Very short version:
Elements are lists, and the entries in the list are subelements, in proper order
Elements are dictionaries, which have the "key=value" attributes from the element.
Elements have a text attribute, which is the text between the opening element and it's first sub-element
Elements have a tail attribute, which is the text after the closing element.
Once you have a tree in a shape like that, then you can probably write a 3-line function that rebuilds it the way you want.
XHTML is basically restricted HTML - a combination between that and XML. In theory, sites should give your browser XHTML, since it's better in every way, but most browsers are a lot more permissive, and therefore don't provide the stricter set.
Some of the problems most sites have are for example the omitting of closing tags. XML parsers tend to error out on those.
You can use recursion:
html = """
<html>
<body>
<h1>This is the first heading.</h1>
<p>Someone made a paragraph. A short one.</p>
<table>
<tr>
<td>a table cell</td>
<tr>
</table>
</body>
</html>
"""
import bs4
def to_list(d):
return [d.name, *[to_list(i) if not isinstance(i, bs4.element.NavigableString) else i for i in d.contents if i != '\n']]
_, *r = to_list(bs4.BeautifulSoup(html).body)
print(r)
Output:
[['h1', 'This is the first heading.'], ['p', 'Someone made a paragraph. A short one.'], ['table', ['tr', ['td', 'a table cell'], ['tr']]]]

In BeautifulSoup, Ignore Children Elements While Getting Parent Element Data

I have html as follows:
<html>
<div class="maindiv">
text data here
<br>
continued text data
<br>
<div class="somename">
text & data I want to omit
</div>
</div>
</html>
I am trying to get only the the text found in the maindivelement, without getting text data found in the somename element. In most cases, in my experience anyway, most text data is contained within some child element. I have ran into this particular case however where the data seems to be contained somewhat will-nilly and is a bit harder to filter.
My approach is as follows:
textdata= soup.find('div', class_='maindiv').get_text()
This gets all the text data found within the maindiv element, as well as the text data found in the somename div element.
The logic I'd like to use is more along the lines of:
textdata = soup.find('div', class_='maindiv').get_text(recursive=False) which would omit any text data found within the somename element.
I know the recursive=False argument works for locating only parent-level elemenets when searching the DOM structure using BeautifulSoup, but can't be used with the .get_text() method.
I've realized the approach of finding all the text, then subtracting the string data found in the somename element from the string data found in the maindiv element, but I'm looking for something a little more efficient.
Not that far from your subtracting method, but one way to do it (at least in Python 3) is to discard all child divs.
s = soup.find('div', class_='maindiv')
for child in s.find_all("div"):
child.decompose()
print(s.get_text())
Would print something like:
text data here
continued text data
That might be a bit more efficient and flexible than subtracting the strings, though it still needs to go through the children first.
from bs4 import BeautifulSoup
html ='''
<html>
<div class="maindiv">
text data here
<br>
continued text data
<br>
<div class="somename">
text & data I want to omit
</div>
</div>
</html>'''
soup = BeautifulSoup(html, 'lxml')
soup.find('div', class_="maindiv").next_element
out:
'\n text data here \n '

lxml xpath get text between two nested tables

I have a html that has nested tables. I wish to find the text between a outside table and inside tables. I thought this is a classic question but so far hasn't find the answer. What I have come up with is
tree.xpath(//p[not(ancestor-or-self::table)]). But this isn't working but because all text descends from the outside table. Also just use preceding::table isn't enough because the text can surrounds the inside table.
For an conceptual example if a table looks liek this [...text1...[inside table No.1]...text2...[inside table No.2]...text3...], how can I get the text1/2/3 only without being contaminated by texts from the inside tables No.1&2. Maybe this is my thought, is it possible to build a concept of table layer via xpath, so I can tell lxml or other libraries that "Give me all text between layer 0 and 1"
Below is a simplified sample html file. In reality, the outside table may contains many nested tables but I just want the text between the most outside table and its 1st nested tables. Thanks folks!
<table>
<tr><td>
<p> text I want </p>
<div> they can be in different types of nodes </div>
<table>
<tr><td><p> unwanted text </p></td></tr>
<tr><td>
<table>
<tr><td><u> unwanted text</u></td></tr>
</table>
</td></tr>
</table>
<p> text I also want </p>
<div> as long as they're inside the root table and outside the first-level inside tables </div>
</td></tr>
<tr><td>
<u> they can be between the first-level inside tables </u>
<table>
</table>
</td></tr>
</table>
And it returns ["text I want", "they can be in different types of nodes", "text I also want", "as long as they're inside the root table and outside the first-level inside tables", "they can be between the first-level inside tables"].
One of the XPaths that could do this, if the outer most table is the root element:
/table/descendant::table[1]/preceding::p
Here, you traverse to the first descendant table of the outermost table, and then select all its preceding p elements.
If not, you will have to take a different approach of accessing the p elements in between the tables, may be using generate-id() function.

Using scrapy to digest loosely formatted, primive html into blocks of text

Just introduced to Scrapy. Have gone through the basic tutorial, but feel I don't quite grok how to use it.
Simple case I need to parse is like:
BASIC CASE
<li>
<b>Descriptive Title</b>
<br>
The_first_paragraph_of_text
<p>Second paragraphs of text</p>
...
<p>Nth Paragraph of text</p>
</li>
What I want to do is generate a database record with two columns, "title" and "body_text". Title comes from the 'Descriptive Title' and the body_text comes from taking all paragraphs of text and concatenating into a block of text.
I wrote something simple like
"""pulls out the Descriptive Title and all the <p>-wrapped paragraphs but misses the first paragraph (which isn't wrapped in a <p>"""
for sel in response.xpath("//li"):
b = sel.xpath('b').extract()
print "b = {0}\n".format(b)
for p in sel.xpath('p'):
paragraph = p.xpath('text()').extract()
print"\n{0}".format(paragraph)
But this doesn't catch the unwrapped first paragraph, only paragraph two and onwards. And also, it's not robust to variations of the <li> html blocks.
In one variation the first paragraph is sometimes wrapped in italics.
ITALICS COMPLICATION
<li>
<b>Descriptive Title</b>
<br>
<i>Occasionally The_first_paragraph_of_text is in italics</i>
<p>Second paragraphs of text</p>
...
<p>Nth Paragraph of text</p>
</li>
In another variation, sometimes <li> are embedded inside some of the paragraph blocks.
SUB LIST-ITEM COMPLICATION
<li>
<b>Descriptive Title</b>
<br>
<i>Occasionally The_first_paragraph_of_text is in italics</i>
<p>Second paragraphs of text</p>
<p>Sometimes paragraphs will have lists inside them
<li>idea 1</li>
<li>idea 2</li>
<li>idea N</li>
</p>
<p>Nth Paragraph of text</p>
</li>
I suspect I'm not really digesting the html file in a "scrapythonic" way. What's the right approach to write a more robust Selector(s) to pull out what I want?
More of an xpath question than a scrapy question.
If your title is always in the first element, being a <b> tag
it's easy as sel.xpath('b[1]/text()').
If I were you however I would add some strict assertions
to make it fail rather than scrap the wrong title text
because a <b> tag can often play other roles.
Try:
title, = sel.xpath('*[1][self::b and count(node())=count(text())]')
title = u'\n'.join(sel.xpath('text()'))
This reads as follows (assertions in parentheses):
There must exist one and only one tag (asserted by title, =)
that is the first tag (asserted by [1])
and is a <b> (asserted by self::b)
and has only text nodes (asserted by count(text())=count(node()))
For the body text,
you'll have to get used to the use of html for formatting
unless you are scraping a site
that formats its main content very simply
(eg arranged <p> tags).
You can get all the text descendants in document order with:
sel.xpath('.//text()')
However <script> and <style> tags may show up unwanted text;
you don't want your body text to be filled with javascript/css gibberish.
You can prevent them from being selected with:
sel.xpath('.//text()[not(parent::script or parent::style)]')
Then, you'd probably want to join the extracted text with u'\n'.join()
This will solve your problem but it will not handle all unwanted tags.
Play around with lxml.html.clean.Cleaner(style=True)
(documented here http://lxml.de/lxmlhtml.html#cleaning-up-html)
and consider finding a library that renders html to text.

Categories