Python splitting the HTML

Python splitting the HTML - python

So I have an HTML markup and I'd like to access a tag with a specific class inside a tag with a specific id. For example:
<tr id="one">
<span class="x">X</span>
.
.
.
.
</tr>
How do I get the content of the tag with the class "x" inside the tag with an id of "one"?

I'm not used to work with lxml.xpath, so I always tend to use BeautifulSoup. Here is a solution with BeautifulSoup:
>>> HTML = """<tr id="one">
... <span class="x">X</span>
... <span class="ax">X</span>
... <span class="xa">X</span>
... </tr>"""
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(HTML)
>>> tr = soup.find('tr', {'id':'one'})
>>> span = tr.find('span', {'class':'x'})
>>> span
<span class="x">X</span>
>>> span.text
u'X'

You need something called "xpath".
from lxml import html
tree = html.fromstring(my_string)
x = tree.xpath('//*[#id="one"]/span[#class="x"]/text()')
print x[0] # X

Related

Using BeautifulSoup, how to select a tag without its children?

The html is as follows:
<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>
I'm trying to get all the divs and cast them into strings:
divs = [str(i) for i in soup.find_all('div')]
However, they'll have their children too:
>>> ["<div name='tag-i-want'><span>I don't want this</span></div>"]
What I'd like it to be is:
>>> ["<div name='tag-i-want'></div>"]
I figured there is unwrap() which would return this, but it modifies the soup as well; I'd like the soup to remain untouched.

With clear you remove the tag's content.
Without altering the soup you can either do an hardcopy with copy or use a DIY approach. Here an example with the copy
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div')
div_only = copy.copy(div)
div_only.clear()
print(div_only)
print(soup.find_all('span') != [])
Output
<div name="tag-i-want"></div>
True
Remark: the DIY approach: without copy
use the Tag class
from bs4 import BeautifulSoup, Tag
...
div_only = Tag(name='div', attrs=div.attrs)
use strings
div_only = '<div {}></div>'.format(' '.join(map(lambda p: f'{p[0]}="{p[1]}"', div.attrs.items())))

#cards pointed me in the right direction with copy(). This is what I ended up using:
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
def remove_children(tag):
tag.clear()
return tag
divs = [str(remove_children(copy.copy(i))) for i in soup.find_all('div')]

How to extract a number between two tags in Python?

import requests
from bs4 import BeautifulSoup
url = 'http://www.x-rates.com/table/?from=USD&amount=1'
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
print soup.prettify()
<td>
Chinese Yuan Renminbi
</td>
<td class="rtRates">
<a href="/graph/?from=USD&to=CNY">
6.887711
</a>
</td>
<td class="rtRates">
<a href="/graph/?from=CNY&to=USD">
0.145186
</a>
</td>
</tr>
May I ask that how can I extract the content between tag 'a'?
Say I want to get 6.887711 in the 6th row of the result?

If you just want to get the first tag, you can make use of the href difference and use regex to match the corresponding tag; For instance the href for the first tag ends with CNY, using re module with regex CNY$ to match the href attribute:
import re
soup.find("a", {"href": re.compile("CNY$")}).text
# '6.888069'

You can use soup.find_all() to iterate through all of them:
for tag in soup.find_all("a"):
print(tag.text.strip())
Which would output:
6.887711
0.145186
...

In a situation like this you might consider using the lxml library because it makes xpath available.
>>> from lxml import etree
>>> import requests
>>> url = 'http://www.x-rates.com/table/?from=USD&amount=1'
>>> HTML = requests.get(url).text
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring(HTML, parser=parser)
>>> currency = tree.xpath('.//table[2]/tbody/tr[3]/td[1]')
>>> currency[0].text
'Bahraini Dinar'
>>> USDrate = tree.xpath('.//table[2]/tbody/tr[3]/td[3]/a')
>>> USDrate[0].text
'2.652179'
In this case I have found the second table, then the third row of that table and then the first and third cells of that row.

Segregating text from bold tags within td tags using beautifulsoup

I'm trying to retrieve what is in-between the td tags without any tags in the following:
<td class="formSummaryPosition"><b>1</b>/9</td>
This is what I have written so far
o = []
for race in table:
for pos in race.findAll("td", {"class":"Position"}):
o.append(pos.contents)
I understand that the .contents will provide me with the follwing:
[[<b>1</b>, u'/9'], [<b>4</b>, u'/11'], [<b>2</b>, u'/8'], ...]
Ultimately I would like to have:
o = [[1/9],[4/11],[2/8]...]
I would appreciate if anyone had any idea on how to achieve this most efficiently?
Cheers

Use get_text() method on an element:
If you only want the text part of a document or tag, you can use the
get_text() method. It returns all the text in a document or beneath a
tag, as a single Unicode string
>>> from bs4 import BeautifulSoup
>>> data = """
... <table>
... <tr>
... <td class="formSummaryPosition"><b>1</b>/9</td>
... <td class="formSummaryPosition"><b>4</b>/11</td>
... <td class="formSummaryPosition"><b>2</b>/8</td>
... </tr>
... </table>
... """
>>> soup = BeautifulSoup(data)
>>> print [td.get_text() for td in soup.find_all('td', class_='formSummaryPosition')]
[u'1/9', u'4/11', u'2/8']

Trouble using BeautifulSoup to parse HTML

I'm having trouble parsing some html using beautifulsoup.
In this piece of HTML for example, I want to extract the Target Text. More HTML in the HTML code is like this so I want to extract all the Target Texts. I also want to extract the "tt0082971" and put that number and the Target Text in two rows of a tab-delimted file. The numbers after 'tt' change for every instance of Target Text.
<td class="target">
<span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0082971">
</span>
<a href="/target/tt0082971/">
Target Text 1
</a>

BeautifulSoup.select accepts CSS Selectors:
>>> from bs4 import BeautifulSoup
>>>
>>> html = '''
... <td class="target">
... <span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0082971">
... </span>
... <a href="/target/tt0082971/">
... Target Text 1
... </a>
... </td>
... '''
>>> soup = BeautifulSoup(html)
>>> for td in soup.select('td.target'):
... span = td.select('span.wlb_wrapper')
... if span:
... print span[0].get('data-tconst') # To get `tt0082971`
... print td.a.text.strip() # To get target text
...
tt0082971
Target Text 1

Extract content within a tag with BeautifulSoup

I'd like to extract the content Hello world. Please note that there are multiples <table> and similar <td colspan="2"> on the page as well:
<table border="0" cellspacing="2" width="800">
<tr>
<td colspan="2"><b>Name: </b>Hello world</td>
</tr>
<tr>
...
I tried the following:
hello = soup.find(text='Name: ')
hello.findPreviousSiblings
But it returned nothing.
In addition, I'm also having problem with the following extracting the My home address:
<td><b>Address:</b></td>
<td>My home address</td>
I'm also using the same method to search for the text="Address: " but how do I navigate down to the next line and extract the content of <td>?

The contents operator works well for extracting text from <tag>text</tag> .
<td>My home address</td> example:
s = '<td>My home address</td>'
soup = BeautifulSoup(s)
td = soup.find('td') #<td>My home address</td>
td.contents #My home address
<td><b>Address:</b></td> example:
s = '<td><b>Address:</b></td>'
soup = BeautifulSoup(s)
td = soup.find('td').find('b') #<b>Address:</b>
td.contents #Address:

use next instead
>>> s = '<table border="0" cellspacing="2" width="800"><tr><td colspan="2"><b>Name: </b>Hello world</td></tr><tr>'
>>> soup = BeautifulSoup(s)
>>> hello = soup.find(text='Name: ')
>>> hello.next
u'Hello world'
next and previous let you move through the document elements in the order they were processed by the parser while sibling methods work with the parse tree

Use the below code to get extract text and content from html tags with python beautifulSoup
s = '<td>Example information</td>' # your raw html
soup = BeautifulSoup(s) #parse html with BeautifulSoup
td = soup.find('td') #tag of interest <td>Example information</td>
td.text #Example information # clean text from html

from bs4 import BeautifulSoup, Tag
def get_tag_html(tag: Tag):
return ''.join([i.decode() if type(i) is Tag else i for i in tag.contents])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python splitting the HTML - python

So I have an HTML markup and I'd like to access a tag with a specific class inside a tag with a specific id. For example: <tr id="one"> <span class="x">X</span> . . . . </tr> How do I get the content of the tag with the class "x" inside the tag with an id of "one"?

You need something called "xpath". from lxml import html tree = html.fromstring(my_string) x = tree.xpath('//*[#id="one"]/span[#class="x"]/text()') print x[0] # X

Related

Using BeautifulSoup, how to select a tag without its children?

How to extract a number between two tags in Python?

Segregating text from bold tags within td tags using beautifulsoup

Trouble using BeautifulSoup to parse HTML

Extract content within a tag with BeautifulSoup

Categories

Resources