Trouble using BeautifulSoup to parse HTML - python

I'm having trouble parsing some html using beautifulsoup.
In this piece of HTML for example, I want to extract the Target Text. More HTML in the HTML code is like this so I want to extract all the Target Texts. I also want to extract the "tt0082971" and put that number and the Target Text in two rows of a tab-delimted file. The numbers after 'tt' change for every instance of Target Text.
<td class="target">
<span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0082971">
</span>
<a href="/target/tt0082971/">
Target Text 1
</a>

BeautifulSoup.select accepts CSS Selectors:
>>> from bs4 import BeautifulSoup
>>>
>>> html = '''
... <td class="target">
... <span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0082971">
... </span>
... <a href="/target/tt0082971/">
... Target Text 1
... </a>
... </td>
... '''
>>> soup = BeautifulSoup(html)
>>> for td in soup.select('td.target'):
... span = td.select('span.wlb_wrapper')
... if span:
... print span[0].get('data-tconst') # To get `tt0082971`
... print td.a.text.strip() # To get target text
...
tt0082971
Target Text 1

Related

How to exclude inner tags with beautifulsoup

Hey Im currently trying to parse through a website and I'm almost done, but there's a little problem. I wannt to exclude inner tags from a html code
<span class="moto-color5_5">
<strong>Text 1 </strong>
<span style="font-size:8px;">Text 2</span>
</span>
I tried using
...find("span", "moto-color5_5") but this returns
Text 1 Text 2
instead of only returning Text 1
Any suggestions?
sincierly :)
Excluding inner tags would also exclude Text 1 because it's in an inner tag <strong>.
You can however just find strong inside of your current soup:
html = """<span class="moto-color5_5">
<strong>Text 1 </strong>
<span style="font-size:8px;">Text 2</span>
</span>
"""
soup = BeautifulSoup(html)
result = soup.find("span", "moto-color5_5").find('strong')
print(result.text) # Text 1

Python 3 get child elements (lxml)

I am using lxml with html:
from lxml import html
import requests
How would I check if any of an element's children have the class = "nearby"
my code (essentially):
url = "www.example.com"
Page = requests.get(url)
Tree = html.fromstring(Page.content)
resultList = Tree.xpath('//p[#class="result-info"]')
i=len(resultList)-1 #to go though the list backwards
while i>0:
if (resultList[i].HasChildWithClass("nearby")):
print('This result has a child with the class "nearby"')
How would I replace "HasChildWithClass()" to make it actually work?
Here's an example tree:
...
<p class="result-info">
<span class="result-meta">
<span class="nearby">
... #this SHOULD print something
</span>
</span>
</p>
<p class="result-info">
<span class="result-meta">
<span class="FAR-AWAY">
... # this should NOT print anything
</span>
</span>
</p>
...
I tried to understand why you use lxml to find the element. However BeautifulSoup and re may be a better choice.
lxml = """
<p class="result-info">
<span class="result-meta">
<span class="nearby">
... #this SHOULD print something
</span>
</span>
</p>
<p class="result-info">
<span class="result-meta">
<span class="FAR-AWAY">
... # this should NOT print anything
</span>
</span>
</p>
"""
But i done what you want.
from lxml import html
Tree = html.fromstring(lxml)
resultList = Tree.xpath('//p[#class="result-info"]')
i = len(resultList) - 1 #to go though the list backwards
for result in resultList:
for e in result.iter():
if e.attrib.get("class") == "nearby":
print(e.text)
Try to use bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(lxml,"lxml")
result = soup.find_all("span", class_="nearby")
print(result[0].text)
Here is an experiment I did.
Take r = resultList[0] in python shell and type:
>>> dir(r)
['__bool__', '__class__', ..., 'find_class', ...
Now this find_class method is highly suspicious. If you check its help doc:
>>> help(r.find_class)
you'll confirm the guess. Indeed,
>>> r.find_class('nearby')
[<Element span at 0x109788ea8>]
For the other tag s = resultList[1] in the example xml code you gave,
>>> s.find_class('nearby')
[]
Now it's clear how to tell whether a 'nearby' child exists or not.
Cheers!

lxml data from two tags

My code is like this
response = urllib2.urlopen("file:///C:/data20140801.html")
page = response.read()
tree = etree.HTML(page)
data = tree.xpath("//p/span/text()")
html page could have this structures
<span style="font-size:10.0pt">Something</span>
html page could also have this structures
<p class="Normal">
<span style="font-size:10.0pt">Some</span>
<span style="font-size:10.0pt">thing<span>
</p>
Using same code for both I want to get "Something"
The XPath expression returns a list of values:
>>> from lxml.html import etree
>>> tree = etree.HTML('''\
... <p class="Normal">
... <span style="font-size:10.0pt">Some</span>
... <span style="font-size:10.0pt">thing<span>
... </p>
... ''')
>>> tree.xpath("//p/span/text()")
['Some', 'thing']
Use ''.join() to combine the two strings into one:
>>> ''.join(tree.xpath("//p/span/text()"))
'Something'

Python splitting the HTML

So I have an HTML markup and I'd like to access a tag with a specific class inside a tag with a specific id. For example:
<tr id="one">
<span class="x">X</span>
.
.
.
.
</tr>
How do I get the content of the tag with the class "x" inside the tag with an id of "one"?
I'm not used to work with lxml.xpath, so I always tend to use BeautifulSoup. Here is a solution with BeautifulSoup:
>>> HTML = """<tr id="one">
... <span class="x">X</span>
... <span class="ax">X</span>
... <span class="xa">X</span>
... </tr>"""
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(HTML)
>>> tr = soup.find('tr', {'id':'one'})
>>> span = tr.find('span', {'class':'x'})
>>> span
<span class="x">X</span>
>>> span.text
u'X'
You need something called "xpath".
from lxml import html
tree = html.fromstring(my_string)
x = tree.xpath('//*[#id="one"]/span[#class="x"]/text()')
print x[0] # X

Segregating text from bold tags within td tags using beautifulsoup

I'm trying to retrieve what is in-between the td tags without any tags in the following:
<td class="formSummaryPosition"><b>1</b>/9</td>
This is what I have written so far
o = []
for race in table:
for pos in race.findAll("td", {"class":"Position"}):
o.append(pos.contents)
I understand that the .contents will provide me with the follwing:
[[<b>1</b>, u'/9'], [<b>4</b>, u'/11'], [<b>2</b>, u'/8'], ...]
Ultimately I would like to have:
o = [[1/9],[4/11],[2/8]...]
I would appreciate if anyone had any idea on how to achieve this most efficiently?
Cheers
Use get_text() method on an element:
If you only want the text part of a document or tag, you can use the
get_text() method. It returns all the text in a document or beneath a
tag, as a single Unicode string
>>> from bs4 import BeautifulSoup
>>> data = """
... <table>
... <tr>
... <td class="formSummaryPosition"><b>1</b>/9</td>
... <td class="formSummaryPosition"><b>4</b>/11</td>
... <td class="formSummaryPosition"><b>2</b>/8</td>
... </tr>
... </table>
... """
>>> soup = BeautifulSoup(data)
>>> print [td.get_text() for td in soup.find_all('td', class_='formSummaryPosition')]
[u'1/9', u'4/11', u'2/8']

Categories