I really can't manage to figure this out. I parsed the following link with BeautifulSoup and I did this:
soup.find(text='Title').find_parent('h3')
And it does not find anything. If you take a look on the code of the linked page, you'll see a h3 tag which contains the word Titles.
The exact point is:
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3>
If I make BS parse the line above only, it works perfectly. I tried also with:
soup.find(text='Title').find_parents('h3')
soup.find(text='Title').find_parent(class_='findSectionHeader')
which both work on the line only, but don't work on the entire html.
If I do a soup.find(text='Titles').find_parents('div') it works with the entire html.
Before the findSectionHeader H3 tag, there is another tag with Title in the text:
>>> soup.find(text='Title').parent
Title
You need to be more specific in your search, search for Titles instead, and loop to find the correct one:
>>> soup.find(text='Titles').parent
<option value="tt">Titles</option>
>>> for elem in soup.find_all(text='Titles'):
... parent_h3 = elem.find_parent('h3')
... if parent_h3 is None:
... continue
... print parent_h3
...
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3>
find(text='...') only matches the full text, not a partial match. Use a regular expression if you need partial matches instead:
>>> import re
>>> soup.find_all(text='Title')
[u'Title']
>>> soup.find_all(text=re.compile('Title'))
[u'Titles', u'Titles', u'Titles', u'Title', u'Advanced Title Search']
Related
I'm trying to get some text without tags using BeautifulSoup. I tried using .string, .contents, .text, .find(text=True), and .next_sibling, and they are listed below.
Edit
Nvmd I just noticed that .next_sibling works for me. Anyways this question can be a note collecting methods handling similar case.
import bs4 as BeautifulSoup
s = """
<p>
<a>
Something I can fetch but don't want
</a>
I want to fetch this line.
<a>
Something else I can fetch but don't want
</a>
</p>
"""
p = BeautifulSoup(s, 'html.parser')
print p.contents
# [u'\n', <p>
# <a>
# Something
# </a>
# I want to fetch this line.
# <a>
# Something else
# </a>
# </p>, u'\n']
print p.next_sibling.string
# I want to fetch this line.
print p.string
# None
print p.text
# all the texts, including those I can get but don't want.
print p.find(text=True)
# Returns an empty line of type bs4.element.NavigableString
print p.find(text=True)[0]
# Returns an empty line of type unicode
I'm wondering if there is a simpler method than manually parsing the string s to get the line I want to fetch?
Try this. It is still rough but at least it doesn't require you to manually parse the strings.
#get all non-empty strings from the backend.
texts = [str.strip(x) for x in p.strings if str.strip(x) != '']
#get strings only with tags
unwanted_text = [str.strip(x.text) for x in p.find_all()]
#take the difference
set(texts).difference(unwanted_text)
This yields:
In [87]: set(texts).difference(unwanted_text)
Out[87]: {'I want to fetch this line.'}
I am trying to grab all text between a tag that has a specific class name. I believe I am very close to getting it right, so I think all it'll take is a simple fix.
In the website these are the tags I'm trying to retrieve data from. I want 'SNP'.
<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>
From what I have currently:
from lxml import html
import requests
def main():
url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
page = html.fromstring(requests.get(url_link).text)
for span_tag in page.xpath("//span"):
class_name = span_tag.get("class")
if class_name is not None:
if "rtq_exch" == class_name:
print(url_link, span_tag.text)
if __name__ == "__main__":main()
I get this:
http://finance.yahoo.com/q?s=^GSPC&d=t None
To show that it works, when I change this line:
if "rtq_dash" == class_name:
I get this (please note the '-' which is the same content between the tags):
http://finance.yahoo.com/q?s=^GSPC&d=t -
What I think is happening is it sees the child tag and stops grabbing the data, but I'm not sure why.
I would be happy with receiving
<span class="rtq_dash">-</span>SNP
as a string for span_tag.text, as I can easily chop off what I don't want.
A higher description, I'm trying to get the stock symbol from the page.
Here is the documentation for requests, and here is the documentation for lxml (xpath).
I want to use xpath instead of BeautifulSoup for several reasons, so please don't suggest changing to use that library instead, not that it'd be any easier anyway.
There are some possible ways. You can find the outer span and return direct-child text node of it :
>>> url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
>>> page = html.fromstring(requests.get(url_link).text)
>>> for span_text in page.xpath("//span[#class='rtq_exch']/text()"):
... print(span_text)
...
SNP
or find the inner span and get the tail :
>>> for span_tag in page.xpath("//span[#class='rtq_dash']"):
... print(span_tag.tail)
...
SNP
Use BeautifulSoup:
import bs4
html = """<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>"""
soup = bs4.BeautifulSoup(html)
snp = list(soup.findAll("span", class_="rtq_exch")[0].strings)[1]
I am using python with BeautifulSoup 4 to find links in a html page that match a particular regular expression. I am able to find links and text matching with the regex but the both things combined together won't work. Here's my code:
import re
import bs4
s = 'Sign in <br />'
soup = bs4.BeautifulSoup(s)
match = re.compile(r'sign\s?in', re.IGNORECASE)
print soup.find_all(text=match) # [u'Sign in\xa0']
print soup.find_all(name='a')[0].text # Sign in
print soup.find_all('a', text=match) # []
Comments are the outputs. As you can see the combined search returns no result. This is strange.
Seems that there's something to do with the "br" tag (or a generic tag) contained inside the link text. If you delete it everything works as expected.
you can either look for the tag or look for its text content but not together:
given that:
self.name = u'a'
self.text = SRE_Pattern: <_sre.SRE_Pattern object at 0xd52a58>
from the source:
# If it's text, make sure the text matches.
elif isinstance(markup, NavigableString) or \
isinstance(markup, basestring):
if not self.name and not self.attrs and self._matches(markup, self.text):
found = markup
that makes #Totem remark the way to go by design
The below is my code. It attempts to get the src of an image within an image tag in html.
import re
for text in open('site.html'):
matches = re.findall(r'\ssrc="([^"]+)"', text)
matches = ' '.join(matches)
print(matches)
problem is when i put in something like:
<img src="asdfasdf">
It works but when i put in an ENTIRE HTML page it returns nothing. Why does it do that? and how do i fix it?
Site.html is just the html code for a website in standard format. I want it to ignore everything and just print the source code for the image. If you would like to see what would be inside site.html then go to a basic HTML webpage and copy all the source code.
Why use a regular expression to parse HTML when you can easily do this with something like BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """This is some text
... <img src="asdasdasd">
... <i> More HTML <b> foo </b> bar </i>
... """
>>> soup = BS(html)
>>> for imgtag in soup.find_all('img'):
... print(imgtag['src'])
...
asdasdasd
The reason why your code doesn't work is because text is one line of the file. Thus, you are only finding matches of a line in every iteration. Although this may work, think about if the last line doesn't have an image tag. matches will be an empty list, and join will make it become ''. You are overriding the variable matches every line.
You want to call findall on the whole HTML:
import re
with open('site.html') as html:
content = html.read()
matches = re.findall(r'\ssrc="([^"]+)"', content)
matches = ' '.join(matches)
print(matches)
Using a with statement here is much more pythonic. It also means you don't have to call file.close() afterwards, as the with statement deals with that.
I'm trying to grab the string immediately after the opening <td> tag. The following code works:
webpage = urlopen(i).read()
soup = BeautifulSoup(webpage)
for elem in soup('td', text=re.compile(".\.doc")):
print elem.parent
when the html looks like this:
<td>plan_49913.doc</td>
but not when the html looks like this:
<td>plan_49913.doc<br />
<font color="#990000">Document superseded by: </font>January 2012</td>
I've tried playing with attrs but can't get it to work. Basically I just want to grab 'plan_49913.doc' in either instance of html.
Any advice would be greatly appreciated.
Thank you in advance.
~chrisK
This works for me:
>>> html = '<td>plan_49913.doc<br /> <font color="#990000">Document superseded by: </font>January 2012</td>'
>>> soup = BeautifulSoup(html)
>>> soup.find(text=re.compile('.\.doc'))
u'plan_49913.doc
Is there something I'm missing?
Also, note that according to the documentation:
If you use text, then any values you give for name and the keyword arguments are ignored.
So you don't need to pass 'td' since it's already being ignored, that is, any text that matches under any other tag will be returned.
Just use the next property, it contains the next node, and that's a textual node.
>>> html = '<td>plan_49913.doc<br /> <font color="#990000">Document superseded by: </font>January 2012</td>'
>>> bs = BeautifulSoup(html)
>>> texts = [ node.next for node in bs.findAll('td') if node.next.endswith('.doc') ]
>>> texts
[u'plan_49913.doc']
you can change the if clause to use a regex if you prefer.