What's wrong with my soup? - python

I am using python with BeautifulSoup 4 to find links in a html page that match a particular regular expression. I am able to find links and text matching with the regex but the both things combined together won't work. Here's my code:
import re
import bs4
s = 'Sign in <br />'
soup = bs4.BeautifulSoup(s)
match = re.compile(r'sign\s?in', re.IGNORECASE)
print soup.find_all(text=match) # [u'Sign in\xa0']
print soup.find_all(name='a')[0].text # Sign in 
print soup.find_all('a', text=match) # []
Comments are the outputs. As you can see the combined search returns no result. This is strange.
Seems that there's something to do with the "br" tag (or a generic tag) contained inside the link text. If you delete it everything works as expected.

you can either look for the tag or look for its text content but not together:
given that:
self.name = u'a'
self.text = SRE_Pattern: <_sre.SRE_Pattern object at 0xd52a58>
from the source:
# If it's text, make sure the text matches.
elif isinstance(markup, NavigableString) or \
isinstance(markup, basestring):
if not self.name and not self.attrs and self._matches(markup, self.text):
found = markup
that makes #Totem remark the way to go by design

Related

Beautiful Soup Nested Tag Search

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).
Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are. Is there any simple method or another way to do it?
Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:
soup = bs4.BeautifulSoup(content, 'html.parser')
# This will get the div
div_container = soup.find('div', class_='some_class')
# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
# prints the p tag content
print(ptag.text)
Hope that helps
Try this one :
data = []
for nested_soup in soup.find_all('xyz'):
data = data + nested_soup.find_all('abc')
Maybe you can turn in into lambda and make it cool, but this works. Thanks.
UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:
from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))
INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:
from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)
count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
continue
temp = matcher.split(tag.text) # Split using tokens such as \s and \n
temp = filter(None, temp) # remove empty elements in the list
count +=len(temp)
print "number of words in the document %d" %count
fd.close()
Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason
You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.
for eg:
r = requests.get(url,headers=headers)
p_tags = re.findall(r'<p>.*?</p>',r.content)
this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.
Alternatively if you just want just the text you can try this:
from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()
this will get you a more bare bones form of the html from the site, and now proceed with the parsing.

BeautifulSoup - Get Text within tag only if a certain string is found

I am trying to scrape some scripts from a TV-Show. I am able to get the text as I need it using BeautifulSoup and Requests.
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.example.com')
s = BeautifulSoup(r.text, 'html.parser')
for p in s.find_all('p'):
print p.text
This works great so far. But I want only those paragraphs from a certain character. Say his name is "stackoverflow". The text would be like this:
A: sdasd sd asda
B: sdasds
STACKOVERFLOW: Help?
So I only want the stuff that STACKOVERFLOW says. Not the rest.
I have tried
s.find_all(text='STACKOVERFLOW') but I get nothing.
What would be the right way to do this? A hint in the right direction would be most appreciated.
Make the partial text match, either with:
s.find_all(text=lambda text: text and 'STACKOVERFLOW' in text)
Or:
import re
s.find_all(text=re.compile('STACKOVERFLOW'))
You can make a custom function to pass into find_all. This function should take in one argument (tag) and return True for the tags that meet your criteria.
def so_tags(tag):
'''returns True if the tag has text and 'stackoverflow' is in the text'''
return (tag.text and "STACKOVERFLOW" in tag.text)
soup.find_all(my_tags)
You could also make a function factory to make it a bit more dynamic.
def user_paragraphs(user):
'''returns a function'''
def user_tags(tag):
'''returns True for tags that have <user> in the text'''
return (tag.text and user in tag.text)
return user_tags
for user in user_list:
user_posts = soup.find_all(user_paragraphs(user))

using lxml and requests in python to grab text between certain tags with a specific class name

I am trying to grab all text between a tag that has a specific class name. I believe I am very close to getting it right, so I think all it'll take is a simple fix.
In the website these are the tags I'm trying to retrieve data from. I want 'SNP'.
<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>
From what I have currently:
from lxml import html
import requests
def main():
url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
page = html.fromstring(requests.get(url_link).text)
for span_tag in page.xpath("//span"):
class_name = span_tag.get("class")
if class_name is not None:
if "rtq_exch" == class_name:
print(url_link, span_tag.text)
if __name__ == "__main__":main()
I get this:
http://finance.yahoo.com/q?s=^GSPC&d=t None
To show that it works, when I change this line:
if "rtq_dash" == class_name:
I get this (please note the '-' which is the same content between the tags):
http://finance.yahoo.com/q?s=^GSPC&d=t -
What I think is happening is it sees the child tag and stops grabbing the data, but I'm not sure why.
I would be happy with receiving
<span class="rtq_dash">-</span>SNP
as a string for span_tag.text, as I can easily chop off what I don't want.
A higher description, I'm trying to get the stock symbol from the page.
Here is the documentation for requests, and here is the documentation for lxml (xpath).
I want to use xpath instead of BeautifulSoup for several reasons, so please don't suggest changing to use that library instead, not that it'd be any easier anyway.
There are some possible ways. You can find the outer span and return direct-child text node of it :
>>> url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
>>> page = html.fromstring(requests.get(url_link).text)
>>> for span_text in page.xpath("//span[#class='rtq_exch']/text()"):
... print(span_text)
...
SNP
or find the inner span and get the tail :
>>> for span_tag in page.xpath("//span[#class='rtq_dash']"):
... print(span_tag.tail)
...
SNP
Use BeautifulSoup:
import bs4
html = """<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>"""
soup = bs4.BeautifulSoup(html)
snp = list(soup.findAll("span", class_="rtq_exch")[0].strings)[1]

BeautifulSoup not finding parents

I really can't manage to figure this out. I parsed the following link with BeautifulSoup and I did this:
soup.find(text='Title').find_parent('h3')
And it does not find anything. If you take a look on the code of the linked page, you'll see a h3 tag which contains the word Titles.
The exact point is:
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3>
If I make BS parse the line above only, it works perfectly. I tried also with:
soup.find(text='Title').find_parents('h3')
soup.find(text='Title').find_parent(class_='findSectionHeader')
which both work on the line only, but don't work on the entire html.
If I do a soup.find(text='Titles').find_parents('div') it works with the entire html.
Before the findSectionHeader H3 tag, there is another tag with Title in the text:
>>> soup.find(text='Title').parent
Title
You need to be more specific in your search, search for Titles instead, and loop to find the correct one:
>>> soup.find(text='Titles').parent
<option value="tt">Titles</option>
>>> for elem in soup.find_all(text='Titles'):
... parent_h3 = elem.find_parent('h3')
... if parent_h3 is None:
... continue
... print parent_h3
...
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3>
find(text='...') only matches the full text, not a partial match. Use a regular expression if you need partial matches instead:
>>> import re
>>> soup.find_all(text='Title')
[u'Title']
>>> soup.find_all(text=re.compile('Title'))
[u'Titles', u'Titles', u'Titles', u'Title', u'Advanced Title Search']

Match HTML tags in two strings using regex in Python

I want to verify that the HTML tags present in a source string are also present in a target string.
For example:
>> source = '<em>Hello</em><label>What's your name</label>'
>> verify_target(’<em>Hi</em><label>My name is Jim</label>')
True
>> verify_target('<label>My name is Jim</label><em>Hi</em>')
True
>> verify_target('<em>Hi<label>My name is Jim</label></em>')
False
I would get rid of Regex and look at Beautiful Soup.
findAll(True) lists all the tags found in your source.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(source)
allTags = soup.findAll(True)
[tag.name for tag in allTags ]
[u'em', u'label']
then you just need to remove possible duplicates and confront your tags lists.
This snippet verifies that ALL of source's tags are present in target's tags.
from BeautifulSoup import BeautifulSoup
def get_tags_set(source):
soup = BeautifulSoup(source)
all_tags = soup.findAll(True)
return set([tag.name for tag in all_tags])
def verify(tags_source_orig, tags_source_to_verify):
return tags_source_orig == set.intersection(tags_source_orig, tags_source_to_verify)
source= '<label>What\'s your name</label><label>What\'s your name</label><em>Hello</em>'
source_to_verify= '<em>Hello</em><label>What\'s your name</label><label>What\'s your name</label>'
print verify(get_tags_set(source),get_tags_set(source_to_verify))
I don't think that regex is the right way here, basically because html is not always just a string, but it's a bit more complex, with nested tags.
I suggest you to use HTMLParser, create a class with parses the original source and builds a structure on it. Then verify that the same data structure is valid for the targets to be verified.

Categories