Extracting images from HTML pages with Python - python

The below is my code. It attempts to get the src of an image within an image tag in html.
import re
for text in open('site.html'):
matches = re.findall(r'\ssrc="([^"]+)"', text)
matches = ' '.join(matches)
print(matches)
problem is when i put in something like:
<img src="asdfasdf">
It works but when i put in an ENTIRE HTML page it returns nothing. Why does it do that? and how do i fix it?
Site.html is just the html code for a website in standard format. I want it to ignore everything and just print the source code for the image. If you would like to see what would be inside site.html then go to a basic HTML webpage and copy all the source code.

Why use a regular expression to parse HTML when you can easily do this with something like BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """This is some text
... <img src="asdasdasd">
... <i> More HTML <b> foo </b> bar </i>
... """
>>> soup = BS(html)
>>> for imgtag in soup.find_all('img'):
... print(imgtag['src'])
...
asdasdasd
The reason why your code doesn't work is because text is one line of the file. Thus, you are only finding matches of a line in every iteration. Although this may work, think about if the last line doesn't have an image tag. matches will be an empty list, and join will make it become ''. You are overriding the variable matches every line.
You want to call findall on the whole HTML:
import re
with open('site.html') as html:
content = html.read()
matches = re.findall(r'\ssrc="([^"]+)"', content)
matches = ' '.join(matches)
print(matches)
Using a with statement here is much more pythonic. It also means you don't have to call file.close() afterwards, as the with statement deals with that.

Related

Putting Links in Parenthesis with BeautifulSoup

BeautifulSoup's get_text() function only records the textual information of an HTML webpage. However, I want my program to return the href link of an tag in parenthesis directly after it returns the actual text.
In other words, using get_text() will just return "17.602" on the following HTML:
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
However, I want my program to return "17.602 (17.602.html#FAR_17_602)". How would I go about doing this?
EDIT: What if you need to print text from other tags, such as:
<p> Sample text.
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
Sample closing text.
</p>
In other words, how would you compose a program that would print
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.
You can format the output using f-strings.
Access the tag's text using .text, and then access the href attribute.
from bs4 import BeautifulSoup
html = """
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
"""
soup = BeautifulSoup(html, "html.parser")
a_tag = soup.find("a")
print(f"{a_tag.text} ({a_tag['href']})")
Output:
17.602 (17.602.html#FAR_17_602)
Edit: You can use .next_sibling and .previous_sibling
print(f"{a_tag.previous_sibling.strip()} {a_tag.text} ({a_tag['href']}) {a_tag.next_sibling.strip()}")
Output:
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.

Beautiful Soup Nested Tag Search

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).
Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are. Is there any simple method or another way to do it?
Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:
soup = bs4.BeautifulSoup(content, 'html.parser')
# This will get the div
div_container = soup.find('div', class_='some_class')
# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
# prints the p tag content
print(ptag.text)
Hope that helps
Try this one :
data = []
for nested_soup in soup.find_all('xyz'):
data = data + nested_soup.find_all('abc')
Maybe you can turn in into lambda and make it cool, but this works. Thanks.
UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:
from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))
INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:
from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)
count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
continue
temp = matcher.split(tag.text) # Split using tokens such as \s and \n
temp = filter(None, temp) # remove empty elements in the list
count +=len(temp)
print "number of words in the document %d" %count
fd.close()
Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason
You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.
for eg:
r = requests.get(url,headers=headers)
p_tags = re.findall(r'<p>.*?</p>',r.content)
this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.
Alternatively if you just want just the text you can try this:
from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()
this will get you a more bare bones form of the html from the site, and now proceed with the parsing.

What's wrong with my soup?

I am using python with BeautifulSoup 4 to find links in a html page that match a particular regular expression. I am able to find links and text matching with the regex but the both things combined together won't work. Here's my code:
import re
import bs4
s = 'Sign in <br />'
soup = bs4.BeautifulSoup(s)
match = re.compile(r'sign\s?in', re.IGNORECASE)
print soup.find_all(text=match) # [u'Sign in\xa0']
print soup.find_all(name='a')[0].text # Sign in 
print soup.find_all('a', text=match) # []
Comments are the outputs. As you can see the combined search returns no result. This is strange.
Seems that there's something to do with the "br" tag (or a generic tag) contained inside the link text. If you delete it everything works as expected.
you can either look for the tag or look for its text content but not together:
given that:
self.name = u'a'
self.text = SRE_Pattern: <_sre.SRE_Pattern object at 0xd52a58>
from the source:
# If it's text, make sure the text matches.
elif isinstance(markup, NavigableString) or \
isinstance(markup, basestring):
if not self.name and not self.attrs and self._matches(markup, self.text):
found = markup
that makes #Totem remark the way to go by design

BeautifulSoup Tag Removal unexpected result

So I wrote some code to extract only what's within the <p> tags of some HTML code. Here is my code
soup = BeautifulSoup(my_string, 'html')
no_tags=' '.join(el.string for el in soup.find_all('p', text=True))
It works how I want it to for most of the examples it is run on, but I have noticed that in examples such as
<p>hello, how are you <code>other code</code> my name is joe</p>
it returns nothing. I suppose this is because there are other tags within the <p> tags. So just to be clear, what I would want it to return is
hello, how are you my name is joe
That is, I want everything inside the <p> tags but only the first level in. I would like to ignore everything that is enclosed in other tags within the <p> tags.
can someone help me out regarding how to deal with such examples?
Hello I think that you can use it to extract the text which is within p tag.
my_string = "<p>hello, how are you <code>other code</code> my name is joe</p>"
soup = BeautifulSoup(my_string, 'html')
soup.code.extract()
text = soup.p.get_text()
print text

BeautifulSoup not finding parents

I really can't manage to figure this out. I parsed the following link with BeautifulSoup and I did this:
soup.find(text='Title').find_parent('h3')
And it does not find anything. If you take a look on the code of the linked page, you'll see a h3 tag which contains the word Titles.
The exact point is:
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3>
If I make BS parse the line above only, it works perfectly. I tried also with:
soup.find(text='Title').find_parents('h3')
soup.find(text='Title').find_parent(class_='findSectionHeader')
which both work on the line only, but don't work on the entire html.
If I do a soup.find(text='Titles').find_parents('div') it works with the entire html.
Before the findSectionHeader H3 tag, there is another tag with Title in the text:
>>> soup.find(text='Title').parent
Title
You need to be more specific in your search, search for Titles instead, and loop to find the correct one:
>>> soup.find(text='Titles').parent
<option value="tt">Titles</option>
>>> for elem in soup.find_all(text='Titles'):
... parent_h3 = elem.find_parent('h3')
... if parent_h3 is None:
... continue
... print parent_h3
...
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3>
find(text='...') only matches the full text, not a partial match. Use a regular expression if you need partial matches instead:
>>> import re
>>> soup.find_all(text='Title')
[u'Title']
>>> soup.find_all(text=re.compile('Title'))
[u'Titles', u'Titles', u'Titles', u'Title', u'Advanced Title Search']

Categories