Cannot prettify scraped html in BeautifulSoup

Cannot prettify scraped html in BeautifulSoup - python

I have a small script that use urllib2 to get contents of a site, find all the link tags, appends a small piece of HTML in on the top and bottom, and then I try to prettify it. It keeps returning TypeError: sequence item 1: expected string, Tag found. I have looked around,i can't really find the issue. As always, any help, much appreciated.
import urllib2
from BeautifulSoup import BeautifulSoup
import re
reddit = 'http://www.reddit.com'
pre = '<html><head><title>Page title</title></head>'
post = '</html>'
site = urllib2.urlopen(reddit)
html=site.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a')
tags.insert(0,pre)
tags.append(post)
soup1 = BeautifulSoup(''.join(tags))
print soup1.prettify()
This is the trace back:
Traceback (most recent call last): File "C:\Python26\bea.py", line 21, in <module>
soup1 = BeautifulSoup(''.join(tags))
TypeError: sequence item 1: expected string, Tag found

This works for me:
soup1 = BeautifulSoup(''.join(str(t) for t in tags))
This pyparsing solution gives some decent output, too:
from pyparsing import makeHTMLTags, originalTextFor, SkipTo, Combine
# makeHTMLTags defines HTML tag patterns for given tag string
aTag,aEnd = makeHTMLTags("A")
# makeHTMLTags by default returns a structure containing
# the tag's attributes - we just want the original input text
aTag = originalTextFor(aTag)
aEnd = originalTextFor(aEnd)
# define an expression for a full link, and use a parse action to
# combine the returned tokens into a single string
aLink = aTag + SkipTo(aEnd) + aEnd
aLink.setParseAction(lambda tokens : ''.join(tokens))
# extract links from the input html
links = aLink.searchString(html)
# build list of strings for output
out = []
out.append(pre)
out.extend([' '+lnk[0] for lnk in links])
out.append(post)
print '\n'.join(out)
prints:
<html><head><title>Page title</title></head>
<a href="http://www.reddit.com/r/pics/" >pics</a>
<a href="http://www.reddit.com/r/reddit.com/" >reddit.com</a>
<a href="http://www.reddit.com/r/politics/" >politics</a>
<a href="http://www.reddit.com/r/funny/" >funny</a>
<a href="http://www.reddit.com/r/AskReddit/" >AskReddit</a>
<a href="http://www.reddit.com/r/WTF/" >WTF</a>
.
.
.
<a href="http://reddit.com/help/privacypolicy" >Privacy Policy</a>
close this window
<a href="http://www.reddit.com/feedback" >volunteer to translate</a>
close this window
</html>

soup1 = BeautifulSoup(''.join(unicode(tag) for tag in tags))

A bit of a syntax error on Jonathans answer, here's the correct one:
soup1 = BeautifulSoup(''.join([unicode(tag) for tag in tags]))

Related

Using next_sibling with font color in BS4

I need to get data after a certain link with text map, but it doesn't work when the data after the link is colored. How do I get that?
Currently, I am using next_sibling, but it only gets the data points that are not red.
The HTML is like this.
I can read the number from here
map
" 2.8 "
but not from here
map
<font color="red">3.1</font>
soup=BeautifulSoup(page.content, 'html.parser')
tags = soup.find_all("a",{'class': 'link2'})
output=open("file.txt","w")
for i in tags:
if i.get_text()=="map":
# prints each next_sibling
print(i.next_sibling)
# Extracts text if needed.
try:
output.write(i.next_sibling.get_text().strip()+"\n")
except AttributeError:
output.write(i.next_sibling.strip()+"\n")
output.close()
The program writes all of the numbers that are not in red, and leaves empty spaces where there are red numbers. I want it to show everything.

If we can see more of your HTML tree there's probably a better way to do this but given the little bit of html that you've shown us, here's one way that will likely work.
from bs4 import BeautifulSoup
html = """map2.8
map
<font color="red">3.1</font>"""
soup=BeautifulSoup(html, 'html.parser')
tags = soup.find_all("a",{'class': 'link2'})
output=open("file.txt","w")
for i in tags:
if i.get_text()=="map":
siblings = [sib for sib in i.next_siblings]
map_sibling_text = siblings[0].strip()
if map_sibling_text == '' and len(siblings) > 1:
if siblings[1].name == 'font':
map_sibling_text = siblings[1].get_text().strip()
output.write("{0}\n".format(map_sibling_text))
output.close()

Depends on how your HTML is overall. Is that classname always associated with an a tag for example? You might be able to do the following. Requires bs4 4.7.1.
import requests
from bs4 import BeautifulSoup as bs
html = '''
map
" 2.8 "
map
<font color="red">3.1</font>
'''
soup = bs(html, 'lxml')
data = [item.next_sibling.strip() if item.name == 'a' else item.text.strip() for item in soup.select('.link2:not(:has(+font)), .link2 + font')]
print(data)

Unexpected Output while parsing Html file using Beautiful Soup

I have a Basic Html file that contains text inside tags as follows:
<head>
<title></title>
</head>
<body>
<div>{#One#}</div>
<span>{#Two#}</span>
<b>{#Three#}</b>
<i>four</i>
<td>{#five#}</td>
<sup>{#six#}</sup>
<sub>{#seven#}</sub>
<i>eight</i>
</body>
Using Python I wanted to parse this file and and check for a special character (for eg. '{') and if this character is not present then return the line and the number on which its not present. So I wrote a small snippet for it.
from bs4 import BeautifulSoup
import re
import urllib2
url = "testhtml.html"
page = open(url)
soup = BeautifulSoup(page.read())
bdy = soup.find('body')
for lines in bdy:
for num,line in enumerate(lines,1):
if "{" not in word:
print num,lines
However when i run the program I get an strange output: Its shown below:
1
1
1
1
1
1<i>four</i>
1
1
1
1<i>eight</i>
Instead of :
4<i>four</i>
8<i>eight</i>
What Am I doing wrong here, it seems like a silly mistake.

Using find('body') returns the whole body tag with all of its contents as a single element. So, iterating over bdy doesn't give what you think.
You need to use bdy.find_all(True), which will return all the tags inside body. Then, change if statement to if '{' not in tag.text:.
soup = BeautifulSoup(html, 'lxml')
bdy = soup.find('body')
for i, tag in enumerate(bdy.find_all(True), 1):
if '{' not in tag.text:
print(i, tag)
Output:
4 <i>four</i>
8 <i>eight</i>

from bs4 import BeautifulSoup
import re
import urllib2
url = "index.html"
page = open(url)
soup = BeautifulSoup(page.read(), "html.parser")
soup.prettify()
bdy = soup.find('body')
for num, lines in enumerate(bdy):
for line in lines:
if line !='\n' and '{' not in line:
print num, lines

python How can I parse html and print specific output inside html tag

#!/usr/bin/env python
import requests, bs4
res = requests.get('https://betaunityapi.webrootcloudav.com/Docs/APIDoc/APIReference')
web_page = bs4.BeautifulSoup(res.text, "lxml")
for d in web_page.findAll("div",{"class":"actionColumnText"}):
print d
Result:
<div class="actionColumnText">
/service/api/console/gsm/{gsmKey}/sites/{siteId}/endpoints/reactivate
</div>
<div class="actionColumnText">
Reactivates a list of endpoints, or all endpoints on a site. </div>
I am interested to see output with only the last line (Reactivates a list of endpoints, or all endpoints on a site) removing start and end .
Not interested in the line with href
Any help is greatly appreciated.

In a simple case, you can just get the text:
for d in web_page.find_all("div", {"class": "actionColumnText"}):
print(d.get_text())
Or/and, if there is only single element you want to find, you can get the last match by index:
d = web_page.find_all("div", {"class": "actionColumnText"})[-1]
print(d.get_text())
Or, you can also find div elements with a specific class which don't have an a child element:
def filter_divs(elm):
return elm and elm.name == "div" and "actionColumnText" in elm.attrs and elm.a is None
for d in web_page.find_all(fitler_divs):
print(d.get_text())
Or, in case of a single element:
web_page.find(fitler_divs).get_text()

U can select the last one with a CSS selector:
var d = web_page.select("div.actionColmnText:last")
d.string()

If this text changes you can use
#!/usr/bin/env python
import requests, bs4
res = requests.get('https://betaunityapi.webrootcloudav.com/Docs/APIDoc/APIReference')
web_page = bs4.BeautifulSoup(res.text, "lxml")
yourText = web_page.findAll("div",{"class":"actionColumnText"})[-1]
yourText = yourText.split(' ')[0]

beautifulsoup add a tag inside a string

Any browser that encounter the text www.domain.com or http://domain.com/etc/ in a text section of some html will automatically translate it into www.domain.com or http://domain.com/etc/ tag. I have to clean-up and verify some texts like this and do this replacement automatically, but the problem is that i can't insert new tags into an element's string.
#!/usr/bin/python
# -*- coding: utf8
import re
from bs4 import BeautifulSoup as bs
def html_content_to_soup(data):
soup = bs(data, "html5lib")
soup.html.unwrap()
soup.head.unwrap()
soup.body.unwrap()
return soup
def create_tag(soup):
def resubfunction(m):
url = m.group(0)
if not url.startswith("http://") and not url.startswith("https://"):
_url = "http://%s" % url
else:
_url = url
tag = soup.new_tag('a', href=_url)
tag.string = url.replace(".", "[.]")
return tag.prettify(formatter=None)
return resubfunction
def replace_vulnerable_text(soup, data):
ex = r"(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s\`!()\[\]{};:\'\".,<>?ÂŤÂťââââ])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!#)))"
return re.sub(ex, create_tag(soup), data)
if __name__ == "__main__":
html = """<html><body>The website bbc.co.uk is down</body></html>"""
soup = bs(html, "html5")
for elem in soup.find_all(text=True):
if elem.string is not None:
elem.replace_with( html_content_to_soup(replace_vulnerable_text(soup, elem.string)))
print unicode(soup)
Instead of the expected
<html><body>The website bbc[.]co[.]uk is down</body></html>
i'm getting
<html><head></head><body>The website <a href="http://bbc.co.uk"> bbc[.]co[.]uk </a> is down</body></html>
The html tags are getting escaped. Any pointers in the right direction? I'm not sure how to approach this.
EDIT: Edited the original question with the correct answer.

import HTMLParser
html_p = HTMLParser.HTMLParser()
string = '<html><head></head><body>The website <a href="http://bbc.co.uk"> bbc[.]co[.]uk </a> is down</body></html>'
print html_p.unescape(string)
it will give you the required output.

I need a function which will return a soup from arbitrary html:
def html_content_to_soup(data):
soup = bs(data, "html5lib")
soup.html.unwrap()
soup.head.unwrap()
soup.body.unwrap()
return soup
Afterwards, we get:
elem.replace_with( html_content_to_soup(replace_vulnerable_text(soup, elem.string)) )
This produces the required content:
<html><head></head><body>The website bbc[.]co[.]uk is down</body></html>

How to extract all the hrefs and src inside specific divs with beautifulsoup python

I want to extract all the href and src inside all the divs on the page that have class = 'news_item'
The html looks like this:
<div class="col">
<div class="group">
<h4>News</h4>
<div class="news_item">
<a href="www.link.com">
<h2 class="link">
here is a link-heading
</h2>
<div class="Img">
<img border="0" src="/image/link" />
</div>
<p></p>
</a>
</div>
from here what I want to extract is:
www.link.com , here is the link-heading and /image/link
My code is:
def scrape_a(url):
news_links = soup.select("div.news_item [href]")
for links in news_links:
if news_links:
return 'http://www.web.com' + news_links['href']
def scrape_headings(url):
for news_headings in soup.select("h2.link"):
return str(news_headings.string.strip())
def scrape_images(url):
images = soup.select("div.Img[src]")
for image in images:
if images:
return 'http://www.web.com' + news_links['src']
def top_stories():
r = requests.get(url)
soup = BeautifulSoup(r.content)
link = scrape_a(soup)
heading = scrape_headings(soup)
image = scrape_images(soup)
message = {'heading': heading, 'link': link, 'image': image}
print message
The problem is that it gives me error:
**TypeError: 'NoneType' object is not callable**
Here is the Traceback:
Traceback (most recent call last):
File "web_parser.py", line 40, in <module>
top_stories()
File "web_parser.py", line 32, in top_stories
link = scrape_a('www.link.com')
File "web_parser.py", line 10, in scrape_a
news_links = soup.select_all("div.news_item [href]")

You should be grabbing all of the news items at once and then iterating through them. This makes it easy to organize the data that you get into manageable chunks (in this case dicts). Try something like this
url = "http://www.web.com"
r = requests.get(url)
soup = BeautifulSoup(r.text)
messages = []
news_links = soup.select("div.news_item") # selects all .news_item's
for l in news_links:
message = {}
message['heading'] = l.find("h2").text.strip()
link = l.find("a")
if not link:
continue
message['link'] = link['href']
image = l.find('img')
if not image:
continue
message['image'] = "http://www.web.com{}".format(image['src'])
messages.append(message)
print messages

Most of your errors come from the fact that the news_link is not being found properly. You aren't getting back the tag you expect.
Change:
news_links = soup.select("div.news_item [href]")
for links in news_links:
if news_links:
return 'http://www.web.com' + news_links['href']
to this and see if it helps:
news_links = soup.find_all("div", class="news_item")
for links in news_links:
if news_links:
return 'http://www.web.com' + news_links.find("a").get('href')
Also note that the return statement will give you something like http://www.web.comwww.link.com which I don't think you want.

Your idea to split the tasks into different methods is pretty good -
nice to read, to change and to reuse.
The errors are almost solved and fixed, in the trace there is select_all but its not in beautifulsoup and neither in your code and some other stuff ...long story short I would do it like this.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
from urlparse import urljoin
import requests
def news_links(url, soup):
links = []
for text in soup.select("div.news_item"):
for x in text.find_all(href=True):
links.append(urljoin(url, x['href']))
return links
def news_headings(soup):
headings = []
for news_headings in soup.select("h2.link"):
heading.append(str(news_headings.string.strip()))
return headings
def news_images(url, soup):
sources = []
for image in soup.select("img[src]"):
sources.append(urljoin(url, image['src']))
return sources
def top_stories():
url = 'http://www.web.com/'
r = requests.get(url)
content = r.content
soup = BeautifulSoup(content)
message = {'heading': news_headings(soup),
'link': news_links(url, soup),
'image': news_images(url, soup)}
return message
print top_stories()
Soup is robust, you want to find or select something that is not there it returns an empty list. It looks like you parsing a list of items - the code is pretty close to be used for this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cannot prettify scraped html in BeautifulSoup - python

soup1 = BeautifulSoup(''.join(unicode(tag) for tag in tags))

A bit of a syntax error on Jonathans answer, here's the correct one: soup1 = BeautifulSoup(''.join([unicode(tag) for tag in tags]))

Related

Using next_sibling with font color in BS4

Unexpected Output while parsing Html file using Beautiful Soup

python How can I parse html and print specific output inside html tag

beautifulsoup add a tag inside a string

How to extract all the hrefs and src inside specific divs with beautifulsoup python

Categories

Resources