Unexpected Output while parsing Html file using Beautiful Soup - python

I have a Basic Html file that contains text inside tags as follows:
<head>
<title></title>
</head>
<body>
<div>{#One#}</div>
<span>{#Two#}</span>
<b>{#Three#}</b>
<i>four</i>
<td>{#five#}</td>
<sup>{#six#}</sup>
<sub>{#seven#}</sub>
<i>eight</i>
</body>
Using Python I wanted to parse this file and and check for a special character (for eg. '{') and if this character is not present then return the line and the number on which its not present. So I wrote a small snippet for it.
from bs4 import BeautifulSoup
import re
import urllib2
url = "testhtml.html"
page = open(url)
soup = BeautifulSoup(page.read())
bdy = soup.find('body')
for lines in bdy:
for num,line in enumerate(lines,1):
if "{" not in word:
print num,lines
However when i run the program I get an strange output: Its shown below:
1
1
1
1
1
1<i>four</i>
1
1
1
1<i>eight</i>
Instead of :
4<i>four</i>
8<i>eight</i>
What Am I doing wrong here, it seems like a silly mistake.

Using find('body') returns the whole body tag with all of its contents as a single element. So, iterating over bdy doesn't give what you think.
You need to use bdy.find_all(True), which will return all the tags inside body. Then, change if statement to if '{' not in tag.text:.
soup = BeautifulSoup(html, 'lxml')
bdy = soup.find('body')
for i, tag in enumerate(bdy.find_all(True), 1):
if '{' not in tag.text:
print(i, tag)
Output:
4 <i>four</i>
8 <i>eight</i>

from bs4 import BeautifulSoup
import re
import urllib2
url = "index.html"
page = open(url)
soup = BeautifulSoup(page.read(), "html.parser")
soup.prettify()
bdy = soup.find('body')
for num, lines in enumerate(bdy):
for line in lines:
if line !='\n' and '{' not in line:
print num, lines

Related

How to find words between 2 words with re in python [duplicate]

I have a text file of ~500k lines with fairly random HTML syntax. The rough structure of the file is as follows:
content <title> title1 </title> more words
title contents2 title more words <body> <title> title2 </title>
<body><title>title3</title></body>
I want to extract all contents in between the tags.
title1
title2
title3
This is what I have tried so far:
content_list = []
with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', errors = 'ignore') as openfile2:
for line in openfile2:
for item in line.split("<title>"):
if "</title>" in item:
content = (item [ item.find("<title>")+len("<title>") : ])
content_list.append(content)
But this method is not retrieving all tags. I think this could be due to the tags that are connected to other words, without spaces. Ie. <body><title>.
I've considered replacing every '<' and '>' with a space and performing the same method, but if I was to do this, I would get "contents2" as an output.
I believe you could do this with BeautifulSoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('file_to_read.txt', 'r'), 'html.parser')
print(soup.find_all('title'))
# [<title> title1 </title>, <title> title2 </title>, <title>title3</title>]
print(soup.find_all('title')[0].text)
# ' title1 '
An example with your code syntax :
with open('file.txt', 'r') as file:
for line in file:
for item in line.split('<title>'):
if '</title>' in item:
content_list.append(str.strip(item.split('</title>')[0]))
print(content_list)
But BeautifulSoup is for me the best alternative anyway.
Try running:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', 'r'), 'html.parser')
content_list = []
contents = soup.find_all('title')
for content in content:
print(content.get_text().strip())
content_list.append(content.get_text().strip())

Beautifulsoup Get All Header Tags and Add Id Attribute Incrementing

Ok I give up. I need some help.
I am trying to find all the Header HTML tags in a html document. I want to find these html tags and add an id to them that increments the id. I need to keep the structure of the document in place.
I have had several different variations just can't seem to get it right.
from bs4 import BeautifulSoup
soup = BeautifulSoup(blog.body, "html.parser")
tags = soup.find_all()
count = 0
for item in tags:
if r"^h\d$" in item:
print('Found')
count += 1
item['id'] = count
soup.append(item)
soup.append(item)
print(soup)
if you want to do without re, another solution. It searches for all html tags and with beautifulSoup.
from bs4 import BeautifulSoup as parser
with open("test.html", "r") as readFile:
htmlSource = readFile.read()
soup = parser(htmlSource, "html.parser")
htmlTags = soup.find_all("html")
for eachTag in htmlTags:
eachTag.attrs["id"] = htmlTags.index(eachTag)
with open("out.html", "w") as saveFile:
saveFile.write(str(soup))

How to rectify a TypeError in python?Beautiful Soup string to tag error?

Here is simple snippet to scrape Wikipedia website and to print each of its contents separately like cast in separate variable and production in separate variable and so on ..
Here in the first div named "bodyContent" there is a another div names "mw-content-text" here my problem is retrieve the data of the first paragraphs before the tag "h2" and i have a code snippet to work out this and unable to convert from BeautifulSoup tag from string and the error is TypeError: unsupported operand type(s) for +: 'Tag' and 'str'
import urllib
from bs4 import BeautifulSoup
url ="https://en.wikipedia.org/wiki/Deadpool_(film)"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext,"lxml")
#print soup.prettify()
movie_title = soup.find('h1',{'id':'firstHeading'})
print movie_title.text
movie_info = soup.find_all('p')
#print movie_info[0].text
#print movie_info[1].text
'''I dont want like this because we dont know how many
intro paragraphs will be so we have to scrape all paras just before that h2 tag'''
Here the problem rises i want to iterate and add .next_sibling and to make a try-exception block to find if the
"resultant_next_url.name == 'p' "
def findNextSibling(base_url):
tag_addition = 'next_sibling'
next_url = base_url+'.'+tag_addition
return next_url
And finally to do like this
base_url = movie_info[0]
resultant_url = findNextSibling(base_url)
print resultant_url.text
Finally found answer, this is solving the problem
import urllib
from bs4 import BeautifulSoup
url ="https://en.wikipedia.org/wiki/Deadpool_(film)"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext,"lxml")
#print soup.prettify()
movie_title = soup.find('h1',{'id':'firstHeading'})
print movie_title.text
movie_info = soup.find_all('p')
# print movie_info[0].text
# print movie_info[1].text
def findNextSibling(resultant_url):
#tag_addition = 'next_sibling'
#base_url.string = base_url.string + '.' + tag_addition
return resultant_url.next_sibling
resultant_url = movie_info[0]
resultant_url = findNextSibling(resultant_url)
print resultant_url.text

beautifulsoup add a tag inside a string

Any browser that encounter the text www.domain.com or http://domain.com/etc/ in a text section of some html will automatically translate it into www.domain.com or http://domain.com/etc/ tag. I have to clean-up and verify some texts like this and do this replacement automatically, but the problem is that i can't insert new tags into an element's string.
#!/usr/bin/python
# -*- coding: utf8
import re
from bs4 import BeautifulSoup as bs
def html_content_to_soup(data):
soup = bs(data, "html5lib")
soup.html.unwrap()
soup.head.unwrap()
soup.body.unwrap()
return soup
def create_tag(soup):
def resubfunction(m):
url = m.group(0)
if not url.startswith("http://") and not url.startswith("https://"):
_url = "http://%s" % url
else:
_url = url
tag = soup.new_tag('a', href=_url)
tag.string = url.replace(".", "[.]")
return tag.prettify(formatter=None)
return resubfunction
def replace_vulnerable_text(soup, data):
ex = r"(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s\`!()\[\]{};:\'\".,<>?ÂŤÂťââââ])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!#)))"
return re.sub(ex, create_tag(soup), data)
if __name__ == "__main__":
html = """<html><body>The website bbc.co.uk is down</body></html>"""
soup = bs(html, "html5")
for elem in soup.find_all(text=True):
if elem.string is not None:
elem.replace_with( html_content_to_soup(replace_vulnerable_text(soup, elem.string)))
print unicode(soup)
Instead of the expected
<html><body>The website bbc[.]co[.]uk is down</body></html>
i'm getting
<html><head></head><body>The website <a href="http://bbc.co.uk"> bbc[.]co[.]uk </a> is down</body></html>
The html tags are getting escaped. Any pointers in the right direction? I'm not sure how to approach this.
EDIT: Edited the original question with the correct answer.
import HTMLParser
html_p = HTMLParser.HTMLParser()
string = '<html><head></head><body>The website <a href="http://bbc.co.uk"> bbc[.]co[.]uk </a> is down</body></html>'
print html_p.unescape(string)
it will give you the required output.
I need a function which will return a soup from arbitrary html:
def html_content_to_soup(data):
soup = bs(data, "html5lib")
soup.html.unwrap()
soup.head.unwrap()
soup.body.unwrap()
return soup
Afterwards, we get:
elem.replace_with( html_content_to_soup(replace_vulnerable_text(soup, elem.string)) )
This produces the required content:
<html><head></head><body>The website bbc[.]co[.]uk is down</body></html>

Cannot prettify scraped html in BeautifulSoup

I have a small script that use urllib2 to get contents of a site, find all the link tags, appends a small piece of HTML in on the top and bottom, and then I try to prettify it. It keeps returning TypeError: sequence item 1: expected string, Tag found. I have looked around,i can't really find the issue. As always, any help, much appreciated.
import urllib2
from BeautifulSoup import BeautifulSoup
import re
reddit = 'http://www.reddit.com'
pre = '<html><head><title>Page title</title></head>'
post = '</html>'
site = urllib2.urlopen(reddit)
html=site.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a')
tags.insert(0,pre)
tags.append(post)
soup1 = BeautifulSoup(''.join(tags))
print soup1.prettify()
This is the trace back:
Traceback (most recent call last): File "C:\Python26\bea.py", line 21, in <module>
soup1 = BeautifulSoup(''.join(tags))
TypeError: sequence item 1: expected string, Tag found
This works for me:
soup1 = BeautifulSoup(''.join(str(t) for t in tags))
This pyparsing solution gives some decent output, too:
from pyparsing import makeHTMLTags, originalTextFor, SkipTo, Combine
# makeHTMLTags defines HTML tag patterns for given tag string
aTag,aEnd = makeHTMLTags("A")
# makeHTMLTags by default returns a structure containing
# the tag's attributes - we just want the original input text
aTag = originalTextFor(aTag)
aEnd = originalTextFor(aEnd)
# define an expression for a full link, and use a parse action to
# combine the returned tokens into a single string
aLink = aTag + SkipTo(aEnd) + aEnd
aLink.setParseAction(lambda tokens : ''.join(tokens))
# extract links from the input html
links = aLink.searchString(html)
# build list of strings for output
out = []
out.append(pre)
out.extend([' '+lnk[0] for lnk in links])
out.append(post)
print '\n'.join(out)
prints:
<html><head><title>Page title</title></head>
<a href="http://www.reddit.com/r/pics/" >pics</a>
<a href="http://www.reddit.com/r/reddit.com/" >reddit.com</a>
<a href="http://www.reddit.com/r/politics/" >politics</a>
<a href="http://www.reddit.com/r/funny/" >funny</a>
<a href="http://www.reddit.com/r/AskReddit/" >AskReddit</a>
<a href="http://www.reddit.com/r/WTF/" >WTF</a>
.
.
.
<a href="http://reddit.com/help/privacypolicy" >Privacy Policy</a>
close this window
<a href="http://www.reddit.com/feedback" >volunteer to translate</a>
close this window
</html>
soup1 = BeautifulSoup(''.join(unicode(tag) for tag in tags))
A bit of a syntax error on Jonathans answer, here's the correct one:
soup1 = BeautifulSoup(''.join([unicode(tag) for tag in tags]))

Categories