I am new to using BeautifulSoup.
I have a line in an HTML file that is stored locally.
<LINK rel="stylesheet" type="text/css" href="report.css" >
I wish to remove that line, but I don't know what approach to use to find the line and remove it.
I can find the line using: old_text = soup.find("link", {"href": "report.css"})
But I can't work out how to remove and save the file again?
You could use .decompose() to get rid of the tag:
soup.find("link", {"href": "report.css"}).decompose()
or
soup.select_one('link[href^="report."]').decompose()
and convert BeautifulSoup object back to string and save it:
str(soup)
Example
from bs4 import BeautifulSoup
html = '''
<some tag>some content</some tag>
<LINK rel="stylesheet" type="text/css" href="report.css" >
<some tag>some content</some tag>
'''
soup = BeautifulSoup(html, "html.parser")
soup.select_one('link[href^="report."]').decompose()
print(str(soup))
Related
Im trying to write a script thats gonna scrape 9gag for images and images only. But i have faced a problem which is that my requests or the Beautifulsoup is getting the wrong HTML page.
Beautifulsoup is currently getting the source-page and not the page that is containing the images.
Why is Beautifulsoup excluding the classes that contain the actual images? Or is it diffrent HTML-pages?
I have tried diffrent formats for the Beautiful soup "parser" but still getting the wrong page.
If you go to 9gag and right-click and the "inspect" you can get to the images, and the page to extract the images with script.
My script:
import requests
from bs4 import BeautifulSoup
import os
def download_image(url, fileName): #save image function
path = os.path.join("imgs", fileName)
f = open(path, 'wb')
f.write(requests.get(url).content)
f.close()
def fetch_url(url): # fetching url
page = requests.get(url)
return page
def parse_html(htmlPage): #parsing the url
soup = BeautifulSoup(htmlPage, "html.parser")
return soup
def retrieve_jpg_urls(soup):
list_of_urls = soup.find_all('list') #classes wanted
parsed_urls = []
for index in range(len(list_of_urls)):
try:
parsed_urls.append(soup.find_all('img')[index].attrs['src']) #img wanted inside class
except:
next
return parsed_urls
def main():
htmlPage = fetch_url("https://9gag.com/")
soup = parse_html(htmlPage.content)
jpgUrls = retrieve_jpg_urls(soup)
for index in range(len(jpgUrls)):
try:
download_image(jpgUrls[index], "savedpic{}.jpg".format(index))
except:
print("failed to parse image with url {}".format(jpgUrls[index]))
print("")
if __name__ == "__main__":
main()
What Beautifulsoup is getting:
<!DOCTYPE html>
<html lang="en">
<head>
<title>9GAG: Go Fun The World</title>
<link href="https://assets-9gag-fun.9cache.com" rel="preconnect"/>
<link href="https://img-9gag-fun.9cache.com" rel="preconnect"/>
<link href="https://miscmedia-9gag-fun.9cache.com" rel="preconnect"/>
<link href="https://images-cdn.9gag.com/img/9gag-og.png" rel="image_src"/>
<link href="https://9gag.com/" rel="canonical"/>
<link href="android-app://com.ninegag.android.app/http/9gag.com/" rel="alternate"/>
<link href="https://assets-9gag-fun.9cache.com/s/fab0aa49/5aa8c9f45ee3dd77f0fdbe4812f1afcf5913a34e/static/dist/core/img/favicon.ico" rel="shortcut icon"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="9GAG has the best funny pics, gifs, videos, gaming, anime, manga, movie, tv, cosplay, sport, food, memes, cute, fail, wtf photos on the internet!" name="description"/>
I want the following:
<img src="https://img-9gag-fun.9cache.com/photo/aLgyG2V_460s.jpg" alt="There's genuine friend love there" style="min-height: 566.304px;">
Try extracting the JSON on the page:
import re
import json
# ...
res = requests.get(...)
html = res.content
m = re.search('JSON\.parse\((.*)\);</script>', html)
double_encoded = m.group(1)
encoded = json.loads(double_encoded)
parsed = json.loads(encoded)
images = [p['images']['image700']['url'] for p in parsed['data']['posts']]
print(images)
output:
['https://img-9gag-fun.9cache.com/photo/abY9Wg8_460s.jpg', 'https://img-9gag-fun.9cache.com/photo/aLgy4o5_460s.jpg', 'https://img-9gag-fun.9cache.com/photo/aE2LVeM_460s.jpg', 'https://img-9gag-fun.9cache.com/photo/amBEGb4_700b.jpg', 'https://img-9gag-fun.9cache.com/photo/aKxrv56_460s.jpg', 'https://img-9gag-fun.9cache.com/photo/a5M8wXN_460s.jpg', 'https://img-9gag-fun.9cache.com/photo/aNY6QEv_700b.jpg', 'https://img-9gag-fun.9cache.com/photo/aYY2Deq_700b.jpg', 'https://img-9gag-fun.9cache.com/photo/aQR0AEw_460s.jpg', 'https://img-9gag-fun.9cache.com/photo/aLgy19P_700b.jpg']
I'm using Beautiful Soup to parse list of categories from http://rtw.ml.cmu.edu/rtw/kbbrowser/, and I got the html code of this page:
<html>
<head>
<link href="../css/browser.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript">
if (parent.location.href == self.location.href) {
if (window.location.href.replace)
window.location.replace('index.php');
else
// causes problems with back button, but works
window.location.href = 'index.php';
}
</script>
</head>
<body id="ontology">
...
</body>
</html>
I'm using quite simple code, but when I'm trying to get to the <body> element, I get None:
import urllib
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import mechanize
from mechanize import Browser
import requests
import re
import os
link = 'http://rtw.ml.cmu.edu/rtw/kbbrowser/ontology.php'
pageFile = urllib.urlopen(link).read()
soup = BeautifulSoup(pageFile)
print soup.head.contents[0].name
print soup.html.contents[1].name
Why does the head element in this case not have a sibling?
I'm getting:
AttributeError: 'NoneType' object has no attribute 'next_element'
when trying to get head.next_Sibling also.
This is because text nodes are also a part of contents.
Instead of operating the contents property, use CSS selectors to locate the list of categories. For example, here is how you can list top-level categories:
for li in soup.select("body#ontology > ul > li"):
print li.find_all("a")[-1].text
Consider the html as
<item>
<title>this is the title</title>
<link>www.linktoawebsite.com</link>
</item>
I am using Lxml(python) and Xpath and trying to extract both the content of the title tag as well as the link tag.
The code is
page=urllib.urlopen(url).read()
x=etree.HTML(page)
titles=x.xpath('//item/title/text()')
links=x.xpath('//item/link/text()')
But this is returning an empty list. However, this is returning a link element.
links=x.xpath('//item/link') #returns <Element link at 0xb6b0ae0c>
Can anyone suggest how to extract the urls from the link tag?
You are using the wrong parser for the job; you don't have HTML, you have XML.
A proper HTML parser will ignore the contents of a <link> tag, because in the HTML specification that tag is always empty.
Use the etree.parse() function to parse your URL stream (no separate .read() call needed):
response = urllib.urlopen(url)
tree = etree.parse(response)
titles = tree.xpath('//item/title/text()')
links = tree.xpath('//item/link/text()')
You could also use etree.fromstring(page) but leaving the reading to the parser is easier.
By parsing content by etree, the <link> tag get closed. So no text value present for link tag
Demo:
>>> from lxml import etree
>>> content = """<item>
... <title>this is the title</title>
... <link>www.linktoawebsite.com</link>
... </item>"""
>>> x = etree.HTML(content)
>>> etree.tostring(x)
'<html><body><item>\n<title>this is the title</title>\n<link/>www.linktoawebsite.com\n</item></body></html>'
>>>
According to HTML, this is not valid tag.
I think link tag structure is like:
<head>
<link rel="stylesheet" type="text/css" href="theme.css">
</head>
I'm trying to capitalize all the (user-visible) text in a HTML file. Here is the obvious thing:
from bs4 import BeautifulSoup
def upcaseAll(str):
soup = BeautifulSoup(str)
for tag in soup.find_all(True):
for s in tag.strings:
s.replace_with(unicode(s).upper())
return unicode(soup)
That crashes:
File "/Users/malvolio/flip.py", line 23, in upcaseAll
for s in tag.strings:
File "/Library/Python/2.7/site-packages/bs4/element.py", line 827, in _all_strings
for descendant in self.descendants:
File "/Library/Python/2.7/site-packages/bs4/element.py", line 1198, in descendants
current = current.next_element
AttributeError: 'NoneType' object has no attribute 'next_element'
All the variations I can think of crash the same way. BS4 does not seem to like it when I replace a lot of NavigableStrings. How can I do this?
You should not use str as the function argument as this is a shadow name of python builtin.
Also you should be able to convert the visible elements by just using prettify with formatter like this:
...
return soup.prettify(formatter=lambda x: unicode(x).upper())
I have tested now and it works:
from bs4 import BeautifulSoup
import requests
r = requests.get('http://www.stackoverflow.com')
soup = BeautifulSoup(r.content)
print soup.prettify(formatter=lambda x: unicode(x).upper())[:200]
<!DOCTYPE html>
<html>
<head>
<title>
STACK OVERFLOW
</title>
<link href="//CDN.SSTATIC.NET/STACKOVERFLOW/IMG/FAVICON.ICO?V=00A326F96F68" rel="SHORTCUT ICON"/>
<link href="//CDN.SSTATIC.NE
...
You can read OUTPUT FORMATTER for more detailed information.
Hope this helps.
This is part of my html code:
<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' />
<link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />
I have to find all hrefs of stylesheets.
I tried to use regular expression like
<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>
The full code is
body = '''<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' />
<link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />''''
real_viraz = '''<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>'''
r = re.findall(real_viraz, body, re.I|re.DOTALL)
print r
But the problem is that rel='stylesheet' and href='' can be in any order in <link ...>, and it can be almost everything between them.
Please help me to find the right regular expression. Thanks.
Somehow, your name looks like a power automation tool Sikuli :)
If you are trying to parse HTML/XML based text in Python. BeautifulSoup (DOCUMENT)is an extremely powerful library to help you with that. Otherwise, you are indeed reinventing the wheel(an interesting story from Randy Sargent).
from bs4 import BeautifulSoup4
# in case you need to get the page first.
#import urllib2
#url = "http://selenium-python.readthedocs.org/en/latest/"
#text = urllib2.urlopen("url").read()
text = """<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" /><link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' /><link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />"""
soup = BeautifulSoup(text)
links = soup.find_all("link", {"rel":"stylesheet"})
for link in links:
try:
print link['href']
except:
pass
the output is:
catalog/view/theme/default/stylesheet/stylesheet.css
http://1
http://2
Learn beautifulsoup well and you are 100% ready for parsing anything in HTML or XML.
(You might also want to put Selenium, Scrapy into your toolbox in the future.)
Short answer: Don't use regular expressions to parse (X)HTML, use a (X)HTML parser.
In Python, this would be lxml. You could parse the HTML using lxml's HTML Parser, and use an XPath query to get all the link elements, and collect their href attributes:
from lxml import etree
parser = etree.HTMLParser()
doc = etree.parse(open('sample.html'), parser)
links = doc.xpath("//head/link[#rel='stylesheet']")
hrefs = [l.attrib['href'] for l in links]
print hrefs
Output:
['catalog/view/theme/default/stylesheet/stylesheet.css', 'http://1', 'http://2']
I'm amazed by the many developers here in Stack-Exchange who insist on using outside Modules over the RE module for obtaining data and Parsing Strings,HTML and CSS. Nothing works more efficiently or faster than RE.
These two lines not only grab the CSS style-sheet path but also grab several if there is more than one CSS stylesheet and place them into a nice Python List for processing and or for a urllib request method.
a = re.findall('link rel="stylesheet" href=".*?"', t)
a=str(a)
Also for those unaware of Native C's use of what most developers know to be the HTML Comment Out Lines.
<!-- stuff here -->
Which allows anything in RE to process and grab data at will from HTML or CSS. And or to remove chunks of pesky Java Script for testing browser capabilities in a single iteration as shown below.
txt=re.sub('<script>', '<!--', txt)
txt=re.sub('</script>', '-->', txt)
txt=re.sub('<!--.*?-->', '', txt)
Python retains all the regular expressions from native C,, so use them people. That's what their for and nothing is as slow as Beautiful Soup and HTMLParser.
Use the RE module to grab all your data from Html tags as well as CSS. Or from anything a string can contain. And if you have a problem with a variable not being of type string then make it a string with a single tiny line of code.
var=str(var)