HTML Parsing issue with BeautifulSoup Library

HTML Parsing issue with BeautifulSoup Library - python

I am working with the BS library for HTML parsing. My task is to remove everything between the head tags. So if i have <head> A lot of Crap! </head> then the result should be <head></head>. This is the code for it
raw_html = "entire_web_document_as_string"
soup = BeautifulSoup(raw_html)
head = soup.head
head.unwrap()
print(head)
And this works fine. But i want that these changes should take place in the raw_html string that contains the entire html document. How do reflect these commands in the original string and not only in the head string? Can you share a code snippet for doing it?

You're basically asking how to export a string of HTML from BS's soup object.
You can do it this way:
# Python 2.7
modified_raw_html = unicode(soup)
# Python3
modified_raw_html = str(soup)

Related

Python HTML parsing: removing excess HTML from get request output

I am wanting to make a simple python script to automate the process of pulling .mov files from an IP camera's SD card. The Model of IP camera supports http requests which returns HTML that contains the .mov file info. My python script so far..
from bs4 import BeautifulSoup
import requests
page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
OUTPUT:
NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov
I want to only return the MOV file. So removing:
"NAME2041=Record_continiously/2018-06-02/8/"
I'm new to HTML parsing with python so I'm a bit confused with the functionality.
Is returned HTML considered a string? If so, I understand that it will be immutable and I will have to create a new string instead of "striping away" the preexisting string.
I have tried:
page.replace("NAME2041=Record_continiously/2018-06-02/8/","")
in which I receive an attribute error. Is anyone aware of any method that could accomplish this?
Here is a sample of the HTML I am working with...
<html>
<head></head>
<body>
000 Success NUM=2039 NAME0=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-17-38_60.mov SIZE0=15736218
NAME1=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-16-37_60.mov SIZE1=15683077
NAME2=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-15-36_60.mov SIZE2=15676882
NAME3=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-14-35_60.mov SIZE3=15731539
</body>
</html>

Use str.split with negative indexing.
Ex:
page = "NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov"
print( page.split("/")[-1])
Output:
MP_2018-06-03_00-33-15_60.mov

as you asked for explanation of your code here it is:
# import statements
from bs4 import BeautifulSoup
import requests
page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3") # returns response object
soup = BeautifulSoup(page.content, 'html.parser') #
page.content returns string content of response
you are passing this(page.content) string content to class BeautifulSoup which is initialized with two arguments your content(page.content) as string and parser here it is html.parser
soup is the object of BeautifulSoup
.prettify() is method used to pretty print the content
In string slicing you may get failure of result due to length of content so it's better to split your content as suggested by #Rakesh and that's the best approach in your case.

Extracting raw html from locally saved html file using BeautifulSoup

Relativally new to BeautifulSoup. Attempting to obtain raw html from locally saved html file. I've looked around and have found that I should probably be using Beautiful Soup for this. Though when I do this:
from bs4 import BeautifulSoup
url = r"C:\example.html"
soup = BeautifulSoup(url, "html.parser")
text = soup.get_text()
print (text)
An empty string is printed out. I assume I'm missing some step. Any nudge in the right direction would be greatly appreciated.

The first argument to BeautifulSoup is an actual HTML string, not a URL. Open the file, read its contents, and pass that in.

Touching upon the previous answer, there are two ways to open an HTML file:
1.
with open("example.html") as fp:
soup = BeautifulSoup(fp)
2.
soup = BeautifulSoup(open("example.html"))

how to detect the language of webpage content by using python

I have to test a bunch of URLs whether those webpages have respective translation content or not. Is there any way to return the language of content in a webpage by using the Python language? Like if the page is in Chinese, then it should return `"Chinese"``.
I checked it with langdetect module, but not able to get the results I desire. These URls are in web xml format. The content is showing under <releasehigh>

Here is a simple example demonstrating use of BeautifulSoup to extract HTML body text and langdetect for the language detection:
from bs4 import BeautifulSoup
from langdetect import detect
with open("foo.html", "rb") as f:
soup = BeautifulSoup(f, "lxml")
[s.decompose() for s in soup("script")] # remove <script> elements
body_text = soup.body.get_text()
print(detect(body_text))

You can extract a chunk of content then use some python language detection like langdetect or guess-language.

Maybe you have a header like this one :
<HTML xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
If it's the case you can see with lang="fr" that this is a french web page. If it's not the case, guessing the language of a text is not trivial.

You can use BeautifulSoup to extract the language from HTML source code.
<html class="no-js" lang="cs">
Extract the lang field from source code:
from bs4 import BeautifulSoup
import requests
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
print(soup.html["lang"])

Python code to consolidate CSS into HTML

Looking for python code that can take an HTML page and insert any linked CSS style definitions used by that page into it - so any externally referenced css page(s) are not needed.
Needed to make single files to insert as email attachments from existing pages used on web site. Thanks for any help.

Sven's answer helped me, but it didn't work out of the box. The following did it for me:
import bs4 #BeautifulSoup 3 has been replaced
soup = bs4.BeautifulSoup(open("index.html").read())
stylesheets = soup.findAll("link", {"rel": "stylesheet"})
for s in stylesheets:
t = soup.new_tag('style')
c = bs4.element.NavigableString(open(s["href"]).read())
t.insert(0,c)
t['type'] = 'text/css'
s.replaceWith(t)
open("output.html", "w").write(str(soup))

You will have to code this yourself, but BeautifulSoup will help you a long way. Assuming all your files are local, you can do something like:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open("index.html").read())
stylesheets = soup.findAll("link", {"rel": "stylesheet"})
for s in stylesheets:
s.replaceWith('<style type="text/css" media="screen">' +
open(s["href"]).read()) +
'</style>')
open("output.html", "w").write(str(soup))
If the files are not local, you can use Pythons urllib or urllib2 to retrieve them.

You can use pynliner. Example from their documentation:
html = "html string"
css = "css string"
p = Pynliner()
p.from_string(html).with_cssString(css)

Issues with BeautifulSoup parsing

I am trying to parse an html page with BeautifulSoup, but it appears that BeautifulSoup doesn't like the html or that page at all. When I run the code below, the method prettify() returns me only the script block of the page (see below). Does anybody has an idea why it happens?
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1"
html = "".join(urllib2.urlopen(url).readlines())
print "-- HTML ------------------------------------------"
print html
print "-- BeautifulSoup ---------------------------------"
print BeautifulSoup(html).prettify()
The is the output produced by BeautifulSoup.
-- BeautifulSoup ---------------------------------
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script language="JavaScript">
<!--
function highlight(img) {
document[img].src = "/marketing/sony/images/en/" + img + "_on.gif";
}
function unhighlight(img) {
document[img].src = "/marketing/sony/images/en/" + img + "_off.gif";
}
//-->
</script>
Thanks!
UPDATE: I am using the following version, which appears to be the latest.
__author__ = "Leonard Richardson (leonardr#segfault.org)"
__version__ = "3.1.0.1"
__copyright__ = "Copyright (c) 2004-2009 Leonard Richardson"
__license__ = "New-style BSD"

Try with version 3.0.7a as Łukasz suggested. BeautifulSoup 3.1 was designed to be compatible with Python 3.0 so they had to change the parser from SGMLParser to HTMLParser which seems more vulnerable to bad HTML.
From the changelog for BeautifulSoup 3.1:
"Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't"

Try lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup, so it might work better for you. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

BeautifulSoup isn't magic: if the incoming HTML is too horrible then it isn't going to work.
In this case, the incoming HTML is exactly that: too broken for BeautifulSoup to figure out what to do. For instance it contains markup like:
SCRIPT type=""javascript""
(Notice the double quoting.)
The BeautifulSoup docs contains a section what you can do if BeautifulSoup can't parse you markup. You'll need to investigate those alternatives.

Samj: If I get things like
HTMLParser.HTMLParseError: bad end tag: u"</scr' + 'ipt>"
I just remove the culprit from markup before I serve it to BeautifulSoup and all is dandy:
html = urllib2.urlopen(url).read()
html = html.replace("</scr' + 'ipt>","")
soup = BeautifulSoup(html)

I had problems parsing the following code too:
<script>
function show_ads() {
document.write("<div><sc"+"ript type='text/javascript'src='http://pagead2.googlesyndication.com/pagead/show_ads.js'></scr"+"ipt></div>");
}
</script>
HTMLParseError: bad end tag: u'', at line 26, column 127
Sam

I tested this script on BeautifulSoup version '3.0.7a' and it returns what appears to be correct output. I don't know what changed between '3.0.7a' and '3.1.0.1' but give it a try.

import urllib
from BeautifulSoup import BeautifulSoup
>>> page = urllib.urlopen('http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1')
>>> soup = BeautifulSoup(page)
>>> soup.prettify()
In my case by executing the above statements, it returns the entire HTML page.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

HTML Parsing issue with BeautifulSoup Library - python

You're basically asking how to export a string of HTML from BS's soup object. You can do it this way: # Python 2.7 modified_raw_html = unicode(soup) # Python3 modified_raw_html = str(soup)

Related

Python HTML parsing: removing excess HTML from get request output

Extracting raw html from locally saved html file using BeautifulSoup

how to detect the language of webpage content by using python

Python code to consolidate CSS into HTML

Issues with BeautifulSoup parsing

Categories

Resources