BS4 breaks HTML trying to repair it

BS4 breaks HTML trying to repair it - python

BS4 corrects faulty html. Usually this is not a problem. I tried parsing, altering and saving the html of this page: ulisses-regelwiki.de/index.php/sonderfertigkeiten.html
In this case the repairing changes the representation. After the repairing many lines of the page are no longer centered, but leftaligned instead.
Since I have to work with the broken html of said page, I cannot simply repair the html code.
How can I prevent bs4 from repairing the html or fix the "correction" somehow?
(this minimal example just shows bs4 repairing broken html-code; I couldn't create a minimal example where bs4 does this in a wrong way like with the page mentioned above)
#!/usr/bin/env python3
from bs4 import BeautifulSoup
html = '''
<!DOCTYPE html>
<center>
Some Test content
<!-- A comment -->
<center>
'''
def is_string_only(t):
return type(t) is NavigableString
soup = BeautifulSoup(html, 'lxml') #or html.parse
print(str(soup))

Try this lib.
from simplified_scrapy import SimplifiedDoc
html = '''
<!DOCTYPE html>
<center>
Some Test content
<!-- A comment -->
<center>
'''
doc = SimplifiedDoc(html)
print (doc.html)
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Related

Beautifulsoup how to select all the 'a' tags

I am a newbie to BeautifulSoup and Python. Here is my HTML:
<html>
<head></head>
<body>
Google
Yahoo
</body>
</html>
Now my code:
from bs4 import BeautifulSoup
# Getting page souped inside Requests, this part is not necessary
soup = BeautifulSoup(html,'html.parser')
print(soup.find('a'))
This is giving just one link, but I want to get all.
Thanks in advance

You are using .find(), that will only return the first found, then you have to use .find_all() instead to get a list of the a tags.
print(soup.find_all('a'))
To get href's by for loop:
for link in soup.find_all('a'):
print(link.href)

Getting src code for a section directly with requests in python

I want to get the source code only of a section from website instead of whole page and then parsing out the section, as it will be faster than loading whole page and then parsing. I tried passing the section link as url parameter but still getting whole page.
url = 'https://stackoverflow.com/questions/19012495/smooth-scroll-to-div-id-jquery/#answer-19013712'
response = requests.get(url)
print(response.text)

You cannot get specific section directly with requests api, but you can use beautifulsoup for that purpose.
A small sample is given by dataquest website:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
page.content
Running the above script will output this html String.
<html>
<head>
<title>A simple example page
</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
First paragraph.
</p><p class="inner-text">
Second paragraph.
</p></div>
<p class="outer-text first-item" id="second"><b>
First outer paragraph.
</b></p><p class="outer-text"><b>
Second outer paragraph.
</b>
</p>
</body>
</html>
You can get specific section by finding it through tag type, class or id.
By tag-type:
soup.find_all('p')
By class:
soup.find_all('p', class_='outer-text')
By Id:
soup.find_all(id="first")

HTTPS will not allow you to do that.
You can use the Stackoverflow API instead. You can pass the answer id 19013712. And thus only get that specific answer via the API.
Note, you may still have to register for an APP key

BeautifulSoup not reading ill-formed html

I was learning BeautifulSoup. It wasn't reading some of the sites properly. I found that the reason was some html attributes were ill-formed. For example:
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Test</title></head>
<body>
<p id="paraone"align="center">some content <b>para1</b>.<!--there is no space before 'align' attribute -->
<p id="paratwo" align="blah">some content <b>para2</b>
</html>
"""
soup = BeautifulSoup(html)
print "soup:", soup
I think BeautifulSoup designed not to read ill-formed html. If so, is there any other module to read the above given html? Can't we parse ill-formed web sites?

Retrieve contents from broken <a> tags using Beautiful Soup

I am trying to parse a website and retrieve the texts that contain Hyper link.
For eg:
This is an Example
I need to retrieve "This is an Example", which I am able to do for pages that dont have broken tags. I am unable to retrieve in following case:
<html>
<body>
<a href = "http:\\www.google.com">Google<br>
Example
</body>
</html>
In such cases it the code is unable to retrieve Google because of the broken tag that links google and only gives me "Example". Is there a way to also retrieve "Google"?
My code is here:
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
f = open("sol.html","r")
soup = BeautifulSoup(f,parse_only=SoupStrainer('a'))
for link in soup.findAll('a',text=True):
print link.renderContents();
Please note sol.html contains the above given html code itself.
Thanks
- AJ

Remove text=True from your code and it should work just fine:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <html>
... <body>
... <a href = "http:\\www.google.com">Google<br>
... Example
... </body>
... </html>
... ''')
>>> [a.get_text().strip() for a in soup.find_all('a')]
[u'Google', u'Example']
>>> [a.get_text().strip() for a in soup.find_all('a', text=True)]
[u'Example']

Try this code:
from BeautifulSoup import BeautifulSoup
text = '''
<html>
<body>
<a href = "http:\\www.google.com">Google<br>
Example
</body>
</html>
'''
soup = BeautifulSoup(text)
for link in soup.findAll('a'):
if link.string != None:
print link.string
Here's the output when i ran the code:
Example
Just replace text with text = open('sol.html').read(), or whatever it is you need to go there.

How can i grab CData out of BeautifulSoup

I have a website that I'm scraping that has a similar structure the following. I'd like to be able to grab the info out of the CData block.
I'm using BeautifulSoup to pull other info off the page, so if the solution can work with that, it would help keep my learning curve down as I'm a python novice.
Specifically, I want to get at the two different types of data hidden in the CData statement. the first which is just text I'm pretty sure I can throw a regex at it and get what I need. For the second type, if i could drop the data that has html elements into it's own beautifulsoup, I can parse that.
I'm just learning python and beautifulsoup, so I'm struggling to find the magical incantation that will give me just the CData by itself.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
Cows and Sheep
</title>
</head>
<body>
<div id="main">
<div id="main-precontents">
<div id="main-contents" class="main-contents">
<script type="text/javascript">
//<![CDATA[var _ = g_cow;_[7654]={cowname_enus:'cows rule!',leather_quality:99,icon:'cow_level_23'};_[37357]={sheepname_enus:'baa breath',wool_quality:75,icon:'sheep_level_23'};_[39654].cowmeat_enus = '<table><tr><td><b class="q4">cows rule!</b><br></br>
<!--ts-->
get it now<table width="100%"><tr><td>NOW</td><th>NOW</th></tr></table><span>244 Cows</span><br></br>67 leather<br></br>68 Brains
<!--yy-->
<span class="q0">Cow Bonus: +9 Cow Power</span><br></br>Sheep Power 60 / 60<br></br>Sheep 88<br></br>Cow Level 555</td></tr></table>
<!--?5695:5:40:45-->
';
//]]>
</script>
</div>
</div>
</div>
</body>
</html>

One thing you need to be careful of BeautifulSoup grabbing CData is not to use a lxml parser.
By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here
#Trying it with html.parser
>>> from bs4 import BeautifulSoup
>>> import bs4
>>> s='''<?xml version="1.0" ?>
<foo>
<bar><![CDATA[
aaaaaaaaaaaaa
]]></bar>
</foo>'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
'aaaaaaaaaaaaa'
>>>

BeautifulSoup sees CData as a special case (subclass) of "navigable strings". So for example:
import BeautifulSoup
txt = '''<foobar>We have
<![CDATA[some data here]]>
and more.
</foobar>'''
soup = BeautifulSoup.BeautifulSoup(txt)
for cd in soup.findAll(text=True):
if isinstance(cd, BeautifulSoup.CData):
print 'CData contents: %r' % cd
In your case of course you could look in the subtree starting at the div with the 'main-contents' ID, rather than all over the document tree.

You could try this:
from BeautifulSoup import BeautifulSoup
// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
s = soup.findAll('script')
cdata = s[0].contents[0]
That should give you the contents of cdata.
Update
This may be a little cleaner:
from BeautifulSoup import BeautifulSoup
import re
// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
cdata = soup.find(text=re.compile("CDATA"))
Just personal preference, but I like the bottom one a little better.

import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(content)
for x in soup.find_all('item'):
print re.sub('[\[CDATA\]]', '', x.string)

For anyone using BeautifulSoup4, Alex Martelli's solution works but do this:
from bs4 import BeautifulSoup, CData
soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
if isinstance(cd, Cdata):
print 'CData contents: %r' % cd

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BS4 breaks HTML trying to repair it - python

Try this lib. from simplified_scrapy import SimplifiedDoc html = ''' <!DOCTYPE html> <center> Some Test content  <center> ''' doc = SimplifiedDoc(html) print (doc.html) Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Related

Beautifulsoup how to select all the 'a' tags

Getting src code for a section directly with requests in python

BeautifulSoup not reading ill-formed html

Retrieve contents from broken <a> tags using Beautiful Soup

How can i grab CData out of BeautifulSoup

Categories

Resources