double encoded html code - python

I use xinha as WYSIWYG editor for html-content.
I sent html-articles via post-form to postgresql.
So far so good, they seem ok.
But when I receive and output from pg to an html page, I see double encoded, i.e. broken html code
like this
<p><a href="http://google.com">google.com</a></p> <p> </p> <p>
Any idea on where to search for the issue?
Thanks in advance

import HTMLParser
hp=HTMLParser.HTMLParser()
s="<p><a href="http://google.com">google.com</a></p> <p> </p> <p>"
print hp.unescape(s)
# u'<p>google.com</p> <p> </p> <p>'

Related

Can't get data from inside of span-tag with beautifulsoup

I am trying to scrape Instagram page, and want to get/access div-tags present inside of span-tag. but I can't! the HTML of the Instagram page looks like as
<head>--</head>
<body>
<span id="react-root" aria-hidden="false">
<form enctype="multipart/form-data" method="POST" role="presentation">…</form>
<section class="_9eogI E3X2T">
<main class="SCxLW o64aR" role="main">
<div class="v9tJq VfzDr">
<header class=" HVbuG">…</header>
<div class="_4bSq7">…</div>
<div class="fx7hk">…</div>
</div>
</main>
</section>
</body>
I do, it as
from bs4 import BeautifulSoup
import urllib.request as urllib2
html_page = urllib2.urlopen("https://www.instagram.com/cherrified_/?hl=en")
soup = BeautifulSoup(html_page,"lxml")
span_tag = soup.find('span') # return span-tag correctly
span_tag.find_all('div') # return empty list, why ?
please also specify an example.
Instagram is a Single Page Application powered by React, which means its source is just a simple "empty" page that loads JavaScript to dynamically generate the content in the browser after downloading.
Click "View source" or go to view-source:https://www.instagram.com/cherrified_/?hl=en in Chrome. This is the HTML you download with urllib.request.
You can see that there is a single <span> tag, which does not include a <div> tag. (Note: <div> inside a <span> is not allowed).
Scraping instagram.com this way is not possible. It also might not be legal (I am not a lawyer).
Notes:
your HTML code example doesn't include a closing tag for <span>.
your HTML code example doesn't match the link you provide in the python snippet.
in the last line of the python snippet you probably meant span_tag.find_all('div') (note the variable name and the singular 'div').

Remove comment tag but NOT content with BeautifulSoup

I'm practicing some web scraping using BeautifulSoup, specifically I'm looking at NFL game data and more specifically the "Team Stats" table on this page (https://www.pro-football-reference.com/boxscores/201809060phi.htm).
When looking at the HTML for the table I see something like this:
<div class="section_heading">...</div>
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_team_stats">
<table class="stats_table" id="team_stats" data-cols-to-freeze=1>
....
</table>
</div>
</div>
-->
Essentially, the HTML that is being rendered to the page is stored in the HTML as a comment, so I can find the div for the table but BeautifulSoup can't parse the table itself because it's all in the comment.
Is there a good way to get around this so I can parse the table HTML with BeautifulSoup? I figured out how to extract the comment text, but I don't know if there's a good way to convert the resulting String into usable HTML. Alternatively the comment tags could simply be removed which I think would let it be parsed as HTML, but I haven't found a good way to do that either.
from bs4 import BeautifulSoup, Comment
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
comments.extract()
From this you will be able to get all the comments out and get the text in between comments and put it in the BS4 to extract data within. Hope this works.

python selenium xpath horror

i know i am doing something wrong here but i would like your help on this one
i have this html code
<span id='some text'>some text</span>
<ul> this is what I would like to grab </ul>
<span id='some more text'>some more text</span>
so i tried something and since this is the first time i am working with xpath i was quite sure i was doing wrong.
driver.find_elements_by_xpath('//ul[preceding:://span[#id="some text"] and following:://span[#id="some more text"] ')
any help is appreciated
An id attribute is supposed to be unique, so one is enough to select a branch.
To get the <ul> tag following the <span id='some text'>:
driver.find_elements_by_xpath("//span[#id='some text']/following-sibling::ul[1]")
, and with a CSS selector:
driver.find_elements_by_css_selector("span[id='some text'] + ul")

BeautifulSoup (bs4) parsing wrong

Parsing this sample document with bs4, from python 2.7.6:
<html>
<body>
<p>HTML allows omitting P end-tags.
<p>Like that and this.
<p>And this, too.
<p>What happened?</p>
<p>And can we <p>nest a paragraph, too?</p></p>
</body>
</html>
Using:
from bs4 import BeautifulSoup as BS
...
tree = BS(fh)
HTML has, for ages, allowed omitted end-tags for various element types, including P (check the schema, or a parser). However, bs4's prettify() on this document shows that it doesn't end any of those paragraphs until it sees </body>:
<html>
<body>
<p>
HTML allows omitting P end-tags.
<p>
Like that and this.
<p>
And this, too.
<p>
What happened?
</p>
<p>
And can we
<p>
nest a paragraph, too?
</p>
</p>
</p>
</p>
</p>
</body>
It's not prettify()'s fault, because traversing the tree manually I get the same structure:
<[document]>
<html>
␊
<body>
␊
<p>
HTML allows omitting P end-tags.␊␊
<p>
Like that and this.␊␊
<p>
And this, too.␊␊
<p>
What happened?
</p>
␊
<p>
And can we
<p>
nest a paragraph, too?
</p>
</p>
␊
</p>
</p>
</p>
</body>
␊
</html>
␊
</[document]>
Now, this would be the right result for XML (at least up to </body>, at which point it should report a WF error). But this ain't XML. What gives?
The doc at http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser tells how to get BS4 to use different parsers. Apparently the default is html.parse, which the BS4 doc says is broken before Python 2.7.3, but apparently still has the problem described above in 2.7.6.
Switching to "lxml" was unsuccessful for me, but switching to "html5lib" produces the correct result:
tree = BS(htmSource, "html5lib")

BeautifulSoup Scraping How to

Consider a HTML structure like
<div class="entry_content">
<p>
<script>some blah blah script here</script>
<fb:like--blah blah></fb:like>
<img/>
</p>
<p align="left">
content to be scraped begins here
</p>
<p>
more content to be scraped in one or many paragraphs from this paragraph onwards
</p>
-- there could be many more <p> here which also need to be included
</div>
The soup
content = soup.html.body.find('div', class_='entry_content')
gives me everything within the outermost div tag, including javascript, facebook code and all html tags.
Now how to remove everything before <p align="left">
I tried something like:
content.split('<p align="left">')[1]
But this is not doing the trick
Have a look at extract or decompose.
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
Tag.decompose() removes a tag from the tree, then completely destroys it and its contents.

Categories