I am working on a simple way to wrap each sentence of an ebook formatted in HTML in span tags.
I am using a trained machine learning model to classify end of sentence punctuation (".!?" ...) and get the real sentences boundaries (ex: in U.S.A, "S" is not considered a sentence).
The problem is, in order to feed my model correct data, I need to first extract the text out of my HTML ebook (using BeautifulSoup's get_text('\n')).
Right now, I am able to wrap the output of get_text('\n') in span tags. But I can't just save that since I loose all the other tags used in the original HTML ebook.
Example HTML ebook sample:
<html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><link href="style.css" rel="stylesheet" type="text/css" /><title> Name. Of the book. </title></head> ...
</div>
After get_text
Name. Of the book.
After running my algorithm:
<span>Name. Of the book.</span>
How can I get this output instead:
<html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><link href="style.css" rel="stylesheet" type="text/css" /><title> <span>Name. Of the book.</span> </title></head> ...
</div>
Thank you in advance for your help!
You can use wrap() method (doc) to wrap the text into <span> tags - it will update the whole HTML structure.
Example:
data = '''<html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><link href="style.css" rel="stylesheet" type="text/css" /><title> Name. Of the book. </title></head>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
print('Before:')
print('-' * 80)
print(soup.prettify())
print('-' * 80)
for text in soup.find_all(text=True):
text.wrap(soup.new_tag("span")) # use wrap() function to wrap the text into <span> tag
print('After:')
print('-' * 80)
print(soup.prettify())
print('-' * 80)
Prints (notice the <span> inside the <title> tag):
Before:
--------------------------------------------------------------------------------
<html>
<head>
<meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
<link href="style.css" rel="stylesheet" type="text/css"/>
<title>
Name. Of the book.
</title>
</head>
</html>
--------------------------------------------------------------------------------
After:
--------------------------------------------------------------------------------
<html>
<head>
<meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
<link href="style.css" rel="stylesheet" type="text/css"/>
<title>
<span>
Name. Of the book.
</span>
</title>
</head>
</html>
--------------------------------------------------------------------------------
Okay so I have a pretty naive but quite effective approach. You can get the entire html code first and then store it in a string and then use Regular Expression on it to extract the texts of the span tag.
This is the only way I can think of as of now. Hope this helps :)
Related
I want to use Python to parse HTML markup, and given one of the resultant DOM tree elements, get the start and end offsets of that element within the original, unmodified markup.
For example, given the HTML markup (with \n EOL chars)
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
<head>
<title>No Longer Human</title>
<meta content="urn:uuid:6757faf0-eef1-45d9-b2b3-7462350db7ba" name="Adept.expected.resource"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="kindle:flow:0002?mime=text/css" rel="stylesheet" type="text/css"/>
<link href="kindle:flow:0001?mime=text/css" rel="stylesheet" type="text/css"/>
</head>
<body class="calibre" aid="0">
</body>
</html>
(example with BeautifulSoup, but I'm not attached to any parser in particular)
>>> soup = bs4.BeautifulSoup(html_markup)
>>> title_tag = soup.find('title')
>>> get_offsets_in_markup(title_tag) # <-------- how do I go about doing this?
(109, 139) # <----- source mapping info I want to get
>>> html_markup[109:139]
'<title>No Longer Human</title>'
I don't see this functionality in the APIs of any of the Python HTML parsers available. Can I hack it into one of the existing parsers? How would I go about doing that? Or is there another, better approach?
I realize that str(soup_element) serializes the element back into markup (and I can hypothetically recurse down the tree saving the start and end indices as I go), but the markup returned by doing that, although semantically equivalent to the original, doesn't match the original char-for-char. None of the available Python parsers do.
You can use regular expression to find corresponding element's start and indexes, and use those indexes in original string to find data:
import re
from bs4 import BeautifulSoup
from pathlib import Path
def get_offsets_in_markup(tag, html_markup):
elem = re.search(str(title_tag), html_markup)
return elem.start(), elem.end()
html_markup = Path('test.html').read_text()
soup = BeautifulSoup(html_markup, 'lxml')
title_tag = soup.find('title')
indexes = get_offsets_in_markup(title_tag, html_markup)
# -> (109, 139)
given_text = html_markup[indexes[0]:indexes[1]]
# -> <title>No Longer Human</title>
This is how test.html looks like:
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
<head>
<title>No Longer Human</title>
<meta content="urn:uuid:6757faf0-eef1-45d9-b2b3-7462350db7ba" name="Adept.e$
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="kindle:flow:0002?mime=text/css" rel="stylesheet" type="text/css"/>
<link href="kindle:flow:0001?mime=text/css" rel="stylesheet" type="text/css"/>
</head>
<body class="calibre" aid="0">
</body>
</html>
I am trying to get the content of a meta tag. The problem is that BS4 can't parse the tag properly on some sites, where the tag is not closed as it should be. With tags as the example below, the output of my function includes tons of clutter including other tags such as scripts, links, etc. I believe the browser closes automatically the meta tag somewhere in the end of the head and this behavior confuses BS4.
My code works with this:
<meta name="description" content="content" />
and doesn't work with:
<meta name="description" content="content">
Here is the code of my BS4 function:
from bs4 import BeautifulSoup
html = BeautifulSoup(open('/path/file.html'), 'html.parser')
desc = html.find(attrs={'name':'description'})
print(desc)
Any way to make it work with those un-closed meta tags?
html5lib or lxml parser would handle the problem properly:
In [1]: from bs4 import BeautifulSoup
...:
...: data = """
...: <html>
...: <head>
...: <meta name="description" content="content">
...: <script>
...: var i = 0;
...: </script>
...: </head>
...: <body>
...: <div id="content">content</div>
...: </body>
...: </html>"""
...:
In [2]: BeautifulSoup(data, 'html.parser').find(attrs={'name': 'description'})
Out[2]: <meta content="content" name="description">\n<script>\n var i = 0;\n </script>\n</meta>
In [3]: BeautifulSoup(data, 'html5lib').find(attrs={'name': 'description'})
Out[3]: <meta content="content" name="description"/>
In [4]: BeautifulSoup(data, 'lxml').find(attrs={'name': 'description'})
Out[4]: <meta content="content" name="description"/>
Having get something new and hope it can give you some help, i think every time BeautifulSoup find an element without a proper end tag, then it will continue searching the next and next element until its parent tag end tag.Maybe you still don't understand my thought, and here i made a little demo:
hello.html
<!DOCTYPE html>
<html lang="en">
<meta name="description" content="content">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<div>
<p class="title"><b>The Dormouse's story</b>
<p class="story">Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.</p>
</p></div>
</body>
</html>
and run like you done before and find result below:
<meta content="content" name="description">
<head>
<meta charset="utf-8">
<title>Title</title>
</meta></head>
<body>
...
</div></body>
</meta>
ok! BeautifulSoup generate the end meta tag automatically and whose position is after the </body> tag, but still can not see meta's parent end tag </html>, so what i mean is that end tag should reflect as the same position as its start tag. But i still can not convince myself such opinion so i make a test, delete <p class='title'> end tag so there is only one </p> tag in <div>...</div>, but after running
c = soup.find_all('p', attrs={'class':'title'})
print(c[0])
there are two </p> tags in result. So that's true as i said previously.
I have gotten the HTML of a webpage using Python, and I now want to find all of the .CSS files that are linked to in the header. I tried partitioning, as shown below, but I got the error "IndexError: string index out of range" upon running it and save each as its own variable (I know how to do this part).
sytle = src.partition(".css")
style = style[0].partition('<link href=')
print style[2]
c =1
I do no think that this is the right way to approach this, so would love some advice. Many thanks in advance. Here is a section of the kind of text I am needing to extract .CSS file(s) from.
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />
<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
You should use regular expression for this. Try the following:
/href="(.*\.css[^"]*)/g
EDIT
import re
matches = re.findall('href="(.*\.css[^"]*)', html)
print(matches)
My answer is along the same lines as Jon Clements' answer, but I tested mine and added a drop of explanation.
You should not use a regex. You can't parse HTML with a regex. The regex answer might work, but writing a robust solution is very easy with lxml. This approach is guaranteed to return the full href attribute of all <link rel="stylesheet"> tags and no others.
from lxml import html
def extract_stylesheets(page_content):
doc = html.fromstring(page_content) # Parse
return doc.xpath('//head/link[#rel="stylesheet"]/#href') # Search
There is no need to check the filenames, since the results of the xpath search are already known to be stylesheet links, and there's no guarantee that the filenames will have a .css extension anyway. The simple regex will catch only a very specific form, but the general html parser solution will also do the right thing in cases such as this, where the regex would fail miserably:
<link REL="stylesheet" hREf =
'/stylesheets/print?1342791421'
media="print"
><!-- link href="/css/stylesheet.css" -->
It could also be easily extended to select only stylesheets for a particular media.
For what it's worth (using lxml.html) as a parsing lib.
untested
import lxml.html
from urlparse import urlparse
sample_html = """<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />
<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
"""
import lxml.html
page = lxml.html.fromstring(html)
link_hrefs = (p.path for p in map(urlparse, page.xpath('//head/link/#href')))
for href in link_hrefs:
if href.rsplit(href, 1)[-1].lower() == 'css': # implement smarter error handling here
pass # do whatever
I want to catch some tags with BeautifulSoup: Some <p> tags, the <title> tag, some <meta> tags. But I want to catch them regardless of their case; I know that some sites do meta like this: <META> and I want to be able to catch that.
I noticed that BeautifulSoup is case-sensitive by default. How do I catch these tags in a non-case-sensitive way?
BeautifulSoup standardises the parse tree on input. It converts tags to lower-case. You don't have anything to worry about IMO.
You can use soup.findAll which should match case-insensitively:
import BeautifulSoup
html = '''<html>
<head>
<meta name="description" content="Free Web tutorials on HTML, CSS, XML" />
<META name="keywords" content="HTML, CSS, XML" />
<title>Test</title>
</head>
<body>
</body>
</html>'''
soup = BeautifulSoup.BeautifulSoup(html)
for x in soup.findAll('meta'):
print x
Result:
<meta name="description" content="Free Web tutorials on HTML, CSS, XML" />
<meta name="keywords" content="HTML, CSS, XML" />
I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object.
Given the following html:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta charset=utf-8 />
<meta name="viewport" content="width=620" />
<title>HTML5 Demos and Examples</title>
<link rel="stylesheet" href="/css/html5demos.css" type="text/css" />
<script src="js/h5utils.js"></script>
</head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Can anyone tell me if there's a way of extracting the declared doctype from it using BeautifulSoup?
Beautiful Soup 4 has a class for DOCTYPE declarations, so you can use that to extract all the declarations at top level (though you're no doubt expecting one or none!)
def doctype(soup):
items = [item for item in soup.contents if isinstance(item, bs4.Doctype)]
return items[0] if items else None
You can go through top-level elements and check each to see whether it is a declaration. Then you can inspect it to find out what kind of declaration it is:
for child in soup.contents:
if isinstance(child, BS.Declaration):
declaration_type = child.string.split()[0]
if declaration_type.upper() == 'DOCTYPE':
declaration = child
You could just fetch the first item in soup contents:
>>> soup.contents[0]
u'DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"'