lxml: Element is not a child of this node - python

I'm trying to change the value of title within the following html document:
<html lang="en">
<head>
<meta charset="utf-8">
<title id="title"></title>
<base href="/">
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>
<app-root></app-root>
</body>
</html>
I wrote the following python script which uses lxml, in order to accomplish the task:
from lxml.html import fromstring, tostring
from lxml.html import builder as E
html = fromstring(open('./index.html').read())
html.replace(html.get_element_by_id('title'), E.TITLE('TEST'))
But after running the script, I get the following error:
ValueError: Element is not a child of this node.
What's supposed to cause this error? Thank you.

The 'title' tag is a child of the 'head' node. In your code you use replace on the 'html' node, which has no 'title' elements (not directly), hence the ValueError.
You can get the desired results if you use replace on the 'head' node.
html.find('head').replace(html.get_element_by_id('title'), E.TITLE('TEST'))

Related

How to parse HTML with source mapping?

I want to use Python to parse HTML markup, and given one of the resultant DOM tree elements, get the start and end offsets of that element within the original, unmodified markup.
For example, given the HTML markup (with \n EOL chars)
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
<head>
<title>No Longer Human</title>
<meta content="urn:uuid:6757faf0-eef1-45d9-b2b3-7462350db7ba" name="Adept.expected.resource"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="kindle:flow:0002?mime=text/css" rel="stylesheet" type="text/css"/>
<link href="kindle:flow:0001?mime=text/css" rel="stylesheet" type="text/css"/>
</head>
<body class="calibre" aid="0">
</body>
</html>
(example with BeautifulSoup, but I'm not attached to any parser in particular)
>>> soup = bs4.BeautifulSoup(html_markup)
>>> title_tag = soup.find('title')
>>> get_offsets_in_markup(title_tag) # <-------- how do I go about doing this?
(109, 139) # <----- source mapping info I want to get
>>> html_markup[109:139]
'<title>No Longer Human</title>'
I don't see this functionality in the APIs of any of the Python HTML parsers available. Can I hack it into one of the existing parsers? How would I go about doing that? Or is there another, better approach?
I realize that str(soup_element) serializes the element back into markup (and I can hypothetically recurse down the tree saving the start and end indices as I go), but the markup returned by doing that, although semantically equivalent to the original, doesn't match the original char-for-char. None of the available Python parsers do.
You can use regular expression to find corresponding element's start and indexes, and use those indexes in original string to find data:
import re
from bs4 import BeautifulSoup
from pathlib import Path
def get_offsets_in_markup(tag, html_markup):
elem = re.search(str(title_tag), html_markup)
return elem.start(), elem.end()
html_markup = Path('test.html').read_text()
soup = BeautifulSoup(html_markup, 'lxml')
title_tag = soup.find('title')
indexes = get_offsets_in_markup(title_tag, html_markup)
# -> (109, 139)
given_text = html_markup[indexes[0]:indexes[1]]
# -> <title>No Longer Human</title>
This is how test.html looks like:
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
<head>
<title>No Longer Human</title>
<meta content="urn:uuid:6757faf0-eef1-45d9-b2b3-7462350db7ba" name="Adept.e$
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="kindle:flow:0002?mime=text/css" rel="stylesheet" type="text/css"/>
<link href="kindle:flow:0001?mime=text/css" rel="stylesheet" type="text/css"/>
</head>
<body class="calibre" aid="0">
</body>
</html>

SCRAPY - XPATH select a object inside a node

I need to get an object inside a variable inside a node which is a javascript node.
(Using scrapy 1.8.0 didn't update yet hehe)
Maybe I don't explain myself clearly but as soon you see it... you will understand.
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<script id='myscript'>
oneVariable = {...}
theVariable = {"Data": "blahblah", "More-Data": {...}}
</script>
</head>
<body>
</body>
</html>
Ok I got the whole node with his information manually using scrapy shell and then the selector
response.xpath('//*[#id="myscript"]').get()
Can I get the "theVariable" I want just with XPATH selectors or functions (like get(), getAll() etc)?
Thanks in advance!
Try changing you xpath expression to something like:
substring-after(//script[#id="myscript"],"theVariable = ")

XPath for html elements

I'd like to use Scrapy to crawl a few hundred websites and just scrape the basic (title, meta* and body) html elements. I know that I should use CrawlSpider for this and adjust some of the settings based on broad crawls. The part that I'm having trouble figuring it out is how to use xpath to create the rules for scraping just those basic html elements. Lots of tutorials I see involve inspecting the element and finding the css class for that element. That is fine for the body element but what about the title and meta tags?
There XPath and CSS selector you can use to select nodes in HTML.
the element is a node, but the node not always an element.
So, then you know head, meta, body are all elements. the class attributes in the div is the same as the charset attribute in meta element. They are all attributes nodes.
e.g:
<!DOCTYPE html>
<html lang='zh-cn'>
<head>
<meta charset='utf-8'>
<meta http-equiv='X-UA-Compatible' content='IE=edge'>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="renderer" content="webkit">
<title>title</title>
</head>
<body>
<div>website content</div>
</body>
</html>
if you want to select
<meta http-equiv='X-UA-Compatible' content='IE=edge'>
you can use XPATH like this:
//head/meta[#http-equiv="X-UA-Compatible"]
You can search the elements in <head> the same way you find in <body>, for example:
//html/head/title
or
//html/head/meta
Well for the title node you can write a simple XPath expression : //title which is the abbreviated syntax of /descendant-or-self::node()/child::title and that's it.
For the meta node guess what you can just write //meta too or if you want you can use the absolute path /html/head/meta
PS. You can do the same thing for the body node.

BeautifulSoup - proper way of dealing with self-closing tags

I have an html file with some self-closing tags, but BeautifulSoup doesn't like them.
from bs4 import BeautifulSoup
html = '<head><meta content="text/html" http-equiv="Content-Type"><meta charset="utf-8"></head>'
doc = BeautifulSoup(html, 'html.parser')
print doc.prettify()
prints
<head>
<meta content="text/html" http-equiv="Content-Type">
<meta charset="utf-8"/>
</meta>
</head>
Must I manually check if each tag is self-closing and modify appropriately, or is there a better way of handling this?
As you may already know, you can specify different parsers that BeautifulSoup would use internally. And, as noted in BeautifulSoup docs:
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won’t matter.
One parser will be faster than another, but they’ll all give you a
data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will
give different results.
In this particular case, both lxml and html5lib produce two separate meta tags:
In [4]: doc = BeautifulSoup(html, 'lxml')
In [5]: print(doc.prettify())
<html>
<head>
<meta content="text/html" http-equiv="Content-Type"/>
<meta charset="utf-8"/>
</head>
</html>
In [6]: doc = BeautifulSoup(html, 'html5lib')
In [7]: print(doc.prettify())
<html>
<head>
<meta content="text/html" http-equiv="Content-Type"/>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>

Get document DOCTYPE with BeautifulSoup

I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object.
Given the following html:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta charset=utf-8 />
<meta name="viewport" content="width=620" />
<title>HTML5 Demos and Examples</title>
<link rel="stylesheet" href="/css/html5demos.css" type="text/css" />
<script src="js/h5utils.js"></script>
</head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Can anyone tell me if there's a way of extracting the declared doctype from it using BeautifulSoup?
Beautiful Soup 4 has a class for DOCTYPE declarations, so you can use that to extract all the declarations at top level (though you're no doubt expecting one or none!)
def doctype(soup):
items = [item for item in soup.contents if isinstance(item, bs4.Doctype)]
return items[0] if items else None
You can go through top-level elements and check each to see whether it is a declaration. Then you can inspect it to find out what kind of declaration it is:
for child in soup.contents:
if isinstance(child, BS.Declaration):
declaration_type = child.string.split()[0]
if declaration_type.upper() == 'DOCTYPE':
declaration = child
You could just fetch the first item in soup contents:
>>> soup.contents[0]
u'DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"'

Categories