Showing the full path after parsing with LXML and XPATH [duplicate] - python

This question already has answers here:
How to get path of an element in lxml?
(4 answers)
Closed 4 years ago.
Is there a way to show:
(a) the full path to a located node?
(b) show the attributes of the path nodes even if I don't know what those attributes might be called?
For example, given a page:
<!DOCTYPE html>
<HTML lang="en">
<HEAD>
<META name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.2.0">
<META charset="utf-8">
<TITLE>blah ombid lipsum</TITLE>
</HEAD>
<BODY>
<P>I'm the expected content</P>
<DIV unexpectedattribute="very unexpected">
<P>I'm wanted but not where you thought I'd be</P>
<P class="strangeParagraphType">I'm also wanted text but also mislocated</P>
</DIV>
</BODY>
</HTML>
I can find wanted text with
# Import Python libraries
import sys
from lxml import html
page = open( 'findme.html' ).read()
tree = html.fromstring(page)
wantedText = tree.xpath(
'//*[contains(text(),"wanted text")]' )
print( len( wantedText ), ' item(s) of wanted text found')
Having found it, however, I'd like to be able to print out the fact that the wanted text is located at:
/HTML/BODY/DIV/P ... or, even better, to show that it is located at /HTML/BODY/DIV/P[2]
... and much better, to show that it is located at that location with /DIV having unexpectedattribute="very unexpected" and the final <P> having the class of strangeParagraphType.

Could use something like this for the first example you have:
['/'.join(list([wt.tag] + [ancestor.tag for ancestor in wt.iterancestors()])[::-1]).upper() for wt in wantedText]
Third one can be created using the attrib property on the element objects and some custom logic:
wantedText[0].getparent().attrib
>>> {'unexpectedattribute': 'very unexpected'}
wantedText[0].attrib
>>> {'class': 'strangeParagraphType'}
Edit: Duplicate answer link up top is definitely a better way to go.

Related

How to extract text from an iframe in Playwright Python? [duplicate]

This question already has answers here:
In Playwright for Python, how do I retrieve a handle for elements from within an frame (iframe)?
(3 answers)
Closed yesterday.
I need to extract the text from the second doctype using playwright (mostly) or another tool. (python only)
Can anyone help?
<!DOCTYPE html>
<html data-lang-tag="" lang="ru">
<head>
`v`<body class='webp'
`v`<div id="app">
`v`<div id="main-container" class>
`v`<main class="content-wrapper">
`v`<main class="main">
`v`<div data-v-c735354e class="....">
`v`#document
<!DOCTYPE html>
`v` <html data-lang-tag="" lang="ru">
`**HERE is what I need - text**`
</iframe>
</div>
</....>
</....>
With the help of conventional locators, this cannot be done, perhaps someone faced such a task?
When I search for "frame.content()"
The following doctype doesn't seem to exist...
it is necessary to search through
frame_locator.frame_locator(selector)
in my example it looks like this:
page.frame_locator("iframe").locator(".sc-iAEawV > .sc-gikAfH > .sc-iJnaPW")

bottle write to page content from python code block

is it possible to write in to the page content from within a python code block in a Bottle SimpleTemplate file?
e.g.:
<!DOCTYPE html>
<html lang="en">
<body>
<%
# A block of python code
basket = [1,2,3]
print("<ul>") # this prints on the server console, not the rendered page
for item in basket:
print("<li>" + str(item) + "</li>")
print("</ul>")
%>
</body>
</html>
I know i could use the specific syntax for loops in this case, but i'd like to know an alternative for use in more complex cases:
<ul>
% for item in basket:
<li>{{item}}</li>
% end
</ul>
<!DOCTYPE html>
<html lang="en">
%result = ''
<body>
<%
# A block of python code
basket = [1,2,3]
result+="<ul>" # this prints on the server console, not the rendered page
for item in basket:
result+="<li>" + str(item) + "</li>"
result+="</ul>"
%>
{{result}}
</body>
</html>
No, I'm afraid it's not directly possible. And even if it were, you probably shouldn't do it.
The cleanest designs keep their logic in the server code; templates are merely for presentation.
For example, this templating tutorial says:
In general we'd like to separate the HTML content from the program logic.
You said:
I know i could use the specific syntax for loops in this case, but i'd like to know an alternative for use in more complex cases
I'm curious: why are you looking for an alternative? You allude to "more complex cases," but your example is just a loop. If you show us an example which you think can't be handled cleanly the standard way, perhaps we can help you with that specific case.

Insert multiple lines of hyperlinks in HTML by Python

To display multiple lines in html body, simple codes:
websites = ["https://www.reddit.com/","https://en.wikipedia.org/","https://www.facebook.com/"]
html = """
<!DOCTYPE html>
<html>
<body>
<h1>Hi, friend</h1>
<p>$websites!</p>
</body>
</html>
"""
html = Template(html).safe_substitute(websites = "<p>".join(websites))
Now I want to change the links to hyperlinks with friendly names.
names = ["News", "Info", "Media"]
Changed the line to:
<p><a href=$websites>$names</a></p>
and:
html = Template(html).safe_substitute(websites = "<p>".join(websites),
names= "<p>".join(names))
What I want in the html to show is:
News
Info
Media
But it doesn't show properly.
What's the right way to do that? Thank you.
Don't do '<p>'.join(websites). This will create a string by joining all the elements of a list and stick the '<p>' between them.
so that gives you "https://www.reddit.com/<p>https://en.wikipedia.org/"<p>https://www.facebook.com/" which is not what you want (I don't think It's valid as well).
You don't have any <a> link tags. So you need to Create those.
The href will point to the website and inside the <a> tag you have the name you want to appear
<a href={link}>{link_name}</a>
This is what you want to do:
websites = ["https://www.reddit.com/","https://en.wikipedia.org/","https://www.facebook.com/"]
html = """
<!DOCTYPE html>
<html>
<body>
<p>$websites</p>
</body>
</html>
"""
tag_names = ['News', 'Info', 'Media']
a_links = '<br/>'.join([f'<a href={link}>{link_name}</a>' for link, link_name in zip(websites, tag_names)])
html = Template(html).safe_substitute(websites=a_links)

Getting Xpath from plain text

Im trying to get xpath from text instead of a URL. But i keep getting the error "AttributeError: 'HtmlElement' object has no attribute 'XPath'"
see code below.
from lxml import html
var ='''<html lang="en">
<head>
<title>Selecting content on a web page with XPath</title>
</head>
<body>
This is the body
</body>
</html>
'''
tree = html.fromstring(var)
body = tree.XPath('//*/body')
print(body)
It has been 15 years since I last used Python, but as far as I can tell, it is a case-sensitive language, and the xpath method is all lowercase.
So try this:
body = tree.xpath('//*/body')

How to prevent lxml from adding a default doctype

lxml seems to add a default doctype when one is missing in the html document.
See this demo code:
import lxml.etree
import lxml.html
def beautify(html):
parser = lxml.etree.HTMLParser(
strip_cdata=True,
remove_blank_text=True
)
d = lxml.html.fromstring(html, parser=parser)
docinfo = d.getroottree().docinfo
return lxml.etree.tostring(
d,
pretty_print=True,
doctype=docinfo.doctype,
encoding='utf8'
)
with_doctype = """
<!DOCTYPE html>
<html>
<head>
<title>With Doctype</title>
</head>
</html>
"""
# This passes!
assert "DOCTYPE" in beautify(with_doctype)
no_doctype = """<html>
<head>
<title>No Doctype</title>
</head>
</html>"""
# This fails!
assert "DOCTYPE" not in beautify(no_doctype)
# because the returned html contains this line
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# which was not present in the source before
How can I tell lxml to not do this?
This issue was originally raised here:
https://github.com/mitmproxy/mitmproxy/issues/845
Quoting a comment on reddit as it might be helpful:
lxml is based on libxml2, which does this by default unless you pass the option HTML_PARSE_NODEFDTD, I believe. Code here.
I don't know if you can tell lxml to pass that option though.. libxml has python bindings that you could perhaps use directly but they seem really hairy.
EDIT: did some more digging and that option does appear in the lxml soure here. That option does exactly what you want but I'm not sure how to activate it yet, if it's even possible.
There is currently no way to do this in lxml, but I've created a Pull Request on lxml which adds a default_doctype boolean to the HTMLParser.
Once the code gets merged in, the parser needs to be created like so:
parser = lxml.etree.HTMLParser(
strip_cdata=True,
remove_blank_text=True,
default_doctype=False,
)
Everything else stays the same.

Categories