How to extract text from an iframe in Playwright Python? [duplicate] - python

This question already has answers here:
In Playwright for Python, how do I retrieve a handle for elements from within an frame (iframe)?
(3 answers)
Closed yesterday.
I need to extract the text from the second doctype using playwright (mostly) or another tool. (python only)
Can anyone help?
<!DOCTYPE html>
<html data-lang-tag="" lang="ru">
<head>
`v`<body class='webp'
`v`<div id="app">
`v`<div id="main-container" class>
`v`<main class="content-wrapper">
`v`<main class="main">
`v`<div data-v-c735354e class="....">
`v`#document
<!DOCTYPE html>
`v` <html data-lang-tag="" lang="ru">
`**HERE is what I need - text**`
</iframe>
</div>
</....>
</....>
With the help of conventional locators, this cannot be done, perhaps someone faced such a task?
When I search for "frame.content()"
The following doctype doesn't seem to exist...

it is necessary to search through
frame_locator.frame_locator(selector)
in my example it looks like this:
page.frame_locator("iframe").locator(".sc-iAEawV > .sc-gikAfH > .sc-iJnaPW")

Related

<!DOCTYPE html> missing in Selenium Python page_source

I'm using Selenium for functional testing of a Django application and thought I'd try html5lib as a way of validating the html output. One of the validations is that the page starts with a <!DOCTYPE ...> tag.
The unit test checks with response.content.decode() all worked fine, correctly flagging errors, but I found that Selenium driver.page_source output starts with an html tag. I have double-checked that I'm using the correct template by modifying the title and making sure that the change is reflected in the page_source. There is also a missing newline and indentation between the <html> tag and the <title> tag.
This is what the first few lines looks like in the Firefox browser.
<!DOCTYPE html>
<html>
<head>
<title>NetLog</title>
</head>
Here's the Python code.
self.driver.get(f"{self.live_server_url}/netlog/")
print(self.driver.page_source
And here's the first few lines of the print when run under the Firefox web driver.
<html><head>
<title>NetLog</title>
</head>
The page body looks fine, while the missing newline is also present between </body> and </html>. Is this expected behaviour? I suppose I could just stuff the DOCTYPE tag in front of the string as a workaround but would prefer to have it behave as intended.
Chris

bottle write to page content from python code block

is it possible to write in to the page content from within a python code block in a Bottle SimpleTemplate file?
e.g.:
<!DOCTYPE html>
<html lang="en">
<body>
<%
# A block of python code
basket = [1,2,3]
print("<ul>") # this prints on the server console, not the rendered page
for item in basket:
print("<li>" + str(item) + "</li>")
print("</ul>")
%>
</body>
</html>
I know i could use the specific syntax for loops in this case, but i'd like to know an alternative for use in more complex cases:
<ul>
% for item in basket:
<li>{{item}}</li>
% end
</ul>
<!DOCTYPE html>
<html lang="en">
%result = ''
<body>
<%
# A block of python code
basket = [1,2,3]
result+="<ul>" # this prints on the server console, not the rendered page
for item in basket:
result+="<li>" + str(item) + "</li>"
result+="</ul>"
%>
{{result}}
</body>
</html>
No, I'm afraid it's not directly possible. And even if it were, you probably shouldn't do it.
The cleanest designs keep their logic in the server code; templates are merely for presentation.
For example, this templating tutorial says:
In general we'd like to separate the HTML content from the program logic.
You said:
I know i could use the specific syntax for loops in this case, but i'd like to know an alternative for use in more complex cases
I'm curious: why are you looking for an alternative? You allude to "more complex cases," but your example is just a loop. If you show us an example which you think can't be handled cleanly the standard way, perhaps we can help you with that specific case.

Getting Xpath from plain text

Im trying to get xpath from text instead of a URL. But i keep getting the error "AttributeError: 'HtmlElement' object has no attribute 'XPath'"
see code below.
from lxml import html
var ='''<html lang="en">
<head>
<title>Selecting content on a web page with XPath</title>
</head>
<body>
This is the body
</body>
</html>
'''
tree = html.fromstring(var)
body = tree.XPath('//*/body')
print(body)
It has been 15 years since I last used Python, but as far as I can tell, it is a case-sensitive language, and the xpath method is all lowercase.
So try this:
body = tree.xpath('//*/body')

Showing the full path after parsing with LXML and XPATH [duplicate]

This question already has answers here:
How to get path of an element in lxml?
(4 answers)
Closed 4 years ago.
Is there a way to show:
(a) the full path to a located node?
(b) show the attributes of the path nodes even if I don't know what those attributes might be called?
For example, given a page:
<!DOCTYPE html>
<HTML lang="en">
<HEAD>
<META name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.2.0">
<META charset="utf-8">
<TITLE>blah ombid lipsum</TITLE>
</HEAD>
<BODY>
<P>I'm the expected content</P>
<DIV unexpectedattribute="very unexpected">
<P>I'm wanted but not where you thought I'd be</P>
<P class="strangeParagraphType">I'm also wanted text but also mislocated</P>
</DIV>
</BODY>
</HTML>
I can find wanted text with
# Import Python libraries
import sys
from lxml import html
page = open( 'findme.html' ).read()
tree = html.fromstring(page)
wantedText = tree.xpath(
'//*[contains(text(),"wanted text")]' )
print( len( wantedText ), ' item(s) of wanted text found')
Having found it, however, I'd like to be able to print out the fact that the wanted text is located at:
/HTML/BODY/DIV/P ... or, even better, to show that it is located at /HTML/BODY/DIV/P[2]
... and much better, to show that it is located at that location with /DIV having unexpectedattribute="very unexpected" and the final <P> having the class of strangeParagraphType.
Could use something like this for the first example you have:
['/'.join(list([wt.tag] + [ancestor.tag for ancestor in wt.iterancestors()])[::-1]).upper() for wt in wantedText]
Third one can be created using the attrib property on the element objects and some custom logic:
wantedText[0].getparent().attrib
>>> {'unexpectedattribute': 'very unexpected'}
wantedText[0].attrib
>>> {'class': 'strangeParagraphType'}
Edit: Duplicate answer link up top is definitely a better way to go.

Only Firefox displays HTML Code and not the page

I have this complicated problem that I can't find a answer to.
I have a Python HTTPServer running that serves webpages. These webpages are created at runtime with help of Beautiful Soup. Problem is that the Firefox shows HTML Code for the webpage and not the actual page? I really don't know know who is causing this problem -
- Python HTTPServer
- Beautiful Soup
- HTML Code
Any case, I have copied parts of the webpage HTML:-
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>
My title
</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<script src="./123_ui.js">
</script>
</head>
<body>
<div>
Hellos
</div>
</body>
</html>
Just to help you, here are the things that I have already tried-
- I have made sure that Python HTTPServer is sending the MIME header as text/html
- Just copying and pasting the HTML Code will show you correct page as its static. I can tell from here that the problem is in HTTPServer side
- The Firebug shows that is empty and "This element has no style rules. You can create a rule for it." is displayed
I just want to know if the error is in Beautiful Soup or HTTPServer or HTML?
Thanks,
Amit
Why are you adding this at the top of the document?
<?xml version="1.0" encoding="iso-8859-1"?>
That will make the browser think the entire document is XML and not XHTML. Removing that line should make it render correctly. I assume Firefox is displaying a page with a bunch of elements which you can expand/collapse to see the content like it normally would for an XML document, even though the HTTP headers might say it's text/html.
So guys,
I have finally solved this problem. The reason was because I wasn't sending MIME header (even though I thought I was) with content type "text/html"
In python HTTPServer, before writing anything to file you always do this:-
self.send_response(301)
self.send_header("Location", self.path + "/")
self.end_headers()
# Once you have called the above methods, you can send the HTML to Client
self.wfile.write('ANY HTML CODE YOU WANT TO WRITE')

Categories