Getting Xpath from plain text

Getting Xpath from plain text - python

Im trying to get xpath from text instead of a URL. But i keep getting the error "AttributeError: 'HtmlElement' object has no attribute 'XPath'"
see code below.
from lxml import html
var ='''<html lang="en">
<head>
<title>Selecting content on a web page with XPath</title>
</head>
<body>
This is the body
</body>
</html>
'''
tree = html.fromstring(var)
body = tree.XPath('//*/body')
print(body)

It has been 15 years since I last used Python, but as far as I can tell, it is a case-sensitive language, and the xpath method is all lowercase.
So try this:
body = tree.xpath('//*/body')

Related

<!DOCTYPE html> missing in Selenium Python page_source

I'm using Selenium for functional testing of a Django application and thought I'd try html5lib as a way of validating the html output. One of the validations is that the page starts with a <!DOCTYPE ...> tag.
The unit test checks with response.content.decode() all worked fine, correctly flagging errors, but I found that Selenium driver.page_source output starts with an html tag. I have double-checked that I'm using the correct template by modifying the title and making sure that the change is reflected in the page_source. There is also a missing newline and indentation between the <html> tag and the <title> tag.
This is what the first few lines looks like in the Firefox browser.
<!DOCTYPE html>
<html>
<head>
<title>NetLog</title>
</head>
Here's the Python code.
self.driver.get(f"{self.live_server_url}/netlog/")
print(self.driver.page_source
And here's the first few lines of the print when run under the Firefox web driver.
<html><head>
<title>NetLog</title>
</head>
The page body looks fine, while the missing newline is also present between </body> and </html>. Is this expected behaviour? I suppose I could just stuff the DOCTYPE tag in front of the string as a workaround but would prefer to have it behave as intended.
Chris

Insert multiple lines of hyperlinks in HTML by Python

To display multiple lines in html body, simple codes:
websites = ["https://www.reddit.com/","https://en.wikipedia.org/","https://www.facebook.com/"]
html = """
<!DOCTYPE html>
<html>
<body>
<h1>Hi, friend</h1>
<p>$websites!</p>
</body>
</html>
"""
html = Template(html).safe_substitute(websites = "<p>".join(websites))
Now I want to change the links to hyperlinks with friendly names.
names = ["News", "Info", "Media"]
Changed the line to:
<p><a href=$websites>$names</a></p>
and:
html = Template(html).safe_substitute(websites = "<p>".join(websites),
names= "<p>".join(names))
What I want in the html to show is:
News
Info
Media
But it doesn't show properly.
What's the right way to do that? Thank you.

Don't do '<p>'.join(websites). This will create a string by joining all the elements of a list and stick the '<p>' between them.
so that gives you "https://www.reddit.com/<p>https://en.wikipedia.org/"<p>https://www.facebook.com/" which is not what you want (I don't think It's valid as well).
You don't have any <a> link tags. So you need to Create those.
The href will point to the website and inside the <a> tag you have the name you want to appear
<a href={link}>{link_name}</a>
This is what you want to do:
websites = ["https://www.reddit.com/","https://en.wikipedia.org/","https://www.facebook.com/"]
html = """
<!DOCTYPE html>
<html>
<body>
<p>$websites</p>
</body>
</html>
"""
tag_names = ['News', 'Info', 'Media']
a_links = '<br/>'.join([f'<a href={link}>{link_name}</a>' for link, link_name in zip(websites, tag_names)])
html = Template(html).safe_substitute(websites=a_links)

Insert variables into an html file

I am trying to send an email using sendgrid. For this I created an html file and I want to format some variables into it.
Very basic test.html example:
<html>
<head>
</head>
<body>
Hello World, {name}!
</body>
</html>
Now in my Python code I am trying to do something like this:
html = open("test.html", "r")
text = html.read()
msg = MIMEText(text, "html")
msg['name'] = 'Boris'
and then proceed to send the email
Sadly, this does not seem to be working. Any way to make this work?

There are a few ways to approach this depending on how dynamic this must be and how many elements you are going to need to insert. If it is one single value name then #furas is correct and you can simply put
html = open("test.html", "r")
text = html.read().format(name="skeletor")
print(text)
And get:
<html>
<head>
</head>
<body>
Hello World, skeletor!
</body>
</html>
Alternatively you can use Jinja2 templates.
import jinja2
html = open("test.html", "r")
text = html.read()
t = jinja2.Template(text)
print(t.render(name="Skeletor"))
Helpful links: Jinja website
Real Python Primer on Jinja
Python Programming Jinja

Scraping Amazon deals page not returning html code - python

I am currently trying to scrape this Amazon page "https://www.amazon.com/b/?ie=UTF8&node=11552285011&ref_=sv_kstore_5" with the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.amazon.com/b/?ie=UTF8&node=11552285011&ref_=sv_kstore_5'
r = requests.get(url)
soup = BeautifulSoup(r.content)
print(soup.prettify)
However when I run it instead of getting the simple html source code I get a bunch of lines which don't make much sense to me starting like this:
<bound method Tag.prettify of <!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/><!-- emit CSM JS -->
<style>
[class*=scx-line-clamp-]{overflow:hidden}.scx-offscreen-truncate{position:relative;left:-1000000px}.scx-line-clamp-1{max-height:16.75px}.scx-truncate-medium.scx-line-clamp-1{max-height:20.34px}.scx-truncate-small.scx-line-clamp-1{max-height:13px}.scx-line-clamp-2{max-height:35.5px}.scx-truncate-medium.scx-line-clamp-2{max-height:41.67px}.scx-truncate-small.scx-line-clamp-2{max-height:28px}.scx-line-clamp-3{max-height:54.25px}.scx-truncate-medium.scx-line-clamp-3{max-height:63.01px}.scx-truncate-small.scx-line-clamp-3{max-height:43px}.scx-line-clamp-4{max-height:73px}.scx-truncate-medium.scx-line-clamp-4{max-height:84.34px}.scx-truncate-small.scx-line-clamp-4{max-height:58px}.scx-line-clamp-5{max-height:91.75px}.scx-truncate-medium.scx-line-clamp-5{max-height:105.68px}.scx-truncate-small.scx-line-clamp-5{max-height:73px}.scx-line-clamp-6{max-height:110.5px}.scx-truncate-medium.scx-line-clamp-6{max-height:127.01
And even when I scroll down, there is nothing that really resemble a structured html code with all the info I need. What am I doing wrong ? (I am a beginner so it could be anything really). Thank you very much!

print(soup.prettify)
intend to call soup.prettify.__repr__(). The output is
<bound method Tag.prettify of <!DOCTYPE html><html class="a-no-js" data-19ax5a9jf="dingo"><head>...
while you need to call the prettify method:
print(soup.prettify())
The output:
<html class="a-no-js" data-19ax5a9jf="dingo">
<head>
<script>
var aPageStart = (new Date()).getTime();
</script>
<meta charset="utf-8"/>
<!-- emit CSM JS -->
<style>
...

Parsing MS specific html tags in BeautifulSoup

When trying to parse an email sent using MS Outlook, I want to be able to strip the annoying Microsoft XML tags that it has added. One such example is the o:p tag. When trying to use Python's BeautifulSoup to parse an email as HTML, it can't seem to find these specialty tags.
For example:
from bs4 import BeautifulSoup
textToParse = """
<html>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "html5lib")
body = soup.find('body')
for otag in body.find_all('o'):
print(otag)
for otag in body.find_all('o:p'):
print(otag)
This will output no text to the console, but if I switched the find_all call to search for p then it would output the p node as expected.
How come these custom tags do not seem to work?

It's a namespace issue. Apparently, BeautifulSoup does not consider custom namespaces valid when parsed with "html5lib".
You can work around this with a regular expression, which – strangely – does work correctly!
print (soup.find_all(re.compile('o:p')))
>>> [<o:p>This should go</o:p>]
but the "proper" solution is to change the parser to "lxml-xml" and introducing o: as a valid namespace.
from bs4 import BeautifulSoup
textToParse = """
<html xmlns:o='dummy_url'>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "lxml-xml")
body = soup.find('body')
print ('this should find nothing')
for otag in body.find_all('o'):
print(otag)
print ('this should find o:p')
for otag in body.find_all('o:p'):
print(otag)
>>>
this should find nothing
this should find o:p
<o:p>This should go</o:p>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting Xpath from plain text - python

It has been 15 years since I last used Python, but as far as I can tell, it is a case-sensitive language, and the xpath method is all lowercase. So try this: body = tree.xpath('//*/body')

Related

<!DOCTYPE html> missing in Selenium Python page_source

Insert multiple lines of hyperlinks in HTML by Python

Insert variables into an html file

Scraping Amazon deals page not returning html code - python

Parsing MS specific html tags in BeautifulSoup

Categories

Resources