Insert multiple lines of hyperlinks in HTML by Python - python

To display multiple lines in html body, simple codes:
websites = ["https://www.reddit.com/","https://en.wikipedia.org/","https://www.facebook.com/"]
html = """
<!DOCTYPE html>
<html>
<body>
<h1>Hi, friend</h1>
<p>$websites!</p>
</body>
</html>
"""
html = Template(html).safe_substitute(websites = "<p>".join(websites))
Now I want to change the links to hyperlinks with friendly names.
names = ["News", "Info", "Media"]
Changed the line to:
<p><a href=$websites>$names</a></p>
and:
html = Template(html).safe_substitute(websites = "<p>".join(websites),
names= "<p>".join(names))
What I want in the html to show is:
News
Info
Media
But it doesn't show properly.
What's the right way to do that? Thank you.

Don't do '<p>'.join(websites). This will create a string by joining all the elements of a list and stick the '<p>' between them.
so that gives you "https://www.reddit.com/<p>https://en.wikipedia.org/"<p>https://www.facebook.com/" which is not what you want (I don't think It's valid as well).
You don't have any <a> link tags. So you need to Create those.
The href will point to the website and inside the <a> tag you have the name you want to appear
<a href={link}>{link_name}</a>
This is what you want to do:
websites = ["https://www.reddit.com/","https://en.wikipedia.org/","https://www.facebook.com/"]
html = """
<!DOCTYPE html>
<html>
<body>
<p>$websites</p>
</body>
</html>
"""
tag_names = ['News', 'Info', 'Media']
a_links = '<br/>'.join([f'<a href={link}>{link_name}</a>' for link, link_name in zip(websites, tag_names)])
html = Template(html).safe_substitute(websites=a_links)

Related

Getting Xpath from plain text

Im trying to get xpath from text instead of a URL. But i keep getting the error "AttributeError: 'HtmlElement' object has no attribute 'XPath'"
see code below.
from lxml import html
var ='''<html lang="en">
<head>
<title>Selecting content on a web page with XPath</title>
</head>
<body>
This is the body
</body>
</html>
'''
tree = html.fromstring(var)
body = tree.XPath('//*/body')
print(body)
It has been 15 years since I last used Python, but as far as I can tell, it is a case-sensitive language, and the xpath method is all lowercase.
So try this:
body = tree.xpath('//*/body')

Insert variables into an html file

I am trying to send an email using sendgrid. For this I created an html file and I want to format some variables into it.
Very basic test.html example:
<html>
<head>
</head>
<body>
Hello World, {name}!
</body>
</html>
Now in my Python code I am trying to do something like this:
html = open("test.html", "r")
text = html.read()
msg = MIMEText(text, "html")
msg['name'] = 'Boris'
and then proceed to send the email
Sadly, this does not seem to be working. Any way to make this work?
There are a few ways to approach this depending on how dynamic this must be and how many elements you are going to need to insert. If it is one single value name then #furas is correct and you can simply put
html = open("test.html", "r")
text = html.read().format(name="skeletor")
print(text)
And get:
<html>
<head>
</head>
<body>
Hello World, skeletor!
</body>
</html>
Alternatively you can use Jinja2 templates.
import jinja2
html = open("test.html", "r")
text = html.read()
t = jinja2.Template(text)
print(t.render(name="Skeletor"))
Helpful links: Jinja website
Real Python Primer on Jinja
Python Programming Jinja

Remove outer tags from Beautiful soup output

I have an xml which is generated after below beautifulsoup statement. It generates an XML which contains html and body tags. I want to remove both html and body tags from output. Can I please know how I can achieve the same ?
Code:
soup = bs(''.join(output), "lxml")
print("soup output : {}".format(soup.html))
output:
<html>
<body>
...
</body>
</html>
try this:
body = soup.find("body")
innerbody = body.decode_contents()

Parsing MS specific html tags in BeautifulSoup

When trying to parse an email sent using MS Outlook, I want to be able to strip the annoying Microsoft XML tags that it has added. One such example is the o:p tag. When trying to use Python's BeautifulSoup to parse an email as HTML, it can't seem to find these specialty tags.
For example:
from bs4 import BeautifulSoup
textToParse = """
<html>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "html5lib")
body = soup.find('body')
for otag in body.find_all('o'):
print(otag)
for otag in body.find_all('o:p'):
print(otag)
This will output no text to the console, but if I switched the find_all call to search for p then it would output the p node as expected.
How come these custom tags do not seem to work?
It's a namespace issue. Apparently, BeautifulSoup does not consider custom namespaces valid when parsed with "html5lib".
You can work around this with a regular expression, which – strangely – does work correctly!
print (soup.find_all(re.compile('o:p')))
>>> [<o:p>This should go</o:p>]
but the "proper" solution is to change the parser to "lxml-xml" and introducing o: as a valid namespace.
from bs4 import BeautifulSoup
textToParse = """
<html xmlns:o='dummy_url'>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "lxml-xml")
body = soup.find('body')
print ('this should find nothing')
for otag in body.find_all('o'):
print(otag)
print ('this should find o:p')
for otag in body.find_all('o:p'):
print(otag)
>>>
this should find nothing
this should find o:p
<o:p>This should go</o:p>

Extracting the hyperlink from link tag using xpath

Consider the html as
<item>
<title>this is the title</title>
<link>www.linktoawebsite.com</link>
</item>
I am using Lxml(python) and Xpath and trying to extract both the content of the title tag as well as the link tag.
The code is
page=urllib.urlopen(url).read()
x=etree.HTML(page)
titles=x.xpath('//item/title/text()')
links=x.xpath('//item/link/text()')
But this is returning an empty list. However, this is returning a link element.
links=x.xpath('//item/link') #returns <Element link at 0xb6b0ae0c>
Can anyone suggest how to extract the urls from the link tag?
You are using the wrong parser for the job; you don't have HTML, you have XML.
A proper HTML parser will ignore the contents of a <link> tag, because in the HTML specification that tag is always empty.
Use the etree.parse() function to parse your URL stream (no separate .read() call needed):
response = urllib.urlopen(url)
tree = etree.parse(response)
titles = tree.xpath('//item/title/text()')
links = tree.xpath('//item/link/text()')
You could also use etree.fromstring(page) but leaving the reading to the parser is easier.
By parsing content by etree, the <link> tag get closed. So no text value present for link tag
Demo:
>>> from lxml import etree
>>> content = """<item>
... <title>this is the title</title>
... <link>www.linktoawebsite.com</link>
... </item>"""
>>> x = etree.HTML(content)
>>> etree.tostring(x)
'<html><body><item>\n<title>this is the title</title>\n<link/>www.linktoawebsite.com\n</item></body></html>'
>>>
According to HTML, this is not valid tag.
I think link tag structure is like:
<head>
<link rel="stylesheet" type="text/css" href="theme.css">
</head>

Categories