Dynamically create text alternative emails [duplicate] - python

This question already has answers here:
Strip HTML from strings in Python
(28 answers)
Closed 8 years ago.
I'm using the jinja2 templating engine to create both HTML emails and their plaintext alternative that I then send out using Sendgrid. Unfortunately for my lazy self, this entails me writing and maintaining two separate templates with essentially the same content, the .html file and the .txt file. The .txt file is identical to the HTML file other than containing no HTML tags.
Is there any way to simply have the HTML template and then somehow dynamically generate the txt version, essentially just stripping the HTML tags? I know a regex could achieve this, but I also know that implementing a regex to deal with HTML tags is notoriously gotcha-ridden.

I used this trick to get text out of HTML even if HTML is broken:
text = get_some_html()
import StringIO, htmllib, formatter
io = StringIO.StringIO()
htmllib.HTMLParser(formatter.AbstractFormatter(formatter.DumbWriter(io))).feed("<pre>"+text+"</pre>")
text = io.getvalue()
If you are sure your HTML is well-formed, you don't need those <pre> tags.

Related

Highlight HTML in Python String

Is there a VSCode extension that highlights HTML code within strings?
Some of my modules include a bunch of HTML responses and I make some very simplistic mistakes at times, such as opening/closing tags, which could be helped if the entire block of text wasn't the same color.
I've found some that do this in different ways for different languages/platforms, but none for HTML in Python strings.
You could load the HTML string from files as an alternative to writing HTML as strings if your HTML is static.
Create HTML files and load them in as strings from the file.

Is it possible to save an HTML page as PDF using Python? [duplicate]

This question already has answers here:
How to convert webpage into PDF by using Python
(10 answers)
Closed 4 years ago.
I'm trying to create a button which saves an HTML page in PDF format using Python. Are there any libraries which do that? If so, how would you write it up?
The HTML page I'm building contains school information such as name, url, city, state, zip, number of students, etc.
Have you tried pdfkit?
It is easy to use as well -
import pdfkit
pdfkit.from_file('test.html', 'out.pdf')

Matching contents of an html file with keyword python

I am making a download manager. And I want to make the download manager check the md5 hash of an url after downloading the file. The hash is found on the page. It needs to compute the md5 of the file ( this is done), search for a match on the html page and then compare the WHOLE contents of the html page for a match.
my question is how do i make python return the whole contents of the html and find a match for my "md5 string"?
Requests lib is what you want to use. Will save you lots of trouble
import urllib and use urllib.urlopen for getting the contents of an html. import re to search for the hash code using regex. You could also use find method on the string instead of regex.
If you encounter problems, then you can ask more specific questions. Your question is too general.

HTML parsing to obtain what I want [duplicate]

This question already has answers here:
Parsing HTML using Python
(7 answers)
Closed 2 years ago.
I'm trying to do a little bit of HTML parsing in Python which I'm horrible at to be quite honest. I've been up googling ways to do this but can't get anything to work. Here is my situation. I have a web page that has a BUNCH of links to downloads. What I want to do is specify a search string, and if the string I am searching for is there, download the file. But it needs to get the entire file name. For example if I am searching for game-1 and the name of the actual game is game-1-something-else, I want it to download game-1-1something-else. I have already used the following code to obtain the source of the page:
import urllib2
file = urllib2.urlopen('http://www.example.com/my/example/dir')
dload = file.read()
This grabs the entire source code of the webpage which is just a directory by itself. For example, I have tons of tags. I have <a href tags, <td> tags, etc. I want to string the tags so all I have is a list of the files in the directory of the web page, then I want to use a regular expression or something simliar to search for what I am searching for, take the entire file name, and download it.
Once you have the HTML data, parse it and then you can make selections of nodes within the page:
import lxml.html
tree = lxml.html.fromstring(dload)
for node in tree.xpath('//a'):
print node['href']

How would I look for all URLs on a web page and then save them to a individual variables with urllib2 In Python?

How would I look for all URLs on a web page and then save them to individual variables with urllib2 In Python?
Parse the html with an html parser and find all (e.g. using Beutiful Soup's findAll() method) <a> tags and check their href attributes.
If, however, you want to find all URLs in the page even if they aren't hyperlinks, then you can use a regular expression which could be anything from simple to ridiculously insane.
You don't do it with urllib2 alone. What are you looking for is parsing urls in a web page.
You get your first page using urllib2, read its contents and then pass it through parser like Beautifulsoup or as the other poster explained, you can regex to search the contents of the page too.
You could simply download the raw html with urllib2, then simply search through it. There might be easier ways but you could do this:
1:Download the source code.
2:Use strings library to split it into a list.
3:Search the first 7 characters of each section-->
4:If the first 7 characters are http://, write that to a variable.
Why do you need separate variables though? Wouldn't it be easier to save them all to a list, using list.append(URL_YOU_JUST_FOUND), every time you find another url?

Categories