Python custom text to html syntax - python

I'm trying to create a custom syntax for my users to format some html
so a user can enter something like:
**some text here**
the some more text down here
**Another bunch of stuff**
then some other junk
and I get:
<h1>some text here</h2>
<p>the some more text down here</p>
<h1>Another bunch of stuff</h1>
<p>then some other junk</p>
and hopefully leave some room to make up other tags as I need them
edit:
So my question is how would I write the regex function to convert some given text and have it find every instance of an opening and closing ** and replace them with the appropriate or tags.
i have:
import re
header_pattern = re.compile(r'(?P**)(?P.*)(?P**)', re.MULTILINE)
def format_headers(text):
def process_match(m):
return "<h2>%s</h2>" % m.group('header')
new_text = header_pattern.sub(process_match, text)
print new_text
but this only grabs the first ** and last ** and ignores ones in the middle.

Use some standard solution. Like Markdown. Its is most preferable.
To match your header strings use r'\*\*(.*?)\*\*'. See example
>>> re.sub(r'\*\*(.*?)\*\*','<h1>\\1</h1>', '**some text here** and **another**')
'<h1>some text here</h1> and <h1>another</h1>'

Is there any reason you couldn't use a simple template language like Cheetah and build the logic into even handlers? Cheetah would allow you to create a template objects that contain a set of methods useful for manipulating HTML.
I had a similar issue a few weeks back when I was trying to use CherryPy to create a webpage, in the tutorials I found Cheetah templates and it made writing the forms so much easier!

Related

How can I get the html tags from an input rather than the text?

I'm trying to make a program that takes input and then outputs the HTML tags. Although I've managed to do the opposite.
import re
text = '<p>I want this bit removed</p>'
tags = re.search('>(.*)<', text)
print(tags.group(1))
At the moment, if I run this, it removes the HTML tags and keeps the text. But I want it so that the output is ['p','/p']. How can I do this? I also want to make it so that it can adapt to any input.
Also, if possible, I'd like to adapt this to a for loop
Just change the regex to look for the text inside the < > instead.
import re
text = '<p>I want this bit removed</p>'
tags = re.findall('<([^>]*)>', text) # [^>] means anything except a `>`
print(tags) # tags is an iterable object (basically a list) here

Python - remove excessive html tags

So I'm currently having this text:
<i>This article is written </i><i>TEST</i><i>.</i>
I think this is a good HTML, however, I want to clean it up, remove all the excessive <i> tags and simplify it to a single <i> tag:
<i>This article is written TEST.</i>
I tried to clean it up myself, but I'd need to look ahead for the text, and haven't had much success with this. Is there a package I can use or a way that I can do it or I'd have to manually do it?
Thank you
The use of an HTML parser is definitely the most reliable solution. It would be able to cope with the tags split across many lines.
The following will solve your example, but probably not much more...
def OuterI(text):
outer = re.search("(.*?)(\<i\>.*<\/i\>)(.*)", text)
if outer:
return "%s<i>%s</i>%s" % (outer.group(1), re.sub(r"(\<\/?[iI]\>)", "", outer.group(2)), outer.group(3))
else:
return text
print OuterI('<i>This article is written </i><i>TEST</i><i>.</i>')
print OuterI('text before <i>This article is written </i><i>TEST</i><i>.</i> text after')

changing plaintext tags into HTML tags to display in browser in python

ok so I'm writing a function in python which takes a text document which is tagged with tags like ===, ==, ---, #text# etc. etc. (alot like wikipedia). Now my program basically has to replace those with HTML tags such as &ndash, &mdash, <>text etc. so that they can be displayed properly in a browser. This is what i've got so far:
def tag_change ():
for () in range ()
sub('--', '–')
sub('---', '—')
sub('''*''', '<i>*</i>')
sub("'''*'''", '<b>*</b>')
sub("==*==", "<h1>*</h1>")
sub("#*#", "<li>*</li>")
Am I on the right track? Or is there something else I need to include? I'm fairly new to this
Your best bet (if you want to write your own function and avoid using an existing tool) is to use regex, which is simple enough
import re
def subst(text):
str = '#text#'
capture = re.search('#(.+)#', str)
return '<li>'+ capture.group(1)+ '</li>'
I hope you get the idea
you could also use patterns like '==(.+)==' and so forth to capture what you want.
You can view this post to learn more about using re.search and re.match
https://stackoverflow.com/a/180993/2152321
You can also learn more about regex pattern construction here
http://www.tutorialspoint.com/python/python_reg_expressions.htm

With flask/jinja, what is a viable way to safely render a link inside a user generated block of text?

Think twitter where you paste a link next to some plain text and when your tweet is rendered, that url is now a clickable link.
Do I:
replace jinja's autoescape with my own by scanning the text for html tags and replacing them with the html entity code
use a regular expression to detect a url contained in the text and replace it within an a href=
what would this expression look like to detect any # of .tld's, http/https, www/any subdomain?
and render this all as ¦safe in the template?
Or is there a python/flask/jinja 'feature' that can better handle this kind of thing?
Jinja has a filter built-in called urlize that should do exactly what you want.

Python: Keyword to Links

I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!
This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.
This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.

Categories