Print string as HTML - python

I would like to know if is there any way to convert a plain unicode string to HTML in Genshi, so, for example, it renders newlines as <br/>.
I want this to render some text entered in a textarea.
Thanks in advance!

If Genshi works just as KID (which it should), then all you have to do is
${XML("<p>Hi!</p>")}
We have a small function to transform from a wiki format to HTML
def wikiFormat(text):
patternBold = re.compile("(''')(.+?)(''')")
patternItalic = re.compile("('')(.+?)('')")
patternBoldItalic = re.compile("(''''')(.+?)(''''')")
translatedText = (text or "").replace("\n", "<br/>")
translatedText = patternBoldItalic.sub(r'<b><i>\2</i></b>', textoTraducido or '')
translatedText = patternBold.sub(r'<b>\2</b>', translatedText or '')
translatedText = patternItalic.sub(r'<i>\2</i>', translatedText or '')
return translatedText
You should adapt it to your needs.
${XML(wikiFormat(text))}

Maybe use a <pre> tag.

Convert plain text to HTML, by escaping "<" and "&" characters (and maybe some more, but these two are the absolute minimum) as HTML entities
Substitute every newline with the text "<br />", possibly still combined with a newline.
In that order.
All in all that shouldn't be more than a few lines of Python code. (I don't do Python but any Python programmer should be able to do that, easily.)
edit I found code on the web for the first step. For step 2, see string.replace at the bottom of this page.

In case anyone is interested, this is how I solved it. This is the python code before the data is sent to the genshi template.
from trac.wiki.formatter import format_to_html
from trac.mimeview.api import Context
...
context = Context.from_request(req, 'resource')
data['comment'] = format_to_html(self.env, context, comment, True)
return template, data, None

Related

How do i remove the adsense code in requests_html?

I am using the requests_html library to scrape a website but i am getting at the same time the adsense from that website from that grabbed text. The example looks something like this:
some text some text some text some text and then this:
(adsbygoogle = window.adsbygoogle || []).push({});
some text some text some text after a line break and then this:
sas.cmd.push(function() { 
 sas.call("std", { 
 siteId: 301357, // 
 pageId: 1101926, // Page : Seneweb_AF/rg
formatId: 49048, // Format : Pave 2 300x250 
 target: '' //
Ciblage 
 }); 
 });
Now how can i get rid of the italic-bold text above?
If requests_html doesn't have a builtin mechanism for handling this, then a solution is to use pure python; this is what i found so far:
curated_article = article.text.split('\n')
curated_article = "\n".join(list(filter(lambda a: not a.startswith("&#"), curated_article)))
print(curated_article)
where article is the html for a scraped article
Assuming you are able to get hold of the text as a string before you need to remove the unwanted parts, you can search and replace.
If (adsbygoogle = window.adsbygoogle || []).push({}); is always the exact same string (including the same whitespace every time), then you can use str.replace().
See How to use string.replace() in python 3.x.
If the text is not the exact same thing every time--and I am guessing that at least the second example you showed is not the same every time--then you can use regular expressions. See the python documentation of the re module.
If you only use a few regular expressions in your program you can just call re.sub,
something like this:
sanitized_text = re.sub(regularexpression, '', original_text, flags=re.MULTILINE|re.DOTALL)
It may take some trial and error get get pattern to match every case that is like the second example.
You'll need re.MULTILINE if there are newlines inside the retrieved article, as there almost certainly will be, and re.DOTALL in order to make certain regex patterns work across line boundaries, which it appears the second example will require.
If you end up having to use several regular expressions you can compile them using re.compile before you start scraping:
pattern = re.compile(regularexpression, flags=re.MULTILINE|re.DOTALL)
Later, when you have text to remove pieces from, you can do the search and replace like this:
sanitized_text = pattern.sub('', original_text)

How to use extended ascii with bs4 url

I've been reluctant to post a question about this, but after 3 days of google I can't get this to work. Long story short i'm making a raid gear tracker for WoW.
I'm using BS4 to handle the webscraping, I'm able to pull the page and scrape the info I need from it. The problem I'm having is when there is an extended ascii character in the player's name, ex: thermíte. (the i is alt+161)
http://us.battle.net/wow/en/character/garrosh/thermíte/advanced
I'm trying to figure out how to re-encode the url so it is more like this:
http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
I'm using tkinter for the gui, I have the user select their realm from a dropdown and then type in the character name in an entry field.
namefield = Entry(window, textvariable=toonname)
I have a scraping function that performs the initial scrape of the main profile page. this is where I assign the value of namefield to a global variable.(I tried to passing it directly to the scraper from with this
namefield = Entry(window, textvariable=toonname, command=firstscrape)
I thought I was close, because when it passed "thermíte", the scrape function would print out "therm\xC3\xADte" all I needed to do was replace the '\x' with '%' and i'd be golden. But it wouldn't work. I could use mastername.find('\x') and it would find instances of it in the string, but doing mastername.replace('\x','%') wouldn't actually replace anything.
I tried various combinations of r'\x' '\%' r'\x' etc etc. no dice.
Lastly when I try to do things like encode into latin then decode back into utf-8 i get errors about how it can't handle the extended ascii character.
urlpart1 = "http://us.battle.net/wow/en/character/garrosh/"
urlpart2 = mastername
urlpart3 = "/advanced"
url = urlpart1 + urlpart2 + urlpart3
That's what I've been using to try and rebuild the final url(atm i'm leaving the realm constant until I can get the name problem fixed)
Tldr:
I'm trying to take a url with extended ascii like:
http://us.battle.net/wow/en/character/garrosh/thermíte/advanced
And have it become a url that a browser can easily process like:
http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
with all of the normal extended ascii characters.
I hope this made sense.
here is a pastebin for the full script atm. there are some things in it atm that aren't utilized until later on. pastebin link
There shouldn't be non-ascii characters in the result url. Make sure mastername is a Unicode string (isinstance(mastername, str) on Python 3):
#!/usr/bin/env python3
from urllib.parse import quote
mastername = "thermíte"
assert isinstance(mastername, str)
url = "http://us.battle.net/wow/en/character/garrosh/{mastername}/advanced"\
.format(mastername=quote(mastername, safe=''))
# -> http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
You can try something like this:
>>> import urllib
>>> 'http://' + '/'.join([urllib.quote(x) for x in url.strip('http://').split('/')]
'http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced'
urllib.quote() "safe" urlencodes characters of a string. You don't want all the characters to be affected, just everything between the '/' characters and excluding the initial 'http://'. So the strip and split functions take those out of the equation, and then you concatenate them back in with the + operator and join
EDIT: This one is on me for not reading the docs... Much cleaner:
>>> url = 'http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced'
>>> urllib.quote(url, safe=':/')
'http://us.battle.net/wow/en/character/garrosh/therm%25C3%25ADte/advanced'

Strip all html lines/code from string in python

Given the following string parsed from an email body...
s = "Keep all of this <h1>But remove this including the tags</h1> this is still good, but
<p>there also might be new lines like this that need to be removed</p>
<p> and even other lines like this all the way down here with whitespace after being parsed from email that need to be removed.</p>
But this is still okay."
How do I remove all the html code and lines from the string to simply return "Keep all of this this is still good But this is still okay." on one line? I've looked at bleach and lxml but they are simply just removing the html <> and returning what's inside, whereas I don't want any of it.
You can still use lxml to get all of the root element's text nodes:
import lxml.html
html = '''
Keep all of this <h1>But remove this including the tags</h1> this is still good, but
<p>there also might be new lines like this that need to be removed</p>
<p> and even other lines like this all the way down here with whitespace after being parsed from email that need to be removed.</p>
But this is still okay.
'''
root = lxml.html.fromstring('<div>' + html + '</div>')
text = ' '.join(t.strip() for t in root.xpath('text()') if t.strip())
Seems to work fine:
>>> text
'Keep all of this this is still good, but But this is still okay.'
Simple solution that requires no external packages:
import re
while '<' in s:
s = re.sub('<.+?>.+?<.+?>', '', s)
Not very efficient, since it passes over the target string many times, but it should work. Note there must be absolutely no < or > characters on the string.
This one?
import re
s = # Your string here
print re.sub('[\s\n]*<.+?>.+?<.+?>[\s\n]*', ' ', s)
Edit: Just made a few mods to #BoppreH answer albeit with an extra space.

Extracting a URL's in Python from XML

I read this thread about extracting url's from a string. https://stackoverflow.com/a/840014/326905
Really nice, i got all url's from a XML document containing http://www.blabla.com with
>>> s = '<link href="http://www.blabla.com/blah" />
<link href="http://www.blabla.com" />'
>>> re.findall(r'(https?://\S+)', s)
['http://www.blabla.com/blah"', 'http://www.blabla.com"']
But i can't figure out, how to customize the regex to omit the double qoute at the end of the url.
First i thought that this is the clue
re.findall(r'(https?://\S+\")', s)
or this
re.findall(r'(https?://\S+\Z")', s)
but it isn't.
Can somebody help me out and tell me how to omit the double quote at the end?
Btw. the questionmark after the "s" of https means "s" can occur or can not occur. Am i right?
>>>from lxml import html
>>>ht = html.fromstring(s)
>>>ht.xpath('//a/#href')
['http://www.blabla.com/blah', 'http://www.blabla.com']
You're already using a character class (albeit a shorthand version). I might suggest modifying the character class a bit, that way you don't need a lookahead. Simply add the quote as part of the character class:
re.findall(r'(https?://[^\s"]+)', s)
This still says "one or more characters not a whitespace," but has the addition of not including double quotes either. So the overall expression is "one or more character not a whitespace and not a double quote."
You want the double quotes to appear as a look-ahead:
re.findall(r'(https?://\S+)(?=\")', s)
This way they won't appear as part of the match. Also, yes the ? means the character is optional.
See example here: http://regexr.com?347nk
I used to extract URLs from text through this piece of code:
url_rgx = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
# convert string to lower case
text = text.lower()
matches = re.findall(url_rgx, text)
# patch the 'http://' part if it is missed
urls = ['http://%s'%url[0] if not url[0].startswith('http') else url[0] for url in matches]
print urls
It works great!
Thanks. I just read this https://stackoverflow.com/a/13057368/326905
and checked out this which is also working.
re.findall(r'"(https?://\S+)"', urls)

Problem extracting text out of html file using python regex

I'm working on a project that requires me to write some code to pull out some text from a html file in python.
<tr>
<td>Target binary file name:</td>
<td class="right">Doc1.docx</td>
</tr>
^Small portion of the html file that I'm interested in.
#! /usr/bin/python
import os
import re
if __name__ == '__main__':
f = open('./results/sample_result.html')
soup = f.read()
p = re.compile("binary")
for line in soup:
m = p.search(line)
if m:
print "finally"
break
^Sample code I wrote to test if I could extract data out.
I've written several programs similar to this to extract text from txt files almost exactly the same and they have worked just fine. Is there something I'm missing out with regards to regex and html?
Is there something I'm missing out with regards to regex and html?
Yes. You're missing the fact that some HTML cannot be parsed with a simple regex.
Is this actually what you're trying to do, or just a simple example for a more complicated regex later? If the latter, listen to everyone else. If the former:
for line in file:
if "binary" in line:
# do stuff
If that doesn't work, are you sure "binary" is in the file? Not, I don't know, "<i>b</i>inary"?
HTML as understood by browsers is waaaay too flexible for reg expressions. Attributes can pop up in any tag, and in any order, and in upper or lower case, and with or without quotation marks about the value. Special emphasis tags can show up anywhere. Whitespace is significant in regex, but not so much in HTML, so your regex has to be littered with \s*'s everywhere. There is no requirement that opening tags be matched with closing tags. Some opening tags include a trailing '/', meaning that they are empty tags (no body, no closing tag). Lastly, HTML is often nested, which is pretty much off the chart as far as regex is concerned.

Categories