Replace text in HTML and BBCode sample - python

First of all I'd like to say this is my first post on SO, which has been of great help for years to me, so thank you all!
Now onto my question:
I have a string of characters containing unicode text, html tags and bbcode tags (which is obviously extracted from a forum).
Sample:
This is my sample text.
It may contain HTML tags,
[b]BBCode[b],
or even [b][u]both[/u] intricated[/b]!
I have also a list of keywords which may appear in the text described above, and for each of these words I have an associated URL.
Sample:
kw = {'sample': 'http://www.sample.fr', 'BBCode': 'http://www.bbcode.sp'}
As you can see I'm currently using Python because I'm used to the language, but I can be flexible.
My goal is to detect which word(s) in my keyword list is present in the sample text, and to "decorate" the matching word(s) with a link (preferably in bbcode) to the corresponding URL, without altering the rest of the string (just like for Wikis).
Taking further the examples above I'd like to retrieve:
This is my [url=http://www.sample.fr]sample[/url] text.
It may contain HTML tags,
[b][url=http://www.bbcode.sp]BBCode[/url][b],
or even [b][u]both[/u] intricated[/b]!
The main problem here is that sometimes, one of the keywords in my list appears inside a tag, which I do not want to "decorate" with a link for obvious reasons.
In other words, the text I'd like to replace can be located only outside the anchor tags:
**HERE** <not here>[not here] **HERE** [/not here]</not here> **HERE**
Also, I've already tried using BeautifulSoup (along with PostMarkup to convert BBCode to HTML before parsing with BeautifulSoup) but it doesn't allow me to keep the initial string...
Remark: "real" text actually can never be placed between brackets (angle nor squared) due to the general usage of my forum, so this simplifies the problem quite a bit.
I'm sorry for my very long question, I hope everything is clear!
Any help appreciated, thanks to everyone by advance!
Update: Casimir's solution in Python (see below) works just great. Thank you Casimir et Hippolyte!

To do that, the way is always the same: you must match first what you want to avoid.
Example:
(?s) # dotall mode
( # capture with all what you want to avoid
<!--.*?--> # html comment
|
<[^>]+> # html tag
|
\[[^\]]+\] # bbcode
)
| # OR
kw1|kw2|kw3|...
Then you must use a function as replacement, inside the function when the capture group 1 is defined, you return the match, otherwise you return the corresponding string for the keyword.

Related

Issues with extracting URLs from text

I am trying to find a regular expression to extract any valid URLs (not only http[s]) using a regular expression. Unfortunately, each one outputs weird things. The best results I achieved using this regex:
\b((?:[a-z][\w\-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
But I can mark at least the following issues:
http://208.206.41.61/email/email_log.cfm?useremail=3Dtana.jones#enron.com&=refdoc=3D(01-128) is extracted as http://208.206.41.61/email/email_log.cfm?useremail=3Dtana.jones#enron.com&=
http://www.onlinefilefolder.com',AJAXTHRESHOLD should be extracted without AJAXTHRESHOLD
CSS / HTML styling is extracted, for example xmlns:x="urn:schemas-microsoft-com:xslt, ze:12px;color:#666, font-size:12px;color etc
How can I improve this regex to make sure only valid URLs are extracted? I am not only extracting it from the HTML, but also from a plain text. Therefore, using only beautifulsoup is impossible for my use case.
No regex is perfect, but this one might help you:
(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&##\/%=~_|$?!:,.]*\)|[-A-Z0-9+&##\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&##\/%=~_|$?!:,.]*\)|[A-Z0-9+&##\/%=~_|$])
Flag to enable: insensitive, global, multiline (igm)
Source: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

How to perform a tag-agnostic text string search in an html file?

I'm using LanguageTool (LT) with the --xmlfilter option enabled to spell-check HTML files. This forces LanguageTool to strip all tags before running the spell check.
This also means that all reported character positions are off because LT doesn't "see" the tags.
For example, if I check the following HTML fragment:
<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
LanguageTool will treat it as a plain text sentence:
This is kind of a stupid question.
and returns the following message:
<error category="Grammar" categoryid="GRAMMAR" context=" This is kind of a stupid question. " contextoffset="24" errorlength="9" fromx="8" fromy="8" locqualityissuetype="grammar" msg="Don't include 'a' after a classification term. Use simply 'kind of'." offset="24" replacements="kind of" ruleId="KIND_OF_A" shortmsg="Grammatical problem" subId="1" tox="17" toy="8"/>
(In this particular example, LT has flagged "kind of a.")
Since the search string might be wrapped in tags and might occur multiple times I can't do a simple index search.
What would be the most efficient Python solution to reliably locate any given text string in an HTML file? (LT returns an approximate character position, which might be off by 10-30% depending on the number of tags, as well as the words before and after the flagged word(s).)
I.e. I'd need to do a search that ignores all tags, but includes them in the character position count.
In this particular example, I'd have to locate "kind of a" and find the location of the letter k in:
kin<b>d</b> o<i>f</i>a
This may not be the speediest way to go, but pyparsing will recognize HTML tags in most forms. The following code inverts the typical scan, creating a scanner that will match any single character, and then configuring the scanner to skip over HTML open and close tags, and also common HTML '&xxx;' entities. pyparsing's scanString method returns a generator that yields the matched tokens, the starting, and the ending location of each match, so it is easy to build a list that maps every character outside of a tag to its original location. From there, the rest is pretty much just ''.join and indexing into the list. See the comments in the code below:
test = "<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>"
from pyparsing import Word, printables, anyOpenTag, anyCloseTag, commonHTMLEntity
non_tag_text = Word(printables+' ', exact=1).leaveWhitespace()
non_tag_text.ignore(anyOpenTag | anyCloseTag | commonHTMLEntity)
# use scanString to get all characters outside of tags, and build list
# of (char,loc) tuples
char_locs = [(t[0], loc) for t,loc,endloc in non_tag_text.scanString(test)]
# imagine a world without HTML tags...
untagged = ''.join(ch for ch, loc in char_locs)
# look for our string in the untagged text, then index into the char,loc list
# to find the original location
search_str = 'kind of a'
orig_loc = char_locs[untagged.find(search_str)][1]
# print the test string, and mark where we found the matching text
print(test)
print(' '*orig_loc + '^')
"""
Should look like this:
<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
^
"""
The --xmlfilter option is deprecated because of issues like this. The proper solution is to remove the tags yourself but keep the positions so you have a mapping to correct the results that come back from LT. When using LT from Java, this is supported by AnnotatedText, but the algorithm should be simple enough to port it. (full disclosure: I'm the maintainer of LT)

Beautiful Soup and regular expressions

I am using Beautiful Soup to identify a specific tag and its contents. The contents are html-links and I want to extract the text of these tags.
The problem is that the text is made up of different numbers according to a specific pattern. I am only interested in number such as "61993J0417" and "61991CJ0316" and I need the regexp to match both when the number has a "J" and "CJ" in the middle.
I have used this code to achieve this:
soup.find_all(text=re.compile('[6][1-2][0-9]{3}[J]|[CJ][0-9]{4}'))
The soup variable is the contents of the specific tag. This code works in 9 out of 10 cases. However, when I run this script on one of my source files, it also matches numbers such as "51987PC0716".
I cannot understand why so I turn to you for assistance.
You haven't specified what the | applies to; by default it's the entire regex, meaning you have asked for either
[6][1-2][0-9]{3}[J]
(which is the same thing as 6[12][0-9]{3}J) or
CJ[0-9]{4}
(not [CJ], which means "either C or J"). Use parentheses to specify what the alternatives are:
^6[12][0-9]{3}(J|CJ)[0-9]{4}$
which is better written
^6[12][0-9]{3}C?J[0-9]{4}$
IIUC, you always have a "J" inside your string.
Therefore, make it obligatory, and make the "C" optional, using a question mark.
Something like:
re.compile('6[1-2][0-9]{3}C?J[0-9]{4}')
I have not tested this, but you probably can continue from here by yourself.

regex regarding symbols in urls

I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(

skip over HTML tags in Regular Expression patterns

I'm trying to write a regular expression pattern (in python) for reformatting these template engine files.
Basically the scheme looks like this:
[$$price$$]
{
<h3 class="price">
$12.99
</h3>
}
I'm trying to make it remove any extra tabs\spaces\new lines so it should look like this:
[$$price$$]{<h3 class="price">$12.99</h3>}
I wrote this: (\t|\s)+? which works except it matches within the html tags, so h3 becomes h3class and I am unable to figure out how to make it ignore anything inside the tags.
Using regular expressions to deal with HTML is extremely error-prone; they're simply not the right tool.
Instead, use a HTML/XML-aware library (such as lxml) to build a DOM-style object tree; modify the text segments within the tree in-place, and generate your output again using said library.
Try this:
\r?\n[ \t]*
EDIT: The idea is to remove all newlines (either Unix: "\n", or Windows: "\r\n") plus any horizontal whitespace (TABs or spaces) that immediately follow them.
Alan,
I have to agree with Charles that the safest way is to parse the HTML, then work on the Text nodes only. Sounds overkill but that's the safest.
On the other hand, there is a way in regex to do that as long as you trust that the HTML code is correct (i.e. does not include invalid < and > in the tags as in: <a title="<this is a test>" href="look here">...)
Then, you know that any text has to be between > and < except at the very beginning and end (if you just get a snapshot of the page, otherwise there is the HTML tag minimum.)
So... You still need two regex's: find the text '>[^<]+<', then apply the other regex as you mentioned.
The other way, is to have an or with something like this (not tested!):
'(<[^>]*>)|([\r\n\f ]+)'
This will either find a tag or spaces. When you find a tag, do not replace, if you don't find a tag, replace with an empty string.

Categories