Beautiful Soup and regular expressions - python

I am using Beautiful Soup to identify a specific tag and its contents. The contents are html-links and I want to extract the text of these tags.
The problem is that the text is made up of different numbers according to a specific pattern. I am only interested in number such as "61993J0417" and "61991CJ0316" and I need the regexp to match both when the number has a "J" and "CJ" in the middle.
I have used this code to achieve this:
soup.find_all(text=re.compile('[6][1-2][0-9]{3}[J]|[CJ][0-9]{4}'))
The soup variable is the contents of the specific tag. This code works in 9 out of 10 cases. However, when I run this script on one of my source files, it also matches numbers such as "51987PC0716".
I cannot understand why so I turn to you for assistance.

You haven't specified what the | applies to; by default it's the entire regex, meaning you have asked for either
[6][1-2][0-9]{3}[J]
(which is the same thing as 6[12][0-9]{3}J) or
CJ[0-9]{4}
(not [CJ], which means "either C or J"). Use parentheses to specify what the alternatives are:
^6[12][0-9]{3}(J|CJ)[0-9]{4}$
which is better written
^6[12][0-9]{3}C?J[0-9]{4}$

IIUC, you always have a "J" inside your string.
Therefore, make it obligatory, and make the "C" optional, using a question mark.
Something like:
re.compile('6[1-2][0-9]{3}C?J[0-9]{4}')
I have not tested this, but you probably can continue from here by yourself.

Related

Python - Injecting html tags into strings based on regex match

I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.
Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs
Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline

Replacing strings in a text and ignoring certain parts

I found many programs online to replace text in a string or file with words prescribed in a dictionary. For example, https://www.daniweb.com/programming/software-development/code/216636/multiple-word-replace-in-text-python
But I was wondering how to get the program to ignore certain parts of the text. For instance, I would like it to ignore parts that are ensconced within say % signs (%Please ignore this%). Better still, how do I get it to ignore the text within but remove the % sign at the end of the run.
Thank you.
This could very easily be done with regular expressions, although they may not be supported by any online programs you find. You will probably need to write something yourself and then use regex as your dict's search key's.
Good place to start playing around with regex is: http://regexr.com
Well in the replacing dictionary just have any word you want to be ignored such as teh be replaced with the but %teh% be replaced with teh. For the program in the link you could have
wordDic = {
'booster': 'rooster',
'%booster%': 'booster'
}

Regular expression for checking string outside of a set

I am writing a web crawler using Scrapy and as a result I get a set of URLs like: [Dummy URLs]
*http://matrix.com/en/Zion
http://matrix.com/en/Machine_World
http://matrix.com/en/Matrix:Banner_guidelines
http://matrix.com/en/File:Link_Banner.jpg
http://matrix.com/wiki/en/index.php*
In the rules in scrapy, I want to add a regex that allows urls ONLY of the kind "http://matrix.com/en/Machine_World" or "http://matrix.com/en/Zion"
i.e urls that contain anything outside of the set "http://matrix.com/en/<[a-zA-Z,_]>" must not be allowed.
Constraints :
The string after "/en/" could be of any length. So I cannot ask it to look only for the first 10 or 20 characters. e.g when I use the regex : [a-zA-Z,]{1,20} OR [a-zA-Z,]{1,} it still matches the URLs like http://matrix.com/en/Matrix:Banner_guidelines coz it finds "http://matrix.com/en/Matrix" part of the url a successful match. I want it look at the string starting after "/en/" till the end of URL and then apply this rule.
Unfortunately I cannot extract that string n write a sub-routine of any kind. It has to be done using a regex only!
i.e urls that contain anything outside of the set "http://matrix.com/en/<[a-zA-Z,_]>" must not be allowed.
Have you tried using this character class in your regex? Looks like you aren't including underscores.
Try
[a-zA-Z,_]+
The plus sign means "one or more" - which is the same as {1,} just a nice shorthand :)
If you want to exclude items with .php or .jpg, feel free to add a $ sign to the end, as so:
[a-zA-Z,_]+$
The $ means "end of line" meaning that your matching sequence must run to the end of the line. As fullstops are not included in the character class, those options will be excluded
Let me know if that works,
Elliott
Reproducible evidence that the suggested regex works:
grep("matrix.com\\/en\\/[a-zA-Z,_]+$", x, perl=TRUE, value=TRUE)
#[1] "http://matrix.com/en/Zion"
#[2] "http://matrix.com/en/Machine_World"
Data
x <- c("http://matrix.com/en/Zion", "http://matrix.com/en/Machine_World",
"http://matrix.com/en/Matrix:Banner_guidelines",
"http://matrix.com/en/File:Link_Banner.jpg",
"http://matrix.com/wiki/en/index.php")

Replace text in HTML and BBCode sample

First of all I'd like to say this is my first post on SO, which has been of great help for years to me, so thank you all!
Now onto my question:
I have a string of characters containing unicode text, html tags and bbcode tags (which is obviously extracted from a forum).
Sample:
This is my sample text.
It may contain HTML tags,
[b]BBCode[b],
or even [b][u]both[/u] intricated[/b]!
I have also a list of keywords which may appear in the text described above, and for each of these words I have an associated URL.
Sample:
kw = {'sample': 'http://www.sample.fr', 'BBCode': 'http://www.bbcode.sp'}
As you can see I'm currently using Python because I'm used to the language, but I can be flexible.
My goal is to detect which word(s) in my keyword list is present in the sample text, and to "decorate" the matching word(s) with a link (preferably in bbcode) to the corresponding URL, without altering the rest of the string (just like for Wikis).
Taking further the examples above I'd like to retrieve:
This is my [url=http://www.sample.fr]sample[/url] text.
It may contain HTML tags,
[b][url=http://www.bbcode.sp]BBCode[/url][b],
or even [b][u]both[/u] intricated[/b]!
The main problem here is that sometimes, one of the keywords in my list appears inside a tag, which I do not want to "decorate" with a link for obvious reasons.
In other words, the text I'd like to replace can be located only outside the anchor tags:
**HERE** <not here>[not here] **HERE** [/not here]</not here> **HERE**
Also, I've already tried using BeautifulSoup (along with PostMarkup to convert BBCode to HTML before parsing with BeautifulSoup) but it doesn't allow me to keep the initial string...
Remark: "real" text actually can never be placed between brackets (angle nor squared) due to the general usage of my forum, so this simplifies the problem quite a bit.
I'm sorry for my very long question, I hope everything is clear!
Any help appreciated, thanks to everyone by advance!
Update: Casimir's solution in Python (see below) works just great. Thank you Casimir et Hippolyte!
To do that, the way is always the same: you must match first what you want to avoid.
Example:
(?s) # dotall mode
( # capture with all what you want to avoid
<!--.*?--> # html comment
|
<[^>]+> # html tag
|
\[[^\]]+\] # bbcode
)
| # OR
kw1|kw2|kw3|...
Then you must use a function as replacement, inside the function when the capture group 1 is defined, you return the match, otherwise you return the corresponding string for the keyword.

regex regarding symbols in urls

I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(

Categories