regex regarding symbols in urls

regex regarding symbols in urls - python

I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?

So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(

Related

Modify the data found between two recurring patterns in a multi-line string

I have a multi-line string, it's around to 10000-40000 characters(changes as per the data returned by an API). In this string, there are a number of tables (they are a part of the string, but formatted in a way that makes them look like a table). The tables are always in a repeating pattern. The pattern looks like this:
==============================================================================
*THE HEADINGS/COLUMN NAMES IN THE TABLE*
------------------------------------------------------------------------------
THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS
I'm trying to display the contents in html on a locally hosted webpage, and I want to have the heading of the tables displayed in a specific way (think color, font size). For that, I'm using the python regex module to identify the pattern, but I'm failing to do so due to inexperience in using the re module. To modify the part that I need modified, I'm using the below piece of code:
re.sub(r'\={78}.*\-{78}',some_replacement_string, complete_multi_line_string)
But the above piece of code is not giving me the output I require, since it is not matching the pattern properly(I'm sure the mistake is in the pattern I'm asking re.sub to match)
However:
re.sub(r'\-{78}',some_replacement_string, complete_multi_line_string)
is working as it's returning the string with the replacement, but the slight problem here is that there are multiple ------------------------------------------------------------------------------s in the code that I do not want modified. Please help me out here. If it is helpful, the output that I'm wanting is something like:
==============================================================================
<span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<\span>
------------------------------------------------------------------------------
THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS
Also, please note that there are newlines or \ns after the ==============================================================================s, the <span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<\span>s and the ------------------------------------------------------------------------------s, if that is helpful in getting to the solution.
The code snippet I'm currently trying to debug, if helpful:
result = re.sub(r'\={78}.*\-{78}', replacement, multi_line_string)
l = result.count('<\span>')
print(l)
PS: There are 78 = and 78 - in all the occurances.

You should try using the following version:
re.sub(r'(={78})\n(.*?)\n(-{78})', r'\1<span>\2</span>\3', complete_multi_line_string, flags=re.S)
The changes I made here include:
Match on lazy dot .*? instead of greedy dot .*, to ensure that we don't match across header sections
Match with the re.S flag, so that .*? will match across newlines

How to select an entire entity around a regex without splitting the string first?

My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.

Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?

Python - Injecting html tags into strings based on regex match

I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.

Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs

Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline

Regular expression for checking string outside of a set

I am writing a web crawler using Scrapy and as a result I get a set of URLs like: [Dummy URLs]
*http://matrix.com/en/Zion
http://matrix.com/en/Machine_World
http://matrix.com/en/Matrix:Banner_guidelines
http://matrix.com/en/File:Link_Banner.jpg
http://matrix.com/wiki/en/index.php*
In the rules in scrapy, I want to add a regex that allows urls ONLY of the kind "http://matrix.com/en/Machine_World" or "http://matrix.com/en/Zion"
i.e urls that contain anything outside of the set "http://matrix.com/en/<[a-zA-Z,_]>" must not be allowed.
Constraints :
The string after "/en/" could be of any length. So I cannot ask it to look only for the first 10 or 20 characters. e.g when I use the regex : [a-zA-Z,]{1,20} OR [a-zA-Z,]{1,} it still matches the URLs like http://matrix.com/en/Matrix:Banner_guidelines coz it finds "http://matrix.com/en/Matrix" part of the url a successful match. I want it look at the string starting after "/en/" till the end of URL and then apply this rule.
Unfortunately I cannot extract that string n write a sub-routine of any kind. It has to be done using a regex only!

i.e urls that contain anything outside of the set "http://matrix.com/en/<[a-zA-Z,_]>" must not be allowed.
Have you tried using this character class in your regex? Looks like you aren't including underscores.
Try
[a-zA-Z,_]+
The plus sign means "one or more" - which is the same as {1,} just a nice shorthand :)
If you want to exclude items with .php or .jpg, feel free to add a $ sign to the end, as so:
[a-zA-Z,_]+$
The $ means "end of line" meaning that your matching sequence must run to the end of the line. As fullstops are not included in the character class, those options will be excluded
Let me know if that works,
Elliott

Reproducible evidence that the suggested regex works:
grep("matrix.com\\/en\\/[a-zA-Z,_]+$", x, perl=TRUE, value=TRUE)
#[1] "http://matrix.com/en/Zion"
#[2] "http://matrix.com/en/Machine_World"
Data
x <- c("http://matrix.com/en/Zion", "http://matrix.com/en/Machine_World",
"http://matrix.com/en/Matrix:Banner_guidelines",
"http://matrix.com/en/File:Link_Banner.jpg",
"http://matrix.com/wiki/en/index.php")

Regular Expression to match a string only when certain characters don't exist

So, here's my question:
I have a crawler that goes and downloads web pages and strips those of URLs (for future crawling). My crawler operates from a whitelist of URLs which are specified in regular expressions, so they're along the lines of:
(http://www.example.com/subdirectory/)(.*?)
...which would allow URLs that followed the pattern to be crawled in the future. The problem I'm having is that I'd like to exclude certain characters in URLs, so that (for example) addresses such as:
(http://www.example.com/subdirectory/)(somepage?param=1&param=5#print)
...in the case above, as an example, I'd like to be able to exclude URLs that feature ?, #, and = (to avoid crawling those pages). I've tried quite a few different approaches, but I can't seem to get it right:
(http://www.example.com/)([^=\?#](.*?))
etc. Any help would be really appreciated!
EDIT: sorry, should've mentioned this is written in Python, and I'm normally fairly proficient at regex (although this has me stumped)
EDIT 2: VoDurden's answer (the accepted one below) almost yields the correct result, all it needs is the $ character at the end of the expression and it works perfectly - example:
(http://www.example.com/)([^=\?#]*)$

(http://www.example.com/)([^=?#]*?)
Should do it, this will allow any URL that does not contain the characters you don't want.
It might however be a little bit hard to extend this approach. A better option is to have the system work two-tiered, i.e. one set of matching regex, and one set of blocking regex. Then only URL:s which pass both of these will be allowed. I think this solution will be a bit more transparent and flexible.

This expression should be what you're looking for:
(http://www.example.com/subdirectory/)([^=?#]*)$
[^=\?#] Will match anything except for the characters you specified.
For Example:
http://www.example.com/subdirectory/ Match
http://www.example.com/subdirectory/index.php Match
http://www.example.com/subdirectory/somepage?param=1&param=5#print No Match
http://www.example.com/subdirectory/index.php?param=1 No Match

You will need to crawl the pages upto ?param=1&param=5
because normally param=1 and param=2 could give you completely different web page.
pick up one the wordpress website to confirm that.
Try like this one, It will try to match just before # char
(http://www.example.com/)([^#]*?)

I'm not sure of what you want. If you wan't to match anything that doesn't containst any ?, #, and = then the regex is
([^=?#]*)

As an alternative there's always the urlparse module which is designed for parsing urls.
from urlparse import urlparse
urls= [
'http://www.example.com/subdirectory/',
'http://www.example.com/subdirectory/index.php',
'http://www.example.com/subdirectory/somepage?param=1&param=5#print',
'http://www.example.com/subdirectory/index.php?param=1',
]
for url in urls:
# in python 2.5+ you can use urlparse(url).query instead
if not urlparse(url)[4]:
print url
Provides the following:
http://www.example.com/subdirectory/
http://www.example.com/subdirectory/index.php

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.