I am writing a web crawler using Scrapy and as a result I get a set of URLs like: [Dummy URLs]
*http://matrix.com/en/Zion
http://matrix.com/en/Machine_World
http://matrix.com/en/Matrix:Banner_guidelines
http://matrix.com/en/File:Link_Banner.jpg
http://matrix.com/wiki/en/index.php*
In the rules in scrapy, I want to add a regex that allows urls ONLY of the kind "http://matrix.com/en/Machine_World" or "http://matrix.com/en/Zion"
i.e urls that contain anything outside of the set "http://matrix.com/en/<[a-zA-Z,_]>" must not be allowed.
Constraints :
The string after "/en/" could be of any length. So I cannot ask it to look only for the first 10 or 20 characters. e.g when I use the regex : [a-zA-Z,]{1,20} OR [a-zA-Z,]{1,} it still matches the URLs like http://matrix.com/en/Matrix:Banner_guidelines coz it finds "http://matrix.com/en/Matrix" part of the url a successful match. I want it look at the string starting after "/en/" till the end of URL and then apply this rule.
Unfortunately I cannot extract that string n write a sub-routine of any kind. It has to be done using a regex only!
i.e urls that contain anything outside of the set "http://matrix.com/en/<[a-zA-Z,_]>" must not be allowed.
Have you tried using this character class in your regex? Looks like you aren't including underscores.
Try
[a-zA-Z,_]+
The plus sign means "one or more" - which is the same as {1,} just a nice shorthand :)
If you want to exclude items with .php or .jpg, feel free to add a $ sign to the end, as so:
[a-zA-Z,_]+$
The $ means "end of line" meaning that your matching sequence must run to the end of the line. As fullstops are not included in the character class, those options will be excluded
Let me know if that works,
Elliott
Reproducible evidence that the suggested regex works:
grep("matrix.com\\/en\\/[a-zA-Z,_]+$", x, perl=TRUE, value=TRUE)
#[1] "http://matrix.com/en/Zion"
#[2] "http://matrix.com/en/Machine_World"
Data
x <- c("http://matrix.com/en/Zion", "http://matrix.com/en/Machine_World",
"http://matrix.com/en/Matrix:Banner_guidelines",
"http://matrix.com/en/File:Link_Banner.jpg",
"http://matrix.com/wiki/en/index.php")
Related
My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.
Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?
I'm using LanguageTool (LT) with the --xmlfilter option enabled to spell-check HTML files. This forces LanguageTool to strip all tags before running the spell check.
This also means that all reported character positions are off because LT doesn't "see" the tags.
For example, if I check the following HTML fragment:
<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
LanguageTool will treat it as a plain text sentence:
This is kind of a stupid question.
and returns the following message:
<error category="Grammar" categoryid="GRAMMAR" context=" This is kind of a stupid question. " contextoffset="24" errorlength="9" fromx="8" fromy="8" locqualityissuetype="grammar" msg="Don't include 'a' after a classification term. Use simply 'kind of'." offset="24" replacements="kind of" ruleId="KIND_OF_A" shortmsg="Grammatical problem" subId="1" tox="17" toy="8"/>
(In this particular example, LT has flagged "kind of a.")
Since the search string might be wrapped in tags and might occur multiple times I can't do a simple index search.
What would be the most efficient Python solution to reliably locate any given text string in an HTML file? (LT returns an approximate character position, which might be off by 10-30% depending on the number of tags, as well as the words before and after the flagged word(s).)
I.e. I'd need to do a search that ignores all tags, but includes them in the character position count.
In this particular example, I'd have to locate "kind of a" and find the location of the letter k in:
kin<b>d</b> o<i>f</i>a
This may not be the speediest way to go, but pyparsing will recognize HTML tags in most forms. The following code inverts the typical scan, creating a scanner that will match any single character, and then configuring the scanner to skip over HTML open and close tags, and also common HTML '&xxx;' entities. pyparsing's scanString method returns a generator that yields the matched tokens, the starting, and the ending location of each match, so it is easy to build a list that maps every character outside of a tag to its original location. From there, the rest is pretty much just ''.join and indexing into the list. See the comments in the code below:
test = "<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>"
from pyparsing import Word, printables, anyOpenTag, anyCloseTag, commonHTMLEntity
non_tag_text = Word(printables+' ', exact=1).leaveWhitespace()
non_tag_text.ignore(anyOpenTag | anyCloseTag | commonHTMLEntity)
# use scanString to get all characters outside of tags, and build list
# of (char,loc) tuples
char_locs = [(t[0], loc) for t,loc,endloc in non_tag_text.scanString(test)]
# imagine a world without HTML tags...
untagged = ''.join(ch for ch, loc in char_locs)
# look for our string in the untagged text, then index into the char,loc list
# to find the original location
search_str = 'kind of a'
orig_loc = char_locs[untagged.find(search_str)][1]
# print the test string, and mark where we found the matching text
print(test)
print(' '*orig_loc + '^')
"""
Should look like this:
<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
^
"""
The --xmlfilter option is deprecated because of issues like this. The proper solution is to remove the tags yourself but keep the positions so you have a mapping to correct the results that come back from LT. When using LT from Java, this is supported by AnnotatedText, but the algorithm should be simple enough to port it. (full disclosure: I'm the maintainer of LT)
I am using Beautiful Soup to identify a specific tag and its contents. The contents are html-links and I want to extract the text of these tags.
The problem is that the text is made up of different numbers according to a specific pattern. I am only interested in number such as "61993J0417" and "61991CJ0316" and I need the regexp to match both when the number has a "J" and "CJ" in the middle.
I have used this code to achieve this:
soup.find_all(text=re.compile('[6][1-2][0-9]{3}[J]|[CJ][0-9]{4}'))
The soup variable is the contents of the specific tag. This code works in 9 out of 10 cases. However, when I run this script on one of my source files, it also matches numbers such as "51987PC0716".
I cannot understand why so I turn to you for assistance.
You haven't specified what the | applies to; by default it's the entire regex, meaning you have asked for either
[6][1-2][0-9]{3}[J]
(which is the same thing as 6[12][0-9]{3}J) or
CJ[0-9]{4}
(not [CJ], which means "either C or J"). Use parentheses to specify what the alternatives are:
^6[12][0-9]{3}(J|CJ)[0-9]{4}$
which is better written
^6[12][0-9]{3}C?J[0-9]{4}$
IIUC, you always have a "J" inside your string.
Therefore, make it obligatory, and make the "C" optional, using a question mark.
Something like:
re.compile('6[1-2][0-9]{3}C?J[0-9]{4}')
I have not tested this, but you probably can continue from here by yourself.
I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(
So, here's my question:
I have a crawler that goes and downloads web pages and strips those of URLs (for future crawling). My crawler operates from a whitelist of URLs which are specified in regular expressions, so they're along the lines of:
(http://www.example.com/subdirectory/)(.*?)
...which would allow URLs that followed the pattern to be crawled in the future. The problem I'm having is that I'd like to exclude certain characters in URLs, so that (for example) addresses such as:
(http://www.example.com/subdirectory/)(somepage?param=1¶m=5#print)
...in the case above, as an example, I'd like to be able to exclude URLs that feature ?, #, and = (to avoid crawling those pages). I've tried quite a few different approaches, but I can't seem to get it right:
(http://www.example.com/)([^=\?#](.*?))
etc. Any help would be really appreciated!
EDIT: sorry, should've mentioned this is written in Python, and I'm normally fairly proficient at regex (although this has me stumped)
EDIT 2: VoDurden's answer (the accepted one below) almost yields the correct result, all it needs is the $ character at the end of the expression and it works perfectly - example:
(http://www.example.com/)([^=\?#]*)$
(http://www.example.com/)([^=?#]*?)
Should do it, this will allow any URL that does not contain the characters you don't want.
It might however be a little bit hard to extend this approach. A better option is to have the system work two-tiered, i.e. one set of matching regex, and one set of blocking regex. Then only URL:s which pass both of these will be allowed. I think this solution will be a bit more transparent and flexible.
This expression should be what you're looking for:
(http://www.example.com/subdirectory/)([^=?#]*)$
[^=\?#] Will match anything except for the characters you specified.
For Example:
http://www.example.com/subdirectory/ Match
http://www.example.com/subdirectory/index.php Match
http://www.example.com/subdirectory/somepage?param=1¶m=5#print No Match
http://www.example.com/subdirectory/index.php?param=1 No Match
You will need to crawl the pages upto ?param=1¶m=5
because normally param=1 and param=2 could give you completely different web page.
pick up one the wordpress website to confirm that.
Try like this one, It will try to match just before # char
(http://www.example.com/)([^#]*?)
I'm not sure of what you want. If you wan't to match anything that doesn't containst any ?, #, and = then the regex is
([^=?#]*)
As an alternative there's always the urlparse module which is designed for parsing urls.
from urlparse import urlparse
urls= [
'http://www.example.com/subdirectory/',
'http://www.example.com/subdirectory/index.php',
'http://www.example.com/subdirectory/somepage?param=1¶m=5#print',
'http://www.example.com/subdirectory/index.php?param=1',
]
for url in urls:
# in python 2.5+ you can use urlparse(url).query instead
if not urlparse(url)[4]:
print url
Provides the following:
http://www.example.com/subdirectory/
http://www.example.com/subdirectory/index.php