any string could be inside amongst the slashes, but there must be only 3 divisions. For instance,
Values that should match:
"90/90/90/9090"
"FDSAFDSA/90/pppppaA3/9090"
Values that should not match:
"90/90/90/9090/90"
"FDSAFDSA/90/pppppaA3/9090/90"
I am using python and the re library, I have tried lots of combination but none of them worked:
bool(re.match(r'^.*\/.*\/.*\/((?!\/).)*$', "90/90/90/9090/90"))
bool(re.match(r'^.*\/.*\/.*\/((?!/).)*$', "90/90/90/9090/90"))
bool(re.match(r'^.*\/.*\/.*\/(?!(/)$).*$', "90/90/90/9090/90"))
bool(re.match(r'^.*\/.*\/.*\/(/).*$', "90/90/90/90/90"))
bool(re.match(r'^.*\/.*\/.*\/.*(\/)$', "90/90/90/90/90"))
You ought to use negated character classes :
^[^/]*/[^/]*/[^/]*/[^/]*$
[^/] matches any character but /, so we represent a string that contains three / and anything else around them.
In your regexes, the . could match anything including /, and while you could have approximated an equivalent of negated character classes using negative lookarounds, you didn't properly apply them to every . you had : ^((?!\/).)*\/((?!\/).)*\/((?!\/).)*\/((?!\/).)*$ would have worked too, although it would have been less performant.
And there's no need to escape those /, they aren't regex meta-characters. You might need to escape them in languages or tools that use / as delimiters, such as JavaScript's /pattern/ syntax or sed's s/search/replace/ substitutions.
Related
So this is the link I have to extract:
http://www.hrmagazine.co.uk/article-details/finance-sector-dominates-working-families-benchmark
And this is what I have currently
.+\/article-details\/.+\-.+\-.+\-.+\-.+\-.+$
The issue, however, is it extracts any number of words and hyphens after the "/article-details/" part, rather than specifically 6 word titles with hyphens replacing the spaces above. So it would accept a bad result
http://www.hrmagazine.co.uk/article-details/finance-sector-dominates-working-families-benchmark-test
When I need it to only accept links like this format
http://www.hrmagazine.co.uk/article-details/one-two-three-four-five-six
What's the correct regular expression for this type of website? The current extractor I have in Scrapy/Spyder is the following
rules = (Rule(LinkExtractor(allow=['.+\/article-details\/.+\-.+\-.+\-.+\-.+\-.+$']), callback='parse_item', follow=True),)
Each of those .+ in your regex can match any number of ANY character - including hyphens. So your overall regex is just requiring a minimum of 5 hyphens, not an exact count. Use [^-]+ to match only non-hyphen characters.
Note that none of those backslashes in your regex are accomplishing anything - in no case is the following character something requiring escaping. Even if they were, you'd need to double the backslashes, or use a raw string r'whatever', so that the backslashes are being interpreted by the re module, rather than Python's string literal parsing rules.
Try replacing the . with something like [a-z]; . will also match hyphens, which is why its matching an unlimited number of words:
.+\/article-details\/[a-z]+\-[a-z]+\-[a-z]+\-[a-z]+\-[a-z]+\-[a-z]+$
If you need to match things like numbers, add them to the brackets as well ([a-z0-9], etc.).
I am trying to see which statements the following pattern matches:
\(*[0-9]{3}\)*-*[0-9]{3}\d\d\d+
I am a little confused because the grouping characters () have a \ before it. Does this mean that the statement must have a ( and )? Would that mean the statements without ( or ) be unmatched?
Statements:
'4046782347'
'(123)1247890'
'456900900'
'(678)2001236'
'4041231234'
'(404123123'
Context is important:
re.match(r'\(', content) matches a literal parenthesis.
re.match(r'\(*', content) matches 0 or more literal parentheses, thus making the parens optional (and allowing more than one of them, but that's clearly a bug).
Since the intended behavior isn't "0 or more" but rather "0 or 1", this should probably be written r'\(?' instead.
That said, there's a whole lot about this regex that's silly. I'd consider instead:
[(]?\d{3}[)]?-?\d{6,}
Using [(]? avoids backslashes, and consequently is easier to read whether it's rendered by str() or repr() (which escapes backslashes).
Mixing [0-9] and \d is silly; better to pick one and stick with it.
Using * in place of ? is silly, unless you really want to match (((123))456-----7890.
\d{3}\d\d\d+ matches three digits, then three or more additional digits. Why not just match six or more digits in the first place?
Normally, the parentheses would act as grouping characters, however regex metacharacters are reduced simply to the raw characters when preceded by a backslash. From the Python docs:
As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
In your case, the statements don't need parentheses in order to match, as each \( and \) in the expression is followed by a *, which means that the previous character can be matched any number of times, including none at all. From the Python docs:
* doesn’t match the literal character *; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.
Thus the statements with or without parentheses around the first 3 digits may match.
Source: https://docs.python.org/2/howto/regex.html
I have a question about regular expression sub in python. So, I have some lines of code and what I want is to replace all floating point values eg: 2.0f,-1.0f...etc..to doubles 2.0,-1.0. I came up with this regular expression '[-+]?[0-9]*\.?[0-9]+f' and it finds what I need but I am not sure how to replace it?
so here's what I have:
# check if floating point value exists
if re.findall('[-+]?[0-9]*\.?[0-9]+f', line):
line = re.sub('[-+]?[0-9]*\.?[0-9]+f', ????? ,line)
I am not sure what to put under ????? such that it will replace what I found in '[-+]?[0-9]*\.?[0-9]+f' without the char f in the end of the string.
Also there might be more than one floating point values, which is why I used re.findall
Any help would be great. Thanks
Capture the part of the text you want to save in a capturing group and use the \1 substitution operator:
line = re.sub(r'([-+]?[0-9]*\.?[0-9]+)f', r'\1' ,line)
Note that findall (or any kind of searching) is unnecessary since re.sub will look for the pattern itself and return the string unchanged if there are no matches.
Now, for several regular expression writing tips:
Always use raw strings (r'...') for regular expressions and substitution strings, otherwise you will need to double your backslashes to escape them from Python's string parser. It is only by accident that you didn't need to do this for \., since . is not part of an escape sequence in Python strings.
Use \d instead of [0-9] to match a digit. They are equivalent, but \d is easier to recognize for "digit", while [0-9] needs to be visually verified.
Your regular expression will not recognize 10.f, which is likely a valid decimal number in your input. Matching floating-point numbers in various formats is trickier than it seems at first, but simple googling will reveal many reasonably complete solutions for this.
The re.X flag will allow you to add arbitrary whitespace and even comments to your regexp. With small regexps that can seem downright silly, but for large expressions the added clarity is a life-saver. (Your regular expression is close to the threshold.)
Here is an example of an extended regular expression that implements the above style tips:
line = re.sub(r'''
( [-+]?
(?: \d+ (?: \.\d* )? # 12 or 12. or 12.34
|
\.\d+ # .12
)
) f''',
r'\1', line, flags=re.X)
((?:...) is a non-capturing group, only used for precedence.)
This is my goto reference for all things regex.
http://www.regular-expressions.info/named.html
The result should be something like:
line = re.sub('(<first>[-+]?[0-9]*\).?[0-9]+f', '\g<first>', line)
Surround the part of the regex you want to "keep" in a "capture group", e.g.
'([-+]?[0-9]*\.?[0-9]+)f'
^ ^
And then you can refer to these capture groups using \1 in your substitution:
r'\1'
For future reference, you can have many capture groups, i.e. \2, \3, etc. by order of the opening parentheses.
Special sequences (character classes) in Python RegEx are escapes like \w or \d that matches a set of characters.
In my case, I need to be able to match all alpha-numerical characters except numbers.
That is, \w minus \d.
I need to use the special sequence \w because I'm dealing with non-ASCII characters and need to match symbols like "Æ" and "Ø".
One would think I could use this expression: [\w^\d] but it doesn't seem to match anything and I'm not sure why.
So in short, how can I mix (add/subtract) special sequences in Python Regular Expressions?
EDIT: I accidentally used [\W^\d] instead of [\w^\d]. The latter does indeed match something, including parentheses and commas which are not alpha-numerical characters as far as I'm concerned.
You can use r"[^\W\d]", ie. invert the union of non-alphanumerics and numbers.
You cannot subtract character classes, no.
Your best bet is to use the regex project, which offers additional functionality while remaining backwards compatible with the re module in in the standard library. It supports character classes based on Unicode properties:
\p{IsAlphabetic}
This will match any character that the Unicode specification states is an alphabetic character.
Even better, regex does support character class subtraction; it views such classes as sets and allows you to create a difference with the -- operator:
[\w--\d]
matches everything in \w except anything that also matches \d.
You can exclude classes using a negative lookahead assertion, such as r'(?!\d)[\w]' to match a word character, excluding digits. For example:
>>> re.search(r'(?!\d)[\w]', '12bac')
<_sre.SRE_Match object at 0xb7779218>
>>> _.group(0)
'b'
To exclude more than one group, you can use the usual [...] syntax in the lookahead assertion, for example r'(?![0-5])[\w]' would match any alphanumeric character except for digits 0-5.
As with [...], the above construct matches a single character. To match multiple characters, add a repetition operator:
>>> re.search(r'((?!\d)[\w])+', '12bac15')
<_sre.SRE_Match object at 0x7f44cd2588a0>
>>> _.group(0)
'bac'
I don't think you can directly combine (boolean and) character sets in a single regex, whether one is negated or not. Otherwise you could simply have combined [^\d] and \w.
Note: the ^ has to be at the start of the set, and applies to the whole set. From the docs: "If the first character of the set is '^', all the characters that are not in the set will be matched.".
Your set [\w^\d] tries to match an alpha-numerical character, followed by a caret, followed by a digit. I can imagine that doesn't match anything either.
I would do it in two steps, effectly combining the regular expressions. First match by non-digits (inner regex), then match by alpha-numerical characters:
re.search('\w+', re.search('([^\d]+)', s).group(0)).group(0)
or variations to this theme.
Note that would need to surround this with a try: except: block, as it will throw an AttributeError: 'NoneType' object has no attribute 'group' in case one of the two regexes fails. But you can, of course, split this single line up in a few more lines.
consider this string
prison break: proof of innocence (2006) {abduction (#1.10)}
i just want to know whether there is (# floating point value )} in the string or not
i tried few regular expressions like
re.search('\(\#+\f+\)\}',xyz)
and
re.search('\(\#+(\d\.\d)+\)\}',xyz)
nothing worked though...can someone suggest me something here
Try r'\(#\d+\.\d+\)\}'
The (, ), ., and } are all special metacharacters, that's why they're preceded by \, so they're matched literally instead.
You also need to apply the + repetition at the right element. Here it's attached to the \d -- the shorthand for digit character class -- to mean that only the digits can appear one-or-more times.
The use of r'raw string literals' makes it easier to work with regex patterns because you don't have to escape backslashes excessively.
See also
What exactly do u and r string flags in Python do, and what are raw string literals?
Variations
For instructional purposes, let's consider a few variations. This will show a few basic features of regex. Let's first consider one of the attempted patterns:
\(\#+(\d\.\d)+\)\}
Let's space out the parts for readability:
\( \#+ ( \d \. \d )+ \) \}
\__________/
this is one group, repeated with +
So this pattern matches:
A literal (, followed by one-or-more #
Followed by one-or-more of:
A digit, a literal dot, and a digit
Followed by a literal )}
Thus, the pattern will match e.g. (###1.23.45.6)} (as seen on rubular.com). Obviously this is not the pattern we want.
Now let's try to modify the solution pattern and say that perhaps we also want to allow just a sequence of digits, without the subsequent period and following digits. We can do this by grouping that part (…), and making it optional with ?.
BEFORE
\(#\d+\.\d+\)\}
\___/
let's make this optional! (…)?
AFTER
\(#\d+(\.\d+)?\)\}
Now the pattern matches e.g. (#1.23)} as well as e.g. (#666)} (as seen on rubular.com).
References
regular-expressions.info - Optional, Brackets for Grouping
"Escape everything" and use raw-literal syntax for safety:
>>> s='prison break: proof of innocence (2006) {abduction (#1.10)}'
>>> re.search(r'\(\#\d+\.\d+\)\}', s)
<_sre.SRE_Match object at 0xec950>
>>> _.group()
'(#1.10)}'
>>>
This assumes that by "floating point value" you mean "one or more digits, a dot, one or more digits", and is not tolerant of other floating point syntax variations, multiple hashes (which you appear from your RE patterns to want to support but don't mention in your Q's text), arbitrary whitespace among the relevant parts (again, unclear from your Q whether you need it), ... -- some issues can be adjusted pretty easily, others "not so much" (it's particularly hard to guess what gamut of FP syntax variations you want to support, for example).