Python regex: using or statement - python

I may not being saying this right (I'm a total regex newbie). Here's the code I currently have:
bugs.append(re.compile("^(\d+)").match(line).group(1))
I'd like to add to the regex so it looks at either '\d+' (starts with digits) or that it starts with 2 capital letters and contains a '-' before the first whitespace. I have the regex for the capital letters:
^[A-Z]{2,}
but I'm not sure how to add the '-' and the make an OR with the \d+. Does this make sense? Thanks!

The way to do an OR in regexps is with the "alternation" or "pipe" operator, |.
For example, to match either one or more digits, or two or more capital letter:
^(\d+|[A-Z]{2,})
Debuggex Demo
You may or may not sometimes need to add/remove/move parentheses to get the precedence right. The way I've written it, you've got one group that captures either the digit string or the capitals. While you're learning the rules (in fact, even after you've learned the rules) it's helpful to look at a regular expression visualizer/debugger like the one I used.
Your rule is slightly more complicated: you want 2 or more capital letters, and a hyphen before the first space. That's a bit hard to write as is, but if you change it to two or more capital letters, zero or more non-space characters, and a hyphen, that's easy:
^(\d+|[A-Z]{2,}\S*?-)
Debuggex Demo
(Notice the \S*?—that means we're going to match as few characters as possible, instead of as many as possible, so we'll only match up to the first hyphen in THIS-IS-A-TEST instead of up to the last. If you want the other one, just drop the ?.)

Write | for "or". For a sequence of zero or more non-whitespace characters, write \S*.
re.compile('^(\d+|[A-Z][A-Z]\S*-\s)')

re.compile(r"""
^ # beginning of the line
(?: # non-capturing group; do not return this group in .group()
(\d+) # one or more digits, captured as a group
| # Or
[A-Z]{2} # Exactly two uppercase letters
\S* # Any number of non-whitespace characters
- # the dash you wanted
) # end of the non-capturing group
""",
re.X) # enable comments in the regex

Related

Regex to match (French) numbers

I'm trying to find a simple (not perfect) pattern to recognise French numbers in a French text. French numbers use comma for the Anglo-Saxon decimal, and use dot or space for the thousand separator. \u00A0 is non-breaking space, also often used in French documents for the thousand separator.
So my first attempt is:
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d', flags=re.UNICODE)
... but the trouble is that this doesn't then match a single digit.
But if I do this
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d?', flags=re.UNICODE)
it then picks up trailing space (or NBS) characters (or for that matter a trailing comma or full stop).
The thing is, the pattern must both START and END with a digit, but it is possible that these may be the SAME character.
How might I achieve this? I considered a two-stage process where you try to see whether this is in fact a single-digit number... but that in itself is not trivial: if followed by a space, NBS, comma or dot, you then have to see whether the character after that, if there is one, is or is not a digit.
Obviously I'm hoping to find a solution which involves only one regex: if there is only one regex, it is then possible to do something like:
doubled_dollars_plain_text = plain_text.replace('$', '$$')
substituted_plain_text = re.sub(number_pattern, '$number', doubled_dollars_plain_text)
... having to use a two-stage process would make this much more lengthy and fiddly.
Edit
I tried to see whether I could implement ThierryLathuille's idea, so I tried:
re.compile(r'(\d(?:[\d\., \u00A0]*\d)?)', flags=re.UNICODE)
... this seems to work pretty well. Unlike JvdV's solution it doesn't attempt to check that thousand separators are followed by 3 digits, and for that matter you could have a succession of commas and spaces in the middle and it would still pass, which is quite problematic when you have a list of numbers separated by ", ". But it's acceptable for certain purposes... until something more sophisticated can be found.
I wonder if there's a way of saying "any non-digit in this pattern must be on its own" (i.e. must be bracketed between two digits)?
What about:
\d{1,3}(?:[\s.]?\d{3})*(?:,\d+)?(?!\d)
See an online demo
\d{1,3} - 1-3 digits.
(?: - Open 1st non-capture group:
[\s.]? - An optional whitespace or literal dot. Note that with unicode \s should match \p{Z} to include the non-breaking whitespace.
\d{3} - Three digits.
)* - Close 1st non-capture group and match 0+ times.
(?:,\d+)? - A 2nd optional non-capture group to match a comma followed by at least 1 digit.
(?!\d) - A negative lookahead to prevent trailing digits.
Very much inspired by JvdV's answer, I suggest this:
number_pattern = re.compile(r'(\d{1,3}(?:(?:[. \u00A0])?\d{3})*(?:,\d+)?(?!\d))', flags=re.UNICODE)
... this makes the thousand separator optional, and also makes thousand groups optional. It restricts the thousand-separator to 3 possible characters: dot, space and NBS, which is necessary for French numbers as found in practice.
PS I just found today that in fact Swiss French-speakers appear sometimes to use an apostrophe (of which there are several candidates in the vastness of Unicode) as a thousand separator.

Regular Expression in Python strings

I want to validate a string that satisfies the below three conditions using regular expression
The special characters allowed are (. , _ , - ).
Should contain only lower-case characters.
Should not start or end with special character.
To satisfy the above conditions, I have created a format as below
^[^\W_][a-z\.,_-]+
This pattern works fine up to second character. However, this pattern is failing for the 3rd and subsequent characters if those contains any special character or upper cases characters.
Example:
Pattern Works for the string S#yanthan but not for Sa#yanthan. I am expecting that pattern to pass even if the third and subsequent characters contains any special characters or upper case characters. Can you suggest me where this pattern goes wrong please? Below is the snippet of the code.
import re
a = "Sayanthan"
exp = re.search("^[^\W_][a-z\.,_-]+",a)
if exp:
print(True)
else:
print(False)
Based on you initial rules I'd go with:
^[a-z](?:[.,_-]*[a-z])*$
See the online demo.
However, you mentioned in the comments:
"Also the third condition is "should not start with Special character" instead of "should not start or end with Special character""
In that case you could use:
^[a-z][-.,_a-z]*$
See the online demo
The pattern that you tried ^[^\W_][a-z.,_-]+ starts with [^\W_] which will match any word char except an underscore, so it could also be an uppercase char.
Then [a-z.,_-]+ will match 1+ times any of the listed, which means the string can also end with a comma for example.
Looking at the conditions listed, you could use:
^[a-z](?:[a-z.,_-]*[a-z])?\Z
^ Start of string
[a-z] Match a lower case char a-z
(?: Non capture group
[a-z.,_-]*[a-z] Match 0+ occurrences of the listed ending with a-z
)? Close group and make it optional
\Z End of string
Regex demo

How to say "match anything until a specific character, then work your way backwards"?

I am often faced with patterns where the part which is interesting is delimited by a specific character, the rest does not matter. A typical example:
/dev/sda1 472437724 231650856 216764652 52% /
I would like to extract 52 (which can also be 9, or 100 - so 1 to 3 digits) by saying "match anything, then when you get to % (which is unique in that line), see before for the matches to extract".
I tried to code this as .*(\d*)%.* but the group is not matched:
.* match anything, any number of times
% ... until you get to the litteral % (the \d is also matched by .* but my understanding is that once % is matched, the regex engine will work backwards, since it now has an "anchor" on which to analyze what was before -- please tell if this reasoning is incorrect, thank you)
(\d*) ... and now before that % you had a (\d*) to match and group
.* ... and the rest does not matter (match everything)
Your regex does not work because . matches too much, and the group matches too little. The group \d* can basically match nothing because of the * quantifier, leaving everything matched by the ..
And your description of .* is somewhat incorrect. It actually matches everything until the end, and moves backwards until the thing after it ((\d*).*) matches. For more info, see here.
In fact, I think your text can be matched simply by:
(\d{1,3})%
And getting group 1.
The logic of "keep looking until you find..." is kind of baked into the regex engine, so you don't need to explicitly say .* unless you want it in the match. In this case you just want the number before the % right?
If you are just looking to extract just the number then I would use:
import re
pattern = r"\d*(?=%)"
string = "/dev/sda1 472437724 231650856 216764652 52% /"
returnedMatches = re.findall(pattern, string)
The regex expression does a positive look ahead for the special character
In your pattern this part .* matches until the end of the string. Then it backtracks giving up as least as possible till it can match 0+ times a digit and a %.
The % is matched because matching 0+ digits is ok. Then you match again .* till the end of the string. There is a capturing group, only it is empty.
What you might do is add a word boundary or a space before the digits:
.* (\d{1,3})%.* or .*\b(\d{1,3})%.*
Regex demo 1 Or regex demo 2
Note that using .* (greedy) you will get the last instance of the digits and the % sign.
If you would make it non greedy, you would match the first occurrence:
.*?(\d{1,3})%.*
Regex demo
By default regex matches as greedily as possible. The initial .* in your regex sequence is matching everything up to the %:
"/dev/sda1 472437724 231650856 216764652 52"
This is acceptable for the regex, because it just chooses to have the next pattern, (\d*), match 0 characters.
In this scenario a couple of options could work for you. I would most recommend to use the previous spaces to define a sequence which "starts with a single space, contains any number of digits in the middle, and ends with a percentage symbol":
' (\d*)%'
Try this:
.*(\b\d{1,3}(?=\%)).*
demo

Python regex: How to make a group of words/character optional?

I am trying to make regex that can match all of them:
word
word-hyphen
word-hyphen-again
that is -\w+could be many depends on words in a term. How can I make it optional
Thing I made so far is given here:- https://regex101.com/r/Atpwze/1
Try using
\w+(-\w+)* for matching 0 or more hyphenated words after first word
\w+(-\w+){0,} same as first case
based on your exact requirement.
In order to eliminate some extreme cases like a-+-+---, you could use \w+(-\w+)*[^\W]
\W matches all non-word characters and ^\W negates the matching of non-word characters
To catch all of your examples, I think you could use:
^\w+(?:\w+\-?|\-\w+)+$
Beginning of the string ^
Match a word character one or more times \w+
Start a non capturing group (?:
Match a word character one or more times with an optional hyphen \w+\-?
Or |
A hyphen with one or more word characters \-\w+
Close the non capturing group )
End of the string $

python regex: don't allow a specific character to repeat

I have a regex
^[a-z][a-z0-9\-]{6,10}[a-z0-9]$
Which matches the following rules:
8-12 characters in length
first character is lowercase alpha
last characters lowercase alpha or digit
internal characters can contain a hyphen
it's re-used a lot in a module, always alongside some other rules and regexes
while writing out some unit tests, i noticed that it's always used in conjunction with another specific rule.
hyphens may not repeat
i can't wrap my head around integrating that rule into this one. i've tried a few dozen approaches with lookbehinds and lookaheads, but have had no luck on isolating to the specific character AND keeping the length requirement.
No repeating hyphen ^[a-z](?:[a-z0-9]|-(?!-)){6,10}[a-z0-9]$
Explained
^ [a-z]
(?:
[a-z0-9] # alnum
| # or
- (?! - ) # hyphen if not followed by hyphen
){6,10}
[a-z0-9] $

Categories