searching for doubles with regular expressions [duplicate] - python

This question already has answers here:
How to extract a floating number from a string [duplicate]
(7 answers)
Closed 7 years ago.
Is there a regular expression which would match floating point numbers but not if this is a part of a construct like 15.01.2016?
re.match(rex, s) should be successful if s would be
1.0A
1.B
.1
and not successful for s like
1.0.0
1.0.1
20.20.20.30
12345.657.345
Edit:
The crucial part is the combination of the constrains: "[0-9]*\.[0-9]*" and not part of "[0-9]+\.[0-9]+(\.[0-9]+)+"

You can use this regex based on look arounds in python:
(?<![\d.])(?:\d*\.\d+|\d+\.\d*)(?![\d.])
RegEx Demo
(?![\d.]) is lookahead assertion to fail the match if next char is DOT or digit
(?<![\d.]) is lookbehind assertion to fail the match if previous char is DOT or digit

The following solution uses also lookahead and lookbehind as anubhava mentioned. Additionally it takes care of negative numbers, the powers of ten (also negative ones) and integers (witout a .):
(?<![\d.])-?(?:\d+\.?\d*|\d*\.\d+)([eE]-?\d+)?(?![.\d])
If you add some additional characters in the lookabehind (?<![\d.]) you can avoid matches in a random bunch of characters or at the end of a word (e.g. if you want no match for "python3" ).

Related

How to match version strings against a minimum version number using a regex and no other packages?

Note: This question is not about available libraries which parse version strings or about how to write regex pattern in general.
I have to feed a pattern into an API call so using another library is not an option for me.
I need to match a version string like e.g. "9.3" or "12.5" against a minimum version number "11.2". I.e. "9.9" or "9.10" must not match while "11.2", "11.10" or "20.0" should.
In Python re is there a reasonable way to accomplish this? Using approaches like this one lead to an expression like this one:
r"^(([2-9]\d|1[2-9])\.\d{1,}|11\.([2-9]|\d{2,}))(\.\d+)*$"
which is hard core to read and zero flexible.
In 2020 isn't there an easier or more generic way to deal with decimal numbers (not digits) in regular expressions in Python?
Maybe expression generators which take a more abstract description and generate expressions?
To match all integers and decimal numbers greater than or equal to 11, use this regular expression:
^[1-9][1-9](\d+)?\.?\d*
^ Beginning of string.
[1-9] First digit must be in the range 1 to 9 (inclusive).
[1-9] Same for second digit.
(\d+)? Might be followed by more digits.
\.? Might be followed by a decimal.
\d* Might be followed by digits in the fractional part.

Find all strings starting and ending with given substring in a string using regex in Python [duplicate]

This question already has an answer here:
Regex including overlapping matches with same start
(1 answer)
Closed 3 years ago.
I have given a string
ATGCCAGGCTAGCTTATTTAA
and I have to find out all substrings in string which starts with ATG and end with either of TAA, TAG, TGA.
Here is what I am doing:
seq="ATGCCAGGCTAGCTTATTTAA"
pattern = re.compile(r"(ATG[ACGT]*(TAG|TAA|TGA))")
for match in re.finditer(pattern, seq):
coding = match.group(1)
print(coding)
This code is giving me output:
ATGCCAGGCTAGCTTATTTAA
But actual output should be :
ATGCCAGGCTAGCTTATTTAA, ATGCCAGGCTAG
what I should change in my code?
tl;dr: can't use regex for this
The problem isn't greedy/non-greedy.
The problem isn't overlapping matches either: there's a solution for that (How to find overlapping matches with a regexp?)
The real problem with OP's question is, REGEX isn't designed for matches with the same start. Regex performs a linear search and stops at the first match. That's one of the reasons why it's fast. However, this prevents REGEX from supporting multiple overlapping matches starting at the same character.
See
Regex including overlapping matches with same start
for more info.
Regex isn't the be-all-end-all of pattern matching. It's in the name: Regular expressions are all about single-interpretation symbol sequences, and DNA tends not to fit that paradigm.
In r"(ATG[ACGT]*(TAG|TAA|TGA))", the * operator is "greedy". Use the non-greedy modifier, like r"(ATG[ACGT]*?(TAG|TAA|TGA))", to tell the regexp to take the shortest matching string, not the longest.

Regex to match word bounded instances with 'dot' inside? [duplicate]

This question already has answers here:
Regular expression for floating point numbers
(20 answers)
Closed 3 years ago.
Hope the question was understandable.
What I want to do is to match anything that constitutes a number (int and float) in python syntax. For instance, I want to match everything on the form (including the dot):
123
123.321
123.
My attempted solution was
"\b\d+/.?\d*\b"
...but this fails. The idea is to match any sequence that starts with one or more digit (\d+), followed by an optional dot (/.?), followed by an arbitrary number of digits (\d*), with word boundaries around. This would match all three number forms specified above.
The word boundary is important because I do not want to match the numbers in
foo123
123foo
and want to match the numbers in
a=123.
foo_method(123., 789.1, 10)
However the problem is that the last word boundary is recognised right before the optional dot. This prevents the regex to match 123. and 123.321, but instead matches 123 and 312.
How can I possibly do this with word boundaries out of the question? Possible to make program perceive the dot as word character?
The float spec is a little more complicated than you've got covered there.
This matches pythons float spec, though there are others as well.
r"[+-]?\d+\.?\d*([eE][+-]?\d+)?"
You can add on positive lookaheads and lookbehinds to this if you are doing something relatively simple, but you may want to split all of what you are parsing by word boundary before parsing for something more complex
This would be the version ensuring word boundaries:
r"(?<=\b)[+-]?\d+\.?\d*([eE][+-]?\d+)?(?=\b)"

regular expression for just number with fixed length and without digit back or front occurred that and without use \b [duplicate]

This question already has answers here:
Regex matching 5-digit substrings not enclosed with digits
(2 answers)
Closed 3 years ago.
I want regular expression for just number with fixed length and without digit back or front occurred that without using \b
sample text: "phone0990-123-12345hello"
my regex: r'09([0-9]){2}([ ]|-){0,1}([0-9]){3}([ ]|-){0,1}([0-9]){4}(?:[^0-9])'
this regex must return null but it return 0990-123-12345 for me!
I say to it match numbre in text that don't continues digit after 9 target digit with (?:[^0-9]) and with ?: say to it that don't show non digit in match. I don't want show h character in match!
As you give the exact number of digits with {4}, you can just leave the part (?:[^0-9]) out.
Summing up, this regex works for me:
r'09([0-9]){2}([ ]|-){0,1}([0-9]){3}([ ]|-){0,1}([0-9]){4}'
You can check your regular expressions on:
https://pythex.org/
I tried my best to understand the question ,but you could also try
09([0-9]+){2}([ -]+){0,1}([0-9]+){3}([ -]+){0,1}([0-9]){4}
try
\d{4}[-.]\d{3}[-.]\d{4}
it will return 0990-123-1234
Try r'\d{4}\-\d{3}-\d{3,4}' it will return only 0990-123-1234

What is the literal meaning of this regex that contains a lookahead? [duplicate]

This question already has answers here:
Reference - What does this regex mean?
(1 answer)
What does ?= mean in a regular expression?
(3 answers)
Closed 5 years ago.
I'm trying to understand the positive and negative lookahead, but I think I'm missing something.
q(?=u)
What I understand this regex to mean is: "match a q that is followed by a u", so I get a match with the string "quit", but getting only a group with 'q'.
But with the regex q(?=u)i, I don't get any result with the string "quit". Why does happen? Probably this regex doesn't have sense, but I would like to know what it means in order to understand the lookahead.
The lookahead doesn't consume its text. qu is different from q(?=u) -- the latter matches just a q, but requires it to be followed by u (which is however not captured or consumed).
And that's why q(?=u)i cannot match -- the q needs to be followed by u and i at the same time, which is impossible. In other words, it would find and capture qi but only if the q was immediately followed by u which is obviously not true if it is followed by i.
If you want to match qui, the regex for that is qui.
Negative lookahead is indispensable if you want to match something not followed by something else. When explaining character classes, this tutorial explained why you cannot use a negated character class to match a q not followed by a u. Negative lookahead provides the solution: q(?!u). The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead, we have the trivial regex u.
Positive lookahead works just the same. q(?=u) matches a q that is followed by a u, without making the u part of the match. The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.
Regex lookahead, lookbehind and atomic groups

Categories