Regex to match word bounded instances with 'dot' inside? [duplicate] - python

This question already has answers here:
Regular expression for floating point numbers
(20 answers)
Closed 3 years ago.
Hope the question was understandable.
What I want to do is to match anything that constitutes a number (int and float) in python syntax. For instance, I want to match everything on the form (including the dot):
123
123.321
123.
My attempted solution was
"\b\d+/.?\d*\b"
...but this fails. The idea is to match any sequence that starts with one or more digit (\d+), followed by an optional dot (/.?), followed by an arbitrary number of digits (\d*), with word boundaries around. This would match all three number forms specified above.
The word boundary is important because I do not want to match the numbers in
foo123
123foo
and want to match the numbers in
a=123.
foo_method(123., 789.1, 10)
However the problem is that the last word boundary is recognised right before the optional dot. This prevents the regex to match 123. and 123.321, but instead matches 123 and 312.
How can I possibly do this with word boundaries out of the question? Possible to make program perceive the dot as word character?

The float spec is a little more complicated than you've got covered there.
This matches pythons float spec, though there are others as well.
r"[+-]?\d+\.?\d*([eE][+-]?\d+)?"
You can add on positive lookaheads and lookbehinds to this if you are doing something relatively simple, but you may want to split all of what you are parsing by word boundary before parsing for something more complex
This would be the version ensuring word boundaries:
r"(?<=\b)[+-]?\d+\.?\d*([eE][+-]?\d+)?(?=\b)"

Related

Find all strings starting and ending with given substring in a string using regex in Python [duplicate]

This question already has an answer here:
Regex including overlapping matches with same start
(1 answer)
Closed 3 years ago.
I have given a string
ATGCCAGGCTAGCTTATTTAA
and I have to find out all substrings in string which starts with ATG and end with either of TAA, TAG, TGA.
Here is what I am doing:
seq="ATGCCAGGCTAGCTTATTTAA"
pattern = re.compile(r"(ATG[ACGT]*(TAG|TAA|TGA))")
for match in re.finditer(pattern, seq):
coding = match.group(1)
print(coding)
This code is giving me output:
ATGCCAGGCTAGCTTATTTAA
But actual output should be :
ATGCCAGGCTAGCTTATTTAA, ATGCCAGGCTAG
what I should change in my code?
tl;dr: can't use regex for this
The problem isn't greedy/non-greedy.
The problem isn't overlapping matches either: there's a solution for that (How to find overlapping matches with a regexp?)
The real problem with OP's question is, REGEX isn't designed for matches with the same start. Regex performs a linear search and stops at the first match. That's one of the reasons why it's fast. However, this prevents REGEX from supporting multiple overlapping matches starting at the same character.
See
Regex including overlapping matches with same start
for more info.
Regex isn't the be-all-end-all of pattern matching. It's in the name: Regular expressions are all about single-interpretation symbol sequences, and DNA tends not to fit that paradigm.
In r"(ATG[ACGT]*(TAG|TAA|TGA))", the * operator is "greedy". Use the non-greedy modifier, like r"(ATG[ACGT]*?(TAG|TAA|TGA))", to tell the regexp to take the shortest matching string, not the longest.

regular expression for just number with fixed length and without digit back or front occurred that and without use \b [duplicate]

This question already has answers here:
Regex matching 5-digit substrings not enclosed with digits
(2 answers)
Closed 3 years ago.
I want regular expression for just number with fixed length and without digit back or front occurred that without using \b
sample text: "phone0990-123-12345hello"
my regex: r'09([0-9]){2}([ ]|-){0,1}([0-9]){3}([ ]|-){0,1}([0-9]){4}(?:[^0-9])'
this regex must return null but it return 0990-123-12345 for me!
I say to it match numbre in text that don't continues digit after 9 target digit with (?:[^0-9]) and with ?: say to it that don't show non digit in match. I don't want show h character in match!
As you give the exact number of digits with {4}, you can just leave the part (?:[^0-9]) out.
Summing up, this regex works for me:
r'09([0-9]){2}([ ]|-){0,1}([0-9]){3}([ ]|-){0,1}([0-9]){4}'
You can check your regular expressions on:
https://pythex.org/
I tried my best to understand the question ,but you could also try
09([0-9]+){2}([ -]+){0,1}([0-9]+){3}([ -]+){0,1}([0-9]){4}
try
\d{4}[-.]\d{3}[-.]\d{4}
it will return 0990-123-1234
Try r'\d{4}\-\d{3}-\d{3,4}' it will return only 0990-123-1234

Regex not working to get string between 2 strings. Python 27 [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
From this URL view-source:https://www.amazon.com/dp/073532753X?smid=A3P5ROKL5A1OLE
I want to get string between var iframeContent = and obj.onloadCallback = onloadCallback;
I have this regex iframeContent(.*?)obj.onloadCallback = onloadCallback;
But it does not work. I am not good at regex so please pardon my lack of knowledge.
I even tried iframeContent(.*?)obj.onloadCallback but it does not work.
It looks like you just want that giant encoded string. I believe yours is failing for two reasons. You're not running in DOTALL mode, which means your . won't match across multiple lines, and your regex is failing because of catastrophic backtracking, which can happen when you have a very long variable length match that matches the same characters as the ones following it.
This should get what you want
m = re.search(r'var iframeContent = \"([^"]+)\"', html_source)
print m.group(1)
The regex is just looking for any characters except double quotes [^"] in between two double quotes. Because the variable length match and the match immediately after it don't match any of the same characters, you don't run into the catastrophic backtracking issue.
I suspect that input string lies across multiple lines.Try adding re.M in search line (ie. re.findall('someString', text_Holder, re.M)).
You could try this regex too
(?<=iframeContent =)(.*)(?=obj.onloadCallback = onloadCallback)
you can check at this site the test.
Is it very important you use DOTALL mode, which means that you will have single-line

How to make regex that matches a number with commas for every three digits?

I am a beginner in Python and in regular expressions and now I try to deal with one exercise, that sound like that:
How would you write a regex that matches a number with commas for
every three digits? It must match the following:
'42'
'1,234'
'6,368,745'
but not the following:
'12,34,567' (which has only two digits between the commas)
'1234' (which lacks commas)
I thought it would be easy, but I've already spent several hours and still don't have write answer. And even the answer, that was in book with this exercise, doesn't work at all (the pattern in the book is ^\d{1,3}(,\d{3})*$)
Thank you in advance!
The answer in your book seems correct for me. It works on the test cases you have given also.
(^\d{1,3}(,\d{3})*$)
The '^' symbol tells to search for integers at the start of the line. d{1,3} tells that there should be at least one integer but not more than 3 so ;
1234,123
will not work.
(,\d{3})*$
This expression tells that there should be one comma followed by three integers at the end of the line as many as there are.
Maybe the answer you are looking for is this:
(^\d+(,\d{3})*$)
Which matches a number with commas for every three digits without limiting the number being larger than 3 digits long before the comma.
You can go with this (which is a slightly improved version of what the book specifies):
^\d{1,3}(?:,\d{3})*$
Demo on Regex101
I got it to work by putting the stuff between the carrot and the dollar in parentheses like so: re.compile(r'^(\d{1,3}(,\d{3})*)$')
but I find this regex pretty useless, because you can't use it to find these numbers in a document because the string has to begin and end with the exact phrase.
#This program is to validate the regular expression for this scenerio.
#Any properly formattes number (w/Commas) will match.
#Parsing through a document for this regex is beyond my capability at this time.
print('Type a number with commas')
sentence = input()
import re
pattern = re.compile(r'\d{1,3}(,\d{3})*')
matches = pattern.match(sentence)
if matches.group(0) != sentence:
#Checks to see if the input value
#does NOT match the pattern.
print ('Does Not Match the Regular Expression!')
else:
print(matches.group(0)+ ' matches the pattern.')
#If the values match it will state verification.
The Simple answer is :
^\d{1,2}(,\d{3})*$
^\d{1,2} - should start with a number and matches 1 or 2 digits.
(,\d{3})*$ - once ',' is passed it requires 3 digits.
Works for all the scenarios in the book.
test your scenarios on https://pythex.org/
I also went down the rabbit hole trying to write a regex that is a solution to the question in the book. The question in the book does not assume that each line is such a number, that is, there might be multiple such numbers in the same line and there might some kind of quotation marks around the number (similar to the question text). On the other hand, the solution provided in the book makes those assumptions: (^\d{1,3}(,\d{3})*$)
I tried to use the question text as input and ended up with the following pattern, which is way too complicated:
r'''(
(?:(?<=\s)|(?<=[\'"])|(?<=^))
\d{1,3}
(?:,\d{3})*
(?:(?=\s)|(?=[\'"])|(?=$))
)'''
(?:(?<=\s)|(?<=[\'"])|(?<=^)) is a non-capturing group that allows
the number to start after \s characters, ', ", or the start of the text.
(?:,\d{3})* is a non-capturing group to avoid capturing, for example, 123 in 12,123.
(?:(?=\s)|(?=[\'"])|(?=$)) is a non-capturing group that allows
the number to end before \s characters, ', ", or the end of the text (no newline case).
Obviously you could extend the list of allowed characters around the number.

searching for doubles with regular expressions [duplicate]

This question already has answers here:
How to extract a floating number from a string [duplicate]
(7 answers)
Closed 7 years ago.
Is there a regular expression which would match floating point numbers but not if this is a part of a construct like 15.01.2016?
re.match(rex, s) should be successful if s would be
1.0A
1.B
.1
and not successful for s like
1.0.0
1.0.1
20.20.20.30
12345.657.345
Edit:
The crucial part is the combination of the constrains: "[0-9]*\.[0-9]*" and not part of "[0-9]+\.[0-9]+(\.[0-9]+)+"
You can use this regex based on look arounds in python:
(?<![\d.])(?:\d*\.\d+|\d+\.\d*)(?![\d.])
RegEx Demo
(?![\d.]) is lookahead assertion to fail the match if next char is DOT or digit
(?<![\d.]) is lookbehind assertion to fail the match if previous char is DOT or digit
The following solution uses also lookahead and lookbehind as anubhava mentioned. Additionally it takes care of negative numbers, the powers of ten (also negative ones) and integers (witout a .):
(?<![\d.])-?(?:\d+\.?\d*|\d*\.\d+)([eE]-?\d+)?(?![.\d])
If you add some additional characters in the lookabehind (?<![\d.]) you can avoid matches in a random bunch of characters or at the end of a word (e.g. if you want no match for "python3" ).

Categories