Python regex search function wierd behavior [duplicate] - python

This question already has answers here:
ip address validation in python using regex [duplicate]
(5 answers)
Closed 6 years ago.
I'm making a code to retrieve IP adresses in a text file, but i have an issue with the regex part:
import re
print re.search(r"[0-255].[0-255].[0-255].[0-255]","5.39.0.0")
this returns None but it should return <_sre.SRE_Match object at 0x0000000001D7C510> (because "5.39.0.0" matches with the expression). If I replace 39 with 0 it works.

Your regular expression wont wort for many reasons (see the comments).
The dots indicate that any character can be used you want \.
Try this regular expression:
(?:(?:[01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9]?[0-9])
As a reference check this site, there are similar examples:
Regular Experession Examples
P.S.: There are several testing sites on the web for regular expressions, you should use them, for example: Regex101
Edit: The conditional options in the last group must be inverted, if not the match of 2** will get with the two first characters throught first condition, ex: 255.255.255.250 will be matched as 255.255.255.25 (the last digit is lost). Also using non capturing groups in regular expressions is recomended in cases where individual groups (used for alternatives or counting) have no meaning or are not needed.

Ok, i forgot some important stuffs, here is the solution:
[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}

Related

Find all strings starting and ending with given substring in a string using regex in Python [duplicate]

This question already has an answer here:
Regex including overlapping matches with same start
(1 answer)
Closed 3 years ago.
I have given a string
ATGCCAGGCTAGCTTATTTAA
and I have to find out all substrings in string which starts with ATG and end with either of TAA, TAG, TGA.
Here is what I am doing:
seq="ATGCCAGGCTAGCTTATTTAA"
pattern = re.compile(r"(ATG[ACGT]*(TAG|TAA|TGA))")
for match in re.finditer(pattern, seq):
coding = match.group(1)
print(coding)
This code is giving me output:
ATGCCAGGCTAGCTTATTTAA
But actual output should be :
ATGCCAGGCTAGCTTATTTAA, ATGCCAGGCTAG
what I should change in my code?
tl;dr: can't use regex for this
The problem isn't greedy/non-greedy.
The problem isn't overlapping matches either: there's a solution for that (How to find overlapping matches with a regexp?)
The real problem with OP's question is, REGEX isn't designed for matches with the same start. Regex performs a linear search and stops at the first match. That's one of the reasons why it's fast. However, this prevents REGEX from supporting multiple overlapping matches starting at the same character.
See
Regex including overlapping matches with same start
for more info.
Regex isn't the be-all-end-all of pattern matching. It's in the name: Regular expressions are all about single-interpretation symbol sequences, and DNA tends not to fit that paradigm.
In r"(ATG[ACGT]*(TAG|TAA|TGA))", the * operator is "greedy". Use the non-greedy modifier, like r"(ATG[ACGT]*?(TAG|TAA|TGA))", to tell the regexp to take the shortest matching string, not the longest.

In python, how to use regex to conditionally pattern match [duplicate]

This question already has answers here:
How to make some parts of a regex pattern conditional?
(3 answers)
Closed 6 years ago.
I am trying to parse the following string with python's regex library:
recipe_a = 'run_list[sm_collectd::default#1.0.0]'
Using http://pythex.org/, I am trying out the following regexes:
\[(.*)::(.*)#(.*)\]
Which yields:
Match 1
1. sm_collectd
2. default
3. 1.0.0
Here is the problem:
recipe_a can be pattern matched with this regex, however, it fails when the #version is no longer specified in the string. The following example will fail to match the pattern:
recipe_b = 'run_list[sm_collectd::default]'
\\[(.\*)::(.\*)#(.\*)\\] fails in this case because # was never matched. have python logic that trys \\[(.\*)::(.\*)#(.\*)\\] and tries the latter regex \\[(.\*)::(.\*)\\]. But that would be silly. It would be nice if I could accomplish this in one regex pattern.
I have tried solving this with a conditional regex statement. The general syntax I have tried is as follows:
(?(?=regex)then|else)
the first ? is lookahead assertion: matches without consuming. So we could have a conditional match on the # symbol.
if the # is matched then do \\[(.\*)::(.\*)#(.\*)\\], else do \\[(.\*)::(.\*)\\].
PROGRAMATIC SOLUTION
kitchen_recipe = 'recipe[my_cookbook::default#0.1.0]'
recipe = kitchen_recipe.strip('recipe[').strip(']')
if '#' in recipe:
cookbook, recipe, cookbook_version = tuple(re.split('::|#', recipe))
else:
cookbook, recipe = tuple(re.split('::', recipe))
cookbook_version = None # no version specified
REGEX SOLUTION
kitchen_recipe = 'recipe[my_cookbook::default#0.1.0]'
run_list_pattern = '\[(.*)::([^#]*)#?([0-9.]*)\]'
cookbook, recipe, cookbook_version = re.search(test_list_pattern,
kitchen_recipe).groups()
A couple of small changes should work for your purposes; try \[(.*)::([^#]*)(?:#(.*))?\]
run_list[sm_collectd::default] yields
sm_collectd
default
None
run_list[sm_collectd::default#1.0.0] yields
sm_collectd
default
1.0.0
Explanation:
I changed the final #(.*) into (?:#(.*))?. The additional (?:) is a non-capturing group, and the following ? means "match 0 or 1 times".
I also changed the first (.*) into ([^#]*), which means match anything but an # as many times as possible, so it won't treat all of default#1.0.0 as the chef recipe name.
Edit: As #gregory points out, you can avoid the non-capturing group. You can use \[(.*)::([^#]*)#?(.*)\].
If you'd like to limit the version number to only digits and .s: \[(.*)::([^#]*)#?([0-9.]*)\].

How can I find multiple of the same format in Python? [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 6 years ago.
For a little idea of what the project is, I'm trying to make a markup language that compiles to HTML/CSS. I plan on formatting links like this: #(link mask)[(link url)], and I want to find all occurrences of this and get both the link mask and the link url.
I tried using this code for it:
re.search("#(.*)\[(.*)\]", string)
But it started at the beginning of the first instance, and ended at the end of the last instance of a link. Any ideas how I can have it find all of them, in a list or something?
The default behavior of a regular expression is "greedy matching". This means each .* will match as many characters as it can.
You want them to instead match the minimal possible number of characters. To do that, change each .* into a .*?. The final question mark will make the pattern match the minimal number of characters. Because you anchor your pattern to a ] character, it will still match/consume the whole link correctly.
* is greedy: it matches as many characters as it can, e.g. up to the last right parenthesis in your document. (After all, . means "any character" and ) is 'any character" as much as any other character.)
You need the non-greedy version of *, which is *?. (Probably actually you should use +?, as I don't think zero-length matches would be very useful).

Regex not working to get string between 2 strings. Python 27 [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
From this URL view-source:https://www.amazon.com/dp/073532753X?smid=A3P5ROKL5A1OLE
I want to get string between var iframeContent = and obj.onloadCallback = onloadCallback;
I have this regex iframeContent(.*?)obj.onloadCallback = onloadCallback;
But it does not work. I am not good at regex so please pardon my lack of knowledge.
I even tried iframeContent(.*?)obj.onloadCallback but it does not work.
It looks like you just want that giant encoded string. I believe yours is failing for two reasons. You're not running in DOTALL mode, which means your . won't match across multiple lines, and your regex is failing because of catastrophic backtracking, which can happen when you have a very long variable length match that matches the same characters as the ones following it.
This should get what you want
m = re.search(r'var iframeContent = \"([^"]+)\"', html_source)
print m.group(1)
The regex is just looking for any characters except double quotes [^"] in between two double quotes. Because the variable length match and the match immediately after it don't match any of the same characters, you don't run into the catastrophic backtracking issue.
I suspect that input string lies across multiple lines.Try adding re.M in search line (ie. re.findall('someString', text_Holder, re.M)).
You could try this regex too
(?<=iframeContent =)(.*)(?=obj.onloadCallback = onloadCallback)
you can check at this site the test.
Is it very important you use DOTALL mode, which means that you will have single-line

Using variable in regular expression in Python [duplicate]

This question already has answers here:
How to use a variable inside a regular expression?
(12 answers)
Closed 2 years ago.
I've looked at several posts and other forums to find an answer related to my question, but nothing has come up specific to what I need. As a heads up, I'm new to programming and don't possess the basic foundation that most would.
I know bash, little python, and decent with RE.
I'm trying to create a python script, using RE's to parse through data and give me an output that I need/want.
My output will consist of 4 values, all originating from one line. The line being read in is thrown together with no defined delimiter. (hence the reason for my program)
In order to find one of the 4 values, I have to say look for 123- and give me everything after that but stop here df5. The 123- is not constant, but defined by a regular expression that works, same goes for df5. I assigned both RE's to a variable. How can I use those variables to find what I want between the two... Please let me know if this makes sense.
import re
start = '123-'
stop = 'df5'
regex = re.compile('{0}(.*?){1}'.format(re.escape(start), re.escape(stop)))
Note that the re.escape() calls aren't necessary for these example strings, but it is important if your delimiters can ever include characters with a special meaning in regex (., *, +, ? etc.).
How about a pattern "%s(.*?)%s" % (oneTwoThree, dF5)? Then you can do a re.search on that pattern and use the groups function on the result.
Something on the lines of
pattern = "%s(.*?)%s" % (oneTwoThree, dF5)
matches = re.search(pattern, text)
if matches:
print matches.groups()
re.findall, if used instead of re.search, can save you the trouble of grouping the matches.

Categories