Matching optional numbers in regex

Matching optional numbers in regex - python

This one is probably a simple one, but I could not find an example that's simple enough to understand (sorry, I'm new with RegEx).
I'm writing some Python code to search for any string that matches any of the following examples:
float[20]
float[7532]
float[]
So this is what I have so far:
import re
p = re.compile('float\[[0-9]+\]')
print p.match("float[20]")
print p.match("float[7532]")
print p.match("float[]")
The code works great for the first and second scenarios, but not the third (no numbers between brackets). What's the best way to add that condition?
Thanks a lot!

p = re.compile('float\[[0-9]*\]')
putting a * after the character class means 0 or matches of the character class.

Try
float\[\d*\]
\d is a shortcut for [0-9].
The asterisk matches 0..n (any number) of characters of the character class.

The + operator requires at least one instance of whatever it's applying to, which your third option doesn't have. You want the * operator which is 0 or more. So:
p = re.compile('float\[[0-9]*\]')

Try:
import re
p = re.compile('float\[[0-9]*\]')
print p.match("float[20]")
print p.match("float[7532]")
print p.match("float[]")
+ is for one or more elements and * is used for zero or more element.

Related

Change part of a word (string) into in a different string if a sign occurs. Python

How most effectively do I cut out a part of a word if the character '=#=' appears and then finish cutting the word if the character '=#=' appears? For example:
From a large string
'321#5=85#45#41=#=I-LOVE-STACK-OVER-FLOW=#=3234#41#=q#$^1=#=xx$q=#=xpa$=4319'
The python code returns:
'I-LOVE-STACK-OVER-FLOW'
Any help will be appreciated.

Using split():
s = '321#5=85#45#41=#=I-LOVE-STACK-OVER-FLOW=#=3234#41#=q#$^1=#=xx$q=#=xpa$=4319'
st = '=#='
ed = '=#='
print((s.split(st))[1].split(ed)[0])
Using regex:
import re
s = '321#5=85#45#41=#=I-LOVE-STACK-OVER-FLOW=#=3234#41#=q#$^1=#=xx$q=#=xpa$=4319'
print(re.search('%s(.*)%s' % (st, ed), s).group(1))
OUTPUT:
I-LOVE-STACK-OVER-FLOW

In addition to #DirtyBit's answer, if you want to also handle cases of more than 2 '=#='s, you can split the string, and then add every other element:
s = '321#5=85#45#41=#=I-LOVE-STACK-OVER-FLOW=#=3234#41#=q#$^1=#=xx$q=#=xpa$=4319=#=|I-ALSO-LOVE-SO=#=3123123'
parts = s.split('=#=')
print(''.join([parts[i] for i in range(1,len(parts),2)]))
Output
I-LOVE-STACK-OVER-FLOW|I-ALSO-LOVE-SO

The explanation is in the code.
import re
ori_list = re.split("=#=",ori_str)
# you can imagine your goal is to find the string wrapped between signs of "=#="
# so after the split, the even number position must be the parts outsides of "=#="
# and the odd number position is what you want
for i in range(len(ori_list)):
if i%2 == 1:#odd position
print(ori_list[i])

Printing substrings' patterns from a string in Python

The input to this problem is a string and has a specific form. For example if s is a string then inputs can be s='3(a)2(b)' or s='3(aa)2(bbb)' or s='4(aaaa)'. The output should be a string, that is the substring inside the brackets multiplied by numerical substring value the substring inside the brackets follows.
For example,
Input ='3(a)2(b)'
Output='aaabb'
Input='4(aaa)'
Output='aaaaaaaaaaaa'
and similarly for other inputs. The program should print an empty string for wrong or invalid inputs.
This is what I've tried so far
s='3(aa)2(b)'
p=''
q=''
for i in range(0,len(s)):
#print(s[i],end='')
if s[i]=='(':
k=int(s[i-1])
while(s[i+1]!=')'):
p+=(s[i+1])
i+=1
if s[i]==')':
q+=k*p
print(q)
Can anyone tell what's wrong with my code?

A oneliner would be:
''.join(int(y[0])*y[1] for y in (x.split('(') for x in Input.split(')')[:-1]))
It works like this. We take the input, and split on the close paren
In [1]: Input ='3(a)2(b)'
In [2]: a = Input.split(')')[:-1]
In [3]: a
Out[3]: ['3(a', '2(b']
This gives us the integer, character pairs we're looking for, but we need to get rid of the open paren, so for each x in a, we split on the open paren to get a two-element list where the first element is the int (as a string still) and the character. You'll see this in b
In [4]: b = [x.split('(') for x in a]
In [5]: b
Out[5]: [['3', 'a'], ['2', 'b']]
So for each element in b, we need to cast the first element as an integer with int() and multiply by the character.
In [6]: c = [int(y[0])*y[1] for y in b]
In [7]: c
Out[7]: ['aaa', 'bb']
Now we join on the empty string to combine them into one string with
In [8]: ''.join(c)
Out[8]: 'aaabb'

Try this:
a = re.findall(r'[\d]+', s)
b = re.findall(r'[a-zA-Z]+', s)
c = ''
for i, j in zip(a, b):
c+=(int(i)*str(j))
print(c)

Here is how you could do it:
Step 1: Simple case, getting the data out of a really simple template
Let's assume your template string is 3(a). That's the simplest case I could think of. We'll need to extract pieces of information from that string. The first one is the count of chars that will have to be rendered. The second is the char that has to be rendered.
You are in a case where regex are more than suited (hence, the use of re module from python's standard library).
I won't do a full course on regex. You'll have to do that by our own. However, I'll explain quickly the step I used. So, count (the variable that holds the number of times we should render the char to render) is a digit (or several). Hence our first capturing group will be something like (\d+). Then we have a char to extract that is enclosed by parenthesis, hence \((\w+)\) (I actually enable several chars to be rendered at once). So, if we put them together, we get (\d+)\((\w+)\). For testing you can check this out.
Applied to our case, a straight forward use of the re module is:
import re
# Our template
template = '3(a)'
# Run the regex
match = re.search(r'(\d+)\((\w+)\)', template)
if match:
# Get the count from the first capturing group
count = int(match.group(1))
# Get the string to render from the second capturing group
string = match.group(2)
# Print as many times the string as count was given
print count * string
Output:
aaa
Yeah!
Step 2: Full case, with several templates
Okay, we know how to do it for 1 template, how to do the same for several, for instance 3(a)4(b)? Well... How would we do it "by hand"? We'd read the full template from left to right and apply each template one by one. Then this is what we'll do with python!
Hopefully for us the re module has a function just for that: finditer. It does exactly what we described above.
So, we'll do something like:
import re
# Our template
template = '3(a)4(b)'
# Iterate through found templates
for match in re.finditer(r'(\d+)\((\w+)\)', template):
# Get the count from the first capturing group
count = int(match.group(1))
# Get the string to render from the second capturing group
string = match.group(2)
print count * string
Output:
aaa
bbbb
Okay... Just remains the combination of that stuff. We know we can put everything at each step in an array, and then join each items of this array at the end, no?
Let's do it!
import re
template = '3(a)4(b)'
parts = []
for match in re.finditer(r'(\d+)\((\w+)\)', template):
parts.append(int(match.group(1)) * match.group(2))
print ''.join(parts)
Output:
aaabbb
Yeah!
Step 3: Final step, optimization
Because we can always do better, we won't stop. for loops are cool. But what I love (it's personal) about python is that there is so much stuff you can actually just write with one line! Is it the case here? Well yes :).
First we can remove the for loop and the append using a list comprehension:
parts = [int(match.group(1)) * match.group(2) for match in re.finditer(r'(\d+)\((\w+)\)', template)]
rendered = ''.join(parts)
Finally, let's remove the two lines with parts populating and then join and let's do all that in a single line:
import re
template = '3(a)4(b)'
rendered = ''.join(
int(match.group(1)) * match.group(2) \
for match in re.finditer(r'(\d+)\((\w+)\)', template))
print rendered
Output:
aaabbb
Yeah! Still the same output :).
Hope it helped!

The value of 'p' should be refreshed after each iteration.
s='1(aaa)2(bb)'
p=''
q=''
i=0
while i<len(s):
if s[i]=='(':
k=int(s[i-1])
p=''
while(s[i+1]!=')'):
p+=(s[i+1])
i+=1
if s[i]==')':
q+=k*p
i+=1
print(q)

The code is not behaving the way I want it to behave. The problem here is the placement of 'p'. 'p' is the variable that adds the substring inside the ( )s. I'm repeating the process even after sufficient adding is done. Placing 'p' inside the 'if' block will do the job.
s='2(aa)2(bb)'
q=''
for i in range(0,len(s)):
if s[i]=='(':
k=int(s[i-1])
p=''
while(s[i+1]!=')'):
#print(i,'first time')
p+=s[i+1]
i+=1
q+=p*k
#print(i,'second time')
print(q)

what you want is not print substrings . the real purpose is most like to generate text based regular expression or comands.
you can parametrize a function to read it or use something like it:
The python library rstr has the function xeger() to do what you need by using random strings and only returning ones that match:
Example
Install with pip install rstr
In [1]: from __future__ import print_function
In [2]: import rstr
In [3]: for dummy in range(10):
...: print(rstr.xeger(r"(a|b)[cd]{2}\1"))
...:
acca
bddb
adda
bdcb
bccb
bcdb
adca
bccb
bccb
acda
Warning
For complex re patterns this might take a long time to generate any matches.

Issue when using array slicing

I am an intermediate Python programmer. In my experiment, I use Linux command that outputs some results something like this:
OFPST_TABLE reply (xid=0x2):
table 0 ("classifier"):
active=1, lookup=41, matched=4
max_entries=1000000
matching:
in_port: exact match or wildcard
eth_src: exact match or wildcard
eth_dst: exact match or wildcard
eth_type: exact match or wildcard
vlan_vid: exact match or wildcard
vlan_pcp: exact match or wildcard
ip_src: exact match or wildcard
ip_dst: exact match or wildcard
nw_proto: exact match or wildcard
nw_tos: exact match or wildcard
tcp_src: exact match or wildcard
tcp_dst: exact match or wildcard
My goal is to collect the value of parameter active= which is variable from time to time (In this case it is just 1). I use the following slicing but it does not work:
string = sw.cmd('ovs-ofctl dump-tables ' + sw.name) # trigger the sh command
count = count + int(string[string.rfind("=") + 1:])
I think I am using slicing wrong here but I tried many ways but I still get nothing. Can someone help me to extract the value of active= parameter from this string?
Thank you very much :)

How about regex?
import re
count += int(re.search(r'active\s*=\s*([^,])\s*,', string).group(1))

1) Use regular expressions:
import re
m = re.search('active=(\d+)', ' active=1, lookup=41, matched=4')
print m.group(1)
2) str.rfind returns the highest index in the string where substring is found, it will find the rightmost = (of matched=4), that is not what you want.
3) Simple slicing won't help you because you need to know the length of the active value, overall it is not the best tool for this task.

Optional grouping in a simple python regex

All I want to do is search a string for instances of two consecutive digits. If such an instance is found I want to group it, otherwise return none for that particular groups. I thought this would be trivial, but I can't understand where I'm going wrong. In the example below, removing the optional (?) character gets me the numbers, but in strings without numbers, the r evaluates to None, so r.groups() throws an exception.
p = re.compile(r'(\d{2})?')
r = p.search('wqddsel78ffgr')
print r.groups()
>>>(None, ) # why not ('78', )?
# --- update/clarification --- #
Thanks for the answers, but the explanations given are leaving me none-the-wiser. Here's a another go at pin-pointing exactly what it is I don't understand.
pattern = re.compile(r'z.*(A)?')
_string = "aazaa90aabcdefA"
result = pattern.search(_string)
result.group()
>>> zaa90aabcdefA
result.groups()
>>> (None, )
I understand why result.group() produces the result it does, but why doesn't result.groups() produce ('A', )? I thought it worked like this: once the regex hits the z it then matches right to the end of the line using .*. In spite of .* matching everything, the regex engine is aware that it passed over an optional group, and since ? means it will try to match if it can, it should work backwards to try and match. Replacing ? with + does return ('A', ). This suggests that ? won't try and match if it doesn't have to, but this seems to contrast with much of what I've read on the subject (esp. J. Friedl's excellent book).

This works for me:
p = re.compile('\D*(\d{2})?')
r = p.search('wqddsel78ffgr')
print r.groups() # ('78',)
r = p.search('wqddselffgr')
print r.groups() # (None,)

Use regex pattern
(\d{2}|(?!.*\d{2}))
(see this demo)
If you want be sure there are exactly 2 consecutive digits and not 3 or more, go with
((?<!\d)\d{2}(?!\d)|(?!.*(?<!\d)\d{2}(?!\d)))
(see this demo)

The ? makes your regex match the empty string. If you omit it, you could just check the result like this:
p = re.compile(r'(\d{2})')
r = p.search('wqddsel78ffgr')
print r.groups() if r else ('',)

Remember that you can search for all matches of a RE in a string easily using findall():
re.findall(r'\d{2}', 'wqddsel78ffgr') # => ['78']
If you don't need the positions where the match occurs, this seems like a simpler way to accomplish what you're doing.

? - is 0 or 1 repetitions. So the regex processor first tries to find 0 repetitions, and... finds it :)

Is it possible to use a back reference to specify the number of replications in a regular expression?

Is it possible to use a back reference to specify the number of replications in a regular expression?
foo= 'ADCKAL+2AG.+2AG.+2AG.+2AGGG+.G+3AGGa.'
The substrings that start with '+[0-9]' followed by '[A-z]{n}.' need to be replaced with simply '+' where the variable n is the digit from earlier in the substring. Can that n be back referenced? For example (doesn't work) '+([0-9])[A-z]{/1}.' is the pattern I want replaced with "+" (that last dot can be any character and represents a quality score) so that foo should come out to ADCKAL+++G.G+.
import re
foo = 'ADCKAL+2AG.+2AG.+2AG.+2AGGG+.+G+3AGGa.'
indelpatt = re.compile('\+([0-9])')
while indelpatt.search(foo):
indelsize=int(indelpatt.search(foo).group(1))
new_regex = '\+%s[ACGTNacgtn]{%s}.' % (indelsize,indelsize)
newpatt=re.compile(new_regex)
foo = newpatt.sub("+", foo)
I'm probably missing an easier way to parse the string.

No, you cannot use back-references as quantifiers. A workaround is to construct a regular expression that can handle each of the cases in an alternation.
import re
foo = 'ADCKAL+2AG.+2AG.+2AG.+2AGGG^+.+G+3AGGa4.'
pattern = '|'.join('\+%s[ACGTNacgtn]{%s}.' % (i, i) for i in range(1, 10))
regex = re.compile(pattern)
foo = regex.sub("+", foo)
print foo
Result:
ADCKAL++++G^+.+G+4.
Note also that your code contains an error that causes it to enter an infinite loop on the input you gave.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching optional numbers in regex - python

p = re.compile('float\[[0-9]\]') putting a after the character class means 0 or matches of the character class.

Try float\[\d*\] \d is a shortcut for [0-9]. The asterisk matches 0..n (any number) of characters of the character class.

The + operator requires at least one instance of whatever it's applying to, which your third option doesn't have. You want the * operator which is 0 or more. So: p = re.compile('float\[[0-9]*\]')

Try: import re p = re.compile('float\[[0-9]\]') print p.match("float[20]") print p.match("float[7532]") print p.match("float[]") + is for one or more elements and is used for zero or more element.

Related

Change part of a word (string) into in a different string if a sign occurs. Python

Printing substrings' patterns from a string in Python

Issue when using array slicing

Optional grouping in a simple python regex

Is it possible to use a back reference to specify the number of replications in a regular expression?

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching optional numbers in regex - python

p = re.compile('float\[[0-9]*\]') putting a * after the character class means 0 or matches of the character class.

Try float\[\d*\] \d is a shortcut for [0-9]. The asterisk matches 0..n (any number) of characters of the character class.

The + operator requires at least one instance of whatever it's applying to, which your third option doesn't have. You want the * operator which is 0 or more. So: p = re.compile('float\[[0-9]*\]')

Try: import re p = re.compile('float\[[0-9]*\]') print p.match("float[20]") print p.match("float[7532]") print p.match("float[]") + is for one or more elements and * is used for zero or more element.

Related

Change part of a word (string) into in a different string if a sign occurs. Python

Printing substrings' patterns from a string in Python

Issue when using array slicing

Optional grouping in a simple python regex

Is it possible to use a back reference to specify the number of replications in a regular expression?

Categories

Resources

p = re.compile('float\[[0-9]\]') putting a after the character class means 0 or matches of the character class.

Try: import re p = re.compile('float\[[0-9]\]') print p.match("float[20]") print p.match("float[7532]") print p.match("float[]") + is for one or more elements and is used for zero or more element.