Why doesn't this regular expression match in this string? - python

I want to be able to replace a string in a file using regular expressions. But my function isn't finding a match. So I've mocked up a test to replicate what's happening.
I have defined the string I want to replace as follows:
string = 'buf = O_strdup("ONE=001&TYPE=PUZZLE&PREFIX=EXPRESS&");'
I want to replace the "TYPE=PUZZLE&PREFIX=EXPRESS&" part with something else. NB. the string won't always contain exactly "PUZZLE" and "PREFIX" in the original file, but it will be of that format ).
So first I tried testing that I got the correct match.
obj = re.search(r'TYPE=([\^&]*)\&PREFIX=([\^&]*)\&', string)
if obj:
print obj.group()
else:
print "No match!!"
Thinking that ([\^&]*) will match any number of characters that are NOT an ampersand.
But I always get "No match!!".
However,
obj = re.search(r'TYPE=([\^&]*)', string)
returns me "TYPE="
Why doesn't my first one work?

Since the ^ sign is escaped with \ the following part: ([\^&]*) matches any sequence of these characters: ^, &.
Try replacing it with ([^&]*).

In my regex tester, this does work: 'TYPE=(.*)\&PREFIX=(.*)\&'

Try this instead
obj = re.search(r'TYPE=(?P<type>[^&]*?)&PREFIX=(?P<prefix>[^&]*?)&', string)
The ?P<some_name> is a named capture group and makes it a little bit easier to access the captured group, obj.group("type") -->> 'PUZZLE'

It might be better to use the functions urlparse.parse_qsl() and urllib.urlencode() instead of regular expressions. The code will be less error-prone:
from urlparse import parse_qsl
from urllib import urlencode
s = "ONE=001&TYPE=PUZZLE&PREFIX=EXPRESS&"
a = parse_qsl(s)
d = dict(TYPE="a", PREFIX="b")
print urlencode(list((key, d.get(key, val)) for key, val in a))
# ONE=001&TYPE=a&PREFIX=b

Related

Python Regular Expression Extracting 'name= ....'

I'm using a Python script to read data from our corporate instance of JIRA. There is a value that is returned as a string and I need to figure out how to extract one bit of info from it. What I need is the 'name= ....' and I just need the numbers from that result.
<class 'list'>: ['com.atlassian.greenhopper.service.sprint.Sprint#6f68eefa[id=30943,rapidViewId=10468,state=CLOSED,name=2016.2.4 - XXXXXXXXXX,startDate=2016-05-26T08:50:57.273-07:00,endDate=2016-06-08T20:59:00.000-07:00,completeDate=2016-06-09T07:34:41.899-07:00,sequence=30943]']
I just need the 2016.2.4 portion of it. This number will not always be the same either.
Any thoughts as how to do this with RE? I'm new to regular expressions and would appreciate any help.
A simple regular expression can do the trick: name=([0-9.]+).
The primary part of the regex is ([0-9.]+) which will search for any digit (0-9) or period (.) in succession (+).
Now, to use this:
import re
pattern = re.compile('name=([0-9.]+)')
string = '''<class 'list'>: ['com.atlassian.greenhopper.service.sprint.Sprint#6f68eefa[id=30943,rapidViewId=10468,state=CLOSED,name=2016.2.4 - XXXXXXXXXX,startDate=2016-05-26T08:50:57.273-07:00,endDate=2016-06-08T20:59:00.000-07:00,completeDate=2016-06-09T07:34:41.899-07:00,sequence=30943]']'''
matches = pattern.search(string)
# Only assign the value if a match is found
name_value = '' if not matches else matches.group(1)
Use a capturing group to extract the version name:
>>> import re
>>> s = 'com.atlassian.greenhopper.service.sprint.Sprint#6f68eefa[id=30943,rapidViewId=10468,state=CLOSED,name=2016.2.4 - XXXXXXXXXX,startDate=2016-05-26T08:50:57.273-07:00,endDate=2016-06-08T20:59:00.000-07:00,completeDate=2016-06-09T07:34:41.899-07:00,sequence=30943]'
>>> re.search(r"name=([0-9.]+)", s).group(1)
'2016.2.4'
where ([0-9.]+) is a capturing group matching one or more digits or dots, parenthesis define a capturing group.
A non-regex option would involve some splitting by ,, = and -:
>>> l = [item.split("=") for item in s.split(",")]
>>> next(value[1] for value in l if value[0] == "name").split(" - ")[0]
'2016.2.4'
This, of course, needs testing and error handling.

Python replace regex

I have a string in which there are some attributes that may be empty:
[attribute1=value1, attribute2=, attribute3=value3, attribute4=]
With python I need to sobstitute the empty values with the value 'None'. I know I can use the string.replace('=,','=None,').replace('=]','=None]') for the string but I'm wondering if there is a way to do it using a regex, maybe with the ?P<name> option.
You can use
import re
s = '[attribute1=value1, attribute2=, attribute3=value3, attribute4=]'
re.sub(r'=(,|])', r'=None\1', s)
\1 is the match in parenthesis.
With python's re module, you can do something like this:
# import it first
import re
# your code
re.sub(r'=([,\]])', '=None\1', your_string)
You can use
s = '[attribute1=value1, attribute2=, attribute3=value3, attribute4=]'
re.sub(r'=(?!\w)', r'=None', s)
This works because the negative lookahead (?!\w) checks if the = character is not followed by a 'word' character. The definition of "word character", in regular expressions, is usually something like "a to z, 0 to 9, plus underscore" (case insensitive).
From your example data it seems all attribute values match this. It will not work if the values may start with something like a comma (unlikely), may be quoted, or may start with anything else. If so, you need a more fool proof setup, such as parse from the start: skipping the attribute name by locating the first = character.
Be specific and use a character class:
import re
string = "[attribute1=value1, attribute2=, attribute3=value3, attribute4=]"
rx = r'\w+=(?=[,\]])'
string = re.sub(rx, '\g<0>None', string)
print string
# [attribute1=value1, attribute2=None, attribute3=value3, attribute4=None]

Regex Expression not matching correctly

I'm tackling a python challenge problem to find a block of text in the format xXXXxXXXx (lower vs upper case, not all X's) in a chunk like this:
jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn
I have tested the following RegEx and found it correctly matches what I am looking for from this site (http://www.regexr.com/):
'([a-z])([A-Z]){3}([a-z])([A-Z]){3}([a-z])'
However, when I try to match this expression to the block of text, it just returns the entire string:
In [1]: import re
In [2]: example = 'jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn'
In [3]: expression = re.compile(r'([a-z])([A-Z]){3}([a-z])([A-Z]){3}([a-z])')
In [4]: found = expression.search(example)
In [5]: print found.string
jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn
Any ideas? Is my expression incorrect? Also, if there is a simpler way to represent that expression, feel free to let me know. I'm fairly new to RegEx.
You need to return the match group instead of the string attribute.
>>> import re
>>> s = 'jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn'
>>> rgx = re.compile(r'[a-z][A-Z]{3}[a-z][A-Z]{3}[a-z]')
>>> found = rgx.search(s).group()
>>> print found
nJDKoJIWh
The string attribute always returns the string passed as input to the match. This is clearly documented:
string
The string passed to match() or search().
The problem has nothing to do with the matching, you're just grabbing the wrong thing from the match object. Use match.group(0) (or match.group()).
Based on xXXXxXXXx if you want upper letters with len 3 and lower with len 1 between them this is what you want :
([a-z])(([A-Z]){3}([a-z]))+
also you can get your search function with group()
print expression.search(example).group(0)

python regular expression substitute

I need to find the value of "taxid" in a large number of strings similar to one given below. For this particular string, the 'taxid' value is '9606'. I need to discard everything else. The "taxid" may appear anywhere in the text, but will always be followed by a ":" and then number.
score:0.86|taxid:9606(Human)|intact:EBI-999900
How to write regular expression for this in python.
>>> import re
>>> s = 'score:0.86|taxid:9606(Human)|intact:EBI-999900'
>>> re.search(r'taxid:(\d+)', s).group(1)
'9606'
If there are multiple taxids, use re.findall, which returns a list of all matches:
>>> re.findall(r'taxid:(\d+)', s)
['9606']
for line in lines:
match = re.match(".*\|taxid:([^|]+)\|.*",line)
print match.groups()

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Categories