Python RE question - proper state initial formatting - python

I have a string that I need to edit, it looks something similar to this:
string = "Idaho Ave N,,Crystal,Mn,55427-1463,US,,610839124763,Expedited"
If you notice the state initial "Mn" is not in proper formatting. I'm trying to use a regular expression to change this:
re.sub("[A-Z][a-z],", "[A-Z][A-Z],", string)
However, re.sub treats the second part as a literal and will change Mn, to [A-Z][A-Z],. How would I use re.sub (or something similar and simple) to properly change Mn, to MN, in this string?
Thank you in advance!

Your re.sub might modify also parts of the string you would not want to modify. Try to process the right element in your list explicitly:
input = "Idaho Ave N,,Crystal,Mn,55427-1463,US,,610839124763,Expedited"
elems = input.split(',')
elems[3] = elems[3].upper()
output = ','.join(elems)
returns
'Idaho Ave N,,Crystal,MN,55427-1463,US,,610839124763,Expedited'

You can pass a function as the replacement parameter to re.sub to generate the replacement string from the match object, e.g.:
import re
s = "Idaho Ave N,,Crystal,Mn,55427-1463,US,,610839124763,Expedited"
def upcase(match):
return match.group().upper()
print re.sub("[A-Z][a-z],", upcase, s)
(This is ignoring the concern of whether you're genuinely finding state initials with this method.)
The appropriate documentation for re.sub is here.

sub(pattern, repl, string, count=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
re.sub("[A-Z][a-z]", lambda m: m.group(0).upper(), myString)
I would avoid calling your variable string since that is a type name.

You create a group by surrounding it in parentheses withing your regex, then refer to is by its group number:
re.sub("([A-Z][a-z]),", "\1,".upper(), string)

Related

How to create a regular expression to replace a url?

I'm trying to create a regular expression using re.sub() that can replace a URL from a string for example.
tool(id='merge_tool', server='http://localhost:8080')
I created a regular expression that returns a string something like given below.
a = "https:192.168.1.1:8080"
re.sub(r'http\S+', a, "tool(id='merge_tool', server='http://localhost:8080')")
results:
"tool(id='merge_tool', server='https:192.168.1.1"
Or if I provide this URL:
b = 'https:facebook.com'
re.sub(r'http\S+', b, "tool(id='merge_tool', server='http://localhost:8080')")
Results:
"tool(id='merge_tool', server='https:facebook.com"
How to fix this so that it can return the entire string after replacing the URL?
You can use
re.sub(r"http[^\s']+", b.replace('\\', '\\\\'), "tool(id='merge_tool', server='http://localhost:8080')")
Note that
http[^\s']+ will match http and then any one or more chars other than whitespace and single quote
b.replace('\\', '\\\\') is a must for cases where replacement literal string is dynamic, and all backslashes in it must be doubled to work as expected.

substitute substring with a new one calculated from the old one with python regular expression

Is it possible to use Python regular expression to substitute a substring with a new substring "calculated" using the original substring?
Let me give an example:
Suppose I have the substring userX40, I want to obtain get the substring 40, cast to int, multiply by 2 and put it back in the original string: so the final result would be user80.
I can do this in different passages such as:
import re
in_str = "userX40"
original_num = int(re.search("\d+", in_str).group())
out_str = re.search("(.*)[^\d]", lll).group() + str(original_num*2)
Is there a way to do it using re.sub in one command?
Yes, it's possible with re.sub. Provide a callable object that converts the match object into an int, multiplies it, and converts back to string.
>>> s = "userX40"
>>> re.sub(r"\d+", lambda x: str(2*int(x.group())), s)
'userX80'

Python replace regex

I have a string in which there are some attributes that may be empty:
[attribute1=value1, attribute2=, attribute3=value3, attribute4=]
With python I need to sobstitute the empty values with the value 'None'. I know I can use the string.replace('=,','=None,').replace('=]','=None]') for the string but I'm wondering if there is a way to do it using a regex, maybe with the ?P<name> option.
You can use
import re
s = '[attribute1=value1, attribute2=, attribute3=value3, attribute4=]'
re.sub(r'=(,|])', r'=None\1', s)
\1 is the match in parenthesis.
With python's re module, you can do something like this:
# import it first
import re
# your code
re.sub(r'=([,\]])', '=None\1', your_string)
You can use
s = '[attribute1=value1, attribute2=, attribute3=value3, attribute4=]'
re.sub(r'=(?!\w)', r'=None', s)
This works because the negative lookahead (?!\w) checks if the = character is not followed by a 'word' character. The definition of "word character", in regular expressions, is usually something like "a to z, 0 to 9, plus underscore" (case insensitive).
From your example data it seems all attribute values match this. It will not work if the values may start with something like a comma (unlikely), may be quoted, or may start with anything else. If so, you need a more fool proof setup, such as parse from the start: skipping the attribute name by locating the first = character.
Be specific and use a character class:
import re
string = "[attribute1=value1, attribute2=, attribute3=value3, attribute4=]"
rx = r'\w+=(?=[,\]])'
string = re.sub(rx, '\g<0>None', string)
print string
# [attribute1=value1, attribute2=None, attribute3=value3, attribute4=None]

Why doesn't this regular expression match in this string?

I want to be able to replace a string in a file using regular expressions. But my function isn't finding a match. So I've mocked up a test to replicate what's happening.
I have defined the string I want to replace as follows:
string = 'buf = O_strdup("ONE=001&TYPE=PUZZLE&PREFIX=EXPRESS&");'
I want to replace the "TYPE=PUZZLE&PREFIX=EXPRESS&" part with something else. NB. the string won't always contain exactly "PUZZLE" and "PREFIX" in the original file, but it will be of that format ).
So first I tried testing that I got the correct match.
obj = re.search(r'TYPE=([\^&]*)\&PREFIX=([\^&]*)\&', string)
if obj:
print obj.group()
else:
print "No match!!"
Thinking that ([\^&]*) will match any number of characters that are NOT an ampersand.
But I always get "No match!!".
However,
obj = re.search(r'TYPE=([\^&]*)', string)
returns me "TYPE="
Why doesn't my first one work?
Since the ^ sign is escaped with \ the following part: ([\^&]*) matches any sequence of these characters: ^, &.
Try replacing it with ([^&]*).
In my regex tester, this does work: 'TYPE=(.*)\&PREFIX=(.*)\&'
Try this instead
obj = re.search(r'TYPE=(?P<type>[^&]*?)&PREFIX=(?P<prefix>[^&]*?)&', string)
The ?P<some_name> is a named capture group and makes it a little bit easier to access the captured group, obj.group("type") -->> 'PUZZLE'
It might be better to use the functions urlparse.parse_qsl() and urllib.urlencode() instead of regular expressions. The code will be less error-prone:
from urlparse import parse_qsl
from urllib import urlencode
s = "ONE=001&TYPE=PUZZLE&PREFIX=EXPRESS&"
a = parse_qsl(s)
d = dict(TYPE="a", PREFIX="b")
print urlencode(list((key, d.get(key, val)) for key, val in a))
# ONE=001&TYPE=a&PREFIX=b

Regex to Split 1st Colon

I have a time in ISO 8601 ( 2009-11-19T19:55:00 ) which is also paired with a name commence. I'm trying to parse this into two. I'm currently up to here:
import re
sColon = re.compile('[:]')
aString = sColon.split("commence:2009-11-19T19:55:00")
Obviously this returns:
>>> aString
['commence','2009-11-19T19','55','00']
What I'd like it to return is this:
>>>aString
['commence','2009-11-19T19:55:00']
How would I go about do this in the original creation of sColon? Also, do you recommend any Regular Expression links or books that you have found useful, as I can see myself needing it in the future!
EDIT:
To clarify... I'd need a regular expression that would just parse at the very first instance of :, is this possible? The text ( commence ) before the colon can chance, yes...
>>> first, colon, rest = "commence:2009-11-19T19:55:00".partition(':')
>>> print (first, colon, rest)
('commence', ':', '2009-11-19T19:55:00')
You could put maximum split parameter in split function
>>> "commence:2009-11-19T19:55:00".split(":",1)
['commence', '2009-11-19T19:55:00']
Official Docs
S.split([sep [,maxsplit]]) -> list of strings
Return a list of the words in the string S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.
Looks like you need .IndexOf(":"), then .Substring()?
#OP, don't do the unnecessary. Regex is not needed with what you are doing. Python has very good string manipulation methods that you can use. All you need is split(), and slicing. Those are the very basics of Python.
>>> "commence:2009-11-19T19:55:00".split(":",1)
['commence', '2009-11-19T19:55:00']
>>>

Categories