regular expression re.sub for string formatting in python - python

Quite new to regular expressions and am trying to get a grasp on them
string = "regex_learning.test"
subbed = re.sub(r'(.*)_learning(.*), r'\1', string)
What I was hoping for is "regex.test" as an ouput when printing subbed, however I just get "regex"
Could someone explain why I am losing the .test?
Thanks in advance

Use this:
subbed = re.sub(r'(.*)_learning(.*)', r'\1' + r'\2', string)
You can also write it as:
subbed = re.sub(r'(.*)_learning(.*)', "%s%s" % (r'\1', r'\2'), string)

Related

Regex issue in python

I have a regex "value=4020a345-f646-4984-a848-3f7f5cb51f21"
if re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x ):
x = re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x )
m = x.group(1)
m only gives me 4020a345, not sure why it does not give me the entire "4020a345-f646-4984-a848-3f7f5cb51f21"
Can anyone tell me what i am doing wrong?
try out this regex, looks like you are trying to match a GUID
value=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
This should match what you want, if all the strings are of the form you've shown:
value=((\w*\d*\-?)*)
You can also use this website to validate your regular expressions:
http://regex101.com/
The below regex works as you expect.
value=([\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*]+)
You are trying to match on some hex numbers, that is why this regex is more correct than using [\w\d]
pattern = "value=([0-9a-fA-F]{8}-([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12})"
data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
res = re.search(pattern, data)
print(res.group(1))
If you dont care about the regex safety, aka checking that it is correct hex, there is no reason not to use simple string manipulation like shown below.
>>> data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
>>> print(data[7:])
020a345-f646-4984-a848-3f7f5cb51f21
>>> # or maybe
...
>>> print(data[7:].replace('-',''))
020a345f6464984a8483f7f5cb51f21
You can get the subparts of the value as a list
txt = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
parts = re.findall('\w+', txt)[1:]
parts is ['4020a345', 'f646', '4984', 'a848', '3f7f5cb51f21']
if you really want the entire string
full = "-".join(parts)
A simple way
full = re.findall("[\w-]+", txt)[-1]
full is 4020a345-f646-4984-a848-3f7f5cb51f21
value=([\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*)
Try this.Grab the capture.Your regex was not giving the whole as you had used | operator.So if regex on left side of | get satisfied it will not try the latter part.
See demo.
http://regex101.com/r/hQ1rP0/45

Python regular expression not matching

This is one of those things where I'm sure I'm missing something simple, but... In the sample program below, I'm trying to use Python's RE library to parse the string "line" to get the floating-point number just before the percent sign, i.e. "90.31". But the code always prints "no match".
I've tried a couple other regular expressions as well, all with the same result. What am I missing?
#!/usr/bin/python
import re
line = ' 0 repaired, 90.31% done'
pct_re = re.compile(' (\d+\.\d+)% done$')
#pct_re = re.compile(', (.+)% done$')
#pct_re = re.compile(' (\d+.*)% done$')
match = pct_re.match(line)
if match: print 'got match, pct=' + match.group(1)
else: print 'no match'
match only matches from the beginning of the string. Your code works fine if you do pct_re.search(line) instead.
You should use re.findall instead:
>>> line = ' 0 repaired, 90.31% done'
>>>
>>> pattern = re.compile("\d+[.]\d+(?=%)")
>>> re.findall(pattern, line)
['90.31']
re.match will match at the start of the string. So you would need to build the regex for complete string.
try this if you really want to use match:
re.match(r'.*(\d+\.\d+)% done$', line)
r'...' is a "raw" string ignoring some escape sequences, which is a good practice to use with regexp in python. – kratenko (see comment below)

Simple python regex, match after colon

I have a simple regex question that's driving me crazy.
I have a variable x = "field1: XXXX field2: YYYY".
I want to retrieve YYYY (note that this is an example value).
My approach was as follows:
values = re.match('field2:\s(.*)', x)
print values.groups()
It's not matching anything. Can I get some help with this? Thanks!
Your regex is good
field2:\s(.*)
Try this code
match = re.search(r"field2:\s(.*)", subject)
if match:
result = match.group(1)
else:
result = ""
re.match() only matches at the start of the string. You want to use re.search() instead.
Also, you should use a verbatim string:
>>> values = re.search(r'field2:\s(.*)', x)
>>> print values.groups()
('YYYY',)

regex in python 2.4

I have a string in python as below:
"\\B1\\B1xxA1xxMdl1zzInoAEROzzMofIN"
I want to get the string as
"B1xxA1xxMdl1zzInoAEROzzMofIN"
I think this can be done using regex but could not achieve it yet. Please give me an idea.
st = "\B1\B1xxA1xxMdl1zzInoAEROzzMofIN"
s = re.sub(r"\\","",st)
idx = s.rindex("B1")
print s[idx:]
output = 'B1xxA1xxMdl1zzInoAEROzzMofIN'
OR
st = "\B1\B1xxA1xxMdl1zzInoAEROzzMofIN"
idx = st.rindex("\\")
print st[idx+1:]
output = 'B1xxA1xxMdl1zzInoAEROzzMofIN'
Here is a try:
import re
s = "\\B1\\B1xxA1xxMdl1zzInoAEROzzMofIN"
s = re.sub(r"\\[^\\]+\\","", s)
print s
Tested on http://py-ide-online.appspot.com (couldn't find a way to share though)
[EDIT] For some explanation, have a look at the Python regex documentation page and the first comment of this SO question:
How to remove symbols from a string with Python?
because using brackets [] can be tricky (IMHO)
In this case, [^\\] means anything but two backslashes \\.
So [^\\]+ means one or more character that matches anything but two backslashes \\.
If the desired section of the string is always on the RHS of a \ char then you could use:
string = "\\B1\\B1xxA1xxMdl1zzInoAEROzzMofIN"
string.rpartition("\\")[2]
output = 'B1xxA1xxMdl1zzInoAEROzzMofIN'

Proper way to deal with string which looks like json object but it is wrapped with single quote

By definition the JSON string is wrapped with double quote.
In fact:
json.loads('{"v":1}') #works
json.loads("{'v':1}") #doesn't work
But how to deal with the second statements?
I'm looking for a solution different from eval or replace.
Thanks.
If you get a mailformed json why don't you just replace the double quotes with single quotes before
json.load
If you cannot fix the other side you will have to convert invalid JSON into valid JSON. I think the following treats escaped characters properly:
def fixEscapes(value):
# Replace \' by '
value = re.sub(r"[^\\]|\\.", lambda match: "'" if match.group(0) == "\\'" else match.group(0), value)
# Replace " by \"
value = re.sub(r"[^\\]|\\.", lambda match: '\\"' if match.group(0) == '"' else match.group(0), value)
return value
input = "{'vt\"e\\'st':1}"
input = re.sub(r"'(([^\\']|\\.)+)'", lambda match: '"%s"' % fixEscapes(match.group(1)), input)
print json.loads(input)
Not sure if I got your requirements right, but are you looking for something like this?
def fix_json(string_):
if string_[0] == string_[-1] == "'":
return '"' + string_[1:-1] +'"'
return string_
Example usage:
>>> fix_json("'{'key':'val\"'...cd'}'")
"{'key':'val"'...cd'}"
EDIT: it seems that the humour I tried to have in making the example above is not self-explanatory. So, here's another example:
>>> fix_json("'This string has - I'm sure - single quotes delimiters.'")
"This string has - I'm sure - single quotes delimiters."
This examples show how the "replacement" only happens at the extremities of the string, not within it.
you could also achieve the same with a regular expression, of course, but if you are just checking the starting and finishing char of a string, I find using regular string indexes more readable....
unfortunately you have to do this:
f = open('filename.json', 'rb')
json = eval(f.read())
done!
this works, but apparently people don't like the eval function. Let me know if you find a better approach. I used this on some twitter data...

Categories