Python string regular expression - python

I need to do a string compare to see if 2 strings are equal, like:
>>> x = 'a1h3c'
>>> x == 'a__c'
>>> True
independent of the 3 characters in middle of the string.

You need to use anchors.
>>> import re
>>> x = 'a1h3c'
>>> pattern = re.compile(r'^a.*c$')
>>> pattern.match(x) != None
True
This would check for the first and last char to be a and c . And it won't care about the chars present at the middle.
If you want to check for exactly three chars to be present at the middle then you could use this,
>>> pattern = re.compile(r'^a...c$')
>>> pattern.match(x) != None
True
Note that end of the line anchor $ is important , without $, a...c would match afoocbarbuz.

Your problem could be solved with string indexing, but if you want an intro to regex, here ya go.
import re
your_match_object = re.match(pattern,string)
the pattern in your case would be
pattern = re.compile("a...c") # the dot denotes any char but a newline
from here, you can see if your string fits this pattern with
print pattern.match("a1h3c") != None
https://docs.python.org/2/howto/regex.html
https://docs.python.org/2/library/re.html#search-vs-match

if str1[0] == str2[0]:
# do something.
You can repeat this statement as many times as you like.
This is slicing. We're getting the first value. To get the last value, use [-1].
I'll also mention, that with slicing, the string can be of any size, as long as you know the relative position from the beginning or the end of the string.

Related

re.match is returning true on two different strings

I am using re.match function of python to compare two strings by ignoring few characters like this:
import re
url = "/ChessBoard_x16_y16.bmp/xyz"
if re.match( '/ChessBoard_x.._y..\.bmp', url ):
print("true")
else:
print("false")
Problem#1: the output is true but I want false here because the url has something extra after .bmp Problem#2: I have used two dots here to ignore the value 16 (x16 & y16) but in fact this value can contain any number of digits like x8, x16, x256 etc. So what should I do to ignore this complete value consisting of any number of digits?
Try the regex
'/ChessBoard_x[\d]+_y[\d]+\.bmp$'
A small demo (Also try on Regex101)
>>> import re
>>> pat = re.compile('/ChessBoard_x[\d]+_y[\d]+\.bmp$')
>>> url = "/ChessBoard_x162_y162.bmp"
>>> pat.match(url).group()
'/ChessBoard_x162_y162.bmp'
>>> url = "/ChessBoard_x16_y16.bmp/xyz"
>>> pat.match(url).group()
>>> # Does not match
Problem 1: You need to specify that you want the string to terminate at the end of the regex. The $ operator does that:
re.match("/ChessBoard_x.._y..\.bmp$", url)
Problem 2: What you want is one or more digits. The \d character class matches digits, + will match one or more of them. I replace the two dots with \d+ therefore:
re.match("/ChessBoard_x\d+_y\d+\.bmp$", url)

lstrip is removing a character I wouldn't expect it to

The following code:
s = "www.wired.com"
print s
s = s.lstrip('www.')
print s
outputs:
www.wired.com
ired.com
Note the missing w on the second line. I'm not sure I understand the behavior. I would expect:
www.wired.com
wired.com
EDIT:
Following the first two answers, I now understand the behavior. My question is now: how do I strip the leading www. without touching the rest?
The argument to string.lstrip is a list of characters:
>>> help(string.lstrip)
Help on function lstrip in module string:
lstrip(s, chars=None)
lstrip(s [,chars]) -> string
Return a copy of the string s with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
>>>
It removes ALL occurrences of those leading characters.
print s.lstrip('w.') # does the same!
[EDIT]:
If you wanted to strop the initial www., but only if it started with that, you could use a regular expression or something like:
s = s[4:] if s.startswith('www.') else s
According to the documentation:
The chars argument is a string specifying the set of characters to be removed...The chars argument is not a prefix; rather, all combinations of its values are stripped
You would achieve the same result by just saying:
'www.wired.com'.lstrip('w.')
If you wanted something more general, I would do something like this:
i = find(s, 'www.')
if i >= 0:
s = s[0:i] + s[i+4:]
To remove the leading www.
>>> import re
>>> s = "www.wired.com"
>>> re.sub(r'^www\.', '', s)
'wired.com'

Python regex to find only second quotes of paired quotes

I wondering if there is some way to find only second quotes from each pair in string, that has paired quotes.
So if I have string like '"aaaaa"' or just '""' I want to find only the last '"' from it. If I have '"aaaa""aaaaa"aaaa""' I want only the second, fourth and sixth '"'s. But if I have something like this '"aaaaaaaa' or like this 'aaa"aaa' I don't want to find anything, since there are no paired quotes. If i have '"aaa"aaa"' I want to find only second '"', since the third '"' has no pair.
I've tried to implement lookbehind, but it doesn't work with quantifiers, so my bad attempt was '(?<=\"a*)\"'.
You don't really need regex for this. You can do:
[i for i, c in enumerate(s) if c == '"'][1::2]
To get the index of every other '"'. Example usage:
>>> for s in ['"aaaaa"', '"aaaa""aaaaa"aaaa""', 'aaa"aaa', '"aaa"aaa"']:
print(s, [i for i, c in enumerate(s) if c == '"'][1::2])
"aaaaa" [6]
"aaaa""aaaaa"aaaa"" [5, 12, 18]
aaa"aaa []
"aaa"aaa" [4]
import re
reg = re.compile(r'(?:\").*?(\")')
then
for match in reg.findall('"this is", "my test"'):
print(match)
gives
"
"
If your necessity is to change the second quote you can also match the whole string and put the pattern before the second quote into a capture group. Then making the substitution by the first match group + the substitution string would archive the issue.
For example, this regex will match everything before the second quote and put it into a group
(\"[^"]*)\"
if you replace whole the match (which includes the second quote) by only the value of the capture group (which does not include the second quote), then you would just cut it off.
See the online example
import re
p = re.compile(ur'(\"[^"]*)\"')
test_str = u"\"test1\"test2\"test3\""
subst = r"\1"
result = re.sub(p, subst, test_str)
print result #result -> "test1test2"test3
Please read my answer about why you don't want to use regular expressions for such a problem, even though you can do that kind of non-regular job with it.
Ok then you probably want one of the solutions I give in the linked answer, where you'll want to use a recursive regex to match all the matching pairs.
Edit: the following has been written before the update to the question, which was asking only for second double quotes.
Though if you want to find only second double quotes in a string, you do not need regexps:
>>> s1='aoeu"aoeu'
>>> s2='aoeu"aoeu"aoeu'
>>> s3='aoeu"aoeu"aoeu"aoeu'
>>> def find_second_quote(s):
... pos_quote_1 = s2.find('"')
... if pos_quote_1 == -1:
... return -1
... pos_quote_2 = s[pos_quote_1+1:].find('"')
... if pos_quote_2 == -1:
... return -1
... return pos_quote_1+1+pos_quote_2
...
>>> find_second_quote(s1)
-1
>>> find_second_quote(s2)
4
>>> find_second_quote(s3)
4
>>>
here it either returns -1 if there's no second quote, or the position of the second quote if there is one.
a parser is probably better, but depending on what you want to get out of it, there are other ways. if you need the data between the quotes:
import re
re.findall(r'".*?"', '"aaaa""aaaaa"aaaa""')
['"aaaa"',
'"aaaaa"',
'""']
if you need the indices, you could do it as a generator or other equivalent like this:
def count_quotes(mystr):
count = 0
for i, x in enumerate(mystr):
if x == '"':
count += 1
if count % 2 == 0:
yield i
list(count_quotes('"aaaa""aaaaa"aaaa""'))
[5, 12, 18]

String may only contain A, U, G or C [duplicate]

This question already has answers here:
Test if string ONLY contains given characters [duplicate]
(7 answers)
Closed 8 years ago.
Forgive the simplistic question, but I've read through the SO questions and the Python documentation and still haven't been able to figure this out.
How can I create a Python regex to test whether a string contains ANY but ONLY the A, U, G and C characters? The string can contain either one or all of those characters, but if it contains any other characters, I'd like the regex to fail.
I tried:
>>> re.match(r"[AUGC]", "AUGGAC")
<_sre.SRE_Match object at 0x104ca1850>
But adding an X on to the end of the string still works, which is not what I expected:
>>> re.match(r"[AUGC]", "AUGGACX")
<_sre.SRE_Match object at 0x104ca1850>
Thanks in advance.
You need the regex to consume the whole string (or fail, if it can't). re.match implicitly adds an anchor at the start of the string, you need to add one to the end:
re.match(r"[AUGC]+$", string_to_check)
Also note the +, which repeatedly matches your character set (since, again, the point is to consume the whole string)
if the value is the only characters in the string, you can do the following:
>>> r = re.compile(r'^[AUGC]+$')
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")
>>>
then if you want your regex to match the empty string as well, you can do:
>>> r = re.compile(r'^[AUGC]*$')
>>> r.match("")
<_sre.SRE_Match object at 0x10ee16718>
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")
Here's a description of what the first regexp does:
Walk through it
Use ^[AUCG]*$; this will match against the entire string.
Or, if there has to be at least one letter, ^[AUCG]+$ — ^ and $ stand for beginning of string and end of string respectively; * and + stand for zero or more and one or more respectively.
This is purely about regular expressions and not specific to Python really.
You are actually really close. What you have just tests for a single character that A or U or G or C.
What you want is to match a string that has one or more letters that are all A or U or G or C, you can accomplish this by adding the plus modifier to your regular expression.
re.match(r"^[AUGC]+$", "AUGGAC")
Additionally, adding $ at the end marks the end of string, you can optionally use ^ at the front to match the beginning of the string.
Just check to see if there is anything other than "AUGC" in there:
if re.search('[^AUGC]', string_to_check):
#fail
You can add a check to make sure the string is not empty in the same statement:
if not string_to_check or re.search('[^AUGC]', string_to_check):
#fail
No real need to use a regex:
>>> good = 'AUGGCUA'
>>> bad = 'AUGHACUA'
>>> all([c in 'AUGC' for c in good])
True
>>> all([c in 'AUGC' for c in bad])
False
I know you're asking about regular expressions but I though it was worth mentioning set. To establish whether your string only contains A U G or C, you could do this:
>>> input = "AUCGCUAGCGAU"
>>> s = set("AUGC")
>>> set(input) <= s
True
>>> bad = "ASNMSA"
>>> set(bad) <= s
False
edit: thanks to #roippi for spotting my mistake, <= should be used, not ==.
Instead of using <=, the method issubset can be used:
>>> set("AUGAUG").issubset(s)
True
if all characters in the string input are in the set s, then issubset will return True.
From: https://docs.python.org/2/library/re.html
Characters that are not within a range can be matched by complementing the set.
If the first character of the set is '^', all the characters that are not in the set will be matched.
For example, [^5] will match any character except '5', and [^^] will match any character except '^'.
^ has no special meaning if it’s not the first character in the set.
So you could do [^AUGC] and if it matches that then reject it, else keep it.

Python re match last underscore in a string

I have some strings that look like this
S25m\S25m_16Q_-2dB.png
S25m\S25m_1_16Q_0dB.png
S25m\S25m_2_16Q_2dB.png
I want to get the string between slash and the last underscore, and also the string between last underscore and extension, so
Desired:
[S25m_16Q, S25m_1_16Q, S25m_2_16Q]
[-2dB, 0dB, 2dB]
I was able to get the whole thing between slash and extension by doing
foo = "S25m\S25m_16Q_-2dB.png"
match = re.search(r'([a-zA-Z0-9_-]*)\.(\w+)', foo)
match.group(1)
But I don't know how to make a pattern so I could split it by the last underscore.
Capture the groups you want to get.
>>> re.search(r'([-\w]*)_([-\w]+)\.\w+', "S25m\S25m_16Q_-2dB.png").groups()
('S25m_16Q', '-2dB')
>>> re.search(r'([-\w]*)_([-\w]+)\.\w+', "S25m\S25m_1_16Q_0dB.png").groups()
('S25m_1_16Q', '0dB')
>>> re.search(r'([-\w]*)_([-\w]+)\.\w+', "S25m\S25m_2_16Q_2dB.png").groups()
('S25m_2_16Q', '2dB')
* matches the previous character set greedily (consumes as many as possible); it continues to the last _ since \w includes letters, numbers, and underscore.
>>> zip(*[m.groups() for m in re.finditer(r'([-\w]*)_([-\w]+)\.\w+', r'''
... S25m\S25m_16Q_-2dB.png
... S25m\S25m_1_16Q_0dB.png
... S25m\S25m_2_16Q_2dB.png
... ''')])
[('S25m_16Q', 'S25m_1_16Q', 'S25m_2_16Q'), ('-2dB', '0dB', '2dB')]
A non-regex solution (albeit rather messy):
>>> import os
>>> s = "S25m\S25m_16Q_-2dB.png"
>>> first, _, last = s.partition("\\")[2].rpartition('_')
>>> print (first, os.path.splitext(last)[0])
('S25m_16Q', '-2dB')
I know it says using re, but why not just use split?
strings = """S25m\S25m_16Q_-2dB.png
S25m\S25m_1_16Q_0dB.png
S25m\S25m_2_16Q_2dB.png"""
strings = strings.split("\n")
parts = []
for string in strings:
string = string.split(".png")[0] #Get rid of file extension
string = string.split("\\")
splitString = string[1].split("_")
firstPart = "_".join(splitString[:-1]) # string between slash and last underscore
parts.append([firstPart, splitString[-1]])
for line in parts:
print line
['S25m_16Q', '-2dB']
['S25m_1_16Q', '0dB']
['S25m_2_16Q', '2dB']
Then just transpose the array,
for line in zip(*parts):
print line
('S25m_16Q', 'S25m_1_16Q', 'S25m_2_16Q')
('-2dB', '0dB', '2dB')

Categories