Python regex to find only second quotes of paired quotes - python

I wondering if there is some way to find only second quotes from each pair in string, that has paired quotes.
So if I have string like '"aaaaa"' or just '""' I want to find only the last '"' from it. If I have '"aaaa""aaaaa"aaaa""' I want only the second, fourth and sixth '"'s. But if I have something like this '"aaaaaaaa' or like this 'aaa"aaa' I don't want to find anything, since there are no paired quotes. If i have '"aaa"aaa"' I want to find only second '"', since the third '"' has no pair.
I've tried to implement lookbehind, but it doesn't work with quantifiers, so my bad attempt was '(?<=\"a*)\"'.

You don't really need regex for this. You can do:
[i for i, c in enumerate(s) if c == '"'][1::2]
To get the index of every other '"'. Example usage:
>>> for s in ['"aaaaa"', '"aaaa""aaaaa"aaaa""', 'aaa"aaa', '"aaa"aaa"']:
print(s, [i for i, c in enumerate(s) if c == '"'][1::2])
"aaaaa" [6]
"aaaa""aaaaa"aaaa"" [5, 12, 18]
aaa"aaa []
"aaa"aaa" [4]

import re
reg = re.compile(r'(?:\").*?(\")')
then
for match in reg.findall('"this is", "my test"'):
print(match)
gives
"
"

If your necessity is to change the second quote you can also match the whole string and put the pattern before the second quote into a capture group. Then making the substitution by the first match group + the substitution string would archive the issue.
For example, this regex will match everything before the second quote and put it into a group
(\"[^"]*)\"
if you replace whole the match (which includes the second quote) by only the value of the capture group (which does not include the second quote), then you would just cut it off.
See the online example
import re
p = re.compile(ur'(\"[^"]*)\"')
test_str = u"\"test1\"test2\"test3\""
subst = r"\1"
result = re.sub(p, subst, test_str)
print result #result -> "test1test2"test3

Please read my answer about why you don't want to use regular expressions for such a problem, even though you can do that kind of non-regular job with it.
Ok then you probably want one of the solutions I give in the linked answer, where you'll want to use a recursive regex to match all the matching pairs.
Edit: the following has been written before the update to the question, which was asking only for second double quotes.
Though if you want to find only second double quotes in a string, you do not need regexps:
>>> s1='aoeu"aoeu'
>>> s2='aoeu"aoeu"aoeu'
>>> s3='aoeu"aoeu"aoeu"aoeu'
>>> def find_second_quote(s):
... pos_quote_1 = s2.find('"')
... if pos_quote_1 == -1:
... return -1
... pos_quote_2 = s[pos_quote_1+1:].find('"')
... if pos_quote_2 == -1:
... return -1
... return pos_quote_1+1+pos_quote_2
...
>>> find_second_quote(s1)
-1
>>> find_second_quote(s2)
4
>>> find_second_quote(s3)
4
>>>
here it either returns -1 if there's no second quote, or the position of the second quote if there is one.

a parser is probably better, but depending on what you want to get out of it, there are other ways. if you need the data between the quotes:
import re
re.findall(r'".*?"', '"aaaa""aaaaa"aaaa""')
['"aaaa"',
'"aaaaa"',
'""']
if you need the indices, you could do it as a generator or other equivalent like this:
def count_quotes(mystr):
count = 0
for i, x in enumerate(mystr):
if x == '"':
count += 1
if count % 2 == 0:
yield i
list(count_quotes('"aaaa""aaaaa"aaaa""'))
[5, 12, 18]

Related

Printing substrings' patterns from a string in Python

The input to this problem is a string and has a specific form. For example if s is a string then inputs can be s='3(a)2(b)' or s='3(aa)2(bbb)' or s='4(aaaa)'. The output should be a string, that is the substring inside the brackets multiplied by numerical substring value the substring inside the brackets follows.
For example,
Input ='3(a)2(b)'
Output='aaabb'
Input='4(aaa)'
Output='aaaaaaaaaaaa'
and similarly for other inputs. The program should print an empty string for wrong or invalid inputs.
This is what I've tried so far
s='3(aa)2(b)'
p=''
q=''
for i in range(0,len(s)):
#print(s[i],end='')
if s[i]=='(':
k=int(s[i-1])
while(s[i+1]!=')'):
p+=(s[i+1])
i+=1
if s[i]==')':
q+=k*p
print(q)
Can anyone tell what's wrong with my code?
A oneliner would be:
''.join(int(y[0])*y[1] for y in (x.split('(') for x in Input.split(')')[:-1]))
It works like this. We take the input, and split on the close paren
In [1]: Input ='3(a)2(b)'
In [2]: a = Input.split(')')[:-1]
In [3]: a
Out[3]: ['3(a', '2(b']
This gives us the integer, character pairs we're looking for, but we need to get rid of the open paren, so for each x in a, we split on the open paren to get a two-element list where the first element is the int (as a string still) and the character. You'll see this in b
In [4]: b = [x.split('(') for x in a]
In [5]: b
Out[5]: [['3', 'a'], ['2', 'b']]
So for each element in b, we need to cast the first element as an integer with int() and multiply by the character.
In [6]: c = [int(y[0])*y[1] for y in b]
In [7]: c
Out[7]: ['aaa', 'bb']
Now we join on the empty string to combine them into one string with
In [8]: ''.join(c)
Out[8]: 'aaabb'
Try this:
a = re.findall(r'[\d]+', s)
b = re.findall(r'[a-zA-Z]+', s)
c = ''
for i, j in zip(a, b):
c+=(int(i)*str(j))
print(c)
Here is how you could do it:
Step 1: Simple case, getting the data out of a really simple template
Let's assume your template string is 3(a). That's the simplest case I could think of. We'll need to extract pieces of information from that string. The first one is the count of chars that will have to be rendered. The second is the char that has to be rendered.
You are in a case where regex are more than suited (hence, the use of re module from python's standard library).
I won't do a full course on regex. You'll have to do that by our own. However, I'll explain quickly the step I used. So, count (the variable that holds the number of times we should render the char to render) is a digit (or several). Hence our first capturing group will be something like (\d+). Then we have a char to extract that is enclosed by parenthesis, hence \((\w+)\) (I actually enable several chars to be rendered at once). So, if we put them together, we get (\d+)\((\w+)\). For testing you can check this out.
Applied to our case, a straight forward use of the re module is:
import re
# Our template
template = '3(a)'
# Run the regex
match = re.search(r'(\d+)\((\w+)\)', template)
if match:
# Get the count from the first capturing group
count = int(match.group(1))
# Get the string to render from the second capturing group
string = match.group(2)
# Print as many times the string as count was given
print count * string
Output:
aaa
Yeah!
Step 2: Full case, with several templates
Okay, we know how to do it for 1 template, how to do the same for several, for instance 3(a)4(b)? Well... How would we do it "by hand"? We'd read the full template from left to right and apply each template one by one. Then this is what we'll do with python!
Hopefully for us the re module has a function just for that: finditer. It does exactly what we described above.
So, we'll do something like:
import re
# Our template
template = '3(a)4(b)'
# Iterate through found templates
for match in re.finditer(r'(\d+)\((\w+)\)', template):
# Get the count from the first capturing group
count = int(match.group(1))
# Get the string to render from the second capturing group
string = match.group(2)
print count * string
Output:
aaa
bbbb
Okay... Just remains the combination of that stuff. We know we can put everything at each step in an array, and then join each items of this array at the end, no?
Let's do it!
import re
template = '3(a)4(b)'
parts = []
for match in re.finditer(r'(\d+)\((\w+)\)', template):
parts.append(int(match.group(1)) * match.group(2))
print ''.join(parts)
Output:
aaabbb
Yeah!
Step 3: Final step, optimization
Because we can always do better, we won't stop. for loops are cool. But what I love (it's personal) about python is that there is so much stuff you can actually just write with one line! Is it the case here? Well yes :).
First we can remove the for loop and the append using a list comprehension:
parts = [int(match.group(1)) * match.group(2) for match in re.finditer(r'(\d+)\((\w+)\)', template)]
rendered = ''.join(parts)
Finally, let's remove the two lines with parts populating and then join and let's do all that in a single line:
import re
template = '3(a)4(b)'
rendered = ''.join(
int(match.group(1)) * match.group(2) \
for match in re.finditer(r'(\d+)\((\w+)\)', template))
print rendered
Output:
aaabbb
Yeah! Still the same output :).
Hope it helped!
The value of 'p' should be refreshed after each iteration.
s='1(aaa)2(bb)'
p=''
q=''
i=0
while i<len(s):
if s[i]=='(':
k=int(s[i-1])
p=''
while(s[i+1]!=')'):
p+=(s[i+1])
i+=1
if s[i]==')':
q+=k*p
i+=1
print(q)
The code is not behaving the way I want it to behave. The problem here is the placement of 'p'. 'p' is the variable that adds the substring inside the ( )s. I'm repeating the process even after sufficient adding is done. Placing 'p' inside the 'if' block will do the job.
s='2(aa)2(bb)'
q=''
for i in range(0,len(s)):
if s[i]=='(':
k=int(s[i-1])
p=''
while(s[i+1]!=')'):
#print(i,'first time')
p+=s[i+1]
i+=1
q+=p*k
#print(i,'second time')
print(q)
what you want is not print substrings . the real purpose is most like to generate text based regular expression or comands.
you can parametrize a function to read it or use something like it:
The python library rstr has the function xeger() to do what you need by using random strings and only returning ones that match:
Example
Install with pip install rstr
In [1]: from __future__ import print_function
In [2]: import rstr
In [3]: for dummy in range(10):
...: print(rstr.xeger(r"(a|b)[cd]{2}\1"))
...:
acca
bddb
adda
bdcb
bccb
bcdb
adca
bccb
bccb
acda
Warning
For complex re patterns this might take a long time to generate any matches.

Python string regular expression

I need to do a string compare to see if 2 strings are equal, like:
>>> x = 'a1h3c'
>>> x == 'a__c'
>>> True
independent of the 3 characters in middle of the string.
You need to use anchors.
>>> import re
>>> x = 'a1h3c'
>>> pattern = re.compile(r'^a.*c$')
>>> pattern.match(x) != None
True
This would check for the first and last char to be a and c . And it won't care about the chars present at the middle.
If you want to check for exactly three chars to be present at the middle then you could use this,
>>> pattern = re.compile(r'^a...c$')
>>> pattern.match(x) != None
True
Note that end of the line anchor $ is important , without $, a...c would match afoocbarbuz.
Your problem could be solved with string indexing, but if you want an intro to regex, here ya go.
import re
your_match_object = re.match(pattern,string)
the pattern in your case would be
pattern = re.compile("a...c") # the dot denotes any char but a newline
from here, you can see if your string fits this pattern with
print pattern.match("a1h3c") != None
https://docs.python.org/2/howto/regex.html
https://docs.python.org/2/library/re.html#search-vs-match
if str1[0] == str2[0]:
# do something.
You can repeat this statement as many times as you like.
This is slicing. We're getting the first value. To get the last value, use [-1].
I'll also mention, that with slicing, the string can be of any size, as long as you know the relative position from the beginning or the end of the string.

Python RegEx search and replace with part of original expression

I'm new to Python and looking for a way to replace all occurrences of "[A-Z]0" with the [A-Z] portion of the string to get rid of certain numbers that are padded with a zero. I used this snippet to get rid of the whole occurrence from the field I'm processing:
import re
def strip_zeros(s):
return re.sub("[A-Z]0", "", s)
test = strip_zeros(!S_fromManhole!)
How do I perform the same type of procedure but without removing the leading letter of the "[A-Z]0" expression?
Thanks in advance!
Use backreferences.
http://www.regular-expressions.info/refadv.html "\1 through \9 Substituted with the text matched between the 1st through 9th pair of capturing parentheses."
http://docs.python.org/2/library/re.html#re.sub "Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern."
Untested, but it would look like this:
return re.sub(r"([A-Z])0", r"\1", s)
Placing the first letter inside a capture group and referencing it with \1
you can try something like
In [47]: s = "ab0"
In [48]: s.translate(None, '0')
Out[48]: 'ab'
In [49]: s = "ab0zy"
In [50]: s.translate(None, '0')
Out[50]: 'abzy'
I like Patashu's answer for this case but for the sake of completeness, passing a function to re.sub instead of a replacement string may be cleaner in more complicated cases. The function should take a single match object and return a string.
>>> def strip_zeros(s):
... def unpadded(m):
... return m.group(1)
... return re.sub("([A-Z])0", unpadded, s)
...
>>> strip_zeros("Q0")
'Q'

removing part of a string (up to but not including) in python

I'm trying to strip off part of a string.
e.g. Strip:-
a = xyz-abc
to leave:-
a = -abc
I would usually use lstrip e.g.
a.lstrip('xyz')
but in this case I don't know what xyz is going to be, so I need a way to just strip everything to the left of '-'.
Is it possible to set that option with lstrip or do I have to go about it a different way?
Thanks.
If there's only one - character, this will work:
'xyz-abc'.split('-')[1]
If you want the '-' in there, you have to reattach it:
>>> '-' + 'xyz-abc'.split('-')[1]
'-abc'
There's also count parameter that allows you to split only at the first - character.
>>> '-' + 'xyz-ab-c'.split('-', 1)[1]
'-ab-c'
partition is also potentially useful:
>>> 'xyz-abc'.partition('-')
('xyz', '-', 'abc')
It splits at the first occurrence of the separator:
>>> ''.join('xyz-ab-c'.partition('-')[1:])
'-ab-c'
>>> a = 'xyz-abc'
>>> a.find('-') # return the index of the first instance of '-'
3
>>> a[a.find('-'):] # return the string of everything past that index
'-abc'
You could use a conjunction of .find and splicing.
If there is no guarantee that the text to the left of - doesn't contain dashes of its own, the reversed version of find called rfind is even more useful:
>>> s = "xyv-er-hdgcfh-abc"
>>> print s[s.rfind("-"):]
-abc

replacing all regex matches in single line

I have dynamic regexp in which I don't know in advance how many groups it has
I would like to replace all matches with xml tags
example
re.sub("(this).*(string)","this is my string",'<markup>\anygroup</markup>')
>> "<markup>this</markup> is my <markup>string</markup>"
is that even possible in single line?
For a constant regexp like in your example, do
re.sub("(this)(.*)(string)",
r'<markup>\1</markup>\2<markup>\3</markup>',
text)
Note that you need to enclose .* in parentheses as well if you don't want do lose it.
Now if you don't know what the regexp looks like, it's more difficult, but should be doable.
pattern = "(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 0
else s for n, s in enumerate(m.groups())),
text)
If the first thing matched by your pattern doesn't necessarily have to be marked up, use this instead, with the first group optionally matching some prefix text that should be left alone:
pattern = "()(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 1
else s for n, s in enumerate(m.groups())),
text)
You get the idea.
If your regexps are complicated and you're not sure you can make everything part of a group, where only every second group needs to be marked up, you might do something smarter with a more complicated function:
pattern = "(this).*(string)"
def replacement(m):
s = m.group()
n_groups = len(m.groups())
# assume groups do not overlap and are listed left-to-right
for i in range(n_groups, 0, -1):
lo, hi = m.span(i)
s = s[:lo] + '<markup>' + s[lo:hi] + '</markup>' + s[hi:]
return s
re.sub(pattern, replacement, text)
If you need to handle overlapping groups, you're on your own, but it should be doable.
re.sub() will replace everything it can. If you pass it a function for repl then you can do even more.
Yes, this can be done in a single line.
>>> re.sub(r"\b(this|string)\b", r"<markup>\1</markup>", "this is my string")
'<markup>this</markup> is my <markup>string</markup>'
\b ensures that only complete words are matched.
So if you have a list of words that you need to mark up, you could do the following:
>>> mywords = ["this", "string", "words"]
>>> myre = r"\b(" + "|".join(mywords) + r")\b"
>>> re.sub(myre, r"<markup>\1</markup>", "this is my string with many words!")
'<markup>this</markup> is my <markup>string</markup> with many <markup>words</markup>!'

Categories