python: remove the farthest left instance matching regex - python

I have a string like
xp = /dir/dir/dir[2]/dir/dir[5]/dir
I want
xp = /dir/dir/dir[2]/dir/dir/dir
xp.replace(r'\[([^]]*)\]', '') removes all the square brackets, I just want to remove the one on the far left.
IT should also completely ignore square brackets with not(random_number_of_characters)
ex /dir/dir/dir[2]/dir/dir[5]/dir[1][not(random_number_of_characters)]
should yield /dir/dir/dir[2]/dir/dir[5]/dir[not(random_number_of_characters)]
ex. /dir/dir/dir[2]/dir/dir[5]/dir[not(random_number_of_characters)]
should yield /dir/dir/dir[2]/dir/dir/dir[not(random_number_of_characters)]

Make it greedy and replace with captured groups.
(.*)\[[^]]*\](.*)
Greedy Group ------^^ ^^^^^^^^-------- Last bracket [ till ]
Replacement : $1$2 or \1\2
Online demo
sample code:
import re
p = re.compile(ur'(.*)\[[^]]*\](.*)')
test_str = u"xp = /dir/dir/dir[2]/dir/dir[5]/dir"
subst = u"$1$2"
result = re.sub(p, subst, test_str)

This code would remove the last square brackets,
>>> import re
>>> xp = "/dir/dir/dir[2]/dir/dir[5]/dir"
>>> m = re.sub(r'\[[^\]]*\](?=[^\[\]]*$)', r'', xp)
>>> m
'/dir/dir/dir[2]/dir/dir/dir'
A lookahead is used to check whether the square brackets are followed by any character not of [, ] symbols zero or more times upto the line end. So it helps to match the last [] brackets. Then replacing the matched brackets with an empty string would completely remove the last brackets.
UPDATE:
You could try the below regex also,
\[[^\]]*\](?=(?:[^\[\]]*\[not\(.*?\)\]$))
DEMO

Related

Python Regular Expression: re.findall doesn`t see all mathces

I would like to find all mathces by such pattern: (one letter)(three figures)(two letter)(two or three figures).
So my python regular expression is:
[А,В,Е,К,М,Н,О,Р,С,Т,У,Х]\d{3}[А,В,Е,К,М,Н,О,Р,С,Т,У,Х]{2}\d{2,3}
where
[А,В,Е,К,М,Н,О,Р,С,Т,У,Х] is letters` set;
\d{num} is for any figure repeated num times.
I wrote this code to solve my problem:
import re
pattern = r"[А,В,Е,К,М,Н,О,Р,С,Т,У,Х]\d{3}[А,В,Е,К,М,Н,О,Р,С,Т,У,Х]{2}\d{2,3}"
string = "A123AA11 А222АА123 A12AA123 A123CC1234 AA123A12"
re.findall(pattern, string)
I suspect to see this list of strings: ['A123AA11', 'А222АА123']
But I got this one: ['А222АА123']
What is the problem? Where did I make a mistake?
I don't know how, but the A in your regex is A_(Cyrillic) (the U+0410 or (1040d) one from ASCII)
print(ord("А")) # 1040
print(ord("A")) # 65
Then the square bracket notation means an OR between every values so here [А,В,Е,К,М,Н,О,Р,С,Т,У,Х] is same as [ABEKMHOPCTYX,] comma included, you only need [ABEKMHOPCTYX]
Giving
string = "A123AA11 A222AA123 A12AA123 A123CC1234 A123A12"
pattern = r"[ABEKMHOPCTYX]\d{3}[ABEKMHOPCTYX]{2}\d{2,3}"
print(re.findall(pattern, string)) # ['A123AA11', 'A222AA123', 'A123CC123']
To match only words that fully match the pattern, use word boundaries \b
pattern = r"\b[ABEKMHOPCTYX]\d{3}[ABEKMHOPCTYX]{2}\d{2,3}\b"
print(re.findall(pattern, string)) # ['A123AA11', 'A222AA123']

How to extract equation between brackets Python 2.7?

I'm trying to extract an equation between brackets but i don't know how to do it in python 2.7.
i tried re.findall but i think the pattern is wrong.
child = {(x1<25)*2 +((x1>=25)&&(x2<200))*2+((x1>=25)&&(x2>=200))*1}
stringExtract = re.findall(r'\{(?:[^()]*|\([^()]*\))*\}', child)
it returns nothing instead of x1<25)*2 +((x1>=25)&&(x2<200))*2+((x1>=25)&&(x2>=200))*1
It seems that you're only interested in everything between { and }, so your regex could be much simpler:
import re
child = "{(x1<25)*2 +((x1>=25)&&(x2<200))*2+((x1>=25)&&(x2>=200))*1}"
pattern = re.compile("""
\s* # every whitespace before leading bracket
{(.*)} # everything between '{' and '}'
\s* # every whitespace after ending bracket
""", re.VERBOSE)
re.findall(pattern, child)
And the output is this:
['(x1<25)*2 +((x1>=25)&&(x2<200))*2+((x1>=25)&&(x2>=200))*1']
To get the string from the list (re.findall() returns a list), you can access it via index position zero: re.findall(pattern, child)[0]. But also the other methods for re could be interesting for you, i.e. re.search() or re.match().
But if every string has a leading bracket and an ending bracket at first and last position, you can also simply do this:
child[1:-1]
which gives you
'(x1<25)*2 +((x1>=25)&&(x2<200))*2+((x1>=25)&&(x2>=200))*1'
You can use this regex - {([^}]*)}. It matches the character { then [^}]* matches anything except } and } matches the end bracket.
>>> import re
>>> eq = "{(x1<25)*2 +((x1>=25)&&(x2<200))*2+((x1>=25)&&(x2>=200))*1}"
>>> m = re.search("{([^}]*)}", eq)
>>> m.group(1)
'(x1<25)*2 +((x1>=25)&&(x2<200))*2+((x1>=25)&&(x2>=200))*1'

Match everything except a pattern and replace matched with string

I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.
Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo
You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)
This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.

About how to find all desired format in a str

I have a text like this format,
s = '[aaa]foo[bbb]bar[ccc]foobar'
Actually the text is Chinese car review like this
【最满意】整车都很满意,最满意就是性价比,...【空间】空间真的超乎想象,毫不夸张,...【内饰】内饰还可以吧,没有多少可以说的...
Now I want to split it to these parts
[aaa]foo
[bbb]bar
[ccc]foobar
first I tried
>>> re.findall(r'\[.*?\].*?',s)
['[aaa]', '[bbb]', '[ccc]']
only got first half.
Then I tried
>>> re.findall(r'(\[.*?\].*?)\[?',s)
['[aaa]', '[bbb]', '[ccc]']
still only got first half
At last I have to get the two parts respectively then zip them
>>> re.findall(r'\[.*?\]',s)
['[aaa]', '[bbb]', '[ccc]']
>>> re.split(r'\[.*?\]',s)
['', 'foo', 'bar', 'foobar']
>>> for t in zip(re.findall(r'\[.*?\]',s),[e for e in re.split(r'\[.*?\]',s) if e]):
... print(''.join(t))
...
[aaa]foo
[bbb]bar
[ccc]foobar
So I want to know if exists some regex could directly split it to these parts?
One of the approaches:
import re
s = '[aaa]foo[bbb]bar[ccc]foobar'
result = re.findall(r'\[[^]]+\][^\[\]]+', s)
print(result)
The output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
\[ or \] - matches the bracket literally
[^]]+ - matches one or more characters except ]
[^\[\]]+ - matches any character(s) except brackets \[\]
I think this could work:
r'\[.+?\]\w+'
Here it is:
>>> re.findall(r"(\[\w*\]\w+)",s)
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Explanation:
parenthesis means the group to search. Witch group:
it should start by a braked \[ followed by some letters \w
then the matched braked braked \] followed by more letters \w
Notice you should to escape braked with \.
I think if input string format is "strict enough", it's possible to try something w/o regexp. It may look as a microoptimisation, but could be interesting as a challenge.
result = map(lambda x: '[' + x, s[1:].split("["))
So I tried to check performance on a 1Mil iterations and here are my results (seconds):
result = map(lambda x: '[' + x, s[1:].split("[")) # 0.89862203598
result = re.findall(r'\[[^]]+\][^\[\]]+', s) # 1.48306798935
result = re.findall(r'\[.+?\]\w+', s) # 1.47224497795
result = re.findall(r'(\[\w*\]\w+)', s) # 1.47370815277
\[.*?\][a-zA-Z]*
This regex should capture anything that start with [somethinghere]Any letters from a to Z
you can play on regex101 to try out different ones and it's easy to make your own regex there
All you need is findall and here is very simple pattern without making it complicated:
import re
print(re.findall(r'\[\w+\]\w+','[aaa]foo[bbb]bar[ccc]foobar'))
output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Detailed solution:
import re
string_1='[aaa]foo[bbb]bar[ccc]foobar'
pattern=r'\[\w+\]\w+'
print(re.findall(pattern,string_1))
explanation:
\[\w+\]\w+
\[ matches the character [ literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed

Regex pattern to extract substring

mystring = "q1)whatq2)whenq3)where"
want something like ["q1)what", "q2)when", "q3)where"]
My approach is to find the q\d+\) pattern then move till I find this pattern again and stop. But I'm not able to stop.
I did req_list = re.compile("q\d+\)[*]\q\d+\)").split(mystring)
But this gives the whole string.
How can I do it?
You could try the below code which uses re.findall function,
>>> import re
>>> s = "q1)whatq2)whenq3)where"
>>> m = re.findall(r'q\d+\)(?:(?!q\d+).)*', s)
>>> m
['q1)what', 'q2)when', 'q3)where']
Explanation:
q\d+\) Matches the string in the format q followed by one or more digits and again followed by ) symbol.
(?:(?!q\d+).)* Negative look-ahead which matches any char not of q\d+ zero or more times.

Categories