I am trying to a string to be used as regex String.
In the following code:
_pattern is a pattern like abba and I am trying to check _string follows the _pattern (eg. catdogdogcat)
rxp in the following code is the regular expression that I am trying to create to match to _string (eg. for above example it will be (.+)(.+)\\2\\1 ). Which is being successfully generated. But the re.match() is returning None.
I want to understand why it is not working and how to correct it ?
import re
_pattern = "abba" #raw_input().strip()
_string = "catdogdogcat" #raw_input().strip()
hm = {}
rxp = ""
c = 1
for x in _pattern:
if hm.has_key(x):
rxp += hm[x]
continue
else:
rxp += "(.+)"
hm[x]="\\\\"+str(c)
c+=1
print rxp
#print re.match(rxp,_string) -> (Tried) Not working
#print re.match(r'rxp', _string) -> (Tried) Not working
print re.match(r'%s' %rxp, _string) # (Tried) Not working
Output
(.+)(.+)\\2\\1
None
Expected Output
(.+)(.+)\\2\\1
<_sre.SRE_Match object at 0x000000000278FE88>
The thing is that your regex string variable has double \\ instead of a single one.
You can use
rxp.replace("\\\\", "\\")
in .match like this:
>>> print re.match(rxp.replace("\\\\", "\\"), _string)
<_sre.SRE_Match object at 0x10bf87c68>
>>> print re.match(rxp.replace("\\\\", "\\"), _string).groups()
('cat', 'dog')
EDIT:
You can also avoid getting double \\ like this:
import re
_pattern = "abba" #raw_input().strip()
_string = "catdogdogcat" #raw_input().strip()
hm = {}
rxp = ""
c = 1
for x in _pattern:
if x in hm:
rxp += hm[x]
continue
else:
rxp += "(.+)"
hm[x]="\\" + str(c)
c+=1
print rxp
print re.match(rxp,_string)
You should use string formatting, and not hard-code rxp into the string:
print re.match(r'%s'%rxp, _string)
Related
my problem is that I need to find multiple elements in one string.
For example I got one string that looks like this:
line = if ((var.equals("INPUT")) || (var.equals("OUTPUT"))
and then i got this code to find everything between ' (" ' and ' ") '
char1 = '("'
char2 = '")'
add = line[line.find(char1)+2 : line.find(char2)]
list.append(add)
The current result is just:
['INPUT']
but I need the result to look like this:
['INPUT','OUTPUT', ...]
after it got the first match it stopped searching for other matches, but I need to find everything in that string that matches this search.
I also need to append every single match to the list.
The simplest:
>>> import re
>>> s = """line = if ((var.equals("INPUT")) || (var.equals("OUTPUT"))"""
>>> r = re.compile(r'\("(.*?)"\)')
>>> r.findall(s)
['INPUT', 'OUTPUT']
The trick is to use .*? which is a non-greedy *.
You should look into regular expressions because that's a perfect fit for what you're trying to achieve.
Let's examine a regular expression that does what you want:
import re
regex = re.compile(r'\("([^"]+)"\)')
It matches the string (" then captures anything that isn't a quotation mark and then matches ") at the end.
By using it with findall you will get all the captured groups:
In [1]: import re
In [2]: regex = re.compile(r'\("([^"]+)"\)')
In [3]: line = 'if ((var.equals("INPUT")) || (var.equals("OUTPUT"))'
In [4]: regex.findall(line)
Out[4]: ['INPUT', 'OUTPUT']
If you don't want to use regex, this will help you.
line = 'if ((var.equals("INPUT")) || (var.equals("OUTPUT"))'
char1 = '("'
char2 = '")'
add = line[line.find(char1)+2 : line.find(char2)]
list.append(add)
line1=line[line.find(char2)+1:]
add = line1[line1.find(char1)+2 : line1.find(char2)]
list.append(add)
print(list)
just add those 3 lines in your code, and you're done
if I understand you correct, than something like that is help you:
line = 'line = if ((var.equals("INPUT")) || (var.equals("OUTPUT"))'
items = []
start = 0
end = 0
c = 0;
while c < len(line):
if line[c] == '(' and line[c + 1] == '"':
start = c + 2
if line[c] == '"' and line[c + 1] == ')':
end = c
if start and end:
items.append(line[start:end])
start = end = None
c += 1
print(items) # ['INPUT', 'OUTPUT']
I want to add a space between Arabic/Farsi and English words in my text.
It should be with regular expression in python.
for example:
input: "علیAli" output: "علی Ali"
input: "علیAliرضا" output: "علی Ali رضا"
input: "AliعلیRezaرضا" output: "Ali علی Reza رضا"
and what ever like them.
You can do it using re.sub likes the following in python 3:
rx = r'[a-zA-Z]+'
output = re.sub(rx, r' \g<0> ', input)
Instead of regular expression , I think this can be done by comparing unicodes. I tried to code the same but didn't know how to again split /r/n to get the required output. This code might be useful for some one.
import codecs,string
def detect_language(character):
maxchar = max(character)
if u'\u0041' <= maxchar <= u'\u007a':
return 'eng'
with codecs.open('letters.txt', encoding='utf-8') as f:
eng_list = []
eng_var =0
arab_list = []
arab_var=0
input = f.read()
for i in input:
isEng = detect_language(i)
if isEng == "eng":
eng_list.append(i)
eng_var = eng_var + 1
elif '\n' in i or '\r' in i:
eng_list.append(i)
arab_list.append(i)
else:
arab_list.append(i)
arab_var =arab_var +1
temp = str(eng_list)
temp1 = temp.encode('ascii','ignore')
Today I had an interview at AMD and was asked a question which I didn't know how to solve it without Regex. Here is the question:
Find all the pattern for the word "Hello" in a text. Consider that there is only ONE char can be in between letters of hello e.g. search for all instances of "h.ello", "hell o", "he,llo", or "hel!lo".
Since you also tagged this question algorithm, I'm just going to show the general approach that I would take when looking at this question, without including any language tricks from python.
1) I would want to split the string into a list of words
2) Loop through each string in the resulting list, checking if the string matches 'hello' without the character at the current index (or if it simply matches 'hello')
3) If a match is found, return it.
Here is a simple approach in python:
s = "h.ello hello h!ello hell.o none of these"
all = s.split()
def drop_one(s, match):
if s == match:
return True # WARNING: Early Return
for i in range(len(s) - 1):
if s[:i] + s[i+1:] == match:
return True
matches = [x for x in all if drop_one(x, "hello")]
print(matches)
The output of this snippet:
['h.ello', 'hello', 'h!ello', 'hell.o']
This should work. I've tried to make it generic. You might have to make some minor adjustments. Let me know if you don't understand any part.
def checkValidity(tlist):
tmpVar = ''
for i in range(len(tlist)):
if tlist[i] in set("hello"):
tmpVar += tlist[i]
return(tmpVar == 'hello')
mStr = "he.llo hehellbo hellox hell.o hello helloxy abhell.oyz"
mWord = "hello"
mlen = len(mStr)
wordLen = len(mWord)+1
i=0
print ("given str = ", mStr)
while i<mlen:
tmpList = []
if mStr[i] == 'h':
for j in range(wordLen):
tmpList.append(mStr[i+j])
validFlag = checkValidity(tmpList)
if validFlag:
print("Match starting at index: ",i, ':', mStr[i:i+wordLen])
i += wordLen
else:
i += 1
else:
i += 1
I'm trying to write a Python regular expression that can parse strings of the type "<name>(<number>)", where <number> is optional.
For example, if I pass 'sclkout', then there is no number at the end, so it should just match 'sclkout'. If the input is 'line7', then is should match 'line' and '7'. The name can also contain numbers inside it, so if I give it 'dx3f', then the output should be 'dx3f', but for 'dx3b0' it should match 'dx3b' and 0.
This is what I first tried:
import re
def do_match(signal):
match = re.match('(\w+)(\d+)?', signal)
assert match
print "Input = " + signal
print "group1 = " + match.group(1)
if match.lastindex == 2:
print "group2 = " + match.group(2)
print ""
# should match 'sclkout'
do_match("sclkout")
# should match 'line' and '7'
do_match("line7")
# should match 'dx4f'
do_match("dx4f")
# should match 'dx3b' and '0'
do_match("dx3b0")
This is of course wrong because of greedy matching in the (\w+) group, so I tried setting that to non-greedy:
match = re.match('(\w+?)(\d+)?', signal)
This however only matches the first letter of the string.
You don't need regex for this:
from itertools import takewhile
def do_match(s):
num = ''.join(takewhile(str.isdigit, reversed(s)))[::-1]
return s[:s.rindex(num)], num
...
>>> do_match('sclkout')
('sclkout', '')
>>> do_match('line7')
('line', '7')
>>> do_match('dx4f')
('dx4f', '')
>>> do_match('dx3b0')
('dx3b', '0')
You can use a possessive quantifier like this:
^(?<name>\w+?)(?<number>\d+)?$
Or ^(\w+?)(\d+)?$, if you don't want the named capture groups.
See live demo here: http://rubular.com/r/44Ntc4mLDY
([a-zA-Z0-9]*[a-zA-Z]+)([0-9]*) is what you want.
import re
test = ["sclkout", "line7", "dx4f", "dx3b0"]
ans = [("sclkout", ""), ("line", "7"), ("dx4f", ""), ("dx3b", "0")]
for t, a in zip(test, ans):
m = re.match(r'([a-zA-Z0-9]*[a-zA-Z]+)([0-9]*)', t)
if m.groups() == a:
print "OK"
else:
print "NG"
output:
OK
OK
OK
OK
Going back to this example,
Having trouble dealing with similar characters to print different things using regex in Python
I was wondering how would I reverse the regex substitution I did and just print out the original text ?
That is, so if I have
text = "This is my first regex python example yahooa yahoouuee bbbiirdd"
as my original text, then it's output would be:
re.sub text = "tookhookisook isook mookyook fookirooksooktook pookyooktookhookonook..."
And then I want that output to be converted back to the original text.
How do I do that?
Python strings are immutable. You haven't changed the original, only created a new string. Just keep a reference to the original.
Edit
By immutable, I mean that their actual value is frozen once create.
>>> s = "abc"
>>> s[0]
'a'
>>> s[1] = 'd'
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
s[1] = 'd'
TypeError: 'str' object does not support item assignment
>>>
In the example above I can have the variable s reference another object, but the string I assigned to it is constant. So when you do s.replace(), the result is a new string, and the original is unchanged.
>>> s.replace ('a', 'd')
'dbc'
>>> s
'abc'
>>>
It seems that this works:
import re
tu = ('This is my first regex python example '
'yahooa yahoouuee bbbiirdd',
'bbbiirdd',
'fookirooksooktook',
'crrsciencezxxxxxscienceokjjsciencq')
reg = re.compile(r'([bcdfghj-np-tv-z])(\1?)')
dereg = re.compile('science([^aeiou])|([^aeiou])ook')
def Frepl(ma):
g1,g2 = ma.groups()
if g2: return 'science' + g2
else: return g1 + 'ook'
def Fderepl(ma):
g = ma.group(2)
if g: return g
else: return 2*ma.group(1)
for strt in tu:
resu = reg.sub(Frepl , strt)
bakk = dereg.sub(Fderepl, resu)
print ('----------------------------------\n'
'strt = %s\n' 'resu == %s\n'
'bakk == %s\n' 'bakk == start : %s'
% (strt, resu, bakk, bakk==strt))
Edit
First, I updated the above code: I eliminated the re.I flag. It was capturing portions like 'dD' as a repeated letter. so it was transformed to 'scienceD', then back to 'DD'
Secondly, I extended the code with a dictionary.
Instead of replacing a letter with letter+'ook', it replaces according to the letter.
For example, I choosed to replace 'b' with 'BAR', 'c' with 'CORE'.... I put the values of the dictionary uppercased, to have a better view of the result. It may in fact be anything else.
The programs takes care of the case. I put only 'T','Y','X' in the dictionary, it's just for essay.
import re
d = {'b':'BAR','c':'CORE','d':'DEAD','f':'FAN',
'g':'GO','h':'HHH','j':'JIU','k':'KOAN',
'l':'LOW','m':'MY','n':'NERD','p':'PI',
'q':'QIM','r':'ROAR','s':'SING','t':'TIP',
'v':'VIEW','w':'WAVE','x':'XOR',
'y':'YEAR','z':'ZOO',
'T':'tears','Y':'yearling','X':'xylophone'}
ded = dict((v,k) for k,v in d.iteritems())
print ded
tu = ('This is my first regex python example '
'Yahooa yahoouuee bbbiirdd',
'bbbiirdd',
'fookirooksooktook',
'crrsciencezxxxxxXscienceokjjsciencq')
reg = re.compile(r'([bcdfghj-np-tv-zBCDFGHJ-NP-TV-Z])(\1?)')
othergr = '|'.join(ded.keys())
dereg = re.compile('science([^aeiouAEIOU])|(%s)' % othergr)
def Frepl(ma, d=d):
g1,g2 = ma.groups()
if g2: return 'science' + g2
else: return d[g1]
def Fderepl(ma,ded=ded):
g = ma.group(2)
if g: return ded[g]
else: return 2*ma.group(1)
for strt in tu:
resu = reg.sub(Frepl , strt)
bakk = dereg.sub(Fderepl, resu)
print ('----------------------------------\n'
'strt = %s\n' 'resu == %s\n'
'bakk == %s\n' 'bakk == start : %s'
% (strt, resu, bakk, bakk==strt))
result
----------------------------------
strt = This is my first regex python example Yahooa yahoouuee bbbiirdd
resu == tearsHHHiSING iSING MYYEAR FANiROARSINGTIP ROAReGOeXOR PIYEARTIPHHHoNERD eXORaMYPILOWe yearlingaHHHooa YEARaHHHoouuee sciencebBARiiROARscienced
bakk == This is my first regex python example Yahooa yahoouuee bbbiirdd
bakk == start : True
----------------------------------
strt = bbbiirdd
resu == sciencebBARiiROARscienced
bakk == bbbiirdd
bakk == start : True
----------------------------------
strt = fookirooksooktook
resu == FANooKOANiROARooKOANSINGooKOANTIPooKOAN
bakk == fookirooksooktook
bakk == start : True
----------------------------------
strt = crrsciencezxxxxxXscienceokjjsciencq
resu == COREsciencerSINGCOREieNERDCOREeZOOsciencexsciencexXORxylophoneSINGCOREieNERDCOREeoKOANsciencejSINGCOREieNERDCOREQIM
bakk == crrsciencezxxxxxXscienceokjjsciencq
bakk == start : True
You can't "convert" a regex substitution backwards in Python or any other regex implementation.
That is simply because the substitution is a one-way street that returns a new string, and there is no magical reverse function
Here is an illustration using string.replace() :
original_string = 'abc'
newstring = original_string.replace('a','b')
'bbc'
converting newstring into 'abc' wouldn't just be substituting 'b' for 'a'. you can't create a "reverse" regex out of any given regex. if we replaced 'b' with 'a' in this example, the string would be 'aac' -- not bbc.
the regex functions work the same was as string.replace -- they return a new string. they don't return an object that knows the exact state of every regex replacement.
you have two options to do whatever it is you want to do:
1- create a custom class that represents a string and tracks an (infinite?) number of regular expression operations, allowing you to create a diff between each state.
2- do what everyone else does, and what many people here suggest: you simply need to stash the original string ( or a copy of it ) off to the side.
( this is an effort to simplify the answer from #StoryTeller )