Regular expression help

Regular expression help - python

I am trying to create a regex in Python 3 that matches 7 characters (eg. >AB0012) separated by an unknown number of characters then matching another 6 characters(eg. aaabbb or bbbaaa). My input string might look like this:
>AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
This is the regex that I have come up with:
matches = re.findall(r'(>.{7})(aaabbb|bbbaaa)', mystring)
print(matches)
The output I am trying to product would look like this:
[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'aaabbb')]
I read through the Python documentation, but I couldn't find how to match an unknown distance between two portions of a regex. Is there some sort of wildcard character that would allow me to complete my regex? Thanks in advance for the help!
EDIT:
If I use *? in my code like this:
mystring = str(input("Paste promoters here: "))
matches = re.findall(r'(>.{7})*?(aaabbb|bbbaaa)', mystring)
print(matches)
My output looks like this:
[('>CD00192', 'aaabbb'), ('', 'bbbaaa'), ('', 'aaabbb')]
*The second and third items in the list are missing the >CD00192 and >ZP01990, respectively. How can I have the regex include these characters in the list?

Here's a non regular expression approach. Split on ">" (your data will start from 2nd element onwards), then since you don't care what those 7 characters are, so start checking from 8th character onwards till 14th character.
>>> string=""" AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"""
>>> for i in string.split(">")[1:]:
... if i[7:13] in ["aaabbb","bbbaaa"]:
... print ">" + i[:13]
...
>CD00192aaabbb

I have a code that gives also the positions.
Here's the simple version of this code:
import re
from collections import OrderedDict
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')
dic = OrderedDict()
# Finding the result
for mat in regx.finditer(ch):
chunk,head = mat.groups()
headstart = mat.start()
dic[(headstart,head)] = [(headstart+six.start(),six.start(),six.group())
for six in rag.finditer(chunk)]
# Diplaying the result
for (headstart,head),li in dic.iteritems():
print '{:>10} {}'.format(headstart,head)
for x in li:
print '{0[0]:>10} {0[1]:>6} {0[2]}'.format(x)
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
24 CD00192
31 8 aaabbb
41 18 bbbaaa
52 29 bbbaaa
62 39 aaabbb
69 ZP01990
95 27 aaabbb
136 SE45789
148 13 aaabbb
172 37 bbbaaa
The same code, in a functional manner, using generators :
import re
from itertools import imap
from collections import OrderedDict
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')
gen = ((mat.groups(),mat.start()) for mat in regx.finditer(ch))
dic = OrderedDict(((headstart,head),
[(headstart+six.start(),six.start(),six.group())
for six in rag.finditer(chunk)])
for (chunk,head),headstart in gen)
print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
'\n'.join(imap('{0[0]:>10} {0[1]:>6} {0[2]}'.format,li))
for (headstart,head),li in dic.iteritems())
.
EDIT
I measured the execution's times.
For each code I measured the creation of the dictionary and the displaying separately.
The code using generators (the second) is 7.4 times more rapid to display the result ( 0.020 seconds) than the other one (0.148 seconds)
But surprisingly for me, the code with generators takes 47 % more time (0.000718 seconds) than the other (0.000489 seconds) to compute the dictionary.
.
EDIT 2
Another way to do:
import re
from collections import OrderedDict
from itertools import imap
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>).{7})|(aaabbb|bbbaaa)')
def collect(ch):
li = []
dic = OrderedDict()
gen = ( (x.start(),x.group(1),x.group(2)) for x in regx.finditer(ch))
for st,g1,g2 in gen:
if g1:
if li:
dic[(stprec,g1prec)] = li
li,stprec,g1prec = [],st,g1
elif g2:
li.append((st,g2))
if li:
dic[(stprec,g1prec)] = li
return dic
dic = collect(ch)
print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
'\n'.join(imap('{0[0]:>10} {0[1]}'.format,li))
for (headstart,head),li in dic.iteritems())
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
24 CD00192
31 aaabbb
41 bbbaaa
52 bbbaaa
62 aaabbb
69 ZP01990
95 aaabbb
136 SE45789
148 aaabbb
172 bbbaaa
This code compute dic in 0.00040 seconds and displays it in 0.0321 seconds
.
EDIT 3
To answer to your question, you have no other possibility than keeping each current value among 'CD00192','ZP01990','SE45789' etc under a name (I don't like to say "in a variable" in Python, because there are no variables in Python. But you can read "under a name" as if I had written "in a variable" )
And for that, you must use finditer()
Here's the code for this solution:
import re
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('(>.{7})|(aaabbb|bbbaaa)')
matches = []
for mat in regx.finditer(ch):
g1,g2= mat.groups()
if g1:
head = g1
else:
matches.append((head,g2))
print matches
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>SE45789', 'aaabbb'), ('>SE45789', 'bbbaaa')]
My preceding codes are more complicated because they catch the positions and gather the values 'aaabbb' and 'bbbaaa' of one header among 'CD00192','ZP01990','SE45789' etc in a list.

zero or more characters can be matched using *, so a* would match "", "a", "aa" etc. + matches one or more character.
You will perhaps want to make the quantifier (+ or *) lazy by using +? or *? as well.
See regular-expressions.info for more details.

Try this:
>>> r1 = re.findall(r'(>.{7})[^>]*?(aaabbb)', s)
>>> r2 = re.findall(r'(>.{7})[^>]*?(bbbaaa)', s)
>>> r1 + r2
[('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'bbbaaa')]

Related

python how to dynamically find a persons name in a string

im working on a project where i have to use speech to text as an input to determine who to call, however using the speech to text can give some unexpected results so i wanted to have a little dynamic matching of the strings, i'm starting small and try to match 1 single name, my name is Nick Vaes, and i try to match my name to the spoken text, but i also want it to match when for example some text would be Nik or something, idealy i would like to have something that would match everything if only 1 letter is wrong so
Nick
ick
nik
nic
nck
would all match my name, the current simple code i have is:
def user_to_call(s):
if "NICK" or "NIK" in s.upper(): redirect = "Nick"
if redirect: return redirect
for a 4 letter name its possible to put all possibilities in the filter, but for names with 12 letters it is a little bit of overkill since i'm pretty sure it can be done way more efficient.

You need to use Levenshtein_distance
A python implementation is nltk
import nltk
nltk.edit_distance("humpty", "dumpty")

What you basically need is fuzzy string matching, see:
https://en.wikipedia.org/wiki/Approximate_string_matching
https://www.datacamp.com/community/tutorials/fuzzy-string-python
Based on that you can check how similar is the input compared your dictionary:
from fuzzywuzzy import fuzz
name = "nick"
tomatch = ["Nick", "ick", "nik", "nic", "nck", "nickey", "njick", "nickk", "nickn"]
for str in tomatch:
ratio = fuzz.ratio(str.lower(), name.lower())
print(ratio)
This code will produce the following output:
100
86
86
86
86
80
89
89
89
You have to experiment with different ratios and check which will suit your requirements to miss only one letter

From what I understand, you are not looking at any fuzzy matching. (Because you did not upvote other responses).
If you are just trying to evaluate what you specified in your request, here is the code. I have put some additional conditions where I printed the appropriate message. Feel free to remove them.
def wordmatch(baseword, wordtoMatch, lengthOfMatch):
lis_of_baseword = list(baseword.lower())
lis_of_wordtoMatch = list(wordtoMatch.lower())
sum = 0
for index_i, i in enumerate(lis_of_wordtoMatch):
for index_j, j in enumerate(lis_of_baseword):
if i in lis_of_baseword:
if i == j and index_i <= index_j:
sum = sum + 1
break
else:
pass
else:
print("word to match has characters which are not in baseword")
return 0
if sum >= lengthOfMatch and len(wordtoMatch) <= len(baseword):
return 1
elif sum >= lengthOfMatch and len(wordtoMatch) > len(baseword):
print("word to match has no of characters more than that of baseword")
return 0
else:
return 0
base = "Nick"
tomatch = ["Nick", "ick", "nik", "nic", "nck", "nickey","njick","nickk","nickn"]
wordlength_match = 3 # this says how many words to match in the base word. In your case, its 3
for t_word in tomatch:
print(wordmatch(base,t_word,wordlength_match))
the output looks like this
1
1
1
1
1
word to match has characters which are not in baseword
0
word to match has characters which are not in baseword
0
word to match has no of characters more than that of baseword
0
word to match has no of characters more than that of baseword
0
Let me know if this served your purpose.

Regex to extract IBAN ignoring colon [duplicate]

I am trying to extract this text "NL dd ABNA ddddddddd" from collection of strings and I need to create expression that would match the third title:
IBAN NL 91ABNA0417463300
IBAN NL91ABNA0417164300
Iban: NL 69 ABNA 402032566
To date, I use this regex pattern for extraction:
NL\s?\d{2}\s?[A-Z]{4}0\s?\d{9}$
Which matches the first two examples, but not the third.
To reproduce this issue, see this example:
https://regex101.com/r/zGDXa2/1.
How can I treat it?

The problem in your regex101 demo is, there is an extra character in your regex after $ so remove that and change 0 to [0 ] and this fixes all and starts matching your third line too. The correct regex becomes,
NL\s?\d{2}\s?[A-Z]{4}[0 ]\s?\d{9}$
Check your updated demo

You can use the following regex:
(?i)(?:(?<=IBAN(?:[:\s]\s|\s[:\s]))NL\s?\d{2}\s?[A-Z]{4}[0 ]\s?\d{9,10})|(?:(?<=IBAN[:\s])NL\s?\d{2}\s?[A-Z]{4}[0 ]\s?\d{9,10})
demo:
https://regex101.com/r/zGDXa2/11
If you work in python you can remove the (?:i) and replace it by a flag re.I or re.IGNORECASE
Tested on:
Uw BTW nummer NL80
IBAN NL 11abna0317164300asdfasf234
iBAN NL21ABNA0417134300 22
Iban: NL 29 ABNA 401422366f sdf
IBAN :NL 39 ABNA 0822416395s
IBAN:NL 39 ABNA 0822416395s
Extracts:
NL 11abna0317164300
NL21ABNA0417134300
NL 29 ABNA 401422366
NL 39 ABNA 0822416395
NL 39 ABNA 0822416395

You can just remove all spaces and uppercase the rest, Like this:
iban = NL 91ABNA0417463300
iban.replace(" ", "")
iban.upper()
And then your regex would be:
NL\d{2}ABNA(\d{10}|\d{9})
It works in https://regex101.com/r/zGDXa2/1

It's not what you want, but works.
IBAN has a strict format, so it's better to normalize it, and next just cut part, because everything will match regexp, as an example:
CODE
#!/usr/bin/python3
# -*- coding: utf-8 -*-
# I'm not sure, that alphabet is correct, A-Z, 0-9
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
def normalize(string):
stage1 = "".join(IBAN.split()).upper()
stage2 = ''
for l in stage1:
if l in alphabet:
stage2 = stage2 + l
return stage2.split('IBAN')[1]
if __name__ == '__main__':
IBAN_LIST = ['IBAN NL 91ABNA0417463300', 'IBAN NL91ABNA0417164300', 'Iban: NL 69 ABNA 402032566']
for IBAN in IBAN_LIST:
IBAN_normalized = normalize(IBAN)
print(IBAN_normalized[2:4], IBAN_normalized[8:])
OUTPUT
91 0417463300
91 0417164300
69 402032566
It's not a regexp, but should work faster, but if you know how to normalize better, please, help with it.
You can see source code here.

Filter elements with a regex, only if they are in a certain block

In a string (in reality it's much bigger):
s = """
BeginA
Qwerty
Element 11 35
EndA
BeginB
Element 12 38
...
Element 198 38
EndB
BeginA
Element 81132 38
SomethingElse
EndA
BeginB
Element 12 39
Element 198 38
EndB
"""
how to replace every Element <anythinghere> 38 which is inside a BeginB...EndB block (and only those!) by Element ABC?
I was trying with:
s = re.sub(r'Element .* 38', 'Element ABC', s)
but this doesn't detect if it's in a BeginB...EndB block. How to do this?

Use two expressions:
block = re.compile(r'BeginB[\s\S]+?EndB')
element = re.compile(r'Element.*?\b38\b')
def repl(match):
return element.sub('Element ABC', match.group(0))
nstring = block.sub(repl, string)
print(nstring)
This yields
BeginA
Qwerty
Element 11 35
EndA
BeginB
Element ABC
...
Element ABC
EndB
BeginA
Element 81132 38
SomethingElse
EndA
BeginB
Element 12 39
Element ABC
EndB
See a demo on ideone.com.
Without re.compile (just to get the idea):
def repl(match):
return re.sub(r'Element.*?\b38\b', 'Element ABC', match.group(0))
print re.sub(r'BeginB[\s\S]+?EndB', repl, s)
The important idea here is the fact that re.sub's second parameter can be a function.
You could very well do it without a function but you'd need the newer regex module which supports \G and \K:
rx = re.compile(r'''
(?:\G(?!\A)|BeginB)
(?:(?!EndB)[\s\S])+?\K
Element.+?\b38\b''', re.VERBOSE)
string = rx.sub('Element ABC', string)
print(string)
See another demo for this one on regex101.com as well.

Try the following:
r'(?s)(?<=BeginB)\s+Element\s+(\d+)\s+\d+.*?(?=EndB)'
You can test it here.
For your example, I would echo #Jan's answer and use two separate regular expressions:
import re
restrict = re.compile(r'(?s)(?<=BeginB).*?(?=EndB)')
pattern = re.compile(r'Element\s+(\d+)\s+38')
def repl(block):
return pattern.sub('Element ABC', block.group(0))
out = restrict.sub(repl, s)

RegEx: Find all digits after certain string

I am trying to get all the digits from following string after the word classes (or its variations)
Accepted for all the goods and services in classes 16 and 41.
expected output:
16
41
I have multiple strings which follows this pattern and some others such as:
classes 5 et 30 # expected output 5, 30
class(es) 32,33 # expected output 32, 33
class 16 # expected output 5
Here is what I have tried so far: https://regex101.com/r/eU7dF6/3
(class[\(es\)]*)([and|et|,|\s]*(\d{1,}))+
But I am able to get only the last matched digit i.e. 41 in the above example.

I suggest grabbing all the substring with numbers after class or classes/class(es) and then get all the numbers from those:
import re
p = re.compile(r'\bclass(?:\(?es\)?)?(?:\s*(?:and|et|[,\s])?\s*\d+)+')
test_str = "Accepted for all the goods and services in classes 16 and 41."
results = [re.findall(r"\d+", x) for x in p.findall(test_str)]
print([x for l in results for x in l])
# => ['16', '41']
See IDEONE demo
As \G construct is not supported, nor can you access the captures stack using Python re module, you cannot use your approach.
However, you can do it the way you did with PyPi regex module.
>>> import regex
>>> test_str = "Accepted for all the goods and services in classes 16 and 41."
>>> rx = r'\bclass(?:\(?es\)?)?(?:\s*(?:and|et|[,\s])?\s*(?P<num>\d+))+'
>>> res = []
>>> for x in regex.finditer(rx, test_str):
res.extend(x.captures("num"))
>>> print res
['16', '41']

You can do it in 2 steps.Regex engine remebers only the last group in continous groups.
x="""Accepted for all the goods and services in classes 16 and 41."""
print re.findall(r"\d+",re.findall(r"class[\(es\)]*\s*(\d+(?:(?:and|et|,|\s)*\d+)*)",x)[0])
Output:['16', '41']
If you dont want string use
print map(ast.literal_eval,re.findall(r"\d+",re.findall(r"class[\(es\)]*\s*(\d+(?:(?:and|et|,|\s)*\d+)*)",x)[0]))
Output:[16, 41]
If you have to do it in one regex use regex module
import regex
x="""Accepted for all the goods and services in classes 16 and 41."""
print [ast.literal_eval(i) for i in regex.findall(r"class[\(es\)]*|\G(?:and|et|,|\s)*(\d+)",x,regex.VERSION1) if i]
Output:[16, 41]

replacing appointed characters in a string in txt file

Hello all…I want to pick up the texts ‘DesingerXXX’ from a text file which contains below contents:
C DesignerTEE edBore 1 1/42006
Cylinder SingleVerticalB DesignerHHJ e 1 1/8Cooling 1
EngineBore 11/16 DesignerTDT 8Length 3Width 3
EngineCy DesignerHEE Inline2008Bore 1
Height 4TheChallen DesignerTET e 1Stroke 1P 305
Height 8C 606Wall15ccG DesignerQBG ccGasEngineJ 142
Height DesignerEQE C 60150ccGas2007
Anidea is to use the ‘Designer’ as a key, to consider each line into 2 parts, before the key, and after the key.
file_object = open('C:\\file.txt')
lines = file_object.readlines()
for line in lines:
if 'Designer' in line:
where = line.find('Designer')
before = line[0:where]
after = line[where:len(line)]
file_object.close()
In the ‘before the key’ part, I need to find the LAST space (‘ ’), and replace to another symbol/character.
In the ‘after the key’ part, I need to find the FIRST space (‘ ’), and replace to another symbol/character.
Then, I can slice it and pick up the wanted according to the new symbols/characters.
is there a better way to pick up the wanted texts? Or not, how can I replace the appointed key spaces?
In the string replace function, I can limit the times of replacing but not exactly which I can replace. How can I do that?
thanks

Using regular expressions, its a trivial task:
>>> s = '''C DesignerTEE edBore 1 1/42006
... Cylinder SingleVerticalB DesignerHHJ e 1 1/8Cooling 1
... EngineBore 11/16 DesignerTDT 8Length 3Width 3
... EngineCy DesignerHEE Inline2008Bore 1
... Height 4TheChallen DesignerTET e 1Stroke 1P 305
... Height 8C 606Wall15ccG DesignerQBG ccGasEngineJ 142
... Height DesignerEQE C 60150ccGas2007'''
>>> import re
>>> exp = 'Designer[A-Z]{3}'
>>> re.findall(exp, s)
['DesignerTEE', 'DesignerHHJ', 'DesignerTDT', 'DesignerHEE', 'DesignerTET', 'DesignerQBG', 'DesignerEQE']
The regular expression is Designer[A-Z]{3} which means the letters Designer, followed by any letter from capital A to capital Z that appears 3 times, and only three times.
So, it won't match DesignerABCD (4 letters), it also wont match Desginer123 (123 is not valid letters).
It also won't match Designerabc (abc are small letters). To make it ignore the case, you can pass an optional flag re.I as a third argument; but this will also match designerabc (you have to be very specific with regular expressions).
So, to make it so that it matches Designer followed by exactly 3 upper or lower case letters, you'd have to change the expression to Designer[Aa-zZ]{3}.
If you want to search and replace, then you can use re.sub for substituting matches; so if I want to replace all matches with the word 'hello':
>>> x = re.sub(exp, 'hello', s)
>>> print(x)
C hello edBore 1 1/42006
Cylinder SingleVerticalB hello e 1 1/8Cooling 1
EngineBore 11/16 hello 8Length 3Width 3
EngineCy hello Inline2008Bore 1
Height 4TheChallen hello e 1Stroke 1P 305
Height 8C 606Wall15ccG hello ccGasEngineJ 142
Height hello C 60150ccGas2007
and what if both before and after 'Designer', there are characters,
and the length of character is not fixed. I tried
'[Aa-zZ]Designer[Aa-zZ]{0~9}', but it doesn't work..
For these things, there are special characters in regular expressions. Briefly summarized below:
When you want to say "1 or more, but at least 1", use +
When you want to say "0 or any number, but there maybe none", use *
When you want to say "none but if it exists, only repeats once" use ?
You use this after the expression you want to be modified with the "repetition" modifiers.
For more on this, have a read through the documentation.
Now your requirements is "there are characters but the length is not fixed", based on this, we have to use +.

Try with re.sub. The regular expression match with your keyword surrounded by spaces. The second parameter of sub, replace the surrounder spaces by your_special_char (in my script a hyphen)
>>> import re
>>> with open('file.txt') as file_object:
... your_special_char = '-'
... for line in file_object:
... formated_line = re.sub(r'(\s)(Designer[A-Z]{3})(\s)', r'%s\2%s' % (your_special_char,your_special_char), line)
... print formated_line
...
C -DesignerTEE-edBore 1 1/42006
Cylinder SingleVerticalB-DesignerHHJ-e 1 1/8Cooling 1
EngineBore 11/16-DesignerTDT-8Length 3Width 3
EngineCy-DesignerHEE-Inline2008Bore 1
Height 4TheChallen-DesignerTET-e 1Stroke 1P 305
Height 8C 606Wall15ccG-DesignerQBG-ccGasEngineJ 142
Height-DesignerEQE-C 60150ccGas2007

Maroun Maroun mentioned 'Why not simply split the string'. so guessing one of the working way is:
import re
file_object = open('C:\\file.txt')
lines = file_object.readlines()
b = []
for line in lines:
a = line.split()
for aa in a:
b.append(aa)
for bb in b:
if 'Designer' in bb:
print bb
file_object.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression help - python

zero or more characters can be matched using , so a would match "", "a", "aa" etc. + matches one or more character. You will perhaps want to make the quantifier (+ or ) lazy by using +? or ? as well. See regular-expressions.info for more details.

Try this: >>> r1 = re.findall(r'(>.{7})[^>]?(aaabbb)', s) >>> r2 = re.findall(r'(>.{7})[^>]?(bbbaaa)', s) >>> r1 + r2 [('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'bbbaaa')]

Related

python how to dynamically find a persons name in a string

Regex to extract IBAN ignoring colon [duplicate]

Filter elements with a regex, only if they are in a certain block

RegEx: Find all digits after certain string

replacing appointed characters in a string in txt file

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression help - python

zero or more characters can be matched using *, so a* would match "", "a", "aa" etc. + matches one or more character. You will perhaps want to make the quantifier (+ or *) lazy by using +? or *? as well. See regular-expressions.info for more details.

Try this: >>> r1 = re.findall(r'(>.{7})[^>]*?(aaabbb)', s) >>> r2 = re.findall(r'(>.{7})[^>]*?(bbbaaa)', s) >>> r1 + r2 [('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'bbbaaa')]

Related

python how to dynamically find a persons name in a string

Regex to extract IBAN ignoring colon [duplicate]

Filter elements with a regex, only if they are in a certain block

RegEx: Find all digits after certain string

replacing appointed characters in a string in txt file

Categories

Resources

zero or more characters can be matched using , so a would match "", "a", "aa" etc. + matches one or more character. You will perhaps want to make the quantifier (+ or ) lazy by using +? or ? as well. See regular-expressions.info for more details.

Try this: >>> r1 = re.findall(r'(>.{7})[^>]?(aaabbb)', s) >>> r2 = re.findall(r'(>.{7})[^>]?(bbbaaa)', s) >>> r1 + r2 [('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'bbbaaa')]