I am trying to extract this text "NL dd ABNA ddddddddd" from collection of strings and I need to create expression that would match the third title:
IBAN NL 91ABNA0417463300
IBAN NL91ABNA0417164300
Iban: NL 69 ABNA 402032566
To date, I use this regex pattern for extraction:
NL\s?\d{2}\s?[A-Z]{4}0\s?\d{9}$
Which matches the first two examples, but not the third.
To reproduce this issue, see this example:
https://regex101.com/r/zGDXa2/1.
How can I treat it?
The problem in your regex101 demo is, there is an extra character in your regex after $ so remove that and change 0 to [0 ] and this fixes all and starts matching your third line too. The correct regex becomes,
NL\s?\d{2}\s?[A-Z]{4}[0 ]\s?\d{9}$
Check your updated demo
You can use the following regex:
(?i)(?:(?<=IBAN(?:[:\s]\s|\s[:\s]))NL\s?\d{2}\s?[A-Z]{4}[0 ]\s?\d{9,10})|(?:(?<=IBAN[:\s])NL\s?\d{2}\s?[A-Z]{4}[0 ]\s?\d{9,10})
demo:
https://regex101.com/r/zGDXa2/11
If you work in python you can remove the (?:i) and replace it by a flag re.I or re.IGNORECASE
Tested on:
Uw BTW nummer NL80
IBAN NL 11abna0317164300asdfasf234
iBAN NL21ABNA0417134300 22
Iban: NL 29 ABNA 401422366f sdf
IBAN :NL 39 ABNA 0822416395s
IBAN:NL 39 ABNA 0822416395s
Extracts:
NL 11abna0317164300
NL21ABNA0417134300
NL 29 ABNA 401422366
NL 39 ABNA 0822416395
NL 39 ABNA 0822416395
You can just remove all spaces and uppercase the rest, Like this:
iban = NL 91ABNA0417463300
iban.replace(" ", "")
iban.upper()
And then your regex would be:
NL\d{2}ABNA(\d{10}|\d{9})
It works in https://regex101.com/r/zGDXa2/1
It's not what you want, but works.
IBAN has a strict format, so it's better to normalize it, and next just cut part, because everything will match regexp, as an example:
CODE
#!/usr/bin/python3
# -*- coding: utf-8 -*-
# I'm not sure, that alphabet is correct, A-Z, 0-9
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
def normalize(string):
stage1 = "".join(IBAN.split()).upper()
stage2 = ''
for l in stage1:
if l in alphabet:
stage2 = stage2 + l
return stage2.split('IBAN')[1]
if __name__ == '__main__':
IBAN_LIST = ['IBAN NL 91ABNA0417463300', 'IBAN NL91ABNA0417164300', 'Iban: NL 69 ABNA 402032566']
for IBAN in IBAN_LIST:
IBAN_normalized = normalize(IBAN)
print(IBAN_normalized[2:4], IBAN_normalized[8:])
OUTPUT
91 0417463300
91 0417164300
69 402032566
It's not a regexp, but should work faster, but if you know how to normalize better, please, help with it.
You can see source code here.
Related
I am trying to match only North American numbers existing in a string; (123)456-7890 and 123-456-7890 are both acceptable presentation formats for North American phone numbers, meaning anyother pattern should not match.
Note: python3.7 and pycharm Editor is being used.
Here are phone numbers represented in a string:
123-456-7890
(123)456-7890
(123)-456-7890
(123-456-7890
1234567890
123 456 7890
I tried to use (\()?\d{3}(?(1)\)|-)\d{3}-\d{4} regex which indeed uses backrefrence conditionals to match the desired phone numbers, Below the python code Is included:
import regex
st = """
123-456-7890
(123)456-7890
(123)-456-7890
(123-456-7890
1234567890
123 456 7890
"""
pat = regex.compile(r'(\()?\d{3}(?(1)\)|-)\d{3}-\d{4}', regex.I)
out = pat.findall(st)
print(out)
Output using findall method: ['', '(', '']
Output using search(st).group() method which returns just the first match: 123-456-7890
Matches should be :
123-456-7890
(123)456-7890
My question is: Why does findall method which should return The matched patterns flawlessly as it does in regex 101 website, Now does return such irritating results like ['', '(', ''] ?
I have tried the regex in regex 101 website and it works perfectly, but does not here.
Note: I am using sams teach yourself regular expressions book and in page 134 The best solution for this problem is suggested and the above is it's python implementation.
Your regex is correct but if you use findall then it automatically prints all captured groups. Better to use finditer and print .group() or .group(0):
>>> pat = regex.compile(r'^(\()?\d{3}(?(1)\)|-)\d{3}-\d{4}$', regex.M)
>>> for m in pat.finditer(st):
... print (m.group())
...
123-456-7890
(123)456-7890
Use re.finditer:
print(list(pat.finditer(st)))
In a string (in reality it's much bigger):
s = """
BeginA
Qwerty
Element 11 35
EndA
BeginB
Element 12 38
...
Element 198 38
EndB
BeginA
Element 81132 38
SomethingElse
EndA
BeginB
Element 12 39
Element 198 38
EndB
"""
how to replace every Element <anythinghere> 38 which is inside a BeginB...EndB block (and only those!) by Element ABC?
I was trying with:
s = re.sub(r'Element .* 38', 'Element ABC', s)
but this doesn't detect if it's in a BeginB...EndB block. How to do this?
Use two expressions:
block = re.compile(r'BeginB[\s\S]+?EndB')
element = re.compile(r'Element.*?\b38\b')
def repl(match):
return element.sub('Element ABC', match.group(0))
nstring = block.sub(repl, string)
print(nstring)
This yields
BeginA
Qwerty
Element 11 35
EndA
BeginB
Element ABC
...
Element ABC
EndB
BeginA
Element 81132 38
SomethingElse
EndA
BeginB
Element 12 39
Element ABC
EndB
See a demo on ideone.com.
Without re.compile (just to get the idea):
def repl(match):
return re.sub(r'Element.*?\b38\b', 'Element ABC', match.group(0))
print re.sub(r'BeginB[\s\S]+?EndB', repl, s)
The important idea here is the fact that re.sub's second parameter can be a function.
You could very well do it without a function but you'd need the newer regex module which supports \G and \K:
rx = re.compile(r'''
(?:\G(?!\A)|BeginB)
(?:(?!EndB)[\s\S])+?\K
Element.+?\b38\b''', re.VERBOSE)
string = rx.sub('Element ABC', string)
print(string)
See another demo for this one on regex101.com as well.
Try the following:
r'(?s)(?<=BeginB)\s+Element\s+(\d+)\s+\d+.*?(?=EndB)'
You can test it here.
For your example, I would echo #Jan's answer and use two separate regular expressions:
import re
restrict = re.compile(r'(?s)(?<=BeginB).*?(?=EndB)')
pattern = re.compile(r'Element\s+(\d+)\s+38')
def repl(block):
return pattern.sub('Element ABC', block.group(0))
out = restrict.sub(repl, s)
I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?
It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space
text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)
I'm trying to extract data from a few large textfiles containing entries about people. The problem is, though, I cannot control the way the data comes to me.
It is usually in a format like this:
LASTNAME, Firstname Middlename (Maybe a Nickname)Why is this text hereJanuary, 25, 2012
Firstname Lastname 2001 Some text that I don't care about
Lastname, Firstname blah blah ... January 25, 2012 ...
Currently, I am using a huge regex that splits all kindaCamelcase words, all words that have a month name tacked onto the end, and a lot of special cases for names. Then I use more regex to extract a lot of combinations for the name and date.
This seems sub-optimal.
Are there any machine-learning libraries for Python that can parse malformed data that is somewhat structured?
I've tried NLTK, but it could not handle my dirty data. I'm tinkering with Orange right now and I like it's OOP style, but I'm not sure if I'm wasting my time.
Ideally, I'd like to do something like this to train a parser (with many input/output pairs):
training_data = (
'LASTNAME, Firstname Middlename (Maybe a Nickname)FooBarJanuary 25, 2012',
['LASTNAME', 'Firstname', 'Middlename', 'Maybe a Nickname', 'January 25, 2012']
)
Is something like this possible or am I overestimating machine learning? Any suggestions will be appreciated, as I'd like to learn more about this topic.
I ended up implementing a somewhat-complicated series of exhaustive regexes that encompassed every possible use case using text-based "filters" that were substituted with the appropriate regexes when the parser loaded.
If anyone's interested in the code, I'll edit it into this answer.
Here's basically what I used. To construct the regular expressions out of my "language", I had to make replacement classes:
class Replacer(object):
def __call__(self, match):
group = match.group(0)
if group[1:].lower().endswith('_nm'):
return '(?:' + Matcher(group).regex[1:]
else:
return '(?P<' + group[1:] + '>' + Matcher(group).regex[1:]
Then, I made a generic Matcher class, which constructed a regex for a particular pattern given the pattern name:
class Matcher(object):
name_component = r"([A-Z][A-Za-z|'|\-]+|[A-Z][a-z]{2,})"
name_component_upper = r"([A-Z][A-Z|'|\-]+|[A-Z]{2,})"
year = r'(1[89][0-9]{2}|20[0-9]{2})'
year_upper = year
age = r'([1-9][0-9]|1[01][0-9])'
age_upper = age
ordinal = r'([1-9][0-9]|1[01][0-9])\s*(?:th|rd|nd|st|TH|RD|ND|ST)'
ordinal_upper = ordinal
date = r'((?:{0})\.? [0-9]{{1,2}}(?:th|rd|nd|st|TH|RD|ND|ST)?,? \d{{2,4}}|[0-9]{{1,2}} (?:{0}),? \d{{2,4}}|[0-9]{{1,2}}[\-/\.][0-9]{{1,2}}[\-/\.][0-9]{{2,4}})'.format('|'.join(months + months_short) + '|' + '|'.join(months + months_short).upper())
date_upper = date
matchers = [
'name_component',
'year',
'age',
'ordinal',
'date',
]
def __init__(self, match=''):
capitalized = '_upper' if match.isupper() else ''
match = match.lower()[1:]
if match.endswith('_instant'):
match = match[:-8]
if match in self.matchers:
self.regex = getattr(self, match + capitalized)
elif len(match) == 1:
elif 'year' in match:
self.regex = getattr(self, 'year')
else:
self.regex = getattr(self, 'name_component' + capitalized)
Finally, there's the generic Pattern object:
class Pattern(object):
def __init__(self, text='', escape=None):
self.text = text
self.matchers = []
escape = not self.text.startswith('!') if escape is None else False
if escape:
self.regex = re.sub(r'([\[\].?+\-()\^\\])', r'\\\1', self.text)
else:
self.regex = self.text[1:]
self.size = len(re.findall(r'(\$[A-Za-z0-9\-_]+)', self.regex))
self.regex = re.sub(r'(\$[A-Za-z0-9\-_]+)', Replacer(), self.regex)
self.regex = re.sub(r'\s+', r'\\s+', self.regex)
def search(self, text):
return re.search(self.regex, text)
def findall(self, text, max_depth=1.0):
results = []
length = float(len(text))
for result in re.finditer(self.regex, text):
if result.start() / length < max_depth:
results.extend(result.groups())
return results
def match(self, text):
result = map(lambda x: (x.groupdict(), x.start()), re.finditer(self.regex, text))
if result:
return result
else:
return []
It got pretty complicated, but it worked. I'm not going to post all of the source code, but this should get someone started. In the end, it converted a file like this:
$LASTNAME, $FirstName $I. said on $date
Into a compiled regex with named capturing groups.
I have similar problem, mainly because of the problem with exporting data from Microsoft Office 2010 and the result is a join between two consecutive words at somewhat regular interval. The domain area is morhological operation like a spelling-checker. You can jump to machine learning solution or create a heuristics solution like I did.
The easy solution is to assume that the the newly-formed word is a combination of proper names (with first character capitalized).
The Second additional solution is to have a dictionary of valid words, and try a set of partition locations which generate two (or at least one) valid words. Another problem may arise when one of them is proper name which by definition is out of vocabulary in the previous dictionary. perhaps one way we can use word length statistic which can be used to identify whether a word is a mistakenly-formed word or actually a legitimate one.
In my case, this is part of manual correction of large corpora of text (a human-in-the-loop verification) but the only thing which can be automated is selection of probably-malformed words and its corrected recommendation.
Regarding the concatenated words, you can split them using a tokenizer:
The OpenNLP Tokenizers segment an input character sequence into tokens. Tokens are usually words, punctuation, numbers, etc.
For example:
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
is tokenized into:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
OpenNLP has a "learnable tokenizer" that you can train. If the doesn't work, you can try the answers to: Detect most likely words from text without spaces / combined words .
When splitting is done, you can eliminate the punctuation and pass it to a NER system such as CoreNLP:
Johnson John Doe Maybe a Nickname Why is this text here January 25 2012
which outputs:
Tokens
Id Word Lemma Char begin Char end POS NER Normalized NER
1 Johnson Johnson 0 7 NNP PERSON
2 John John 8 12 NNP PERSON
3 Doe Doe 13 16 NNP PERSON
4 Maybe maybe 17 22 RB O
5 a a 23 24 DT O
6 Nickname nickname 25 33 NN MISC
7 Why why 34 37 WRB MISC
8 is be 38 40 VBZ O
9 this this 41 45 DT O
10 text text 46 50 NN O
11 here here 51 55 RB O
12 January January 56 63 NNP DATE 2012-01-25
13 25 25 64 66 CD DATE 2012-01-25
14 2012 2012 67 71 CD DATE 2012-01-25
One part of your problem: "all words that have a month name tacked onto the end,"
If as appears to be the case you have a date in the format Monthname 1-or-2-digit-day-number, yyyy at the end of the string, you should use a regex to munch that off first. Then you have a now much simpler job on the remainder of the input string.
Note: Otherwise you could run into problems with given names which are also month names e.g. April, May, June, August. Also March is a surname which could be used as a "middle name" e.g. SMITH, John March.
Your use of the "last/first/middle" terminology is "interesting". There are potential problems if your data includes non-Anglo names like these:
Mao Zedong aka Mao Ze Dong aka Mao Tse Tung
Sima Qian aka Ssu-ma Ch'ien
Saddam Hussein Abd al-Majid al-Tikriti
Noda Yoshihiko
Kossuth Lajos
José Luis RodrÃguez Zapatero
Pedro Manuel Mamede Passos Coelho
Sukarno
A few pointers, to get you started:
for date parsing, you could start with a couple of regexes, and then you could use chronic or jChronic
for names, these OpenNlp models should work
As for training a machine learning model yourself, this is not so straightforward, especially regarding training data (work effort)...
I am trying to create a regex in Python 3 that matches 7 characters (eg. >AB0012) separated by an unknown number of characters then matching another 6 characters(eg. aaabbb or bbbaaa). My input string might look like this:
>AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
This is the regex that I have come up with:
matches = re.findall(r'(>.{7})(aaabbb|bbbaaa)', mystring)
print(matches)
The output I am trying to product would look like this:
[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'aaabbb')]
I read through the Python documentation, but I couldn't find how to match an unknown distance between two portions of a regex. Is there some sort of wildcard character that would allow me to complete my regex? Thanks in advance for the help!
EDIT:
If I use *? in my code like this:
mystring = str(input("Paste promoters here: "))
matches = re.findall(r'(>.{7})*?(aaabbb|bbbaaa)', mystring)
print(matches)
My output looks like this:
[('>CD00192', 'aaabbb'), ('', 'bbbaaa'), ('', 'aaabbb')]
*The second and third items in the list are missing the >CD00192 and >ZP01990, respectively. How can I have the regex include these characters in the list?
Here's a non regular expression approach. Split on ">" (your data will start from 2nd element onwards), then since you don't care what those 7 characters are, so start checking from 8th character onwards till 14th character.
>>> string=""" AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"""
>>> for i in string.split(">")[1:]:
... if i[7:13] in ["aaabbb","bbbaaa"]:
... print ">" + i[:13]
...
>CD00192aaabbb
I have a code that gives also the positions.
Here's the simple version of this code:
import re
from collections import OrderedDict
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')
dic = OrderedDict()
# Finding the result
for mat in regx.finditer(ch):
chunk,head = mat.groups()
headstart = mat.start()
dic[(headstart,head)] = [(headstart+six.start(),six.start(),six.group())
for six in rag.finditer(chunk)]
# Diplaying the result
for (headstart,head),li in dic.iteritems():
print '{:>10} {}'.format(headstart,head)
for x in li:
print '{0[0]:>10} {0[1]:>6} {0[2]}'.format(x)
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
24 CD00192
31 8 aaabbb
41 18 bbbaaa
52 29 bbbaaa
62 39 aaabbb
69 ZP01990
95 27 aaabbb
136 SE45789
148 13 aaabbb
172 37 bbbaaa
The same code, in a functional manner, using generators :
import re
from itertools import imap
from collections import OrderedDict
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')
gen = ((mat.groups(),mat.start()) for mat in regx.finditer(ch))
dic = OrderedDict(((headstart,head),
[(headstart+six.start(),six.start(),six.group())
for six in rag.finditer(chunk)])
for (chunk,head),headstart in gen)
print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
'\n'.join(imap('{0[0]:>10} {0[1]:>6} {0[2]}'.format,li))
for (headstart,head),li in dic.iteritems())
.
EDIT
I measured the execution's times.
For each code I measured the creation of the dictionary and the displaying separately.
The code using generators (the second) is 7.4 times more rapid to display the result ( 0.020 seconds) than the other one (0.148 seconds)
But surprisingly for me, the code with generators takes 47 % more time (0.000718 seconds) than the other (0.000489 seconds) to compute the dictionary.
.
EDIT 2
Another way to do:
import re
from collections import OrderedDict
from itertools import imap
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>).{7})|(aaabbb|bbbaaa)')
def collect(ch):
li = []
dic = OrderedDict()
gen = ( (x.start(),x.group(1),x.group(2)) for x in regx.finditer(ch))
for st,g1,g2 in gen:
if g1:
if li:
dic[(stprec,g1prec)] = li
li,stprec,g1prec = [],st,g1
elif g2:
li.append((st,g2))
if li:
dic[(stprec,g1prec)] = li
return dic
dic = collect(ch)
print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
'\n'.join(imap('{0[0]:>10} {0[1]}'.format,li))
for (headstart,head),li in dic.iteritems())
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
24 CD00192
31 aaabbb
41 bbbaaa
52 bbbaaa
62 aaabbb
69 ZP01990
95 aaabbb
136 SE45789
148 aaabbb
172 bbbaaa
This code compute dic in 0.00040 seconds and displays it in 0.0321 seconds
.
EDIT 3
To answer to your question, you have no other possibility than keeping each current value among 'CD00192','ZP01990','SE45789' etc under a name (I don't like to say "in a variable" in Python, because there are no variables in Python. But you can read "under a name" as if I had written "in a variable" )
And for that, you must use finditer()
Here's the code for this solution:
import re
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('(>.{7})|(aaabbb|bbbaaa)')
matches = []
for mat in regx.finditer(ch):
g1,g2= mat.groups()
if g1:
head = g1
else:
matches.append((head,g2))
print matches
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>SE45789', 'aaabbb'), ('>SE45789', 'bbbaaa')]
My preceding codes are more complicated because they catch the positions and gather the values 'aaabbb' and 'bbbaaa' of one header among 'CD00192','ZP01990','SE45789' etc in a list.
zero or more characters can be matched using *, so a* would match "", "a", "aa" etc. + matches one or more character.
You will perhaps want to make the quantifier (+ or *) lazy by using +? or *? as well.
See regular-expressions.info for more details.
Try this:
>>> r1 = re.findall(r'(>.{7})[^>]*?(aaabbb)', s)
>>> r2 = re.findall(r'(>.{7})[^>]*?(bbbaaa)', s)
>>> r1 + r2
[('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'bbbaaa')]