replacing appointed characters in a string in txt file

replacing appointed characters in a string in txt file - python

Hello all…I want to pick up the texts ‘DesingerXXX’ from a text file which contains below contents:
C DesignerTEE edBore 1 1/42006
Cylinder SingleVerticalB DesignerHHJ e 1 1/8Cooling 1
EngineBore 11/16 DesignerTDT 8Length 3Width 3
EngineCy DesignerHEE Inline2008Bore 1
Height 4TheChallen DesignerTET e 1Stroke 1P 305
Height 8C 606Wall15ccG DesignerQBG ccGasEngineJ 142
Height DesignerEQE C 60150ccGas2007
Anidea is to use the ‘Designer’ as a key, to consider each line into 2 parts, before the key, and after the key.
file_object = open('C:\\file.txt')
lines = file_object.readlines()
for line in lines:
if 'Designer' in line:
where = line.find('Designer')
before = line[0:where]
after = line[where:len(line)]
file_object.close()
In the ‘before the key’ part, I need to find the LAST space (‘ ’), and replace to another symbol/character.
In the ‘after the key’ part, I need to find the FIRST space (‘ ’), and replace to another symbol/character.
Then, I can slice it and pick up the wanted according to the new symbols/characters.
is there a better way to pick up the wanted texts? Or not, how can I replace the appointed key spaces?
In the string replace function, I can limit the times of replacing but not exactly which I can replace. How can I do that?
thanks

Using regular expressions, its a trivial task:
>>> s = '''C DesignerTEE edBore 1 1/42006
... Cylinder SingleVerticalB DesignerHHJ e 1 1/8Cooling 1
... EngineBore 11/16 DesignerTDT 8Length 3Width 3
... EngineCy DesignerHEE Inline2008Bore 1
... Height 4TheChallen DesignerTET e 1Stroke 1P 305
... Height 8C 606Wall15ccG DesignerQBG ccGasEngineJ 142
... Height DesignerEQE C 60150ccGas2007'''
>>> import re
>>> exp = 'Designer[A-Z]{3}'
>>> re.findall(exp, s)
['DesignerTEE', 'DesignerHHJ', 'DesignerTDT', 'DesignerHEE', 'DesignerTET', 'DesignerQBG', 'DesignerEQE']
The regular expression is Designer[A-Z]{3} which means the letters Designer, followed by any letter from capital A to capital Z that appears 3 times, and only three times.
So, it won't match DesignerABCD (4 letters), it also wont match Desginer123 (123 is not valid letters).
It also won't match Designerabc (abc are small letters). To make it ignore the case, you can pass an optional flag re.I as a third argument; but this will also match designerabc (you have to be very specific with regular expressions).
So, to make it so that it matches Designer followed by exactly 3 upper or lower case letters, you'd have to change the expression to Designer[Aa-zZ]{3}.
If you want to search and replace, then you can use re.sub for substituting matches; so if I want to replace all matches with the word 'hello':
>>> x = re.sub(exp, 'hello', s)
>>> print(x)
C hello edBore 1 1/42006
Cylinder SingleVerticalB hello e 1 1/8Cooling 1
EngineBore 11/16 hello 8Length 3Width 3
EngineCy hello Inline2008Bore 1
Height 4TheChallen hello e 1Stroke 1P 305
Height 8C 606Wall15ccG hello ccGasEngineJ 142
Height hello C 60150ccGas2007
and what if both before and after 'Designer', there are characters,
and the length of character is not fixed. I tried
'[Aa-zZ]Designer[Aa-zZ]{0~9}', but it doesn't work..
For these things, there are special characters in regular expressions. Briefly summarized below:
When you want to say "1 or more, but at least 1", use +
When you want to say "0 or any number, but there maybe none", use *
When you want to say "none but if it exists, only repeats once" use ?
You use this after the expression you want to be modified with the "repetition" modifiers.
For more on this, have a read through the documentation.
Now your requirements is "there are characters but the length is not fixed", based on this, we have to use +.

Try with re.sub. The regular expression match with your keyword surrounded by spaces. The second parameter of sub, replace the surrounder spaces by your_special_char (in my script a hyphen)
>>> import re
>>> with open('file.txt') as file_object:
... your_special_char = '-'
... for line in file_object:
... formated_line = re.sub(r'(\s)(Designer[A-Z]{3})(\s)', r'%s\2%s' % (your_special_char,your_special_char), line)
... print formated_line
...
C -DesignerTEE-edBore 1 1/42006
Cylinder SingleVerticalB-DesignerHHJ-e 1 1/8Cooling 1
EngineBore 11/16-DesignerTDT-8Length 3Width 3
EngineCy-DesignerHEE-Inline2008Bore 1
Height 4TheChallen-DesignerTET-e 1Stroke 1P 305
Height 8C 606Wall15ccG-DesignerQBG-ccGasEngineJ 142
Height-DesignerEQE-C 60150ccGas2007

Maroun Maroun mentioned 'Why not simply split the string'. so guessing one of the working way is:
import re
file_object = open('C:\\file.txt')
lines = file_object.readlines()
b = []
for line in lines:
a = line.split()
for aa in a:
b.append(aa)
for bb in b:
if 'Designer' in bb:
print bb
file_object.close()

Related

python how to dynamically find a persons name in a string

im working on a project where i have to use speech to text as an input to determine who to call, however using the speech to text can give some unexpected results so i wanted to have a little dynamic matching of the strings, i'm starting small and try to match 1 single name, my name is Nick Vaes, and i try to match my name to the spoken text, but i also want it to match when for example some text would be Nik or something, idealy i would like to have something that would match everything if only 1 letter is wrong so
Nick
ick
nik
nic
nck
would all match my name, the current simple code i have is:
def user_to_call(s):
if "NICK" or "NIK" in s.upper(): redirect = "Nick"
if redirect: return redirect
for a 4 letter name its possible to put all possibilities in the filter, but for names with 12 letters it is a little bit of overkill since i'm pretty sure it can be done way more efficient.

You need to use Levenshtein_distance
A python implementation is nltk
import nltk
nltk.edit_distance("humpty", "dumpty")

What you basically need is fuzzy string matching, see:
https://en.wikipedia.org/wiki/Approximate_string_matching
https://www.datacamp.com/community/tutorials/fuzzy-string-python
Based on that you can check how similar is the input compared your dictionary:
from fuzzywuzzy import fuzz
name = "nick"
tomatch = ["Nick", "ick", "nik", "nic", "nck", "nickey", "njick", "nickk", "nickn"]
for str in tomatch:
ratio = fuzz.ratio(str.lower(), name.lower())
print(ratio)
This code will produce the following output:
100
86
86
86
86
80
89
89
89
You have to experiment with different ratios and check which will suit your requirements to miss only one letter

From what I understand, you are not looking at any fuzzy matching. (Because you did not upvote other responses).
If you are just trying to evaluate what you specified in your request, here is the code. I have put some additional conditions where I printed the appropriate message. Feel free to remove them.
def wordmatch(baseword, wordtoMatch, lengthOfMatch):
lis_of_baseword = list(baseword.lower())
lis_of_wordtoMatch = list(wordtoMatch.lower())
sum = 0
for index_i, i in enumerate(lis_of_wordtoMatch):
for index_j, j in enumerate(lis_of_baseword):
if i in lis_of_baseword:
if i == j and index_i <= index_j:
sum = sum + 1
break
else:
pass
else:
print("word to match has characters which are not in baseword")
return 0
if sum >= lengthOfMatch and len(wordtoMatch) <= len(baseword):
return 1
elif sum >= lengthOfMatch and len(wordtoMatch) > len(baseword):
print("word to match has no of characters more than that of baseword")
return 0
else:
return 0
base = "Nick"
tomatch = ["Nick", "ick", "nik", "nic", "nck", "nickey","njick","nickk","nickn"]
wordlength_match = 3 # this says how many words to match in the base word. In your case, its 3
for t_word in tomatch:
print(wordmatch(base,t_word,wordlength_match))
the output looks like this
1
1
1
1
1
word to match has characters which are not in baseword
0
word to match has characters which are not in baseword
0
word to match has no of characters more than that of baseword
0
word to match has no of characters more than that of baseword
0
Let me know if this served your purpose.

How to extract set of substrings from a paragraph of string

Say I have a string:
output='[{ "id":"b678792277461" ,"Responses":{"SUCCESS":{"sh xyz":"sh xyz\\n Name Age Height Weight\\n Ana \\u003c15 \\u003e 163 47\\n 43\\n DEB \\u003c23 \\u003e 155 \\n Grey \\u003c53 \\u003e 143 54\\n 63\\n Sch#"},"FAILURE":{},"BLACKLISTED":{}}}]'
This is just an example but I have much longer output which is response from an api call.
I want to extract all names (ana, dab, grey) and put in a separate list.
how can I do it?
json_data = json.loads(output)
json_data = [{'id': 'b678792277461', 'Responses': {'SUCCESS': {'sh xyz': 'sh xyz\n Name Age Height Weight\n Ana <15 > 163 47\n 43\n DEB <23 > 155 \n Grey <53 > 143 54\n 63\n Sch#'}, 'FAILURE': {}, 'BLACKLISTED': {}}}]
1) I have tried re.findall('\\n(.+)\\u',output)
but this didn't work because it says "incomplete sequence u"
2)
start = output.find('\\n')
end = output.find('\\u', start)
x=output[start:end]
But I couldn't figure out how to run this piece of code in loop to extract names
Thanks

The \u object is not a letter and it cannot be matched. It is a part of a Unicode sequence. The following regex works, but it is kind of quirky. It looks for the beginning of each line, except for the first one, until the first space.
output = json_data[0]['Responses']['SUCCESS']['sh xyz']
pattern = "\n\s*([a-z]+)\s+"
result = re.findall(pattern, output, re.M | re.I)
#['Name', 'Ana', 'DEB', 'Grey']
Explanation of the pattern:
start at a new line (\n)
skip all spaces, if any (\s*)
collect one or more letters ([a-z]+)
skip at least one space (\s+)
Unfortunately, "Name" is also recognized as a name. If you know that it is always present in the first line, slice the list of the results:
result[1:]
#['Ana', 'DEB', 'Grey']

I use regexr.com and play around with the regular expression until I get it right and then covert that into Python.
https://regexr.com/
I'm assuming the \n is the newline character here and I'll bet your \u error is caused by a line break. To use the multiline match in Python, you need to use that flag when you compile.
\n(.*)\n - this will be greedy and grab as many matches as possible (In the example it would grab the entire \nAna through 54\n
[{ "id":"678792277461" ,"Responses": {Name Age Height Weight\n Ana \u00315 \u003163 47\n 43\n Deb \u00323 \u003155 60 \n Grey \u00353 \u003144 54\n }]
import re
a = re.compile("\\n(.*)\\n", re.MULTILINE)
for responses in a.match(source):
match = responses.split("\n")
# match[0] should be " Ana \u00315 \u003163 47"
# match[1] should be " Deb \u00323 \u003155 60" etc.

Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?

How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']

Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo

I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link

Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]

Regex findall command want to specify in most common name algorithm - Python 3

I am trying to make a python program that searches an imported text file for occurances of common words (names in this case) and then prints a list of the top three most common names in the file. Some names repeat due to the count of names then being higher or lower (less or more popular). The text file is simply a collections of names with either a F or M on the same line to distinguish it as a male or female name. I have the code
N=3
words = re.findall (r'\w+', data)
top_words_all = Counter(words).most_common(N)
for word, frequency in top_words_all:
print("%s - %d" % (word, frequency))
(Note: 'data' is the reading of the text file.) It gives me a nice list of three words that occur the most but the trouble is that the first and second most common on the count is F and M because it counts it as a separate word. How do I count each word along with their F or M a few spaces away. To give you an idea of what the text file looks like:
Drew M
Drew M
Drew M
Drew M
Steven M
Steven M
Sally F
Sally F
Not only that, but this code is just to print out a top three list for all names (male or female). I would like to do two more which would do the most common male names and female names. I am guessing that when you include the M/F in the word then I could just find the words with the occurance of "M" or "F" and then display them only. Please help. I am new to coding as you can tell and need some desperate assistance. Please explain your chosen coding to me if at all possible so that I understand what the code or functions actually do.

You can use \w+\s+(?:M|F) regular expression. It checks for more than one space with the following M or F character. ?: is a non-capturing group.
Demo:
>>> import re
>>> from collections import Counter
>>> data = """
... Drew M
... Drew F
... Drew F
... Drew M
... Steven M
... Steven F
... Sally F
... Sally M
... """
>>> words = re.findall (r'\w+\s+(?:M|F)?', data)
>>> top_words_all = Counter(words).most_common(3)
>>> for word, frequency in top_words_all:
... print("%s - %d" % (word, frequency))
...
Drew M - 2
Drew F - 2
Steven F - 1
Hope that helps.

Regular expression help

I am trying to create a regex in Python 3 that matches 7 characters (eg. >AB0012) separated by an unknown number of characters then matching another 6 characters(eg. aaabbb or bbbaaa). My input string might look like this:
>AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
This is the regex that I have come up with:
matches = re.findall(r'(>.{7})(aaabbb|bbbaaa)', mystring)
print(matches)
The output I am trying to product would look like this:
[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'aaabbb')]
I read through the Python documentation, but I couldn't find how to match an unknown distance between two portions of a regex. Is there some sort of wildcard character that would allow me to complete my regex? Thanks in advance for the help!
EDIT:
If I use *? in my code like this:
mystring = str(input("Paste promoters here: "))
matches = re.findall(r'(>.{7})*?(aaabbb|bbbaaa)', mystring)
print(matches)
My output looks like this:
[('>CD00192', 'aaabbb'), ('', 'bbbaaa'), ('', 'aaabbb')]
*The second and third items in the list are missing the >CD00192 and >ZP01990, respectively. How can I have the regex include these characters in the list?

Here's a non regular expression approach. Split on ">" (your data will start from 2nd element onwards), then since you don't care what those 7 characters are, so start checking from 8th character onwards till 14th character.
>>> string=""" AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"""
>>> for i in string.split(">")[1:]:
... if i[7:13] in ["aaabbb","bbbaaa"]:
... print ">" + i[:13]
...
>CD00192aaabbb

I have a code that gives also the positions.
Here's the simple version of this code:
import re
from collections import OrderedDict
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')
dic = OrderedDict()
# Finding the result
for mat in regx.finditer(ch):
chunk,head = mat.groups()
headstart = mat.start()
dic[(headstart,head)] = [(headstart+six.start(),six.start(),six.group())
for six in rag.finditer(chunk)]
# Diplaying the result
for (headstart,head),li in dic.iteritems():
print '{:>10} {}'.format(headstart,head)
for x in li:
print '{0[0]:>10} {0[1]:>6} {0[2]}'.format(x)
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
24 CD00192
31 8 aaabbb
41 18 bbbaaa
52 29 bbbaaa
62 39 aaabbb
69 ZP01990
95 27 aaabbb
136 SE45789
148 13 aaabbb
172 37 bbbaaa
The same code, in a functional manner, using generators :
import re
from itertools import imap
from collections import OrderedDict
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')
gen = ((mat.groups(),mat.start()) for mat in regx.finditer(ch))
dic = OrderedDict(((headstart,head),
[(headstart+six.start(),six.start(),six.group())
for six in rag.finditer(chunk)])
for (chunk,head),headstart in gen)
print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
'\n'.join(imap('{0[0]:>10} {0[1]:>6} {0[2]}'.format,li))
for (headstart,head),li in dic.iteritems())
.
EDIT
I measured the execution's times.
For each code I measured the creation of the dictionary and the displaying separately.
The code using generators (the second) is 7.4 times more rapid to display the result ( 0.020 seconds) than the other one (0.148 seconds)
But surprisingly for me, the code with generators takes 47 % more time (0.000718 seconds) than the other (0.000489 seconds) to compute the dictionary.
.
EDIT 2
Another way to do:
import re
from collections import OrderedDict
from itertools import imap
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>).{7})|(aaabbb|bbbaaa)')
def collect(ch):
li = []
dic = OrderedDict()
gen = ( (x.start(),x.group(1),x.group(2)) for x in regx.finditer(ch))
for st,g1,g2 in gen:
if g1:
if li:
dic[(stprec,g1prec)] = li
li,stprec,g1prec = [],st,g1
elif g2:
li.append((st,g2))
if li:
dic[(stprec,g1prec)] = li
return dic
dic = collect(ch)
print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
'\n'.join(imap('{0[0]:>10} {0[1]}'.format,li))
for (headstart,head),li in dic.iteritems())
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
24 CD00192
31 aaabbb
41 bbbaaa
52 bbbaaa
62 aaabbb
69 ZP01990
95 aaabbb
136 SE45789
148 aaabbb
172 bbbaaa
This code compute dic in 0.00040 seconds and displays it in 0.0321 seconds
.
EDIT 3
To answer to your question, you have no other possibility than keeping each current value among 'CD00192','ZP01990','SE45789' etc under a name (I don't like to say "in a variable" in Python, because there are no variables in Python. But you can read "under a name" as if I had written "in a variable" )
And for that, you must use finditer()
Here's the code for this solution:
import re
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('(>.{7})|(aaabbb|bbbaaa)')
matches = []
for mat in regx.finditer(ch):
g1,g2= mat.groups()
if g1:
head = g1
else:
matches.append((head,g2))
print matches
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>SE45789', 'aaabbb'), ('>SE45789', 'bbbaaa')]
My preceding codes are more complicated because they catch the positions and gather the values 'aaabbb' and 'bbbaaa' of one header among 'CD00192','ZP01990','SE45789' etc in a list.

zero or more characters can be matched using *, so a* would match "", "a", "aa" etc. + matches one or more character.
You will perhaps want to make the quantifier (+ or *) lazy by using +? or *? as well.
See regular-expressions.info for more details.

Try this:
>>> r1 = re.findall(r'(>.{7})[^>]*?(aaabbb)', s)
>>> r2 = re.findall(r'(>.{7})[^>]*?(bbbaaa)', s)
>>> r1 + r2
[('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'bbbaaa')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.