parsing repeated lines of string based on initial characters [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am working on lists and strings in python. I have following lines of string.
ID abcd
AC efg
RF hij
ID klmno
AC p
RF q
I want the output as :
abcd, efg, hij
klmno, p, q
This output is based on the the first two characters in the line. How can I achieve it in efficient way?
I'm looking to output the second part of the line for every entry between the ID tags.

I'm having a little trouble parsing the question, but according to my best guess, this should do what you're looking for:
all_data = " ".join([line for line in file]).split("ID")
return [", ".join([item.split(" ")[::2] for item in all_data])]
Basically what you're doing here is first just joining together all of your data (removing the newlines) then splitting on your keyphrase of "ID"
After that, if I'm correctly interpreting the question, you're looking to get the second value of each pair. These pairs are space delimited (as is everything in that item due to the " ".join in the first line), so we just step through that list grabbing every other item.
In general splits have a little more syntactic sugar than is usually used, and the full syntax is: [start:end:step], so [::2] just returns every other item.

You could use the following, which takes order into account so that transposing the dict's values makes more sense...
from collections import OrderedDict
items = OrderedDict()
with open('/home/jon/sample_data.txt') as fin:
lines = (line.strip().partition(' ')[::2] for line in fin)
for key, value in lines:
items.setdefault(key[0], []).append(value)
res = [', '.join(el) for el in zip(*items.values())]
# ['abcd, efg, hij', 'klmno, p, q']

Use a default dict:
from collections import defaultdict
result = defaultdict(list)
for line in lines:
split_line = line.split(' ')
result[split_line[0]].append(split_line[1])
This will give you a dictionary result that stores all the values that have the same key in an array. To get all the strings that were in a line that started with e.g. ID:
print result[ID]

Based on your answers in comments, this should work (if I understand what you're looking for):
data = None
for line in lines:
fields = line.split(2)
if fields[0] == "ID":
#New set of data
if data is not None:
#Output last set of data.
print ", ".join(data)
data = []
data.append(fields[1])
if data is not None:
#Output final data set
print ", ".join(data)
It's pretty straight forward, you're just collecting the second field in each line into data until you see that start of the next data set, at which point you output the previous data set.

I think using itertools.groupby is best for this kind of parsing (do something until next token X)
import itertools
class GroupbyHelper(object):
def __init__(self):
self.state = None
def __call__(self, row):
if self.state is None:
self.state = True
else:
if row[0] == 'ID':
self.state = not self.state
return self.state
# assuming you read data from 'stream'
for _, data in itertools.groupby((line.split() for line in stream), GroupbyHelper()):
print ','.join(c[1] for c in data)
output:
$ python groupby.py
abcd,efg,hij
klmno,p,q

It looks like you would like to sub group your data, when ever 'ID' is present as your key. Groupby solution could work wonder here, if you know how to group your data. Here is one such implementation that might work for you
>>> data=[e.split() for e in data.splitlines()]
>>> def new_key(key):
toggle = [0,1]
def helper(e):
if e[0] == key:
toggle[:] = toggle[::-1]
return toggle[0]
return helper
>>> from itertools import groupby
>>> for k,v in groupby(data, key = new_key('ID')):
for e in v:
print e[-1],
print
abcd efg hij
klmno p q

If lines is equal to
['ID abcd', 'AC efg', 'RF hij']
then
[line.split()[1] for line in lines]
Edit: Added everything below after down votes
I am not sure why this was down voted. I thought that code was the simplest way to get started with the information provided at the time. Perhaps this is a better explanation of what I thought/think the data was/is?
if input is a list of strings in repeated sequence, called alllines;
alllines = [ #a list of repeated lines of string based on initial characters
'ID abcd',
'AC efg',
'RF hij',
'ID klmno',
'AC p',
'RF q'
]
then code is;
[[line.split()[1] for line in lines] for lines in [[alllines.pop(0) \
for i in range(3)] for o in range(len(alllines)/3)]]
This basically says, create a sublist of three split [1] strings from the whole list of all strings for every three strings in the whole list.
and output is;
[[
'abcd', 'efg', 'hij'
], [
'klmno', 'p', 'q'
]]
Edit: 8-6-13 This is an even better one without pop();
zip(*[iter([line.split()[1] for line in alllines])]*3)
with a slightly different output
[(
'abcd', 'efg', 'hij'
), (
'klmno', 'p', 'q'
)]

Related

How to extract values from a csv splitting in the correct place (no imports)?

How can I read a csv file without using any external import (e.g. csv or pandas) and turn it into a list of lists? Here's the code I worked out so far:
m = []
for line in myfile:
m.append(line.split(','))
Using this for loop works pretty fine, but if in the csv I get a ',' is in one of the fields it breaks wrongly the line there.
So, for example, if one of the lines I have in the csv is:
12,"This is a single entry, even if there's a coma",0.23
The relative element of the list is the following:
['12', '"This is a single entry', 'even if there is a coma"','0.23\n']
While I would like to obtain:
['12', '"This is a single entry, even if there is a coma"','0.23']
I would avoid trying to use a regular expression, but you would need to process the text a character at a time to determine where the quote characters are. Also normally the quote characters are not included as part of a field.
A quick example approach would be the following:
def split_row(row, quote_char='"', delim=','):
in_quote = False
fields = []
field = []
for c in row:
if c == quote_char:
in_quote = not in_quote
elif c == delim:
if in_quote:
field.append(c)
else:
fields.append(''.join(field))
field = []
else:
field.append(c)
if field:
fields.append(''.join(field))
return fields
fields = split_row('''12,"This is a single entry, even if there's a coma",0.23''')
print(len(fields), fields)
Which would display:
3 ['12', "This is a single entry, even if there's a coma", '0.23']
The CSV library though does a far better job of this. This script does not handle any special cases above your test string.
Here is my go at it:
line ='12, "This is a single entry, more bits in here ,even if there is a coma",0.23 , 12, "This is a single entry, even if there is a coma", 0.23\n'
line_split = line.replace('\n', '').split(',')
quote_loc = [idx for idx, l in enumerate(line_split) if '"' in l]
quote_loc.reverse()
assert len(quote_loc) % 2 == 0, "value was odd, should be even"
for m, n in zip(quote_loc[::2], quote_loc[1::2]):
line_split[n] = ','.join(line_split[n:m+1])
del line_split[n+1:m+1]
print(line_split)

How do i print this list in a readable form?

I have written a short python script to search for urls with a http status code in a logfile. The script works as intended and counts how often an url is requested in combination with a certain http status code. The dictionary with the results is unsorted. Thats why i sorted the data afterwards using the values in the dictionary. This part of the script works as intended and i get a sorted list with the urls and the counter, The list looks like:
([('http://example1.com"', 1), ('http://example2.com"', 5), ('http://example3.com"', 10)])
I just want to make it better readable and print the list in rows.
http://example1.com 1
http://example2.com 5
http://example3.com 10
I started with python only two weeks ago and i cant find a solution. I tried several solutions i found here on stackoverflow but nothing works. My current solution prints all urls in seperate rows but does not show the count. I cant use comma as a seperator because i got some url with commas in my logfile. Im sorry for my bad english and the stupid question. Thank you in advance.
from operator import itemgetter
from collections import OrderedDict
d=dict()
with open("access.log", "r") as f:
for line in f:
line_split = line.split()
list = line_split[5], line_split[8]
url=line_split[8]
string='407'
if string in line_split[5]:
if url in d:
d[url]+=1
else:
d[url]=1
sorted_d = OrderedDict(sorted(d.items(), key=itemgetter(1)))
for element in sorted_d:
parts=element.split(') ')
print(parts)
for url, count in sorted_d.items():
print(f'{url} {count}')
Replace your last for loop with the above.
To explain: we unpack the url, count pairs in sorted_d in the for loop, and then use the an f-string to print the url and count separated by a space.
First if you're already importing from the collections library, why not import a Counter?
from collections import Counter
d=Counter()
with open("access.log", "r") as f:
for line in f:
line_split = line.split()
list = line_split[5], line_split[8]
url=line_split[8]
string='407'
if string in line_split[5]:
d[url] += 1
for key, value in d.most_common(): # or reversed(d.most_common())
print(f'{key} {value}')
There are many good tutorials on how to format strings in Python such as this
Here an example code how to print a dictionary. I set the width of the columns with the variables c1 and c2.
c1 = 34; c2 = 10
printstr = '\n|%s|%s|' % ('-'*c1, '-'*c2)
for key in sorted(d.keys()):
val_str = str(d[key])
printstr += '\n|%s|%s|' % (str(key).ljust(c1), val_str.rjust(c2))
printstr += '\n|%s|%s|\n\n' % ('-' * c1, '-' * c2)
print(printstr)
The string function ljust() creates a string of the length passed as an argument where the content of the string is left justified.

In Python, how to match a string to a dictionary item (like 'Bra*')

I'm a complete novice at Python so please excuse me for asking something stupid.
From a textfile a dictionary is made to be used as a pass/block filter.
The textfile contains addresses and either a block or allow like "002029568,allow" or "0011*,allow" (without the quotes).
The search-input is a string with a complete code like "001180000".
How can I evaluate if the search-item is in the dictionary and make it match the "0011*,allow" line?
Thank you very much for your efford!
The filter-dictionary is made with:
def loadFilterDict(filename):
global filterDict
try:
with open(filename, "r") as text_file:
lines = text_file.readlines()
for s in lines:
fields = s.split(',')
if len(fields) == 2:
filterDict[fields[0]] = fields[1].strip()
text_file.close()
except:
pass
Check if the code (ccode) is in the dictionary:
if ccode in filterDict:
if filterDict[ccode] in ['block']:
continue
else:
if filterstat in ['block']:
continue
The filters-file is like:
002029568,allow
000923993,allow
0011*, allow
If you can use re, you don't have to worry about the wildcard but let re.match do the hard work for you:
# Rules input (this could also be read from file)
lines = """002029568,allow
0011*,allow
001180001,block
"""
# Parse rules from string
rules = []
for line in lines.split("\n"):
line = line.strip()
if not line:
continue
identifier, ruling = line.split(",")
rules += [(identifier, ruling)]
# Get rulings for specific number
def rule(number):
from re import match
rulings = []
for identifier, ruling in rules:
# Replace wildcard with regex .*
identifier = identifier.replace("*", ".*")
if match(identifier, number):
rulings += [ruling]
return rulings
print(rule("001180000"))
print(rule("001180001"))
Which prints:
['allow']
['allow', 'block']
The function will return a list of rulings. Their order is the same order as they appear in your config lines. So you could easily just pick the last or first ruling whichever is the one you're interested in.
Or break the loop prematurely if you can assume that no two rulings will interfere.
Examples:
001180000 is matched by 0011*,allow only, so the only ruling which applies is allow.
001180001 is matched by 0011*,allow at first, so you'll get allow as before. However, it is also matched by 001180001,block, so a block will get added to the rulings, too.
If the wildcard entries in the file have a fixed length (for example, you only need to support lines like 0011*,allow and not 00110*,allow or 0*,allow or any other arbitrary number of digits followed by *) you can use a nested dictionary, where the outer keys are the known parts of the wildcarded entries.
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
Then when you parse the file and get to the line 0011*,allow you do not need to do any matching. All you have to do is check if '0011' is present. Crude example:
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
if prefix in d:
# there is a "match", then you can deal with all the entries that match,
# in this case the items in the inner dictionary
# {'001180000': 'value', '001180001': 'value'}
print('match')
else:
print('no match')
If you do need to support arbitrary lengths of wildcarded entries, you will have to resort to a loop iterating over the dictionary (and therefore beating the point of using a dictionary to begin with):
d = {'001180000': 'value', '001180001': 'value'}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
for k, v in d.items():
if k.startswith(prefix):
# found matching key-value pair
print(k, v)

Create a code in python to get the most frequent tag and value pair from a list

I have a .txt file with 3 columns: word position, word and tag (NN, VB, JJ, etc.).
Example of txt file:
1 i PRP
2 want VBP
3 to TO
4 go VB
I want to find the frequency of the word and tag as a pair in the list in order to find the most frequently assigned tag to a word.
Example of Results:
3 (food, NN), 2 (Brave, ADJ)
My idea is to start by opening the file from the folder, read the file line by line and split, set a counter using dictionary and print with the most common to uncommon in descending order.
My code is extremely rough (I'm almost embarrassed to post it):
file=open("/Users/Desktop/Folder1/trained.txt")
wordcount={}
for word in file.read().split():
from collections import Counter
c = Counter()
for d in dicts.values():
c += Counter(d)
print(c.most_common())
file.close()
Obviously, i'm getting no results. Anything will help. Thanks.
UPDATE:
so i got this code posted on here which worked, but my results are kinda funky. here's the code (the author removed it so i don't know who to credit):
file=open("/Users/Desktop/Folder1/trained.txt").read().split('\n')
d = {}
for i in file:
if i[1:] in d.keys():
d[i[1:]] += 1
else:
d[i[1:]] = 1
print (sorted(d.items(), key=lambda x: x[1], reverse=True))
here are my results:
[('', 15866), ('\t.\t.', 9479), ('\ti\tPRP', 7234), ('\tto\tTO', 4329), ('\tlike\tVB', 2533), ('\tabout\tIN', 2518), ('\tthe\tDT', 2389), ('\tfood\tNN', 2092), ('\ta\tDT', 2053), ('\tme\tPRP', 1870), ('\twant\tVBP', 1713), ('\twould\tMD', 1507), ('0\t.\t.', 1427), ('\teat\tVB', 1390), ('\trestaurant\tNN', 1371), ('\tuh\tUH', 1356), ('1\t.\t.', 1265), ('\ton\tIN', 1237), ("\t'd\tMD", 1221), ('\tyou\tPRP', 1145), ('\thave\tVB', 1127), ('\tis\tVBZ', 1098), ('\ttell\tVB', 1030), ('\tfor\tIN', 987), ('\tdollars\tNNS', 959), ('\tdo\tVBP', 956), ('\tgo\tVB', 931), ('2\t.\t.', 912), ('\trestaurants\tNNS', 899),
there seem to be a mix of good results with words and other results with space or random numbers, anyone know a way to remove what aren't real words? also, i know \t is supposed to signify a tab, is there a way to remove that as well? you guys really helped a lot
You need to have a separate collections.Counter for each word. This code uses defaultdict to create a dictionary of counters, without checking every word to see if it is known.
from collections import Counter, defaultdict
counts = defaultdict(Counter)
for row in file: # read one line into `row`
if not row.strip():
continue # ignore empty lines
pos, word, tag = row.split()
counts[word.lower()][tag] += 1
That's it, you can now check the most common tag of any word:
print(counts["food"].most_common(1))
# Prints [("NN", 3)] or whatever
If you don't mind using pandas which is a great library for tabular data I would do the following:
import pandas as pd
df = pd.read_csv("/Users/Desktop/Folder1/trained.txt", sep=" ", header=None, names=["position", "word", "tag"])
df["word_tag_counts"] = df.groupby(["word", "tag"]).transform("count")
Then if you only want the maximum one from each group you can do:
df.groupby(["word", "tag"]).max()["word_tag_counts"]
which should give you a table with the values you want

Trying to split a txt file into multiple variables

So I'm making a program where it reads a text file and I need to separate all the info into their own variables. It looks like this:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD
YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ
DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT
QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN
YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE
QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN
KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS
SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT
TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV
STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN
The code after the > is a title, the next bit that looks like this "A.41,52" are numbered positions in the sequence I need to save to use, and everything after that is an amino acid sequence. I know how to deal with the amino acid sequence, I just need to know how to separate the important numbers in the first line.
In the past when I just had a title and sequence I did something like this:
for line in nucfile:
if line.startswith(">"):
headerline=line.strip("\n")[1:]
else:
nucseq+=line.strip("\n")
Am I on the right track here? This is my first time, any advice would be fantastic and thanks for reading :)
I suggest you use the split() method.
split() allows you to specify the separator of your choice. Provided the sequence title (here 1EK9) is always separated from the rest of the sequence by a colon, you could first pass ":" as your separator. You could then split the remainder of the sequence to recover the numbered positions (e.g. A.41,52) using ";" as a separator.
I hope this helps!
I think what you are trying to do is extract certain parts of the sequence based on their identifiers given to you on the first line (the line starting with >).
This line contains your title, then a sequence name and the data range you need to extract.
Try this:
sequence_pairs = {}
with open('somefile.txt') as f:
header_line = next(f)
sequence = f.read()
title,components = header_line.split(':')
pairs = components.split(';')
for pair in pairs:
start,end = pair[2:-1].split(',')
sequence_pars[pair[:1]] = sequence[start:int(end)+1]
for sequence,data in sequence_pairs.iteritems():
print('{} - {}'.format(sequence, data))
As the other answer may be very good to tackle the assumed problem in it's entirety - but the OP has requested for pointers or an example of the tpyical split-unsplit transform which is often so successful I hereby provide some ideas and working code to show this (based on the example of the question).
So let us focus on the else branch below:
from __future__ import print_function
nuc_seq = [] # a list
title_token = '>'
with open('some_file_of_a_kind.txt', 'rt') as f:
for line in f.readlines():
s_line = line.strip() # this strips whitespace
if line.startswith(title_token):
headerline = line.strip("\n")[1:]
else:
nuc_seq.append(s_line) # build list
# now nuc_seq is a list of strings like:
# ['ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD',
# 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ',
# ...
# ]
demo_nuc_str = ''.join(nuc_seq)
# now:
# demo_nuc_str == 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGADYTYSNGYR ...'
That is fast and widely deployed paradigm in Python programming (and programming with powerful datatypes in general).
If the split-unsplit ( a.k.a. join) method is still unclear, just ask or try to sear SO on excellent answers to related questions.
Also note, that there is no need to line.strip('\n') as \nis considered whitespace like ' ' (string with a space only) or a tabulator '\t', sample:
>>> a = ' \t \n '
>>> '+'.join(a.split())
''
So the "joining character" only appears, if there are at least two element sto join and in this case, strip removed all whits space and left us with the empty string.
Upate:
As requested a further analysis of the "coordinate part" in the line called headline of the question:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
If you want to retrieve the:
A.41,52; B.61,74; C.247,257; D.279,289
and assume you have (as above the complete line in headline string):
title, coordinate_string = headline.split(':')
# so now title is '1EK9' and
# coordinates == 'A.41,52; B.61,74; C.247,257; D.279,289'
Now split on the semi colons, trim the entries:
het_seq = [z.strip() for z in coordinates.split(';')]
# now het_seq == ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
If 'a', 'B', 'C', and 'D' are well known dimensions, than you can "lose" the ordering info from input file (as you could always reinforce what you already know ;-) and might map the coordinats as key: (ordered coordinate-pair):
>>> coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
>>> coord_map
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
In context of a micro program:
#! /usr/bin/enc python
from __future__ import print_function
het_seq = ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
print(coord_map)
yields:
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
Here one might write this explicit a nested for loop but it is a late european evening so trick is to read it from right:
for all elements of het_seq
split on the dot and store left in a and right in b
than further split the bc into a sequence of k's, convert to integer and put into tuple (ordered pair of integer coordinates)
arrived on the left you build a tuple of the a ("The dimension like 'A' and the coordinate tuple from 3.
In the end call the dict() function that constructs a dictionary using here the form dict(key_1, value_1, hey_2, value_2, ...) which gives {key_1: value1, ...}
So all coordinates are integers, stored ordered pairs as tuples.
I'ld prefer tuples here, although split() generates lists, because
You will keep those two coordinates not extend or append that pair
In python mapping and remapping is often performed and there a hashable (that is immutable type) is ready to become a key in a dict.
One last variant (with no knoted comprehensions):
coord_map = {}
for abc in het_seq:
a, bc = abc.split('.')
coord_map[a] = tuple(int(k) for k in bc.split(','))
print(coord_map)
The first four lines produce the same as above minor obnoxious "one liner" (that already had been written on three lines kept together within parentheses).
HTH.
So I'm assuming you are trying to process a Fasta like file and so the way I would do it is to first get the header and separate the pieces with Regex. Following that you can store the A:42.52 B... in a list for easy access. The code is as follows.
import re
def processHeader(line):
positions = re.search(r':(.*)', line).group(1)
positions = positions.split('; ')
return positions
dnaSeq = ''
positions = []
with open('myFasta', 'r') as infile:
for line in infile:
if '>' in line:
positions = processHeader(line)
else:
dnaSeq += line.strip()
I am not sure I completely understand the goal (and I think this post is more suitable for a comment, but I do not have enough privileges) but I think that the key to you solution is using .split(). You can then join the elements of the resulting list just by using + similar to this:
>>> result = line.split(' ')
>>> result
['1EK9:A.41,52;', 'B.61,74;', 'C.247,257;', 'D.279,289', 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD', 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN',
'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
>>> result[3]+result[4]
'D.279,289ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD'
>>>
etc. You can also use the usual following syntax to extract the elements of the list that you need:
>>> result[5:]
['YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN', 'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
and join them together:
>>> reduce(lambda x, y: x+y, result[5:])
'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQDVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTTQRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGNYYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLAREQIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQNKVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRSSFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDATTTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPVSTNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN'
remember that + on lists produces a list.
By the way I would not remove '\n' to start with as you may try to use it to extract the first line similar to the above with using space to extract "words".
UPDATE (starting from result):
#getting A indexes
letter_seq=result[5:]
ind=result[:4]
Aind=ind[0].split('.')[1].replace(';', '')
#getting one long letter seq
long_letter_seq=reduce(lambda x, y: x+y, letter_seq)
#extracting the final seq fromlong_letter_seq using Aind
output = long_letter_seq[int(Aind.split(',')[0]):int(Aind.split(',')[1])]
the last line is just a union of several operations that were also used earlier.
Same for B C D etc -- so a lot of manual work and calculations...
BE CAREFUL with indexes of A -- numbering in python starts from 0 which may not be the case in your numbering system.
The more elegant solution would be using re (https://docs.python.org/2/library/re.html) to find pettern using a mask, but this requires very well defined rules for how to look up the sequence needed.
UPDATE2: it is also not clear to me what is the role of spaces -- so far I removed them but they may matter when counting the letters in the original string.

Categories