I have a .tsv file which I have attached along with this post. I have rows(cells) in the format of A1,A2,A3...A12 , B1..B2, .... H1..H12. I need to re-arrange this to a format like A1,B1,C1,D1,...H1 , A2,B2,C2,...H2 ..... A12,B12,C12,...H12.
I need to do this using Python.
I have another .tsv file that allows me to compare it with this file. It is called flipped.tsv . The flipped.tsv file contains the accurate well values corresponding to the cells. In other words, I must map the well values with their accurate cell-lines.
From what I have understood is that the cell line of the meta-data is incorreclty arranged in column-major whereas it has to be arranged in a row-major format like how it is in flipped.tsv file.
For example :
"A2 of flipped_metadata.tsv has the same well values as that of B1 of metadata.tsv."
What is the logic that I can carry out to perform this in Python?
First .tsv file
flipped .tsv file
You could do the following:
import csv
# Read original file
with open("file.tsv", "r") as file:
rows = list(csv.reader(file, delimiter="\t"))
# Key function for sorting
def key_func(row):
""" Transform row in sort key, e.g. ['A7', 1, 2] -> (7, 'A') """
return int(row[0][1:]), row[0][0]
# Write `flippedĀ“ file
with open("file_flipped.tsv", "w") as file:
csv.writer(file, delimiter="\t").writerows(
row[:1] + flipped[1:]
for row, flipped in zip(rows, sorted(rows, key=key_func))
)
The flipping is done by sorting the original rows by
first the integer part of their first row entry int(row[0][1:]), and
then the character part of their first entry row[0][0].
See tio.run illustration here.
If the effect of the sorting isn't obvious, take a look at the result of the same operation, just without the relabelling of the first column:
with open("file_flipped.tsv", "w") as file:
csv.writer(file, delimiter="\t").writerows(
sorted(rows, key=key_func)
)
Output:
A1 26403 23273
B1 27792 8805
C1 5668 19510
...
F12 100 28583
G12 18707 14889
H12 13544 7447
The blocks are build based on the number part first, and within each block the lines run through the sorted characters.
This only works as long as the non-number part has always exactly one character.
If the non-number part has always exactly 2 characters then the return of the key function has to be adjusted to int(row[0][2:]), row[0][:2] etc.
If there's more variability allowed, e.g. between 1 and 5 characters, then a regex approach would be more appropriate:
import re
re_key = re.compile(r"([a-zA-Z]+)(\d+)")
def key_func(row):
""" Transform row in sort key, e.g. ['Aa7', 10, 20] -> (7, 2, 'Aa') """
word, number = re_key.match(row[0]).group(1, 2)
return int(number), len(word), word
Here's a regex demo.
And, depending on how the words have to be sorted, it might be necessary to include the length of the word into the sort key: Python sorts ['B', 'AA', 'A'] naturally into ['A', 'AA', 'B'] and not ['A', 'B', 'AA']. The addition of the length, like in the function, does achieve that.
Related
I have large number of files that are named according to a gradually more specific criteria.
Each part of the filename separate by the '_' relate to a drilled down categorization of that file.
The naming convetion looks like this:
TEAM_STRATEGY_ATTRIBUTION_TIMEFRAME_DATE_FILEVIEW
What I am trying to do is iterate through all these files and then pull out a list of how many different occurrences of each naming convention exists.
So essentially this is what I've done so far, I iterated through all the files and made a list of each name. I then separated each name by the '_' and then appended each of those to their respective category lists.
Now I'm trying to export them to a CSV file separated by columns and this is where I'm running into problems
L = [teams, strategies, attributions, time_frames, dates, file_types]
columns = zip(*L)
list(columns)
with open (_outputfolder_, 'w') as f:
writer = csv.writer(f)
for column in columns:
print(column)
This is a rough estimation of the list I'm getting out:
[{'TEAM1'},
{'STRATEGY1', 'STRATEGY2', 'STRATEGY3', 'STRATEGY4', 'STRATEGY5', 'STRATEGY6', 'STRATEGY7', 'STRATEGY8', 'STRATEGY9', 'STRATEGY10','STRATEGY11', 'STRATEGY12', 'STRATEGY13', 'STRATEGY14', 'STRATEGY15'},
{'ATTRIBUTION1','ATTRIBUTION1','Attribution3','Attribution4','Attribution5', 'Attribution6', 'Attribution7', 'Attribution8', 'Attribution9', 'Attribution10'},
{'TIME_FRAME1', 'TIME_FRAME2', 'TIME_FRAME3', 'TIME_FRAME4', 'TIME_FRAME5', 'TIME_FRAME6', 'TIME_FRAME7'},
{'DATE1'},
{'FILE_TYPE1', 'FILE_TYPE2'}]
What I want the final result to look like is something like:
Team1 STRATEGY1 ATTRIBUTION1 TIME_FRAME1 DATE1 FILE_TYPE1
STRATEGY2 ATTRIBUTION2 TIME_FRAME2 FILE_TYPE2
... ... ...
etc. etc. etc.
But only the first line actually gets stored in the CSV file.
can anyone help me understand how to iterate just past the first line? I'm sure this is happening because the Team type has only one option, but I don't want this to hinder it.
I referred to the answer, you have to transpose the result and use it.
refer the post below ,
Python - Transposing a list (rows with different length) using numpy fails.
I have used natural sorting to sort the integers and appended the lists with blanks to have the expected outcome.
The natural sorting is slower for larger lists
you can also use third party libraries,
Does Python have a built in function for string natural sort?
def natural_sort(l):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(l, key = alphanum_key)
res = [[] for _ in range(max(len(sl) for sl in columns))]
count = 0
for sl in columns:
sorted_sl = natural_sort(sl)
for x, res_sl in zip(sorted_sl, res):
res_sl.append(x)
for result in res:
if (count > 0 ):
result.insert(0,'')
count = count +1
with open ("test.csv", 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(res)
f.close()
the columns should be converted in to list before printing to csv file
writerows method can be leveraged to print multiplerows
https://docs.python.org/2/library/csv.html -- you can find more information here
TEAM1,STRATEGY1,ATTRIBUTION1,TIME_FRAME1,DATE1,FILE_TYPE1
,STRATEGY2,Attribution3,TIME_FRAME2,FILE_TYPE2
,STRATEGY3,Attribution4,TIME_FRAME3
,STRATEGY4,Attribution5,TIME_FRAME4
,STRATEGY5,Attribution6,TIME_FRAME5
,STRATEGY6,Attribution7,TIME_FRAME6
,STRATEGY7,Attribution8,TIME_FRAME7
,STRATEGY8,Attribution9
,STRATEGY9,Attribution10
,STRATEGY10
,STRATEGY11
,STRATEGY12
,STRATEGY13
,STRATEGY14
,STRATEGY15
I have a file that looks like this
a:1
a:2
a:3
b:1
b:2
b:2
and I would like it to take the the a and b portion of the file and add it as the first column and and the number below, like this.
a b
1 1
2 2
3 3
can this be done?
A CSV (Comma Separated File) should have commas in it, so the output should have commas instead of space-separators.
I recommend writing your code in two parts: The first part should read the input; the second should write out the output.
If your input looks like this:
a:1
a:2
a:3
b:1
b:2
b:2
c:7
you can read in the input like this:
#!/usr/bin/env python3
# Usage: python3 scripy.py < input.txt > output.csv
import sys
# Loop through all the input lines and put the values in
# a list according to their category:
categoryList = {} # key => category, value => list of values
for line in sys.stdin.readlines():
line = line.rstrip('\n')
category, value = line.split(':')
if category not in categoryList:
categoryList[category] = []
categoryList[category].append(value)
# print(categoryList) # debug line
# Debug line prints: {'a': ['1', '2', '3'], 'b': ['1', '2', '2']}
This will read in all your data into a categoryList dict. It's a dict that contains the categories (the letters) as keys, and contains lists (of numbers) as the values. Once you have all the data held in that dict, you can output it.
Outputting involves getting a list of categories (the letters, in your example case) so that they can be written out first as your header:
# Get the list of categories:
categories = sorted(categoryList.keys())
assert categories, 'No categories found!' # sanity check
From here, you can use Python's nice csv module to output the header and then the rest of the lines. When outputting the main data, we can use an outer loop to loop through the nth entries of each category, then we can use an inner loop to loop through every category:
import csv
csvWriter = csv.writer(sys.stdout)
# Output the categories as the CSV header:
csvWriter.writerow(categories)
# Now output the values we just gathered as
# Comma Separated Values:
i = 0 # the index into an individual category list
while True:
values = []
for category in categories:
try:
values.append(categoryList[category][i])
except IndexError:
values.append('') # no value, so use an empty string
if len(''.join(values)) == 0:
break # we've run out of categories that contain input
csvWriter.writerow(values)
i += 1 # increment index for the next time through the loop
If you don't want to use Python's csv module, you will still need to figure out how to group the entries in the category together. And if all you have is simple output (where none of the entries contain quotes, newlines, or commas), you can get away with manually writing out the output.
You could use something like this to output your values:
# Output the categories as the CSV header:
print(','.join(categories))
# Now output the values we just gathered as
# Comma Separated Values:
i = 0 # the index into an individual category list
while True:
values = []
for category in categories:
try:
values.append(categoryList[category][i])
except IndexError:
values.append('') # no value, so use an empty string
if len(''.join(values)) == 0:
break # we've run out of categories that contain input
print(','.join(values))
i += 1 # increment index for the next time through the loop
This will print out:
a,b,c
1,1,7
2,2,
3,2,
It does this by looping through all the list entries (the outer loop), and then looping through all the categories (the inner loop), and then printing out the values joined together by commas.
If you don't want the commas in your output, then you're technically not looking for CSV (Comma Separated Value) output. Still, in that case, it should be easy to modify the code to get what you want.
But if you have more complicated output (that is, values that have quotes, commas, and newlines in it) you should strongly consider using the csv module to output your data. Otherwise, you'll spend lots of time trying to fix obscure bugs with odd input that the csv module already handles.
So I have a .txt file that has five columns, the first is a string and the next four are floats. What I want to do is be able to search the file for the string and find the row which it is on so I can use the floats associated with it.
From other threads I've managed to do this by putting the floats and strings in two separate files. Is it possible to have them in the same file? When I do this, I get an error about being unable to convert a string to a float or vice versa.
So for example, I have this in the text file:
blue1 0 1 2 3
blue2 4 5 6 7
red1 8 9 10 11
red2 12 13 14 15
The code I am using to do this is the same code I used when I had two separate files:
lookup = 'red1'
with open(file) as myFile:
for row, line in enumerate(myFile, 1):
if lookup in line:
print 'found in line:', row
data = np.loadtxt(path + file)
d = data[:row,]
The error I am getting says:
ValueError: could not convert string to float: blue1
What I'm trying to get is the row number "red1" is on, then use that number to figure out where I need to slice in order to get the numbers associated with it.
Regarding your code, you are trying to do the same thing twice. You are using open to open and read the file and also np.loadtxt to read it. The error is from the np.loadtxt.
For np.loadtxt you need to supply the file types if they aren't all the same:
There is an example in the docs:
np.loadtxt(d, dtype={'names': ('gender', 'age', 'weight'),
'formats': ('S1', 'i4', 'f4')})
For yours, it'd look like
data = np.loadtxt('text.txt', dtype={
'names' : ('color', 'a', 'b', 'c', 'd'),
'formats' : ('U50', 'f', 'f','f', 'f')
})
And then you can use https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html to locate your string.
You can store everything in one file. You just need to read in the file in the correct manner.
Doing it the other way with open would look like:
with open('text.txt') as f:
for line in f:
my_arr = line.split()
my_str = my_arr.pop(0) # Create a list of all the items and then pop the string off the list
my_arr = list(map(float, my_arr))
if my_str == "blue1":
print(my_arr, type(my_arr[0]))
The floats are now in a list so we can print all of them and show that their type is float
Output: ([0.0, 1.0, 2.0, 3.0], <type 'float'>)
This is the first line of my txt.file
0.112296E+02-.121994E-010.158164E-030.158164E-030.000000E+000.340000E+030.328301E-010.000000E+00
There should be 8 columns, sometimes separated with '-', sometimes with '.'. It's very confusing, I just have to work with the file, I didn't generate it.
And second question: How can I work with the different columns? There is no header, so maybe:
df.iloc[:,0] .. ?
As stated in comments, this is likely a list of numbers in scientific notation, that aren't separated by anything but simply glued together.
It could be interpreted as:
0.112296E+02
-.121994E-010
.158164E-030
.158164E-030
.000000E+000
.340000E+030
.328301E-010
.000000E+00
or as
0.112296E+02
-.121994E-01
0.158164E-03
0.158164E-03
0.000000E+00
0.340000E+03
0.328301E-01
0.000000E+00
Assuming the second interpretation is better, the trick is to split evenly every 12 characters.
data = [line[i:i+12] for i in range(0, len(line), 12)]
If really the first interpretation is better, then I'd use a REGEX
import re
line = '0.112296E+02-.121994E-010.158164E-030.158164E-030.000000E+000.340000E+030.328301E-010.000000E+00'
pattern = '[+-]?\d??\.\d+E[+-]\d+'
data = re.findall(pattern, line)
Edit
Obviously, you'd need to iterate over each line in the file, and add it to your dataframe. This is a rather inefficient thing to do in Pandas. Therefore, if your preferred interpretation is the fixed width one, I'd go with #Ev. Kounis ' answer: df = pd.read_fwf(myfile, widths=[12]*8)
Otherwise, the inefficient way is:
df = pd.DataFrame(columns=range(8))
with open(myfile, 'r') as f_in:
for i, lines in enumerate(f_in):
data = re.findall(pattern, line)
df.loc[i] = [float(d) for d in data]
The two things to notice here is that the DataFrame must be initialized with column names (here [0, 1, 2, 3..7] but perhaps you know of better identifiers); and that the regex gave us strings that must be casted to floats.
As i said in the comments, it is not a case of multiple separators, it is just a fixed width format. Pandas has a method to read such files. try this:
df = pd.read_fwf(myfile, widths=[12]*8)
print(df) # prints -> [0.112296E+02, -.121994E-01, 0.158164E-03, 0.158164E-03.1, 0.000000E+00, 0.340000E+03, 0.328301E-01, 0.000000E+00.1]
for the widths you have to provide the cell width which looks like its 12 and the number of columns which as you say must be 8.
As you might notice the results of the read are not perfect (notice the .1 just before the comma in the 4th and last element) but i am working on it.
Alternatively, you can do it "manually" like so:
myfile = r'C:\Users\user\Desktop\PythonScripts\a_file.csv'
width = 12
my_content = []
with open(myfile, 'r') as f_in:
for lines in f_in:
data = [float(lines[i * width:(i + 1) * width]) for i in range(len(lines) // width)]
my_content.append(data)
print(my_content) # prints -> [[11.2296, -0.0121994, 0.000158164, 0.000158164, 0.0, 340.0, 0.0328301, 0.0]]
and every row would be a nested list.
A possible solution is the following:
row = '0.112296E+02-.121994E-010.158164E-030.158164E-030.000000E+000.340000E+030.328301E-010.000000E+00'
chunckLen = 12
for i in range(0, len(row), chunckLen):
print(row[0+i:chunckLen+i])
You can easly extend the code to handle more general cases.
So I'm making a program where it reads a text file and I need to separate all the info into their own variables. It looks like this:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD
YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ
DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT
QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN
YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE
QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN
KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS
SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT
TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV
STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN
The code after the > is a title, the next bit that looks like this "A.41,52" are numbered positions in the sequence I need to save to use, and everything after that is an amino acid sequence. I know how to deal with the amino acid sequence, I just need to know how to separate the important numbers in the first line.
In the past when I just had a title and sequence I did something like this:
for line in nucfile:
if line.startswith(">"):
headerline=line.strip("\n")[1:]
else:
nucseq+=line.strip("\n")
Am I on the right track here? This is my first time, any advice would be fantastic and thanks for reading :)
I suggest you use the split() method.
split() allows you to specify the separator of your choice. Provided the sequence title (here 1EK9) is always separated from the rest of the sequence by a colon, you could first pass ":" as your separator. You could then split the remainder of the sequence to recover the numbered positions (e.g. A.41,52) using ";" as a separator.
I hope this helps!
I think what you are trying to do is extract certain parts of the sequence based on their identifiers given to you on the first line (the line starting with >).
This line contains your title, then a sequence name and the data range you need to extract.
Try this:
sequence_pairs = {}
with open('somefile.txt') as f:
header_line = next(f)
sequence = f.read()
title,components = header_line.split(':')
pairs = components.split(';')
for pair in pairs:
start,end = pair[2:-1].split(',')
sequence_pars[pair[:1]] = sequence[start:int(end)+1]
for sequence,data in sequence_pairs.iteritems():
print('{} - {}'.format(sequence, data))
As the other answer may be very good to tackle the assumed problem in it's entirety - but the OP has requested for pointers or an example of the tpyical split-unsplit transform which is often so successful I hereby provide some ideas and working code to show this (based on the example of the question).
So let us focus on the else branch below:
from __future__ import print_function
nuc_seq = [] # a list
title_token = '>'
with open('some_file_of_a_kind.txt', 'rt') as f:
for line in f.readlines():
s_line = line.strip() # this strips whitespace
if line.startswith(title_token):
headerline = line.strip("\n")[1:]
else:
nuc_seq.append(s_line) # build list
# now nuc_seq is a list of strings like:
# ['ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD',
# 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ',
# ...
# ]
demo_nuc_str = ''.join(nuc_seq)
# now:
# demo_nuc_str == 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGADYTYSNGYR ...'
That is fast and widely deployed paradigm in Python programming (and programming with powerful datatypes in general).
If the split-unsplit ( a.k.a. join) method is still unclear, just ask or try to sear SO on excellent answers to related questions.
Also note, that there is no need to line.strip('\n') as \nis considered whitespace like ' ' (string with a space only) or a tabulator '\t', sample:
>>> a = ' \t \n '
>>> '+'.join(a.split())
''
So the "joining character" only appears, if there are at least two element sto join and in this case, strip removed all whits space and left us with the empty string.
Upate:
As requested a further analysis of the "coordinate part" in the line called headline of the question:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
If you want to retrieve the:
A.41,52; B.61,74; C.247,257; D.279,289
and assume you have (as above the complete line in headline string):
title, coordinate_string = headline.split(':')
# so now title is '1EK9' and
# coordinates == 'A.41,52; B.61,74; C.247,257; D.279,289'
Now split on the semi colons, trim the entries:
het_seq = [z.strip() for z in coordinates.split(';')]
# now het_seq == ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
If 'a', 'B', 'C', and 'D' are well known dimensions, than you can "lose" the ordering info from input file (as you could always reinforce what you already know ;-) and might map the coordinats as key: (ordered coordinate-pair):
>>> coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
>>> coord_map
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
In context of a micro program:
#! /usr/bin/enc python
from __future__ import print_function
het_seq = ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
print(coord_map)
yields:
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
Here one might write this explicit a nested for loop but it is a late european evening so trick is to read it from right:
for all elements of het_seq
split on the dot and store left in a and right in b
than further split the bc into a sequence of k's, convert to integer and put into tuple (ordered pair of integer coordinates)
arrived on the left you build a tuple of the a ("The dimension like 'A' and the coordinate tuple from 3.
In the end call the dict() function that constructs a dictionary using here the form dict(key_1, value_1, hey_2, value_2, ...) which gives {key_1: value1, ...}
So all coordinates are integers, stored ordered pairs as tuples.
I'ld prefer tuples here, although split() generates lists, because
You will keep those two coordinates not extend or append that pair
In python mapping and remapping is often performed and there a hashable (that is immutable type) is ready to become a key in a dict.
One last variant (with no knoted comprehensions):
coord_map = {}
for abc in het_seq:
a, bc = abc.split('.')
coord_map[a] = tuple(int(k) for k in bc.split(','))
print(coord_map)
The first four lines produce the same as above minor obnoxious "one liner" (that already had been written on three lines kept together within parentheses).
HTH.
So I'm assuming you are trying to process a Fasta like file and so the way I would do it is to first get the header and separate the pieces with Regex. Following that you can store the A:42.52 B... in a list for easy access. The code is as follows.
import re
def processHeader(line):
positions = re.search(r':(.*)', line).group(1)
positions = positions.split('; ')
return positions
dnaSeq = ''
positions = []
with open('myFasta', 'r') as infile:
for line in infile:
if '>' in line:
positions = processHeader(line)
else:
dnaSeq += line.strip()
I am not sure I completely understand the goal (and I think this post is more suitable for a comment, but I do not have enough privileges) but I think that the key to you solution is using .split(). You can then join the elements of the resulting list just by using + similar to this:
>>> result = line.split(' ')
>>> result
['1EK9:A.41,52;', 'B.61,74;', 'C.247,257;', 'D.279,289', 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD', 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN',
'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
>>> result[3]+result[4]
'D.279,289ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD'
>>>
etc. You can also use the usual following syntax to extract the elements of the list that you need:
>>> result[5:]
['YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN', 'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
and join them together:
>>> reduce(lambda x, y: x+y, result[5:])
'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQDVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTTQRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGNYYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLAREQIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQNKVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRSSFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDATTTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPVSTNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN'
remember that + on lists produces a list.
By the way I would not remove '\n' to start with as you may try to use it to extract the first line similar to the above with using space to extract "words".
UPDATE (starting from result):
#getting A indexes
letter_seq=result[5:]
ind=result[:4]
Aind=ind[0].split('.')[1].replace(';', '')
#getting one long letter seq
long_letter_seq=reduce(lambda x, y: x+y, letter_seq)
#extracting the final seq fromlong_letter_seq using Aind
output = long_letter_seq[int(Aind.split(',')[0]):int(Aind.split(',')[1])]
the last line is just a union of several operations that were also used earlier.
Same for B C D etc -- so a lot of manual work and calculations...
BE CAREFUL with indexes of A -- numbering in python starts from 0 which may not be the case in your numbering system.
The more elegant solution would be using re (https://docs.python.org/2/library/re.html) to find pettern using a mask, but this requires very well defined rules for how to look up the sequence needed.
UPDATE2: it is also not clear to me what is the role of spaces -- so far I removed them but they may matter when counting the letters in the original string.