How to create a dataset using sequence file in python - python

I have a protein sequence file looks like this:
>102L:A MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL -------------------------------------------------------------------------------------------------------------------------------------------------------------------XX
The first one is the name of the sequence, the second one is the actual protein sequence, and the first one is the indicator that shows if there is any missing coordinates. In this case, notice that there is two "X" in the end. That means that the last two residue of the sequence witch are "NL" in this case are missing coordinates.
By coding in Python I would like to generate a table which should look like this:
name of the sequence
total number of missing coordinates (which is the number of X)
the range of these missing coordinates (which is the range of the position of those X)
4)the length of the sequence
5)the actual sequence
So the final results should looks like this:
>102L:A 2 163-164 164 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
And my code looks like this so far:
total_seq = []
with open('sample.txt') as lines:
for l in lines:
split_list = l.split()
# Assign the list number
header = split_list[0] # 1
seq = split_list[1] # 5
disorder = split_list[2]
# count sequence length and total residue of missing coordinates
sequence_length = len(seq) # 4
for x in disorder:
counts = 0
if x == 'X':
counts = counts + 1
total_seq.append([header, seq, str(counts)]) # obviously I haven't finish coding 2 & 3
with open('new_sample.txt', 'a') as f:
for lol in total_seq:
f.write('\n'.join(lol))
I'm new in python, would anyone help please?

Here's your modified code. It now produces your desired output.
with open("sample.txt") as infile:
matrix = [line.split() for line in infile.readlines()]
header_list = [row[0] for row in matrix]
seq_list = [str(row[1]) for row in matrix]
disorder_list = [str(row[2]) for row in matrix]
f = open('new_sample.txt', 'a')
for i in range(len(header_list)):
header = header_list[i]
seq = seq_list[i]
disorder = disorder_list[i]
# count sequence length and total residue of missing coordinates
sequence_length = len(seq)
# get total number of missing coordinates
num_missing = disorder.count('X')
# get the range of these missing coordinates
first_X_pos = disorder.find('X')
last_X_pos = disorder.rfind('X')
range_missing = '-'.join([str(first_X_pos), str(last_X_pos)])
reformat_seq=" ".join([header, str(num_missing), range_missing, str(sequence_length), seq, '\n'])
f.write(reformat_seq)
f.close()
Some more tips:
Don't forget about python's string functions. They will solve a lot of your problems automatically. The documentation is very good.
If you searched for how to do just part 2 or just part 3 in your question, you would find the results elsewhere.

Related

Not Parsing Through

I tried to parse through a text file, and see the index of the character where the four characters before it are each different. Like this:
wxrgh
The h would be the marker, since it is after the four different digits, and the index would be 4. I would find the index by converting the text into an array, and it works for the test but not for the actually input. Does anyone know what is wrong.
def Repeat(x):
size = len(x)
repeated = []
for i in range(_size):
k = i + 1
for j in range(k, _size):
if x[i] == x[j] and x[i] not in repeated:
repeated.append(x[i])
return repeated
with open("input4.txt") as f:
text = f.read()
test_array = []
split_array = list(text)
woah = ""
for i in split_array:
first = split_array[split_array.index(i)]
second = split_array[split_array.index(i) + 1]
third = split_array[split_array.index(i) + 2]
fourth = split_array[split_array.index(i) + 3]
test_array.append(first)
test_array.append(second)
test_array.append(third)
test_array.append(fourth)
print(test_array)
if Repeat(test_array) != []:
test_array = []
else:
woah = split_array.index(i)
print(woah)
print(woah)
I tried a test document and unit tests but that still does not work
You can utilise a set to help you with this.
Read the entire file into a list (buffer). Iterate over the buffer starting at offset 4. Create a set of the 4 characters that precede the current position. If the length of the set is 4 (i.e., they're all different) and the character at the current position is not in the set then you've found the index you're interested in.
W = 4
with open('input4.txt') as data:
buffer = data.read()
for i in range(W, len(buffer)):
if len(s := set(buffer[i-W:i])) == W and buffer[i] not in s:
print(i)
Note:
If the input data are split over multiple lines you may want to remove newline characters.
You will need to be using Python 3.8+ to take advantage of the assignment expression (walrus operator)

How to put a group of integers in a row in a text file into a list?

I have a text file composed mostly of numbers something like this:
3 011236547892X
9 02321489764 Q
4 031246547873B
I would like to extract each of the following (spaces 5 to 14 (counting from zero)) into a list:
1236547892
321489764
1246547873
(Please note: each "number" is 10 "characters" long - the second row has a space at the end.)
and then perform analysis on the contents of each list.
I have umpteen versions, however I think I am closest with:
with open('k_d_m.txt') as f:
for line in f:
range = line.split()
num_lst = [x for x in range(3,10)]
print(num_lst)
However I have: TypeError: 'list' object is not callable
What is the best way forward?
What I want to do with num_lst is, amongst other things, as follows:
num_lst = list(map(int, str(num)))
print(num_lst)
nth = 2
odd_total = sum(num_lst[0::nth])
even_total = sum(num_lst[1::nth])
print(odd_total)
print(even_total)
if odd_total - even_total == 0 or odd_total - even_total == 11:
print("The number is ok")
else:
print("The number is not ok")
Use a simple slice:
with open('k_d_m.txt') as f:
num_lst = [x[5:15] for x in f]
Response to comment:
with open('k_d_m.txt') as f:
for line in f:
num_lst = list(line[5:15])
print(num_lst)
First of all, you shouldn't name your variable range, because that is already taken for the range() function. You can easily get the 5 to 14th chars of a string using string[5:15]. Try this:
num_lst = []
with open('k_d_m.txt') as f:
for line in f:
num_lst.append(line[5:15])
print(num_lst)

How to convert '2.6840000e+01' type like datas to float in Python?

I got a "input.txt" file that contains lines like:
1 66.3548 1011100110110010 25
Then i apply some functions column by column:
column stays the same,
column is rounding in a spesific way,
column is converted from binary to decimal,
column is converted from hexadecimal to binary.
And finaly i get this:
[1.0000000e+00 6.6340000e+01 4.7538000e+04 1.0010100e+05]
Then i write this to "fall.txt".
All the operations is working correctly. But i want to see the numbers like:
1 66.34 47538 100101
I placed the columns of the relevant rows in list_for_1. Then i applied the functions to indexes and put them to another list list_for_11. Finally i put all the answers in a matrix. I wrote the matrix to the "fall.txt".
Here's what i did:
with open("input.txt", "r") as file:
#1. TİP SATIRLAR İÇİN GEREKLİ OBJELER
list_for_1 = list()
list_for_11 = list()
#list_final_1 = list()
for line in file:
#EĞER SATIR TİPİ 1 İSE
if line.startswith("1"):
line = line[:-1]
list_for_1 = line.split(" ") #tüm elemanları 1 listede toplama
#1. tip satır için elemanlara gerekli işlemlerin yapılması
list_for_11.append(list_for_1[0]) #ilk satır 1 kalacak
list_for_11.append(float_yuvarla(float(list_for_1[1]))) #float yuvarlama
list_for_11.append(binary_decimal(list_for_1[2])) #binary'den decimal'e
list_for_11.append(hexa_binary(list_for_1[3])) #hexa'dan binary'e
m = 0
n = 0
array1 = np.zeros((6,4))
for i in list_for_11: #listedeki elemanları matrise yerleştirme
if(m > 5):
break
if(isinstance(i, str)):
x = int(i, 2)
array1[m][n] = float(i)
n += 1
if(n == 4):
n = 0
m += 1
with open("fall.txt","w") as ff:
ff.write(str(array1))
ff.write("\n")
Over here i actually send float type to matrix but it's not working:
if(isinstance(i, str)):
x = int(i, 2)
array1[m][n] = float(i)
I'm sort of a new python user, so i might write unnecessarily long and complex codes. If there's any shorter way to do what i did, i would like to get opinions for that as well.
Here's a function to format your numbers the way you want them:
def formatNumber(num):
if num % 1 == 0:
return int(num)
else:
return num
Your list of numbers:
l = [1.0000000e+00, 6.6340000e+01, 4.7538000e+04, 1.0010100e+05]
Reformatting your list of numbers:
for x in l:
print(formatNumber(x))
Output:
1
66.34
47538
100101

How to reshape a numpy data table?

I'm pretty new to python and I have a task to "reshape" some data in a .txt file. The simplified format of the original data looks like this:
A 1 x
A 2 y
A 3 z
B 1 q
B 2 w
B 3 e
...
What I need to get looks like this
A B
1 x q
2 y w
3 z e
...
The thing is, there are multiple .txt files I have to reshape and there's no fixed number of 1-2-3s per A-B-C, meaning A could go from 1 to 50, while B could go from 1 to 10 or 75.
I'm looking for an algorithm on how to do this, I've figured how to reach the data I need and discard the data I don't need, but I can't figure how to "reduce" the dimension of data.
What I've done so far is getting the necessary data in arrays and putting those arrays in a numpy array
data = np.array([station, depth, temperature])
Now I'm trying to fill a new 2d data array, with x and y axis being the number of different stations and depths: if the original data has AAAABBCCDDDD, then the new data array's x axis will contain ABCD (using Counter().keys()).
First you could parse everything, reading line by line, and store the values in a dictionary. Since each line looks something like A 1 x , the general case is as follows
BIG_LETTER INDEX VALUE WHITESPACE
In the dictionary, you would have as keys the BIG_LETTER's and as the values another dictionary that stores the index and the value, something like {A : {1: 'q', 2: 'c'}}. This can trivially be achieved.
replace_with_your_file_name = "./text.txt"
with open(replace_with_your_file_name, "r") as file:
for line in file.readlines():
line = line.strip().split(' ') # remove ending whitespace and split ''
# Store in a dictionary the big letter and all its values
# something like {A : {1: 'q', 2: 'c'}}
if not line[0] in data:
data[line[0]] = {}
data[line[0]][line[1]] = line[2] # data[ big_letter ][number] = char
Then, after that is finished you could use another for loop to sort the keys in the nested dictionary, so if it was {'B' : { 5: 'a', 2:'c' } } it would become {'B' : {2: 'c', 5: 'a'}}. Also you can then easily extract for each big letter the maxmium number they have a value for, which solves the problem of non-fixed length. The highest maxmium number is saved for later.
# Sort the by the dictionary key
GLOBAL_MAX_NUMBER: int = 0 # the larget number among all big letters
for item in data:
big_letter: dict = data[item]
data[item] = dict(sorted(big_letter.items(),)) # Sorth according to the keys
local_max_number = list(data[item])[-1] # The last element is the largest
if local_max_number > GLOBAL_MAX_NUMBER:
GLOBAL_MAX_NUMBER = local_max_number
iterations = GLOBAL_MAX_NUMBER # Improve readability
Now you can write the data in a new file in the format you wish
# Write them to a new file
with open("newfile.txt", "w") as file:
# FORMAT: A B C D ... (BIG NUMBRES)
# ----- 1 a b c d ... (INDEX AND VALUE FOR EACH BIG LETTTER IN THE FIRST ROW)
# Write all the big letters in a row
WHITESPACE: str = " "
file.write(WHITESPACE + " ".join(list(data)) + "\n")
# that `GLOBAL_MAX_NUMBER` we kept track off
for i in range(iterations):
current_number: int = i+1 # Current index
file.write(f'{current_number} ')
for big_letter in data: # A, B, C ...
if current_number not in data[big_letter]:
file.write("0 ") # in case this does not exist write 0
else:
file.write(f'{data[big_letter][current_number]} ') # write the value
file.write("\n")
All of the above, combined would give the desired output
A B
1 x q
2 y w
3 z e

Access the elements of a list around the current element?

I am trying to figure out if it is possible to access the elements of a list around the element you are currently at. I have a list that is large (20k+ lines) and I want to find every instance of the string 'Name'. Additionally, I also want to get +/- 5 elements around each 'Name' element. So 5 lines before and 5 lines after. The code I am using is below.
search_string = 'Name'
with open('test.txt', 'r') as infile, open ('textOut.txt','w') as outfile:
for line in infile:
if search_string in line:
outfile.writelines([line, next(infile), next(infile),
next(infile), next(infile), next(infile)])
Getting the lines after the occurrence of 'Name' is pretty straightforward, but figuring out how to access the elements before it has me stumped. Anyone have an ideas?
20k lines isn't that much, if it's ok to read all of them in a list, we can take slices around the index where a match is found, like this:
with open('test.txt', 'r') as infile, open('textOut.txt','w') as outfile:
lines = [line.strip() for line in infile.readlines()]
n = len(lines)
for i in range(n):
if search_string in lines[i]:
start = max(0, i - 5)
end = min(n, i + 6)
outfile.writelines(lines[start:end])
You can use the function enumerate that allows you to iterate through both elements and indexes.
Example to access elements 5 indexes before and after your current element :
n = len(l)
for i, x in enumerate(l):
print(l[max(i-5, 0)]) # Prevent picking last elements of iterable by using negative indexes
print(x)
print(l[min(i+5, n-1)]) # Prevent overflow
You need to keep track of the index of where in the list you currently are
So something like:
# Read the file into list_of_lines
index = 0
while index < len(list_of_lines):
if list_of_lines[index] == 'Name':
print(list_of_lines[index - 1]) # This is the previous line
print(list_of_lines[index + 1]) # This is the next line
# And so on...
index += 1
Let's say you have your lines stored in your list:
lines = ['line1', 'line2', 'line3', 'line4', 'line5', 'line6', 'line7', 'line8', 'line9']
You could define a method returning elements grouped by n consecutives, as a generator:
def each_cons(iterable, n = 2):
if n < 2: n = 1
i, size = 0, len(iterable)
while i < size-n+1:
yield iterable[i:i+n]
i += 1
Teen, just call the method. To show the content I'm calling list on it, but you can iterate over it:
lines_by_3_cons = each_cons(lines, 3) # or any number of lines, 5 in your case
print(list(lines_by_3_cons))
#=> [['line1', 'line2', 'line3'], ['line2', 'line3', 'line4'], ['line3', 'line4', 'line5'], ['line4', 'line5', 'line6'], ['line5', 'line6', 'line7'], ['line6', 'line7', 'line8'], ['line7', 'line8', 'line9']]
I personally loved that problem. All guys here are doing it by taking the whole file into memory. I think I wrote a memory efficient code.
Here, check this out!
myfile = open('infile.txt')
stack_print_moments = []
expression = 'MYEXPRESSION'
neighbourhood_size = 5
def print_stack(stack):
for line in stack:
print(line.strip())
print('-----')
current_stack = []
for index, line in enumerate(myfile):
current_stack.append(line)
if len(current_stack) > 2 * neighbourhood_size + 1:
current_stack.pop(0)
if expression in line:
stack_print_moments.append(index + neighbourhood_size)
if index in stack_print_moments:
print_stack(current_stack)
last_index = index
for index in range(last_index, last_index + neighbourhood_size + 1):
if index in stack_print_moments:
print_stack(current_stack)
current_stack.pop(0)
More advanced code is here: Github link

Categories