sorting data using key in python - python

I got a data format like:
ATOM 124 N GLU B 12
ATOM 125 O GLU B 12
ATOM 126 OE1 GLU B 12
ATOM 127 C GLU B 12
ATOM 128 O GLU B 14
ATOM 129 N GLU B 14
ATOM 130 OE1 GLU B 14
ATOM 131 OE2 GLU B 14
ATOM 132 CA GLU B 14
ATOM 133 C GLU B 15
ATOM 134 CA GLU B 15
ATOM 135 OE2 GLU B 15
ATOM 136 O GLU B 15
.....100+ lines
From here, I want to filter this data based on col[5] (starting column count from 0) and col[2]. Per value of col[5] if OE1 or OE2 happens to be only once then the data set to be discarded. But for each value of col[5] if OE1 and OE2 both be present, it would be kept.
The desired data after filtering:
ATOM 128 O GLU B 14
ATOM 129 N GLU B 14
ATOM 130 OE1 GLU B 14
ATOM 131 OE2 GLU B 14
ATOM 132 CA GLU B 14
I have tried using search_string like:
for item in stored_list:
search_str_a = 'OE1'+item[3]+item[4]+item[5]
search_str_b = 'OE2'+item[3]+item[4]+item[5]
target_str = item[2]+item[3]+item[4]+item[5]
This is helpful to maintain rest of the col alike while searching for OE1 or OE2, but not helpful to filter and eliminate if one of them(or both them) is missing.
Any ideas would be really nice here.

The below code needs pandas
you can download it from http://pandas.pydata.org/pandas-docs/stable/install.html
import pandas as pd
file_read_path = "give here source file path"
df = pd.read_csv(file_read_path, sep= " ", names = ["col0","col1","col2","col3","col4","col5"])
group_series = df.groupby("col5")["col2"].apply(lambda x: "%s" % ', '.join(x))
filtered_list = []
for index in group_series.index:
str_col2_group = group_series[index]
if "OE1" in str_col2_group and "OE2" in str_col2_group:
filtered_list.append(index)
df = df[df.col5.isin(filtered_list)]
output_file_path = "give here output file path"
df.to_csv(output_file_path,sep = " ",index = False,header = False)
this would be helpfull http://pandas.pydata.org/pandas-docs/stable/tutorials.html
Output result
ATOM 128 O GLU B 14
ATOM 129 N GLU B 14
ATOM 130 OE1 GLU B 14
ATOM 131 OE2 GLU B 14
ATOM 132 CA GLU B 14

using csv, it comes with python
import csv
import operator
file_read_path = "give here source file path"
with open(file_read_path) as f_pdb:
rdr = csv.DictReader(f_pdb,delimiter=' ', fieldnames = ["col0","col1","col2","col3","col4","col5"])
sorted_bio = sorted(rdr,key=operator.itemgetter('col5'),reverse=False)
col5_tmp = None
tmp_list = []
perm_list = []
tmp_str = ""
col5_v = ""
for row in sorted_bio:
col5_v = row["col5"]
if col5_v != col5_tmp:
if "OE1" in tmp_str and "OE2" in tmp_str:
perm_list.extend(tmp_list)
tmp_list = []
tmp_str = ""
col5_tmp = col5_v
tmp_list.append(row)
tmp_str = tmp_str +","+ row["col2"]
if col5_v != col5_tmp:
if "OE1" in tmp_str and "OE2" in tmp_str:
perm_list.extend(tmp_list)
csv_file = open("give here output file path","w")
dict_writer = csv.DictWriter(csv_file,delimiter=' ', fieldnames = ["col0","col1","col2","col3","col4","col5"])
for row in perm_list:
dict_writer.writerow(row)
csv_file.close()

Related

Count the number of atoms of each element

I have to access a file and count the number of atoms of each element. That is, count the number of times of the last character.
For example, I have a file named 14ly.pdb with the following lines:
ATOM 211 N TYR A 27 4.697 8.290 -3.031 1.00 13.35 N
ATOM 212 CA TYR A 27 5.025 8.033 -1.616 0.51 11.29 C
ATOM 214 C TYR A 27 4.189 8.932 -0.730 1.00 10.87 C
ATOM 215 O TYR A 27 3.774 10.030 -1.101 1.00 12.90 O
I should get as a result: 'N':1, 'C':2, 'O':1, that is, 1 atom of type N, 2 of type C and 1 of type O.
I have the following incomplete code that I need to complete:
import os
def count_atoms(pdb_file_name):
num_atoms = dict()
with open(pdb_file_name) as file_content:
##Here should go the code I need##
return num_atoms
result = count_atoms('14ly.pdb')
print(result)
number_of_atoms = dict()
with open("14ly.pdb", "r") as f:
for line in f:
line_words = line.split(" ")
last_char = line_words[-1].rstrip('\n')
if last_char in number_of_atoms.keys():
number_of_atoms[last_char] += 1
else:
number_of_atoms[last_char] = 1
print(number_of_atoms)
I think this should be enough

Extract from a file the lines before and after a certain line that as specific character in a certain column using python

I am trying to read each line of a text file (each line has values that are separated by one or multiple spaces, but it can also be an empty line), and whenever a line has a particular character in a column (in my case I want "H" in column 2 (column 1 in python), and extract the line before(can be an empty line), the line of interest with "H" and the next line (it can be an empty line).
input : (file.txt)
173 H B 120.24 8.76
174 Y B 125.13 8.88
175 E B 121.65 8.77
176 T B 122.94 9.22
177 L H 129.04 9.19
178 A B 117.33 7.62
179 R B 122.34 8.15
180 F H 124.32 8.81
181 E B 125.43 8.78
182 L C 124.83 8.13
183 S C 114.31 8.50
184 E C 120.65 8.36
185 H C 119.53 8.52
186 H C 119.67 8.62
1 M U **** ****
2 A C 127.24 8.61
3 H B 116.05 8.41
4 C B 124.62 9.23
output : (output.txt)
173 H B 120.24 8.76
174 Y B 125.13 8.88
184 E C 120.65 8.36
185 H C 119.53 8.52
186 H C 119.67 8.62
185 H C 119.53 8.52
186 H C 119.67 8.62
2 A C 127.24 8.61
3 H B 116.05 8.41
4 C B 124.62 9.23
Here is the code that I have but I do not manage to obtain what I want:
newopen2 = open('./output.txt', 'w')
with open("input.txt", "r") as f:
for line in f.readlines():
for x, y in enumerate(line):
if (y[1]) == "H" in line:
newopen2.write("".join(line[max(0, str(line) - 1):str(line) + 2]).replace('\r', ''))
newopen2.write("\n")
else:
continue
newopen2.close()
f.close()
I would appreciate any help, thank you.
Teez
This solution creates a binary array where a 1 indicates the line is to be written to the file.
from typing import List
def propagate_ones(in_array: List) -> List:
'''Returns a binary sequence which changes an in_array element to a 1
if it is a 0 and neighbours a 1'''
propagated_array = in_array.copy()
for i in range(len(in_array)-1):
if in_array[i+1] == 1:
propagated_array[i] = 1
if in_array[i] == 1:
propagated_array[i+1] = 1
return propagated_array
with open('./file.txt', 'r') as fin, open('./out.txt', 'w') as fout:
in_lines = fin.readlines()
h_flags = [1 if i != '\n' and i.split()[1]=='H' else 0 for i in in_lines]
to_write_flags = propagate_ones(h_flags)
to_write = zip(to_write_flags, in_lines)
for line in to_write:
if line[0] == 1:
fout.write(line[1])
This writes the desired output to out.txt:
173 H B 120.24 8.76
174 Y B 125.13 8.88
184 E C 120.65 8.36
185 H C 119.53 8.52
186 H C 119.67 8.62
2 A C 127.24 8.61
3 H B 116.05 8.41
4 C B 124.62 9.23

file separation on the basis of matching character

ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C
ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C
ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N
ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N
ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O
ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N
ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C
ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C
ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O
ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C
I have the above text file and need to make two text files on the basis of differences at position 21 in line. I wrote a script which can print the required results. But if I do not know what is the character at column 21, how can I do this job. Following is the script which I tried. suppose I do not know whether line 21 is "A" and "B" or "B" and "G" or any other combination and need to separate on the basis of line 21. How can I do this?
import sys
for fn in sys.argv[1:]:
f=open(fn,'r')
while 1:
line=f.readline()
if not line: break
if line[21:22] == 'B':
chns = line[0:80]
print chns
Storing the previous value of the 21st character from the previous line, then adding a newline for every non-match (which means another group of same letters) prints the grouped lines based on its 21st character.
Take note that it only groups lines with matching 21st character based on the line sequence in the file, which means non-sorted lines will have more than one separated groups of same 21st character.
Modified file to show this case:
ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C
ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C
ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N
ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C
ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C
ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O
ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C
ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N
ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N
ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O
Code producing this case (without sorting the lines):
import sys
for fn in sys.argv[1:]:
with open(fn,'r') as file:
prev = 0
for line in file:
line = line.strip()
if line[21:22] != prev:
# new line separator for each group
print ''
print line
prev = line[21:22]
A sample output showing this case:
ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C
ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C
ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N
ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C
ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C
ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O
ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C
ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N
ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N
ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O
So, if you want only one group for each same 21st character, putting all the lines in a list and sorting it using list.sort() will do.
Code (sorting the lines first before grouping):
import sys
for fn in sys.argv[1:]:
with open(fn,'r') as file:
lines = file.readlines()
# creates a list or pairs (21st char, line) within a list
lines = [ [line[21:22], line.strip() ] for line in lines ]
# sorts lines based on key (21st char)
lines.sort()
# brings back list of lines to its original state,
# but the order is not reverted since it is already sorted
lines = [ line[1] for line in lines ]
prev = 0
for line in lines:
if line[21:22] != prev:
# new line separator for each group
print ''
print line
prev = line[21:22]
Outputs to:
ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C
ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C
ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N
ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N
ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O
ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N
ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C
ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C
ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O
ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C
Edit:
Writing grouped lines in different files do not actually need checking the previous line's value because changing the filename based on 21st character opens a new file, thus separating lines. But here, I used prev so that any previously created file with the same filename won't just be appended that may cause clutter or inconsistency on the file's contents.
import sys
for fn in sys.argv[1:]:
with open(fn,'r') as file:
lines = file.readlines()
# creates a list or pairs (21st char, line) within a list
lines = [ [line[21:22], line ] for line in lines ]
# sorts lines based on key (21st char)
lines.sort()
# brings back list of lines to its original state,
# but the order is not reverted since it is already sorted
lines = [ line[1] for line in lines ]
filename = 'file'
prev = 0
for line in lines:
if line[21:22] != prev:
# creates a new file
file = open(filename + line[21:22] + '.txt', 'w')
else:
# appends to the file
file = open(filename + line[21:22] + '.txt', 'a')
file.write(line)
prev = line[21:22]
The file writing part can be simplified if appending previously created files is not a problem. But, it risks writing to a file with same filename that is not created by the script or created by the script during earlier executions/sessions.
filename = 'file'
for line in lines:
file = open(filename + line[21:22] + '.txt', 'a')
file.write(line)
Use str.split and compare the 5th element (i.e the 21st character)
while 1:
line = f.readline()
if not line:
break
# get character in 5th column
ch = line.split()[4]
if ch == 'B':
chns = line[0:80]
print chns
else: # not sure what the character is
pass # do something
You can just initialize a value to None and look if it changes :
import sys
for fn in sys.argv[1:]:
old = None
f=open(fn,'r')
for line in f:
if not line: break
if (old is None) or (line[21] == old):
old = line[21]
chns = line[0:80]
print chns
Not sure what you are trying to achieve. But the following code will sort the lines from all files by the 21st character in the dictionary lines.
import sys
lines = dict()
for fn in sys.argv[1:]:
f = open(fn,'r')
for line in f:
if not line:
break
key = line.split()[4]
if key not in lines.keys():
lines[key] = list()
lines[key].append(line)
You can then get all 21st characters that occurred using lines.keys(), and get a list() with all corresponding lines from the dictionary.

Can't identify syntax error? Also, need using raw_input with list

I'm trying to develop a simple Python program to calculate the formula mass of a compound. I'm facing 2 issues:
There's apparently a syntax error with 'b' but I don't know what it is. Here is what I've done so far:
def FormulaMass():
H = 1
He = 4
Li = 7
Be = 9
B = 11
C = 12
N = 14
O = 16
F = 19
Ne = 20
Na = 23
Mg = 24
Al = 27
Si = 28
P = 31
S = 32
Cl = 35.5
Ar = 40
K = 39
Ca = 40
Sc = 45
Ti = 48
V = 51
Cr = 52
Mn = 55
Fe = 56
Co = 59
Ni = 59
Cu = 63.5
Zn = 65
Ga = 70
Ge = 73
As = 75
Se = 79
Br = 80
Rb = 85.5
Sr = 88
Y = 89
Zr = 91
Nb = 93
Mo = 96
Tc = 98
Ru = 101
Rh = 103
Pd = 106.5
Ag = 108
Cd = 112.5
In = 115
Sn = 119
Sb = 122
Te = 128
I =127
Xe = 131
Cs = 133
Ba = 137
La = 139
Ce = 140
Pr = 141
Nd = 144
Pm = 145
Sm = 150
Eu = 152
Gd = 157
Tb = 159
Dy = 162.5
Ho = 165
Er = 167
Tm = 169
Yb = 173
Lu = 175
Hf = 178.5
Ta = 181
W = 184
Re = 186
Os = 190
Ir = 192
Pt = 195
Au = 197
Hg = 201
Tl = 204
Pb = 207
Bi = 209
Po = 209
At = 210
Rn = 222
Fr = 223
Ra = 226
Ac = 227
Th = 232
Pa = 231
U = 238
Np = 237
Pu = 244
Am = 243
Cm = 247
Bk = 247
Cf = 251
Es = 252
Fm = 257
Md = 258
No = 259
Rf = 261
Db = 262
Sg = 266
Bh = 264
Hs = 277
Mt = 268
Ds = 271
Rg = 272
Uub = 285
Uut = 284
Uuq = 289
Uup = 288
Uuh = 292
Uuo = 294
element = [H, He, Li, Be, B. C, N, O, F, Ne, Na, Mg, Al, Si, P, S, Cl, Ar, K, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Ga, Ge, As, Se, Br, Rb, Sr, Y, Zr, Nb, Mo, Tc, Ru, Rh, Pd, Ag, Cd, In, Sn, Sb, Te, I, Xe, Cs, Ba, La, Ce, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, Lu, Hf, Ta, W, Re, Os, Ir, Pt, Au, Hg, Tl, Pb, Bi, Po, At, Rn, Fr, Ra, Ac, Th, Pa, U, Np, Pu, Am, Cm, Bk, Cf, Es, Fm, Md, No, Rf, Db, Sg, Bh, Hs, Mt, Ds, Rg, Uub, Uut, Uuq, Uup, Uuh, Uuo]
a = raw_input('Which' + str(element) + '?')
b = float(raw_input('How many moles?'))
c = str(raw_input('Is that all [Y/N]?'))
while c == 'N':
print
'a' doesn't actually come up when running the code it just immediately identifies this syntax error in 'b'.
What I'm trying to do with 'a' is to allow the user to input a constant from the list 'element' so that the mass (depending on the number of moles can be calculated). Now one potential problem I see is that I'm not sure how to allow users to input different elements with different numbers of moles without creating endless constants (e.g. a, b ,c...).
The aim is to add a*b at the end to find the mass but is there a way to make multiple a's and b's so in theory users could have a*b + a1*b1...
PS Sorry for not putting in my code properly it would take too long for me to put 4 indents after each line :/
In the element list, you've done Be, B. C, N. Notice how you used a period after the 'B' rather then using a comma.
What Python is thinking that you're doing is doing B.C -- you have some sort of object named B, and are trying to get the attribute C on it. Hence the error message -- B, an integer, has no attribute named C.
There are also a few potential issues with your code. As one commentator noted, your print statement is incomplete.
I'm also assuming that you want the user to enter a given element (such as 'Ne'), then find the mass of that atom (20), then do some manipulation there.
In that case, you probably want to restructure your code so that it uses a dict and looks something like this:
def FormulaMass():
elements = {
'H': 1,
'He':4,
# ... etc
'Uuo': 294
}
element = raw_input('Which element? ')
mass = elements[element]
print mass
# add math here
# Run function:
FormulaMass()

How do I get the number of subunits from a PDB file with awk , python or biopython?

I have PDB(text) files which are in a directory. I would like to print the number of subunits from each PDB file.
Read all lines in a pdb file that start with ATOM
The fifth column of the ATOM line contains A, B, C, D etc.
If it contains only A the number of subunit is 1. If it contains A and B, the number of subunits are 2. If it contains A, B, and C the number of subunits are 3.
1kg2.pdb file
ATOM 1363 N ASN A 258 82.149 -23.468 9.733 1.00 57.80 N
ATOM 1364 CA ASN A 258 82.494 -22.084 9.356 1.00 62.98 C
ATOM 1395 C MET B 196 34.816 -51.911 11.750 1.00 49.79 C
ATOM 1396 O MET B 196 35.611 -52.439 10.963 1.00 47.65 O
1uz3.pdb file
ATOM 1384 O ARG A 260 80.505 -20.450 15.420 1.00 22.10 O
ATOM 1385 CB ARG A 260 78.980 -18.077 15.207 1.00 36.88 C
ATOM 1399 SD MET B 196 34.003 -52.544 16.664 1.00 57.16 S
ATOM 1401 N ASP C 197 34.781 -50.611 12.007 1.00 44.30 N
2b69.pdb file
ATOM 1393 N MET B 196 33.300 -54.017 12.033 1.00 46.46 N
ATOM 1394 CA MET B 196 33.782 -52.714 12.566 1.00 49.99 C
desired output
pdb_id subunits
1kg2 2
1uz3 3
2b69 1
How can I do this with awk, python or Biopython?
You can use an array to record all seen values for the fifth column.
$ gawk '/^ATOM/ {seen[$5] = 1} END {print length(seen)}' 1kg2.pdb
2
Edit: Using gawk 4.x you can use ENDFILE to generate the required output:
BEGIN {
print "pdb_id\t\tsubunits"
print
}
/^ATOM/ {
seen[$5] = 1
}
ENDFILE {
print FILENAME, "\t", length(seen)
delete seen
}
The result:
$ gawk -f pdb.awk 1kg2.pdb 1uz3.pdb 2b69.pdb
pdb_id subunits
1kg2.pdb 2
1uz3.pdb 3
2b69.pdb 1
A dictionary is one way to count unique occurrences. The following assigns a meaningless value (0) to each subunit, since all you care about is the number of unique subunits (dictionary keys).
import os
for fn in os.listdir():
if ".pdb" in fn:
sub = {}
with open(fn, 'r') as f:
for line in f:
c = line.split()
if len(c) > 5 and c[0] == "ATOM":
sub[c[4]] = 0
print(fn, len(sub.keys()))
(A brand new user deserves an answer along with a pointer to http://whathaveyoutried.com/. Subsequent questions should include evidence that the user has actually tried to solve the problem.)

Categories