ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C
ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C
ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N
ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N
ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O
ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N
ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C
ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C
ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O
ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C
I have the above text file and need to make two text files on the basis of differences at position 21 in line. I wrote a script which can print the required results. But if I do not know what is the character at column 21, how can I do this job. Following is the script which I tried. suppose I do not know whether line 21 is "A" and "B" or "B" and "G" or any other combination and need to separate on the basis of line 21. How can I do this?
import sys
for fn in sys.argv[1:]:
f=open(fn,'r')
while 1:
line=f.readline()
if not line: break
if line[21:22] == 'B':
chns = line[0:80]
print chns
Storing the previous value of the 21st character from the previous line, then adding a newline for every non-match (which means another group of same letters) prints the grouped lines based on its 21st character.
Take note that it only groups lines with matching 21st character based on the line sequence in the file, which means non-sorted lines will have more than one separated groups of same 21st character.
Modified file to show this case:
ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C
ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C
ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N
ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C
ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C
ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O
ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C
ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N
ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N
ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O
Code producing this case (without sorting the lines):
import sys
for fn in sys.argv[1:]:
with open(fn,'r') as file:
prev = 0
for line in file:
line = line.strip()
if line[21:22] != prev:
# new line separator for each group
print ''
print line
prev = line[21:22]
A sample output showing this case:
ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C
ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C
ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N
ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C
ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C
ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O
ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C
ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N
ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N
ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O
So, if you want only one group for each same 21st character, putting all the lines in a list and sorting it using list.sort() will do.
Code (sorting the lines first before grouping):
import sys
for fn in sys.argv[1:]:
with open(fn,'r') as file:
lines = file.readlines()
# creates a list or pairs (21st char, line) within a list
lines = [ [line[21:22], line.strip() ] for line in lines ]
# sorts lines based on key (21st char)
lines.sort()
# brings back list of lines to its original state,
# but the order is not reverted since it is already sorted
lines = [ line[1] for line in lines ]
prev = 0
for line in lines:
if line[21:22] != prev:
# new line separator for each group
print ''
print line
prev = line[21:22]
Outputs to:
ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C
ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C
ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N
ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N
ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O
ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N
ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C
ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C
ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O
ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C
Edit:
Writing grouped lines in different files do not actually need checking the previous line's value because changing the filename based on 21st character opens a new file, thus separating lines. But here, I used prev so that any previously created file with the same filename won't just be appended that may cause clutter or inconsistency on the file's contents.
import sys
for fn in sys.argv[1:]:
with open(fn,'r') as file:
lines = file.readlines()
# creates a list or pairs (21st char, line) within a list
lines = [ [line[21:22], line ] for line in lines ]
# sorts lines based on key (21st char)
lines.sort()
# brings back list of lines to its original state,
# but the order is not reverted since it is already sorted
lines = [ line[1] for line in lines ]
filename = 'file'
prev = 0
for line in lines:
if line[21:22] != prev:
# creates a new file
file = open(filename + line[21:22] + '.txt', 'w')
else:
# appends to the file
file = open(filename + line[21:22] + '.txt', 'a')
file.write(line)
prev = line[21:22]
The file writing part can be simplified if appending previously created files is not a problem. But, it risks writing to a file with same filename that is not created by the script or created by the script during earlier executions/sessions.
filename = 'file'
for line in lines:
file = open(filename + line[21:22] + '.txt', 'a')
file.write(line)
Use str.split and compare the 5th element (i.e the 21st character)
while 1:
line = f.readline()
if not line:
break
# get character in 5th column
ch = line.split()[4]
if ch == 'B':
chns = line[0:80]
print chns
else: # not sure what the character is
pass # do something
You can just initialize a value to None and look if it changes :
import sys
for fn in sys.argv[1:]:
old = None
f=open(fn,'r')
for line in f:
if not line: break
if (old is None) or (line[21] == old):
old = line[21]
chns = line[0:80]
print chns
Not sure what you are trying to achieve. But the following code will sort the lines from all files by the 21st character in the dictionary lines.
import sys
lines = dict()
for fn in sys.argv[1:]:
f = open(fn,'r')
for line in f:
if not line:
break
key = line.split()[4]
if key not in lines.keys():
lines[key] = list()
lines[key].append(line)
You can then get all 21st characters that occurred using lines.keys(), and get a list() with all corresponding lines from the dictionary.
Related
I have to access a file and count the number of atoms of each element. That is, count the number of times of the last character.
For example, I have a file named 14ly.pdb with the following lines:
ATOM 211 N TYR A 27 4.697 8.290 -3.031 1.00 13.35 N
ATOM 212 CA TYR A 27 5.025 8.033 -1.616 0.51 11.29 C
ATOM 214 C TYR A 27 4.189 8.932 -0.730 1.00 10.87 C
ATOM 215 O TYR A 27 3.774 10.030 -1.101 1.00 12.90 O
I should get as a result: 'N':1, 'C':2, 'O':1, that is, 1 atom of type N, 2 of type C and 1 of type O.
I have the following incomplete code that I need to complete:
import os
def count_atoms(pdb_file_name):
num_atoms = dict()
with open(pdb_file_name) as file_content:
##Here should go the code I need##
return num_atoms
result = count_atoms('14ly.pdb')
print(result)
number_of_atoms = dict()
with open("14ly.pdb", "r") as f:
for line in f:
line_words = line.split(" ")
last_char = line_words[-1].rstrip('\n')
if last_char in number_of_atoms.keys():
number_of_atoms[last_char] += 1
else:
number_of_atoms[last_char] = 1
print(number_of_atoms)
I think this should be enough
I am working with a file from the protein data bank which looks something like this.
SITE 2 AC1 15 ASN A 306 LEU A 309 ILE A 310 PHE A 313
SITE 3 AC1 15 ARG A 316 LEU A 326 ALA A 327 ILE A 345
SITE 4 AC1 15 CYS A 432 HIS A 435 HOH A 504
CRYST1 64.511 64.511 111.465 90.00 90.00 90.00 P 43 21 2 8
ORIGX1 1.000000 0.000000 0.000000 0.00000
ORIGX2 0.000000 1.000000 0.000000 0.00000
ORIGX3 0.000000 0.000000 1.000000 0.00000
SCALE1 0.015501 0.000000 0.000000 0.00000
SCALE2 0.000000 0.015501 0.000000 0.00000
SCALE3 0.000000 0.000000 0.008971 0.00000
ATOM 1 N ASP A 229 29.461 51.231 44.569 1.00 47.64 N
ATOM 2 CA ASP A 229 29.341 51.990 43.290 1.00 47.13 C
ATOM 3 C ASP A 229 30.455 51.566 42.330 1.00 45.62 C
ATOM 4 O ASP A 229 31.598 51.376 42.743 1.00 47.18 O
ATOM 5 CB ASP A 229 29.433 53.493 43.567 1.00 49.27 C
ATOM 6 CG ASP A 229 28.817 54.329 42.463 1.00 51.26 C
ATOM 7 OD1 ASP A 229 27.603 54.172 42.206 1.00 53.47 O
ATOM 8 OD2 ASP A 229 29.542 55.145 41.856 1.00 52.96 O
ATOM 9 N MET A 230 30.119 51.424 41.051 1.00 41.99 N
ATOM 10 CA MET A 230 31.092 51.004 40.043 1.00 36.38 C
First I needed to extract only the fourth column of the rows labeled ATOM, which is the amino acid sequence that specific atom is a part of. I have done that here.
import gzip
class Manual_Seq:
def parseSeq(self, path):
with gzip.open(path,'r') as file_content:
for line in file_content:
newLine = line.split(' ')[0]
if newLine == 'ATOM':
AA = line[17]+line[18]+line[19]
print AA
Which produces an output of this
ASP
ASP
ASP
.....
MET
But what I need now, is to output only the first ASP and the first MET and etc and concatenate them so it'll look like this.
ASPMET
I was thinking maybe I'll try to iterate ahead one line and compare it until it is different from the first output, but I am unsure of how I would do this, if you have any other ideas or any improvements to my code please do feel free to submit your suggestions, thanks.
I also need to mention that there can in fact be two identical amino acids in one file so the output could be "ASP MET ASP"
Instead of printing them, make a list, so
print AA
Becomes
my_list.append(AA)
Just don't forget to initialize the list before the loop with my_list=[]
Now that you have all those values, you can loop through them and make a string out of the unique values. If the order doesn't matter to you than you can use set like this:
my_string = ''.join(set(my_list))
But if the order is important, you have to loop through that list:
my_string = ''
seen = []
for item in my_list:
if item not in seen:
seen.append(item)
my_string += item
You could do it without the seen list, but that would be risky
Anyway, all that means you are looping twice on the same data, which is not needed. Instead of all of this, you could initialize my_string='' and seen=[] before your main loop, and do what I did inside your loop instead of print AA... That would look like this:
def parseSeq(self, path):
with gzip.open(path,'r') as file_content:
my_string = ''
seen = []
for line in file_content:
newLine = line.split(' ')[0]
if newLine == 'ATOM':
AA = line[17]+line[18]+line[19]
if AA not in seen:
seen.append(AA)
my_string += AA
return my_string # or print my_string
I added a bit of code to your existing code:
import gzip
class Manual_Seq:
def parseSeq(self, path):
with gzip.open(path,'r') as file_content:
Here we define an empty list, called AAs to hold your amino acids.
AAs = []
for line in file_content:
Next, I generalized your code a bit to split the line into fields so that we can extract various fields, as needed.
fields = line.split(' ')
line_index = fields[0]
if line_index == 'ATOM':
He we check to see if the amino acid is already in the list of amino acids... If not, then we add the amino acid to the list... This has the effect of deduplicating the amino acids.
if fields[3] not in AAs:
AAs.append(fields[3])
Lastly, we concatenate all the values into a single value using the empty string '' and the join() method.
return ''.join(AAs)
Just wondering did you consider using this BioPandas?
https://rasbt.github.io/biopandas/tutorials/Working_with_PDB_Structures_in_DataFrames/
It should be easier to do what you want to do using pandas.
You just need to use:
df.column_name.unique()
and then concantenate the string in the list using "".join(list_name)
https://docs.python.org/3/library/stdtypes.html#str.join
I got a data format like:
ATOM 124 N GLU B 12
ATOM 125 O GLU B 12
ATOM 126 OE1 GLU B 12
ATOM 127 C GLU B 12
ATOM 128 O GLU B 14
ATOM 129 N GLU B 14
ATOM 130 OE1 GLU B 14
ATOM 131 OE2 GLU B 14
ATOM 132 CA GLU B 14
ATOM 133 C GLU B 15
ATOM 134 CA GLU B 15
ATOM 135 OE2 GLU B 15
ATOM 136 O GLU B 15
.....100+ lines
From here, I want to filter this data based on col[5] (starting column count from 0) and col[2]. Per value of col[5] if OE1 or OE2 happens to be only once then the data set to be discarded. But for each value of col[5] if OE1 and OE2 both be present, it would be kept.
The desired data after filtering:
ATOM 128 O GLU B 14
ATOM 129 N GLU B 14
ATOM 130 OE1 GLU B 14
ATOM 131 OE2 GLU B 14
ATOM 132 CA GLU B 14
I have tried using search_string like:
for item in stored_list:
search_str_a = 'OE1'+item[3]+item[4]+item[5]
search_str_b = 'OE2'+item[3]+item[4]+item[5]
target_str = item[2]+item[3]+item[4]+item[5]
This is helpful to maintain rest of the col alike while searching for OE1 or OE2, but not helpful to filter and eliminate if one of them(or both them) is missing.
Any ideas would be really nice here.
The below code needs pandas
you can download it from http://pandas.pydata.org/pandas-docs/stable/install.html
import pandas as pd
file_read_path = "give here source file path"
df = pd.read_csv(file_read_path, sep= " ", names = ["col0","col1","col2","col3","col4","col5"])
group_series = df.groupby("col5")["col2"].apply(lambda x: "%s" % ', '.join(x))
filtered_list = []
for index in group_series.index:
str_col2_group = group_series[index]
if "OE1" in str_col2_group and "OE2" in str_col2_group:
filtered_list.append(index)
df = df[df.col5.isin(filtered_list)]
output_file_path = "give here output file path"
df.to_csv(output_file_path,sep = " ",index = False,header = False)
this would be helpfull http://pandas.pydata.org/pandas-docs/stable/tutorials.html
Output result
ATOM 128 O GLU B 14
ATOM 129 N GLU B 14
ATOM 130 OE1 GLU B 14
ATOM 131 OE2 GLU B 14
ATOM 132 CA GLU B 14
using csv, it comes with python
import csv
import operator
file_read_path = "give here source file path"
with open(file_read_path) as f_pdb:
rdr = csv.DictReader(f_pdb,delimiter=' ', fieldnames = ["col0","col1","col2","col3","col4","col5"])
sorted_bio = sorted(rdr,key=operator.itemgetter('col5'),reverse=False)
col5_tmp = None
tmp_list = []
perm_list = []
tmp_str = ""
col5_v = ""
for row in sorted_bio:
col5_v = row["col5"]
if col5_v != col5_tmp:
if "OE1" in tmp_str and "OE2" in tmp_str:
perm_list.extend(tmp_list)
tmp_list = []
tmp_str = ""
col5_tmp = col5_v
tmp_list.append(row)
tmp_str = tmp_str +","+ row["col2"]
if col5_v != col5_tmp:
if "OE1" in tmp_str and "OE2" in tmp_str:
perm_list.extend(tmp_list)
csv_file = open("give here output file path","w")
dict_writer = csv.DictWriter(csv_file,delimiter=' ', fieldnames = ["col0","col1","col2","col3","col4","col5"])
for row in perm_list:
dict_writer.writerow(row)
csv_file.close()
I often parse formatted text files using Python (for biology research, but I'll try and ask my question in a way you won't need biology background.) I deal with a type of file -called a pdb file- that contains 3D structure of a protein in a formatted text. This is an example:
HEADER CHROMOSOMAL PROTEIN 02-JAN-87 1UBQ
TITLE STRUCTURE OF UBIQUITIN REFINED AT 1.8 ANGSTROMS RESOLUTION
REMARK 1
REMARK 1 REFERENCE 1
REMARK 1 AUTH S.VIJAY-KUMAR,C.E.BUGG,K.D.WILKINSON,R.D.VIERSTRA,
REMARK 1 AUTH 2 P.M.HATFIELD,W.J.COOK
REMARK 1 TITL COMPARISON OF THE THREE-DIMENSIONAL STRUCTURES OF HUMAN,
REMARK 1 TITL 2 YEAST, AND OAT UBIQUITIN
REMARK 1 REF J.BIOL.CHEM. V. 262 6396 1987
REMARK 1 REFN ISSN 0021-9258
ATOM 1 N MET A 1 27.340 24.430 2.614 1.00 9.67 N
ATOM 2 CA MET A 1 26.266 25.413 2.842 1.00 10.38 C
ATOM 3 C MET A 1 26.913 26.639 3.531 1.00 9.62 C
ATOM 4 O MET A 1 27.886 26.463 4.263 1.00 9.62 O
ATOM 5 CB MET A 1 25.112 24.880 3.649 1.00 13.77 C
ATOM 6 CG MET A 1 25.353 24.860 5.134 1.00 16.29 C
ATOM 7 SD MET A 1 23.930 23.959 5.904 1.00 17.17 S
ATOM 8 CE MET A 1 24.447 23.984 7.620 1.00 16.11 C
ATOM 9 N GLN A 2 26.335 27.770 3.258 1.00 9.27 N
ATOM 10 CA GLN A 2 26.850 29.021 3.898 1.00 9.07 C
ATOM 11 C GLN A 2 26.100 29.253 5.202 1.00 8.72 C
ATOM 12 O GLN A 2 24.865 29.024 5.330 1.00 8.22 O
ATOM 13 CB GLN A 2 26.733 30.148 2.905 1.00 14.46 C
ATOM 14 CG GLN A 2 26.882 31.546 3.409 1.00 17.01 C
ATOM 15 CD GLN A 2 26.786 32.562 2.270 1.00 20.10 C
ATOM 16 OE1 GLN A 2 27.783 33.160 1.870 1.00 21.89 O
ATOM 17 NE2 GLN A 2 25.562 32.733 1.806 1.00 19.49 N
ATOM 18 N ILE A 3 26.849 29.656 6.217 1.00 5.87 N
ATOM 19 CA ILE A 3 26.235 30.058 7.497 1.00 5.07 C
ATOM 20 C ILE A 3 26.882 31.428 7.862 1.00 4.01 C
ATOM 21 O ILE A 3 27.906 31.711 7.264 1.00 4.61 O
ATOM 22 CB ILE A 3 26.344 29.050 8.645 1.00 6.55 C
ATOM 23 CG1 ILE A 3 27.810 28.748 8.999 1.00 4.72 C
ATOM 24 CG2 ILE A 3 25.491 27.771 8.287 1.00 5.58 C
ATOM 25 CD1 ILE A 3 27.967 28.087 10.417 1.00 10.83 C
TER 26 ILE A 3
HETATM 604 O HOH A 77 45.747 30.081 19.708 1.00 12.43 O
HETATM 605 O HOH A 78 19.168 31.868 17.050 1.00 12.65 O
HETATM 606 O HOH A 79 32.010 38.387 19.636 1.00 12.83 O
HETATM 607 O HOH A 80 42.084 27.361 21.953 1.00 22.27 O
END
ATOM marks beginning of a line that contains atomic coordinates. TER marks end of coordinates. I want to take the whole block of text that contains atomic coordinates so I use:
import re
f = open('example.pdb', 'r+')
content = f.read()
coor = re.search('ATOM.*TER', content) #take everthing between ATOM and TER
But it matches nothing. There must be a way of taking a whole block of text by using regex. I also don't understand why this regex pattern doesn't work. Help is appreciated.
This should match (but I haven't actually tested it):
coor = re.search('ATOM.*TER', content, re.DOTALL)
If you read the documentation on DOTALL, you will understand why it wasn't working.
A still better way of writing the above is
coor = re.search(r'^ATOM.*^TER', content, re.MULTILINE | re.DOTALL)
where it is required that ATOM and TER come after newlines, and where raw string notation is being used, which is customary for regular expressions (though it won't make any difference in this case).
You could also avoid regular expressions altogether:
start = content.index('\nATOM')
end = content.index('\nTER', start)
coor = content[start:end]
(This will actually not include the TER in the result, which may be better).
You need (?s) modifier:
import re
f = open('example.pdb', 'w+')
content = f.read()
coor = re.search('(?s)ATOM.*TER', content)
print coor;
This will match everything - newline included - with .*.
Note that if you only need anything in between (ATOM inclusive, TER exclusive), just change to a positive lookahead for TER:
'(?s)ATOM.*(?=TER)'
import re
pattern=re.compile(r"ATOM(.*?)TER")
print pattern.findall(string)
This should do it.
I wouldn't use a regex, instead itertool's dropwhile and takewhile which is more efficient than loading the entire file into memory to perform a regex operation. (eg, we just ignore the start of the file until ATOM, then we don't need to read from the file further after encountering TER).
from itertools import dropwhile, takewhile
with open('example.pdb') as fin:
until_atom = dropwhile(lambda L: not L.startswith('ATOM'), fin)
atoms = takewhile(lambda L: L.startswith('ATOM'), until_atom)
for atom in atoms:
print atom,
So this ignores lines while they don't start with ATOM, then keeps taking lines from that point while they still start with ATOM. You could change that condition to be lambda L: not L.startswith('TER') if you want.
Instead of printing, you could use:
all_atom_text = ''.join(atoms)
to get one large text block.
How about a non-regular-expression alternative. It can be achieved with a relatively simple loop, and a little bit of state. Example:
# Gather all sets of ATOM-TER in all_coors (if there are multiple per file).
all_coors = []
f = open('example.pdb', 'w+')
coor = None
in_atom = False
for line in f:
if not in_atom and line.startswith('ATOM'):
# Found first ATOM, start collecting results.
in_atom = True
coor = []
elif in_atom and line.startswith('TER'):
# Found TER, stop collecting results.
in_atom = False
# Save collected results.
all_coors.append(''.join(coor))
coor = None
if in_atom:
# Collect ATOM result.
coor.append(line)
I have PDB(text) files which are in a directory. I would like to print the number of subunits from each PDB file.
Read all lines in a pdb file that start with ATOM
The fifth column of the ATOM line contains A, B, C, D etc.
If it contains only A the number of subunit is 1. If it contains A and B, the number of subunits are 2. If it contains A, B, and C the number of subunits are 3.
1kg2.pdb file
ATOM 1363 N ASN A 258 82.149 -23.468 9.733 1.00 57.80 N
ATOM 1364 CA ASN A 258 82.494 -22.084 9.356 1.00 62.98 C
ATOM 1395 C MET B 196 34.816 -51.911 11.750 1.00 49.79 C
ATOM 1396 O MET B 196 35.611 -52.439 10.963 1.00 47.65 O
1uz3.pdb file
ATOM 1384 O ARG A 260 80.505 -20.450 15.420 1.00 22.10 O
ATOM 1385 CB ARG A 260 78.980 -18.077 15.207 1.00 36.88 C
ATOM 1399 SD MET B 196 34.003 -52.544 16.664 1.00 57.16 S
ATOM 1401 N ASP C 197 34.781 -50.611 12.007 1.00 44.30 N
2b69.pdb file
ATOM 1393 N MET B 196 33.300 -54.017 12.033 1.00 46.46 N
ATOM 1394 CA MET B 196 33.782 -52.714 12.566 1.00 49.99 C
desired output
pdb_id subunits
1kg2 2
1uz3 3
2b69 1
How can I do this with awk, python or Biopython?
You can use an array to record all seen values for the fifth column.
$ gawk '/^ATOM/ {seen[$5] = 1} END {print length(seen)}' 1kg2.pdb
2
Edit: Using gawk 4.x you can use ENDFILE to generate the required output:
BEGIN {
print "pdb_id\t\tsubunits"
print
}
/^ATOM/ {
seen[$5] = 1
}
ENDFILE {
print FILENAME, "\t", length(seen)
delete seen
}
The result:
$ gawk -f pdb.awk 1kg2.pdb 1uz3.pdb 2b69.pdb
pdb_id subunits
1kg2.pdb 2
1uz3.pdb 3
2b69.pdb 1
A dictionary is one way to count unique occurrences. The following assigns a meaningless value (0) to each subunit, since all you care about is the number of unique subunits (dictionary keys).
import os
for fn in os.listdir():
if ".pdb" in fn:
sub = {}
with open(fn, 'r') as f:
for line in f:
c = line.split()
if len(c) > 5 and c[0] == "ATOM":
sub[c[4]] = 0
print(fn, len(sub.keys()))
(A brand new user deserves an answer along with a pointer to http://whathaveyoutried.com/. Subsequent questions should include evidence that the user has actually tried to solve the problem.)