Comparing two words from different lines in a file using python - python

I am working with a file from the protein data bank which looks something like this.
SITE 2 AC1 15 ASN A 306 LEU A 309 ILE A 310 PHE A 313
SITE 3 AC1 15 ARG A 316 LEU A 326 ALA A 327 ILE A 345
SITE 4 AC1 15 CYS A 432 HIS A 435 HOH A 504
CRYST1 64.511 64.511 111.465 90.00 90.00 90.00 P 43 21 2 8
ORIGX1 1.000000 0.000000 0.000000 0.00000
ORIGX2 0.000000 1.000000 0.000000 0.00000
ORIGX3 0.000000 0.000000 1.000000 0.00000
SCALE1 0.015501 0.000000 0.000000 0.00000
SCALE2 0.000000 0.015501 0.000000 0.00000
SCALE3 0.000000 0.000000 0.008971 0.00000
ATOM 1 N ASP A 229 29.461 51.231 44.569 1.00 47.64 N
ATOM 2 CA ASP A 229 29.341 51.990 43.290 1.00 47.13 C
ATOM 3 C ASP A 229 30.455 51.566 42.330 1.00 45.62 C
ATOM 4 O ASP A 229 31.598 51.376 42.743 1.00 47.18 O
ATOM 5 CB ASP A 229 29.433 53.493 43.567 1.00 49.27 C
ATOM 6 CG ASP A 229 28.817 54.329 42.463 1.00 51.26 C
ATOM 7 OD1 ASP A 229 27.603 54.172 42.206 1.00 53.47 O
ATOM 8 OD2 ASP A 229 29.542 55.145 41.856 1.00 52.96 O
ATOM 9 N MET A 230 30.119 51.424 41.051 1.00 41.99 N
ATOM 10 CA MET A 230 31.092 51.004 40.043 1.00 36.38 C
First I needed to extract only the fourth column of the rows labeled ATOM, which is the amino acid sequence that specific atom is a part of. I have done that here.
import gzip
class Manual_Seq:
def parseSeq(self, path):
with gzip.open(path,'r') as file_content:
for line in file_content:
newLine = line.split(' ')[0]
if newLine == 'ATOM':
AA = line[17]+line[18]+line[19]
print AA
Which produces an output of this
ASP
ASP
ASP
.....
MET
But what I need now, is to output only the first ASP and the first MET and etc and concatenate them so it'll look like this.
ASPMET
I was thinking maybe I'll try to iterate ahead one line and compare it until it is different from the first output, but I am unsure of how I would do this, if you have any other ideas or any improvements to my code please do feel free to submit your suggestions, thanks.
I also need to mention that there can in fact be two identical amino acids in one file so the output could be "ASP MET ASP"

Instead of printing them, make a list, so
print AA
Becomes
my_list.append(AA)
Just don't forget to initialize the list before the loop with my_list=[]
Now that you have all those values, you can loop through them and make a string out of the unique values. If the order doesn't matter to you than you can use set like this:
my_string = ''.join(set(my_list))
But if the order is important, you have to loop through that list:
my_string = ''
seen = []
for item in my_list:
if item not in seen:
seen.append(item)
my_string += item
You could do it without the seen list, but that would be risky
Anyway, all that means you are looping twice on the same data, which is not needed. Instead of all of this, you could initialize my_string='' and seen=[] before your main loop, and do what I did inside your loop instead of print AA... That would look like this:
def parseSeq(self, path):
with gzip.open(path,'r') as file_content:
my_string = ''
seen = []
for line in file_content:
newLine = line.split(' ')[0]
if newLine == 'ATOM':
AA = line[17]+line[18]+line[19]
if AA not in seen:
seen.append(AA)
my_string += AA
return my_string # or print my_string

I added a bit of code to your existing code:
import gzip
class Manual_Seq:
def parseSeq(self, path):
with gzip.open(path,'r') as file_content:
Here we define an empty list, called AAs to hold your amino acids.
AAs = []
for line in file_content:
Next, I generalized your code a bit to split the line into fields so that we can extract various fields, as needed.
fields = line.split(' ')
line_index = fields[0]
if line_index == 'ATOM':
He we check to see if the amino acid is already in the list of amino acids... If not, then we add the amino acid to the list... This has the effect of deduplicating the amino acids.
if fields[3] not in AAs:
AAs.append(fields[3])
Lastly, we concatenate all the values into a single value using the empty string '' and the join() method.
return ''.join(AAs)

Just wondering did you consider using this BioPandas?
https://rasbt.github.io/biopandas/tutorials/Working_with_PDB_Structures_in_DataFrames/
It should be easier to do what you want to do using pandas.
You just need to use:
df.column_name.unique()
and then concantenate the string in the list using "".join(list_name)
https://docs.python.org/3/library/stdtypes.html#str.join

Related

Count the number of atoms of each element

I have to access a file and count the number of atoms of each element. That is, count the number of times of the last character.
For example, I have a file named 14ly.pdb with the following lines:
ATOM 211 N TYR A 27 4.697 8.290 -3.031 1.00 13.35 N
ATOM 212 CA TYR A 27 5.025 8.033 -1.616 0.51 11.29 C
ATOM 214 C TYR A 27 4.189 8.932 -0.730 1.00 10.87 C
ATOM 215 O TYR A 27 3.774 10.030 -1.101 1.00 12.90 O
I should get as a result: 'N':1, 'C':2, 'O':1, that is, 1 atom of type N, 2 of type C and 1 of type O.
I have the following incomplete code that I need to complete:
import os
def count_atoms(pdb_file_name):
num_atoms = dict()
with open(pdb_file_name) as file_content:
##Here should go the code I need##
return num_atoms
result = count_atoms('14ly.pdb')
print(result)
number_of_atoms = dict()
with open("14ly.pdb", "r") as f:
for line in f:
line_words = line.split(" ")
last_char = line_words[-1].rstrip('\n')
if last_char in number_of_atoms.keys():
number_of_atoms[last_char] += 1
else:
number_of_atoms[last_char] = 1
print(number_of_atoms)
I think this should be enough

String substitution with regex or regular Python?

I have a list of strings like the following
orig = ["a1 2.3 ABC 4 DEFG 567 b890",
"a2 3.0 HI 4 5 JKL 67 c65",
"b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112"]
Context here is that this is a CSV file and certain columns are omitted. I don't think that the pandas csv reader can handle these cases. The idea is now to inject na for the missing values, so the output becomes
corr = ["a1 2.3 ABC 4 na na na DEFG 567 b890",
"a2 3.0 HI 4 5 na na JKL 67 c65",
"b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112"]
to align the second column with capitalised words later on, when imported in pandas.
The structure is the following: Delimiters between columns are two or more whitespaces and between the two upper case columns have to be four values. In the original file, there are always only two upper case columns, there is at least one and maximal four numbers in between them and there are only number values between these upper case words.
I can write without problem a script in native Python, so please no suggestions for this. But I thought, this might be a case for regex. As a regex beginner, I only managed to extract the string between the two upper case columns with
for line in orig:
a = re.findall("([A-Z]+[\s\d]+[A-Z]+)", line))
print(a)
>>>'ABC 4 DEFG' #etc pp
Is there now an easy way in regex to determine, how many numbers are between the upper case words and insert 'na' values to have always four values in between? Or should I do it in native Python?
Of course, if there is a way to do this with the pandas csv reader, that would be even better. But I studied pandas csv_reader docs and haven't found anything useful.
Based on complete pandas approach split and concat might help i.e
ndf = pd.Series(orig).str.split(expand=True)
# 0 1 2 3 4 5 6 7 8 9 10
#0 a1 2.3 ABC 4 DEFG 567 b890 None None None None
#1 a2 3.0 HI 4 5 JKL 67 c65 None None None
#2 b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112
df = pd.concat([ndf.iloc[:,:4], ndf.iloc[:,4:].apply(sorted,key=pd.notnull,axis=1)],1)
df.astype(str).apply(' '.join,axis=1).tolist()
['a1 2.3 ABC 4 None None None None DEFG 567 b890',
'a2 3.0 HI 4 None None None 5 JKL 67 c65',
'b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112']
Though the consensus seems to be that regex is not the best tool for such a dynamic string substitution, I found the re module quite comfortable to use in this context. The capturing pattern is based on a comment by Jon Clements.
import re
orig = ["a1 2.3 ABC 4 DEFG 567 b890",
"a2 3.0 HI 4 5 JKL 67 c65",
"b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112"]
corr = []
for item in orig:
#capture group starting with first capitalised word and stopping before the second
col_betw = re.search("\s{2,}([A-Z]+.*)\s{2,}[A-Z]+\s{2,}", item).group(1)
#determine, how many elements we have in this segment
nr_col_betw = len(re.split(r"\s{2,}", col_betw))
#substitute, if not enough numbers
if nr_col_betw <= 4:
#fill with NA, which is interpreted by pandas csv reader as NaN
subst = col_betw + " NA" * (5 - nr_col_betw)
item = item.replace(col_betw, subst, 1)
corr.append(item)

file separation on the basis of matching character

ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C
ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C
ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N
ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N
ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O
ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N
ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C
ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C
ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O
ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C
I have the above text file and need to make two text files on the basis of differences at position 21 in line. I wrote a script which can print the required results. But if I do not know what is the character at column 21, how can I do this job. Following is the script which I tried. suppose I do not know whether line 21 is "A" and "B" or "B" and "G" or any other combination and need to separate on the basis of line 21. How can I do this?
import sys
for fn in sys.argv[1:]:
f=open(fn,'r')
while 1:
line=f.readline()
if not line: break
if line[21:22] == 'B':
chns = line[0:80]
print chns
Storing the previous value of the 21st character from the previous line, then adding a newline for every non-match (which means another group of same letters) prints the grouped lines based on its 21st character.
Take note that it only groups lines with matching 21st character based on the line sequence in the file, which means non-sorted lines will have more than one separated groups of same 21st character.
Modified file to show this case:
ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C
ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C
ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N
ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C
ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C
ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O
ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C
ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N
ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N
ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O
Code producing this case (without sorting the lines):
import sys
for fn in sys.argv[1:]:
with open(fn,'r') as file:
prev = 0
for line in file:
line = line.strip()
if line[21:22] != prev:
# new line separator for each group
print ''
print line
prev = line[21:22]
A sample output showing this case:
ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C
ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C
ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N
ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C
ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C
ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O
ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C
ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N
ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N
ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O
So, if you want only one group for each same 21st character, putting all the lines in a list and sorting it using list.sort() will do.
Code (sorting the lines first before grouping):
import sys
for fn in sys.argv[1:]:
with open(fn,'r') as file:
lines = file.readlines()
# creates a list or pairs (21st char, line) within a list
lines = [ [line[21:22], line.strip() ] for line in lines ]
# sorts lines based on key (21st char)
lines.sort()
# brings back list of lines to its original state,
# but the order is not reverted since it is already sorted
lines = [ line[1] for line in lines ]
prev = 0
for line in lines:
if line[21:22] != prev:
# new line separator for each group
print ''
print line
prev = line[21:22]
Outputs to:
ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C
ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C
ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N
ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N
ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O
ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N
ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C
ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C
ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O
ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C
Edit:
Writing grouped lines in different files do not actually need checking the previous line's value because changing the filename based on 21st character opens a new file, thus separating lines. But here, I used prev so that any previously created file with the same filename won't just be appended that may cause clutter or inconsistency on the file's contents.
import sys
for fn in sys.argv[1:]:
with open(fn,'r') as file:
lines = file.readlines()
# creates a list or pairs (21st char, line) within a list
lines = [ [line[21:22], line ] for line in lines ]
# sorts lines based on key (21st char)
lines.sort()
# brings back list of lines to its original state,
# but the order is not reverted since it is already sorted
lines = [ line[1] for line in lines ]
filename = 'file'
prev = 0
for line in lines:
if line[21:22] != prev:
# creates a new file
file = open(filename + line[21:22] + '.txt', 'w')
else:
# appends to the file
file = open(filename + line[21:22] + '.txt', 'a')
file.write(line)
prev = line[21:22]
The file writing part can be simplified if appending previously created files is not a problem. But, it risks writing to a file with same filename that is not created by the script or created by the script during earlier executions/sessions.
filename = 'file'
for line in lines:
file = open(filename + line[21:22] + '.txt', 'a')
file.write(line)
Use str.split and compare the 5th element (i.e the 21st character)
while 1:
line = f.readline()
if not line:
break
# get character in 5th column
ch = line.split()[4]
if ch == 'B':
chns = line[0:80]
print chns
else: # not sure what the character is
pass # do something
You can just initialize a value to None and look if it changes :
import sys
for fn in sys.argv[1:]:
old = None
f=open(fn,'r')
for line in f:
if not line: break
if (old is None) or (line[21] == old):
old = line[21]
chns = line[0:80]
print chns
Not sure what you are trying to achieve. But the following code will sort the lines from all files by the 21st character in the dictionary lines.
import sys
lines = dict()
for fn in sys.argv[1:]:
f = open(fn,'r')
for line in f:
if not line:
break
key = line.split()[4]
if key not in lines.keys():
lines[key] = list()
lines[key].append(line)
You can then get all 21st characters that occurred using lines.keys(), and get a list() with all corresponding lines from the dictionary.

Find a block of text in formatted text file

I often parse formatted text files using Python (for biology research, but I'll try and ask my question in a way you won't need biology background.) I deal with a type of file -called a pdb file- that contains 3D structure of a protein in a formatted text. This is an example:
HEADER CHROMOSOMAL PROTEIN 02-JAN-87 1UBQ
TITLE STRUCTURE OF UBIQUITIN REFINED AT 1.8 ANGSTROMS RESOLUTION
REMARK 1
REMARK 1 REFERENCE 1
REMARK 1 AUTH S.VIJAY-KUMAR,C.E.BUGG,K.D.WILKINSON,R.D.VIERSTRA,
REMARK 1 AUTH 2 P.M.HATFIELD,W.J.COOK
REMARK 1 TITL COMPARISON OF THE THREE-DIMENSIONAL STRUCTURES OF HUMAN,
REMARK 1 TITL 2 YEAST, AND OAT UBIQUITIN
REMARK 1 REF J.BIOL.CHEM. V. 262 6396 1987
REMARK 1 REFN ISSN 0021-9258
ATOM 1 N MET A 1 27.340 24.430 2.614 1.00 9.67 N
ATOM 2 CA MET A 1 26.266 25.413 2.842 1.00 10.38 C
ATOM 3 C MET A 1 26.913 26.639 3.531 1.00 9.62 C
ATOM 4 O MET A 1 27.886 26.463 4.263 1.00 9.62 O
ATOM 5 CB MET A 1 25.112 24.880 3.649 1.00 13.77 C
ATOM 6 CG MET A 1 25.353 24.860 5.134 1.00 16.29 C
ATOM 7 SD MET A 1 23.930 23.959 5.904 1.00 17.17 S
ATOM 8 CE MET A 1 24.447 23.984 7.620 1.00 16.11 C
ATOM 9 N GLN A 2 26.335 27.770 3.258 1.00 9.27 N
ATOM 10 CA GLN A 2 26.850 29.021 3.898 1.00 9.07 C
ATOM 11 C GLN A 2 26.100 29.253 5.202 1.00 8.72 C
ATOM 12 O GLN A 2 24.865 29.024 5.330 1.00 8.22 O
ATOM 13 CB GLN A 2 26.733 30.148 2.905 1.00 14.46 C
ATOM 14 CG GLN A 2 26.882 31.546 3.409 1.00 17.01 C
ATOM 15 CD GLN A 2 26.786 32.562 2.270 1.00 20.10 C
ATOM 16 OE1 GLN A 2 27.783 33.160 1.870 1.00 21.89 O
ATOM 17 NE2 GLN A 2 25.562 32.733 1.806 1.00 19.49 N
ATOM 18 N ILE A 3 26.849 29.656 6.217 1.00 5.87 N
ATOM 19 CA ILE A 3 26.235 30.058 7.497 1.00 5.07 C
ATOM 20 C ILE A 3 26.882 31.428 7.862 1.00 4.01 C
ATOM 21 O ILE A 3 27.906 31.711 7.264 1.00 4.61 O
ATOM 22 CB ILE A 3 26.344 29.050 8.645 1.00 6.55 C
ATOM 23 CG1 ILE A 3 27.810 28.748 8.999 1.00 4.72 C
ATOM 24 CG2 ILE A 3 25.491 27.771 8.287 1.00 5.58 C
ATOM 25 CD1 ILE A 3 27.967 28.087 10.417 1.00 10.83 C
TER 26 ILE A 3
HETATM 604 O HOH A 77 45.747 30.081 19.708 1.00 12.43 O
HETATM 605 O HOH A 78 19.168 31.868 17.050 1.00 12.65 O
HETATM 606 O HOH A 79 32.010 38.387 19.636 1.00 12.83 O
HETATM 607 O HOH A 80 42.084 27.361 21.953 1.00 22.27 O
END
ATOM marks beginning of a line that contains atomic coordinates. TER marks end of coordinates. I want to take the whole block of text that contains atomic coordinates so I use:
import re
f = open('example.pdb', 'r+')
content = f.read()
coor = re.search('ATOM.*TER', content) #take everthing between ATOM and TER
But it matches nothing. There must be a way of taking a whole block of text by using regex. I also don't understand why this regex pattern doesn't work. Help is appreciated.
This should match (but I haven't actually tested it):
coor = re.search('ATOM.*TER', content, re.DOTALL)
If you read the documentation on DOTALL, you will understand why it wasn't working.
A still better way of writing the above is
coor = re.search(r'^ATOM.*^TER', content, re.MULTILINE | re.DOTALL)
where it is required that ATOM and TER come after newlines, and where raw string notation is being used, which is customary for regular expressions (though it won't make any difference in this case).
You could also avoid regular expressions altogether:
start = content.index('\nATOM')
end = content.index('\nTER', start)
coor = content[start:end]
(This will actually not include the TER in the result, which may be better).
You need (?s) modifier:
import re
f = open('example.pdb', 'w+')
content = f.read()
coor = re.search('(?s)ATOM.*TER', content)
print coor;
This will match everything - newline included - with .*.
Note that if you only need anything in between (ATOM inclusive, TER exclusive), just change to a positive lookahead for TER:
'(?s)ATOM.*(?=TER)'
import re
pattern=re.compile(r"ATOM(.*?)TER")
print pattern.findall(string)
This should do it.
I wouldn't use a regex, instead itertool's dropwhile and takewhile which is more efficient than loading the entire file into memory to perform a regex operation. (eg, we just ignore the start of the file until ATOM, then we don't need to read from the file further after encountering TER).
from itertools import dropwhile, takewhile
with open('example.pdb') as fin:
until_atom = dropwhile(lambda L: not L.startswith('ATOM'), fin)
atoms = takewhile(lambda L: L.startswith('ATOM'), until_atom)
for atom in atoms:
print atom,
So this ignores lines while they don't start with ATOM, then keeps taking lines from that point while they still start with ATOM. You could change that condition to be lambda L: not L.startswith('TER') if you want.
Instead of printing, you could use:
all_atom_text = ''.join(atoms)
to get one large text block.
How about a non-regular-expression alternative. It can be achieved with a relatively simple loop, and a little bit of state. Example:
# Gather all sets of ATOM-TER in all_coors (if there are multiple per file).
all_coors = []
f = open('example.pdb', 'w+')
coor = None
in_atom = False
for line in f:
if not in_atom and line.startswith('ATOM'):
# Found first ATOM, start collecting results.
in_atom = True
coor = []
elif in_atom and line.startswith('TER'):
# Found TER, stop collecting results.
in_atom = False
# Save collected results.
all_coors.append(''.join(coor))
coor = None
if in_atom:
# Collect ATOM result.
coor.append(line)

How do I get the number of subunits from a PDB file with awk , python or biopython?

I have PDB(text) files which are in a directory. I would like to print the number of subunits from each PDB file.
Read all lines in a pdb file that start with ATOM
The fifth column of the ATOM line contains A, B, C, D etc.
If it contains only A the number of subunit is 1. If it contains A and B, the number of subunits are 2. If it contains A, B, and C the number of subunits are 3.
1kg2.pdb file
ATOM 1363 N ASN A 258 82.149 -23.468 9.733 1.00 57.80 N
ATOM 1364 CA ASN A 258 82.494 -22.084 9.356 1.00 62.98 C
ATOM 1395 C MET B 196 34.816 -51.911 11.750 1.00 49.79 C
ATOM 1396 O MET B 196 35.611 -52.439 10.963 1.00 47.65 O
1uz3.pdb file
ATOM 1384 O ARG A 260 80.505 -20.450 15.420 1.00 22.10 O
ATOM 1385 CB ARG A 260 78.980 -18.077 15.207 1.00 36.88 C
ATOM 1399 SD MET B 196 34.003 -52.544 16.664 1.00 57.16 S
ATOM 1401 N ASP C 197 34.781 -50.611 12.007 1.00 44.30 N
2b69.pdb file
ATOM 1393 N MET B 196 33.300 -54.017 12.033 1.00 46.46 N
ATOM 1394 CA MET B 196 33.782 -52.714 12.566 1.00 49.99 C
desired output
pdb_id subunits
1kg2 2
1uz3 3
2b69 1
How can I do this with awk, python or Biopython?
You can use an array to record all seen values for the fifth column.
$ gawk '/^ATOM/ {seen[$5] = 1} END {print length(seen)}' 1kg2.pdb
2
Edit: Using gawk 4.x you can use ENDFILE to generate the required output:
BEGIN {
print "pdb_id\t\tsubunits"
print
}
/^ATOM/ {
seen[$5] = 1
}
ENDFILE {
print FILENAME, "\t", length(seen)
delete seen
}
The result:
$ gawk -f pdb.awk 1kg2.pdb 1uz3.pdb 2b69.pdb
pdb_id subunits
1kg2.pdb 2
1uz3.pdb 3
2b69.pdb 1
A dictionary is one way to count unique occurrences. The following assigns a meaningless value (0) to each subunit, since all you care about is the number of unique subunits (dictionary keys).
import os
for fn in os.listdir():
if ".pdb" in fn:
sub = {}
with open(fn, 'r') as f:
for line in f:
c = line.split()
if len(c) > 5 and c[0] == "ATOM":
sub[c[4]] = 0
print(fn, len(sub.keys()))
(A brand new user deserves an answer along with a pointer to http://whathaveyoutried.com/. Subsequent questions should include evidence that the user has actually tried to solve the problem.)

Categories