I am stuck trying to query nearest neighbors of models from a pdb file, using scipy’s kd-tree. I have currently implemented a brute force approach where I compare each model's rmsd value to every other model. I would like to speed up the time it takes to find each model nearest neighbors by using kd-tree.
For reference, a sample of the pdb file I am working with has multiple models in a single file:
MODEL 5
HETATM 1 C1 SIN A 0 13.542 -2.290 0.745 1.00 0.00 C
HETATM 2 O1 SIN A 0 14.446 -2.652 0.010 1.00 0.00 O
HETATM 3 O2 SIN A 0 12.378 -2.189 0.395 1.00 0.00 O
...
TER 627 NH2 A 39
ENDMDL
MODEL 6
HETATM 1 C1 SIN A 0 11.762 2.281 -7.835 1.00 0.00 C
ATOM 26 C TRP A 2 11.341 6.316 -0.847 1.00 0.00 C
ATOM 27 O TRP A 2 11.074 6.179 0.330 1.00 0.00 O
ATOM 28 CB TRP A 2 13.182 7.844 -1.538 1.00 0.00 C
ATOM 29 CG TRP A 2 12.069 8.524 -2.259 1.00 0.00 C
...
HETATM 626 HN2 NH2 A 39 3.093 9.404 -6.782 1.00 0.00 H
TER 627 NH2 A 39
ENDMDL
MODEL 7
HETATM 1 C1 SIN A 0 -16.074 -1.515 -4.262 1.00 0.00 C
HETATM 2 O1 SIN A 0 -16.968 -1.910 -4.992 1.00 0.00 O
...
ATOM 18 OD1 ASP A 1 -12.877 3.426 -8.525 1.00 0.00 O
ATOM 19 OD2 ASP A 1 -13.484 1.785 -9.782 1.00 0.00 O
TER 627 NH2 A 39
ENDMDL
My initial attempt was to represent each model as a list, that has a list of atom coordinates, and each 3D atom coordinate is represented by a list:
print(model_coord)
[
[[1.4579, 0.0, 0.0],... ,[-5.5, 21.5529, 23.7390]],
[[16.5450, 3.3699, 10.1888], ... ,[-0.0963, 24.510883331298828, 20.2952]],
[[17.6256, 2.5858, 12.4808],... ,[-11.6052, 13.1031, 23.8958]]
]
I then received the following error when creating kdtree object:
kdtree = scipy.spatial.KDTree(model_coord)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 235, in __init__
self.n, self.m = np.shape(self.data)
ValueError: too many values to unpack
However, converting model_coord into panada dataframes allowed me to obtain the n by m requirement to create the kdtree object, where each row represents a model and the column 3D atom coordinates:
model_df = pd.DataFrame(model_coord)
print(model_df.to_string())
0 1 2 ...
0 [1.45799, 0.0, 0.0] [3.9140, 2.8670, 0.4530] [7.590, 3.7990, 0.1850] ...
1 [16.5450, 3.3699, 10.1888] [15.9148, 1.9402, 13.6552] [14.4702, 2.6485, 17.0995] ...
2 [17.6256, 2.5858, 12.4808] [16.4266, 2.2781, 16.0749] [12.6480, 2.6846, 16.0066] …
Here is my attempt to query nearest neighbor of radius with a model, where epsilon is the radius:
kdtree = scipy.spatial.KDTree(model_df)
for index, model in model_df.iterrows():
model_nn_dist, model_nn_ids = kdtree.query(model,distance_upper_bound=epsilon)
Received the following error due to the coordinates being a list object:
model_nn_dist, model_nn_ids=kdtree.query(model,distance_upper_bound=epsilon)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 521, in query
hits = self.__query(x, k=k, eps=eps, p=p,distance_upper_bound=distance_upper_bound)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 320, in __query
side_distances = np.maximum(0,np.maximum(x-self.maxes,self.mins-x))
TypeError: unsupported operand type(s) for -: 'list' and ‘list'
Attempted to resolve this by converting the atom coordinates into numpy array, however, this is the error I receive:
model_nn_dist, model_nn_ids = kdtree.query(model,distance_upper_bound=epsilon)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 521, in query
hits = self.__query(x, k=k, eps=eps, p=p, distance_upper_bound=distance_upper_bound)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 320, in __query
side_distances = np.maximum(0,np.maximum(x-self.maxes,self.mins-x))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I am wondering if there is a better approach or a more suitable data structure to query nearest neighbors of models or sets of coordinates, using kd-trees.
Related
I'm running a python code where I'm importing the python module sPyRMSD. I'm getting an error when I reach the line containing the io.loadmol() for loading molecules.
from spyrmsd import io,rmsd
ref = io.loadmol("tempref.pdb")
I'm getting the following error-
Reference Molecule:lig_ref.pdb
PDBqt File:ligand_vina_out.pdbqt
Traceback (most recent call last):
File "rmsd.py", line 34, in <module>
ref = io.loadmol("tempref.pdb")
File "/home/aathiranair/.local/lib/python3.8/site-
packages/spyrmsd/io.py", line 66, in loadmol
mol = load(fname)
NameError: name 'load' is not defined
I tried uninstalling and reinstalling the spyrmsd module, but I still face the same issue.
I also tried creating a virtual environment and running the script but faced the same issue.
(ihub_proj) aathiranair#aathiranair-Inspiron-5406-
2n1:~/Desktop/Ihub$ python3 rmsd.py lig_ref.pdb
ligand_vina_out.pdbqt
Reference Molecule:lig_ref.pdb
PDBqt File:ligand_vina_out.pdbqt
Traceback (most recent call last):
File "rmsd.py", line 34, in <module>
ref = io.loadmol("tempref.pdb")
File
"/home/aathiranair/Desktop/Ihub/ihub_proj/lib/python3.8/site-
packages/spyrmsd/io.py", line 66, in loadmol
mol = load(fname)
NameError: name 'load' is not defined
the tempref.pdb file looks like this-
ATOM 1 O6 LIG 359 2.349 1.014 7.089 0.00 0.00
ATOM 9 H LIG 359 1.306 1.691 9.381 0.00 0.00
ATOM 2 C2 LIG 359 0.029 4.120 8.082 0.00 0.00
ATOM 3 O9 LIG 359 -1.106 2.491 9.345 0.00 0.00
ATOM 4 C1 LIG 359 -0.204 3.890 0.337 0.00 0.00
ATOM 5 S5 LIG 359 -0.355 4.108 4.075 0.00 0.00
ATOM 8 C4 LIG 359 -3.545 1.329 7.893 0.00 0.00
ATOM 7 C7 LIG 359 -1.133 5.150 9.406 0.00 0.00
ATOM 6 C3 LIG 359 -0.064 1.805 8.234 0.00 0.00
It seems that to use the module io one of OpenBabel or RDKit is required
Also, make sure to have NumPy
I have a dataset which talks about firms' financial information from 1987 to 1999 with different industries. Now I wanna run an OLS for each industries. There are 137 industries in the dataframe with 28128 firms-year available. df['Industry'] is the code for each industry, same code means the same industry. After getting the intercept and coefs of each industry, put them in to a statistic descriptive table.
I am thinking of writing a for loop but have no idea how.
first I filter out industries that have at least 50 observations
Dechow_1987 = Dechow_1987[Dechow_1987.groupby('Industry')['Industry'].transform('size') >= 50]
print( 'Firm-years available:' ,Dechow_1987.shape[0])
print( 'Number of Industries available:' , Dechow_1987.Industry.nunique())
And I want to get this table at the end:
Intercept b1 b b3 Adjusted R2
Mean 0.03 0.19 -0.51 0.15 0.34
(t-statistic) (16.09) (21.10) (-35.77) (15.33)
Lower quartile 0.01 0.11 -0.63 0.08 0.22
Median 0.03 0.18 -0.52 0.15 0.34
Upper quartile 0.04 0.26 -0.40 0.23 0.4
I have tried :
ind = Dechow_1987.Industry.unique()
op=pd.DataFrame()
for i in ind:
Dechow_1987_i = Dechow_1987[Dechow_1987.Industry == i]
X_CFOs = Dechow_1987_i[['CFOtm1', 'CFOt', 'CFOtp1']]
X_CFOs = sm.add_constant(X_CFOs)
Y_WC= Dechow_1987_i['wcch']
reg = sm.OLS(Y_WC, X_CFOs).fit()
#reg.score(Y_WC,X_CFOs)
intercept = reg.params.const
coef_CFOtm1 = reg.params.CFOtm1
coef_CFOt = reg.params.CFOt
coef_CFOtp1 = reg.params.CFOtp1
ind=i
array=np.append(coef_CFOtm1,coef_CFOt,coef_CFOtp1).dtype('int32')
array=np.append(array,intercept)
array=np.append(array,ind)
array=array.reshape(3,len(array))
df_append=pd.DataFrame(array)
op=op.append(df_append)
op.columns=['A'+str(i) for i in range (3,len(op.columns)+1)]
op.rename(columns={op.columns[-1]:"Industry"},inplace=True)
op.rename(columns={op.columns[-2]:"Intercept"},inplace=True)
op=op.reset_index().drop('index',axis=1)
op=op.drop_duplicates()
It come out :
TypeError Traceback (most recent call last)
<ipython-input-114-60ce6bb71209> in <module>
18
19 ind=i
---> 20 array=np.append(coef_CFOtm1,coef_CFOt,coef_CFOtp1).dtype('int32')
21 array=np.append(array,intercept)
22 array=np.append(array,ind)
<__array_function__ internals> in append(*args, **kwargs)
D:\anaconda3\lib\site-packages\numpy\lib\function_base.py in append(arr, values, axis)
4743 values = ravel(values)
4744 axis = arr.ndim-1
-> 4745 return concatenate((arr, values), axis=axis)
4746
4747
<__array_function__ internals> in concatenate(*args, **kwargs)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
I have a PDB file which contain two molecules (receptor and ligand).
Each molecule will have its own header. All in ONE PDB file.
The header of receptor section looks like this (line 1-6 of the PDB file):
HEADER rec.pdb
REMARK original generated coordinate pdb file
ATOM 1 N GLY A 1 -51.221 -13.970 37.091 1.00 0.00 RA0 N
ATOM 2 H GLY A 1 -50.383 -13.584 37.482 1.00 0.00 RA0 H
ATOM 3 CA GLY A 1 -50.902 -15.071 36.197 1.00 0.00 RA0 C
ATOM 4 C GLY A 1 -49.525 -15.659 36.443 1.00 0.00 RA0 C
and this ligand section (line 11435 to 11440) of the PDB file
HEADER lig.000.00.pdb
ATOM 1 N MET A 1 27.318 -26.957 12.663 1.00 0.00 LA0 N
ATOM 2 H MET A 1 27.313 -27.570 11.870 1.00 0.00 LA0 H
ATOM 3 CA MET A 1 28.374 -27.102 13.668 1.00 0.00 LA0 C
ATOM 4 CB MET A 1 28.531 -28.564 14.090 1.00 0.00 LA0 C
ATOM 5 CG MET A 1 27.224 -29.154 14.628 1.00 0.00 LA0 C
Note that the receptor and ligand section also contain the string RA0 and LA0
at the 11th column of PDB file.
What I want to do is to rename the chain of receptor as chain A and ligand as chain B.
In order to do that I intended to extract both part into two different objects first,
then rename the chain and finally put them together again.
I tried this code with Bio3D.
But doesn't work:
library(bio3d)
pdb_infile <- "myfile.pdb"
pdb <- read.pdb(pdb_infile)
receptor_segment.sele <- atom.select(pdb, segid = "RA0", verbose = TRUE)
receptor_pdb <- trim.pdb(pdb, receptor_segment.sele)
ligand_segment.sele <- atom.select(pdb, segid = "LA0", verbose = TRUE)
ligand_pdb <- trim.pdb(pdb, ligand_segment.sele) # showed no entry
What's the way to do it?
I'm open to solution with R or Python.
I have a dataframe that contains the similarity scores 100x100 for each 100 products against 100 products(data_neighbours). I have another dataframe that has the data at user and product level(1000x100). I want to go through each product for each user and get top10 similar products from data_neighbours and their corresponding similarity scores and compute a function getScore as below:
def getScore(history, similarities):
return sum(history*similarities)/sum(similarities)
for i in range(0,len(data_sims.index)):
for j in range(1,len(data_sims.columns)):
user = data_sims.index[i]
product = data_sims.columns[j]
if data.ix[i][j] == 1:
data_sims.ix[i][j] = 0
else:
product_top_names = data_neighbours.ix[product][1:10]
product_top_sims = data_ibs.ix[product].order(ascending=False)[1:10]
user_purchases = data_germany.ix[user,product_top_names]
data_sims.ix[i][j] = getScore(user_purchases,product_top_sims)
How can I optimize this loop for faster processing. The example has been cited from here: http://www.salemmarafi.com/code/collaborative-filtering-with-python/
Sample data:
Data:(1000x101) user is the 101th column:
Index user song1 song2.....
0 1 0 0
1 33 0 1
2 42 1 0
3 51 0 0
data_ibs(similarity scores)--(100x100):
song1 song2 song3 song4
song1 1.00 0.00 0.02 0.05
song2 0.00 1.00 0.05 0.03
song3 0.02 0.05 1.00 0.11
song4 0.05 0.03 0.11 1.00
data_neighbours(top10 similar songs for each song based on sorted score from data_ibs)--(100x10):
1 2 3......... 10
song1 song5 song10 song4
song2 song8 song11 song5
song3 song9 song12 song10
data germany(user level data for each song as column, except userid)--(1000x100):
index song1 song2 song3
1 0 0 0
2 1 0 0
3 0 0 1
Expected dataset(data_sims)--1000x101:
user song1 song2 song3
1 0.00 0.00 0.22
33 0.09 0.00 0.11
42 0.00 0.10 0.00
51 0.09 0.09 0.00
where if value is 1 in data for any song, basically its score is set to 0, other cases, top 10 songs are fetched from data_neighbours and corresponding scores from data_ibs. Now it is checked if those songs are already present for the user or not(1,0) in user_purchases dataset. finally similarity scores are computed for the ixj position using user_purchses(1/0 values for each top 10 song) multiply by similarity score from data_ibs and divide by the sum of total top 10 similarity scores. Repeat the same for all the user x song combination.
I often parse formatted text files using Python (for biology research, but I'll try and ask my question in a way you won't need biology background.) I deal with a type of file -called a pdb file- that contains 3D structure of a protein in a formatted text. This is an example:
HEADER CHROMOSOMAL PROTEIN 02-JAN-87 1UBQ
TITLE STRUCTURE OF UBIQUITIN REFINED AT 1.8 ANGSTROMS RESOLUTION
REMARK 1
REMARK 1 REFERENCE 1
REMARK 1 AUTH S.VIJAY-KUMAR,C.E.BUGG,K.D.WILKINSON,R.D.VIERSTRA,
REMARK 1 AUTH 2 P.M.HATFIELD,W.J.COOK
REMARK 1 TITL COMPARISON OF THE THREE-DIMENSIONAL STRUCTURES OF HUMAN,
REMARK 1 TITL 2 YEAST, AND OAT UBIQUITIN
REMARK 1 REF J.BIOL.CHEM. V. 262 6396 1987
REMARK 1 REFN ISSN 0021-9258
ATOM 1 N MET A 1 27.340 24.430 2.614 1.00 9.67 N
ATOM 2 CA MET A 1 26.266 25.413 2.842 1.00 10.38 C
ATOM 3 C MET A 1 26.913 26.639 3.531 1.00 9.62 C
ATOM 4 O MET A 1 27.886 26.463 4.263 1.00 9.62 O
ATOM 5 CB MET A 1 25.112 24.880 3.649 1.00 13.77 C
ATOM 6 CG MET A 1 25.353 24.860 5.134 1.00 16.29 C
ATOM 7 SD MET A 1 23.930 23.959 5.904 1.00 17.17 S
ATOM 8 CE MET A 1 24.447 23.984 7.620 1.00 16.11 C
ATOM 9 N GLN A 2 26.335 27.770 3.258 1.00 9.27 N
ATOM 10 CA GLN A 2 26.850 29.021 3.898 1.00 9.07 C
ATOM 11 C GLN A 2 26.100 29.253 5.202 1.00 8.72 C
ATOM 12 O GLN A 2 24.865 29.024 5.330 1.00 8.22 O
ATOM 13 CB GLN A 2 26.733 30.148 2.905 1.00 14.46 C
ATOM 14 CG GLN A 2 26.882 31.546 3.409 1.00 17.01 C
ATOM 15 CD GLN A 2 26.786 32.562 2.270 1.00 20.10 C
ATOM 16 OE1 GLN A 2 27.783 33.160 1.870 1.00 21.89 O
ATOM 17 NE2 GLN A 2 25.562 32.733 1.806 1.00 19.49 N
ATOM 18 N ILE A 3 26.849 29.656 6.217 1.00 5.87 N
ATOM 19 CA ILE A 3 26.235 30.058 7.497 1.00 5.07 C
ATOM 20 C ILE A 3 26.882 31.428 7.862 1.00 4.01 C
ATOM 21 O ILE A 3 27.906 31.711 7.264 1.00 4.61 O
ATOM 22 CB ILE A 3 26.344 29.050 8.645 1.00 6.55 C
ATOM 23 CG1 ILE A 3 27.810 28.748 8.999 1.00 4.72 C
ATOM 24 CG2 ILE A 3 25.491 27.771 8.287 1.00 5.58 C
ATOM 25 CD1 ILE A 3 27.967 28.087 10.417 1.00 10.83 C
TER 26 ILE A 3
HETATM 604 O HOH A 77 45.747 30.081 19.708 1.00 12.43 O
HETATM 605 O HOH A 78 19.168 31.868 17.050 1.00 12.65 O
HETATM 606 O HOH A 79 32.010 38.387 19.636 1.00 12.83 O
HETATM 607 O HOH A 80 42.084 27.361 21.953 1.00 22.27 O
END
ATOM marks beginning of a line that contains atomic coordinates. TER marks end of coordinates. I want to take the whole block of text that contains atomic coordinates so I use:
import re
f = open('example.pdb', 'r+')
content = f.read()
coor = re.search('ATOM.*TER', content) #take everthing between ATOM and TER
But it matches nothing. There must be a way of taking a whole block of text by using regex. I also don't understand why this regex pattern doesn't work. Help is appreciated.
This should match (but I haven't actually tested it):
coor = re.search('ATOM.*TER', content, re.DOTALL)
If you read the documentation on DOTALL, you will understand why it wasn't working.
A still better way of writing the above is
coor = re.search(r'^ATOM.*^TER', content, re.MULTILINE | re.DOTALL)
where it is required that ATOM and TER come after newlines, and where raw string notation is being used, which is customary for regular expressions (though it won't make any difference in this case).
You could also avoid regular expressions altogether:
start = content.index('\nATOM')
end = content.index('\nTER', start)
coor = content[start:end]
(This will actually not include the TER in the result, which may be better).
You need (?s) modifier:
import re
f = open('example.pdb', 'w+')
content = f.read()
coor = re.search('(?s)ATOM.*TER', content)
print coor;
This will match everything - newline included - with .*.
Note that if you only need anything in between (ATOM inclusive, TER exclusive), just change to a positive lookahead for TER:
'(?s)ATOM.*(?=TER)'
import re
pattern=re.compile(r"ATOM(.*?)TER")
print pattern.findall(string)
This should do it.
I wouldn't use a regex, instead itertool's dropwhile and takewhile which is more efficient than loading the entire file into memory to perform a regex operation. (eg, we just ignore the start of the file until ATOM, then we don't need to read from the file further after encountering TER).
from itertools import dropwhile, takewhile
with open('example.pdb') as fin:
until_atom = dropwhile(lambda L: not L.startswith('ATOM'), fin)
atoms = takewhile(lambda L: L.startswith('ATOM'), until_atom)
for atom in atoms:
print atom,
So this ignores lines while they don't start with ATOM, then keeps taking lines from that point while they still start with ATOM. You could change that condition to be lambda L: not L.startswith('TER') if you want.
Instead of printing, you could use:
all_atom_text = ''.join(atoms)
to get one large text block.
How about a non-regular-expression alternative. It can be achieved with a relatively simple loop, and a little bit of state. Example:
# Gather all sets of ATOM-TER in all_coors (if there are multiple per file).
all_coors = []
f = open('example.pdb', 'w+')
coor = None
in_atom = False
for line in f:
if not in_atom and line.startswith('ATOM'):
# Found first ATOM, start collecting results.
in_atom = True
coor = []
elif in_atom and line.startswith('TER'):
# Found TER, stop collecting results.
in_atom = False
# Save collected results.
all_coors.append(''.join(coor))
coor = None
if in_atom:
# Collect ATOM result.
coor.append(line)