"TMalign..." is an executable file that I used to get data. How could I store the output into a variable so that I could extract target values from the output. The executable file is compiled from a long .cpp, so I do not think I could call the variable names from there.
import sys,os
os.system("./TMalign 3w4u.pdb 6bb5.pdb -u 139") #some command I have
The output is like, and I need to extract the TM-score values:
*********************************************************************
* TM-align (Version 20190822): protein structure alignment *
* References: Y Zhang, J Skolnick. Nucl Acids Res 33, 2302-9 (2005) *
* Please email comments and suggestions to yangzhanglab#umich.edu *
*********************************************************************
Name of Chain_1: 3w4u.pdb (to be superimposed onto Chain_2)
Name of Chain_2: 6bb5.pdb
Length of Chain_1: 141 residues
Length of Chain_2: 139 residues
Aligned length= 139, RMSD= 1.07, Seq_ID=n_identical/n_aligned= 0.590
TM-score= 0.94726 (if normalized by length of Chain_1, i.e., LN=141, d0=4.42)
TM-score= 0.96044 (if normalized by length of Chain_2, i.e., LN=139, d0=4.38)
TM-score= 0.96044 (if normalized by user-specified LN=139.00 and d0=4.38)
(You should use TM-score normalized by length of the reference structure)
(":" denotes residue pairs of d < 5.0 Angstrom, "." denotes other aligned residues)
SLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::. .
-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK-Y
Total CPU time is 0.03 seconds
Thanks for help!
You should look toward the following approach:
import re
from subprocess import check_output
ret = check_output(['./TMalign', '3w4u.pdb', '6bb5.pdb', '-u', '139'])
tm_scores = []
for line in str(ret).split('\\n'):
if re.match(r'^TM-score=', line):
score = line.split()[1:2] # Extract the value
tm_scores.extend(score) # Saving only values
# tm_scores now contains: ['0.94726', '0.96044', '0.96044']
While it being somewhat elaborate, it is a flexible and tunable solution. Note, if it will be used among other code, it would be better to wrap this into a function.
My function wasn't that smart,I will let ouput write in to a file to do the follow
import os
cmd = './TMalign 3w4u.pdb 6bb5.pdb -u 139'
os.system(cmd + ">> 1.txt")
Related
I have a binary file that was created using a Python code. This code mainly scripts a bunch of tasks to pre-process a set of data files. I would now like to read this binary file in Fortran. The content of the binary file is coordinates of points in a simple format e.g.: number of points, x0, y0, z0, x1, y1, z1, ....
These binary files were created using the 'tofile' function in numpy. I have the following code in Fortran so far:
integer:: intValue
double precision:: dblValue
integer:: counter
integer:: check
open(unit=10, file='file.bin', form='unformatted', status='old', access='stream')
counter = 1
do
if ( counter == 1 ) then
read(unit=10, iostat=check) intValue
if ( check < 0 ) then
print*,"End Of File"
stop
else if ( check > 0 ) then
print*, "Error Detected"
stop
else if ( check == 0 ) then
counter = counter + 1
print*, intValue
end if
else if ( counter > 1 ) then
read(unit=10, iostat=check) dblValue
if ( check < 0 ) then
print*,"End Of File"
stop
else if ( check > 0 ) then
print*, "Error Detected"
stop
else if ( check == 0 ) then
counter = counter + 1
print*,dblValue
end if
end if
end do
close(unit=10)
This unfortunately does not work, and I get garbage numbers (e.g 6.4731191026611484E+212, 2.2844499004808491E-279 etc.). Could someone give some pointers on how to do this correctly?
Also what would be a good way of writing and reading binary files interchangeably between Python and Fortran - as it seems like that is going to be one of the requirements of my application.
Thanks
Here's a trivial example of how to take data generated with numpy to Fortran the binary way.
I calculated 360 values of sin on [0,2π),
#!/usr/bin/env python3
import numpy as np
with open('sin.dat', 'wb') as outfile:
np.sin(np.arange(0., 2*np.pi, np.pi/180.,
dtype=np.float32)).tofile(outfile)
exported that with tofile to binary file 'sin.dat', which has a size of 1440 bytes (360 * sizeof(float32)), read that file with this Fortran95 (gfortran -O3 -Wall -pedantic) program which outputs 1. - (val**2 + cos(x)**2) for x in [0,2π),
program numpy_import
integer, parameter :: REAL_KIND = 4
integer, parameter :: UNIT = 10
integer, parameter :: SAMPLE_LENGTH = 360
real(REAL_KIND), parameter :: PI = acos(-1.)
real(REAL_KIND), parameter :: DPHI = PI/180.
real(REAL_KIND), dimension(0:SAMPLE_LENGTH-1) :: arr
real(REAL_KIND) :: r
integer :: i
open(UNIT, file="sin.dat", form='unformatted',&
access='direct', recl=4)
do i = 0,ubound(arr, 1)
read(UNIT, rec=i+1, err=100) arr(i)
end do
do i = 0,ubound(arr, 1)
r = 1. - (arr(i)**2. + cos(real(i*DPHI, REAL_KIND))**2)
write(*, '(F6.4, " ")', advance='no')&
real(int(r*1E6+1)/1E6, REAL_KIND)
end do
100 close(UNIT)
write(*,*)
end program numpy_import
thus if val == sin(x), the numeric result must in good approximation vanish for float32 types.
And indeed:
output:
360 x 0.0000
So thanks to this great community, from all the advise I got, and a little bit of tinkering around, I think I figured out a stable solution to this problem, and I wanted to share with you all this answer. I will provide a minimal example here, where I want to write a variable size array from Python into a binary file, and read it using Fortran. I am assuming that the number of rows numRows and number of columns numCols are also written along with the full array datatArray. The following Python script writeBin.py writes the file:
import numpy as np
# Read in the numRows and numCols value
# Read in the array values
numRowArr = np.array([numRows], dtype=np.float32)
numColArr = np.array([numCols], dtype=np.float32)
fileObj = open('pybin.bin', 'wb')
numRowArr.tofile(fileObj)
numColArr.tofile(fileObj)
for i in range(numRows):
lineArr = dataArray[i,:]
lineArr.tofile(fileObj)
fileObj.close()
Following this, the fortran code to read the array from the file can be programmed as follows:
program readBin
use iso_fortran_env
implicit none
integer:: nR, nC, i
real(kind=real32):: numRowVal, numColVal
real(kind=real32), dimension(:), allocatable:: rowData
real(kind=real32), dimension(:,:), allocatable:: fullData
open(unit=10,file='pybin.bin',form='unformatted',status='old',access='stream')
read(unit=10) numRowVal
nR = int(numRowVal)
read(unit=10) numColVal
nC = int(numColVal)
allocate(rowData(nC))
allocate(fullData(nR,nC))
do i = 1, nR
read(unit=10) rowData
fullData(i,:) = rowData(:)
end do
close(unit=10)
end program readBin
The main point that I gathered from the discussion on this thread is to match the read and the write as much as possible, with precise specifications of the data types to be read, the way they are written etc. As you may note, this is a made up example, so there may be some things here and there that are not perfect. However, I have used this now to program a finite element program, and the mesh data was where I used this binary read/write - and it worked very well.
P.S: In case you find some typo, please let me know, and I will edit it out rightaway.
Thanks a lot.
I am struggling to get to grips with this.
I create a netcdf4 file with the following dimensions and variables (note in particular the unlimited point dimension):
dimensions:
point = UNLIMITED ; // (275935 currently)
realization = 24 ;
variables:
short mod_hs(realization, point) ;
mod_hs:scale_factor = 0.01 ;
short mod_ws(realization, point) ;
mod_ws:scale_factor = 0.01 ;
short obs_hs(point) ;
obs_hs:scale_factor = 0.01 ;
short obs_ws(point) ;
obs_ws:scale_factor = 0.01 ;
short fchr(point) ;
float obs_lat(point) ;
float obs_lon(point) ;
double obs_datetime(point) ;
}
I have a Python program that populated this file with data in a loop (hence the unlimited record dimension - I don't know apriori how big the file will be).
After populating the file, it is 103MB in size.
My issue is that reading data from this file is quite slow. I guessed that this is something to do with chunking and the unlmited point dimension?
I ran ncks --fix_rec_dmn on the file and (after a lot of churning) it produced a new netCDF file that is only 32MB in size (which is about the right size for the data it contains).
This is a massive difference in size - why is the original file so bloated? Also - accessing the data in this file is orders of magnitude quicker. For example, in Python, to read in the contents of the hs variable takes 2 seconds on the original file and 40 milliseconds on the fixed record dimension file.
The problem I have is that some of my files contain a lot of points and seem to be too big to run ncks on (my machine runs out of memoery and I have 8GB), so I can't convert all the data to fixed record dimension.
Can anyone explain why the file sizes are so different and how I can make the original files smaller and more efficient to read?
By the way - I am not using zlib compression (I have opted for scaling floating point values to an integer short).
Chris
EDIT
My Python code is essentially building up one single timeseries file of collocated model and observation data from multiple individual model forecast files over 3 months. My forecast model runs 4 times a day, and I am aggregateing 3 months of data, so that is ~120 files.
The program extracts a subset of the forecast period from each file (e.t. T+24h -> T+48h), so it is not a simple matter of concatenating the files.
This is a rough approxiamtion of what my code is doing (it actually reads/writes more variables, but I am just showing 2 here for clarity):
# Create output file:
dout = nc.Dataset(fn, mode='w', clobber=True, format="NETCDF4")
dout.createDimension('point', size=None)
dout.createDimension('realization', size=24)
for varname in ['mod_hs','mod_ws']:
v = ncd.createVariable(varname, np.short,
dimensions=('point', 'realization'), zlib=False)
v.scale_factor = 0.01
# Cycle over dates
date = <some start start>
end_dat = <some end date>
# Keeo track if record dimension ('point') size:
n = 0
while date < end_date:
din = nc.Dataset("<path to input file>", mode='r')
fchr = din.variables['fchr'][:]
# get mask for specific forecast hour range
m = np.logical_and(fchr >= 24, fchr < 48)
sz = np.count_nonzero(m)
if sz == 0:
continue
dout.variables['mod_hs'][n:n+sz,:] = din.variables['mod_hs'][:][m,:]
dout.variables['mod_ws'][n:n+sz,:] = din.variables['mod_wspd'][:][m,:]
# Increment record dimension count:
n += sz
din.close()
# Goto next file
date += dt.timedelta(hours=6)
dout.close()
Interestingly, if I make the output file format NETCDF3_CLASSIC rather that NETCDF4 the output size the size that I would expect. NETCDF4 output seesm to be bloated.
My experience has been that the default chunksize for record dimensions depends on the version of the netCDF library underneath. For 4.3.3.1, it is 524288. 275935 records is about half a record-chunk. ncks automatically chooses (without telling you) more sensible chunksizes than netCDF defaults, so the output is better optimized. I think this is what is happening. See http://nco.sf.net/nco.html#cnk
Please try to provide a code that works without modification if possible, I had to edit to get it working, but it wasn't too difficult.
import netCDF4 as nc
import numpy as np
dout = nc.Dataset('testdset.nc4', mode='w', clobber=True, format="NETCDF4")
dout.createDimension('point', size=None)
dout.createDimension('realization', size=24)
for varname in ['mod_hs','mod_ws']:
v = dout.createVariable(varname, np.short,
dimensions=('point', 'realization'), zlib=False,chunksizes=[1000,24])
v.scale_factor = 0.01
date = 1
end_date = 5000
n = 0
while date < end_date:
sz=100
dout.variables['mod_hs'][n:n+sz,:] = np.ones((sz,24))
dout.variables['mod_ws'][n:n+sz,:] = np.ones((sz,24))
n += sz
date += 1
dout.close()
The main difference is in createVariable command. For file size, without providing "chunksizes" in creating variable, I also got twice as large file compared to when I added it. So for file size it should do the trick.
For reading variables from file, I did not notice any difference actually, maybe I should add more variables?
Anyway, it should be clear how to add chunk size now, You probably need to test a bit to get good conf for Your problem. Feel free to ask more if it still does not work for You, and if You want to understand more about chunking, read the hdf5 docs
I think your problem is that the default chunk size for unlimited dimensions is 1, which creates a huge number of internal HDF5 structures. By setting the chunksize explicitly (obviously ok for unlimited dimensions), the second example does much better in space and time.
Unlimited dimensions require chunking in HDF5/netCDF4, so if you want unlimited dimensions you have to think about chunking performance, as you have discovered.
More here:
https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_perf_chunking.html
I want to get fingerprints using smiles of compounds. I did but the problem is I want to get in a higher bit and a list format so I can calculate the length of lists. In this case I just get classes. Any solution in python using pybel? I did this but when I write len(fps[0]) I get an error
import pybel
smiles = ['CCCC', 'CCCN']
mols = [pybel.readstring("smi", x) for x in smiles]
fps = [x.calcfp() for x in mols]
print fps[0]
You can use function fp for fingerprint object. By default FP2 fingerprint is calculated, its length is 32.
There is code, which output length and bit-0
import pybel
smiles = ['CCCC', 'CCCN']
mols = [pybel.readstring("smi", x) for x in smiles]
fps = [x.calcfp() for x in mols]
print len(fps[0].fp)
print fps[0].fp[0]
Result:
32
0
I am reading a file containing single precision data with 512**3 data points. Based on a threshold, I assign each point a flag of 1 or 0. I wrote two programs doing the same thing, one in fortran, the other in python. But the one in fortran takes like 0.1 sec while the one in python takes minutes. Is it normal? Or can you please point out the problem with my python program:
fortran.f
program vorticity_tracking
implicit none
integer, parameter :: length = 512**3
integer, parameter :: threshold = 1320.0
character(255) :: filen
real, dimension(length) :: stored_data
integer, dimension(length) :: flag
integer index
filen = "vor.dat"
print *, "Reading the file ", trim(filen)
open(10, file=trim(filen),form="unformatted",
& access="direct", recl = length*4)
read (10, rec=1) stored_data
close(10)
do index = 1, length
if (stored_data(index).ge.threshold) then
flag(index) = 1
else
flag(index) = 0
end if
end do
stop
end program
Python file:
#!/usr/bin/env python
import struct
import numpy as np
f_type = 'float32'
length = 512**3
threshold = 1320.0
file = 'vor_00000_455.float'
f = open(file,'rb')
data = np.fromfile(f, dtype=f_type, count=-1)
f.close()
flag = []
for index in range(length):
if (data[index] >= threshold):
flag.append(1)
else:
flag.append(0)
********* Edit ******
Thanks for your comments. I am not sure then how to do this in fortran. I tried the following but this is still as slow.
flag = np.ndarray(length, dtype=np.bool)
for index in range(length):
if (data[index] >= threshold):
flag[index] = 1
else:
flag[index] = 0
Can anyone please show me?
Your two programs are totally different. Your Python code repeatedly changes the size of a structure. Your Fortran code does not. You're not comparing two languages, you're comparing two algorithms and one of them is obviously inferior.
In general Python is an interpreted language while Fortran is a compiled one. Therefore you have some overhead in Python. But it shouldn't take that long.
One thing that can be improved in the python version is to replace the for loop by an index operation.
#create flag filled with zeros with same shape as data
flag=numpy.zeros(data.shape)
#get bool array stating where data>=threshold
barray=data>=threshold
#everywhere where barray==True put a 1 in flag
flag[barray]=1
shorter version:
#create flag filled with zeros with same shape as data
flag=numpy.zeros(data.shape)
#combine the two operations without temporary barray
flag[data>=threshold]=1
Try this for python:
flag = data > threshhold
It will give you an array of flags as you want.
I've got about 3,500 files that consist of single line character strings. The files vary in size (from about 200b to 1mb). I'm trying to compare each file with each other file and find a common subsequence of length 20 characters between two files. Note that the subsequence is only common between two files during each comparison, and not common among all files.
I've stuggled with this problem a bit, and since I'm not an expert, I've ended up with a bit of an ad-hoc solution. I use itertools.combinations to build a list in Python that ends up with around 6,239,278 combinations. I then pass the files two at a time to a Perl script that acts a wrapper for a suffix tree library written in C called libstree. I've tried to avoid this type of solution but the only comparable C suffix tree wrapper in Python suffers from a memory leak.
So here's my problem. I've timed it, and on my machine, the solution processes about 500 comparisons in 25 seconds. So that means, it'll take around 3 days of continuous processing to complete the task. And then I have to do it all again to look at say 25 characters instead of 20. Please note that I'm way out of my comfort zone and not a very good programmer, so I'm sure there is a much more elegant way to do this. I thought I'd ask it here and produce my code to see if anyone has any suggestion as to how I could complete this task faster.
Python code:
from itertools import combinations
import glob, subprocess
glist = glob.glob("Data/*.g")
i = 0
for a,b in combinations(glist, 2):
i += 1
p = subprocess.Popen(["perl", "suffix_tree.pl", a, b, "20"], shell=False, stdout=subprocess.PIPE)
p = p.stdout.read()
a = a.split("/")
b = b.split("/")
a = a[1].split(".")
b = b[1].split(".")
print str(i) + ":" + str(a[0]) + " --- " + str(b[0])
if p != "" and len(p) == 20:
with open("tmp.list", "a") as openf:
openf.write(a[0] + " " + b[0] + "\n")
Perl code:
use strict;
use Tree::Suffix;
open FILE, "<$ARGV[0]";
my $a = do { local $/; <FILE> };
open FILE, "<$ARGV[1]";
my $b = do { local $/; <FILE> };
my #g = ($a,$b);
my $st = Tree::Suffix->new(#g);
my ($c) = $st->lcs($ARGV[2],-1);
print "$c";
Rather than writing Python to call Perl to call C, I'm sure you would be better off dropping the Python code and writing it all in Perl.
If your files are certain to contain exactly one line then you can read the pairs more simply by writing just
my #g = <>;
I believe the program below performs the same function as your Python and Perl code combined, but I cannot test it as I am unable to install libstree at present.
But as ikegami has pointed out, it would be far better to calculate and store the longest common subsequence for each pair of files and put them into categories afterwards. I won't go on to code this as I don't know what information you need - whether it is just subsequence length or if you need the characters and/or the position of the subsequences as well.
use strict;
use warnings;
use Math::Combinatorics;
use Tree::Suffix;
my #glist = glob "Data/*.g";
my $iterator = Math::Combinatorics->new(count => 2, data => \#glist);
open my $fh, '>', 'tmp.list' or die $!;
my $n = 0;
while (my #pair = $iterator->next_combination) {
$n++;
#ARGV = #pair;
my #g = <>;
my $tree = Tree::Suffix->new(#g);
my $lcs = $tree->lcs;
#pair = map m|/(.+?)\.|, #pair;
print "$n: $pair[0] --- $pair[1]\n";
print $fh, "#pair\n" if $lcs and length $lcs >= 20;
}