Fingerprinting using python module called pybel - python

I want to get fingerprints using smiles of compounds. I did but the problem is I want to get in a higher bit and a list format so I can calculate the length of lists. In this case I just get classes. Any solution in python using pybel? I did this but when I write len(fps[0]) I get an error
import pybel
smiles = ['CCCC', 'CCCN']
mols = [pybel.readstring("smi", x) for x in smiles]
fps = [x.calcfp() for x in mols]
print fps[0]

You can use function fp for fingerprint object. By default FP2 fingerprint is calculated, its length is 32.
There is code, which output length and bit-0
import pybel
smiles = ['CCCC', 'CCCN']
mols = [pybel.readstring("smi", x) for x in smiles]
fps = [x.calcfp() for x in mols]
print len(fps[0].fp)
print fps[0].fp[0]
Result:
32
0

Related

Python store output as a variable after running an executable command

"TMalign..." is an executable file that I used to get data. How could I store the output into a variable so that I could extract target values from the output. The executable file is compiled from a long .cpp, so I do not think I could call the variable names from there.
import sys,os
os.system("./TMalign 3w4u.pdb 6bb5.pdb -u 139") #some command I have
The output is like, and I need to extract the TM-score values:
*********************************************************************
* TM-align (Version 20190822): protein structure alignment *
* References: Y Zhang, J Skolnick. Nucl Acids Res 33, 2302-9 (2005) *
* Please email comments and suggestions to yangzhanglab#umich.edu *
*********************************************************************
Name of Chain_1: 3w4u.pdb (to be superimposed onto Chain_2)
Name of Chain_2: 6bb5.pdb
Length of Chain_1: 141 residues
Length of Chain_2: 139 residues
Aligned length= 139, RMSD= 1.07, Seq_ID=n_identical/n_aligned= 0.590
TM-score= 0.94726 (if normalized by length of Chain_1, i.e., LN=141, d0=4.42)
TM-score= 0.96044 (if normalized by length of Chain_2, i.e., LN=139, d0=4.38)
TM-score= 0.96044 (if normalized by user-specified LN=139.00 and d0=4.38)
(You should use TM-score normalized by length of the reference structure)
(":" denotes residue pairs of d < 5.0 Angstrom, "." denotes other aligned residues)
SLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::. .
-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK-Y
Total CPU time is 0.03 seconds
Thanks for help!
You should look toward the following approach:
import re
from subprocess import check_output
ret = check_output(['./TMalign', '3w4u.pdb', '6bb5.pdb', '-u', '139'])
tm_scores = []
for line in str(ret).split('\\n'):
if re.match(r'^TM-score=', line):
score = line.split()[1:2] # Extract the value
tm_scores.extend(score) # Saving only values
# tm_scores now contains: ['0.94726', '0.96044', '0.96044']
While it being somewhat elaborate, it is a flexible and tunable solution. Note, if it will be used among other code, it would be better to wrap this into a function.
My function wasn't that smart,I will let ouput write in to a file to do the follow
import os
cmd = './TMalign 3w4u.pdb 6bb5.pdb -u 139'
os.system(cmd + ">> 1.txt")

Print scientific variable on txt file using f.write()

Since a couple of hours, I am trying to print a simple time vector in a txt file using Python.
import numpy as np
Tp = 2000 * 10**(-9)
dt = Tp / (90000)
t = np.linspace(0,Tp,dt)
timing = open("time.txt","w")
for ii in range(len(t)) :
timing.write(str(t[ii]))
timing.write("\n")
timing.close()
But I still get an empty file and I don't understand at all why.
Maybe I have to be more specific in the function with the precision I want.
Since I have a lot of small numbers (4e-10 ..) to process I would like to understand a general method to write variable (not the entire vector at once) on a txt file with a exponential notation (In Matlab it's kind of automatic I think).
Thx
You have an error using linspace. Please check https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html
Try this:
import numpy as np
Tp = 2000 * 10**(-9)
# dt = Tp / 90000.0
dt = 90000
t = np.linspace(0,Tp,dt)
timing = open("time.txt","w")
for ii in range(len(t)) :
timing.write(str(t[ii]))
timing.write("\n")
timing.close()

What is the fastest way to sort strings in Python if locale is a non-concern?

I was trying to find a fast way to sort strings in Python and the locale is a non-concern i.e. I just want to sort the array lexically according to the underlying bytes. This is perfect for something like radix sort. Here is my MWE
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = randChar("id%010d", N//K, N) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds
As you can see it took 6.8 seconds which is almost 10x slower than R's radix sort below.
N = 1e7
K = 100
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE)
system.time(sort(id3,method="radix"))
I understand that Python's .sort() doesn't use radix sort, is there an implementation somewhere that allows me to sort strings as performantly as R?
AFAIK both R and Python "intern" strings so any optimisations in R can also be done in Python.
The top google result for "radix sort strings python" is this gist which produced an error when sorting on my test array.
It is true that R interns all strings, meaning it has a "global character cache" which serves as a central dictionary of all strings ever used by your program. This has its advantages: the data takes less memory, and certain algorithms (such as radix sort) can take advantage of this structure to achieve higher speed. This is particularly true for the scenarios such as in your example, where the number of unique strings is small relative to the size of the vector. On the other hand it has its drawbacks too: the global character cache prevents multi-threaded write access to character data.
In Python, afaik, only string literals are interned. For example:
>>> 'abc' is 'abc'
True
>>> x = 'ab'
>>> (x + 'c') is 'abc'
False
In practice it means that, unless you've embedded data directly into the text of the program, nothing will be interned.
Now, for your original question: "what is the fastest way to sort strings in python"? You can achieve very good speeds, comparable with R, with python datatable package. Here's the benchmark that sorts N = 10⁸ strings, randomly selected from a set of 1024:
import datatable as dt
import pandas as pd
import random
from time import time
n = 10**8
src = ["%x" % random.getrandbits(10) for _ in range(n)]
f0 = dt.Frame(src)
p0 = pd.DataFrame(src)
f0.to_csv("test1e8.csv")
t0 = time(); f1 = f0.sort(0); print("datatable: %.3fs" % (time()-t0))
t0 = time(); src.sort(); print("list.sort: %.3fs" % (time()-t0))
t0 = time(); p1 = p0.sort_values(0); print("pandas: %.3fs" % (time()-t0))
Which produces:
datatable: 1.465s / 1.462s / 1.460s (multiple runs)
list.sort: 44.352s
pandas: 395.083s
The same dataset in R (v3.4.2):
> require(data.table)
> DT = fread("test1e8.csv")
> system.time(sort(DT$C1, method="radix"))
user system elapsed
6.238 0.585 6.832
> system.time(DT[order(C1)])
user system elapsed
4.275 0.457 4.738
> system.time(setkey(DT, C1)) # sort in-place
user system elapsed
3.020 0.577 3.600
Jeremy Mets posted in the comments of this blog post that Numpy can sort string fairly by converting the array to np.araray. This indeed improve performance, however it is still slower than Julia's implementation.
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = np.array(randChar("id%010d", N//K, N)) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds

Generating LaTeX tables from R summary with RPy and xtable

I am running a few linear model fits in python (using R as a backend via RPy) and I would like to export some LaTeX tables with my R "summary" data.
This thread explains quite well how to do it in R (with the xtable function), but I cannot figure out how to implement this in RPy.
The only relevant thing searches such as "Chunk RPy" or "xtable RPy" returned was this, which seems to load the package in python but not to use it :-/
Here's an example of how I use RPy and what happens.
And this would be the error without bothering to load any data:
from rpy2.robjects.packages import importr
xtable = importr('xtable')
latex = xtable('')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-131-8b38f31b5bb9> in <module>()
----> 1 latex = xtable(res_sum)
2 print latex
TypeError: 'SignatureTranslatedPackage' object is not callable
I have tried using the stargazer package instead of xtable and I get the same error.
Ok, I solved it, and I'm a bit ashamed to say that it was a total no-brainer.
You just have to call the functions as xtable.xtable() or stargazer.stargazer().
To easily generate TeX data from Python, I wrote the following function;
import re
def tformat(txt, v):
"""Replace the variables between [] in raw text with the contents
of the named variables. Between the [] there should be a variable name,
a colon and a formatting specification. E.g. [smin:.2f] would give the
value of the smin variable printed as a float with two decimal digits.
:txt: The text to search for replacements
:v: Dictionary to use for variables.
:returns: The txt string with variables substituted by their formatted
values.
"""
rx = re.compile(r'\[(\w+)(\[\d+\])?:([^\]]+)\]')
matches = rx.finditer(txt)
for m in matches:
nm, idx, fmt = m.groups()
try:
if idx:
idx = int(idx[1:-1])
r = format(v[nm][idx], fmt)
else:
r = format(v[nm], fmt)
txt = txt.replace(m.group(0), r)
except KeyError:
raise ValueError('Variable "{}" not found'.format(nm))
return txt
You can use any variable name from the dictionary in the text that you pass to this function and have it replaced by the formatted value of that variable.
What I tend to do is to do my calculations in Python, and then pass the output of the globals() function as the second parameter of tformat:
smin = 235.0
smax = 580.0
lst = [0, 1, 2, 3, 4]
t = r'''The strength of the steel lies between SI{[smin:.0f]}{MPa} and \SI{[smax:.0f]}{MPa}. lst[2] = [lst[2]:d].'''
print tformat(t, globals())
Feel free to use this. I put it in the public domain.
Edit: I'm not sure what you mean by "linear model fits", but might numpy.polyfit do what you want in Python?
To resolve your problem, please update stargazer to version 4.5.3, now available on CRAN. Your example should then work perfectly.

Python libraries to calculate human readable filesize from bytes?

I find hurry.filesize very useful but it doesn't give output in decimal?
For example:
print size(4026, system=alternative) gives 3 KB.
But later when I add all the values I don't get the exact sum. For example if the output of hurry.filesize is in 4 variable and each value is 3. If I add them all, I get output as 15.
I am looking for alternative of hurry.filesize to get output in decimals too.
This isn't really hard to implement yourself:
suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
def humansize(nbytes):
i = 0
while nbytes >= 1024 and i < len(suffixes)-1:
nbytes /= 1024.
i += 1
f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
return '%s %s' % (f, suffixes[i])
Examples:
>>> humansize(131)
'131 B'
>>> humansize(1049)
'1.02 KB'
>>> humansize(58812)
'57.43 KB'
>>> humansize(68819826)
'65.63 MB'
>>> humansize(39756861649)
'37.03 GB'
>>> humansize(18754875155724)
'17.06 TB'
Disclaimer: I wrote the package I'm about to describe
The module bitmath supports the functionality you've described. It also addresses the comment made by #filmore, that semantically we should be using NIST unit prefixes (not SI), that is to say, MiB instead of MB. rounding is now supported as well.
You originally asked about:
print size(4026, system=alternative)
in bitmath the default prefix-unit system is NIST (1024 based), so, assuming you were referring to 4026 bytes, the equivalent solution in bitmath would look like any of the following:
In [1]: import bitmath
In [2]: print bitmath.Byte(bytes=4026).best_prefix()
3.931640625KiB
In [3]: human_prefix = bitmath.Byte(bytes=4026).best_prefix()
In [4]: print human_prefix.format("{value:.2f} {unit}")
3.93 KiB
I currently have an open task to allow the user to select a preferred prefix-unit system when using the best_prefix method.
Update: 2014-07-16 The latest package has been uploaded to PyPi, and it includes several new features (full feature list is on the GitHub page)
This is not necessary faster than the #nneonneo solution, it's just a bit cooler, if I can say that :)
import math
suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
def human_size(nbytes):
human = nbytes
rank = 0
if nbytes != 0:
rank = int((math.log10(nbytes)) / 3)
rank = min(rank, len(suffixes) - 1)
human = nbytes / (1024.0 ** rank)
f = ('%.2f' % human).rstrip('0').rstrip('.')
return '%s %s' % (f, suffixes[rank])
This works based on the fact that the integer part of a logarithm with base 10 of any number is one less than the actual number of digits. The rest is pretty much straight forward.
I used to reinvent the wheel every time I wrote a little script or ipynb or whatever. It got trite, so I wrote the datasize python module. I'm posting this here because I just updated it, and wow have the Python versions moved up!
It is a DataSize class, which subclasses int, so arithmetic just works, however it returns int from arithmetic because I use it with Pandas and some numpy, and I didn't want to slow things down when there is python<-->C++ translation for matrix math libraries.
You can construct a DataSize object using a string with either SI or NIST suffixes in either bits or bytes, and even wierd word lengths if you need to work with data for embedded tech that uses those. The DataSize object has an intuitive format() code syntax for human-readable representation. Internally the value is just an integer count of 8-bit bytes.
eg.
>>> from datasize import DataSize
>>> 'My new {:GB} SSD really only stores {:.2GiB} of data.'.format(DataSize('750GB'),DataSize(DataSize('750GB') * 0.8))
'My new 750GB SSD really only stores 558.79GiB of data.'

Categories