I would like get some basic understanding how to use biopython for clustering genes.
Lets say i have genes that i would like to group. How to feed them to the algorithm, and how to give a cutoff point under which size and amount of cluster would depend?
I've tried straightforward approach:
from Bio.Cluster import kcluster
list1 = [
'ADHAMKCAIROSURBANDJVUGLOBALIZATIONANDURBANFANTASIESPLA',
'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB',
'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB',
'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB'
]
list2 = [Seq(gen, IUPAC.extended_protein) for gen in list1]
clusterid, error, nfound = kcluster(list2)
but it just brought me an error:
Traceback (most recent call last):
File "./test.py", line 9, in <module>
clusterid, error, nfound = kcluster(list2)
TypeError: data cannot be converted to needed array.
The kcluster function takes a data matrix as input and not Seq instances.
You need to convert your sequences to a matrix and provide that to the kcluster function.
One way of converting the data to a matrix containing numerical elements only is by using the numpy.fromstring function. It basically translates each letter in a sequence to it's ASCII counterpart.
This creates a 2D array of encoded sequences that the kcluster function recognized and uses to cluster your sequences.
>>> from Bio.Cluster import kcluster
>>> import numpy as np
>>> sequences = [
... 'ADHAMKCAIROSURBANDJVUGLOBALIZATIONANDURBANFANTASIESPLA',
... 'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB',
... 'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB',
... 'AGGESTAMKTHEARABSTATEANDNEOLIBERALGLOBALIZATIONTHEARAB'
... ]
>>> matrix = np.asarray([np.fromstring(s, dtype=np.uint8) for s in sequences])
>>> clusterid, error, nfound = kcluster(matrix)
>>> print(clusterid)
[1, 0, 0, 0]
Related
I am trying to write a code which reads this file and gives the inverse of square root of each term in a matrix. This is the file I am using:
1.659999999999999963e-04
3.970000000000000005e-04
-8.014499999999999402e-02
-2.274299999999999933e-02
-7.559999999999999880e-03
-3.156229999999999869e-01
5.650100000000000261e-02
2.350100000000000106e-02
-4.383999999999999876e-03
-4.878299999999999997e-02
1.207599999999999993e-02
-5.254199999999999843e-02
1.123500000000000019e-02
1.614240000000000119e-01
1.954900000000000040e-02
-2.614100000000000104e-02
1.534899999999999980e-02
5.446000000000000320e-03
-6.210299999999999848e-02
-9.615000000000000283e-03
1.687800000000000064e-02
6.460999999999999729e-03
-9.490999999999999437e-03
1.676700000000000065e-02
-2.308000000000000156e-03
-1.412399999999999940e-02
8.978899999999999382e-02
1.848960000000000048e-01
5.956000000000000356e-03
-5.592300000000000049e-02
1.114599999999999966e-02
-5.689600000000000213e-02
-6.731000000000000004e-03
2.572999999999999940e-02
1.512000000000000106e-03
-3.237999999999999993e-03
-4.068999999999999700e-03
-1.234000000000000071e-03
2.378109999999999946e-01
-1.128000000000000096e-03
-3.534999999999999948e-03
-4.550000000000000008e-04
1.479999999999999925e-04
5.220000000000000031e-04
3.718099999999999877e-02
1.104580000000000006e-01
1.965000000000000167e-03
4.266999999999999960e-03
-5.140999999999999737e-03
1.648640000000000105e-01
1.776220000000000021e-01
1.922000000000000097e-03
3.250600000000000017e-02
4.402899999999999869e-02
-8.430999999999999259e-03
4.409999999999999858e-04
1.389999999999999905e-04
1.374209999999999876e-01
-2.431860000000000133e-01
-1.727000000000000019e-03
-2.280000000000000126e-04
8.100000000000000375e-05
-7.480999999999999803e-03
8.000000000000000654e-05
-3.939999999999999817e-04
1.441000000000000007e-03
-7.290000000000000473e-04
-3.663000000000000284e-02
-1.657999999999999969e-03
-8.369999999999999619e-04
-6.904999999999999680e-03
1.593100000000000072e-02
-3.393000000000000183e-03
1.495999999999999934e-03
-7.368999999999999682e-03
1.436199999999999977e-02
-1.319700000000000040e-02
-4.557000000000000287e-03
8.123700000000000365e-02
2.447399999999999923e-02
-1.295199999999999997e-02
-8.722100000000000686e-02
-5.232999999999999804e-03
-1.255940000000000112e-01
1.291999999999999963e-03
-1.382999999999999898e-03
4.989999999999999644e-03
1.508000000000000009e-03
2.304399999999999851e-02
2.819400000000000031e-02
3.119999999999999944e-04
-8.781999999999999876e-03
6.794500000000000539e-02
6.198999999999999649e-03
-2.058879999999999877e-01
9.219999999999999680e-04
-1.618800000000000100e-02
-3.415860000000000007e-01
-1.660999999999999933e-03
-4.889999999999999599e-04
1.759999999999999954e-04
-3.763999999999999985e-03
-6.566600000000000215e-02
-7.680000000000000195e-04
-1.231799999999999978e-01
7.047999999999999578e-03
1.425000000000000051e-02
-2.900799999999999906e-02
1.187499999999999944e-01
-1.449199999999999933e-01
-1.106999999999999911e-03
-1.557999999999999923e-03
-2.236999999999999839e-03
-7.270699999999999386e-02
-5.140000000000000254e-04
2.246999999999999865e-03
-1.778949999999999976e-01
1.669599999999999904e-02
-1.277799999999999943e-02
-2.379040000000000044e-01
-2.207999999999999893e-03
1.925000000000000062e-03
7.750100000000000044e-02
-5.004100000000000215e-02
1.704999999999999918e-03
3.272400000000000309e-02
1.957499999999999865e-02
-1.514620000000000133e-01
-3.288999999999999823e-03
-3.605699999999999877e-02
3.648999999999999900e-03
3.459799999999999681e-02
-1.859999999999999945e-04
3.300000000000000253e-05
3.000000000000000076e-06
9.999999999999999547e-07
4.800000000000000122e-05
1.361999999999999929e-03
-6.057300000000000184e-02
5.689999999999999529e-04
-5.000000000000000409e-06
2.984699999999999853e-02
6.999999999999999387e-05
4.600000000000000004e-05
-1.294499999999999991e-02
-2.318000000000000182e-03
-0.000000000000000000e+00
1.858200000000000129e-01
-5.969959999999999711e-01
6.000000000000000152e-06
The code I have tried to write is this:
from ast import Num
from cmath import sqrt
import math
import numpy as np
from numpy import append
f1= open('diagonal.txt', 'r')
k=f1.readlines()
#def mkstr(s):
# str1=" "
# return (str1.join(s))
#print(float(mkstr(k)))
#for j in f1:
#inverse_root(j)
#def mkstr2():
# f1=open("diagonal.txt", "r")
# k=f1.readlines()
# fl=[float(item) for item in k]
# inverse_root(fl)
#for j in mkstr2():
#inverse_root(j)
f=[float(x) for x in k]
sa=np.array(f)
ca=sa.astype(np.float)
def inverse_root(j):
return [1/(sqrt(j))]
j=[inverse_root(ca)]
I am getting the output like this:
DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
ca=sa.astype(np.float)
Traceback (most recent call last):
File "/home/gian-2018/prateek/python/read_file/27/5/rmwpd.py", line 34, in <module>
j=[inverse_root(ca)]
File "/home/gian-2018/prateek/python/read_file/27/5/rmwpd.py", line 33, in inverse_root
return [1/(sqrt(j))]
TypeError: only length-1 arrays can be converted to Python scalars
So to get inverse of sqrt for each term from the matrix file: I tried to convert the list of these strings into float list, but even after I get error for trying to get list of these numbers into the formula of inverse sqrt instead of using a real number.
Now I am trying to create the inverse formula suitable for an array which means creating a new array which converts each term in the previous matrix into their inverse sqrt and saves the value into the new array.
By any way possible please either tell me how to convert this list into real numbers or convert each term of this matrix into its inverse sqrt and save it into new matrix.
Or just tell me how to implement this task in any way possible if you have understood it.
I'm trying to compute the natural log of the exponential plus the median of an array but precision of the floating point numbers has to be at 2000 or the answer will always be 0.
Here is what I have so far:
import bigfloat
x = np.array([-15349.79, -15266.66, -15242.86])
answer = np.median(x) - bigfloat.log(np.mean(bigfloat.exp(x, precision(2000)) + np.median(x)))
This code returns the following error because BigFloat.exp doesn't work on list types.
__main__:1: RuntimeWarning: invalid value encountered in log
array([ nan, nan, nan])
>>> np.median - bigfloat.log(np.mean(bigfloat.exp(x, precision(2000)) + np.median(x)))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/bigfloat/core.py", line 1446, in exp
(BigFloat._implicit_convert(x),),
File "/usr/local/lib/python2.7/dist-packages/bigfloat/core.py", line 800, in _implicit_convert
"to BigFloat" % (arg, type(arg)))
TypeError: Unable to convert argument [-15349.79 -15266.66 -15242.86] of type <type 'numpy.ndarray'> to BigFloat
I then tried list comprehension to build exponentials with precision=2000 but the output loses precision.
>>>[bigfloat.exp(num, precision(2000)) + BigFloat(np.median(x), context=precision(2000)) for num in x]
[BigFloat.exact('-15266.660000000000', precision=53), BigFloat.exact('-15266.660000000000', precision=53), BigFloat.exact('-15266.660000000000', precision=53)]
I can get exponential values from x with precision using list comprehension
like this:
>>> y = np.array([bigfloat.exp(num, precision(2000)) for num in x])
array([ BigFloat.exact('4.687104391719151845495683439738733373338442839115295629901427982078731632624437676931490944129698360820181432182432247030174990996132679553420878801399467284014932435857939862423137788809352433385343338286781879066456873472082636971652587852379587642184738259777920214145750081457830886095059600271300801524086458081526778407096875577469007859748418787409261916565428551066536598870494431653463691369520685259389041691328858076222360659135651318185915247243253437425313718584237418827954902711646850362440592578566737383074556346759332830833101196990845567845361685702120838400980658468192376607029699380e-6667', precision=2000),
BigFloat.exact('5.940252515613784074556309936460240719247289162721506036367173706446251980105141516508485353816420203092949808160441872102510858074752258842471232457791645630942356791030979819957875784926363292788234347364203226596687259399429746917920741954706671428927952815041593801046941161417184730521922391569252658359604561007344640221209247257690437938341582528542446104212627600044421405229891732857592357725820529732658234033631318425476205945557122313924830225909708184718714804450362854280687393958953216361511561704836206669636791590655443171684811683402330053964681285971765503804198242627954802184910024902e-6631', precision=2000),
BigFloat.exact('1.288289823457680769007768910906908351089465066737359292851136781233384628626343352992636825492031571595103435334370758196286825224816434202324210124540257467624499933926757723584914581286627071375803686797647455160335628799775858142489656885909172399293492295645703203222218182084249732700966821340431452575815746282651823925677898549539269454644804048695031225101971491985645951419388454738446335860070391719831039721150295437452237572831818257568221336263814212095143032293694959570063677473370798577031762607527221992020661837810811705007892502559603984057933079724597470162879821292957122400421187875e-6620', precision=2000)], dtype=object)
So, how do I add np.median(x) to each item in the exponential array and how do I get the log of the elements in the final array? Is there an easier way to compute the equation given by variable answer in the first code snippet above?
I'm basically trying to convert this R code to Python:
llMed <- stats::median(x)
metric <- as.double(
llMed - log( Rmpfr::mean( exp( -Rmpfr::mpfr(x, prec=2000L) + llMed )))
)
I think I was able to transpose the R code above, using gmpy2 instead of BigFloat but probably could have used either:
import gmpy2 as gp
import numpy as np
with gp.local_context(gp.context(), precision=2000) as ctx:
x = np.array([gp.mpfr(-15349.79), gp.mpfr(-15266.66), gp.mpfr(-15242.86)])
answer = np.median(x) - gp.log(np.mean([gp.exp(gp.add(np.median(x), -gp.mpfr(item))) for item in x]))
Output returns:
mpfr('-15348.69138771133276342351845677418587273364155071266454924792376077085597654860712492594599688058199377288639137666410888137048513058096745776113277020531535761588878042016412651412489608285582998650481415959482636259292554205272401992800167437149224493900214813760731711328834885525224938584036572786202764947113186177391441168267495103157033223800214492991771147327262071249020907408433551462034872024117419365924381891832161483692689444158313503155805562703358104122358438183824459422779020196496507161021965435781332319270106503099589290788685588200038739438059696970511568363022088057740627040541967',2000)
I am writing a program that will append a list with a single element pulled from a 2 dimensional numpy array. So far, I have:
# For loop to get correlation data of selected (x,y) pixel for all bands
zdata = []
for n in d.bands:
cor_xy = np.array(d.bands[n])
zdata.append(cor_xy[y,x])
Every time I run my program, I get the following error:
Traceback (most recent call last):
File "/home/sdelgadi/scr/plot_pixel_data.py", line 36, in <module>
cor_xy = np.array(d.bands[n])
TypeError: only integer arrays with one element can be converted to an index
My method works when I try it from the python interpreter without using a loop, i.e.
>>> zdata = []
>>> a = np.array(d.bands[0])
>>> zdata.append(a[y,x])
>>> a = np.array(d.bands[1])
>>> zdata.append(a[y,x])
>>> print(zdata)
[0.59056658, 0.58640128]
What is different about creating a for loop and doing this manually, and how can I get my loop to stop causing errors?
You're treating n as if it's an index into d.bands when it's an element of d.bands
zdata = []
for n in d.bands:
cor_xy = np.array(n)
zdata.append(cor_xy[y,x])
You say a = np.array(d.bands[0]) works. The first n should be exactly the same thing as d.bands[0]. If so then np.array(n) is all you need.
I've been having some problems with this code, trying to end up with an inner product of two 1-D arrays. The code of interest looks like this:
def find_percents(i):
percents=[]
median=1.5/(6+2*int(i/12))
b=2*median
m=b/(7+2*int(i/12))
for j in xrange (1,6+2*int(i/12)):
percents.append(float((b-m*j)))
percentlist=numpy.asarray(percents, dtype=float)
#print percentlist
total=sum(percentlist)
return total, percentlist
def playerlister(i):
players=[]
for i in xrange(i+1,i+6+2*int(i/12)):
position=sheet.cell(i,2)
points=sheet.cell(i,24)
if re.findall('RB', str(position.value)):
vbd=points.value-rbs[24]
players.append(vbd)
else:
pass
playerlist=numpy.asarray(players, dtype=float)
return playerlist
def others(i,percentlist,playerlist,total):
alternatives=[]
playerlist=playerlister(i)
percentlist=find_percents(i)
players=numpy.dot(playerlist,percentlist)
I am receiving the following error in response to the very last line of this attached code:
ValueError: setting an array element with a sequence.
In most other examples of this error, I have found the error to be because of incorrect data types in the arrays percentlist and playerlist, but mine should be float type. If it helps at all, I call these functions a little later in the program, like so:
for i in xrange(1,30):
total, percentlist= find_percents(i)
playerlist= playerlister(i)
print type(playerlist[i])
draft_score= others(i,percentlist,playerlist,total)
Can anyone help me figure out why I am setting an array element with a sequence? Please let me know if any more information might be helpful! Also for clarity, the playerlister is making use of the xlrd module to extract data from a spreadsheet, but the data are numerical and testing has shown that that both lists have a type of numpy.float64.
The shape and contents of each of these for one iteration of i is
<type 'numpy.float64'>
(5,)
[ 73.7 -94.4 140.9 44.8 130.9]
(5,)
[ 0.42857143 0.35714286 0.28571429 0.21428571 0.14285714]
Your function find_percents returns a two-element tuple.
When you call it in others, you are binding that tuple to the variable named percentlist, which you then try to use in a dot-product.
My guess is that by writing this in others it is fixed:
def others(i,percentlist,playerlist,total):
playerlist = playerlister(i)
_, percentlist = find_percents(i)
players = numpy.dot(playerlist,percentlist)
provided of course playerlist and percentlist always have the same number of elements (which we can't check because of the missing spreadsheet).
To verify, the following gives you the exact error message and the minimum of code needed to reproduce it:
>>> import numpy as np
>>> a = np.arange(5)
>>> np.dot(a, (2, a))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.
I'm a beginner in python. I wrote a function as follows:
import numpy as np
def crossover(v1,v2):
N=2
v1n=np.zeros(shape=(1,N+1))
v2n=np.zeros(shape=(1,N+1))
beta=np.random.rand(1)
v1n[0,0]=(1-beta)*v1[0]+beta*v2[0]
v1n[0][1]=v1[1]
v2n[0][0]=(1-beta)*v2[0]+beta*v1[0]
v2n[0][1]=v2[1]
return (v1n,v2n)
when I want to see crossover([3,4],[7,8]), the following error....:
Traceback (most recent call last):
File "<pyshell#82>", line 1, in <module>
crossover([4,5],[5,4])
File "C:\Python27\crossover.py", line 11, in crossover
v1n[0,0]=(1-beta)*v1[0]+beta*v2[0]
TypeError: 'int' object has no attribute '__getitem__'
Your code is running fine on python 2.7.8 (on my computer). But i suggest your output is bad.
if you run your code you get output:
(array([[ 4.91965332, 5. , 0. ]]), array([[ 4.08034668, 4. , 0. ]]))
This is actually a tuple with two arrays containing each a list.
you are using numpy for small 'arrays' and that is actually slower than normal list.
numpy is used for hundreds, even thousands of data.
check this link for more info about numpy speed
i would suggest you just use list instead.
let me give you an example on how i would have done it without numpy :D
import random
v1=[5,4]
v2=[4,5]
# basicly random number from 0 to 1
beta=random.random()
# let's initialize v1n and v2n (:
v1n = [0,0,0]
v2n = [0,0,0]
v1n[0] = (1-beta)*v1[0]+beta*v2[0]
v1n[1] = v1[1]
v2n[0] = (1-beta)*v2[0]+beta*v1[0]
v2n[1]=v2[1]
print("first 3d array:")
print(v1n)
print("second 3d array:")
print(v2n)
print("note that this really is 2d arrays because the 3rd dimension is always zero")