I need to write code that tests a numpy array of cutoff values for a classification problem. The values to test are stored in the cutoff_list variable. I then want to place the list of resulting confusion matrices in a dictionary. However, the code below gives me only the first dictionary entry (confusion matrix for the first test value):
cutoff_list = [np.arange(0,1,0.01)] # list of test values
dictionary = {}
for i, v in enumerate(cutoff_list):
actual = (df.observed)
predicted = np.where(df.indicator > i, 1, 0)
df_confusion = confusion_matrix(actual, predicted) / len(df.indicator)
dictionary[i] = df_confusion
print(dictionary)
Libraries that I am using:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
Is this a problem with the loop or the dictionary update step? I'm new to Python, have more experience with R and still struggling here. Any help appreciated.
Your for loop runs only once, because of the way you set it up.
You are enclosing the return of the np.arrange(0.1.0.01) in a list which breaks the way you want your for loop to run. You are getting just one value because the for loop runs just once as the outer list has just one item.
>>> cutoff_list = [np.arange(0,1,0.01)]
>>> cutoff_list
[array([0. , 0.01, ... 0.98, 0.99])]
>>> type(cutoff_list)
<class 'list'>
You want to get the actual numpy array:
>>> cutoff_list = np.arange(0,1,0.01)
>>> cutoff_list
array([0. , 0.01, ... 0.98, 0.99])
>>> type(cutoff_list)
<class 'numpy.ndarray'>
Change the line cutoff_list = [np.arange(0,1,0.01)] to cutoff_list = np.arange(0,1,0.01) and see whether that resolves your problem.
I would also imagine that you want to use v instead of i in this line:
predicted = np.where(df.indicator > i, 1, 0)
as the i will hold just the enumerating value that you use as key for your dict, whereas the v will hold the values from cutoff_list.
Related
Assume I have the following dataframe. How can I create a new column "new_col" containing the centroids? I can only create the column with the labs, not with the centroids.
Here is my code.
from sklearn import preprocessing
from sklearn.cluster import KMeans
numbers = pd.DataFrame(list(range(1,1000)), columns = ['num'])
kmean_model = KMeans(n_clusters=5)
kmean_model.fit(numbers[['num']])
kmean_model.cluster_centers_
array([[699. ],
[297. ],
[497.5],
[899.5],
[ 99. ]])
numbers['new_col'] = kmean_model.predict(numbers[['num']])
It is simple. Just use .labels_ as follows.
numbers['new_col'] = kmean_model.labels_
Edit. Sorry my mistake.
Make dictionary whose key is label and value is centers, and replace the new_col using the dictionary. See the following.
label_center_dict = {k:v for k, v in zip(kmean_model.labels_, kmean_model.cluster_centers_)}
numbers['new_col'] = kmean_model.labels_
numbers['new_col'].replace(label_center_dict, inplace = True)
I´ve a Pandas dataframe that I read from csv and contains X and Y coordinates and a value that I need to put in a matrix and save it to a text file. So, I created a numpy array with max(X) and max(Y) extension.
I´ve this file:
fid,x,y,agblongo_tch_alive
2368458,1,1,45.0126083457747
2368459,1,2,44.8996854102889
2368460,2,2,45.8565022933761
2358154,3,1,22.6352522929758
2358155,3,3,23.1935887499899
And I need this one:
45.01 44.89 -9999.00
-9999.00 45.85 -9999.00
22.63 -9999.00 23.19
To do that, I´m using a loop like this:
for row in data.iterrows():
p[int(row[1][2]),int(row[1][1])] = row[1][3]
and then I save it to disk using np.array2string. It works.
As the original csv has 68 M lines, it´s taking a lot of time to process, so I wonder if there´s another more pythonic and fast way to do that.
Assuming the columns of your df are 'x', 'y', 'value', you can use advanced indexing
>>> x, y, value = data['x'].values, data['y'].values, data['value'].values
>>> result = np.zeros((y.max()+1, x.max()+1), value.dtype)
>>> result[y, x] = value
This will, however, not work properly if coordiantes are not unique.
In that case it is safer (but slower) to use add.at:
>>> result = np.zeros((y.max()+1, x.max()+1), value.dtype)
>>> np.add.at(result, (y, x), value)
Alternatively, you can create a sparse matrix since your data happen to be in sparse coo format. Using the '.A' property you can then convert that to a normal (dense) array as needed:
>>> from scipy import sparse
>>> spM = sparse.coo_matrix((value, (y, x)), (y.max()+1, x.max()+1))
>>> (spM.A == result).all()
True
Update: if the fillvalue is not zero the above must be modified.
Method 1: replace second line with (remember this should only be used if coordinates are unique):
>>> result = np.full((y.max()+1, x.max()+1), fillvalue, value.dtype)
Method 2: does not work
Method 3: after creating spM do
>>> spM.sum_duplicates()
>>> assert spM.has_canonical_format
>>> spM.data -= fillvalue
>>> result2 = spM.A + fillvalue
I am trying to solve a min value problem, I could obtain the min values from two loops but, what I really need is also the exact values that correspended to output min.
from __future__ import division
from numpy import*
b1=0.9917949
b2=0.01911
b3=0.000840
b4=0.10175
b5=0.000763
mu=1.66057*10**(-24) #gram
c=3.0*10**8
Mler=open("olasiM.txt","w+")
data=zeros(0,'float')
for A in range(1,25):
M2=zeros(0,'float')
print 'A=',A
for Z in range(1,A+1):
SEMF=mu*c**2*(b1*A+b2*A**(2./3.)-b3*Z+b4*A*((1./2.)-(Z/A))**2+(b5*Z**2)/(A**(1./3.)))
SEMF=array(SEMF)
M2=hstack((M2,SEMF))
minm2=min(M2)
data=hstack((data,minm2))
data=hstack((data,A))
datalist = data.tolist()
for i in range (len(datalist)):
Mler.write(str(datalist[i])+'\n')
Mler.close()
Here, what I want is to see the min value of the SEMF and, corresponding A,Z values, For example, it has to be A=1, Z=1 and SEMF= some#
I also don't know how to write these, A and Z values to the document
The big advantage of numpy over using python lists is vectorized operations. Unfortunately your code fails completely in using them. For example the whole inner loop that has Z as index can easily be vectorized. You instead are computing the single elements using python floats and then stacking them one by one in the numpy array M2.
So I'd refactor that part of the code by:
import numpy as np
# ...
Zs = np.arange(1, A+1, dtype=float)
SEMF = mu*c**2 * (b1*A + b2*A**(2./3.) - b3*Zs + b4*A*((1./2.) - (Zs/A))**2 + (b5*Zs**2)/(A**(1./3.)))
Here the SEMF array should be exactly what you'd obtain as the final M2 array. Now you can find the minimum and stack that value into your data array:
min_val = SEMF.min()
data = hstack((data,minm2))
data = hstack((data,A))
If you also what to keep track for which value of Z you got the minimum you can use the argmin method:
min_val, min_pos = SEMF.min(), SEMF.argmin()
data = hstack((data,np.array([min_val, min_pos, A])))
The final code should look like:
from __future__ import division
import numpy as np
b1 = 0.9917949
b2 = 0.01911
b3 = 0.000840
b4 = 0.10175
b5 = 0.000763
mu = 1.66057*10**(-24) #gram
c = 3.0*10**8
data=zeros(0,'float')
for A in range(1,25):
Zs = np.arange(1, A+1, dtype=float)
SEMF = mu*c**2 * (b1*A + b2*A**(2./3.) - b3*Zs + b4*A*((1./2.) - (Zs/A))**2 + (b5*Zs**2)/(A**(1./3.)))
min_val, min_pos = SEMF.min(), SEMF.argmin()
data = hstack((data,np.array([min_val, min_pos, A])))
datalist = data.tolist()
with open("olasiM.txt","w+") as mler:
for i in range (len(datalist)):
mler.write(str(datalist[i])+'\n')
Note that numpy provides some functions to save/load array to/from files, like savetxt so I suggest that instead of manually saving the values there to use these functions.
Probably some numpy expert could vectorize also the operations for the As. Unfortunately my numpy knowledge isn't that advanced and I don't know how the handle the fact that we'd have a variable number of dimensions due to the range(1, A+1) thing...
I am using the scipy stats module to calculate the linear regression. ie
slope, intercept, r_value, p_value, std_err
= stats.linregress(data['cov_0.0075']['num'],data['cov_0.0075']['com'])
where data is a dictionary containing several 'cov_x' keys corresponding to a dataframe with columns 'num' and 'com'
I want to be able to loop through this dictionary and do linear regression on each 'cov_x'. I am not sure how to do this. I tried:
for i in data:
slope_+str(i), intercept+str(i), r_value+str(i),p_value+str(i),std_err+str(i)= stats.linregress(data[i]['num'],data[i]['com'])
Essentially I want len(x) slope_x values.
You could use a list comprehension to collect all the stats.linregress return values:
result = [stats.linregress(df['num'],df['com']) for key, df in data.items()]
result is a list of 5-tuples. To collect all the first, second, third, etc... elements from each 5-tuple into separate lists, use zip(*[...]):
slopes, intercepts, r_values, p_values, stderrs = zip(*result)
You should be able to do what you're trying to, but there are a couple of things you should watch out for.
First, you can't add a string to a variable name and store it that way. No plus signs on the left of the equals sign. Ever.
You should be able to accomplish what you're trying to do, however. Just make sure that you use the dict data type if you want string indexing.
import scipy.stats as stats
import pandas as pd
import numpy as np
data = {}
l = ['cov_0.0075','cov_0.005']
for i in l:
x = np.random.random(100)
y = np.random.random(100)+15
d = {'num':x,'com':y}
df = pd.DataFrame(data=d)
data[i] = df
slope = {}
intercept = {}
r_value = {}
p_value = {}
std_error = {}
for i in data:
slope[str(i)], \
intercept[str(i)], \
r_value[str(i)],\
p_value[str(i)], std_error[str(i)]= stats.linregress(data[i]['num'],data[i]['com'])
print(slope,intercept,r_value,p_value,std_error)
should work just fine. Otherwise, you can store individual values and put them in a list later.
I want to know how I should index / access some data programmatically in python.
I have columnar data: depth, temperature, gradient, gamma, for a set of boreholes. There are n boreholes. I have a header, which lists the borehole name and numeric ID. Example:
Bore_name,Bore_ID,,,Bore_name,Bore_ID,,,, ...
<a row of headers>
depth,temp,gradient,gamma,depth,temp,gradient,gamma ...
I don't know how to index the data, apart from rude iteration:
with open(filename,'rU') as f:
bores = f.readline().rstrip().split(',')
headers = f.readline().rstrip().split(',')
# load from CSV file, missing values are empty 'cells'
tdata = numpy.genfromtxt(filename, skip_header=2, delimiter=',', missing_values='', filling_values=numpy.nan)
for column in range(0,numpy.shape(tdata)[1],4):
# plots temperature on x, depth on y
pl.plot(tdata[:,column+1],tdata[:,column], label=bores[column])
# get index at max depth
depth = numpy.nanargmin(tdata[:,column])
# plot text label at max depth (y) and temp at that depth (x)
pl.text(tdata[depth,column+1],tdata[depth,column],bores[column])
It seems easy enough this way, but I've been using R recently and have got a bit used to their way of referencing data objects via classes and subclasses interpreted from headers.
Well if you like R's data.table, there have been a few (at least) attempts to re-create that functionality in NumPy--through additional classes in NumPy Core and through external Python libraries. The effort i find most promising is the datarray library by Fernando Perez. Here's how it works.
>>> # create a NumPy array for use as our data set
>>> import numpy as NP
>>> D = NP.random.randint(0, 10, 40).reshape(8, 5)
>>> # create some generic row and column names to pass to the constructor
>>> row_ids = [ "row{0}".format(c) for c in range(D1.shape[0]) ]
>>> rows = 'rows_id', row_ids
>>> variables = [ "col{0}".format(c) for c in range(D1.shape[1]) ]
>>> cols = 'variable', variables
Instantiate the DataArray instance, by calling the constructor and passing in an ordinary NumPy array and a list of tuples--one tuple for each axis, and since ndim = 2 here, there are two tuples in the list each tuple is comprised of axis label (str) and a sequence of labels for that axes (list).
>>> from datarray.datarray import DataArray as DA
>>> D1 = DA(D, [rows, cols])
>>> D1.axes
(Axis(name='rows', index=0, labels=['row0', 'row1', 'row2', 'row3',
'row4', 'row5', 'row6', 'row7']), Axis(name='cols', index=1,
labels=['col0', 'col1', 'col2', 'col3', 'col4']))
>>> # now you can use R-like syntax to reference a NumPy data array by column:
>>> D1[:,'col1']
DataArray([8, 5, 0, 7, 8, 9, 9, 4])
('rows',)
You could put your data into a dict for each borehole, keyed by the borehole id, and values as dicts with headers as keys. Roughly like this:
data = {boreid1:{"temp":temparray, ...}, boreid2:{"temp":temparray}}
Probably reading from files will be a little bit more cumbersome with these approach, but for plotting you could do something like
pl.plot(data[boreid]["temperature"], data[boreid]["depth"])
Here are idioms for naming rows and columns:
row0, row1 = np.ones((2,5))
for col in range(0, tdata.shape[1], 4):
depth,temp,gradient,gamma = tdata[:, col:col+4] .T
pl.plot( temp, depth )
See also namedtuple:
from collections import namedtuple
Rec = namedtuple( "Rec", "depth temp gradient gamma" )
r = Rec( *tdata[:, col:col+4].T )
print r.temp, r.depth
datarray (thanks Doug) is certainly more general.