regression coefficient calculation in python - python

I have a Dataframe and an input text file of activity.Dataframe is produced via pandas.I want to find out the regression coefficient of each term using following formula
Y=C1aX1a+C1bX1b+...+C2aX2a+C2bX2b+....C0 ,
where Y is the activity Cna the regression coefficient for the residue choice a at position n, X the dummy variable coding (xna= 1 or 0) corresponding to the presence or absence of residue choice a at position n, and C0 the mean value of the activity.
My dataframe look likes
2u 2s 4r 4n 4m 7h 7v
0 1 1 0 0 0 1
0 1 0 1 0 0 1
1 0 0 1 0 1 0
1 0 0 0 1 1 0
1 0 1 0 0 1 0
Here 1 and 0 represents the presence and absence of residues respectively.
Using MLR(multiple linear regression) how can i find out the regression coefficient of each residue ie, 2u,2s,4r,4n,4m,7h,7v.
C1a represents the regression coefficient of residue a at 1st position(here 1a is 2u,1b is 2s, 2a is 4r...) X1a represents the dummy value ie 0 or 1 corresponding to 1a.
Activity file contain following data
6.5
5.9
5.7
6.4
5.2
So first equation will look like
6.5=C1a*0+C1b*1+C2a*1+C2b*0+C2c*0+C3a*0+C3b*1+C0
…
Can I get regression coefficient using numpy?.Please help me, All suggestions will be appreciated.

Let A be your dataframe (you can get it as a pure and simple numpy array. Read it in using np.loadtxt if it's CSV), and y be your activity file (again, a numpy array), and use np.linalg.lstsq
DF = """0 1 1 0 0 0 1
0 1 0 1 0 0 1
1 0 0 1 0 1 0
1 0 0 0 1 1 0
1 0 1 0 0 1 0"""
res = """6.5, 5.9, 5.7, 6.4, 5.2"""
A = np.fromstring ( DF, sep=" " ).reshape((5,7))
y = np.fromstring(res, sep=" ")
(x, res, rango, svals ) = np.linalg.lstsq(A, y )
print x
# 2.115625, 2.490625, 1.24375 , 1.19375 , 2.16875 , 2.115625, 2.490625
print np.sum(A.dot(x)**2) # Sum of squared residuals:
# 177.24750000000003
print A.dot(x) # Print predicition
# 6.225, 6.175, 5.425, 6.4 , 5.475

Related

Classification based on categorical data

I have a dataset
Inp1 Inp2 Output
A,B,C AI,UI,JI Animals
L,M,N LI,DO,LI Noun
X,Y AI,UI Extras
For these values, I need to apply a ML algorithm. Which algorithm would be best suited to find relations in between these groups to assign an output class to them?
Assuming each cell is a list (as you have multiple strings stored in each), and that you are not looking for a specific encoding. The following should work. It can also be adjusted to suit different encodings.
import pandas as pd
A = [["Inp1", "Inp2", "Inp3", "Output"],
[["A","B","C"], ["AI","UI","JI"],["Apple","Bat","Dog"],["Animals"]],
[["L","M","N"], ["LI","DO","LI"], ["Lawn", "Moon", "Noon"], ["Noun"]]]
dataframe = pd.DataFrame(A[1:], columns=A[0])
def my_encoding(row):
encoded_row = []
for ls in row:
encoded_ls = []
for s in ls:
sbytes = s.encode('utf-8')
sint = int.from_bytes(sbytes, 'little')
encoded_ls.append(sint)
encoded_row.append(encoded_ls)
return encoded_row
print(dataframe.apply(my_encoding))
output:
Inp1 ... Output
0 [65, 66, 67] ... [32488788024979009]
1 [76, 77, 78] ... [1853189966]
if my assumptions are incorrect or this is not what you're looking for let me know.
As you mentioned, you are going to apply ML algorithm (say classification), I think One Hot Encoding is what you are looking for.
Requested format:
Inp1 Inp2 Inp3 Output
7,44,87 4,65,2 47,36,20 45
This format can't help you to train your model as multiple labels in a single cell. However you have to pre-process again like OHE.
Suggesting format:
A B C L M N X Y AI DO JI LI UI Apple Bat Dog Lawn Moon Noon Yemen Zombie
1 1 1 0 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0
0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1
Hereafter you can label encode / ohe the output field as per your model requires.
Happy learning !
BCE is for multi-label classifications, whereas categorical CE is for multi-class classification where each example belongs to a single class. In your task you need to understand if for a single example you end in a single class only (CE) or single example may end in multiple classes (BCE). Probable the second is true since animal can be a noun. ;)

Efficient way to find coordinates of connected blobs in binary image

I am looking for the coordinates of connected blobs in a binary image (2d numpy array of 0 or 1).
The skimage library provides a very fast way to label blobs within the array (which I found from similar SO posts). However I want a list of the coordinates of the blob, not a labelled array. I have a solution which extracts the coordinates from the labelled image. But it is very slow. Far slower than the inital labelling.
Minimal Reproducible example:
import timeit
from skimage import measure
import numpy as np
binary_image = np.array([
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,1,0,1,1,1,0,1,1,1,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,0,1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
])
print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)
labels = measure.label(binary_image)
print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)
def extract_blobs_from_labelled_array(labelled_array):
# The goal is to obtain lists of the coordinates
# Of each distinct blob.
blobs = []
label = 1
while True:
indices_of_label = np.where(labelled_array==label)
if not indices_of_label[0].size > 0:
break
else:
blob =list(zip(*indices_of_label))
label+=1
blobs.append(blob)
if __name__ == "__main__":
print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
print("Time taken:")
print(
timeit.timeit(
'extract_blobs_from_labelled_array(labels)',
globals=globals(),
number=1
)
)
print("\n\n")
Output:
2d array of type: <class 'numpy.ndarray'>:
[[0 1 0 0 1 1 0 1 1 0 0 1]
[0 1 0 1 1 1 0 1 1 1 0 1]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 1 1 1 1 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0]
[0 1 0 0 1 1 0 1 1 0 0 1]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 1 1 1 1 0 0 0 0 1 0 0]]
2d array with connected blobs labelled of type <class 'numpy.ndarray'>:
[[ 0 1 0 0 2 2 0 3 3 0 0 4]
[ 0 1 0 2 2 2 0 3 3 3 0 4]
[ 0 0 0 0 0 0 0 3 3 3 0 0]
[ 0 5 5 5 5 0 0 0 0 3 0 0]
[ 0 0 0 0 0 0 0 3 3 3 0 0]
[ 0 0 6 0 0 0 0 0 0 0 0 0]
[ 0 6 0 0 7 7 0 8 8 0 0 9]
[ 0 0 0 0 0 0 0 8 8 8 0 0]
[ 0 10 10 10 10 0 0 0 0 8 0 0]]
Beginning extract_blobs_from_labelled_array timing
Time taken:
9.346099977847189e-05
9e-05 is small but so is this image for the example. In reality I am working with very high resolution images for which the function takes approximately 10 minutes.
Is there a faster way to do this?
Side note: I'm only using list(zip()) to try get the numpy coordinates into something I'm used to (I don't use numpy much just Python). Should I be skipping this and just using the coordinates to index as-is? Will that speed it up?
The part of the code that slow is here:
while True:
indices_of_label = np.where(labelled_array==label)
if not indices_of_label[0].size > 0:
break
else:
blob =list(zip(*indices_of_label))
label+=1
blobs.append(blob)
First, a complete aside: you should avoid using while True when you know the number of elements you will be iterating over. It's a recipe for hard-to-find infinite-loop bugs.
Instead, you should use:
for label in range(np.max(labels)):
and then you can ignore the if ...: break.
A second issue is indeed that you are using list(zip(*)), which is slow compared to NumPy functions. Here you could get approximately the same result with np.transpose(indices_of_label), which will get you a 2D array of shape (n_coords, n_dim), ie (n_coords, 2).
But the Big Issue is the expression labelled_array == label. This will examine every pixel of the image once for every label. (Twice, actually, because then you run np.where(), which takes another pass.) This is a lot of unnecessary work, as the coordinates can be found in one pass.
The scikit-image function skimage.measure.regionprops can do this for you. regionprops goes over the image once and returns a list containing one RegionProps object per label. The object has a .coords attribute containing the coordinates of each pixel in the blob. So, here's your code, modified to use that function:
import timeit
from skimage import measure
import numpy as np
binary_image = np.array([
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,1,0,1,1,1,0,1,1,1,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,0,1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
])
print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)
labels = measure.label(binary_image)
print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)
def extract_blobs_from_labelled_array(labelled_array):
"""Return a list containing coordinates of pixels in each blob."""
props = measure.regionprops(labelled_array)
blobs = [p.coords for p in props]
return blobs
if __name__ == "__main__":
print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
print("Time taken:")
print(
timeit.timeit(
'extract_blobs_from_labelled_array(labels)',
globals=globals(),
number=1
)
)
print("\n\n")

Pandas set value if most columns are equal in a dataframe

starting by another my question I've done yesterday Pandas set value if all columns are equal in a dataframe
Starting by #anky_91 solution I'm working on something similar.
Instead of put 1 or -1 if all columns are equals I want something more flexible.
In fact I want 1 if (for example) the 70% percentage of the columns are 1, -1 for the same but inverse condition and 0 else.
So this is what I've wrote:
# Instead of using .all I use .sum to count the occurence of 1 and 0 for each row
m1 = local_df.eq(1).sum(axis=1)
m2 = local_df.eq(0).sum(axis=1)
# Debug print, it work
print(m1)
print(m2)
But I don't know how to change this part:
local_df['enseamble'] = np.select([m1, m2], [1, -1], 0)
m = local_df.drop(local_df.columns.difference(['enseamble']), axis=1)
I write in pseudo code what I want:
tot = m1 + m2
if m1 > m2
if(m1 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = 1
else if m2 > m1
if(m2 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = -1
else:
df['enseamble'] = 0
Thanks
Edit 1
This is an example of expected output:
NET_0 NET_1 NET_2 NET_3 NET_4 NET_5 NET_6
date
2009-08-02 0 1 1 1 0 1
2009-08-03 1 0 0 0 1 0
2009-08-04 1 1 1 0 0 0
date enseamble
2009-08-02 1 # because 1 is more than 70%
2009-08-03 -1 # because 0 is more than 70%
2009-08-04 0 # because 0 and 1 are 50-50
You could obtain the specified output from the following conditions:
thr = 0.7
c1 = (df.eq(1).sum(1)/df.shape[1]).gt(thr)
c2 = (df.eq(0).sum(1)/df.shape[1]).gt(thr)
c2.astype(int).mul(-1).add(c1)
Output
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
dtype: int64
Or using np.select:
pd.DataFrame(np.select([c1,c2], [1,-1], 0), index=df.index, columns=['result'])
result
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
Try with (m1 , m2 and tot are same as what you have):
cond1=(m1>m2)&((m1 * 100/tot).gt(0.7))
cond2=(m2>m1)&((m2 * 100/tot).gt(0.7))
df['enseamble'] =np.select([cond1,cond2],[1,-1],0)
m =df.drop(df.columns.difference(['enseamble']), axis=1)
print(m)
enseamble
date
2009-08-02 1
2009-08-03 -1
2009-08-04 0

Inefficient Regularized Logistic Regression with Numpy

I am a machine learning noob attemping to implement regularized logistic regression via Newton's method.
The data have two features which are supposed to be expanded to 28 through finding all monomial terms of (u,v) up to degree 6
My code converges to the correct solution of norm(theta)=0.9384 after around 500 or so iterations when it should only take around 15 for lambda = 10, though the exercise is based on Matlab instead of Python. Each cycle of the parameter update is also very slow with my code and I am not sure exactly why. If anyone could explain why my code takes so many iterations to converge and why each iteration is painfully slow I would be very grateful!
The data are taken from Andrew Ng's open course exercise 5. The problem information and data can be found here http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex5/ex5.html
although I posted the data and my code below.
X data with two features
0.051267,0.69956
-0.092742,0.68494
-0.21371,0.69225
-0.375,0.50219
-0.51325,0.46564
-0.52477,0.2098
-0.39804,0.034357
-0.30588,-0.19225
0.016705,-0.40424
0.13191,-0.51389
0.38537,-0.56506
0.52938,-0.5212
0.63882,-0.24342
0.73675,-0.18494
0.54666,0.48757
0.322,0.5826
0.16647,0.53874
-0.046659,0.81652
-0.17339,0.69956
-0.47869,0.63377
-0.60541,0.59722
-0.62846,0.33406
-0.59389,0.005117
-0.42108,-0.27266
-0.11578,-0.39693
0.20104,-0.60161
0.46601,-0.53582
0.67339,-0.53582
-0.13882,0.54605
-0.29435,0.77997
-0.26555,0.96272
-0.16187,0.8019
-0.17339,0.64839
-0.28283,0.47295
-0.36348,0.31213
-0.30012,0.027047
-0.23675,-0.21418
-0.06394,-0.18494
0.062788,-0.16301
0.22984,-0.41155
0.2932,-0.2288
0.48329,-0.18494
0.64459,-0.14108
0.46025,0.012427
0.6273,0.15863
0.57546,0.26827
0.72523,0.44371
0.22408,0.52412
0.44297,0.67032
0.322,0.69225
0.13767,0.57529
-0.0063364,0.39985
-0.092742,0.55336
-0.20795,0.35599
-0.20795,0.17325
-0.43836,0.21711
-0.21947,-0.016813
-0.13882,-0.27266
0.18376,0.93348
0.22408,0.77997
0.29896,0.61915
0.50634,0.75804
0.61578,0.7288
0.60426,0.59722
0.76555,0.50219
0.92684,0.3633
0.82316,0.27558
0.96141,0.085526
0.93836,0.012427
0.86348,-0.082602
0.89804,-0.20687
0.85196,-0.36769
0.82892,-0.5212
0.79435,-0.55775
0.59274,-0.7405
0.51786,-0.5943
0.46601,-0.41886
0.35081,-0.57968
0.28744,-0.76974
0.085829,-0.75512
0.14919,-0.57968
-0.13306,-0.4481
-0.40956,-0.41155
-0.39228,-0.25804
-0.74366,-0.25804
-0.69758,0.041667
-0.75518,0.2902
-0.69758,0.68494
-0.4038,0.70687
-0.38076,0.91886
-0.50749,0.90424
-0.54781,0.70687
0.10311,0.77997
0.057028,0.91886
-0.10426,0.99196
-0.081221,1.1089
0.28744,1.087
0.39689,0.82383
0.63882,0.88962
0.82316,0.66301
0.67339,0.64108
1.0709,0.10015
-0.046659,-0.57968
-0.23675,-0.63816
-0.15035,-0.36769
-0.49021,-0.3019
-0.46717,-0.13377
-0.28859,-0.060673
-0.61118,-0.067982
-0.66302,-0.21418
-0.59965,-0.41886
-0.72638,-0.082602
-0.83007,0.31213
-0.72062,0.53874
-0.59389,0.49488
-0.48445,0.99927
-0.0063364,0.99927
Y data
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
My code below:
import pandas as pd
import numpy as np
import math
def sigmoid(theta, x):
return 1/(1 + math.exp(-1*theta.T.dot(x)))
def cost_function(X, y, theta):
s = 0
for i in range(m):
loss = -y[i]*np.log(sigmoid(theta, X[i])) - (1-y[i])*np.log(1-sigmoid(theta, X[i]))
s += loss
s /= m
s += (lamb/(2*m))*sum(theta[j]**2 for j in range(1, 28))
return s
def gradient(theta, X, y):
# add regularization terms
add_column = theta * (lamb/m)
add_column[0] = 0
a = sum((sigmoid(theta, X[i]) - y[i])*X[i] + add_column for i in range(m))/m
return a
def hessian(theta, X, reg_matrix):
matrix = []
for i in range(28):
row = []
for j in range(28):
cell = sum(sigmoid(theta, X[k])*(1-sigmoid(theta, X[k]))*X[k][i]*X[k][j] for k in range(m))
row.append(cell)
matrix.append(row)
H = np.array(matrix)
H = np.add(H, reg_matrix)
return H
def newtons_method(theta, iterations):
for i in range(iterations):
g = gradient(theta, X, y)
H = hessian(theta, X, reg_matrix)
theta = theta - np.linalg.inv(H).dot(g)
cost = cost_function(X,y,theta)
print(cost)
return theta
def map_feature(u, v): # expand features according to problem instructions
new_row = []
new_row.append(1)
new_row.append(u)
new_row.append(v)
new_row.append(u**2)
new_row.append(u*v)
new_row.append(v**2)
new_row.append(u**3)
new_row.append(u**2*v)
new_row.append(u*v**2)
new_row.append(v**3)
new_row.append(u**4)
new_row.append(u**3*v)
new_row.append(u*v**3)
new_row.append(v**4)
new_row.append(u**2*v**2)
new_row.append(u**5)
new_row.append(u**4*v)
new_row.append(u*v**4)
new_row.append(v**5)
new_row.append(u**2*v**3)
new_row.append(u**3*v**2)
new_row.append(u**6)
new_row.append(u**5*v)
new_row.append(u*v**5)
new_row.append(v**6)
new_row.append(u**4*v**2)
new_row.append(u**2*v**4)
new_row.append(u**3*v**3)
return np.array(new_row)
with open('ex5Logx.dat', 'r') as f:
array = []
for line in f.readlines():
array.append(line.strip().split(','))
for a in array:
a[0], a[1] = float(a[0]), float(a[1].strip())
xdata= np.array(array)
with open('ex5Logy.dat', 'r') as f:
array = []
for line in f.readlines():
array.append(line.strip())
for i in range(len(array)):
array[i] = float(array[i])
ydata= np.array(array)
X_df = pd.DataFrame(xdata, columns=['score1', 'score2'])
y_df = pd.DataFrame(ydata, columns=['acceptence'])
m = len(y_df)
iterations = 15
ones = np.ones((m,1)) # intercept term in first column
X = np.array(X_df)
X = np.append(ones, X, axis=1)
y = np.array(y_df).flatten()
new_X = [] # prepare new array for expanded features
for i in range(m):
new_row = map_feature(X[i][1], X[i][2])
new_X.append(new_row)
X = np.array(new_X)
theta = np.array([0 for i in range(28)]) # initialize parameters to 0
lamb = 10 # lambda constant for regularization
reg_matrix = np.zeros((28,28),dtype=int) # n+1*n+1 regularization matrix
np.fill_diagonal(reg_matrix, 1)
reg_matrix[0] = 0
reg_matrix = (lamb/m)*reg_matrix
theta = newtons_method(theta, iterations)
print(np.linalg.norm(theta))
I am not 100% sure but i went through one tutorial on Logistic Regression using Newton's method(http://thelaziestprogrammer.com/sharrington/math-of-machine-learning/solving-logreg-newtons-method) and it's implementation of Newton's method is little different from yours.Actually there is one major difference. In Newton's method it's adding product of inv of hessian and gradient to theta whereas you are subtracting. I know about logistic regression normal way not using newton's method. Apart from that it seems that you are using loops in Cost function and Hessian which i think can be done with one statement in numpy than looping.
I would suggest refer to attached link which i gave as it has done all implementation in python numpy and there are no loops. Loops which you have created are impacting performance.

Python numpy zeros array being assigned 1 for every value when only one index is updated

The following is my code:
amount_features = X.shape[1]
best_features = np.zeros((amount_features,), dtype=int)
best_accuracy = 0
best_accuracy_index = 0
def find_best_features(best_features, best_accuracy):
for i in range(amount_features):
trial_features = best_features
trial_features[i] = 1
svc = SVC(C = 10, gamma = .1)
svc.fit(X_train[:,trial_features==1],y_train)
y_pred = svc.predict(X_test[:,trial_features==1])
accuracy = metrics.accuracy_score(y_test,y_pred)
if (accuracy > best_accuracy):
best_accuracy = accuracy
best_accuracy_index = i
print(best_accuracy_index)
best_features[best_accuracy_index] = 1
return best_features, best_accuracy
bf, ba = find_best_features(best_features, best_accuracy)
print(bf, ba)
And this is my output:
25
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] 0.865853658537
And my expected output:
25
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0] 0.865853658537
I am trying to update the zeros array with the index that gives the highest accuracy. As you see it should be index 25, and I follow that by assigning the 25 index for my array equal to 1. However, when I print the array it shows every index has been updated to 1.
Not sure what is the mishap. Thanks for spending your limited time on Earth to help me.
Change trial_features = best_features to trial_features = numpy.copy(best_features). Reasoning behind the change is already given by #Michael Butscher.

Categories