Related
Nested Array
I want to turn the above into the below. This accidentally happened as I was doing a linear regression that the output was already in a 1x1 array, let me know if you would like to see more of my code. It looks like my betas variable is the issue with the nesting.
Normal Array
Generally speaking, I am just trying to get the output from
[[ array([x]), array([x]), array([x]), array([x]), array([x])]]
to
[[x, x, x, x, x ]]
def si_model():
dj_data = pd.read_csv("/data.tsv", sep = "\t")
dj_data = dj_data.pct_change().dropna()
ann_dj_data = dj_data * 252
dj_index = ann_dj_data['^DJI']
ann_dj_data = ann_dj_data.drop('^DJI', axis='columns')
# Function to Linear Regress Each Stock onto DJ
def model_regress(stock):
# Fit DJ to Index Data
DJ = np.array(dj_index).reshape(len(stock), 1)
# Regression of each stock onto DJ
lm = LinearRegression().fit(DJ, y=stock.to_numpy())
resids = stock.to_numpy() - lm.predict(DJ)
return lm.coef_, lm.intercept_, resids.std()
# Run model regression on each stock
lm_all = ann_dj_data.apply(lambda stock: model_regress(stock)).T
# Table of the Coeffeicents
lm_all = lm_all.rename(columns={0: 'Beta ', 1: 'Intercept', 2: 'Rsd Std'})
# Varaince of the index's returns
dj_index_var = dj_index.std() ** 2
betas = lm_all['Beta '].to_numpy()
resid_vars = lm_all['Rsd Std'].to_numpy() ** 2
# Single index approximation of covariance matrix using identity matrix (np.eye)
Qsi = dj_index_var * betas * betas.reshape(-1, 1) + np.eye(len(betas)) * resid_vars
return Qsi
# Printing first five rows of approximation
Qsi = si_model()
print("Covariance Matrix")
print(Qsi[:5, :5])
You can use squeeze().
Here is a small example similar to yours:
import numpy as np
a = np.array([17.1500691])
b = np.array([5.47690856])
c = np.array([5.47690856])
d = np.array([11.7700696])
e = list([[a,b],[c,d]])
print(e)
f = np.squeeze(np.array(e), axis=2)
print(f)
Output:
[[array([17.1500691]), array([5.47690856])], [array([5.47690856]), array([11.7700696])]]
[[17.1500691 5.47690856]
[ 5.47690856 11.7700696 ]]
I wrote a piece code to make a simple linear regression model using Python. However, I am having trouble getting the correct cost function, and most importantly the correct theta parameters. The model is implemented from scratch and not using Scikit learn module. I have used Andrew NG's notes from his ML Coursera course to create the model. The correct values of theta are [[-3.630291] [1.166362]].
Would be really grateful if someone could offer their expertise, and point out what I'm doing wrong.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Load The Dataset
dataset = pd.read_csv("Population vs Profit.txt",names=["Population" ,
"Profit"])
print (dataset.head())
col = len(dataset.columns)
x = dataset.iloc[:,:col-1].values
y = dataset.iloc[:,col-1].values
#Visualizing The Dataset
plt.scatter(x, y, color="red", marker="x", label="Profit")
plt.title("Population vs Profit")
plt.xlabel("Population")
plt.ylabel("Profit")
plt.legend()
plt.show()
#Preprocessing Data
dataset.insert(0,"x0",1)
col = len(dataset.columns)
x = dataset.iloc[:,:col-1].values
b = np.zeros(col-1)
m = len(y)
costlist = []
alpha = 0.001
iteration = 10000
#Defining Functions
def hypothesis(x,b,y):
h = x.dot(b.T) - y
return h
def cost(x,b,y,m):
j = np.sum(hypothesis(x,b,y)**2)
j = j/(2*m)
return j
print (cost(x,b,y,m))
def gradient_descent(x,b,y,m,alpha):
for i in range (iteration):
h = hypothesis(x,b,y)
product = np.sum(h.dot(x))
b = b - ((alpha/m)*product)
costlist.append(cost(x,b,y,m))
return b,cost(x,b,y,m)
b , mincost = gradient_descent(x,b,y,m,alpha)
print (b , mincost)
print (cost(x,b,y,m))
plt.plot(b,color="green")
plt.show()
The dataset I'm using is the following text.
6.1101,17.592
5.5277,9.1302
8.5186,13.662
7.0032,11.854
5.8598,6.8233
8.3829,11.886
7.4764,4.3483
8.5781,12
6.4862,6.5987
5.0546,3.8166
5.7107,3.2522
14.164,15.505
5.734,3.1551
8.4084,7.2258
5.6407,0.71618
5.3794,3.5129
6.3654,5.3048
5.1301,0.56077
6.4296,3.6518
7.0708,5.3893
6.1891,3.1386
20.27,21.767
5.4901,4.263
6.3261,5.1875
5.5649,3.0825
18.945,22.638
12.828,13.501
10.957,7.0467
13.176,14.692
22.203,24.147
5.2524,-1.22
6.5894,5.9966
9.2482,12.134
5.8918,1.8495
8.2111,6.5426
7.9334,4.5623
8.0959,4.1164
5.6063,3.3928
12.836,10.117
6.3534,5.4974
5.4069,0.55657
6.8825,3.9115
11.708,5.3854
5.7737,2.4406
7.8247,6.7318
7.0931,1.0463
5.0702,5.1337
5.8014,1.844
11.7,8.0043
5.5416,1.0179
7.5402,6.7504
5.3077,1.8396
7.4239,4.2885
7.6031,4.9981
6.3328,1.4233
6.3589,-1.4211
6.2742,2.4756
5.6397,4.6042
9.3102,3.9624
9.4536,5.4141
8.8254,5.1694
5.1793,-0.74279
21.279,17.929
14.908,12.054
18.959,17.054
7.2182,4.8852
8.2951,5.7442
10.236,7.7754
5.4994,1.0173
20.341,20.992
10.136,6.6799
7.3345,4.0259
6.0062,1.2784
7.2259,3.3411
5.0269,-2.6807
6.5479,0.29678
7.5386,3.8845
5.0365,5.7014
10.274,6.7526
5.1077,2.0576
5.7292,0.47953
5.1884,0.20421
6.3557,0.67861
9.7687,7.5435
6.5159,5.3436
8.5172,4.2415
9.1802,6.7981
6.002,0.92695
5.5204,0.152
5.0594,2.8214
5.7077,1.8451
7.6366,4.2959
5.8707,7.2029
5.3054,1.9869
8.2934,0.14454
13.394,9.0551
5.4369,0.61705
One issue is with your "product". It is currently a number when it should be a vector. I was able to get the values [-3.24044334 1.12719788] by rerwitting your for-loop as follows:
def gradient_descent(x,b,y,m,alpha):
for i in range (iteration):
h = hypothesis(x,b,y)
#product = np.sum(h.dot(x))
xvalue = x[:,1]
product = h.dot(xvalue)
hsum = np.sum(h)
b = b - ((alpha/m)* np.array([hsum , product]) )
costlist.append(cost(x,b,y,m))
return b,cost(x,b,y,m)
There's possibly another issue besides this as it doesn't converge to your answer. You should make sure you are using the same alpha also.
I am currently using scikit-learn for text classification on the 20ng dataset. I want to calculate the information gain for a vectorized dataset. It has been suggested to me that this can be accomplished, using mutual_info_classif from sklearn. However, this method is really slow, so I was trying to implement information gain myself based on this post.
I came up with the following solution:
from scipy.stats import entropy
import numpy as np
def information_gain(X, y):
def _entropy(labels):
counts = np.bincount(labels)
return entropy(counts, base=None)
def _ig(x, y):
# indices where x is set/not set
x_set = np.nonzero(x)[1]
x_not_set = np.delete(np.arange(x.shape[1]), x_set)
h_x_set = _entropy(y[x_set])
h_x_not_set = _entropy(y[x_not_set])
return entropy_full - (((len(x_set) / f_size) * h_x_set)
+ ((len(x_not_set) / f_size) * h_x_not_set))
entropy_full = _entropy(y)
f_size = float(X.shape[0])
scores = np.array([_ig(x, y) for x in X.T])
return scores
Using a very small dataset, most scores from sklearn and my implementation are equal. However, sklearn seems to take frequencies into account, which my algorithm clearly doesn't. For example
categories = ['talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
X, y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
max_features=100,
stop_words='english')
X_vec = cv.fit_transform(X)
t0 = time()
res_sk = mutual_info_classif(X_vec, y, discrete_features=True)
print("Time passed for sklearn method: %3f" % (time()-t0))
t0 = time()
res_ig = information_gain(X_vec, y)
print("Time passed for ig: %3f" % (time()-t0))
for name, res_mi, res_ig in zip(cv.get_feature_names(), res_sk, res_ig):
print("%s: mi=%f, ig=%f" % (name, res_mi, res_ig))
sample output:
center: mi=0.011824, ig=0.003548
christian: mi=0.128629, ig=0.127122
color: mi=0.028413, ig=0.026397
com: mi=0.041184, ig=0.030458
computer: mi=0.020590, ig=0.012327
cs: mi=0.007291, ig=0.001574
data: mi=0.020734, ig=0.008986
did: mi=0.035613, ig=0.024604
different: mi=0.011432, ig=0.005492
distribution: mi=0.007175, ig=0.004675
does: mi=0.019564, ig=0.006162
don: mi=0.024000, ig=0.017605
earth: mi=0.039409, ig=0.032981
edu: mi=0.023659, ig=0.008442
file: mi=0.048056, ig=0.045746
files: mi=0.041367, ig=0.037860
ftp: mi=0.031302, ig=0.026949
gif: mi=0.028128, ig=0.023744
god: mi=0.122525, ig=0.113637
good: mi=0.016181, ig=0.008511
gov: mi=0.053547, ig=0.048207
So I was wondering if my implementation is wrong, or it is correct, but a different variation of the mutual information algorithm scikit-learn uses.
A little late with my answer but you should look at Orange's implementation. Within their app it is used as a behind-the-scenes processor to help inform the dynamic model parameter building process.
The implementation itself looks fairly straightforward and could most likely be ported out. The entropy calculation first
The sections starting at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L233
def _entropy(dist):
"""Entropy of class-distribution matrix"""
p = dist / np.sum(dist, axis=0)
pc = np.clip(p, 1e-15, 1)
return np.sum(np.sum(- p * np.log2(pc), axis=0) * np.sum(dist, axis=0) / np.sum(dist))
Then the second portion.
https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L305
class GainRatio(ClassificationScorer):
"""
Information gain ratio is the ratio between information gain and
the entropy of the feature's
value distribution. The score was introduced in [Quinlan1986]_
to alleviate overestimation for multi-valued features. See `Wikipedia entry on gain ratio
<http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.
"""
def from_contingency(self, cont, nan_adjustment):
h_class = _entropy(np.sum(cont, axis=1))
h_residual = _entropy(np.compress(np.sum(cont, axis=0), cont, axis=1))
h_attribute = _entropy(np.sum(cont, axis=0))
if h_attribute == 0:
h_attribute = 1
return nan_adjustment * (h_class - h_residual) / h_attribute
The actual scoring process happens at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L218
So I have a tensor h_in of shape (50, ?, 1, 100) that I should now like to turn into shape (50, 1, 1, 100) by taking the max over the axis 1.
How do I do that?
I tried
h_out = max_pool(h_in)
with
def max_pool(h,ksize=[1,-1,1,1],strides=[1,1,1,1],padding='VALID'):
return tf.nn.max_pool(h,ksize=ksize,strides=strides,padding=padding)
but that doesn't seem to reduce the size.
runnable example:
import tensorflow as tf
import numpy as np
import numpy.random as nprand
def _weight_variable(shape,name):
initial = tf.truncated_normal(shape,stddev=0.1)
v = tf.Variable(initial,name=name)
return v
def _bias_variable(shape,name):
initial = tf.constant(0.1,shape=shape)
v = tf.Variable(initial,name=name)
return v
def _embedding_variable(shape,name):
initial = tf.truncated_normal(shape)
v = tf.Variable(initial,name=name)
return v
def conv2d(x,W,strides=[1,1,1,1],padding='VALID'):
return tf.nn.conv2d(x,W,strides=strides,padding=padding)
def max_pool(h,ksize=[1,-1,1,1],strides=[1,1,1,1],padding='VALID'):
return tf.nn.max_pool(h,ksize=ksize,strides=strides,padding=padding)
nof_embeddings= 55000
dim_embeddings = 300
batch_size = 50
filter_size = 100
x_input = tf.placeholder(tf.int32, shape=[batch_size, None])
def _model():
embeddings = _embedding_variable([nof_embeddings,dim_embeddings],'embeddings')
h_lookup = tf.nn.embedding_lookup(embeddings,x_input)
h_embed = tf.reshape(h_lookup,[batch_size,-1,dim_embeddings,1])
f = 3
W_conv1f = _weight_variable([f,dim_embeddings,1,filter_size],f'W_conv1_{f}')
b_conv1f = _bias_variable([filter_size],f'b_conv1_{f}')
h_conv1f = tf.nn.relu(conv2d(h_embed,W_conv1f) + b_conv1f)
h_pool1f = max_pool(h_conv1f)
print("h_embed:",h_embed.get_shape())
print()
print(f'h_conv1_{f}:',h_conv1f.get_shape())
print(f'h_pool1_{f}:',h_pool1f.get_shape())
print()
return tf.shape(h_pool1f)
if __name__ == '__main__':
tensor_length = 35
model = _model()
with tf.Session() as sess:
tf.global_variables_initializer().run()
batch = nprand.randint(0,nof_embeddings,size=[batch_size,tensor_length])
shape = sess.run(model,
feed_dict ={
x_input : batch
})
print('result:',shape)
which outputs
h_embed: (50, ?, 300, 1)
h_conv1_3: (50, ?, 1, 100)
h_pool1_3: (50, ?, 1, 100)
result: [ 50 35 1 100]
Let's say I instead hardcode the size that I want:
h_pool1f = max_pool(h_conv1f,ksize=[1,35-f+1,1,1])
That works.
But now I'm in trouble as soon as I change the tensor_length (which is determined at runtime, so no, I cannot hardcode it).
One "solution" would be to blow the input up to a fixed maximum length by padding, or something, but then again, that introduces unnecessary computations and an artificial cap, both of which I should very much like to avoid.
So, is there
a way to make tensorflow "correctly" recognise the -1 in k_size?
or another way to compute the max?
I think tf.reduce_max is what you are looking for:
https://www.tensorflow.org/api_docs/python/tf/reduce_max
Usage:
tens = some tensorflow.Tensor
ax = some positive integer, or -1 or None
red_m = tf.reduce_max(tens, axis=ax)
If tens has shape [shape_0, shape_1, shape_2], the resulting tensor red_m will have shape [shape_1, shape_2] if ax=0, shape [shape_0, shape_2] if ax=1, and so on. If ax=-1, the last axes is inferred, while if ax=None, the reduction will happen along all axes.
I am trying to make the simpliest regression on pyBrain but somehow I'm failing.
The Neural Network should learn the function Y=3*X
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import SupervisedDataSet
from pybrain.structure import FullConnection, FeedForwardNetwork, TanhLayer, LinearLayer, BiasUnit
import matplotlib.pyplot as plt
from numpy import *
n = FeedForwardNetwork()
n.addInputModule(LinearLayer(1, name = 'in'))
n.addInputModule(BiasUnit(name = 'bias'))
n.addModule(TanhLayer(1,name = 'tan'))
n.addOutputModule(LinearLayer(1, name = 'out'))
n.addConnection(FullConnection(n['bias'], n['tan']))
n.addConnection(FullConnection(n['in'], n['tan']))
n.addConnection(FullConnection(n['tan'], n['out']))
n.sortModules()
# initialize the backprop trainer and train
t = BackpropTrainer(n, learningrate = 0.1, momentum = 0.0, verbose = True)
#DATASET
DS = SupervisedDataSet( 1, 1 )
X = random.rand(100,1)*100
Y = X*3+random.rand(100,1)*5
for r in xrange(X.shape[0]):
DS.appendLinked((X[r]),(Y[r]))
t.trainOnDataset(DS, 200)
plt.plot(X,Y,'.b')
X=[[i] for i in arange(0,100,0.1)]
Y=map(n.activate,X)
plt.plot(X,Y,'-g')
It doesn't learn anything. I have tried to remove the hidden layer (because in this example we don't even need that) and the network started to predict NaNs.
What's going on?
EDIT: This is the code that solved my problem:
#DATASET
DS = SupervisedDataSet( 1, 1 )
X = random.rand(100,1)*100
Y = X*3+random.rand(100,1)*5
maxy = float(max(Y))
maxx = 100.0
for r in xrange(X.shape[0]):
DS.appendLinked((X[r]/maxx),(Y[r]/maxy))
t.trainOnDataset(DS, 200)
plt.plot(X,Y,'.b')
X=[[i] for i in arange(0,100,0.1)]
Y=map(lambda x: n.activate(array(x)/maxx)*maxy,X)
plt.plot(X,Y,'-g')
The basic pybrain neurons are going to output something between 0 and 1. Divide your Y by 300 (the maximum possible value), and you'll get better results.
More generally, find the maximum Y for your dataset, and scale everything by that.