The input X to my network has the shape (10, 1, 5, 4). I am interested in boxplotting the distribution of input features (four), for each class. So, for example:
X = np.random.randn(10, 1, 5, 4)
a = np.zeros(5, dtype=int)
b = np.ones(5, dtype=int)
y = np.hstack((a,b))
print(X.shape)
print(y.shape)
(10, 1, 5, 4)
(10,)
Then I separate the input Xinto respective classes, like:
class0, class1 =[],[]
for i in range(len(y)):
if y[i]==0:
class0.append(X[i])
else:
class1.append(X[i])
class0 = np.array(class0)
class1 = np.array(class1)
Taking class0into consideration, I can go ahead to manipulate it in a way that the four features are arranged per column (col1, col2,col3,col4) this way.
def transformer(myclass):
#reshape the class
k = myclass.transpose((0,1,3,2))
#access individual feature
s = k[0][:,0].reshape(-1,1)
a = k[0][:,1].reshape(-1,1)
j = k[0][:, 2].reshape(-1,1)
b = k[0][:, 3].reshape(-1,1)
rslt = [s,a,j,b]
return rslt
Then plot the features:
sns.boxplot(data=transformer(class0))
This is the general idea of my workflow. Note that the function transformer is hardcoded to access only the first observation (element) of the class it takes as input.
Question: How to I do modify my function to access all observations of the class, not per every single example, for generalised. Such that col1are all features in the class that are in first column for each example.
Do write the following:
def mytransformer(myclass):
#first, transpose class
k = myclass.transpose((0,1,3,2))
#speed
for i in range(k):
s = k[i][:,0].reshape(-1,1)
return s
Which gives the error:
mytransformer(class0)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-15-5451e55f03d9> in <module>()
----> 1 mytransformer(class0)
<ipython-input-14-d1a2c8098caf> in mytransformer(myclass)
3 myclass = myclass.transpose((0,1,3,2))
4 #speed
----> 5 for i in range(myclass):
6 s = k[i][:,0].reshape(-1,1)
7 return s
TypeError: only integer scalar arrays can be converted to a scalar index
Is there a way to add legend to the boxplot so that I can give name to each feature?
For your Question 1, You are using for loop range with a NumPy array which instead should have argument as an integer.
Maybe it is,
for i in range(len(k)):
Related
What is a better way to do the following codeblock? I want to create a 1d array for each scene containing features a-e to eventually have the shape: m x n if m is the number of scenes and n is the combined length of all the features.
The shape of features a-d is unknown and can be different from each other. For example feature a could have shape 100 x 3 x 3 x 5 while feature b could have shape 30 x 4. Feature e is simply a boolean.
inputs = []
for scene in scenes:
inp = np.concatenate((
scene['a'].flatten(),
scene['b'].flatten(),
scene['c'].flatten(),
scene['d'].flatten(),
[scene['e'] == True]))
inputs.append(inp)
inputs = torch.FloatTensor(inputs)
Let say we know ['a', 'b', 'c', 'd', 'e'] are the only attributes in each scene ( so they're accessible by scene.keys(). Then, the following code works:
output = np.vstack(
np.hstack(map(lambda x: np.array(x).flatten(), s.values())) for s in scenes
)
inputs = torch.FloatTensor(output)
In order to test that, I created a scene generator function that creates dictionaries similar to what you mentioned and synthetically create 10 scenes:
import numpy as np
def scense_generator():
scene = dict()
scene['a'] = np.random.random((100, 3, 3, 5))
scene['b'] = np.random.random((30, 4))
scene['c'] = np.random.random((15, 15, 2))
scene['d'] = np.random.random((2, 2, 2))
scene['e'] = True
return scene
scenes = [scense_generator() for _ in range(10)]
output = np.vstack(
np.hstack(map(lambda x: np.array(x).flatten(), s.values())) for s in scenes
)
print(output.shape)
# (10, 5079)
import pandas as pd
import numpy as np
class CLF:
Weights = 0
def fit(DF_input, DF_output, eta=0.1, drop=1000):
X, y = DF_input.to_numpy(copy=True), DF_output.to_numpy(copy=True)
N,d = X.shape
m = len(np.unique(y))
self.Weights = np.random.normal(0,1, size=(d,m))
INPUT = pd.read_csv(path_input)
OUTPUT = pd.read_csv(path_output)
clf = CLF()
clf.fit(INPUT, OUTPUT)
I defined a method .fit() for the class I wrote. The first step is convert two dataframes into numpy arrays. However, I got the following error when I tried to use the method, although INPUT.to_numpy(copy=True) and OUTPUT.to_numpy(copy=True) both work fine in their own right. Can somebody help me out here? Why was to_numpy recognized as an attribute rather than a method of dataframes?
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-22-a3d455104534> in <module>
1 clf = CLF()
----> 2 clf.fit(INPUT, OUTPUT)
<ipython-input-16-57babd738b2d> in fit(DF_input, DF_output, eta, drop)
4
5 def fit(DF_input, DF_output, eta=0.1,drop=1000):
----> 6 X, y = DF_input.to_numpy(copy=True), DF_output.to_numpy(copy=True)
7 N,d = X.shape
8 m = len(np.unique(y)) # number of classes
AttributeError: 'CLF' object has no attribute 'to_numpy'
Your problem is that the first input for object method is usually reserved for self. The correct syntax should be:
class CLF:
Weights = 0
# notice the `self`
def fit(self, DF_input, DF_output, eta=0.1, drop=1000):
X, y = DF_input.to_numpy(copy=True), DF_output.to_numpy(copy=True)
N,d = X.shape
m = len(np.unique(y))
self.Weights = np.random.normal(0,1, size=(d,m))
INPUT = pd.read_csv(path_input)
OUTPUT = pd.read_csv(path_output)
clf = CLF()
clf.fit(INPUT, OUTPUT)
An instance method is a type of attribute; this is a more general error message that keys on the . (dot) operator, rather than parsing through to the left parenthesis to discriminate your usage.
The problem is that you defined an instance method fit, but named your instance as DF_input. I think you simply forgot the usual self naming for the implicit instance parameter.
I have big binary 3D data and I want to re-arrange the data such as it is a sequence of values in order achieved by parsing the original data as sub-arrays of size (4x4x4).
For example, if the data is 2D and I want to re-arrange the data from 2x2 sub-arrays
example image
I used simple loops for this but just iterating over the loops took way more times, I am trying to to use some numpy functions to do so but I am new to SciPy
My code looks like this
x,y,z = 1200,800,400
data = np.fromfile(file_name, dtype=np.float32)
data.shape = (z,y,x)
new_data = np.empty(shape=x*y*z, dtype = np.float32)
index = 0
for zz in range(0,z,4):
for yy in range(0,y,4):
for xx in range(0,x,4):
for zShift in range(4):
for yShift in range(4):
for xShift in range(4):
new_data[index] = data[zz+zShift][yy+yShift][xx+xShift]
index+=1
new_data.tofile(output)
However, this takes a lot of time, any better implementation ideas?
As I said, the code works as intended, however, I need a smarter, pythonic way to achieve my output
Thank you!
x,y,z = 1200,800,400
data = np.empty([x,y,z])
# numpy calculates the shape of -1
out = data.reshape(-1, 4, 4, 4)
out.shape
>>> (6000000, 4, 4, 4)
Perform the following test, for smaller data and block size:
x, y, z = 4, 4, 4 # Dimensions
stp = 2 # Block size (in each dimension)
# Create the test array
arr = np.arange(x * y * z).reshape((x, y, z))
And to create a list of "blocks", run:
new_data = []
for xx in range(0, x, stp):
for yy in range(0, y, stp):
for zz in range(0, z, stp):
print('Index:', xx, yy, zz)
obj = arr[xx:xx+stp, yy:yy+stp, zz:zz+stp].copy()
print(obj)
new_data.append(obj)
In the target version of your code:
restore original values of x, y and z,
read the array from your source,
change stp back to 4,
drop test printouts.
Note also that your code adds individual elements to new_data,
only iterating over blocks of size 4 * 4 * 4,
whereas you wrote that you want a sequence of smaller arrays
(i.e. slices) of size 4 * 4 * 4, what my code does.
So if you need a list of slices (smaller arrays), not a single
4-D array, use my code.
I get a ValueError: Found input variables with inconsistent numbers of samples: [20000, 1] when I run the following even though the row values of x and y are correct. I load in the RCV1 dataset, get indices of the categories with the top x documents, create list of tuples with equal number of randomly-selected positives and negatives for each category, and then finally attempt to run a logistic regression on one of the categories.
import sklearn.datasets
from sklearn import model_selection, preprocessing
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
from scipy import sparse
rcv1 = sklearn.datasets.fetch_rcv1()
def get_top_cat_indices(target_matrix, num_cats):
cat_counts = target_matrix.sum(axis=0)
#cat_counts = cat_counts.reshape((1,103)).tolist()[0]
cat_counts = cat_counts.reshape((103,))
#b = sorted(cat_counts, reverse=True)
ind_temp = np.argsort(cat_counts)[::-1].tolist()[0]
ind = [ind_temp[i] for i in range(5)]
return ind
def prepare_data(x, y, top_cat_indices, sample_size):
res_lst = []
for i in top_cat_indices:
# get column of indices with relevant cat
temp = y.tocsc()[:, i]
# all docs with labeled category
cat_present = x.tocsr()[np.where(temp.sum(axis=1)>0)[0],:]
# all docs other than labelled category
cat_notpresent = x.tocsr()[np.where(temp.sum(axis=1)==0)[0],:]
# get indices equal to 1/2 of sample size
idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
# concatenate the ids
sampled_x_pos = cat_present.tocsr()[idx_cat,:]
sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))
sampled_y_pos = temp.tocsr()[idx_cat,:]
sampled_y_neg = temp.tocsr()[idx_nocat,:]
sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))
res_lst.append((sampled_x, sampled_y))
return res_lst
ind = get_top_cat_indices(rcv1.target, 5)
test_res = prepare_data(train_x, train_y, ind, 20000)
x, y = test_res[0]
print(x.shape)
print(y.shape)
LogisticRegression().fit(x, y)
Could it be an issue with the sparse matrices, or problem with dimensionality (there are 20K samples and 47K features)
When I run your code, I get following error:
AttributeError: 'bool' object has no attribute 'any'
That's because y for LogisticRegression needs to numpy array. So, I changed last line to:
LogisticRegression().fit(x, y.A.flatten())
Then I get following error:
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0
This is because your sampling code has a bug. You need to subset y array with rows having that category before using sampling indices. See code below:
def prepare_data(x, y, top_cat_indices, sample_size):
res_lst = []
for i in top_cat_indices:
# get column of indices with relevant cat
temp = y.tocsc()[:, i]
# all docs with labeled category
c1 = np.where(temp.sum(axis=1)>0)[0]
c2 = np.where(temp.sum(axis=1)==0)[0]
cat_present = x.tocsr()[c1,:]
# all docs other than labelled category
cat_notpresent = x.tocsr()[c2,:]
# get indices equal to 1/2 of sample size
idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
# concatenate the ids
sampled_x_pos = cat_present.tocsr()[idx_cat,:]
sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))
sampled_y_pos = temp.tocsr()[c1][idx_cat,:]
print(sampled_y_pos.nnz)
sampled_y_neg = temp.tocsr()[c2][idx_nocat,:]
print(sampled_y_neg.nnz)
sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))
res_lst.append((sampled_x, sampled_y))
return res_lst
Now, Everything works like a charm
I tried to implement parameter initialization and got the error message:
import numpy as np
def initialize_with_zeros(dim):
w = np.zeros(dim, 1)
b = 0
return w, b
dim = 2
initialize_with_zeros(dim)
Here is the Error:
TypeError Traceback (most recent call
last) in ()
5
6 dim = 2
----> 7 initialize_with_zeros(dim)
in initialize_with_zeros(dim)
1 def initialize_with_zeros(dim):
----> 2 w = np.zeros(dim, 1)
3 b = 0
4 return w, b
5
TypeError: data type not understood
np.zeros takes only the shape as a tuple or a single integer (in case of 1-d arrays). If you just need a 1 dimensional array, pass a single parameter. If you need a 2d-array, pass as a tuple (dim,1). Hence, depending on what you want, either use
w = np.zeros(dim)
which will give you a one dimensional array of zeros
or use
w = np.zeros((dim, 1))
which will give you a two dimensional array with dim number of rows and 1 column.
From the official docs
numpy.zeros(shape, dtype=float, order='C')
Parameters:
shape : int or tuple of ints Shape of the new array, e.g., (2, 3) or 2.
Initializing parameters with zeros:
# GRADED FUNCTION: initialize_with_zeros
import numpy as np
def initialize_with_zeros(dim):
w = np.zeros([dim, 1])
b = 0
return w, b
dim = 2
w,b=initialize_with_zeros(dim)
print ("w = " + str(w))
print ("b = " + str(b))