Best way to represent a training set to split with - python

A training set is made off a set of samples and a set of labels one for each sample. In my case a sample is a vector while a label is a scalar. To deal with this I use Numpy. Consider this example:
samples = np.array([[1,0],[0.2,0.5], [0.3,0.8]])
labels = np.array([1,0,0])
Now I have to split the training set in two partitions shuffling the elements. This fact raise a problem: I loose the correspondence with the labels. How can I solve this?
As the performance is critical in my project I prefer not to construct a permutation vector, I am looking for a way to bind the labels with the samples. By now my solution is to use as label the last column of the samples array like:
samples_and_labels = np.array([[1,0,0],[0.2,0.5,0], [0.3,0.8,1]])
Is this the fastest solution for my case? Or are there any better? For instance creating pairs?

The mixing of indices with float datatypes makes me uneasy. When you say split the training set, is this completely random? If so I would go with the random permutation vector - I don't think your solution is any faster (even without my data type reservations) because you're still allocating memory when creating your samples_and_labels array.
You could do something like (assuming len(samples) is even for simplicity of illustration):
# set n to len(samples)/2
ind = np.hstack((np.ones(n, dtype=np.bool), np.zeros(n, dtype=np.bool)))
# modifies in-place, no memory allocation
np.random.shuffle(ind)
and then you can do
samples_left, samples_right = samples[ind], samples[ind == False]
labels_left, labels_right = labels[ind], labels[ind == False]
and call
np.random.shuffle(ind)
whenever you need a new split

Without numpy, maybe it's not so fast. You can try import "_random" intead of just "random" for better shuffling performance.
import random
samples = [[1,0],[0.2,0.5], [0.3,0.8]]
labels = [1,0,0]
print(samples, '\n', labels)
z = list(zip(samples, labels))
random.shuffle(z)
samples, labels = zip(*z)
print(samples, '\n', labels)

Related

Index elements from 2D array with a 2D array and change these elements according to a 1D array [Python]

to quantify the behaviour of a classifier I want to modify the input data and track the changes in classification. I am classifying 1D Signals and use algorithms that deliver explanations, that means another algorithm creates an array of the most important points for classification decision in the 1D signal. These most important point indices I then want to use to index the 1D Signal in the 2D array of Signals and modify the values at these points. The values for the modification have to come from an 1D array with random values, so that every signal gets changed with the same randomness. I will try to visualize it:
array_of_1D_Signals = [[8,1,2,8,3,4,8,1,3,8],[4,1,8,8,3,8,6,1,8,4],[...],[...]]
#examplary 4 most important points for every signal (lets randomly say the 8´s are important)
#they are ordered from most important to least important
list_of_indices_for_every_signal = [[7,3,0,9],[8,2,5,3],[...],[...]]
values_for_modification = [4,1,6,3]
#the array i need to create (the 8's get exchanged with the values )
modified_array_of_1D_Signals = [[6,1,2,1,3,4,4,1,3,3],[4,1,1,3,3,6,6,1,4,4],[...],[...]]
I have solved this with for loops, but i do this over millions of samples and it takes ages.
Is there a smart numpy way of doing this? I have a little example version with fancy indexing.
array_of_1D_Signals = np.full((100,100),1,dtype ='float')
indices = np.random.randint(100,size = (100,100))
values = np.random.uniform(low = 0.0, high = 1.0, size=(100,))
rows = np.arange(start = 0 ,stop = array_of_1D_Signals.shape[0], step = 1)
rows = np.repeat(rows,4)
columns = indices[:,:4].flatten()
array_of_1D_Signals[rows,columns] = np.tile(values[:4],100)
But that doesnt feel the smartest way with the repeat of the rows and the tiling of the values, because I imagine it scales rather bad, because in my real analysis all dimension get big (millions of samples with thousands of points to change)
Maybe someone has an idea?
Thank you for your time
Ok, something I tried earlier and did wrong works, when done right.
The solution is
np.put_along_axis(array_of_1D_Signals, indices[:,:4], values[:4], axis=1)
My first proposed solution also takes 2.3 times as long as this line of code.
Hope this helps someone later.

Exclude/Ignore data region in polynomial fit (zfit)

I wanted to know if there's a way to exclude one or more data regions in a polynomial fit. Currently this doesn't seem to work as I would expect. Here a small example:
import numpy as np
import pandas as pd
import zfit
# Create test data
left_data = np.random.uniform(0, 3, size=1000).tolist()
mid_data = np.random.uniform(3, 6, size=5000).tolist()
right_data = np.random.uniform(6, 9, size=1000).tolist()
testsample = pd.DataFrame(left_data + mid_data + right_data, columns=["x"])
# Define fit parameter
coeff1 = zfit.Parameter('coeff1', 0.1, -3, 3)
coeff2 = zfit.Parameter('coeff2', 0.1, -3, 3)
# Define Space for the fit
obs_all = zfit.Space("x", limits=(0, 9))
# Perform the fit
bkg_fit = zfit.pdf.Chebyshev(obs=obs_all, coeffs=[coeff1, coeff2], coeff0=1)
new_testsample = zfit.Data.from_pandas(obs=obs_all, df=testsample.query("x<3 or x>6"), weights=None)
nll = zfit.loss.UnbinnedNLL(model=bkg_fit, data=new_testsample)
minimizer = zfit.minimize.Minuit()
result = minimizer.minimize(nll)
TestSample.png
Here I've created a small testsample with 3 uniformly distributed data. I only want to use the data in x < 3 OR x > 6 and ignore the 'peak' in between. Because of their equal shape and height, I'd expect that coeff1 and coeff2 would be at (nearly) zero and the fitted curve would be a straight, horizontal line. Obviously this doesn't happen because zfit assumes that there're just no entries between 3 and 6.
I also tried using MultiSpaces to ignore that region via
limit1 = zfit.Space("x", limits=(0, 3))
limit2 = zfit.Space("x", limits=(6, 9))
obs_data = limit1 + limit2
But this leads to a
ValueError: obs need to be a Space with exactly one limit if rescaling is requested.
Anyone has an idea how to solve this?
Thanks in advance ^^
Indeed, this is a bit of a tricky problem, but that may just needs a small update in zfit.
What you are doing is correct: simply use only the data in the desired region. However, this is not the whole story because there is a "normalization range": probabilistically speaking, it's like a conditioning on a certain region as we know the data can only be in a specific region. Hence the normalization of the PDF should only integrate over the included (LOW and HIGH) regions.
This can normally be done in two ways:
Using multispace
using the multispace property as you do. This should work (it is though most probably not the way to go in the future), except for a quirk in the polynomial function: the polynomials are defined from -1 to 1. Currently, the data is simply rescaled therefore to be within -1 and 1 (and for that it should use the "space" property of the PDF). This, currently, requires to be a simple space (which could also be allowed in principle, using the minimum and maximum of the limits).
Simultaneous fit
As mentioned in the comments by #jtlz2, you can do a simultaneous fit. That is nothing to worry about, it is simply splitting the likelihood into two parts. As it is a product of probabilities, we can just conceptually split it into two products and multiply (or add their log).
So you can have the pdf fit the lower region and the upper at the same time. However, this does not solve the problem of the normalization: what should the PDF be normalized to? We will run into the same problem.
Solution 1: different space and norm
Space and the normalization range are however not the same. By default, the space (usually called 'obs') is also used as the default normalization range but not required. So you could use one space going from the lowest to the largest point as the obs and then set the norm range with your multispace (set_norm should do it or set_norm_range if you're using not the newest version). This, I think, should do the trick.
Solution 2: manual re-scaling
The actual problem is that it complains about the re-scaling to -1 and 1 that can't be done. Every polynomial which does that can also be told not to do that by using the apply_scaling=False argument. With that, you're responsible to scale the data within -1 and 1 (as the polynomials are not defined outside) and there should not be any error.

How to use sklearn's IncrementalPCA partial_fit

I've got a rather large dataset that I would like to decompose but is too big to load into memory. Researching my options, it seems that sklearn's IncrementalPCA is a good choice, but I can't quite figure out how to make it work.
I can load in the data just fine:
f = h5py.File('my_big_data.h5')
features = f['data']
And from this example, it seems I need to decide what size chunks I want to read from it:
num_rows = data.shape[0] # total number of rows in data
chunk_size = 10 # how many rows at a time to feed ipca
Then I can create my IncrementalPCA, stream the data chunk-by-chunk, and partially fit it (also from the example above):
ipca = IncrementalPCA(n_components=2)
for i in range(0, num_rows//chunk_size):
ipca.partial_fit(features[i*chunk_size : (i+1)*chunk_size])
This all goes without error, but I'm not sure what to do next. How do I actually do the dimension reduction and get a new numpy array I can manipulate further and save?
EDIT
The code above was for testing on a smaller subset of my data – as #ImanolLuengo correctly points out, it would be way better to use a larger number of dimensions and chunk size in the final code.
As you well guessed the fitting is done properly, although I would suggest increasing the chunk_size to 100 or 1000 (or even higher, depending on the shape of your data).
What you have to do now to transform it, is actually transforming it:
out = my_new_features_dataset # shape N x 2
for i in range(0, num_rows//chunk_size):
out[i*chunk_size:(i+1) * chunk_size] = ipca.transform(features[i*chunk_size : (i+1)*chunk_size])
And thats should give you your new transformed features. If you still have too many samples to fit in memory, I would suggest using out as another hdf5 dataset.
Also, I would argue that reducing a huge dataset to 2 components is probably not a very good idea. But is hard to say without knowing the shape of your features. I would suggest reducing them to sqrt(features.shape[1]), as it is a decent heuristic, or pro tip: use ipca.explained_variance_ratio_ to determine the best amount of features for your affordable information loss threshold.
Edit: as for the explained_variance_ratio_, it returns a vector of dimension n_components (the n_components that you pass as parameter to IPCA) where each value i inicates the percentage of the variance of your original data explained by the i-th new component.
You can follow the procedure in this answer to extract how much information is preserved by the first n components:
>>> print(ipca.explained_variance_ratio_.cumsum())
[ 0.32047581 0.59549787 0.80178824 0.932976 1. ]
Note: numbers are ficticius taken from the answer above assuming that you have reduced IPCA to 5 components. The i-th number indicates how much of the original data is explained by the first [0, i] components, as it is the cummulative sum of the explained variance ratio.
Thus, what is usually done, is to fit your PCA to the same number of components than your original data:
ipca = IncrementalPCA(n_components=features.shape[1])
Then, after training on your whole data (with iteration + partial_fit) you can plot explaine_variance_ratio_.cumsum() and choose how much data you want to lose. Or do it automatically:
k = np.argmax(ipca.explained_variance_ratio_.cumsum() > 0.9)
The above will return the first index on the cumcum array where the value is > 0.9, this is, indicating the number of PCA components that preserve at least 90% of the original data.
Then you can tweek the transformation to reflect it:
cs = chunk_size
out = my_new_features_dataset # shape N x k
for i in range(0, num_rows//chunk_size):
out[i*cs:(i+1)*cs] = ipca.transform(features[i*cs:(i+1)*cs])[:, :k]
NOTE the slicing to :k to just select only the first k components while ignoring the rest.

Cross-validation: finding row indices for a test set that aren't part of a training set

What I need to do is randomly pick (with replacement) 50 rows from a numpy matrix for the purposes of training a linear separator.
Then, I need to test the linear separator using the rows which I did not pick.
For the first part, where A is my full data matrix, I do:
A_train = A[np.random.randint(A.shape[0],size=50),:]
But I currently have no effective way to find:
A_test = ...
Where A_test contains no rows that are the same as A_train. How would I do this?
Key to this problem is that A is an n x m matrix, and not a 1-dimensional matrix...
You can use np.setdiff1d to find row indices that are not included in your training set:
import numpy as np
gen = np.random.RandomState(0)
n_total = 1000
n_train = 800
train_idx = gen.choice(n_total, size=n_train)
test_idx = np.setdiff1d(np.arange(n_total), train_idx)
One consequence of sampling with replacement is that the number of examples eligible for inclusion in the test set will vary according to the number of repeated examples in the training set:
print(test_idx.size)
# 439
If you want to ensure that the size of the test set is consistent, you could resample with replacement from the set of indices that aren't in the training set:
n_test = 200
test_idx2 = gen.choice(test_idx, size=n_test)
If you don't actually care about sampling with replacement then a simpler option would be to generate a random permutation of all the indices, then take the first N as training examples and the rest as test examples:
idx = gen.permutation(n_total)
train_idx, test_idx = idx[:n_train], idx[n_train:]
Or you could just shuffle the rows of your array in place using np.random.shuffle.
I should also point out that scikit-learn has various convenience methods for partitioning data into training and test sets for the purposes of cross-validation.

get the best features from matrix n X m

I have a X matrix with 1000 features (columns) and 100 lines of float elements and y a vector target of two classes 0 and 1, the dimension of y is (100,1). I want to compute the 10 best features in this matrix which discriminate the 2 classes. I tried to use the chi-square defined in scikit-learn but X is of float elements.
Can you help me and tell me a function that I can use.
Thank you.
I am not sure what you mean by X is of float elements. Chi2 works for non-negative histogram data (i.e. l1 normalized). If you data doesn't satisfy this, you have to use another method.
There is a whole module of feature selection algorithms in scikit-learn. Have you read the docs? The simplest one would be using SelectKBest.
Recursive Feature Elimination(RFE) has been really effective for me. This method assigns weights to all the features initially, and removes the feature with the least weight. This step is applied repeatedly till we achieve our desired number of features (in your case 10).
http://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination
As far as I know, if you data is correlated, L1 penalty selection might not be the best idea. Correct me if I'm wrong.

Categories