How to use sklearn's IncrementalPCA partial_fit - python

I've got a rather large dataset that I would like to decompose but is too big to load into memory. Researching my options, it seems that sklearn's IncrementalPCA is a good choice, but I can't quite figure out how to make it work.
I can load in the data just fine:
f = h5py.File('my_big_data.h5')
features = f['data']
And from this example, it seems I need to decide what size chunks I want to read from it:
num_rows = data.shape[0] # total number of rows in data
chunk_size = 10 # how many rows at a time to feed ipca
Then I can create my IncrementalPCA, stream the data chunk-by-chunk, and partially fit it (also from the example above):
ipca = IncrementalPCA(n_components=2)
for i in range(0, num_rows//chunk_size):
ipca.partial_fit(features[i*chunk_size : (i+1)*chunk_size])
This all goes without error, but I'm not sure what to do next. How do I actually do the dimension reduction and get a new numpy array I can manipulate further and save?
EDIT
The code above was for testing on a smaller subset of my data – as #ImanolLuengo correctly points out, it would be way better to use a larger number of dimensions and chunk size in the final code.

As you well guessed the fitting is done properly, although I would suggest increasing the chunk_size to 100 or 1000 (or even higher, depending on the shape of your data).
What you have to do now to transform it, is actually transforming it:
out = my_new_features_dataset # shape N x 2
for i in range(0, num_rows//chunk_size):
out[i*chunk_size:(i+1) * chunk_size] = ipca.transform(features[i*chunk_size : (i+1)*chunk_size])
And thats should give you your new transformed features. If you still have too many samples to fit in memory, I would suggest using out as another hdf5 dataset.
Also, I would argue that reducing a huge dataset to 2 components is probably not a very good idea. But is hard to say without knowing the shape of your features. I would suggest reducing them to sqrt(features.shape[1]), as it is a decent heuristic, or pro tip: use ipca.explained_variance_ratio_ to determine the best amount of features for your affordable information loss threshold.
Edit: as for the explained_variance_ratio_, it returns a vector of dimension n_components (the n_components that you pass as parameter to IPCA) where each value i inicates the percentage of the variance of your original data explained by the i-th new component.
You can follow the procedure in this answer to extract how much information is preserved by the first n components:
>>> print(ipca.explained_variance_ratio_.cumsum())
[ 0.32047581 0.59549787 0.80178824 0.932976 1. ]
Note: numbers are ficticius taken from the answer above assuming that you have reduced IPCA to 5 components. The i-th number indicates how much of the original data is explained by the first [0, i] components, as it is the cummulative sum of the explained variance ratio.
Thus, what is usually done, is to fit your PCA to the same number of components than your original data:
ipca = IncrementalPCA(n_components=features.shape[1])
Then, after training on your whole data (with iteration + partial_fit) you can plot explaine_variance_ratio_.cumsum() and choose how much data you want to lose. Or do it automatically:
k = np.argmax(ipca.explained_variance_ratio_.cumsum() > 0.9)
The above will return the first index on the cumcum array where the value is > 0.9, this is, indicating the number of PCA components that preserve at least 90% of the original data.
Then you can tweek the transformation to reflect it:
cs = chunk_size
out = my_new_features_dataset # shape N x k
for i in range(0, num_rows//chunk_size):
out[i*cs:(i+1)*cs] = ipca.transform(features[i*cs:(i+1)*cs])[:, :k]
NOTE the slicing to :k to just select only the first k components while ignoring the rest.

Related

Index elements from 2D array with a 2D array and change these elements according to a 1D array [Python]

to quantify the behaviour of a classifier I want to modify the input data and track the changes in classification. I am classifying 1D Signals and use algorithms that deliver explanations, that means another algorithm creates an array of the most important points for classification decision in the 1D signal. These most important point indices I then want to use to index the 1D Signal in the 2D array of Signals and modify the values at these points. The values for the modification have to come from an 1D array with random values, so that every signal gets changed with the same randomness. I will try to visualize it:
array_of_1D_Signals = [[8,1,2,8,3,4,8,1,3,8],[4,1,8,8,3,8,6,1,8,4],[...],[...]]
#examplary 4 most important points for every signal (lets randomly say the 8´s are important)
#they are ordered from most important to least important
list_of_indices_for_every_signal = [[7,3,0,9],[8,2,5,3],[...],[...]]
values_for_modification = [4,1,6,3]
#the array i need to create (the 8's get exchanged with the values )
modified_array_of_1D_Signals = [[6,1,2,1,3,4,4,1,3,3],[4,1,1,3,3,6,6,1,4,4],[...],[...]]
I have solved this with for loops, but i do this over millions of samples and it takes ages.
Is there a smart numpy way of doing this? I have a little example version with fancy indexing.
array_of_1D_Signals = np.full((100,100),1,dtype ='float')
indices = np.random.randint(100,size = (100,100))
values = np.random.uniform(low = 0.0, high = 1.0, size=(100,))
rows = np.arange(start = 0 ,stop = array_of_1D_Signals.shape[0], step = 1)
rows = np.repeat(rows,4)
columns = indices[:,:4].flatten()
array_of_1D_Signals[rows,columns] = np.tile(values[:4],100)
But that doesnt feel the smartest way with the repeat of the rows and the tiling of the values, because I imagine it scales rather bad, because in my real analysis all dimension get big (millions of samples with thousands of points to change)
Maybe someone has an idea?
Thank you for your time
Ok, something I tried earlier and did wrong works, when done right.
The solution is
np.put_along_axis(array_of_1D_Signals, indices[:,:4], values[:4], axis=1)
My first proposed solution also takes 2.3 times as long as this line of code.
Hope this helps someone later.

Exclude/Ignore data region in polynomial fit (zfit)

I wanted to know if there's a way to exclude one or more data regions in a polynomial fit. Currently this doesn't seem to work as I would expect. Here a small example:
import numpy as np
import pandas as pd
import zfit
# Create test data
left_data = np.random.uniform(0, 3, size=1000).tolist()
mid_data = np.random.uniform(3, 6, size=5000).tolist()
right_data = np.random.uniform(6, 9, size=1000).tolist()
testsample = pd.DataFrame(left_data + mid_data + right_data, columns=["x"])
# Define fit parameter
coeff1 = zfit.Parameter('coeff1', 0.1, -3, 3)
coeff2 = zfit.Parameter('coeff2', 0.1, -3, 3)
# Define Space for the fit
obs_all = zfit.Space("x", limits=(0, 9))
# Perform the fit
bkg_fit = zfit.pdf.Chebyshev(obs=obs_all, coeffs=[coeff1, coeff2], coeff0=1)
new_testsample = zfit.Data.from_pandas(obs=obs_all, df=testsample.query("x<3 or x>6"), weights=None)
nll = zfit.loss.UnbinnedNLL(model=bkg_fit, data=new_testsample)
minimizer = zfit.minimize.Minuit()
result = minimizer.minimize(nll)
TestSample.png
Here I've created a small testsample with 3 uniformly distributed data. I only want to use the data in x < 3 OR x > 6 and ignore the 'peak' in between. Because of their equal shape and height, I'd expect that coeff1 and coeff2 would be at (nearly) zero and the fitted curve would be a straight, horizontal line. Obviously this doesn't happen because zfit assumes that there're just no entries between 3 and 6.
I also tried using MultiSpaces to ignore that region via
limit1 = zfit.Space("x", limits=(0, 3))
limit2 = zfit.Space("x", limits=(6, 9))
obs_data = limit1 + limit2
But this leads to a
ValueError: obs need to be a Space with exactly one limit if rescaling is requested.
Anyone has an idea how to solve this?
Thanks in advance ^^
Indeed, this is a bit of a tricky problem, but that may just needs a small update in zfit.
What you are doing is correct: simply use only the data in the desired region. However, this is not the whole story because there is a "normalization range": probabilistically speaking, it's like a conditioning on a certain region as we know the data can only be in a specific region. Hence the normalization of the PDF should only integrate over the included (LOW and HIGH) regions.
This can normally be done in two ways:
Using multispace
using the multispace property as you do. This should work (it is though most probably not the way to go in the future), except for a quirk in the polynomial function: the polynomials are defined from -1 to 1. Currently, the data is simply rescaled therefore to be within -1 and 1 (and for that it should use the "space" property of the PDF). This, currently, requires to be a simple space (which could also be allowed in principle, using the minimum and maximum of the limits).
Simultaneous fit
As mentioned in the comments by #jtlz2, you can do a simultaneous fit. That is nothing to worry about, it is simply splitting the likelihood into two parts. As it is a product of probabilities, we can just conceptually split it into two products and multiply (or add their log).
So you can have the pdf fit the lower region and the upper at the same time. However, this does not solve the problem of the normalization: what should the PDF be normalized to? We will run into the same problem.
Solution 1: different space and norm
Space and the normalization range are however not the same. By default, the space (usually called 'obs') is also used as the default normalization range but not required. So you could use one space going from the lowest to the largest point as the obs and then set the norm range with your multispace (set_norm should do it or set_norm_range if you're using not the newest version). This, I think, should do the trick.
Solution 2: manual re-scaling
The actual problem is that it complains about the re-scaling to -1 and 1 that can't be done. Every polynomial which does that can also be told not to do that by using the apply_scaling=False argument. With that, you're responsible to scale the data within -1 and 1 (as the polynomials are not defined outside) and there should not be any error.

Generating large simulations and inserting the same array multiple times into another array at different locations

I am working to generate a monte carlo simulation for oil wells. The end goal is to have all the wells with a smoothed probabilistic production curve. I have optimized what I can, but each of the 3 apply statements I am listing take so much time when I use my full dataset and the number of simulations I want. (Hours) The code I included is has 10 iterations. If you crank it up to 10,000 which is the goal it really starts to drag.
I have generated a Panda that has all the future wells I want to model with a probability of that well being chosen next to be drilled.
I then created a panda where I grouped everything into the categories I want to use to figure out the order that the model will choose the wells. So my "timing" panda contains my categories and an array of every index of those wells in those categories and an array of the well's probabilities.
This all is done in a few seconds. The next part works, but gets very slow.
Next I use a numpy generator choice with percentages to randomly generate the order of the wells for i simulations. As other posts have noted #njit does not work with the probability array. The result is 1 dimension of the array is the order that the wells will be chosen by each category, and the other dimension is each simulation. There are about 150 categories, and 10,000s of wells in each categories. I am hoping to run 10,000 simulations.
a is an array of indexes of wells that can be chosen
size is the length of that array
per is the probability that each well will get chosen
Next I link my timing panda to my panda with all of the wells in it. This attaches the previous array to the wells array. Then I search this array for the well index to figure out for each simulation when that specific well is going to get run. This generates a 1d array with what order that well is going be drilled in each simulation.
This function gets called on 100,000s of wells and as I increase the number of simulations it really slows down.
order is an array of the order each well is drilled per simulation
index is the index of that well
The final difficulty I am having is the averaging out the production curve for the wells. I have how much oil will be produced by each well per month. I need to insert that curve into the array at each point when the well is drilled, then average all of those values together to get the average production of the well given all the simulations.
I have also tried creating an np.zeros array then using the np.insert function, but I could not figure out how to insert an array multiple times without a loop and generating the initial array of 0's took longer than the current method I had. (I overcame inserting the array multiple times by covering everything to a string, inserting the type curve as a string then converting back to an array of numbers, but this did not seem efficient). I need to have the number of leading 0's
order is the time in months that each well will get drilled
curve is the production curve passed as a list
m is the highest value of the months that the well is drilled in all simulations
import numpy as np
from numba import njit
import datetime
import math
def TimingGenerator(a, size, p):
i = 10
g = np.random.Generator(np.random.PCG64())
order = np.concatenate([g.choice(a=a, size=size, replace=False, p=p) for z in range(i)]).reshape(i, size)
return order
#njit
def OrderGenerator(order, index):
result = np.where(order == index)[1]
return result
def CurveAverager(order, curve, m):
matrix = np.array([[0] * math.ceil(i) + curve + [0] * int((m - math.ceil(i))) for i in order])
result = np.mean(matrix, axis=0)
return result
begin_time = datetime.datetime.now()
size = 8000
g = np.random.Generator(np.random.PCG64())
a = g.choice(20_000, size=size, replace=False)
p = np.random.randint(1,100, size=size)
p = p/np.sum(p)
for i in range(150):
q = TimingGenerator(a,size,p)
print(datetime.datetime.now() - begin_time)
index = np.amin(q)
for i in range(100000):
order = OrderGenerator(q, index)
print(datetime.datetime.now() - begin_time)
order = order / 15
curve = list(range(600, 0, -1))
for i in range(20000):
avgcurve = CurveAverager(order, curve, size)
print(datetime.datetime.now() - begin_time)
Thanks for any help you can offer. I am willing to greatly alter my code if you can think of anything to help speed it up. Not sure if there is a better way to apply probabilities and smooth out the production curve which is really the end goal.
Cheers.

Implementing a weighted average on the fly in Python

I have a stream of coming data and I want to implement the moving average on the fly. If all the elements in the moving average have the same weight it is fairly easy to implement using a 'Queue' but I want the most recent elements to have higher weights and the distribution of this weights are linear (not exponential).
For example, if the moving average is of length 5, the current value should have weight '1', the previous one should have weight '0.8' and so on until the fifth element in the queue which should have weight '0.2'; so the weight vector is: [0.2, 0.4, 0.6, 0.8, 1.0].
I was wondering if anybody knows how to implement it is Python. If there is any faster way to do this please recommend that to me; efficiency is important for my specific job.
If you want to keep a weighting vector such as described (linearly decreasing weights), you will need to keep all the information about your past stream. I quickly tried to sketch a mathematical function that would avoid keeping your past scalars in your memory without success. This is where the exponential weighting has a powerful advantage:
average_(t) = x_(t) + aa*average_(t-1)
You only need to keep two variables in your memory.
Anyway, if the memory is not an efficiency parameter, your problem goes down as a vector multiplication. Therefore, I would suggest using the numpy library. [1] [2]. See a solution example below (perhaps you may found a more efficient one):
import numpy as np
stream = np.array((20, 40))
n = len(stream)
latest_scalar = 60
stream = np.append(stream, latest_scalar)
n += 1
# n represent the length of the stream
# I assumed that is more efficient to handle n without calling len() function
# may raise safety issue
weights = np.arange(1, n+1)
# [1, 2, 3]
average = np.dot(stream, weights).sum() / (n*(n+1)/2)
# (n*(n+1)/2): total of the weights
# output: 46.666... ok!

Best way to represent a training set to split with

A training set is made off a set of samples and a set of labels one for each sample. In my case a sample is a vector while a label is a scalar. To deal with this I use Numpy. Consider this example:
samples = np.array([[1,0],[0.2,0.5], [0.3,0.8]])
labels = np.array([1,0,0])
Now I have to split the training set in two partitions shuffling the elements. This fact raise a problem: I loose the correspondence with the labels. How can I solve this?
As the performance is critical in my project I prefer not to construct a permutation vector, I am looking for a way to bind the labels with the samples. By now my solution is to use as label the last column of the samples array like:
samples_and_labels = np.array([[1,0,0],[0.2,0.5,0], [0.3,0.8,1]])
Is this the fastest solution for my case? Or are there any better? For instance creating pairs?
The mixing of indices with float datatypes makes me uneasy. When you say split the training set, is this completely random? If so I would go with the random permutation vector - I don't think your solution is any faster (even without my data type reservations) because you're still allocating memory when creating your samples_and_labels array.
You could do something like (assuming len(samples) is even for simplicity of illustration):
# set n to len(samples)/2
ind = np.hstack((np.ones(n, dtype=np.bool), np.zeros(n, dtype=np.bool)))
# modifies in-place, no memory allocation
np.random.shuffle(ind)
and then you can do
samples_left, samples_right = samples[ind], samples[ind == False]
labels_left, labels_right = labels[ind], labels[ind == False]
and call
np.random.shuffle(ind)
whenever you need a new split
Without numpy, maybe it's not so fast. You can try import "_random" intead of just "random" for better shuffling performance.
import random
samples = [[1,0],[0.2,0.5], [0.3,0.8]]
labels = [1,0,0]
print(samples, '\n', labels)
z = list(zip(samples, labels))
random.shuffle(z)
samples, labels = zip(*z)
print(samples, '\n', labels)

Categories