How to convert numpy array into libsvm format - python

I have a numpy array for an image and am trying to dump it into the libsvm format of LABEL I0:V0 I1:V1 I2:V2..IN:VN. I see that scikit-learn has a dump_svmlight_file and would like to use that if possible since it's optimized and stable.
It takes parameters of X, y, and file output name. The values I'm thinking about would be:
X - numpy array
y - ????
file output name - self-explanatory
Would this be a correct assumption for X? I'm very confused about what I should do for y though.
It appears it needs to be a feature set of some kind. I don't know how I would go about obtaining that however. Thanks in advance for the help!

The svmlight format is tailored to classification/regression problems. Therefore, the array X is a matrix with as many rows as data points in your set, and as many columns as features. y is the vector of instance labels.
For example, suppose you have 1000 objects (images of bicycles and bananas, for example), featurized in 400 dimensions. X would be 1000x400, and y would be a 1000-vector with a 1 entry where there should be a bicycle, and a -1 entry where there should be a banana.

Related

Python - How to resample a 2D shape?

I am writing a python script for some geometrical data manipulation (calculating motion trajectories for a multi-drive industrial machine). Generally, the idea is that there is a given shape (let's say - an ellipse, but it general case it can be any convex shape, defined with a series of 2D points), which is rotated and it's uppermost tangent point must be followed. I don't have a problem with the latter part but I need a little hint with the 2D shape preparation.
Let's say that the ellipse was defined with too little points, for example - 25. (As I said, ultimately this can be any shape, for example a rounded hexagon). To maintain necessary precision I need far more points (let's say - 1000), preferably equally distributed over whole shape or with higher density of points near corners, sharp curves, etc.
I have a few things ringing in my head, I guess that DFT (FFT) would be a good starting point for this resampling, analyzing the scipy.signal.resample() I have found out that there are far more functions in the scipy.signal package which sound promising to me...
What I'm asking for is a suggestion which way I should follow, what tool I should try for this job, which may be the most suitable. Maybe there is a tool meant exactly for what I'm looking for or maybe I'm overthinking this and one of the implementations of FFT like resample() will work just fine (of course, after some adjustments at the starting and ending point of the shape to make sure it's closing without issues)?
Scipy.signal sounds promising, however, as far as I understand, it is meant to work with time series data, not geometrical data - I guess this may cause some problems as my data isn't a function (in a mathematical understanding).
Thanks and best regards!
As far as I understood, what you want is to get an interpolated version of your original data.
The DFT (or FFT) will not achieve this purpose, since it will perform an Fourier Transform (which is not what you want).
Talking theoretically, what you need to interpolate your data is to define a function to calculate the result in the new-data-points.
So, let's say your data contains 5 points, in which one you have a 1D (to simplify) number stored, representing your data, and you want a new array with 10 points, filled with the linear-interpolation of your original data.
Using numpy.interp:
import numpy as np
original_data = [2, 0, 3, 5, 1] # define your data in 1D
new_data_resolution = 0.5 # define new sampling distance (i.e, your x-axis resolution)
interp_data = np.interp(
x = np.arange(0, 5-1+new_data_resolution , new_data_resolution), # new sampling points (new axis)
xp = range(original_data),
fp = original_data
)
# now interp_data contains (5-1) / 0.5 + 1 = 9 points
After this, you will have a (5-1) / new_resolution (which is greater than 5, since new_resolution < 1)-length data, which values will be (in this case) a linear interpolation of your original data.
After you have achieved/understood this example, you can dive in the scipy.interpolate module to get a better understanding in the interpolation functions (my example uses a linear function to get the data in the missing points).
Applying this to n-D dimensional arrays is straight-forward, iterating over each dimension of your data.

How do I make sense of this diagram from Google's deep learning course?

I have been enjoying Google's deep learning course while finding it relatively difficult. I am still in the first section which is going through multinomial logistic classification, specifically in regards to classifying pictures of letters according to what letter they contain.
While I kind of get it, I am struggling to understand this diagram quite a bit.
https://postimg.org/image/lk4369cl5/
I suppose what confuses me the most is how the input X, which I assumes is a big matrix of pixel values, becomes converted into a matrix with 1 column and 3 rows, with the title y in the diagram. I don't get how applying the formula wx + b to the input matrix X would give a 3x1 matrix y in the diagram. It really seems in this diagram that regardless of the size of X you get a output matrix y that has 1 row for each possible classification (in this case A, B or C) but that doesn't make sense given the model is wx + b, the size of y should be the same as the size of X - but then that doesn't make sense because then you have this giant matrix of y values which i don't know how you classify.
Furthermore I don't really understand what y is? Does y just have one number for each of the possible classifications x can be, with a higher number suggesting it is that classification?
https://classroom.udacity.com/courses/ud730/lessons/6370362152/concepts/63798118260923#
I think I get how softmax is calculated from y.
Thanks

Scikit SVM error: X.shape[1] = 1 should be equal to 2

I am trying to use Scikit to train 2 features called: x1 and x2. Both these arrays are shape (490,1). In order to pass in one X argument into clf.fit(X,y), I used np.concatenate to produce an array shape (490,2). The label array is composed of 1's and 0's and is shape (490,). The code is shown below:
x1 = int_x # previously defined array shape (490,1)
x2 = int_x2 # previously defined array shape (490,1)
y=np.ravel(close) # where close is composed of 1's and 0's shape (490,1)
X,y = np.concatenate((x1[:-1],x2[:-1]),axis=1), y[:-1] #train on all datapoints except last
clf = SVC()
clf.fit(X,y)
The following error is shown:
X.shape[1] = 1 should be equal to 2, the number of features at training time
What I don't understand is why this message appears even though when I check the shape of X, it is indeed 2 and not 1. I originally tried this with only one feature and clf.fit(X,y) worked well, so I am inclined to think that np.concatenate produced something that was not suitable. Any suggestions would be great.
It's difficult to say without having the concrete values of int_x, int_x2 and close. Indeed, if I try with int_x, int_x2 and close randomly constructed as
import numpy as np
from sklearn.svm import SVC
int_x = np.random.normal(size=(490,1))
int_x2 = np.random.normal(size=(490,1))
close = np.random.randint(2, size=(490,))
which conforms to your specs, then your code works. Thus the error may be in the way you constructed int_x, int_x2 and close.
If you believe the problem is not there, could you please share a minimal reproducible example with specific values of int_x, int_x2 and close?
I think I understand what was wrong with my code.
First, I should have created another variable, say x that defined the concatenation of int_x and int_x2 and is shape: (490,2), which is the same shape as close. This came in handy later.
Next, the clf.fit(X,y) was not incorrect in itself. However, I did not correctly formulate my prediction code. For instance, I said: clf.predict([close[-1]]) in hopes of capturing the binary target output (either 0 or 1). The argument that was passed into this method was incorrect. It should have been clf.predict([x[-1]]) because the algorithm predicts the label at the feature location as opposed to the other way around. Since the variable x is now the same shape as close, then the result of clf.predict([x[-1]]) should produce the predicted result of close[-1].

python lmfit: Given an array of discrete values that define a model, how to fit a same length array of data with specified uncertainty values?

I have a dataset and uncertainty both single arrays of length 100.
I have a "model" array also of length 100.
Goal: Optimize only one parameter (scaling the amplitude) of this model array to better fit the data given its uncertainty.
so far I've tried:
def residual(params, x, data, eps_data):
amp = params['amp'].value
model = amp * x
return (data - model)/eps_data
params = Parameters()
params.add('amp',value=100)
out = minimize(residual,params,args=(mod_array,data_array,unc_array))
Then, I multiply the best fit value amplitude with the original model array:
fit = params['amp'].value*mod_array
Then, I plot the fit over the original dataset and it looks absolutely terrible, I don't even see the model anywhere close to the data. What's wrong in the code/algorithm?
That looks like it should work. But you have not given enough information to be sure. What are the data types for the arrays (they should be numpy ndarrays of dtype numpy.float64 or numpy.float32), and what is the output you get? How much as the value of 'amp' changed in the fit?
Note that if you're using a very recent devel version of numpy, you would (and, in the future will) need to use 'out.params['amp']' for the best-fit value.

get the best features from matrix n X m

I have a X matrix with 1000 features (columns) and 100 lines of float elements and y a vector target of two classes 0 and 1, the dimension of y is (100,1). I want to compute the 10 best features in this matrix which discriminate the 2 classes. I tried to use the chi-square defined in scikit-learn but X is of float elements.
Can you help me and tell me a function that I can use.
Thank you.
I am not sure what you mean by X is of float elements. Chi2 works for non-negative histogram data (i.e. l1 normalized). If you data doesn't satisfy this, you have to use another method.
There is a whole module of feature selection algorithms in scikit-learn. Have you read the docs? The simplest one would be using SelectKBest.
Recursive Feature Elimination(RFE) has been really effective for me. This method assigns weights to all the features initially, and removes the feature with the least weight. This step is applied repeatedly till we achieve our desired number of features (in your case 10).
http://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination
As far as I know, if you data is correlated, L1 penalty selection might not be the best idea. Correct me if I'm wrong.

Categories