How to interpolate cumulative histogram data? - python

I have got a set of histograms from numpy.histogram:
probas, years = zip(*[np.histogram(r, bins= bin_values) for r in results])
results is an array of shape(9, 10000) The bin values are the years from 2029 and 2066. The probas array has a shape (9,37) and the years array (9,38). So years[:,:-1] has a shape of (9,37).
I can obtaint he cumulative histogram data using:
probas = np.cumsum(probas, axis=1)
I can then normalize it to [0,1]:
probas = np.asarray(probas)
probas = probas/np.max(probas, axis = 0)
I then try and interpolate that cumulative distribution using scipy:
inverse_pdfs = [scipy.interpolate.interp1d(probas[i], years[i,:-1]) for i in range(probas.shape[0])]
When I plot the third histogram of the data set as a plt.plot() and that from the inverse_pdfs using:
i = 2
plt.plot(years[i,:-1], probas[i], color="orange")
probability_range = np.arange(0.,1.01,0.01)
plt.plot([inverse_pdfs[i](p) for p in probability_range], probability_range, color="blue")
I obtain:
As you can see the match is pretty good for most of the years after 2042, but before that it is very bad.
Any suggestion on how to improve that match, or where the problem comes from, would be very welcome.
For information, the data used to train the interpolator on the third histogram are:
years[2,:-1]: [2029. 2030. 2031. 2032. 2033. 2034. 2035. 2036. 2037. 2038. 2039. 2040.
2041. 2042. 2043. 2044. 2045. 2046. 2047. 2048. 2049. 2050. 2051. 2052.
2053. 2054. 2055. 2056. 2057. 2058. 2059. 2060. 2061. 2062. 2063. 2064.
2065.]
probas[2]:[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.0916 0.2968 0.4888 0.6666 0.8335 0.9683 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. ]

Related

SKLearn PCA explained_variance_ration cumsum gives array of 1

I have a problem with PCA. I read that PCA needs clean numeric values. I started my analysis with a dataset called trainDf with shape (1460, 79).
I did my data cleaning and processing by removing empty values, imputing and dropping columns and I got a dataframe transformedData with shape (1458, 69).
Data cleaning steps are:
LotFrontage imputing with mean value
MasVnrArea imputing with 0s (less than 10 cols)
Ordinal encoding for categorical columns
Electrical imputing with most frequent value
I found outliers with IQR and got withoutOutliers with shape (1223, 69).
After this, I looked at histograms and decided to apply PowerTransformer on some features and StandardScaler on others and I got normalizedData.
Now I tried doing PCA and I got this:
pca = PCA().fit(transformedData)
print(pca.explained_variance_ratio_.cumsum())
plt.plot(pca.explained_variance_ratio_.cumsum())
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
the output of this PCA is the following:
[0.67454179 0.8541084 0.98180307 0.99979932 0.99986346 0.9999237
0.99997091 0.99997985 0.99998547 0.99999044 0.99999463 0.99999719
0.99999791 0.99999854 0.99999909 0.99999961 0.99999977 0.99999988
0.99999994 0.99999998 0.99999999 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. ]
Then I tried:
pca = PCA().fit(withoutOutliers)
print(pca.explained_variance_ratio_.cumsum())
plt.plot(pca.explained_variance_ratio_.cumsum())
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
out:
[0.68447278 0.86982875 0.99806386 0.99983727 0.99989606 0.99994353
0.99997769 0.99998454 0.99998928 0.99999299 0.9999958 0.99999775
0.99999842 0.99999894 0.99999932 0.99999963 0.9999998 0.9999999
0.99999994 0.99999998 0.99999999 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. ]
Finally:
pca = PCA().fit(normalizedData)
print(pca.explained_variance_ratio_.cumsum())
plt.plot(pca.explained_variance_ratio_.cumsum())
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
Out:
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
How is it possible that the last execution gives such an output?
Here are data distributions
transformedData
withoutOutliers
normalizedData
I'll add any further data if necessary, thanks in advance to any who can help!
In short, all data should be scaled before applying PCA (for example using a StandardScaler).
I got the answer on Data science stackexchange.

Question on discrete convolution with python

I am struggling to understand why the np.convolve method returns an N+M-1 set. I would appreciate your help.
Suppose I have two discrete probability distributions with values of [1,2] and [10,12] and probabilities of [.5,0.2] and [.5,0.4] respectively.
Using numpy's convolve function I get:
>>In[]: np.convolve([.5,0.2],[.5,0.4])
>>Out[]: array([[0.25, 0.3 , 0.08])
However I don't understand why the resulting probability distribution only has 3 datapoints. To my understanding the sum of my input variables can have the following values: [11,12,13,14] so I would expect 4 datapoints to reflect the probabilities of each of these occurrences.
What am I missing?
I have managed to find the answer to my own question after understanding convolution a bit better. Posting it here for anyone wondering:
Effectively, the convolution of the two "signals" or probability functions in my example above is not correctly done as it is nowhere reflected that the events [1,2] of the first distribution and [10,12] of the second do not coincide.
Simply taking np.convolve([.5,0.2],[.5,0.4]) assumes the probabilities corresponding to the same events (e.g. [1,2] [1,2]).
Correct approach would be to bring the two series into alignment under a common X axis as in x \in [1,12] as below:
>>In[]: vector1 = [.5,0.2, 0,0,0,0,0,0,0,0,0,0]
>>In[]: vector2 = [0,0,0,0,0,0,0,0,0,.5, 0,0.4]
>>In[]: np.convolve(vector1, vector2)
>>Out[]: array([0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.25, 0.1 ,
0.2 , 0.08, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. ])
which gives the correct values for 11,12,13,14

How do I change column type in Python from int to object for sklearn?

I am really new to Python and scikit-learn (sklearn) and I am trying to load this dataset which consists of 7 columns of attributes and 1 column of the data classification (class/data target). But there's this one attribute which consists of data [1,2,3,4,5] which actually marks a stage of something, thus making it a nominal, not numeric. But of course python recognizes it as a numerical data (int64), when in fact I want it to be treated as a nominal data (object). How do I change the column type to nominal?
I have done the following.
print(data.dtypes)
data["col_name"]=data["col_name"].astype(numpy.object)
print(data.dtypes)
In the first print, it still recognizes my data["col_name"] as an int64, but after the astype line, it has changed it object. But it doesn't make any difference to the data, since when I try to use matplotlib and create a histogram, it still recognizes both the X and Y as numbers instead of object.
Also I have read about the One Hot Encoding and Label Encoding on the documentation, but I figured they are not what I need in my case. I wonder if I have misunderstood something or maybe there's another solution.
Thanks
Reading through the documents for sklearn. This package has thorough documentation. In particular the Preprocessing section on encoding categorical features:
In regards to keeping categorical features represented in an array of integers, ie [1,2,3,4,5], we have this:
Such integer representation can not be used directly with scikit-learn
estimators, as these expect continuous input, and would interpret the
categories as being ordered, which is often not desired (i.e. the set
of browsers was ordered arbitrarily). One possibility to convert
categorical features to features that can be used with scikit-learn
estimators is to use a one-of-K or one-hot encoding, which is
implemented in OneHotEncoder. This estimator transforms each
categorical feature with m possible values into m binary features,
with only one active.
So what you can to do is convert your array into 5 new columns (this case, since you have 5 possible values) using one-hot encoding.
Here is some working code. The input is a column of categorical parameters [1,2,3,4,5], the ouput is a matrix, 5 columns, 1 for each of the 5 possible choices:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit([[1],[2],[3],[4],[5]])
OneHotEncoder(categorical_features='all', dtype='numpy.float64', handle_unknown='error', n_values='auto', sparse=True)
print enc.transform([[1],[2],[3],[4],[5]]).toarray()
Output:
[[ 1. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]]
Say your categorical parameters were in this order: [1,3,2,5,4,3,2,1,3,4,2]. You would get this output:
[[ 1. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 1.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 1. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 1. 0. 0. 0.]]
So this 1 column will convert into 5 columns.
print(data.dtypes)
data["col_name"]=data["col_name"].astype(str)
print(data.dtypes)

Sklearn digits dataset

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
digits = datasets.load_digits()
print(digits.data)
classifier = svm.SVC(gamma=0.4, C=100)
x, y = digits.data[:-1], digits.target[:-1]
x = x.reshape(1,-1)
y = y.reshape(-1,1)
print((x))
classifier.fit(x, y)
###
print('Prediction:', classifier.predict(digits.data[-3]))
###
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
I have reshaped the x and y as well. Still I'm getting an error saying :
Found input variables with inconsistent numbers of samples: [1, 1796]
Y has 1-d array with 1796 elements whereas x has many. How does it show 1 for x?
Actually scrap what I suggested below:
This link describes the general dataset API. The attribute data is a 2d array of each image, already flattened:
import sklearn.datasets
digits = sklearn.datasets.load_digits()
digits.data.shape
#: (1797, 64)
This is all you need to provide, no reshaping required. Similarly, the attribute data is a 1d array of each label:
digits.data.shape
#: (1797,)
No reshaping necessary. Just split into training and testing and run with it.
Try printing x.shape and y.shape. I feel that you're going to find something like: (1, 1796, ...) and (1796, ...) respectively. When calling fit for classifiers in scikit it expects two identically shaped iterables.
The clue, why are the arguments when reshaping different ways around:
x = x.reshape(1, -1)
y = y.reshape(-1, 1)
Maybe try:
x = x.reshape(-1, 1)
Completely unrelated to your question, but you're predicting on digits.data[-3] when the only element left out of the training set is digits.data[-1]. Not sure if that was intentional.
Regardless, it could be good to check your classifier over more results using the scikit metrics package. This page has an example of using it over the digits dataset.
The reshaping will transform your 8x8 matrix to a 1-dimensional vector, which can be used as a feature. You need to reshape the entire X vector, not only those of the training data, since the one's you will use for prediction need to have the same format.
The following code shows how:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
digits = datasets.load_digits()
classifier = svm.SVC(gamma=0.4, C=100)
x, y = digits.images, digits.target
#only reshape X since its a 8x8 matrix and needs to be flattened
n_samples = len(digits.images)
x = x.reshape((n_samples, -1))
print("before reshape:" + str(digits.images[0]))
print("After reshape" + str(x[0]))
classifier.fit(x[:-2], y[:-2])
###
print('Prediction:', classifier.predict(x[-2]))
###
plt.imshow(digits.images[-2], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
###
print('Prediction:', classifier.predict(x[-1]))
###
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
It will output:
before reshape:[[ 0. 0. 5. 13. 9. 1. 0. 0.]
[ 0. 0. 13. 15. 10. 15. 5. 0.]
[ 0. 3. 15. 2. 0. 11. 8. 0.]
[ 0. 4. 12. 0. 0. 8. 8. 0.]
[ 0. 5. 8. 0. 0. 9. 8. 0.]
[ 0. 4. 11. 0. 1. 12. 7. 0.]
[ 0. 2. 14. 5. 10. 12. 0. 0.]
[ 0. 0. 6. 13. 10. 0. 0. 0.]]
After reshape[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5.
0. 0. 3. 15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8.
8. 0. 0. 5. 8. 0. 0. 9. 8. 0. 0. 4. 11. 0. 1.
12. 7. 0. 0. 2. 14. 5. 10. 12. 0. 0. 0. 0. 6. 13.
10. 0. 0. 0.]
And a correct prediction for the last 2 images, which weren't used for training - you can decide however to make a bigger split between testing and training set.

How to use OneHotEncoder output in ordinary least squares regression plot

I have been trying to perform Ordinary Least Squares regression using the scikit-learn library but have hit another rock.
I have used OneHotEncoder to binarize my (independent) dummy/categorical features and I have an array like so:
x = [[ 1. 0. 0. ..., 0. 0. 0.]
[ 1. 0. 0. ..., 0. 0. 0.]
[ 0. 1. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 1. ..., 0. 0. 0.]
[ 1. 0. 0. ..., 0. 0. 0.]]
The dependent variables (Y) are stored in a one dimensional array. Everything is wonderful, except now when I come to plot these values I get an error:
# Plot outputs
pl.scatter(x_test, y_test, color='black')
ValueError: x and y must be the same size
When I use numpy.size on X and Y respectively it is clear thats a reasonable error:
>>> print np.size(x)
5096
>>> print np.size(y)
98
Interestingly, the two sets of data are accepted by the fit method.
My question is how can I transform the output of OneHotEncoder to use in my regression?
If I understand you correctly, you have your X matrix as an input as an [m x n] matrix and some output Y of [n x 1], where m = number of features and n = number of data points.
Firstly, the linear regression fitting function will not care that X is of dimension [m x n] and Y of [n x 1] as it will simply use a parameter of dimension [1 x m], i.e.,
Y = theta * X
Unfortunately, as noted by eickenberg, you cannot plot all of the X features against the Y value using matplotlibs scatter call as you have, hence you get the error message of incompatible sizes, it wants to plot n x n not (n x m) x n.
To fix your problem, try looking at a single feature at a time:
pl.scatter(x_test[:,0], y_test, color='black')
Assuming you have standardised your data (subtracted the mean and divided by the average) a quick and dirty way to see the trends would be plot all of them on a single axes:
fig = plt.figure(0)
ax = fig.add_subplot(111)
n, m = x_test.size
for i in range(m):
ax.scatter(x_test[:,m], y_test)
plt.show()
To visualise all at once on independent figures (depending on the number of features) then look at, e.g., subplot2grid routines or another python module like pandas.

Categories