Cumulative Explained Variance for PCA in Python

Cumulative Explained Variance for PCA in Python - python

I have a simple R script for running FactoMineR's PCA on a tiny dataframe in order to find the cumulative percentage of variance explained for each variable:
library(FactoMineR)
a <- c(1, 2, 3, 4, 5)
b <- c(4, 2, 9, 23, 3)
c <- c(9, 8, 7, 6, 6)
d <- c(45, 36, 74, 35, 29)
df <- data.frame(a, b, c, d)
df_pca <- PCA(df, ncp = 4, graph=F)
print(df_pca$eig$`cumulative percentage of variance`)
Which returns:
> print(df_pca$eig$`cumulative percentage of variance`)
[1] 58.55305 84.44577 99.86661 100.00000
I'm trying to do the same in Python using scikit-learn's decomposition package as follows:
import pandas as pd
from sklearn import decomposition, linear_model
a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
pca = decomposition.PCA(n_components = 4)
pca.fit(df)
transformed_pca = pca.transform(df)
# sum cumulative variance from each var
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
if i == 0:
cum_explained_var.append(pca.explained_variance_ratio_[i])
else:
cum_explained_var.append(pca.explained_variance_ratio_[i] +
cum_explained_var[i-1])
print(cum_explained_var)
But this results in:
[0.79987089715487936, 0.99224337624509307, 0.99997254568237226, 1.0]
As you can see, both correctly add up to 100%, but it seems the contributions of each variable differ between the R and Python versions. Does anyone know where these differences are coming from or how to correctly replicate the R results in Python?
EDIT: Thanks to Vlo, I now know that the differences stem from the FactoMineR PCA function scaling the data by default. By using the sklearn preprocessing package (pca_data = preprocessing.scale(df)) to scale my data before running PCA, my results match the

Thanks to Vlo, I learned that the differences between the FactoMineR PCA function and the sklearn PCA function is that the FactoMineR one scales the data by default. By simply adding a scaling function to my python code, I was able to reproduce the results.
import pandas as pd
from sklearn import decomposition, preprocessing
a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]
e = [35, 84, 3, 54, 68]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
pca_data = preprocessing.scale(df)
pca = decomposition.PCA(n_components = 4)
pca.fit(pca_data)
transformed_pca = pca.transform(pca_data)
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
if i == 0:
cum_explained_var.append(pca.explained_variance_ratio_[i])
else:
cum_explained_var.append(pca.explained_variance_ratio_[i] +
cum_explained_var[i-1])
print(cum_explained_var)
Output:
[0.58553054049052267, 0.8444577483783724, 0.9986661265687754, 0.99999999999999978]

Related

Detect and fix outliers in a pandas series

I have pandas series with some outliers values. Here's some mock data:
df = pd.DataFrame({'col1': [1200, 400, 50, 75, 8, 9, 8, 7, 6, 5, 4, 6, 6, 8, 3, 6, 6, 7, 6]})
I'd like to substitute outliers i.e values that >= 3 standard deviation from mean with the mean value.

Let's do:
thrs = df['col1'].mean() + 3 * df['col1'].std()
df.loc[df['col1'] >= thrs, 'col1'] = df['col1'].mean()

std_dev = df["col1"].std()
mean = df["col1"].mean()
df["col1"] = np.where(df.col1 >= (mean + 3*std_dev), mean, df.col1)

Hierarchical clustering of different length time series data using scipy and DTW

I have a set of time series data having different lengths and I am trying to cluster them using Dynamic Time Warping (DTW).
For the completeness of the question, I am using this simple implementation of DTW
def DTWDistance(s1, s2):
DTW={}
for i in range(len(s1)):
DTW[(i, -1)] = float('inf')
for i in range(len(s2)):
DTW[(-1, i)] = float('inf')
DTW[(-1, -1)] = 0
for i in range(len(s1)):
for j in range(len(s2)):
dist= (s1[i]-s2[j])**2
DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)])
return sqrt(DTW[len(s1)-1, len(s2)-1])
and I have my time series data of the form
timeseries = [[0, 1, 2, 3, 4, 5, 6, 7, 8],
[0, 0, 1, 2, 3, 4],
[6, 7, 8, 9, 10, 11, 12, 13, 14],
[15, 14, 13, 14, 15, 16, 17, 18, 19]]
When I try
import scipy.cluster.hierarchy as hac
Z = hac.linkage(data, method='complete', metric= DTWDistance)
I get ValueError: setting an array element with a sequence., which is understandable because scipy.cluster.hierarchy.linkage documentation says
...a collection of m observation vectors in n dimensions may be passed
as an m by n array. All elements of the condensed distance matrix must
be finite, i.e. no NaNs or infs.
And clearly my input doesn't fulfil this specification. What will be the correct approach to classify different length time series data?
EDIT 1
A simple workaround will be to fill the missing entries for shorter timeseries with 0, so that we can obtain an m x n matrix as required. I am not sure if this will alter the semantics of the timeseries.

Multiprocessing Pooling Fails at Dask Functions

I am trying to take two arrays, "day 1": ranging from 0 to 11 (incremented by +1) and "day 2:" ranging from 11 to 0 (incremented by -1), and sum them. However, I wish to use multiprocessing and dask arrays to speed up the process (I will be going to bigger numbers later). I want to split day 1 and day 2 into four equal parts (day 1: [0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11] and day 2: [11, 10, 9], [8, 7, 6], [5, 4, 3], [2, 1, 0]) and have four processes to add work on each consequent array (i.e., day1's [0, 1, 2] with day 2's [11, 10, 9] and get [11, 11, 11]. After all four processes are done, I hope to return back into one big list of [11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11]. However, within the function of the bolded step, the code fails to run and is stuck in an infinite loop or calculations of some sort.
Code:
import numpy as np
import dask.array as da
from dask import delayed
import threading
import multiprocessing as mp
NUM_WORKERS = 4
# create list from 0 to 11
day1 = list(range(12))
# create list form 11 to 0
day2 = day1[::-1]
def get_sum(i, base):
z = []
x = day1[i * length: i * length + length]
y = day2[i * length: i * length + length]
z.append(x)
z.append(y)
converted = da.from_array(z, chunks = NUM_WORKERS)
**summed = da.sum(converted, axis = 0).compute()**
list_concatenate = np.concatenate((base, summed), axis=0)
all_sum = sum(list_concatenate)
process_list = []
for i in range(NUM_WORKERS):
process_list = mp.Process(target = get_sum, args = (i, process_list))
process_list.start()
process_list.join()

Python with numpy: How to delete an element from each row of a 2-D array according to a specific index

Say I have a 2-D numpy array A of size 20 x 10.
I also have an array of length 20, del_ind.
I want to delete an element from each row of A according to del_ind, to get a resultant array of size 20 x 9.
How can I do this?
I looked into np.delete with a specified axis = 1, but this only deletes element from the same position for each row.
Thanks for the help

You will probably have to build a new array.
Fortunately you can avoid python loops for this task, using fancy indexing:
h, w = 20, 10
A = np.arange(h*w).reshape(h, w)
del_ind = np.random.randint(0, w, size=h)
mask = np.ones((h,w), dtype=bool)
mask[range(h), del_ind] = False
A_ = A[mask].reshape(h, w-1)
Demo with a smaller dataset:
>>> h, w = 5, 4
>>> %paste
A = np.arange(h*w).reshape(h, w)
del_ind = np.random.randint(0, w, size=h)
mask = np.ones((h,w), dtype=bool)
mask[range(h), del_ind] = False
A_ = A[mask].reshape(h, w-1)
## -- End pasted text --
>>> A
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])
>>> del_ind
array([2, 2, 1, 1, 0])
>>> A_
array([[ 0, 1, 3],
[ 4, 5, 7],
[ 8, 10, 11],
[12, 14, 15],
[17, 18, 19]])

Numpy isn't known for inplace edits; it's mainly intended for statically sized matrices. For that reason, I'd recommend doing this by copying the intended elements to a new array.
Assuming that it's sufficient to delete one column from every row:
def remove_indices(arr, indices):
result = np.empty((arr.shape[0], arr.shape[1] - 1))
for i, (delete_index, row) in enumerate(zip(indices, arr)):
result[i] = np.delete(row, delete_index)
return result

calculate histogram peaks in python

In Python, how do I calcuate the peaks of a histogram?
I tried this:
import numpy as np
from scipy.signal import argrelextrema
data = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 1, 2, 3, 4,
5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9,
12,
15, 16, 17, 18, 19, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24,]
h = np.histogram(data, bins=[0, 5, 10, 15, 20, 25])
hData = h[0]
peaks = argrelextrema(hData, np.greater)
But the result was:
(array([3]),)
I'd expect it to find the peaks in bin 0 and bin 3.
Note that the peaks span more than 1 bin. I don't want it to consider the peaks that span more than 1 column as additional peak.
I'm open to another way to get the peaks.
Note:
>>> h[0]
array([19, 15, 1, 10, 5])
>>>

In computational topology, the formalism of persistent homology provides a definition of "peak" that seems to address your need. In the 1-dimensional case the peaks are illustrated by the blue bars in the following figure:
A description of the algorithm is given in this
Stack Overflow answer of a peak detection question.
The nice thing is that this method not only identifies the peaks but it quantifies the "significance" in a natural way.
A simple and efficient implementation (as fast as sorting numbers) and the source material to the above answer given in this blog article:
https://www.sthu.org/blog/13-perstopology-peakdetection/index.html

Try the findpeaks library.
pip install findpeaks
# Your input data:
data = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 12, 15, 16, 17, 18, 19, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,]
# import library
from findpeaks import findpeaks
# Find some peaks using the smoothing parameter.
fp = findpeaks(lookahead=1, interpolate=10)
# fit
results = fp.fit(data)
# Make plot
fp.plot()
# Results with respect to original input data.
results['df']
# Results based on interpolated smoothed data.
results['df_interp']

I wrote an easy function:
def find_peaks(a):
x = np.array(a)
max = np.max(x)
lenght = len(a)
ret = []
for i in range(lenght):
ispeak = True
if i-1 > 0:
ispeak &= (x[i] > 1.8 * x[i-1])
if i+1 < lenght:
ispeak &= (x[i] > 1.8 * x[i+1])
ispeak &= (x[i] > 0.05 * max)
if ispeak:
ret.append(i)
return ret
I defined a peak as a value bigger than 180% that of the neighbors and bigger than 5% of the max value. Of course you can adapt the values as you prefer in order to find the best set up for your problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cumulative Explained Variance for PCA in Python - python

Related

Detect and fix outliers in a pandas series

Hierarchical clustering of different length time series data using scipy and DTW

Multiprocessing Pooling Fails at Dask Functions

Python with numpy: How to delete an element from each row of a 2-D array according to a specific index

calculate histogram peaks in python

Categories

Resources