Selecting features in python - python

I am trying to do this algorithm http://venom.cs.utsa.edu/dmz/techrep/2007/CS-TR-2007-011.pdf
import pandas as pd
import pathlib
import gaitrec
from tsfresh import extract_features
from collections import defaultdict
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances
class PFA(object):
def __init__(self, n_features, q=None):
self.q = q
self.n_features = n_features
def fit(self, X):
if not self.q:
self.q = X.shape[1]
pca = PCA(n_components=self.q).fit(X)
A_q = pca.components_.T
kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
clusters = kmeans.predict(A_q)
cluster_centers = kmeans.cluster_centers_
dists = defaultdict(list)
for i, c in enumerate(clusters):
dist = euclidean_distances(A_q[i, :].reshape(1,-1), cluster_centers[c, :].reshape(1,-1))[0][0]
dists[c].append((i, dist))
self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
self.features_ = X[:, self.indices_]
p = pathlib.Path(gaitrec.__file__).parent
dataset_file = p / 'DatasetC' / 'subj_001' / 'walk0' / 'subj_0010.csv'
read_csv = pd.read_csv(dataset_file, sep=';', decimal='.', names=['time','x','y', 'z', 'id'])
read_csv['id'] = 0
if __name__ == '__main__':
print(read_csv)
extracted_features = extract_features(read_csv, column_id="id", column_sort="time")
features_withno_nanvalues = extracted_features.dropna(how='all', axis=1)
print(features_withno_nanvalues)
X = features_withno_nanvalues.to_numpy()
pfa = PFA(n_features=2274, q=1)
pfa.fit(X)
Y = pfa.features_
print(Y) #feature extracted
column_indices = pfa.indices_ #index of the features
print(column_indices)
C:\Users\Thund\AppData\Local\Programs\Python\Python37\python.exe C:/Users/Thund/Desktop/RepoBitbucket/Gaitrec/gaitrec/extraction.py
time x y z id
0 0 -0.833333 0.416667 -0.041667 0
1 1 -0.833333 0.416667 -0.041667 0
2 2 -0.833333 0.416667 -0.041667 0
3 3 -0.833333 0.416667 -0.041667 0
4 4 -0.833333 0.416667 -0.041667 0
... ... ... ... ... ..
1337 1337 -0.833333 0.416667 0.083333 0
1338 1338 -0.833333 0.416667 0.083333 0
1339 1339 -0.916667 0.416667 0.083333 0
1340 1340 -0.958333 0.416667 0.083333 0
1341 1341 -0.958333 0.416667 0.083333 0
[1342 rows x 5 columns]
Feature Extraction: 100%|██████████| 3/3 [00:04<00:00, 1.46s/it]
C:\Users\Thund\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\decomposition\_pca.py:461: RuntimeWarning: invalid value encountered in true_divide
explained_variance_ = (S ** 2) / (n_samples - 1)
variable x__abs_energy ... z__variation_coefficient
id ...
0 1430.496338 ... 5.521904
[1 rows x 2274 columns]
C:/Users/Thund/Desktop/RepoBitbucket/Gaitrec/gaitrec/extraction.py:21: ConvergenceWarning: Number of distinct clusters (2) found smaller than n_clusters (2274). Possibly due to duplicate points in X.
kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
[[1430.49633789 66.95824 ]]
[0, 1]
Process finished with exit code 0
I don't understand the warnings and the cause that from 2k+ features it only extract the first 2,that's what I did:
Produce the covariance matrix from the original data
Compute eigenvectors and eigenvalues of the covariance matrix using the SVD method
Those two steps combined are what you call PCA.
The Principle Components are the eigenvectors of the covariance matrix of the original data and then apply the K-means algorithm.
My question are:
How can I fix the warning it gives me?
It only select 2 features from 2k+ features, so something is wrong?

As mentioned in the comments, the features after the fit are coming from the indices of the A_q matrix, which has a reduced number of features from PCA. You're getting two features instead of q features (1 in this case) because of the reshape. self.features_ should probably come from A_q instead of X.

I think the problem in your code is in the following statement:
pfa = PFA(n_features=2274, q=1)
I haven't read the paper, but you have to observe pca behavior. If the authors set q variable to 1, you should see why q is 1.
For instance:
from matplotlib.pyplot import plot
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import figure
pca_obj = PCA().fit(X=X)
figure(1, figsize=(6,3), dpi=300)
plot(pca_obj.explained_variance_, linewidth=2)
xlabel('Components')
ylabel('Explained Variaces')
Note: If you are using an application other than jupyter-notebook please add show at the end of the line, in case you couldn't see any graph
from matplotlib.pyplot import plot
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import figure
from matplotlib.pyplot import show
pca_obj = PCA().fit(X=X)
figure(1, figsize=(6,3), dpi=300)
plot(pca_obj.explained_variance_, linewidth=2)
xlabel('Components')
ylabel('Explained Variaces')
show()
For my dataset, the result is:
Now, I can say: "My q variable is 100, since PCA performs better starting with 100 components."
Can say the same? How do you know q is 1?
Now observe your best q performance variable, see if it solves your problem.

Related

Extract More Than Two Dimensions via Python: sklearn.cross_decomposition import CCA & transform

I am very interested in using Python to extract 3-4 Dimensions via Canonical Correlation Analyses. I am pasting my very basic code below, and it appears to always default to only extracting two Dimensions even though each of my input arrays are 10,000+ X 3. Even if I have 4 columns for my X & Y matrices it always gives just two Dimensions - was hoping for three and eventually four as I add many more raw Features to my X and Y arrays. Trying to keep simple for now. Could part of my problem also be that some of my Field Names have spaces in them too?
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
data = "G:\Shared drives\Data Intelligence\ZF\Segmentation/Data.csv"
df = pd.read_csv(data)
df.head()
print(df.columns)
X = df[['Altcurr Ext Stk Sales Cc',\
'Altcurr Ext Dss Sales Cc',\
'LBM Sales']]
X.head()
X_mc = (X-X.mean())/(X.std())
X_mc.head()
Y = df[['Primary_Supplier_0_org1',\
'Primary_Supplier_1_org2',\
'Primary_Supplier_2_TV']]
Y.head()
Y_mc = (Y-Y.mean())/(Y.std())
Y_mc.head()
from sklearn.cross_decomposition import CCA
ca = CCA()
ca.fit(X_mc, Y_mc)
X_c, Y_c = ca.transform(X_mc, Y_mc)
By default the CCA() function sets , you can check out the documentation :
Parameters:
n_components int, default=2
Number of components to keep. Should be in [1, min(n_samples, n_features, n_targets)].
For your dataset, X and Y both have 3 columns, so you can go up to n_components = 3 . Using an example dataset :
from sklearn.datasets import make_blobs
from sklearn.cross_decomposition import CCA
X, _ = make_blobs(n_samples=10000, centers=3, n_features=6,random_state=0)
y = X[:,3:]
X = X[:,:3]
ca = CCA(n_components = 3)
ca.fit(X, y)
X_c, Y_c = ca.transform(X, y)
print(X_c.shape)
(10000, 3)
print(Y_c.shape)
(10000, 3)

How the X axis on a Linearregression is formated and processed?

I am trying to build a regression line based on date and closure price of a stock.
I know the regline doesn't allow to be calculated on date, so I transform the date to be a numerical value.
I have been able to format the data as it requires.
Here is my sample code :
import datetime as dt
import csv
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
source = 'C:\\path'
#gets file
df = pd.read_csv(source+'\\ABBN.SW.csv')
#change string to datetime
df['Date'] = pd.to_datetime(df['Date'])
#change datetime to numerical value
df['Date'] = df['Date'].map(dt.datetime.toordinal)
#build X and Y axis
x = np.array(df['Date']).reshape(-1, 1)
y = np.array(df['Close'])
model = LinearRegression()
model.fit(x,y)
print(model.intercept_)
print(model.coef_)
print(x)
[[734623]
[734625]
[734626]
...
[738272]
[738273]
[738274]]
print(y)
[16.54000092 16.61000061 16.5 28.82999992 28.88999939 ... 29.60000038]
intercept : -1824.9528261991056 #complete off the charts, it should be around 18-20
coef : [0.00250826]
The question here is : What I am missing on the X axis (date) to produce a correct intercept ?
It looks like the the coef is right tho.
See the example on excel (old data)
References used :
https://realpython.com/linear-regression-in-python/
https://medium.com/python-data-analysis/linear-regression-on-time-series-data-like-stock-price-514a42d5ac8a
https://www.alpharithms.com/predicting-stock-prices-with-linear-regression-214618/
I would suggest to apply min-max normalisation to your ordinal dates. In this manner you will get the desired "small" intercept out of the linear regression.
import datetime as dt
import csv
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
df = pd.read_csv("data.csv")
df['Date'] = pd.to_datetime(df["Date"])
df['Date_ordinal'] = df["Date"].map(dt.datetime.toordinal)
df["Date_normalized"] = df["Date"].apply(lambda x: len(df["Date"]) * (x - df["Date"].min()) / (df["Date"].max() - df["Date"].min()))
print(df)
def apply_linear(df,label_dates):
x = np.array(df[label_dates]).reshape(-1, 1)
y = np.array(df['Close'])
model = LinearRegression()
model.fit(x,y)
print("intercep = ",model.intercept_)
print("coef = ",model.coef_[0])
print("Without normalization")
apply_linear(df,"Date_ordinal")
print("With normalization")
apply_linear(df,"Date_normalized")
And the results of my execution as follows, passing to it an invented representative data set for your purpose:
PS C:\Users\ruben\PycharmProjects\stackOverFlowQnA> python .\main.py
Date Close Date_ordinal Date_normalized
0 2022-04-01 111 738246 0.000000
1 2022-04-02 112 738247 0.818182
2 2022-04-03 120 738248 1.636364
3 2022-04-04 115 738249 2.454545
4 2022-04-05 105 738250 3.272727
5 2022-04-09 95 738254 6.545455
6 2022-04-10 100 738255 7.363636
7 2022-04-11 105 738256 8.181818
8 2022-04-12 112 738257 9.000000
Without normalization
intercep = 743632.8904761908
coef = -1.0071428571428576
With normalization
intercep = 113.70476190476191
coef = -1.2309523809523817

Run Different Scikit-learn Clustering Algorithms on Dataset

I have a dataframe like below. The shape is (24,7)
Name x1 x2 x3 x4 x5 x6
Harry 102 204 0.43 0.21 1.02 0.39
James 242 500 0.31 0.11 0.03 0.73
.
.
.
Mike 3555 4002 0.12 0.03 0.52. 0.11
Henry 532 643 0.01 0.02 0.33 0.10
I want to run Scikit-learn's Different Clustering Algorithms Script on the above dataframe. However, the input data looks quite confusing, not too sure how to input my dataframe
https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py
There are two main differences between your scenario and the scikit-learn example you link to:
You only have one dataset, not several different ones to compare.
You have six features, not just two.
Point one allows you to simplify the example code by deleting the loops over the different datasets and related calculations. Point two implies that you cannot easily plot your results. Instead, you could just add the predicted class labels found by each algorithm to your dataset.
So you could modify the example code like this:
import time
import warnings
import numpy as np
import pandas as pd
from sklearn import cluster, datasets, mixture
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
from itertools import cycle, islice
np.random.seed(0)
# ============
# Introduce your dataset
# ============
my_df = # Insert your data here, as a pandas dataframe.
features = [f'x{i}' for i in range(1, 7)]
X = my_df[features].values
# ============
# Set up cluster parameters
# ============
params = {
"quantile": 0.3,
"eps": 0.3,
"damping": 0.9,
"preference": -200,
"n_neighbors": 3,
"n_clusters": 3,
"min_samples": 7,
"xi": 0.05,
"min_cluster_size": 0.1,
}
# normalize dataset for easier parameter selection
X = StandardScaler().fit_transform(X)
# estimate bandwidth for mean shift
bandwidth = max(cluster.estimate_bandwidth(X, quantile=params["quantile"]),
0.001) # arbitrary correction to avoid 0
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(
X, n_neighbors=params["n_neighbors"], include_self=False
)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
# ============
# Create cluster objects
# ============
ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
two_means = cluster.MiniBatchKMeans(n_clusters=params["n_clusters"])
ward = cluster.AgglomerativeClustering(
n_clusters=params["n_clusters"], linkage="ward", connectivity=connectivity
)
spectral = cluster.SpectralClustering(
n_clusters=params["n_clusters"],
eigen_solver="arpack",
affinity="nearest_neighbors",
)
dbscan = cluster.DBSCAN(eps=params["eps"])
optics = cluster.OPTICS(
min_samples=params["min_samples"],
xi=params["xi"],
min_cluster_size=params["min_cluster_size"],
)
affinity_propagation = cluster.AffinityPropagation(
damping=params["damping"], preference=params["preference"], random_state=0
)
average_linkage = cluster.AgglomerativeClustering(
linkage="average",
affinity="cityblock",
n_clusters=params["n_clusters"],
connectivity=connectivity,
)
birch = cluster.Birch(n_clusters=params["n_clusters"])
gmm = mixture.GaussianMixture(
n_components=params["n_clusters"], covariance_type="full"
)
clustering_algorithms = (
("MiniBatch\nKMeans", two_means),
("Affinity\nPropagation", affinity_propagation),
("MeanShift", ms),
("Spectral\nClustering", spectral),
("Ward", ward),
("Agglomerative\nClustering", average_linkage),
("DBSCAN", dbscan),
("OPTICS", optics),
("BIRCH", birch),
("Gaussian\nMixture", gmm),
)
for name, algorithm in clustering_algorithms:
t0 = time.time()
# catch warnings related to kneighbors_graph
with warnings.catch_warnings():
warnings.filterwarnings(
"ignore",
message="the number of connected components of the "
+ "connectivity matrix is [0-9]{1,2}"
+ " > 1. Completing it to avoid stopping the tree early.",
category=UserWarning,
)
warnings.filterwarnings(
"ignore",
message="Graph is not fully connected, spectral embedding"
+ " may not work as expected.",
category=UserWarning,
)
algorithm.fit(X)
t1 = time.time()
if hasattr(algorithm, "labels_"):
y_pred = algorithm.labels_.astype(int)
else:
y_pred = algorithm.predict(X)
# Add cluster labels to the dataset
my_df[name] = y_pred
PS : please replace : data = X_data.iloc[:20000] by your X
import numpy as np
import matplotlib as plt
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import decomposition
from sklearn import preprocessing
from sklearn import cluster, metrics
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn import preprocessing
from collections import Counter
from sklearn.cluster import DBSCAN
from sklearn import mixture
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
comp_model = pd.DataFrame(columns=['Model', 'Score_Silhouette',
'num_clusters', 'size_clusters',
'parameters'])
K-Means :
def k_means(X_data, nb_clusters, model_comp):
ks = nb_clusters
inertias = []
data = X_data.iloc[:20000]
X = data.values
X_scaled = preprocessing.StandardScaler().fit_transform(X)
for num_clusters in ks:
# Create a KMeans instance with k clusters: model
model = KMeans(n_clusters=num_clusters, n_init=1)
# Fit model to samples
model.fit(X_scaled)
# Append the inertia to the list of inertias
inertias.append(model.inertia_)
silh = metrics.silhouette_score(X_scaled, model.labels_)
# Counting the amount of data in each cluster
taille_clusters = Counter(model.labels_)
data = [{'Model': 'kMeans',
'Score_Silhouette': silh,
'num_clusters': num_clusters,
'size_clusters': taille_clusters,
'parameters': 'nb_clusters :'+str(num_clusters)}]
model_comp = model_comp.append(data, ignore_index=True, sort=False)
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
return model_comp
comp_model = k_means(X_data=df,
nb_clusters=pd.np.arange(2, 11, 1),
model_comp=comp_model)
DBscan :
def dbscan_grid_search(X_data, model_comp, eps_space=0.5,
min_samples_space=5, min_clust=0, max_clust=10):
data = X_data.iloc[:20000]
X = data.values
X_scaled = preprocessing.StandardScaler().fit_transform(X)
# Starting a tally of total iterations
n_iterations = 0
# Looping over each combination of hyperparameters
for eps_val in eps_space:
for samples_val in min_samples_space:
dbscan_grid = DBSCAN(eps=eps_val,
min_samples=samples_val)
# fit_transform
clusters = dbscan_grid.fit_predict(X=X_scaled)
# Counting the amount of data in each cluster
cluster_count = Counter(clusters)
#n_clusters = sum(abs(pd.np.unique(clusters))) - 1
n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
# Increasing the iteration tally with each run of the loop
n_iterations += 1
# Appending the lst each time n_clusters criteria is reached
if n_clusters >= min_clust and n_clusters <= max_clust:
silh = metrics.silhouette_score(X_scaled, clusters)
data = [{'Model': 'Dbscan',
'Score_Silhouette': silh,
'num_clusters': n_clusters,
'size_clusters': cluster_count,
'parameters': 'eps :'+str(eps_val)+'+ samples_val :'+str(samples_val)}]
model_comp = model_comp.append(
data, ignore_index=True, sort=False)
return model_comp
comp_model = dbscan_grid_search(X_data=df,
model_comp=comp_model,
eps_space=pd.np.arange(0.1, 5, 0.6),
min_samples_space=pd.np.arange(1, 30, 3),
min_clust=2,
max_clust=10)
GMM :
def gmm(X_data, nb_clusters, model_comp):
ks = nb_clusters
data = X_data.iloc[:20000]
X = data.values
X_scaled = preprocessing.StandardScaler().fit_transform(X)
for num_clusters in ks:
# Create a KMeans instance with k clusters: model
gmm = mixture.GaussianMixture(n_components=num_clusters).fit(X_scaled)
# Fit model to samples
gmm.fit(X_scaled)
pred = gmm.predict(X_scaled)
cluster_count = Counter(pred)
silh = metrics.silhouette_score(X_scaled, pred)
data = [{'Model': 'GMM',
'Score_Silhouette': silh,
'num_clusters': num_clusters,
'size_clusters': cluster_count,
'parameters': 'nb_clusters :'+str(num_clusters)}]
model_comp = model_comp.append(data, ignore_index=True, sort=False)
return model_comp
comp_model = gmm(X_data=df,
nb_clusters=pd.np.arange(2, 11, 1),
model_comp=comp_model
)
At the end you will have comp_model which will contain all the results of your algo. Here I am using three algorithms, after you selected the best fit for you (with score silhouette and number of cluster).
You should check the repartitions of each cluster :
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py

Tensorboard smoothing

I downloaded the CSV files from tesnorboard in order to plot the losses myself as I want them Smoothed.
This is currently my code:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('C:\\Users\\ali97\\Desktop\\Project\\Database\\Comparing Outlier Fractions\\10 Percent (MAE)\\MSE Validation.csv',usecols=['Step','Value'],low_memory=True)
df2 = pd.read_csv('C:\\Users\\ali97\\Desktop\\Project\\Database\\Comparing Outlier Fractions\\15 Percent (MAE)\\MSE Validation.csv',usecols=['Step','Value'],low_memory=True)
df3 = pd.read_csv('C:\\Users\\ali97\\Desktop\\Project\\Database\\Comparing Outlier Fractions\\20 Percent (MAE)\\MSE Validation.csv',usecols=['Step','Value'],low_memory=True)
plt.plot(df['Step'],df['Value'] , 'r',label='10% Outlier Frac.' )
plt.plot(df2['Step'],df2['Value'] , 'g',label='15% Outlier Frac.' )
plt.plot(df3['Step'],df3['Value'] , 'b',label='20% Outlier Frac.' )
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()
I was reading how to smooth the graph and I found out another member here wrote the code on how tensorboard actually smooths graphs, but I really don't know how to implement it in my code.
def smooth(scalars: List[float], weight: float) -> List[float]: # Weight between 0 and 1
last = scalars[0] # First value in the plot (first timestep)
smoothed = list()
for point in scalars:
smoothed_val = last * weight + (1 - weight) * point # Calculate smoothed value
smoothed.append(smoothed_val) # Save it
last = smoothed_val # Anchor the last smoothed value
return smoothed
Thank you.
If you are working with pandas library you can use the function ewm (Pandas EWM) and ajust the alpha factor to get a good approximation of the smooth function from tensorboard.
df.ewm(alpha=(1 - ts_factor)).mean()
CSV file mse_data.csv
step value
0 0.000000 9.716303
1 0.200401 9.753981
2 0.400802 9.724551
3 0.601202 7.926591
4 0.801603 10.181700
.. ... ...
495 99.198400 0.298243
496 99.398800 0.314511
497 99.599200 -1.119387
498 99.799600 -0.374202
499 100.000000 1.150465
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("mse_data.csv")
print(df)
TSBOARD_SMOOTHING = [0.5, 0.85, 0.99]
smooth = []
for ts_factor in TSBOARD_SMOOTHING:
smooth.append(df.ewm(alpha=(1 - ts_factor)).mean())
for ptx in range(3):
plt.subplot(1,3,ptx+1)
plt.plot(df["value"], alpha=0.4)
plt.plot(smooth[ptx]["value"])
plt.title("Tensorboard Smoothing = {}".format(TSBOARD_SMOOTHING[ptx]))
plt.grid(alpha=0.3)
plt.show()

PyMC3: PositiveDefiniteError when sampling a Categorical variable

I am trying to sample a simple model of a categorical distribution with a Dirichlet prior. Here is my code:
import numpy as np
from scipy import optimize
from pymc3 import *
k = 6
alpha = 0.1 * np.ones(k)
with Model() as model:
p = Dirichlet('p', a=alpha, shape=k)
categ = Categorical('categ', p=p, shape=1)
tr = sample(10000)
And I get this error:
PositiveDefiniteError: Scaling is not positive definite. Simple check failed. Diagonal contains negatives. Check indexes [0 1 2 3 4]
The problem is that NUTS is failing to initialize properly. One solution is to use another sampler like this:
with pm.Model() as model:
p = pm.Dirichlet('p', a=alpha)
categ = pm.Categorical('categ', p=p)
step = pm.Metropolis(vars=p)
tr = pm.sample(1000, step=step)
Here I am manually assigning p to Metropolis, and letting PyMC3 assign categ to a proper sampler.

Categories