Clusters features median values using Python - python

While working on a dataset, I used k-means clustering and I want to explore the median values of the features/variables.
data = pd.DataFrame({'Monetary': rfm_m_log,'Recency': rfm_r_log,'Frequency': rfm_f_log})
matrix = data.as_matrix()
kmeans = KMeans(init='k-means++', n_clusters = 2, n_init=30)
kmeans.fit(matrix)
clusters_customers = kmeans.predict(matrix)
How to print the median values of Monetary, Recency and Frequency in each cluster? (Cluster 1 and Cluster 2)

It can be done by slicing the data-frame according to the actual classifications:
# class 0 median of the Monetary column
data.iloc[np.argwhere(clusters_customers == 0).ravel()]['Monetary'].median()
# class 1 median of the Monetary column
data.iloc[np.argwhere(clusters_customers == 1).ravel()]['Monetary'].median()

Related

Numerical vs. categorical vars: Why 100% correlation for categorical variable with high cardinality?

I am new to data science and trying to get a grip on exploratory data analysis. My goal is to get a correlation matrix between all the variables. For numerical variables I use Pearson's R, for categorical variables I use the corrected Cramer's V. The issue now is to get a meaningful correlation between categorical and numerical variables. For that I use the correlation ratio, as outlined here. The issue with that is that categorical variables with high cardinality show a high correlation no matter what:
correlation matrix cat vs. num
This seems nonsensical, since this would practically show the cardinality of the the categorical variable instead of the correlation to the numerical variable. The question is: how to deal with the issue in order to get a meaningful correlation.
The Python code below shows how I implemented the correlation ratio:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
train = pd.DataFrame({
'id': [0,1,2,3,4,5,6,7,8,9,10,11], 'num3': [6,3,3,9,6,9,9,3,6,3,6,9],
'cat2': [0,1,0,1,0,1,0,1,0,1,0,1], 'cat3': [0,1,2,0,1,2,0,1,2,0,1,2],
'cat6': [0,4,8,2,6,10,0,4,8,2,6,10], 'cat12': [0,7,2,9,4,11,6,1,8,3,10,5],
})
cat_cols, num_cols = ['cat2','cat3','cat6','cat12'], ['id','num3']
def corr_ratio(cats, nums):
avgtotal = nums.mean()
elements_avg, elements_count = np.zeros(len(cats.index)), np.zeros(len(cats.index))
cu = cats.unique()
for i in range(cu.size):
cn = cu[i]
filt = cats == cn
elements_count[i] = filt.sum()
elements_avg[i] = nums[filt].mean(axis=0)
numerator = np.sum(np.multiply(elements_count, np.power(np.subtract(elements_avg, avgtotal), 2)))
denominator = np.sum(np.power(np.subtract(nums, avgtotal), 2)) # total variance
return 0.0 if numerator == 0 else np.sqrt(numerator / denominator)
rows = []
for cat in cat_cols:
col = []
for num in num_cols:
col.append(round(corr_ratio(train[cat], train[num]), 2))
rows.append(col)
df = pd.DataFrame(np.array(rows), columns=num_cols, index=cat_cols)
sns.heatmap(df)
plt.tight_layout()
plt.show()
It could be because I think you are visualising something more related to chi-2 in your seaborn plot. Cramer's V is a number derived from chi-2 but not equivalent. So it means you could have a high value for a specific cell but a more relevant value for Cramer's V. I'm not even sure it makes sense to compare raw modalities values because they could be on a totally different order of magnitude.
Chi 2 formula
Cramer's V formula
If I am not mistaken, there is another method called Theil’s U. How about trying this out and see if the same problem will occur?
You can use this:
num_cols: your_df.select_dtypes(include=['number']).columns.to_list()
cat_target_cols: your_df.select_dtypes(include=['object']).columns.to_list()
corr_df = pd.DataFrame(associations(dataset=your_df, numerical_columns=num_cols, nom_nom_assoc='theil', figsize=(20, 20), nominal_columns=cat_target_cols).get('corr'))

Can I use statsmodel ARIMA to interpolate a time series?

In Python 3.7, I have a time series represented by a Pandas dataframe in which the index is a DateTimeIndex and the single value column is stock price:
The gaps correspond to NaN "price" values, and there are 126 non-NaN values and 20 NaN values. What I'm trying to do is to interpolate the non-NaN values to predict the values that are NaN. I tried several interpolation methods (linear, cubic spline) but they're not sufficiently accurate, and looking at the plot above, it appears there is a significant upward trend and also some traces of weekly periodicity, so I decided to use statsmodel ARIMA. Here is my code:
def fill_in_dataframe_ARIMA( df ):
price_is_not_NaN = df[ 'price' ].notnull()
price_is_NaN = np.logical_not( price_is_not_NaN )
# Convert the datetimes of the index into milliseconds:
datetime_ms = df.index.map( to_ms )
# Train the ARIMA model:
train_datetime_ms = datetime_ms[ price_is_not_NaN ]
train_price = df.price[ price_is_not_NaN ]
arima_model = ARIMA( train_price, ( 5, 1, 2 ), train_datetime_ms ).fit()
# Use model to predict the missing prices:
missing_datetime_ms = datetime_ms[ price_is_NaN ]
missing_price = arima_model.predict( exog = missing_datetime_ms )
return missing_price
What I'm expecting is that missing_price ends up being an array-like object of twenty entries, like missing_datetime_ms. Instead, missing_price has 125 entries, one fewer than the number of samples in train_datetime_ms:train_price.
Clearly I am not understanding what's meant by endogenous and exogenous (not to mention interpolate vs. extrapolate). Can someone please explain how I can get the intended result of 20 predicted entries?

How to calculate probability from probability density function in the Naive Bayes Classifier?

I am implementing Gaussian Naive Bayes Algorithm:
# importing modules
import pandas as pd
import numpy as np
# create an empty dataframe
data = pd.DataFrame()
# create our target variable
data["gender"] = ["male","male","male","male",
"female","female","female","female"]
# create our feature variables
data["height"] = [6,5.92,5.58,5.92,5,5.5,5.42,5.75]
data["weight"] = [180,190,170,165,100,150,130,150]
data["foot_size"] = [12,11,12,10,6,8,7,9]
# view the data
print(data)
# create an empty dataframe
person = pd.DataFrame()
# create some feature values for this single row
person["height"] = [6]
person["weight"] = [130]
person["foot_size"] = [8]
# view the data
print(person)
# Priors can be calculated either constants or probability distributions.
# In our example, this is simply the probability of being a gender.
# calculating prior now
# number of males
n_male = data["gender"][data["gender"] == "male"].count()
# number of females
n_female = data["gender"][data["gender"] == "female"].count()
# total people
total_ppl = data["gender"].count()
print ("Male count =",n_male,"and Female count =",n_female)
print ("Total number of persons =",total_ppl)
# number of males divided by the total rows
p_male = n_male / total_ppl
# number of females divided by the total rows
p_female = n_female / total_ppl
print ("Probability of MALE =",p_male,"and FEMALE =",p_female)
# group the data by gender and calculate the means of each feature
data_means = data.groupby("gender").mean()
# view the values
data_means
# group the data by gender and calculate the variance of each feature
data_variance = data.groupby("gender").var()
# view the values
data_variance
data_variance = data.groupby("gender").var()
data_variance["foot_size"][data_variance.index == "male"].values[0]
# means for male
male_height_mean=data_means["height"][data_means.index=="male"].values[0]
male_weight_mean=data_means["weight"][data_means.index=="male"].values[0]
male_footsize_mean=data_means["foot_size"][data_means.index=="male"].values[0]
print (male_height_mean,male_weight_mean,male_footsize_mean)
# means for female
female_height_mean=data_means["height"][data_means.index=="female"].values[0]
female_weight_mean=data_means["weight"][data_means.index=="female"].values[0]
female_footsize_mean=data_means["foot_size"][data_means.index=="female"].values[0]
print (female_height_mean,female_weight_mean,female_footsize_mean)
# variance for male
male_height_var=data_variance["height"][data_variance.index=="male"].values[0]
male_weight_var=data_variance["weight"][data_variance.index=="male"].values[0]
male_footsize_var=data_variance["foot_size"][data_variance.index=="male"].values[0]
print (male_height_var,male_weight_var,male_footsize_var)
# variance for female
female_height_var=data_variance["height"][data_variance.index=="female"].values[0]
female_weight_var=data_variance["weight"][data_variance.index=="female"].values[0]
female_footsize_var=data_variance["foot_size"][data_variance.index=="female"].values[0]
print (female_height_var,female_weight_var,female_footsize_var)
# create a function that calculates p(x | y):
def p_x_given_y(x,mean_y,variance_y):
# input the arguments into a probability density function
p = 1 / (np.sqrt(2 * np.pi * variance_y)) * \
np.exp((-(x - mean_y) ** 2) / (2 * variance_y))
# return p
return p
# numerator of the posterior if the unclassified observation is a male
posterior_numerator_male = p_male * \
p_x_given_y(person["height"][0],male_height_mean,male_height_var) * \
p_x_given_y(person["weight"][0],male_weight_mean,male_weight_var) * \
p_x_given_y(person["foot_size"][0],male_footsize_mean,male_footsize_var)
# numerator of the posterior if the unclassified observation is a female
posterior_numerator_female = p_female * \
p_x_given_y(person["height"][0],female_height_mean,female_height_var) * \
p_x_given_y(person["weight"][0],female_weight_mean,female_weight_var) * \
p_x_given_y(person["foot_size"][0],female_footsize_mean,female_footsize_var)
print ("Numerator of Posterior MALE =",posterior_numerator_male)
print ("Numerator of Posterior FEMALE =",posterior_numerator_female)
if (posterior_numerator_male >= posterior_numerator_female):
print ("Predicted gender is MALE")
else:
print ("Predicted gender is FEMALE")
When we are calculating the probability, we are calculating it using the Gaussian PDF:
$$ P(x) = \frac{1}{\sqrt {2 \pi {\sigma}^2}} e^{\frac{-(x- \mu)^2}{2 {\sigma}^2}} $$
My question is that the above equation is that of a PDF. To calculate probability, we have to integrate it over an area dx.
$ \int_{x0}^{x1} P(x)dx $
But in the above program, we are plugging the value of x and calculating the probability. Is that correct? Why? I have seen most of the articles calculating the probability ib the same manner.
If this is the wrong way to calculate the probability in the Naive Bayes Classifier, then what is the correct method?
The method is correct. The pdf function is a probability density, i.e., a function that measures the probability of being in a neighborhood of a value divided by the "size" of such a neighborhood, where the "size" is the length in dimension 1, the area in 2, the volume in 3, etc.
In continuous probabilities the probability of getting precisely any given outcome is 0, and this is why densities are used instead. Therefore, we don't deal with expressions such as P(X=x) but with P(|X-x| < Δ(x)), which stands for the probability of X being close x.
Let me simplify the notation and write P(X~x) for P(|X-x| < Δ(x)).
If you apply the Bayes rule here, you will get
P(X~x|W~w) = P(W~w|X~x)*P(X~x)/P(W~w)
because we are dealing with probabilities. If we now introduce densities:
pdf(x|w)*Δ(x) = pdf(w|x)Δ(w)*pdf(x)Δ(x)/(pdf(w)*Δ(w))
because probability = density*neighborhood_size. And since all Δ(·) cancel out in the expression above, we get
pdf(x|w) = pdf(w|x)*pdf(x)/pdf(w)
which is the Bayes rule for densities.
The conclusion is that, given that the Bayes rule also holds for densities, it is legitimate to use the same methods replacing probabilities with densities when dealing with continuous random variables.

What is the meaning of normalization in machine learning language? Does it correspond to one sample?

I am dealing with a classification problem I want to classify data into 2 classes. I generate 1000 samples at different temperatures ranging from 1 to 5. I load data using following function load_data. Where "data" is 2 dimensional array (1000,16), Rows correspond to number of samples at "1.0.npy" and similarly for other points and 16 is number of features. So I picked max and min values from each sample by applying a for loop. But I'm afraid that my normalization is not correct because I'm not sure what is the strategy of normalization in machine learning. Should I pick np.amax(each sample) or should I pick np.amax("1.0.npy") mean from all 1000 samples that contained in 1.0.npy files. My goal is to normalize data between 0 and 1.
`def load_data():
path ="./directory"
files =sorted(os.listdir(path)) #{1.0.npy, 2.0.npy,.....5.0.npy}
dictData ={}
for df in sorted(files):
print(df)
data = np.load(os.path.join(path,df))
a=data
lis =[]
for i in range(len(data)):
old_range = np.amax(a[i]) - np.amin(a[i])
new_range = 1 - 0
f = ((a[i] - np.amin(a[i])) / old_range)*new_range + 0
lis.append(f)`
After normalization I get following result such that first value of every sample is 0 and last value is one.
[0, ...., 1] #first sample
[0,.....,1] #second sample

Get the cluster size in sklearn in python

I am using sklearn DBSCAN to cluster my data as follows.
#Apply DBSCAN (sims == my data as list of lists)
db1 = DBSCAN(min_samples=1, metric='precomputed').fit(sims)
db1_labels = db1.labels_
db1n_clusters_ = len(set(db1_labels)) - (1 if -1 in db1_labels else 0)
#Returns the number of clusters (E.g., 10 clusters)
print('Estimated number of clusters: %d' % db1n_clusters_)
Now I want to get the top 3 clusters sorted from the size (number of data points in each cluster). Please let me know how to obtain the cluster size in sklearn?
Another option would be to use numpy.unique:
db1_labels = db1.labels_
labels, counts = np.unique(db1_labels[db1_labels>=0], return_counts=True)
print labels[np.argsort(-counts)[:3]]
Well you can Bincount Function in Numpy to get the frequencies of labels. For example, we will use the example for DBSCAN using scikit-learn:
#Store the labels
labels = db.labels_
#Then get the frequency count of the non-negative labels
counts = np.bincount(labels[labels>=0])
print counts
#Output : [243 244 245]
Then to get the top 3 values use argsort in numpy. In our example since there are only 3 clusters, I will extract the top 2 values :
top_labels = np.argsort(-counts)[:2]
print top_labels
#Output : [2 1]
#To get their respective frequencies
print counts[top_labels]

Categories