Visualize multidimensional datasets with MDS - python

I am trying to visualize the 3 features of my dataframe using MDS to scale them in 2 dimensions.
So, I performed MDS in 2 dimensions to plot the new data, giving each point a different color according to the target variable. my target variable is 'Type'
In: df
Sales hours month Type
243 13 5 A
111 4 3 B
250 7 7 C
101 12 1 A
X = df
X = pd.get_dummies(X)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Apply the MDS
mds = MDS(2,random_state=0)
X_2d = mds.fit_transform(X_scaled)
# Plot the new dataset.
colors = ['red','green','blue']
plt.rcParams['figure.figsize'] = [7, 7]
plt.rc('font', size=14)
for i in np.unique(df.Type):
subset = X_2d[df.Type == i]
x = [row[0] for row in subset]
y = [row[1] for row in subset]
plt.scatter(x,y,c=colors[i],label= df.target_names[i])
plt.legend()
plt.show()
When I applied the MDS, it works well and the new dataset is generated.
But my problem is in the plotting.
TypeError: list indices must be integers or slices, not str
----> plt.scatter(x,y,c=colors[i],label=all_outliers_type.target_names[i])

Seems like your indentation is off: you're calling colors[i] outside of your for loop, and i seems to be one of "A", "B", "C".

Related

Applying Featuretools output to another dataframe

I have a dataframe which the target features, which looks like this:
x x1 y
1 2 3
2 3 4
Now I use feautretools to automatically do feature engineering, using this line of code:
es = ft.EntitySet(id = 'x')
es.entity_from_dataframe(entity_id = 'y', dataframe = df, index = 'x')
feature_matrix, feature_names = ft.dfs(entityset=es,
target_entity = 'y',
max_depth = 2,
verbose = 1,
n_jobs = 3)
I would like to take the features generated, and then apply them to a dataset which lacks the labels, something which looks like this:
x x1
1 2
How would I take the features generate (e.g mean of x + x1) and then map their creation process ((df['x']+df['x1']).mean()) onto the dataframe lacking the label?
This answered my question, the saving feature part:
https://featuretools.alteryx.com/en/stable/guides/deployment.html

2D bin (x,y) and calculate mean of values (c) of 10 deepest data points (z)

For a data set consisting of:
coordinates x, y
depth z
a certain value c
I would like to do the following more efficient:
bin the data set in 2D bins based on the coordinates (x, y)
take the 10 deepest data points (z) per bin
calculate the mean value of c of these 10 data points per bin
Finally show a 2d heatmap with the calculated mean values.
I have found a working solution, but this takes too long for small bins and/or large data sets.
Is there a more efficient way of achieving the same result?
Current working example
Example dataframe:
import numpy as np
from numpy.random import rand
import pandas as pd
import math
import matplotlib.pyplot as plt
n = 10000
df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)})
Bin the data set:
cell_size = 0.01
nx = math.ceil((max(df['x']) - min(df['x'])) / cell_size)
ny = math.ceil((max(df['y']) - min(df['y'])) / cell_size)
x_range = np.arange(0, nx)
y_range = np.arange(0, ny)
df['xbin'], x_edges = pd.cut(x=df['x'], bins=nx, labels=x_range, retbins=True)
df['ybin'], y_edges = pd.cut(x=df['y'], bins=ny, labels=y_range, retbins=True)
Code that now takes to long:
df = df.groupby(['xbin', 'ybin']).apply(
lambda d: d.sort_values('z').head(10).mean())
Update an empty DataFrame for the bins without data and show result:
index = pd.MultiIndex.from_product([x_range, y_range],
names=['xbin', 'ybin'])
tot_df = pd.DataFrame(index=index, columns=['z', 'c'])
tot_df.update(df)
zval = tot_df['c'].astype('float').values
zval = zval.reshape((nx, ny))
zval = zval.T
zval = np.flipud(zval)
extent = [min(x_edges), max(x_edges), min(y_edges), max(y_edges)]
plt.matshow(zval, aspect='auto', extent=extent)
plt.show()
you can use np.searchsorted to bin the rows by x and y and then use groupby to take 10 deep values and calculate means. As groupby will maintains the order in each group you can sort values before applying bins. groupby will perform better without apply
df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)})
df = df.sort_values("z", ascending=False)
bins = np.linspace(0, 1, 11)
df["bin_x"] = np.searchsorted(bins, df['x'].values) - 1
df["bin_y"] = np.searchsorted(bins, df['y'].values) - 1
result = df.groupby(["bin_x", "bin_y"]).head(10)
result.groupby(["bin_x", "bin_y"])["c"].mean()
Result
bin_x bin_y
0 0 0.369531
1 0.601803
2 0.554452
3 0.575464
4 0.455198
...
9 5 0.469838
6 0.420772
7 0.367549
8 0.379200
9 0.523083
Name: c, Length: 100, dtype: float64

TypeError with Python : str and int

I received this error when trying to compile my code. I extracted data from xlsx file and created a dataframe ,replaced null values with 0, converted all the values to sting to be able to scatterplot and when i tried to show the results of my linear regression I received this error.
TypeError: unsupported operand type(s) for /: 'str' and 'int'
and this is the code I did so far
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def predict(x):
return slope * x + intercept
from scipy import stats
xlsxfile = pd.ExcelFile("C:\\Users\\AchourAh\\Desktop\\PL14_IPC_03_09_2018_SP_Level.xlsx")
data = xlsxfile.parse('Sheet1', index_col = None, header = None)
data1 = data.fillna(0) #Replace null values of the whole dataset with 0
data1 = data1.astype(str)
print(data1)
X = data1.iloc[0:len(data1),1]
print(X)
Y = data1.iloc[0:len(data1),2]
print(Y)
axes = plt.axes()
axes.grid()
plt.scatter(X,Y)
slope, intercept, r_value, p_value, std_err = stats.linregress(X, Y)
To notice that I am a beginner with this. The last line is causing the error
This is the first columns COP COR and PAUS of the dataframe which I am trying to apply some Linear regression on:
0 PP SP000045856 COP COR SP000045856 PAUS
1 201723 0 2000
2 201724 12560 40060
3 201725 -17760 15040
4 201726 -5840 16960
5 201727 10600 4480
6 201728 0 14700
7 201729 4760 46820
... till line 27
The data in your Excel file has header information in the first row, so setting header=None is the reason why there are string values in your data instead of putting it as column names.
If you delete the header kwarg
xlsxfile = pd.ExcelFile("C:\\Users\\AchourAh\\Desktop\\PL14_IPC_03_09_2018_SP_Level.xlsx")
data = xlsxfile.parse('Sheet1', index_col = None)
everything should work and you should get a dataframe like this:
data
0 PP SP000045856 COP COR SP000045856 PAUS
0 1 201723 0 2000
1 2 201724 12560 40060
2 3 201725 -17760 15040
3 4 201726 -5840 16960
4 5 201727 10600 4480
5 6 201728 0 14700
6 7 201729 4760 46820
However, you could do the same thing even a little shorter by directly using the read_excel-function of pandas:
data = pd.read_excel('C:\\Users\\AchourAh\\Desktop\\PL14_IPC_03_09_2018_SP_Level.xlsx', 'Sheet1')
Your scatter-plot can then be done e.g. like
data.plot('SP000045856 COP COR', 'SP000045856 PAUS', 'scatter')
or perhaps better readable but identical:
data.plot.scatter('SP000045856 COP COR', 'SP000045856 PAUS')
And the linear regression could be done like
slope, intercept, r_value, p_value, std_err = stats.linregress(data['SP000045856 COP COR'], data['SP000045856 PAUS'])

Splitting coef into arrays applicable for multi class

I use this function to plot the best and worst features (coef) for each label.
def plot_coefficients(classifier, feature_names, top_features=20):
coef = classifier.coef_.ravel()
for i in np.split(coef,6):
top_positive_coefficients = np.argsort(i)[-top_features:]
top_negative_coefficients = np.argsort(i)[:top_features]
top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
# create plot
plt.figure(figsize=(15, 5))
colors = ["red" if c < 0 else "blue" for c in i[top_coefficients]]
plt.bar(np.arange(2 * top_features), i[top_coefficients], color=colors)
feature_names = np.array(feature_names)
plt.xticks(np.arange(1, 1 + 2 * top_features), feature_names[top_coefficients], rotation=60, ha="right")
plt.show()
Applying it to sklearn.LinearSVC:
if (name == "LinearSVC"):
print(clf.coef_)
print(clf.intercept_)
plot_coefficients(clf, cv.get_feature_names())
The CountVectorizer used has a dimension of (15258, 26728).
It's a multi-class decision problem with 6 labels. Using .ravel returns a flat array with a length of 6*26728=160368. Meaning that all indicies that are higher than 26728 are out of bound for axis 1. Here are the top and bottom indices for one label:
i[ 0. 0. 0.07465654 ... -0.02112607 0. -0.13656274]
Top [39336 35593 29445 29715 36418 28631 28332 40843 34760 35887 48455 27753
33291 54136 36067 33961 34644 38816 36407 35781]
i[ 0. 0. 0.07465654 ... -0.02112607 0. -0.13656274]
Bot [39397 40215 34521 39392 34586 32206 36526 42766 48373 31783 35404 30296
33165 29964 50325 53620 34805 32596 34807 40895]
The first entry in the "top" list has the index 39336. This is equal to the entry 39337-26728=12608 in the vocabulary. What would I need to change in the code to make this applicable?
EDIT:
X_train = sparse.hstack([training_sentences,entities1train,predictionstraining_entity1,entities2train,predictionstraining_entity2,graphpath_training,graphpathlength_training])
y_train = DFTrain["R"]
X_test = sparse.hstack([testing_sentences,entities1test,predictionstest_entity1,entities2test,predictionstest_entity2,graphpath_testing,graphpathlength_testing])
y_test = DFTest["R"]
Dimensions:
(15258, 26728)
(15258, 26728)
(0, 0) 1
...
(15257, 0) 1
(15258, 26728)
(0, 0) 1
...
(15257, 0) 1
(15258, 26728)
(15258L, 1L)
File "TwoFeat.py", line 708, in plot_coefficients
colors = ["red" if c < 0 else "blue" for c in i[top_coefficients]]
MemoryError
First, is it necessary you have to use ravel()?
LinearSVC (or in fact any other classifier which has coef_) gives out coef_ in a shape:
coef_ : array, shape = [n_features] if n_classes == 2 else [n_classes, n_features]
Weights assigned to the features (coefficients in the primal problem).
So this has number of rows equal to the classes, and number of columns equal to features. For each class, you just need to access right row. The order of classes will be available from classifier.classes_ attribute.
Secondly, the indenting of your code is wrong. The code in which plot should be inside the for loop to plot for each class. Currently its outside the scope of for loop, so only will print for last class.
Correcting these two things, here's a sample reproducible code to plot the top and bottom features for each class.
def plot_coefficients(classifier, feature_names, top_features=20):
# Access the coefficients from classifier
coef = classifier.coef_
# Access the classes
classes = classifier.classes_
# Iterate the loop for number of classes
for i in range(len(classes)):
print(classes[i])
# Access the row containing the coefficients for this class
class_coef = coef[i]
# Below this, I have just replaced 'i' in your code with 'class_coef'
# Pass this to get top and bottom features
top_positive_coefficients = np.argsort(class_coef)[-top_features:]
top_negative_coefficients = np.argsort(class_coef)[:top_features]
# Concatenate the above two
top_coefficients = np.hstack([top_negative_coefficients,
top_positive_coefficients])
# create plot
plt.figure(figsize=(10, 3))
colors = ["red" if c < 0 else "blue" for c in class_coef[top_coefficients]]
plt.bar(np.arange(2 * top_features), class_coef[top_coefficients], color=colors)
feature_names = np.array(feature_names)
# Here I corrected the start to 0 (Your code has 1, which shifted the labels)
plt.xticks(np.arange(0, 1 + 2 * top_features),
feature_names[top_coefficients], rotation=60, ha="right")
plt.show()
Now just use this method as you like:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space']
dataset = fetch_20newsgroups(subset='all', categories=categories,
shuffle=True, random_state=42)
vectorizer = CountVectorizer()
# Just to replace classes from integers to their actual labels,
# you can use anything as you like in y
y = []
mapping_dict = dict(enumerate(dataset.target_names))
for i in dataset.target:
y.append(mapping_dict[i])
# Learn the words from data
X = vectorizer.fit_transform(dataset.data)
clf = LinearSVC(random_state=42)
clf.fit(X, y)
plot_coefficients(clf, vectorizer.get_feature_names())
Output from above code:
'alt.atheism'
'comp.graphics'
'sci.space'
'talk.religion.misc'

How do I modify this function to accept multiple Dataframes?

I wrote this function and I would like it to accept more than one DF so that the final plot has multiple plotted lines for the predictions and the coef_DF gets completed with the rest of the coefficients.
The function extracts the needed features and target from a much larger dataset to make predictions using a linear regression func, it then makes the model, plots the line over the dataset and returns a df with all the coeficients.
(This is just an exercise.)
def prep_model_and_predict(feature, target, dataset, degree):
# part 1: make a df with relevant format and features
# degree >=1
poly_df=pd.DataFrame()
poly_df[str(target)] = dataset[str(target)]
poly_df['power_1'] = dataset[str(feature)]
#cehck if degree >1
if degree > 1:
for power in range(2, degree+1): #loop over reaming deg
name = 'power_'+str(power)
poly_df[name]=poly_df['power_1'].apply(lambda x: x**power)
#part 2: make model and predictions
features=list(poly_df.columns[1:])
X=poly_df[features]
y=poly_df[str(target)]
model=LinearRegression().fit(X,y)
predictions=model.predict(X)
#part 3: put weghts in a nice df
coef_df=pd.DataFrame()
coef_df=coef_df.append({"Name":'Intercept', 'Value':model.intercept_}, ignore_index=True)
coef_df=coef_df.append({'Name':'Power_1', 'Value':model.coef_[0]}, ignore_index=True)
if degree > 1:
for degree in range(2, degree+1):
name = 'Power_' + str(degree)
coef_df = coef_df.append({"Name":name,
'Value':'{:.3e}'.format(model.coef_[degree-1])}, ignore_index=True)
#prt 4: plot it
fig, ax = plt.subplots()
ax.plot(poly_df['power_1'], poly_df[str(target)], '.',
poly_df['power_1'], predictions, '-')
ax.set_xlabel('Square footage, living area')
ax.set_ylabel('Price per Sqft')
ax.ticklabel_format(axis='y', style='sci', scilimits=(-2,2))
return coef_df, ax
and this is the result:
Name Value
0 Intercept 506738
1 Power_1 2.71336e-77
2 Power_2 7.335e-39
3 Power_3 -1.850e-44
4 Power_4 8.437e-50
5 Power_5 0.000e+00
6 Power_6 0.000e+00
7 Power_7 3.645e-55
8 Power_8 1.504e-51
9 Power_9 5.760e-48
10 Power_10 1.958e-44
11 Power_11 5.394e-41
12 Power_12 9.404e-38
13 Power_13 -3.635e-41
14 Power_14 4.655e-45
15 Power_15 -1.972e-49
much appreciated!
I am not sure what exactly you are asking for. But I would suggest, next time try to ask a question that is easily produce-able and runnable by other people here in SO.
I have tried to answer your questions. Correct me if I misunderstand your question.
Pass arbitrary number of DataFrame to your function and plot it:
I have created three random dataframes for use:
df1 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
df2 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
df3 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
The functions that plots them:
def plot_me(*kwargs):
plt.figure(figsize=(13,9))
lab_ind = 0
for i in kwargs:
plt.plot(i['A'], i['B'], label = lab_ind)
lab_ind += 1
plt.legend()
plt.show()
The result plot you get:
Put the results of your model into a DataFrame
Regarding your second question, I am not going to concentrate too much on your exact details - for example the name of the columns of your dataframe, etc.
For this particular example I have generated two random arrays:
X = np.random.randint(0,50 ,size=(50, 2))
y = np.random.randint(0,2 ,size=(50, 1))
Then fit a LinearRegression model on this data.
model=LinearRegression().fit(X,y)
predictions=model.predict(X)
And then add it to a DataFrame:
res_df = pd.DataFrame(predictions,columns = ['Value'])
And if you print res_df
Value
0 0.420395
1 0.459389
2 0.369648
3 0.416058
4 0.644088
5 0.362072
6 0.363157
7 0.468943
. .
. .

Categories