How to loop over a dataframe and create list

How to loop over a dataframe and create list - python

So, i have the following data below and i want to loop through the dataframe and perform some functions and at the end save the results from the function in a list. I am have trouble creating a list. i only get a single value in the list and not the two means which i intend to get. Anybody with a more effective way to solve this problem please share.
dict = {'PassengerId' : [0.0, 0.001, 0.002, 0.003, 0.004, 0.006, 0.007, 0.008, 0.009, 0.01],
'Survived' : [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0],
'Pclass' : [1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.5],
'Age' : [0.271, 0.472, 0.321, 0.435, 0.435, np.nan, 0.673, 0.02, 0.334, 0.171],
'SibSp' : [0.125, 0.125, 0.0, 0.125, 0.0, 0.0, 0.0, 0.375, 0.0, 0.125],
'Parch' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.167, 0.333, 0.0],
'Fare' : [0.014, 0.139, 0.015, 0.104, 0.016, 0.017, 0.101, 0.041, 0.022, 0.059]}
import pandas as pd
dicts = pd.DataFrame(dicts, columns = dicts.keys())
def Mean(self):
list_mean = []
list_all = []
for i, row in dicts.iterrows():
if (row['Age'] > 0.2) & (row['Fare'] < 0.1):
list_all.append(row['PassengerId'])
elif (row['Age'] > 0.2) & (row['Fare'] > 0.1):
list_all.clear()
list_all.append(row['PassengerId'])
return list_mean.append(np.mean(list_all))
Mean()
Help Please!!

Some of changes you have to made in you solution to resolve this issue. And for vectorized answer checkout my Code section.
1.
Return statement return list_mean should placed in function block not in if-block
Change:
. . .
if (row['Age'] > self.age) & (row['Fare'] < self.fare):
list_mean.append(row['PassengerId'])
return list_mean
. . .
To:
. . .
list_mean = []
for i, row in dicts.iterrows():
if (row['Age'] > self.age) & (row['Fare'] < self.fare):
list_mean.append(row['PassengerId'])
return list_mean
. . .
CODE :(Vectorized-Version-Solution) No need of defining explicit class to perform this action
import numpy as np
dict_ = {
'PassengerId':
[0.0, 0.001, 0.002, 0.003, 0.004, 0.006, 0.007, 0.008, 0.009, 0.01],
'Survived': [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0],
'Pclass': [1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.5],
'Age':
[0.271, 0.472, 0.321, 0.435, 0.435, np.nan, 0.673, 0.02, 0.334, 0.171],
'SibSp': [0.125, 0.125, 0.0, 0.125, 0.0, 0.0, 0.0, 0.375, 0.0, 0.125],
'Parch': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.167, 0.333, 0.0],
'Fare':
[0.014, 0.139, 0.015, 0.104, 0.016, 0.017, 0.101, 0.041, 0.022, 0.059]
}
import pandas as pd
dicts = pd.DataFrame(dict_, columns=dict_.keys())
l1 = dicts['PassengerId'][np.logical_and(dicts['Age'] > 0.2, dicts['Fare'] < 0.1)]
l2 = dicts['PassengerId'][np.logical_and(dicts['Age'] > 0.2, dicts['Fare'] > 0.1)]
print( (sum(list(l1))/len(l1), sum(list(l2))/len(l2)) )
OUTPUT :
(0.00375, 0.0036666666666666666)

import pandas as pd
import numpy as np
dict = {'PassengerId' : [0.0, 0.001, 0.002, 0.003, 0.004, 0.006, 0.007, 0.008, 0.009, 0.01],
'Survived' : [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0],
'Pclass' : [1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.5],
'Age' : [0.271, 0.472, 0.321, 0.435, 0.435, np.nan, 0.673, 0.02, 0.334, 0.171],
'SibSp' : [0.125, 0.125, 0.0, 0.125, 0.0, 0.0, 0.0, 0.375, 0.0, 0.125],
'Parch' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.167, 0.333, 0.0],
'Fare' : [0.014, 0.139, 0.015, 0.104, 0.016, 0.017, 0.101, 0.041, 0.022, 0.059]}
df = pd.DataFrame(dict, columns = dict.keys())
def calculate_mean():
l1, l2 = [], []
for i, row in df.iterrows():
if row['Age'] > 0.2 and row['Fare'] < 0.1:
l1.append(row['PassengerId'])
elif row['Age'] > 0.2 and row['Fare'] > 0.1:
l2.append(row['PassengerId'])
return np.mean(l1), np.mean(l2)
print(calculate_mean()) # (0.00375, 0.0036666666666666666)

Related

Is it possible to pass a dataframe to TF/Keras that has a numpy array for each row?

I'm doing a regression that is working but to improve results I wanted to add a numpy array (it represents user attributes that I preprocessed outside the application).
Here's a example of my data:
MPG Cylinders Displacement Horsepower Weight Acceleration Model Year Origin NumpyColumn
0 18.0 8 307.0 130.0 3504.0 12.0 70 1 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 15.0 8 350.0 165.0 3693.0 11.5 70 1 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 18.0 8 318.0 150.0 3436.0 11.0 70 1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 16.0 8 304.0 150.0 3433.0 12.0 70 1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 17.0 8 302.0 140.0 3449.0 10.5 70 1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
... ... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.0 2790.0 15.6 82 1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
394 44.0 4 97.0 52.0 2130.0 24.6 82 2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
395 32.0 4 135.0 84.0 2295.0 11.6 82 1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
396 28.0 4 120.0 79.0 2625.0 18.6 82 1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
397 31.0 4 119.0 82.0 2720.0 19.4 82 1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Here's how to generate it:
import numpy as np
import pandas as pd
import scipy.sparse as sparse
#download data
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
'Acceleration', 'Model Year', 'Origin']
df = pd.read_csv(url, names=column_names,
na_values='?', comment='\t',
sep=' ', skipinitialspace=True)
lenOfDF = (len(df))
#add numpy array
arr = sparse.coo_matrix(([1,1,1], ([0,1,2], [1,2,0])), shape=(lenOfDF,lenOfDF))
df['NumpyColumn'] = arr.toarray().tolist()
Then my model is similar to this:
g_input = Input(shape=[Xtrain.shape[1]])
H1 = Dense(512)(g_input)
H1r = Activation('relu')(H1)
H2 = Dense(256)(H1r)
H2r = Activation('relu')(H2)
H3 = Dense(256)(H2r)
H3r = Activation('relu')(H3)
H4 = Dense(128)(H3r)
H4r = Activation('relu')(H4)
H5 = Dense(128)(H4r)
H5r = Activation('relu')(H5)
H6 = Dense(64)(H5r)
H6r = Activation('relu')(H6)
H7 = Dense(32)(H6r)
Hr = Activation('relu')(H7)
g_V = Dense(1)(Hr)
generator = Model(g_input,g_V)
generator.compile(loss='binary_crossentropy', optimizer=opt)
When I call it using the dataset with the NumpyColumn(x_batch is just a split and scaled dataset of above dataframe with the numpy array passed through so it remains unchanged). I get the following error:
# generated = generator.predict(x_batch) #making prediction from the generator
generated = generator.predict(tf.convert_to_tensor(x_batch)) #making prediction from the generator
Error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).
What am I doing wrong here? My thought is that having a array would provide the model information to make better prediction so I'm trying to test it. Is it possible to add a numpy array to a dataframe to train? Or is there an alternative approach I should be doing?
Edit 1
Above is a sample to quickly help you understand the problem. In my case after encoding/scaling the dataframe, I have a numpy array that looks like this (it's numeric representing the catergorical encodings + two numpy arrays at the end):
array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,
0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 9921.0,
20.0, 0.40457918757980704, 0.11369258150627903, 0.868421052631579,
0.47368421052631576, 0.894736842105263, 0.06688034531010473,
0.16160188713280013, 0.7368421052631579, 0.1673332894736842,
0.2099143206854345, 0.3690644464300929, 0.07097828135799109,
0.8157894736842104, 0.9210526315789473, 0.23091420289239645,
0.08623506024464939, 0.5789473684210527, 0.763157894736842, 0.0,
0.18421052631578946, 0.07949239000059796, 0.18763907099960708,
0.7368421052631579, 0.2668740256483197, 0.6842105263157894,
0.13699219747488295, 0.868421052631579, 0.868421052631579,
0.052631349139178094, 0.6842105263157894, 0.5526315789473684,
0.6842105263157894, 0.6842105263157894, 0.6842105263157894,
0.7105263157894737, 0.7105263157894737, 0.7105263157894737,
0.23684210526315788, 0.0, 0.7105263157894737, 0.5789473684210527,
0.763157894736842, 0.5263157894736842, 0.6578947368421052,
0.6842105263157894, 0.7105263157894737, 0.0, 0.5789473684210527,
0.2631578947368421, 0.6842105263157894, 0.6578947368421052,
0.42105263157894735, 0.5789473684210527, 0.42105263157894735,
0.7368421052631579, 0.7368421052631579, 0.15207999030227856,
0.8445892232119124, 0.2683721567016762, 0.3142850329243405,
0.18421052631578946, 0.19132292433056333, 0.20615136344079915,
0.14475710664724623, 0.1624920232728424, 0.6989826700898587,
0.18421052631578946, 0.21052631578947367, 0.4793448772543646,
0.7894736842105263, 0.682967263567459, 0.37139592674256894,
0.21123755190149363, 0.18421052631578946, 0.6578947368421052,
0.39473684210526316, 0.631578947368421, 0.7894736842105263,
0.36842105263157887, 0.1863353145721346, 0.7368421052631579,
0.26809396092240706, 0.22492185003691062, 0.1460488284639197,
0.631578947368421, 0.15347526114630458, 0.763157894736842,
0.2097323620058104, 0.3684210526315789, 0.631578947368421,
0.631578947368421, 0.631578947368421, 0.6842105263157894,
0.36842105263157887, 0.10507952765043811, 0.22418515695024185,
0.23755698619020282, 0.22226500126902, 0.530004040377794,
0.3421052631578947, 0.19018711711349692, 0.19629244102133708,
0.5789473684210527, 0.10526315789473684, 0.49999999999999994,
0.5263157894736842, 0.5263157894736842, 0.49999999999999994,
0.1052631578947368, 0.10526315789473678, 0.5263157894736842,
0.4736842105263157, 2013.0,
array([0. , 0. , 0. , 0.62235785, 0. ,
0.27049118, 0. , 0.31094068, 0. , 0. ,
0. , 0. , 0. , 0.4330532 , 0. ,
0. , 0.2515796 , 0. , 0. , 0. ,
0.40683705, 0.01569915, 0. , 0. , 0. ,
0.13090582, 0. , 0.49955425, 0.06970194, 0.29155406,
0. , 0. , 0.27342197, 0. , 0. ,
0. , 0.04415211, 0. , 0.03908829, 0. ,
0.07673171, 0.33199945, 0. , 0.51759815, 0. ,
0.4719149 , 0.4538082 , 0.13475986, 0. , 0. ,
0. , 0. , 0. , 0. , 0.08000553,
0. , 0.02991109, 0. , 0.5051543 , 0. ,
0.24663273, 0. , 0.50839704, 0. , 0. ,
0.05281948, 0.44884402, 0. , 0.44542992, 0.15376966,
0. , 0. , 0. , 0.39128256, 0.49497205,
0. , 0. ], dtype=float32),
array([0. , 0. , 0. , 0.62235785, 0. ,
0.27049118, 0. , 0.31094068, 0. , 0. ,
0. , 0. , 0. , 0.4330532 , 0. ,
0. , 0.25157961, 0. , 0. , 0. ,
0.40683705, 0.01569915, 0. , 0. , 0. ,
0.13090582, 0. , 0.49955425, 0.06970194, 0.29155406,
0. , 0. , 0.27342197, 0. , 0. ,
0. , 0.04415211, 0. , 0.03908829, 0. ,
0.07673171, 0.33199945, 0. , 0.51759815, 0. ,
0.47191489, 0.45380819, 0.13475986, 0. , 0. ,
0. , 0. , 0. , 0. , 0.08000553,
0. , 0.02991109, 0. , 0.50515431, 0. ,
0.24663273, 0. , 0.50839704, 0. , 0. ,
0.05281948, 0.44884402, 0. , 0.44542992, 0.15376966,
0. , 0. , 0. , 0.39128256, 0.49497205,
0. , 0. ])], dtype=object)

Problem:
You are trying to pass nested list/array objects as a feature to convert to tensor. That's the reason for the error. You can handle it at pandas level by simply converting the n length lists/arrays to n columns (check solution 2). However, usually, when working with such columns, you ideally want to process them differently in the network (such as pass this column into an LSTM for example). Therefore an ideal way is to have a multi-input model, which is usually how we work with these features in the industry (check solution 1).
Solution 1: Solving this via Multi-inputs
This is a fairly common problem especially when we are working with multiple sequences of data or multiple encodings.
One straightforward method of solving this is by creating separate inputs for each encoding.
(Assuming X_train has 9 columns) Pass 8 columns out of the 9 to the first input, and the encoding (column with list/array) as a separate input.
Concatenate these to create an 8+398 length tensor which now passes through the computation graph.
The single series with lists can be converted to a tensor/np.array by np.array(df.column.tolist()). This will convert the (398,) length series with lists to a (398, 398) shaped NumPy array.
Now, you can handle the features and the encodings separately as well before concatenating them and passing them through Dense layers. E.g passing the second input via LSTM layers.
from tensorflow.keras import layers, Model, utils, activations
g_input = layers.Input(shape=(8,)) #<--------
np_input = layers.Input(shape=(398,)) #<--------
x = layers.concatenate([g_input, np_input])
x = layers.Dense(512, activation='relu')(x)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dense(128, activation='relu')(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dense(32, activation='relu')(x)
g_V = layers.Dense(1, activation='sigmoid')(x)
generator = Model([g_input,np_input],g_V)
generator.compile(loss='binary_crossentropy', optimizer='adam')
utils.plot_model(generator, show_layer_names=False, show_shapes=True)
print('')
print('RESHAPING DATA TO - (398,8) and (398,398)')
generator.predict([df.drop('NumpyColumn',1).to_numpy(),
np.array(df['NumpyColumn'].tolist())]).shape
RESHAPING DATA TO - (398,8) and (398,398)
(398, 1)
Solution 2: Solving this via Pandas
If however, you don't want the encoding to be separate and are simply going to use it as a flattened feature along with other features, then you can simply just flatten the data frame over axis=1 to create 8+398 columns and then convert it to tensor.
import tensorflow as tf
from tensorflow.keras import layers, Model, utils, activations
g_input = layers.Input(shape=(406,)) #<---------
x = layers.Dense(512, activation='relu')(g_input)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dense(128, activation='relu')(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dense(32, activation='relu')(x)
g_V = layers.Dense(1, activation='sigmoid')(x)
generator = Model(g_input,g_V)
generator.compile(loss='binary_crossentropy', optimizer='adam')
utils.plot_model(generator, show_layer_names=False, show_shapes=True)
print('')
print('RESHAPING DATA TO - (398, 406)')
ddf = pd.concat([df.iloc[:,:-1], df.NumpyColumn.apply(pd.Series)], axis=1) #<-----
generator.predict(tf.convert_to_tensor(ddf)).shape #<-----
RESHAPING DATA TO - (398, 406)
(398, 1)

Parameterizing the construction of a Python list

By using this Python code (I'm working with Python 3.6):
length = 4
overall = [["row" + str(length + 1)] +
[1.0] + [0.0] * (length - 1)]
for i in range(1, length):
overall += [["row" + str(i + length + 1)] +
[0.0] * i + [1.0] + [0.0] * (length - (i + 1))]
I obtain the following list of lists:
OUTPUT 1:
overall = [['row5', 1.0, 0.0, 0.0, 0.0],
['row6', 0.0, 1.0, 0.0, 0.0],
['row7', 0.0, 0.0, 1.0, 0.0],
['row8', 0.0, 0.0, 0.0, 1.0]]
Now, I'd like to parametrize the piece of code above.
Given a parameter, for example, n_repetitions = 3, I'd like to obtain:
OUTPUT 2:
overall = [['row5', 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
['row6', 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
['row7', 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0],
['row8', 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0]]
where, in each "row", the initial "group" made of 4 numerical one-element lists has been repeated n_repetitions times (3, in this example).
Which is a good way to do that in an automatic way (e.g.: by using a for loop, a list comprehension, ...)?

Yes you can use list comprehension + list addition/multiplication like so:
overall = [['row5', 1.0, 0.0, 0.0, 0.0],
['row6', 0.0, 1.0, 0.0, 0.0],
['row7', 0.0, 0.0, 1.0, 0.0],
['row8', 0.0, 0.0, 0.0, 1.0]]
overall = [[row[0]] + row[1:]*repeat for row in overall]

I want to confirm.
I follow your code
Output:
overall = [['row5', 1.0, 0.0, 0.0, 0.0],
['row6', 0.0, 1.0, 0.0, 0.0],
['row7', 0.0, 0.0, 1.0, 0.0],
['row8', 0.0, 0.0, 0.0, 1.0]]
What do you hope?
overall = [["row5"], [1], [0], [0], [0],
["row6"], [0], [1], [0], [0],
["row7"], [0], [0], [1], [0],
["row8"], [0], [0], [0], [1]]
this ?
length = 4
n_repetitions = 3
arr = [1.0] + [0.0] * (length - 1)
overall = [["row" + str(length + 1)] + arr * n_repetitions]
for i in range(1, length):
_ = [0.0] * i + [1.0] + [0.0] * (length - (i + 1))
overall += [["row" + str(i + length + 1)] + _ * n_repetitions]

overall is a list of lists
type(overall)
list
in matrix terms, this is your ID matrix without the first column:
id =[l[1:] for l in overall]
and this is your label columns:
labels = [[l[0]] for l in overall]
you can then isolate the first element of each list and repeat the rest:
n_repetitions = 3
result = [[l[0]] + l[1:]*n_repetitions for l in overall]
result
[['row5', 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
['row6', 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
['row7', 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0],
['row8', 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0]]

How to plot clusters and centers from a multi-feature kmeans model, with Matplotlib?

I used kmeans algorithm to determine number of clusters in my dataset. In the following code, you can see that I have multiple features, some are categorical some are not. I encoded, and scaled them, and I get my optimal number of clusters.
You can download data from here:
https://www.sendspace.com/file/1cnbji
import sklearn.metrics as sm
from sklearn.preprocessing import scale
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans, SpectralClustering, MiniBatchKMeans
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('dataset.csv')
print(df.columns)
features = df[['parcela', 'bruto', 'neto',
'osnova', 'sipovi', 'nadzemno',
'podzemno', 'tavanica', 'fasada']]
trans = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['tavanica', 'fasada']),
('StandardScaler', Normalizer(), ['parcela', 'bruto', 'neto', 'osnova', 'nadzemno', 'podzemno', 'sipovi'])],
remainder='passthrough') # Default is to drop untransformed columns
features = trans.fit_transform(features)
Sum_of_squared_distances = []
for i in range(1,19):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
kmeans.fit(features)
Sum_of_squared_distances.append(kmeans.inertia_)
plt.plot(range(1,19), Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
On the graph, the elbow method shows my optimal number of clusters as 7.
How can I plot the 7 clusters?
I want to see centroids on the graph, and scatter plot with 7 different colors of the clusters.

Given Plot: kmeans clustering centroid, where centers is one dimension. The centers array has a (3, 2) shape, with x as (3, 1) and y as (3, 1).
The method demonstrated for this one dimension of centers, has been adapted to produce a solution for the seven dimensions of centers, produced by the model for this question.
The centers returned for the model in this question has seven dimensions, with a shape of (7, 14) where 14 is 7 sets of x and y values.
This solution answers the question, How to plot the clusters & centers?
It does not offer comments or interpretation on the results of the model, that would need to be a different question asked in SE: Cross Validated or SE: Data Science.
# uses the imports as shown in the question
from matplotlib.patches import Rectangle, Patch # for creating a legend
from matplotlib.lines import Line2D
# beginning with
features = trans.fit_transform(features)
# create the model and fit it to features
kmeans_model2 = KMeans(n_clusters=7, init='k-means++', random_state=0).fit(features)
# find the centers; there are 7
centers = np.array(kmeans_model2.cluster_centers_)
# unique markers for the labels
markers = ['o', 'v', 's', '*', 'p', 'd', 'h']
# get the model labels
labels = kmeans_model2.labels_
labels_unique = set(labels)
# unique colors for each label
colors = sns.color_palette('husl', n_colors=len(labels_unique))
# color map with labels and colors
cmap = dict(zip(labels_unique, colors))
# plot
# iterate through each group of 2 centers
for j in range(0, len(centers)*2, 2):
plt.figure(figsize=(6, 6))
x_features = features[:, j]
y_features = features[:, j+1]
x_centers = centers[:, j]
y_centers = centers[:, j+1]
# add the data for each label to the plot
for i, l in enumerate(labels):
# print(f'Label: {l}') # uncomment as needed
# print(f'feature x coordinates for label:\n{x_features[i]}') # uncomment as needed
# print(f'feature y coordinates for label:\n{y_features[i]}') # uncomment as needed
plt.plot(x_features[i], y_features[i], color=colors[l], marker=markers[l], alpha=0.5)
# print values for given plot, rounded for easier interpretation; all 4 can be commented out
print(f'feature labels:\n{list(labels)}')
print(f'x_features:\n{list(map(lambda x: round(x, 3), x_features))}')
print(f'y_features:\n{list(map(lambda x: round(x, 3), y_features))}')
print(f'x_centers:\n{list(map(lambda x: round(x, 3), x_centers))}')
print(f'y_centers:\n{list(map(lambda x: round(x, 3), y_centers))}')
# add the centers
# this loop is to color the center marker to correspond to the color of the corresponding label.
for k in range(len(centers)):
plt.scatter(x_centers[k], y_centers[k], marker="X", color=colors[k])
# title
plt.title(f'Features: Dimension {int(j/2)}')
# create the rectangles for the legend
patches = [Patch(color=v, label=k) for k, v in cmap.items()]
# create centers marker for the legend
black_x = Line2D([], [], color='k', marker='X', linestyle='None', label='centers', markersize=10)
# add the legend
plt.legend(title='Labels', handles=patches + [black_x], bbox_to_anchor=(1.04, 0.5), loc='center left', borderaxespad=0, fontsize=15)
plt.show()
Output of plotting
Many of the plotted features have overlapping values and centers.
The x and y values for features and centers have been printed to more easily see the overlap, and to confirm the plotted values.
The responsible print lines can be commented out or removed, when no longer needed.
Feature 0
feature labels:
[6, 1, 1, 1, 5, 5, 3, 4, 1, 0, 1, 5, 5, 1, 1, 1, 1, 1, 4, 1, 2, 0, 1, 3, 3, 4, 2, 2, 4, 3, 3, 2, 6, 3, 1, 2, 4, 6, 1, 4, 4, 1, 4, 5, 3, 1, 1, 1, 1, 1, 0, 1, 5, 5, 1, 1, 3, 3, 3, 1, 3, 1, 3, 3, 0, 1, 2, 2, 2, 6]
x_features:
[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0]
y_features:
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0]
x_centers:
[1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0]
y_centers:
[0.0, 0.0, 1.0, 0.0, -0.0, -0.0, 1.0]
Feature 1
feature labels:
[6, 1, 1, 1, 5, 5, 3, 4, 1, 0, 1, 5, 5, 1, 1, 1, 1, 1, 4, 1, 2, 0, 1, 3, 3, 4, 2, 2, 4, 3, 3, 2, 6, 3, 1, 2, 4, 6, 1, 4, 4, 1, 4, 5, 3, 1, 1, 1, 1, 1, 0, 1, 5, 5, 1, 1, 3, 3, 3, 1, 3, 1, 3, 3, 0, 1, 2, 2, 2, 6]
x_features:
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
y_features:
[1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0]
x_centers:
[1.0, -0.0, -0.0, -0.0, -0.0, 0.0, 0.0]
y_centers:
[0.0, 1.0, 0.0, -0.0, 0.0, 0.0, 1.0]
Feature 2
feature labels:
[6, 1, 1, 1, 5, 5, 3, 4, 1, 0, 1, 5, 5, 1, 1, 1, 1, 1, 4, 1, 2, 0, 1, 3, 3, 4, 2, 2, 4, 3, 3, 2, 6, 3, 1, 2, 4, 6, 1, 4, 4, 1, 4, 5, 3, 1, 1, 1, 1, 1, 0, 1, 5, 5, 1, 1, 3, 3, 3, 1, 3, 1, 3, 3, 0, 1, 2, 2, 2, 6]
x_features:
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0]
y_features:
[0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
x_centers:
[0.0, -0.0, 0.125, 1.0, 0.0, 0.0, 0.0]
y_centers:
[0.0, -0.0, 0.0, 0.0, 0.0, 1.0, 0.0]
Feature 3
feature labels:
[6, 1, 1, 1, 5, 5, 3, 4, 1, 0, 1, 5, 5, 1, 1, 1, 1, 1, 4, 1, 2, 0, 1, 3, 3, 4, 2, 2, 4, 3, 3, 2, 6, 3, 1, 2, 4, 6, 1, 4, 4, 1, 4, 5, 3, 1, 1, 1, 1, 1, 0, 1, 5, 5, 1, 1, 3, 3, 3, 1, 3, 1, 3, 3, 0, 1, 2, 2, 2, 6]
x_features:
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0]
y_features:
[0.298, 0.193, 0.18, 0.336, 0.181, 0.174, 0.197, 0.23, 0.175, 0.212, 0.196, 0.186, 0.2, 0.15, 0.141, 0.304, 0.108, 0.101, 0.304, 0.105, 0.459, 0.18, 0.16, 0.224, 0.216, 0.246, 0.139, 0.111, 0.227, 0.177, 0.159, 0.25, 0.298, 0.223, 0.335, 0.431, 0.17, 0.381, 0.255, 0.222, 0.296, 0.156, 0.202, 0.145, 0.195, 0.15, 0.141, 0.18, 0.336, 0.175, 0.212, 0.196, 0.186, 0.2, 0.15, 0.141, 0.177, 0.177, 0.177, 0.177, 0.177, 0.177, 0.224, 0.224, 0.18, 0.16, 0.222, 0.202, 0.18, 0.336]
x_centers:
[0.0, -0.0, 0.875, -0.0, 1.0, 0.0, 0.0]
y_centers:
[0.196, 0.188, 0.249, 0.196, 0.237, 0.182, 0.328]
Feature 4
feature labels:
[6, 1, 1, 1, 5, 5, 3, 4, 1, 0, 1, 5, 5, 1, 1, 1, 1, 1, 4, 1, 2, 0, 1, 3, 3, 4, 2, 2, 4, 3, 3, 2, 6, 3, 1, 2, 4, 6, 1, 4, 4, 1, 4, 5, 3, 1, 1, 1, 1, 1, 0, 1, 5, 5, 1, 1, 3, 3, 3, 1, 3, 1, 3, 3, 0, 1, 2, 2, 2, 6]
x_features:
[0.712, 0.741, 0.763, 0.704, 0.749, 0.741, 0.754, 0.735, 0.744, 0.738, 0.743, 0.747, 0.758, 0.759, 0.749, 0.714, 0.766, 0.748, 0.728, 0.755, 0.681, 0.752, 0.762, 0.734, 0.721, 0.747, 0.749, 0.756, 0.737, 0.748, 0.742, 0.724, 0.712, 0.733, 0.73, 0.688, 0.722, 0.705, 0.777, 0.749, 0.733, 0.744, 0.733, 0.764, 0.739, 0.76, 0.749, 0.763, 0.704, 0.744, 0.738, 0.743, 0.747, 0.758, 0.759, 0.749, 0.748, 0.748, 0.748, 0.748, 0.748, 0.748, 0.734, 0.734, 0.752, 0.762, 0.749, 0.733, 0.763, 0.704]
y_features:
[0.614, 0.636, 0.612, 0.601, 0.631, 0.64, 0.62, 0.624, 0.636, 0.633, 0.632, 0.63, 0.61, 0.629, 0.641, 0.616, 0.629, 0.65, 0.601, 0.644, 0.539, 0.628, 0.623, 0.627, 0.65, 0.603, 0.641, 0.641, 0.616, 0.632, 0.648, 0.631, 0.614, 0.624, 0.58, 0.562, 0.666, 0.587, 0.565, 0.616, 0.591, 0.646, 0.642, 0.625, 0.631, 0.629, 0.641, 0.612, 0.601, 0.636, 0.633, 0.632, 0.63, 0.61, 0.629, 0.641, 0.632, 0.632, 0.632, 0.632, 0.632, 0.632, 0.627, 0.627, 0.628, 0.623, 0.616, 0.642, 0.612, 0.601]
x_centers:
[0.745, 0.747, 0.73, 0.741, 0.735, 0.752, 0.708]
y_centers:
[0.63, 0.625, 0.611, 0.632, 0.62, 0.625, 0.604]
Feature 5
feature labels:
[6, 1, 1, 1, 5, 5, 3, 4, 1, 0, 1, 5, 5, 1, 1, 1, 1, 1, 4, 1, 2, 0, 1, 3, 3, 4, 2, 2, 4, 3, 3, 2, 6, 3, 1, 2, 4, 6, 1, 4, 4, 1, 4, 5, 3, 1, 1, 1, 1, 1, 0, 1, 5, 5, 1, 1, 3, 3, 3, 1, 3, 1, 3, 3, 0, 1, 2, 2, 2, 6]
x_features:
[0.164, 0.096, 0.103, 0.171, 0.091, 0.106, 0.094, 0.132, 0.105, 0.098, 0.102, 0.101, 0.115, 0.079, 0.095, 0.135, 0.075, 0.088, 0.126, 0.063, 0.186, 0.088, 0.075, 0.134, 0.107, 0.134, 0.09, 0.072, 0.16, 0.097, 0.073, 0.123, 0.165, 0.154, 0.133, 0.158, 0.084, 0.11, 0.105, 0.1, 0.164, 0.075, 0.1, 0.075, 0.135, 0.069, 0.095, 0.103, 0.171, 0.105, 0.098, 0.102, 0.101, 0.115, 0.079, 0.095, 0.097, 0.097, 0.097, 0.097, 0.097, 0.097, 0.134, 0.134, 0.088, 0.075, 0.1, 0.1, 0.103, 0.171]
y_features:
[0.001, 0.002, 0.001, 0.001, 0.001, 0.002, 0.002, 0.001, 0.001, 0.001, 0.001, 0.005, 0.002, 0.001, 0.002, 0.001, 0.002, 0.001, 0.001, 0.002, 0.0, 0.001, 0.001, 0.002, 0.0, 0.001, 0.001, 0.002, 0.002, 0.002, 0.0, 0.001, 0.001, 0.001, 0.004, 0.004, 0.001, 0.002, 0.001, 0.001, 0.002, 0.0, 0.001, 0.001, 0.001, 0.001, 0.0, 0.001, 0.001, 0.001, 0.0, 0.0, 0.003, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.0, 0.002, 0.001, 0.001, 0.0, 0.001, 0.001, 0.002, 0.002, 0.002, 0.001]
x_centers:
[0.093, 0.1, 0.116, 0.112, 0.125, 0.101, 0.152]
y_centers:
[0.001, 0.001, 0.002, 0.001, 0.001, 0.002, 0.001]
Feature 6
feature labels:
[6, 1, 1, 1, 5, 5, 3, 4, 1, 0, 1, 5, 5, 1, 1, 1, 1, 1, 4, 1, 2, 0, 1, 3, 3, 4, 2, 2, 4, 3, 3, 2, 6, 3, 1, 2, 4, 6, 1, 4, 4, 1, 4, 5, 3, 1, 1, 1, 1, 1, 0, 1, 5, 5, 1, 1, 3, 3, 3, 1, 3, 1, 3, 3, 0, 1, 2, 2, 2, 6]
x_features:
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.002, 0.0, 0.0, 0.001, 0.0, 0.001, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.001, 0.001, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.001, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.001, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
y_features:
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.001, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
x_centers:
[0.0, 0.0, 0.0, 0.0, 0.0, 0.001, 0.0]
y_centers:
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Update with all dimensions on one plot
As requested by the OP
# plot
plt.figure(figsize=(16, 8))
for j in range(0, len(centers)*2, 2):
x_features = features[:, j]
y_features = features[:, j+1]
x_centers = centers[:, j]
y_centers = centers[:, j+1]
# add the data for each label to the plot
for i, l in enumerate(labels):
plt.plot(x_features[i], y_features[i], marker=markers[int(j/2)], color=colors[int(j/2)], alpha=0.5)
# add the centers
for k in range(len(centers)):
plt.scatter(x_centers[k], y_centers[k], marker="X", color=colors[int(j/2)])
# create the rectangles for the legend
patches = [Patch(color=v, label=k) for k, v in cmap.items()]
# create centers marker for the legend
black_x = Line2D([], [], color='k', marker='X', linestyle='None', label='centers', markersize=10)
# add the legend
plt.legend(title='Labels', handles=patches + [black_x], bbox_to_anchor=(1.04, 0.5), loc='center left', borderaxespad=0, fontsize=15)
plt.show()
As noted with the individual plots, there's a lot of overlap.

y-axis labels don't get displayed properly with Plotly

I've a problem with displaying the y-axis labels properly with plotly.
This is my index:
index = ['2015-11','2015-12','2016-01','2016-02','2016-03','2016-04','2016-05',
'2016-06','2016-07','2016-08','2016-09','2016-10','2016-11']
the data
data = [[0.115, 0.077, 0.0, 0.038, 0.0, 0.038, 0.038, 0.077, 0.0, 0.077, 0.077, 0.038],
[0.073, 0.055, 0.083, 0.055, 0.018, 0.055, 0.073, 0.037, 0.028, 0.037, 0.009, 0.0],
[0.099, 0.027, 0.036, 0.045, 0.063, 0.153, 0.027, 0.045, 0.063, 0.027, 0.0, 0.0],
[0.076, 0.038, 0.053, 0.061, 0.098, 0.068, 0.038, 0.061, 0.023, 0.0, 0.0, 0.0],
[0.142, 0.062, 0.027, 0.08, 0.097, 0.044, 0.071, 0.027, 0.0, 0.0, 0.0, 0.0],
[0.169, 0.026, 0.026, 0.026, 0.013, 0.013, 0.091, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.138, 0.121, 0.052, 0.017, 0.034, 0.017, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.297, 0.081, 0.054, 0.054, 0.054, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.095, 0.016, 0.024, 0.04, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.102, 0.023, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.054, 0.027, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.087, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
I create a heatmap with following code:
import plotly.figure_factory as ff
from plotly.offline import iplot
import re
cols = range(12)
index = index
df = pd.DataFrame(data, columns = cols)
df.index = index
x = df.columns.tolist()
y = df.index.tolist()
z = df.values
annotation_text = np.char.mod('%.0f%%', df*100).tolist()
annotation_text = [[re.sub('^0%$','', x) for x in l] for l in annotation_text]
colorscale=[[0.0, 'rgb(248, 248, 255)'],
[0.04, 'rgb(224, 228, 236)'],
[0.08, 'rgb(196, 210, 226)'],
[0.12, 'rgb(158, 178, 226)'],
[0.16, 'rgb(134, 158, 227)'],
[0.2, 'rgb(122, 146, 227)'],
[1.0, 'rgb(65, 105, 225)'],
]
fig = ff.create_annotated_heatmap(z, x=x, y=y, colorscale= colorscale,
annotation_text = annotation_text)
fig.layout.yaxis.autorange = 'reversed'
offline.iplot(fig, filename='annotated_heatmap_color.html')
Which produces the correct heatmap but with the y-axis labels missing
When I change the index to shorter values like '5-11' with
index = [x[3:] for x in index]
the labels show up.
I don't understand the logic behind that and would like to know how to fix it.

Plotly.py uses plotly.js under the hood, which is transforming your date strings to a numerical date format and misplacing them on your non numerical axis.
To explicit a categorical axis you just have to add:
fig.layout.yaxis.type = 'category'

Scipy optimize.curve_fit sometimes won't converge

I'm trying to use numpy.optimize.curve_fit to estimate the frequency and phase of an on/off sequence.
This is the code I'm using:
from numpy import *
from scipy import optimize
row = array([0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0,])
def fit_func(x, a, b, c, d):
return c * sin (a * x + b) + d
p0 = [(pi/10.0), 5.0, row.std(), row.mean()]
result = optimize.curve_fit(fit_func, arange(len(row)), row, p0)
print result
This works. But on some rows, even though they seem perfectly ok, it fails.
Example of failing row:
row = array([1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,])
The error is:
RuntimeError: Optimal parameters not found: Both actual and predicted relative reductions in the sum of squares are at most 0.000000 and the relative error between two consecutive iterates is at most 0.000000
Which tells me very little about what's happened.
A quick test shows that varying the parameters in p0 will cause that row to succeed... and others to fail. Why is that?

I tried both rows of data that you provided and both worked for me just fine. I'm using Scipy 0.8.0rc3. What version are you using? Another thing that might help is to set c and d to fixed values since they really should be the same every time. I set c to 0.6311786 and d to .5. You could also use an fft with zero padding and quadratic fitting around the peak to find the frequency if you want another method. Really, any pitch estimation method is applicable since you are looking for the fundamental frequency.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to loop over a dataframe and create list - python

Related

Is it possible to pass a dataframe to TF/Keras that has a numpy array for each row?

Parameterizing the construction of a Python list

How to plot clusters and centers from a multi-feature kmeans model, with Matplotlib?

y-axis labels don't get displayed properly with Plotly

Scipy optimize.curve_fit sometimes won't converge

Categories

Resources