I'm trying to evaluate the accuracy and performance of several KNN Classifiers.
DataTest["ConfM_K30_ST"] = confusion_matrix(
DataTest["ST_Class"],
DataTest["KNN_K30_ST"]
)
aux = DataTest["ST_Class"]
aux1 = DataTest["KNN_K30_ST"]
When trying to compare the Predicted Result with the Originals I receive the following error:
ValueError: Length of values does not match length of index
DataTest is my DataFrame containing 20% of the Data. The labeled data is, for this example, "ST_Class" and the predicted data is "KNN_K30_ST".
In order to verify what was going on I set these 2 dataframes on aux and aux1. They are both of type Series with sizes (3224,).
The only problem I could see is that the indexes are not continuous and don't start in 0 nor end in 3223. To facilitate comprehension see the image below.
Link: https://i.imgur.com/Splhr62.png
The only error I can see is that you are trying to store the confusion matrix as a column in the dataframe. This isn't possible due to the size mismatch.
Here's a small sample
df1
a
0 1
2 1
4 1
df2
a
1 0
3 1
5 0
# Output from the confusion matrix
confusion_matrix(df1, df2)
array([[0, 0],
[2, 1]])
As suggested, I was obliviously trying to store a confusion matrix in a DataFrame.
My solution was to set it in a Dictionary.
Thank you all for the quick replies!
Related
I am learning pandas and numpy on python. I was trying to apply conditional statements to my DataFrame and I encountered a ValueError due to shape mismatch. Please kindly help me to understand why, thank you!
Here is a look of my simple dataset:
I was trying to filter the DataFrame if the following conditions are met:
area > 8 and area < 10
Here is the result that I have received:
The results are fine if I print the condition out individually and I couldn't understand why can't the matrix converge to form a single DataFrame.
The problem is here: brics[brics['area'] > 8] and brics[brics['area'] < 10].
The inner expression in both cases produces a 5-element vector. Both of them have the same shape. The first has 4 trues and 1 false, the second has 3 trues and 2 falses. But when you do brics[xxx], that selects a subset. brics[xxx] where xxx has 4 trues produces a (4,4) matrix. brics[xxx] where xxx has 3 trues produces a (3,3) matrix. You can't combine those.
The KEY is that you want to combine these BEFORE you use them as indexes:
x = brics[ np.logical_and( brics['area'] > 8, brics['area'] < 10 ) ]
And by the way, you made this much harder for us than it should have been because you posted an image instead of code we could cut and paste.
I had a DataFrame called "segments" that looks like the below:
ORIGIN_AIRPORT_ID DEST_AIRPORT_ID FL_COUNT ORIGIN_INDEX DEST_INDEX OUTDEGREE
WEIGHT
0 10135 10397 77 119 373 3 0.333333
1 10135 11433 85 119 1375 3 0.333333
Using this, I created two Boolean Series objects: One in which I'm storing all the IDs for which the WEIGHT column is not 0 and one in which they are:
Zeroes = (segments['WEIGHT'] == 0).groupby(segments['ORIGIN_INDEX']).all()
Non_zeroes = (segments['WEIGHT'] != 0).groupby(segments['ORIGIN_INDEX']).all()
I want to do two things (because I'm not sure which this task needs):
Create a NumPy vector where all "True" values in the Non_zeroes Series are set to the result of 1/4191 (~0.024~) and all "True" values in the Zeroes Series are set to 0 (or the same logic using True and False of one Series) keeping the IDs (e.g. ORIGIN_INDEX 119 0.024%, etc.)
And I'd also like to create a NumPy vector that is JUST a list of the percentages and zeroes WITHOUT the IDs
EDIT to add extra detail requested!
I tried using a condition as a variable, then using .loc to apply it:
cond_array = copied.WEIGHT is not 0
df.loc[cond_array, ID] = 1/4191
I tried using from_coo(), toarray(), and DataFrame to convert:
pd.Series.sparse.from_coo(P, dense_index=True)
P.toarray()
pd.DataFrame(P)
Finally, I tried applying logic to the DF instead of the COO Matrix. I THINK this gets close, but it is still failing. I believe it fails because it is not including the 0s (copied is just a DF that's a copy of segments):
copied['WEIGHT'] = copied.loc[copied['WEIGHT'] != 0, 'WEIGHT'] = float((1/len(copied))) #0.00023860653
The last code passes the first two tests (testing if it's an array and that it sums to 1.0), but fails the last
assert np.isclose(x0.max(), 1.0/n_actual, atol=10*n*np.finfo(float).eps), "x0` values seem off..."
EDIT 2:
Had the wrong count. It was supposed to be 1/300, not 1/4191. All fixed now, thanks all who took a look :)
I'm doing a project about Hierarchical clustering, and I'm writing some code where I perform AgglomerativeClustering with every possible combination of 'affinity' and 'linkage', which are two parameters you can set. The problem arises when I try to fit the data to the algorithm. The dataset has the following shape (1300, 8) and was indexed using 'index_col=0' in order to get rid of the first column that was useless (the columns count up to 8 after dropping the useless one)
The for loop for linkage actually works fine if run separately, the problem regards the affinity one.
dataset = #csv file
aff = ["l1", "l2", "manhattan", "cosine", "precomputed", "euclidean"]
link = ["complete", "average", "single"]
for a in aff:
for l in link:
ds=dataset
ac_tune=AgglomerativeClustering(n_clusters=5, affinity=a, linkage=l)
ac_tune.fit(ds)
the error is the following:
IndexError: index 8 is out of bounds for axis 1 with size 8
It fails when you try to perform the "precomputed" affinity. For this option, the dataset needs to be a distance matrix instead of the raw data.
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
My data looks like this. They are floats and they are in a big numpy array [700000,3]. There are no empty fields.
Label | Values1 | Values2
1. | 0.01 | 0.01
1. | ... | ...
1. |
2. |
2. |
3. |
...
The idea is to feed in the set of values1 and values2 and have it identify the label using classification.
But I don't want to feed the data row by row, but input all values1/2 that belong to label 1 as a set (e.g. inputting the first 3 rows is supposed to return [1,0,...], inputting the next 2 rows as a set [0,1,...])
Is there a non-complex way of feeding the data in this way? (i.e. feed batch where column label equals 1)
I am currently sorting the data and thinking about using pointers to the start and having loops which check if the next row is equal to the current to find a pointer to the end of the set and get the number of rows of that batch. But this more or less prevents randomizing input order.
Since you have your data in a numpy array (let's call it data, you can use
single_digit = data[(data[:,0] == 1.)][: , 1:]
which will compare the zeroth element of each row with the digit (1. in this case) and select only the rows having the label 1.. From these rows, it takes the first and second element, i.e. Values1 and Values2. A working example is below. You can use a for loop to iterate over all labels contained in the data set and construct a numpy array for each label with
single_digit = data[(data[:,0] == label_of_this_iteration)][: , 1:]
and then feed these arrays to the network. Within TensorFlow you can easily feed batches of different length, if you do not specify the first dimension of the corresponding placeholders.
import numpy as np
# Generate some data with three columns (label, Values1, Values2)
n = 20
ints = np.random.randint(1,6,(n, 1))
dous = np.random.uniform(size=(n,2))
data = np.hstack((ints, dous))
print(data)
# Extract the second and third columns of all rows having the label 1.0
ones = data[(data[:,0] == 1.)][: , 1:]
print(ones)
Ideally use TFRecords format.
This approach makes it easier to mix and match data sets and network architectures
Here is a link for detail on what this json like structure looks like example.proto
If I have a large csr_matrix A, I want to sum over its columns, simply
A.sum(axis=0)
does this for me, right? Are the corresponding axis values: 1->rows, 0->columns?
I stuck when I want to sum over columns with some weights which are specified in a list, e.g. [1 2 3 4 5 4 3 ... 4 2 5] with the same length as the number of rows in the csr_matrix A. To be more clear, I want the inner product of each column vector with this weight vector. How can I achieve this with Python?
This is a part of my code:
uniFeature = csr_matrix(uniFeature)
[I,J] = uniFeature.shape
sumfreq = uniFeature.sum(axis=0)
sumratings = []
for j in range(J):
column = uniFeature.getcol(j)
column = column.toarray()
sumtemp = np.dot(ratings,column)
sumratings.append(sumtemp)
sumfreq = sumfreq.toarray()
average = np.true_divide(sumratings,sumfreq)
(Numpy is imported as np) There is a weight vector "ratings", the program is supposed to output the average rating for each column of the matrix "uniFeature".
I experimented to dot column=uniFeature.getcol(j) directly with ratings(which is a list), but there is an error that says format does not agree. It's ok after column.toarray() then dot with ratings. But isn't making each column back to dense form losing the point of having the sparse matrix and would be very slow? I ran the above code and it's too slow to show the results. I guess there should be a way that dots the vector "ratings" with each column of the sparse matrix efficiently.
Thanks in advance!