Shuffle with cross_validation - python

I have a dataset like this:
[ 5. , 2. , 15. , 0.25535303],
[ 5. , 3. , 15. , 6.72465845],
[ 5. , 4. , 15. , 5.62719504],
[ 5. , 5. , 15. , 5.61760597],
[ 5. , 6. , 15. , 4.9561533 ],
[ 6. , 2. , 15. , 0.2709665 ],
[ 6. , 3. , 15. , 6.07004364],
[ 6. , 4. , 15. , 5.62719504],
[ 6. , 5. , 15. , 5.54684885],
[ 6. , 6. , 15. , 5.32846201],
[ 2. , 2. , 20. , 3.79257349],
[ 2. , 3. , 20. , 4.00440964],
[ 2. , 4. , 20. , 4.37965706],
[ 2. , 5. , 20. , 3.92216922],
[ 2. , 6. , 20. , 3.41378368],
[ 3. , 2. , 20. , 0.13500398],
[ 3. , 3. , 20. , 4.38384781],
[ 3. , 4. , 20. , 5.17229688],
[ 3. , 5. , 20. , 5.00464056],
The third column values go from 15 to 35. I wanted to apply cross-validation but I suspected that the K-Folds would only include in each K block the same value in the third column, and it would affect negatively to my model.
Therefore, my solution is:
dataset_shuffle = shuffle(dataset)
X = dataset_shuffle["A", "B", "C"]
y = dataset_shuffle["D"]
result = cross_validate(estimator,X,y,scoring=scoretypes,cv=5,return_train_score=False)
r2 = result['test_r2'].mean()
mselist = -result['test_neg_mean_squared_error']
rmse = np.sqrt(mselist).mean()
Do you consider this a proper solution to my problem?
Is my solution the same as doing this?:
X = dataset["A", "B", "C"]
y = dataset["D"]
cv = KFold(n_splits=5, shuffle=True)
result = cross_validate(estimator,X,y,scoring=scoretypes,cv=cv,return_train_score=False)

Related

replacing specific columns values in 2d array numpy

how to replace 4th and 5th column values in utl by new_values array and keep the remaining columns as it is
utl = np.array([[ 3. , 134.4 , 17. , 135.05 , 22. , 135.25 , 0.04 ],
[ 12. , 134.3 , 17. , 135.05 , 22. , 135.8 , 0.15 ]])
new_values=np.array([[ 27., 135.45],
[ 27., 136.55]])
i tried this but it does not work
# utl[:,[4,5]] = new_values
# utl[:,4] = new_values[:,0]
output must be
#values changed
[[ 3. , 134.4 , 17. , 135.05 , | 27. , 135.45 |, 0.04 ],
[ 12. , 134.3 , 17. , 135.05 , | 27. , 136.55 |, 0.15 ]])
this works fine, as expected:
utl[:, [4,5]] = new_values
output:
array([[ 3. , 134.4 , 17. , 135.05, 27. , 135.45, 0.04],
[ 12. , 134.3 , 17. , 135.05, 27. , 136.55, 0.15]])

How can i sort an array based on the mean of each column in python?

Input:
array([[ 1. , 5. , 1. ],
[ 10. , 7. , 1.5],
[ 6.9, 5. , 1. ],
[ 19. , 9. , 100. ],
[ 11. , 11. , 11. ]])
Expected Output:
array([[ 19. , 9. , 100. ],
[ 11. , 11. , 11. ],
[ 10. , 7. , 1.5],
[ 6.9, 5. , 1. ],
[ 1. , 5. , 1. ]])
i tried doing the below:
for i in M:
ls = i.mean()
x = np.append(i,ls)
print(x) #found the mean
After this i am unable to arrange each column based on the mean value in each row. All i can do
is to arrange each row in descending order but that is not what i wanted.
You can do this:
In [405]: row_idxs = np.argsort(np.mean(a * -1, axis=1))
In [406]: a[row_idxs, :]
Out[406]:
array([[ 19. , 9. , 100. ],
[ 11. , 11. , 11. ],
[ 10. , 7. , 1.5],
[ 6.9, 5. , 1. ],
[ 1. , 5. , 1. ]])
Using argsort will sort the indices. Multiplying by -1 allows you to get descending order.

Numpy sorting a matrix by column by cycle

I have this numpy array:
array(
[[ 1. , 9. , 565.98653513],
[ 1. , 1. , 973.18466261],
[ 1. , 25. , 803.17747373],
[ 2. , 9. , 82.56336897],
[ 2. , 1. , 104.69517373],
[ 2. , 25. , 627.01127514],
[ 3. , 21. , 334.07622382],
[ 3. , 34. , 921.37623107],
[ 3. , 20. , 342.08177942],
... ... ... ...
[ 10. , 7. , 424.29338026],
[ 10. , 0. , 232.71475407],
[ 10. , 1. , 330.44846202]])
But I want to sort the matrix by the first column in a cycle: 1, 2, 3, ...,10. It should look like this:
array(
[[ 1. , 9. , 565.98653513],
[ 2. , 9. , 82.56336897],
[ 3. , 21. , 334.07622382],
... ... ... ...
[ 10. , 7. , 424.29338026],
[ 1. , 1. , 973.18466261],
[ 2. , 1. , 104.69517373],
[ 3. , 34. , 921.37623107],
... ... ... ...
[ 10. , 0. , 232.71475407],
[ 1. , 25. , 803.17747373],
[ 2. , 25. , 627.01127514],
[ 3. , 20. , 342.08177942],
... ... ... ...
[ 10. , 1. , 330.44846202]])
How can I do this?
I was thinking of converting it to a dataframe (i.e. pandas) for more sorting options, then covert back to an array...but I don't see a straight forward function to do this.
I appreciate any help or ideas.
Assuming your array is named x:
y = [x[i::10] for i in range(int(len(x)/10))]
y = np.array(y)
y.reshape(x.shape)
print(y)
The x[i:j:k] notation means x from i to j with step k. So x[i::10] mean x from i to the end with step 10.
See more here.

Sort rows of a 2D array, based on the first column

I want to sort the rows of a 2D array based on the elements of the first column, in Python 3. For example, if
x = array([[ 5. , 9. , 2. , 6. ],
[ 7. , 12. , 3.5, 8. ],
[ 2. , 6. , 7. , 9. ]])
then I need the sorted array to be
x = array([[ 2. , 6. , 7. , 9. ],
[ 5. , 9. , 2. , 6. ],
[ 7. , 12. , 3.5, 8. ]])
How can I do that? A similar question was asked and answered here, but it does not work for me.
The following should work:
import numpy as np
x = np.array([[ 5. , 9. , 2. , 6. ],
[ 7. , 12. , 3.5, 8. ],
[ 2. , 6. , 7. , 9. ]])
x[x[:, 0].argsort()]
Out[2]:
array([[ 2. , 6. , 7. , 9. ],
[ 5. , 9. , 2. , 6. ],
[ 7. , 12. , 3.5, 8. ]])
Documentation : numpy.argsort
#using sorted
x = ([[5.,9.,2.,6. ], [7.,12.,3.5,8.], [2.,6.,7.,9.]])
x = sorted(x, key=lambda i: i[0]) #1st col
print(x)

Sort 2D NumPy array by one of the columns

I though this would be super easy but I am struggling a little. I have a data structure as follows
array([[ 5. , 3.40166205],
[ 10. , 2.72778882],
[ 15. , 2.31881804],
[ 20. , 2.50643777],
[ 1. , 3.94076063],
[ 2. , 3.80598599],
[ 3. , 3.67121134],
[ 6. , 3.2668874 ],
[ 7. , 3.13211276],
[ 8. , 2.99733811],
[ 9. , 2.86256347],
[ 11. , 2.64599467],
[ 12. , 2.56420051],
[ 13. , 2.48240635],
[ 14. , 2.4006122 ],
[ 16. , 1.8280531 ],
[ 17. , 1.74625894],
[ 18. , 1.66446479],
[ 19. , 1.58267063],
[ 20. , 1.50087647]])
And I want to sort it ONLY on the first column ... so it is ordered as follows:
array([[1. , 3.9],
[2. , 3.8],
... ,
[20. , 1.5]])
np.sort doesn't seem to work as it moves array to a flat structure. I've also used itemgetter
from operator import itemgetter
sorted(data, key=itemgetter(1))
But this doesn't give me the output I'm looking for.
Help appreciated!
This is a common numpy idiom. You can use argsort (on the first column) + numpy indexing here -
x[x[:, 0].argsort()]
array([[ 1. , 3.94076063],
[ 2. , 3.80598599],
[ 3. , 3.67121134],
[ 5. , 3.40166205],
[ 6. , 3.2668874 ],
[ 7. , 3.13211276],
[ 8. , 2.99733811],
[ 9. , 2.86256347],
[ 10. , 2.72778882],
[ 11. , 2.64599467],
[ 12. , 2.56420051],
[ 13. , 2.48240635],
[ 14. , 2.4006122 ],
[ 15. , 2.31881804],
[ 16. , 1.8280531 ],
[ 17. , 1.74625894],
[ 18. , 1.66446479],
[ 19. , 1.58267063],
[ 20. , 2.50643777],
[ 20. , 1.50087647]])

Categories