i can't still figure out how to do this the best possible way with less code,
i have a Ndarray called X :
array([0.5 , 2 , 3.2 , 0.16 , 3.3 , 10 , 12 , 2.5 , 10 , 1.2 ])
and i want somehow to get the smallest 5 values with their position in X
as in i want ( 0.5 , 0.16 , 1.2, 2 ,2.5 ) and to know that they are the first and fourth and 10th and second 8th in the ndarray X ( the are actually the the values of a row in a matrix and i want to know the position of the smallest 5 )
thank you!
You can use ndarray.argpartition:
X = np.array([0.5 , 2 , 3.2 , 0.16 , 3.3 , 10 , 12 , 2.5 , 10 , 1.2 ])
n = 5
arg = X.argpartition(range(n))[:n]
print(arg)
# [3 0 9 1 7]
print(X[arg])
# [0.16 0.5 1.2 2. 2.5 ]
I am trying to use numpy to dynamically create a set of zeros based on the size of a separate numpy array.
This is a small portion of the code of a much larger project. I have posted everything relevant in this question. I have a function k means which takes in a dataset (posted below) and a k value (which is 3, for this example).
I create a variable centroids which is supposed to look something like
[[4.9 3.1 1.5 0.1]
[7.2 3. 5.8 1.6]
[7.2 3.6 6.1 2.5]]
From there, I need to create a numpy array of "labels", one corresponding to every row in the dataset, of all zeroes with the same shape as the centroids array. Meaning, for a dataset with 5 rows, it would look like:
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
This is what I am trying to achieve, albiet on a dynamic scale (i.e. where the # of rows and columns in the dataset are unknown).
The following (hard coded, non numpy) satisfies that (assuming there are 150 lines in the dataset:
def k_means(dataset, k):
centroids = [[5,3,2,4.5],[5,3,2,5],[2,2,2,2]]
cluster_labels = []
for i in range(0,150):
cluster_labels.append([0,0,0,0])
print (cluster_labels)
I am trying to do this dynamically with the following:
def k_means(dataset, k):
centroids = dataset[numpy.random.choice(dataset.shape[0], k, replace=False), :]
print(centroids)
cluster_labels = []
cluster_labels = numpy.asarray(cluster_labels)
for index in range(len(dataset)):
# temp_array = numpy.zeros_like(centroids)
# print(temp_array)
cluster_labels = cluster_labels.append(cluster_labels, numpy.zeros_like(centroids))
The current result is: AttributeError: 'numpy.ndarray' object has no attribute 'append'
Or, if I comment out the cluster_labels line and uncomment the temp, I get:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
I will ultimately get 150 sets of that.
Sample of Iris Dataset:
5.1 3.5 1.4 0.2
4.9 3 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5 3.6 1.4 0.2
5.4 3.9 1.7 0.4
4.6 3.4 1.4 0.3
5 3.4 1.5 0.2
4.4 2.9 1.4 0.2
4.9 3.1 1.5 0.1
5.4 3.7 1.5 0.2
4.8 3.4 1.6 0.2
4.8 3 1.4 0.1
4.3 3 1.1 0.1
5.8 4 1.2 0.2
5.7 4.4 1.5 0.4
5.4 3.9 1.3 0.4
5.1 3.5 1.4 0.3
5.7 3.8 1.7 0.3
5.1 3.8 1.5 0.3
5.4 3.4 1.7 0.2
5.1 3.7 1.5 0.4
4.6 3.6 1 0.2
5.1 3.3 1.7 0.5
4.8 3.4 1.9 0.2
5 3 1.6 0.2
5 3.4 1.6 0.4
5.2 3.5 1.5 0.2
5.2 3.4 1.4 0.2
4.7 3.2 1.6 0.2
4.8 3.1 1.6 0.2
5.4 3.4 1.5 0.4
5.2 4.1 1.5 0.1
5.5 4.2 1.4 0.2
Can anybody help me dynamically use numpy to achieve what I am aiming for?
Thanks.
shape of a numpy array is the size of the array. In a 2D array shape represents (number of rows, number of columns). So, shape[0] is the number of rows and shape[1] is the number of columns. You can use numpy.zeros((dataset.shape[0], centroids.shape[1])) to create a numpy array with your desired dimensions. Here is an example code with modified version of your k-means function.
import numpy
def k_means(dataset, k):
centroids = dataset[numpy.random.choice(dataset.shape[0], k, replace=False), :]
print(centroids)
cluster_labels = numpy.zeros((dataset.shape[0], centroids.shape[1]))
print(cluster_labels)
dataset = numpy.array([[1,2,3,4,5,6,7,8,9,0],
[3,4,5,6,4,3,2,2,6,7],
[4,4,5,6,7,7,8,9,9,0],
[5,6,7,8,5,3,3,2,2,1],
[6,3,3,2,2,4,5,6,6,8]])
k_means(dataset, 2)
Output:
[[1 2 3 4 5 6 7 8 9 0]
[5 6 7 8 5 3 3 2 2 1]]
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
I used numpy.zeros((dataset.shape[0], centroids.shape[1])) to make it more similar to your code. Actually, numpy.zeros(dataset.shape) would do the same thing, because centroids.shape[1] and dataset.shape[1] is the same. The number of columns of centroids and the number columns dataset are the same, because you choose your centroids from the dataset. So, the last version should be like:
def k_means(dataset, k):
centroids = dataset[numpy.random.choice(dataset.shape[0], k, replace=False), :]
cluster_labels = numpy.zeros(dataset.shape)
This question already has answers here:
Pandas Groupby and Sum Only One Column
(3 answers)
Pandas sum by groupby, but exclude certain columns
(4 answers)
Closed 4 years ago.
I have a 2d numpy array with repeated values in first column.
The repeated values can have any corresponding value in second column.
Its easy to find the cumsum using numpy, but, I have to find the cumsum for all the repeated values.
How can we do this effectively using numpy or pandas?
Here, I have solved the problem using ineffective for-loop.
I was wondering if there is a more elegant solution.
Question
How can we get the same result in more effective fashion?
Help will be appreciated.
#!python
# -*- coding: utf-8 -*-#
#
# Imports
import pandas as pd
import numpy as np
np.random.seed(42) # make results reproducible
aa = np.random.randint(1, 20, size=10).astype(float)
bb = np.arange(10)*0.1
unq = np.unique(aa)
ans = np.zeros(len(unq))
print(aa)
print(bb)
print(unq)
for i, u in enumerate(unq):
for j, a in enumerate(aa):
if a == u:
print(a, u)
ans[i] += bb[j]
print(ans)
"""
# given data
idx col0 col1
0 7. 0.0
1 15. 0.1
2 11. 0.2
3 8. 0.3
4 7. 0.4
5 19. 0.5
6 11. 0.6
7 11. 0.7
8 4. 0.8
9 8. 0.9
# sorted data
4. 0.8
7. 0.0
7. 0.4
8. 0.9
8. 0.3
11. 0.6
11. 0.7
11. 0.2
15. 0.1
19. 0.5
# cumulative sum for repeated serial
4. 0.8
7. 0.0 + 0.4
8. 0.9 + 0.3
11. 0.6 + 0.7 + 0.2
15. 0.1
19. 0.5
# Required answer
4. 0.8
7. 0.4
8. 1.2
11. 1.5
15. 0.1
19. 0.5
"""
You can groupby col0 and find the .sum() for col1.
df.groupby('col0')['col1'].sum()
Output:
col0
4.0 0.8
7.0 0.4
8.0 1.2
11.0 1.5
15.0 0.1
19.0 0.5
Name: col1, dtype: float64
I think a pandas method such as the one offered by #HarvIpan is best for readability and functionality, but since you asked for a numpy method as well, here is a way to do it in numpy using a list comprehension, which is more succinct than your original loop:
np.array([[i,np.sum(bb[np.where(aa==i)])] for i in np.unique(aa)])
which returns:
array([[ 4. , 0.8],
[ 7. , 0.4],
[ 8. , 1.2],
[ 11. , 1.5],
[ 15. , 0.1],
[ 19. , 0.5]])
I've got a numpy array that looks like this:
1 0 0 0 200 0 0 0 1
6 0 0 0 2 0 0 0 4.3
5 0 0 0 1 0 0 0 7.1
expected out put would be
1 100 100 100 200 100 100 100 1
6 4 4 4 2 3.15 3.15 3.15 4.3
5 3 3 3 1 4.05 4.05 4.05 7.1
and I would like to replace all the 0 values with an average of their neighbours. Any hints welcome! Many thanks!
If the structure in the sample array is preserved throughout your array, then this code will work:
In [159]: def avg_func(r):
lavg = (r[0] + r[4])/2.0
ravg = (r[4] + r[-1])/2.0
r[1:4] = lavg
r[5:-1] = ravg
return r
In [160]: np.apply_along_axis(avg_func, 1, arr)
Out[160]:
array([[ 1. , 100.5 , 100.5 , 100.5 , 200. , 100.5 , 100.5 ,
100.5 , 1. ],
[ 6. , 4. , 4. , 4. , 2. , 3.15, 3.15,
3.15, 4.3 ],
[ 5. , 3. , 3. , 3. , 1. , 4.05, 4.05,
4.05, 7.1 ]])
But, as you can see this is kinda messy with hardcoding the indexes. You just have to get creative when you define avg_func here. Feel free to improve this solution and get creative. Also, note that this implementation does in-place modification on the input array.
I am very new to python programming so this might look very easy for most of the pros out there. I have a text file in the following format. I want to import only the numbers to a matrix. Meaning i do not want the spaces (there is also a space at the start of each row) and the data label.
1 1 1 1 1 1 1 data_1
1 1 1 1 1 1 2 data_2
1 1 1 1 1 2 1 data_3
1 1 1 1 1 2 2 data_4
1 1 1 1 1 3 1 data_5
1 1 1 1 1 3 2 data_6
Use numpy.loadtxt, which assumes the data are delimited by whitespace by default and takes an argument usecols specifying which fields to use in building the array:
In [1]: import numpy as np
In [2]: matrix = np.loadtxt('matrix.txt', usecols=range(7))
In [3]: print matrix
[[ 1. 1. 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1. 1. 2.]
[ 1. 1. 1. 1. 1. 2. 1.]
[ 1. 1. 1. 1. 1. 2. 2.]
[ 1. 1. 1. 1. 1. 3. 1.]
[ 1. 1. 1. 1. 1. 3. 2.]]
If you want your matrix elements to be integers, pass dtype=int to loadtxt as well.