I'm running into some trouble converting values in columns into a Numpy 2d array. What I have as an output from my code is something the following:
38617.0 0 0
40728.0 0 1
40538.0 0 2
40500.5 0 3
40214.0 0 4
40545.0 0 5
40352.5 0 6
40222.5 0 7
40008.0 0 8
40017.0 0 9
40126.0 0 10
40029.0 0 11
39681.5 0 12
39973.0 0 13
39903.0 0 14
39766.5 0 15
39784.0 0 16
39528.5 0 17
39513.5 0 18
And this continues for ~300,000 lines. The coords of the data are arranged as (z,x,y), and I want to convert it into a 2d array with dimensions 765X510 (x,y) so that the z-coordinates are sitting at their respective (x,y) coordinates so that I may write it to an image file.
Any ideas? I've been looking around and I haven't found anything on the matter.
EDIT:
This is the while-loop that's creating the above columns of data (it's actually two, a function is called within another while-loop):
def make_median_image(x,y):
while y < 509:
y = y + 1 # Makes the first value (x,0), b/c Python is indexed at 0
median_first_row0 = sc.median([a11[y,x],a22[y,x],a33[y,x],a44[y,x],a55[y,x],a66[y,x],a77[y,x],a88[y,x],a99[y,x],a1010[y,x]])
print median_first_row0,x,y
list1 = [median_first_row0,x,y]
list = list1.append(
while x < 764:
x = x + 1
make_median_image(x,y)
import numpy as np
l = [[1,2,3], [4,5,6], [7,8,9], [0,0,0]]
You can directly pass a python 2D list into a numpy array.
>>> np.array(l)
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[0, 0, 0]])
If you only want the latter two columns (which are your x,y values)
>>> np.array([[i[1],i[2]] for i in l])
array([[2, 3],
[5, 6],
[8, 9],
[0, 0]])
Would it be possible to create an array from the data below like so?
Data:
38617.0 0 0
40728.0 0 1
40538.0 0 2
40500.5 0 3
40214.0 0 4
40545.0 0 5
40352.5 0 6
40222.5 0 7
40008.0 0 8
40017.0 0 9
40126.0 0 10
40029.0 0 11
39681.5 0 12
39973.0 0 13
39903.0 0 14
39766.5 0 15
39784.0 0 16
39528.5 0 17
39513.5 0 18
... (continues for ~100,000 lines, so you can guess why I'm adamant to find an answer)
What I would like:
numpy_ndarray = [[38617.0, 40728.0, 40538.0, 40500.5, 40214.0, 40545.0, 40352.5, ... (continues until the last column value in the data above is 764) ], [begin next line, when x = 1, ... (until last y-value is 764)], ... [ ... (some last pixel value)]]
So it basically builds a matrix/image grid out of the pixel values in the first column of data that's associated with the (x,y) coordinate in the second and third columns.
lets say your array is l
import numpy as np
l = [[1,2,3], [4,5,6], [7,8,9], [0,0,0]]
>>> np.array(l)
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[0, 0, 0]])
this should work
>>> np.array([[i[1] for i in l],[i[2] for i in l]])
array([[2, 5, 8, 0],
[3, 6, 9, 0]])
Related
Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11
Suppose I have a 2 dimensional array with a very large number of rows, and a list of pairs of indexes of that array. I want to create a new 2 dim array, whose rows are concatenations of the rows of the original array, made according to the list of pairs of indexes. For example:
a =
1 2 3
4 5 6
7 8 9
0 0 0
indexes = [[0,0], [0,1], [2,3]]
the returned array should be:
1 2 3 1 2 3
1 2 3 4 5 6
7 8 9 0 0 0
Obviously I can iterate the list of indexes, but my question is whether there is a more efficient way of doing this. I should say that the list of indexes is also very large.
First convert indexes to a Numpy array:
ind = np.array(indexes)
Then generate your result as:
result = np.concatenate([a[ind[:,0]], a[ind[:,1]]], axis=1)
The result is:
array([[1, 2, 3, 1, 2, 3],
[1, 2, 3, 4, 5, 6],
[7, 8, 9, 0, 0, 0]])
Another possible formula (with the same result):
result = np.concatenate([ a[ind[:,i]] for i in range(ind.shape[1]) ], axis=1)
You can do this in one line using NumPy as:
a = np.arange(12).reshape(4, 3)
print(a)
b = [[0, 0], [1, 1], [2, 3]]
b = np.array(b)
print(b)
c = a[b.reshape(-1)].reshape(-1, a.shape[1]*b.shape[1])
print(c)
'''
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
[[0 0]
[1 1]
[2 3]]
[[ 0 1 2 0 1 2]
[ 3 4 5 3 4 5]
[ 6 7 8 9 10 11]]
'''
You can use horizontal stacking np.hstack:
c = np.array(indexes)
np.hstack((a[c[:,0]],a[c[:,1]]))
output:
[[1 2 3 1 2 3]
[1 2 3 4 5 6]
[7 8 9 0 0 0]]
I already asked a variation of this question, but I still have a problem regarding the runtime of my code.
Given a numpy array consisting of 15000 rows and 44 columns. My goal is to find out which rows are equal and add them to a list, like this:
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 2 3 4 5
Result:
equal_rows1 = [1,2,3]
equal_rows2 = [0,4]
What I did up till now is using the following code:
import numpy as np
input_data = np.load('IN.npy')
equal_inputs1 = []
equal_inputs2 = []
for i in range(len(input_data)):
for j in range(i+1,len(input_data)):
if np.array_equal(input_data[i],input_data[j]):
equal_inputs1.append(i)
equal_inputs2.append(j)
The problem is that it takes a lot of time to return the desired arrays and that this allows only 2 different "similar row lists" although there can be more. Is there any better solution for this, especially regarding the runtime?
This is pretty simple with pandas groupby:
df
A B C D E
0 1 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 1 2 3 4 5
[g.index.tolist() for _, g in df.groupby(df.columns.tolist()) if len(g.index) > 1]
# [[1, 2, 3], [0, 4]]
If you are dealing with many rows and many unique groups, this might get a bit slow. The performance depends on your data. Perhaps there is a faster NumPy alternative, but this is certainly the easiest to understand.
You can use collections.defaultdict, which retains the row values as keys:
from collections import defaultdict
dd = defaultdict(list)
for idx, row in enumerate(df.values):
dd[tuple(row)].append(idx)
print(list(dd.values()))
# [[0, 4], [1, 2, 3], [5]]
print(dd)
# defaultdict(<class 'list'>, {(1, 0, 0, 0, 0): [0, 4],
# (0, 0, 0, 0, 0): [1, 2, 3],
# (1, 2, 3, 4, 5): [5]})
You can, if you wish, filter out unique rows via a dictionary comprehension.
I'm looking for an efficient way to do the following with Numpy:
Given a array counts of positive integers containing for instance:
[3, 1, 0, 6, 3, 2]
I would like to generate another array containing the indices of the first one, where the index i is repeated counts[i] times:
[0 0 0 1 3 3 3 3 3 3 4 4 4 5 5]
My problem is that this array is potentially very large and I'm looking for a vectorial (or fast) way to do this.
You can do it with numpy.repeat:
import numpy as np
arr = np.array([3, 1, 0, 6, 3, 2])
repix = np.repeat(np.arange(arr.size), arr)
print(repix)
Output:
[0 0 0 1 3 3 3 3 3 3 4 4 4 5 5]
Since this is a complicated problem (at least for me), I will try to keep this as brief as possible.
My data is of the form
import pandas as pd
import numpy as np
# edit: a1 and a2 are linked as they are part of the same object
a1 = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
a2 = np.array([[5, 6, 5], [2, 3], [3, 4, 8, 1]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])
df = pd.DataFrame(dict(a1=a1.tolist(),a2=a2.tolist(), b=b, y=y))
a1 a2 b y
0 [1, 2, 3] [5, 6, 5] 6 0
1 [4, 5] [2, 3] 15 1
2 [7, 8, 9, 10] [3, 4, 8, 1] 24 1
which I would like to use in sklearn for classification, e.g.
from sklearn import tree
X = df[['a1', 'a2', 'b']]
Y = df['y']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print(clf.predict([[2., 2.]]))
However, while pandas can handle lists as entries, sklearn, by design, cannot. In this example the clf.fit will result in ValueError: setting an array element with a sequence. to which you can find plenty of answers.
But how do you deal with such data?
I tried to split the data up into multiple columns (i.e. a1[0] ... a1[3] - code for that is a bit lengthy), but a1[3] will be empty (NaN, 0 or whatever invalid value you think of). Imputation does not make sense here, since no value is supposed to be there.
Of course, such a procedure has an impact on the result of the classification as the algorithm might pick up the "zero" value as something meaningful.
If the dataset is large enough, so I thought, it might be worth splitting it up in equal lengths of a1. But this procedure can reduce the power of the classification algorithm, since the length of a1 might help to distinguish between classes.
I also thought of using warm start for algorithms that support (e.g. Perceptron) and fit it to data split by the length of a1. But this would surely fail, would it not? The datasets would have different number of features, so I assume that something would go wrong.
Solutions to this problem surely must exist and I've simply not found the right place in the documentation.
Lets assume for a second those numbers are numerical categories.
What you can do is transform column 'a' into a set of binary columns, of which each corresponds to a possible value of 'a'.
Taking your example code, we would:
import pandas as pd
import numpy as np
a = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])
df = pd.DataFrame(dict(a=a.tolist(),b=b,y=y))
from sklearn.preprocessing import MultiLabelBinarizer
MLB = MultiLabelBinarizer()
df_2 = pd.DataFrame(MLB.fit_transform(df['a']), columns=MLB.classes_)
df_2
1 2 3 4 5 7 8 9 10
0 1 1 1 0 0 0 0 0 0
1 0 0 0 1 1 0 0 0 0
2 0 0 0 0 0 1 1 1 1
Than, we can just concat the old and new data:
new_df = pd.concat([df_2, df.drop('a',1)],1)
1 2 3 4 5 7 8 9 10 b y
0 1 1 1 0 0 0 0 0 0 6 0
1 0 0 0 1 1 0 0 0 0 15 1
2 0 0 0 0 0 1 1 1 1 24 1
Please do notice that if you have a training and a test set, it would be wise to first concat em, do the transform, and than separate 'em. Thats because one of the data sets can contain terms that do not belong to the other.
Hope that helps
Edit:
If you are worried that might make your df too big, its perfectly okay to apply PCA to the binarized variables. It will reduce cardinality while maintaining an arbitrary amount of variance/correlation.
Sklearn likes the data in 2d array i.e. shape (batch_size, features)
The simplest solution is to prepare one feature vector by concatenating the arrays using numpy.concatenate. The pass this feature vector to sklearn. Since the length of each column is fixed this should work.