I have a matrix tempsyntheticGroup2 with 6 columns. I want to change the value of columns (0,1,2,3,5) from float to int. This is my code:
tempsyntheticGroup2=tempsyntheticGroup2[:,[0,1,2,3,5]].astype(int)
but it doesn't work properly and I loose the other columns.
I don't think you can have a numpy array with some element that are ints, and some that are floats (there is only one possible dtype per array). But if you just want to round to lower integer (while keeping all elements as floats) you can do this:
# define dummy example matrix
t = np.random.rand(3,4) + np.arange(12).reshape((3,4))
array([[ 0.68266426, 1.4115732 , 2.3014562 , 3.5173022 ],
[ 4.52399807, 5.35321628, 6.95888015, 7.17438118],
[ 8.97272076, 9.51710983, 10.94962065, 11.00586511]])
# round some columns to lower int
t[:,[0,2]] = np.floor(t[:,[0,2]])
# or
t[:,[0,2]] = t[:,[0,2]].astype(int)
array([[ 0. , 1.4115732 , 2. , 3.5173022 ],
[ 4. , 5.35321628, 6. , 7.17438118],
[ 8. , 9.51710983, 10. , 11.00586511]])
otherwise you probably need to split your original array into 2 different arrays, with one containing the column that stay floats, the other containing the column that become ints.
t_int = t[:,[0,2]].astype(int)
array([[ 0, 2],
[ 4, 6],
[ 8, 10]])
t_float = t[:,[1,3]]
array([[ 1.4115732 , 3.5173022 ],
[ 5.35321628, 7.17438118],
[ 9.51710983, 11.00586511]])
Note that you'll have to change your indexing accordingly to access your elements...
I think you use wrong syntax to get column data.
read this article.
How do you extract a column from a multi-dimensional array?
Related
When I initialize two random matrices (here of shape (6,2)), and re-order the rows, np.sum() will give me different results. I understand this is likely due to numerical errors, as the differences are small, but how exactly is np.sum() summing the elements? How can I replicate the results for each matrix order?
x1 = np.array([[-0.31381854, -0.05944943],
[ 0.3848904 , -0.36534384],
[ 1.1122322 , 1.2480698 ],
[-1.4493011 , 0.5094067 ],
[ 0.00905334, 0.77591574],
[ 0.25694364, -2.108599 ]], dtype=np.float32)
x2 = np.array([[-0.31381854, -0.05944943],
[ 1.1122322 , 1.2480698 ],
[ 0.00905334, 0.77591574],
[ 0.3848904 , -0.36534384],
[-1.4493011 , 0.5094067 ],
[ 0.25694364, -2.108599 ]], dtype=np.float32)
print(np.sum(x1))
print(np.sum(x2))
0.0
-2.3841858e-07
Although they have the exact same elements per column and row, the sums are different.
If I sum all elements using the sum() function, the results do not disagree:
print(sum(sum(x1)))
print(sum(sum(x2)))
-5.960464477539063e-08
-5.960464477539063e-08
When I sum the columns invidividually, using the python sum() function, I get the same resulting sum:
print(sum(x1[:,0]))
print(sum(x2[:,0]))
-6.705522537231445e-08
-6.705522537231445e-08
print(sum(x1[:,1]))
print(sum(x2[:,1]))
-2.9802322387695312e-08
-2.9802322387695312e-08
But again, if I sum the columns separately, this time using np.sum( axis=1), the results are different:
print(np.sum(x1),1)
print(np.sum(x2),1)
-2.3841858e-07 1
0.0 1
This is an issue for large matrices with thousands of elements, where the numerical inaccuracies sum up to be massive differences.
I just don't understand how np.sum() is operating, in a way that a simple sum is giving such different results, where the python canonical sum() function does not!
why a[:,[x]] could create a column vector from an array? The [ ] represents what?
Could anyone explain to me the principle?
a = np.random.randn(5,6)
a = a.astype(np.float32)
print(a)
c = torch.from_numpy(a[:,[1]])
[[-1.6919796 0.3160475 0.7606999 0.16881375 1.325092 0.71536326]
[ 1.217861 0.35804042 0.0285245 0.7097111 -2.1760604 0.992101 ]
[-1.6351479 0.6607222 0.9375339 0.5308735 -1.9699149 -2.002803 ]
[-1.1895325 1.1744579 -0.5980689 -0.8906375 -0.00494479 0.51751447]
[-1.7642071 0.4681248 1.3938268 -0.7519176 0.5987852 -0.5138923 ]]
###########################################
tensor([[0.3160],
[0.3580],
[0.6607],
[1.1745],
[0.4681]])
The [ ] mean you are giving extra dimension. Try numpy shape method to see the diference.
a[:,1].shape
output :
(10,)
with [ ]
a[:,[1]].shape
output :
(10,1)
That syntax is for array slicing in numpy, where arrays are indexed as a[rows, columns, page, ... (higher-dimensions)]
Selecting for a specific row/column/page is done by giving a specific number or range of numbers. So when you use a[1,2], numpy gets the element from row 1, column 2.
You can select for several specific indices by giving the dimension multiple values. So a[[1,3],1] gets you both elements (1,1) and (1,3).
The : tells numpy to get everything from that specific array dimension. So when you use a[:,1], numpy gets every row in column 1. Alternatively, a[1,:] gets every column in row 1.
Hey guys so I want to write a function that performs a z-score transformation to a single column in a 2d array and then return an array where the specified column is "transformed" and the other columns remain the same. So the way I went about this is first I deleted the column that I want to transform using np.delete(), then performed the transformation, and then finally added the array with the deleted column and the transformed column using np.insert(). However all the elements in the transformed column is all 0. What can I do??
I have attached an image so you can view the incorrect output as well.
x1 = np.array([[4,3,12],[1,5,20],[1,2,3],[10,20,40],[7,2,44]])
def myfunc(array, scalar):
total_result = np.delete(array, scalar, axis =1)
z_score = ((array - array.mean())/array.std())[:,1]
answer = np.insert(total_result, scalar, z_score, axis=1)
return answer
myfunc(x1, 1)
Your array is of type integer, and your z-score is float. When you insert float into an integer array, it converts it to integer, hence all 0. You need to convert your array into float first. Also, deleting/inserting is not the right way to do it, simply assign your new values to your desired column. No need for delete/insert. Here is how to do it:
def myfunc(array, scalar):
z_score = ((array - array.mean())/array.std())[:,scalar]
array[:,scalar] = z_score
return array
x1 = x1.astype(np.float64, copy=False)
myfunc(x1, 1)
output:
[[ 4. -0.64344154 12. ]
[ 1. -0.49380397 20. ]
[ 1. -0.71826033 3. ]
[10. 0.62847778 40. ]
[ 7. -0.71826033 44. ]]
I have an array that contains 2D arrays.
For each 2D array i want to sum up the columns and the result must be in column form.
I have a piece of code to do this, but I feel like I am not utilising numpy optimally. What is the fastest to do this?
My current code:
temp = [np.sum(l_i,axis=1).reshape(-1,1) for l_i in self.layer_inputs]
Sample Array:
array([
array([[ 0.48517904, -11.10809746],
[ 13.64104864, 5.77576326]]),
array([[16.74109924, -3.28535518],
[-4.00977275, -3.39593759],
[ 5.9048581 , -1.65258805],
[13.40762143, -1.61158724],
[ 9.8634849 , 8.02993728]]),
array([[-7.61920427, -3.2314264 ],
[-3.79142779, -2.44719713],
[32.42085005, 4.79376209],
[13.97676962, -1.19746096],
[45.60100807, -3.01680368]])
], dtype=object)
Sample Expected Result:
[array([[-10.62291842],
[ 19.41681191]]),
array([[13.45574406],
[-7.40571034],
[ 4.25227005],
[11.7960342 ],
[17.89342218]]),
array([[-10.85063067],
[ -6.23862492],
[ 37.21461214],
[ 12.77930867],
[ 42.58420439]]) ]
New answer
Given your stringent requirement for a list of arrays, there is no more computationally efficient solution.
Original answer
To leverage NumPy, don't work with a list of arrays: dtype=object is the hint you won't be able to use vectorised operations.
Instead, combine into one array, e.g. via np.vstack, and store split indices. If you need a list of arrays, use np.split as a final step. But this constant flipping between lists and a single array is expensive. Really, you should attempt to just store the splits and a single array, i.e. idx and data below.
idx = np.array(list(map(len, A))).cumsum()[:-1] # [2, 7]
data = np.vstack(A).sum(1)
I am using Python, numpy and scikit-learn. I have data of keys and values that are stored in an SQL table. I retrieve this as a list of tuples returned as: [(id, value),...]. Each id appears only once in the list and the tuples appear sorted in order of ascending id. This process is completed a few times so that I have multiple lists of key: value pairs. Such that:
dataset = []
for sample in samples:
listOfTuplePairs = getDataFromSQL(sample) # get a [(id, value),...] list
dataset.append(listOfTuplePairs)
Keys may be duplicated across different samples, and each row may be of a different length. An example dataset might be:
dataset = [[(1, 0.13), (2, 2.05)],
[(2, 0.23), (4, 7.35), (5, 5.60)],
[(2, 0.61), (3, 4.45)]]
It can be seen that each row is a sample, and that some ids (in this case 2) appear in multiple samples.
Problem: I wish to construct a single (possibly sparse) numpy array suitable for processing with scikit-learn. The values relating to a specific key (id) for each sample should be aligned in the same 'column' (if that is the correct terminology) such that the matrix of the above example would look as follows:
ids = 1 2 3 4 5
------------------------------
dataset = [(0.13, 2.05, null, null, null),
(null, 0.23, null, 7.35, 5.60),
(null, 0.61, 4.45, null, null)]
As you can see, I also wish to strip the ids from the matrix (though I will need to retain a list of them so I know what the values in the matrix relate to. Each initial list of key: value pairs may contain several thousand rows and there may be several thousand samples so the resulting matrix may be very large. Please provide answers that consider speed (within the limits of Python), memory efficiency and code clarity.
Many, many thanks in advance for any help.
Here's a NumPy based approach to create a sparse matrix coo_matrix with memory efficiency in focus -
from scipy.sparse import coo_matrix
# Construct row IDs
lens = np.array([len(item) for item in dataset])
shifts_arr = np.zeros(lens.sum(),dtype=int)
shifts_arr[lens[:-1].cumsum()] = 1
row = shifts_arr.cumsum()
# Extract values from dataset into a NumPy array
arr = np.concatenate(dataset)
# Get the unique column IDs to be used for col-indexing into output array
col = np.unique(arr[:,0],return_inverse=True)[1]
# Determine the output shape
out_shp = (row.max()+1,col.max()+1)
# Finally create a sparse marix with the row,col indices and col-2 of arr
sp_out = coo_matrix((arr[:,1],(row,col)), shape=out_shp)
Please note that if the IDs are supposed to be column numbers in the output array, you could replace the use of np.unique that gives us such unique IDs with something like this -
col = (arr[:,0]-1).astype(int)
This should give us a good performance boost!
Sample run -
In [264]: dataset = [[(1, 0.13), (2, 2.05)],
...: [(2, 0.23), (4, 7.35), (5, 5.60)],
...: [(2, 0.61), (3, 4.45)]]
In [265]: sp_out.todense() # Using .todense() to show output
Out[265]:
matrix([[ 0.13, 2.05, 0. , 0. , 0. ],
[ 0. , 0.23, 0. , 7.35, 5.6 ],
[ 0. , 0.61, 4.45, 0. , 0. ]])
You can convert each element in the dataset to a dictionary and then use pandas data frame which will return the result close to the desired output. If 2D numpy array is desired we can use as_matrix() method to convert the data frame to numpy array:
import pandas as pd
pd.DataFrame(dict(x) for x in dataset).as_matrix()
# array([[ 0.13, 2.05, nan, nan, nan],
# [ nan, 0.23, nan, 7.35, 5.6 ],
# [ nan, 0.61, 4.45, nan, nan]])