Creating many state vectors and saving them in a file

Creating many state vectors and saving them in a file - python

I want to create m number of matrices, each of which is an n x 1 numpy arrays. Moreover those matrices should have only two nonzero entries in the two rows, all other rows should have 0 as their entries, meaning that matrix number m=1 should have entries m[0,:]=m[1,:]=1, rest elements are 0. And similarly the last matrix m=m should have entries like m[n-1,:]=m[n,:]=1, where rest of the elements in other rows are 0. So for consecutive two matrices, the nonzero elements shift by two rows. And finally, I would like them to be stored into a dictionary or in a file.
What would be a neat way to do this?

Is this what you're looking for?
In [2]: num_rows = 10 # should be divisible by 2
In [3]: np.repeat(np.eye(num_rows // 2), 2, axis=0)
Out[3]:
array([[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.]])
In terms of storage in a file, you can use np.save and np.load.
Note that the default data type for np.eye will be float64. If you expect your values to be small when you begin integrating or whatever you're planning on doing with your state vectors, I'd recommend setting the data type appropriately (like np.uint8 for positive integers < 256 for example).

Related

Python - 2D Array: Finding Coordinates

I have a python based battleship game that allows players to pass in a .csv containing a 10x10 grid of 0s and 1s, where 1s correspond to locations of ships. For example,
In [1]: board = np.genfromtext(filename, delimiter=',')
Out[2]: array([
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 1., 1., 1., 0., 0., 0.],
[1., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
(5) unit ship located at board[8,5:10]
(4) unit ship located at board[7,3:7]
(3) unit ship located at board[3:6,2]
(3) unit ship located at board[3:6,6]
(2) unit ship located at board[8:10,0]
The game restricts players to 5 ships with sizes of (5, 4, 3, 3, 2). Given this information how can I come up with a method for determining the coordinates for each ship?
I'm currently applying a for loop that finds indexes where 1s are present, see below.
for ri, ci in zip(range(board.shape[0]), range(board.shape[1])):
# get row indexes of ships
rlocs = len(np.where(board[ri,:]==1)[0])
# get col indexes of ships
clocs = len(np.where(board[:,ci]==1)[0])
# skip empty row and col
if (rlocs == 0) and (clocs == 0): continue
# check if consecutives <int>
rcons = is_consecutive(rlocs)
ccons = is_consecutive(clocs)
# if more than one consecutive is found then assume more than 1 ship in row/col
if (rcons > 1) or (ccons > 1):
# .... ?
At this point I'm unsure what the next step would be... any help or advice is welcomed!
FYI: is_consecutive returns an int of the number of values in a list that are in a consecutive order. For example, [0,1,2,9,10] would return 2 (i.e., 0-2 and 9-10).
The output that I'm looking for is a dictionary that looks similar to this:
{'ship_05_01': [(x0,y0), (x1,y1)], 'ship_03_01': [(x0,y0), (x1,y1)], 'ship_03_02': [(x0,y0), (x1,y1)], ...}
where ship_xx_nn --> xx = number of spaces; nn = index

minimize runtime for numpy array manipulation

I have an 2 dimensional array with np.shape(input)=(a,b) and that looks like
input=array[array_1[0,0,0,1,0,1,2,0,3,3,2,...,entry_b],...array_a[1,0,0,1,2,2,0,3,1,3,3,...,entry_b]]
Now I want to create an array np.shape(output)=(a,b,b) in which every entry that had the same value in the input get the value 1 and 0 otherwise
for example:
input=[[1,0,0,0,1,2]]
output=[array([[1., 0., 0., 0., 1., 0.],
[0., 1., 1., 1., 0., 0.],
[0., 1., 1., 1., 0., 0.],
[0., 1., 1., 1., 0., 0.],
[1., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 1.]])]
My code so far is looking like:
def get_matrix(svdata,padding_size):
List=[]
for k in svdata:
matrix=np.zeros((padding_size,padding_size))
for l in range(padding_size):
for m in range(padding_size):
if k[l]==k[m]:
matrix[l][m]=1
List.append(matrix)
return List
But it takes 2:30 min for an input array of shape (2000,256). How can I become more effiecient by using built in numpy solutions?

res = input[:,:,None]==input[:,None,:]
Should give boolean (a,b,b) array
res = res.astype(int)
to get a 0/1 array

You're trying to create the array y where y[i,j,k] is 1 if input[i,j] == input[i, k]. At least that's what I think you're trying to do.
So y = input[:,:,None] == input[:,None,:] will give you a boolean array. You can then convert that to np.dtype('float64') using astype(...) if you want.

Understanding Variance Treshold

I am working on a text classification problem in python, where I build a traing array based on {0,1} if the word is inside the text or not.
array([[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.],
...,
[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.]])
as I want to run SVM on it, I want to reduce my features. In scikit learn I found this: https://scikit-learn.org/stable/modules/feature_selection.html
with the Variance Threshold set to:
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
x_train_red = sel.fit_transform(x_train)
from the reduction I am reducing my shape from:
(7808, 2000)
(7808, 97)
does it only reduce the futre where every line has 1 or where every line has a 0 or how does it work?

From the documentation you can see the variance is calculated by p(1-p), the default threeshold or limit 0.8 means that any column with a probability of having 0 variance above 0.8 will be eliminated. So it deletes the columns with rare occurrences, those words are not in your text a lot, so their variance will be close to 0 and the feature selection will eliminate it.

Removing NaN rows from a three dimensional array

How can I remove the NaN rows from the array below using indices (since I will need to remove the same rows from a different array.
array([[[nan, 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]],
[[ 0., 0., 0., 0.],
[ 0., nan, 0., 0.],
[ 0., 0., 0., 0.]]])
I get the indices of the rows to be removed by using the command
a[np.isnan(a).any(axis=2)]
But using what I would normally use on a 2D array does not produce the desired result, losing the array structure.
a[~np.isnan(a).any(axis=2)]
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
How can I remove the rows I want using the indices obtained from my first command?

You need to reshape:
a[~np.isnan(a).any(axis=2)].reshape(a.shape[0], -1, a.shape[2])
But be aware that the number of NaN-rows at each 2D subarray should be the same to get a new 3D array.

Vectorize Sequences explanation

Studying Deep Learning with Python, I can't comprehend the following simple batch of code which encodes the integer sequences into a binary matrix.
def vectorize_sequences(sequences, dimension=10000):
# Create an all-zero matrix of shape (len(sequences), dimension)
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1. # set specific indices of results[i] to 1s
return results
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
x_train = vectorize_sequences(train_data)
And the output of x_train is something like
x_train[0]
array([ 0., 1.,1., ...,0.,0.,0.])
Can someone put some light of the 0.'s existance in x_train array while only 1.'s are appending in each next i iteration?
I mean shouldn't be all 1's?

The script transforms you dataset into a binary vector space model. Let's disect things one by one.
First, if we examine the x_train content we see that each review is represented as a sequence of word ids. Each word id corresponds to one specific word:
print(train_data[0]) # print the first review
[1, 14, 22, 16, 43, 530, 973, ..., 5345, 19, 178, 32]
Now, this would be very difficult to feed the network. The lengths of reviews varies, fractional values between any integers have no meaning (e.g. what if on the output we get 43.5, what does it mean?)
So what we can do, is create a single looong vector, the size of the entire dictionary, dictionary=10000 in your example. We will then associate each element/index of this vector with one word/word_id. So word represented by word id 14 will now be represented by 14-th element of this vector.
Each element will either be 0 (word is not present in the review) or 1 (word is present in the review). And we can treat this as a probability, so we even have meaning for values in between 0 and 1. Furthermore, every review will now be represented by this very long (sparse) vector which has a constant length for every review.
So on a smaller scale if:
word word_id
I -> 0
you -> 1
he -> 2
be -> 3
eat -> 4
happy -> 5
sad -> 6
banana -> 7
a -> 8
the sentences would then be processed in a following way.
I be happy -> [0,3,5] -> [1,0,0,1,0,1,0,0,0]
I eat a banana. -> [0,4,8,7] -> [1,0,0,0,1,0,0,1,1]
Now I highlighted the word sparse. That means, there will have A LOT MORE zeros in comparison with ones. We can take advantage of that. Instead of checking every word, whether it is contained in a review or not; we will check a substantially smaller list of only those words that DO appear in our review.
Therefore, we can make things easy for us and create reviews × vocabulary matrix of zeros right away by np.zeros((len(sequences), dimension)). And then just go through words in each review and flip the indicator to 1.0 at position corresponding to that word:
result[review_id][word_id] = 1.0
So instead of doing 25000 x 10000 = 250 000 000 operations, we only did number of words = 5 967 841. That's just ~2.5% of original amount of operations.

The for loop here is not processing all the matrix. As you can see, it enumerates elements of the sequence, so it's looping only on one dimension.
Let's take a simple example :
t = np.array([1,2,3,4,5,6,7,8,9])
r = np.zeros((len(t), 10))
Output
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
then we modify elements with the same way you have :
for i, s in enumerate(t):
r[i,s] = 1.
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])
you can see that the for loop modified only a set of elements (len(t)) which has index [i,s] (in this case ; (0, 1), (1, 2), (2, 3), an so on))

import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating many state vectors and saving them in a file - python

Related

Python - 2D Array: Finding Coordinates

minimize runtime for numpy array manipulation

Understanding Variance Treshold

Removing NaN rows from a three dimensional array

Vectorize Sequences explanation

Categories

Resources