Fastening the generation of data sequence into array using Python - python

I have the following code for making the sequences of the dataframe, which has loaded the csv data of rains ratios.
import pandas as pd
import numpy as np
import sklearn
import sklearn.preprocessing
seq_len = 1100
def load_data(df_, seq_len):
data_raw = df_.values # convert to numpy array
data = []
data = np.array([data_raw[index: index + seq_len] for index in range(len(data_raw) - (seq_len+1))])
print(data.shape)
df = pd.read_csv("data.csv",index_col = 0)
temp = df.copy()
temp = normalize_data(temp)
load_data(df_, seq_len)temp
When I ran the function load_data(df_, seq_len)temp, I have to wait a lot of time. I am not understanding whether it is the issue of the seq_len.
Here is the attached dataset: data.csv
Please help me make it faster. It may happen that in future I may have a bigger data. But if this one becomes faster I need not have to worry for the future data.
**EDITED: ** As per #ParitoshSingh Comment.. Here is the part of the dataset. But do not consider this is the data. It is just a part of bigger data:
,rains_ratio_2013,rains_ratio_2014
0,1.12148,1.1216
1,1.12141,1.12162
2,1.12142,1.12163
3,1.12148,1.1216
4,1.12143,1.12165
5,1.12141,1.12161
6,1.1213799999999998,1.12161
7,1.1214,1.12158
8,1.1214,1.12158
9,1.12141,1.12158
10,1.12141,1.12161
11,1.12144,1.1215899999999999
12,1.12141,1.12162
13,1.12141,1.12161
14,1.12143,1.12161
15,1.12143,1.1216899999999999
16,1.12143,1.12173
17,1.12143,1.12178
18,1.1214600000000001,1.12179
19,1.12148,1.12174
20,1.12148,1.1217
21,1.12148,1.12174
22,1.12148,1.1217
23,1.12145,1.1217
24,1.12145,1.1217
25,1.12148,1.1217
26,1.1214899999999999,1.1217
27,1.1214899999999999,1.1216899999999999
28,1.12143,1.1216899999999999
29,1.12143,1.1216899999999999
30,1.12144,1.1216899999999999

This is essentially a sliding window problem.
One approach is to use vectorization to take the sliding windows over the data faster. Note that If you do not have enough memory to load the final output data, then this may cause issues as well.
import numpy as np
import pandas as pd
Creating some dummy dataframe for ease of use. You should test on your original dataframe.
seq_len = 5
df = pd.DataFrame(np.arange(300).reshape(-1, 3))
print(df.head())
#Output:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Now, we can create an array for all indexes that we need to use, and use indexing to access all our values in the desired format.
def load_data(df_, seq_len):
data_raw = df_.values # convert to numpy array
#find total number of rows
nrows = len(data_raw) - seq_len + 1 #Your code had -(seq_len + 1) for some reason. i am assuming that was just a mistake. If not, correct this accordingly.
#Now, create an index matrix from the total number of rows.
data = data_raw[np.arange(nrows)[:,None] + np.arange(seq_len)]
print("shape is", data.shape)
return data
out = load_data(df, seq_len)
#Output: shape is (98, 3, 3)
EDIT: If you run into memory errors, you can always modify the function to use a generator instead. This way, you take a middle ground between the two scenarios of iterating one by one or consuming too much memory.
def load_data_gen(df_, seq_len, chunksize=10):
data_raw = df_.values # convert to numpy array
nrows = len(data_raw) - seq_len + 1
for i in range(0, nrows, chunksize):
data = data_raw[np.arange(i, min(i+chunksize, nrows))[:,None] + np.arange(seq_len)]
print("shape is", data.shape)
yield data
out = load_data_gen(df, seq_len, 15)
test = list(out)
#Output:
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (6, 5, 3)

Related

How to devide each raw of data into 3 matrixes in python?

I have data with 1034 columns, I want to divide each raw of it into 3 matrixes of 49*7. It remains 5 columns delete them. How can I do this in python?
First, I removed the last 5 columns from the data.
rawData = pd.read_csv('../input/smartgrid/data/data.csv')#import the data
#remove the last 5 columns
rawData.pop('2016/9/9')
rawData.pop('2016/9/8')
rawData.pop('2016/9/7')
rawData.pop('2016/9/6')
rawData.pop('2016/9/5')
Then, It happens a preprocessing of the data. After that, it is fed to this function which is supposed to divide each row into three matrixes week1, week2 and week3.
def CNN2D(X_train, X_test, y_train, y_test):
print('2D - Convolutional Neural Network:')
#Transforming every row of the train set into a 2D array
n_array_X_train = X_train.to_numpy()
#devided n_array_Xtrain into 3 matrixes in order to apply it in convolution layer like RGB color
week1= [] # the first matrix
week2= [] # the second matrix
week3= [] # the third matrix
Here's a way to do what you're asking:
import pandas as pd
import numpy as np
#rawData = pd.read_csv('../input/smartgrid/data/data.csv')#import the data
rawData = pd.DataFrame([[x * 5 + i for x in range(1034)] for i in range(2)], columns=range(1034))
numRowsPerMatrix = len(rawData.columns) // 7 // 3
numColsNeeded = 3 * 7 * numRowsPerMatrix
rawData = rawData.T.iloc[:numColsNeeded].T
for i in range(len(rawData.index)):
n_array_X_train = rawData.iloc[i].to_numpy()
week1= np.reshape(n_array_X_train[:49 * 7], (49, 7)) # the first matrix
week2= np.reshape(n_array_X_train[49 * 7: 2 * 49 * 7], (49, 7)) # the second matrix
week3= np.reshape(n_array_X_train[2 * 49 * 7:], (49, 7)) # the third matrix
The line rawData = rawData.T.iloc[:numColsNeeded].T transposes the array, slices only the required rows (which were columns in the original df, all but last 5), then transposes it back.
The assignments to week1, week2 and week3 slice successive thirds of the 1D numpy array in the current row of rawData and reshape each into a 49 row by 7 column matrix.

Adding data into new dimension of numpy array

I have a numpy array of size 55 x 10 x 10, which represents 55 10 x 10 grayscale images. I'm trying to make them RGB by duplicating the 10 x 10 images 3 times.
From what I've understood, I first need to add a new dimension to house the duplicated data. I've done this using:
array_4d = np.expand_dims(array_3d, 1),
so I now have a 55 x 1 x 10 x 10 array. How do I now duplicate the 10 x 10 images and add them back into this array?
Quick edit: In the end I want a 55 x 3 x 10 x 10 array
Let us first create a 3d array of size 55x10x10
from matplotlib import pyplot as plt
import numpy as np
original_array = np.random.randint(10,255, (55,10,10))
print(original_array.shape)
>>>(55, 10, 10)
Visual of first image in array:
first_img = original_array[0,:,:]
print(first_img.shape)
plt.imshow(first_img, cmap='gray')
>>>(10, 10)
Now you can get your desired array in just one single step.
stacked_img = np.stack(3*(original_array,), axis=1)
print(stacked_img.shape)
>>>(55, 3, 10, 10)
Use axis=-1 if you want channel last
Now let us verify that the value are correct by extracting the first image from this array and taking average of 3 channels:
new_img = stacked_img[0,:,:,:]
print(new_img.shape)
>>> (3, 10, 10)
new_img_mean = new_img.mean(axis=0)
print(new_img_mean.shape)
>>> (10, 10)
np.allclose(new_img_mean, first_img) # If this is True then the two arrays are same
>>> True
For visual verification, you'll have to move the channel to last because that is what matplotlib needs. This is a 3 channel image, so we are not using cmap='gray' here
print(np.moveaxis(new_img, 0, -1).shape)
plt.imshow(np.moveaxis(new_img, 0, -1))
>>> (10, 10, 3)

Vectorizing for loop using splicing in NumPy

I have this for loop:
blockSize = 5
ds = np.arange(20)
ds = np.reshape(ds, (1, len(ds))
counts = np.zeros(len(ds[0]/blockSize))
for i in range(len(counts[0])):
counts[0, i] = np.floor(np.sum(ds[0, i*blockSize:i*blockSize+blockSize]))
I am trying to vectorize it, doing something like this:
countIndices = np.arange(len(counts[0]))
counts[0, countsIndices] = np.floor(np.sum(ds[0, countIndices*blockSize:countIndices*blockSize + blockSize]))
However, this does not work and gives this error:
counts[0, countIndices] = np.floor(np.sum(ds[0, countIndices*blockSize:countIndices*blockSize + blockSize]))
TypeError: only integer scalar arrays can be converted to a scalar index
I know that something like this works:
counts[0, countIndices] = np.floor(ds[0, countIndices*blockSize]
+ ds[0, countIndices*blockSize + 2] +
... ds[0, countIndices*blockSize + blockSize])
The issue is that for large values of blocksize (which blocksize is very large in my actual code), this is not feasible to implement. I am confused as to how to accomplish what I want. Any help is greatly appreciated.
You don't need to do floor if you store the result to an integer array. You can also create a fake new axis of size blockSize to fully vectorize your operation.
block_size = 5
ds = np.arange(80.0).reshape(4, -1) # Shape (4, 20)
counts = np.empty((ds.shape[0], ds.shape[1] // block_size), dtype=int)
To introduce the fake dimension and sum:
ds.reshape(ds.shape[0], -1, block_size).sum(axis=-1, out=counts)
Reshaping does not copy the data, so the operation ds.reshape(ds.shape[0], -1, block_size) is extremely cheap in both time and space.
You can use -1 for one of the reshape dimensions to avoid computing/writing out long division expressions.

Numpy split array without copying

I have a very large array of images (multiple GBs) and want to split it using numpy. This is my code:
images = ... # this is the very large array which contains a lot of images.
images.shape => (50000, 256, 256)
indices = ... # array containing ranges, that group the images array like [(0, 300), (301, 580), (581, 860), ...]
train_indices, test_indices = ... # both arrays contain indices like [1, 6, 8, 19] which determine which groups are in the train and which are in the test group
images_train, images_test = np.empty([0, images.shape[1], images.shape[2]]), np.empty([0, images.shape[1], images.shape[2]])
# assign the image groups to either train or test set
for (i, rng) in enumerate(indices):
group_range = range(rng[0], rng[1]+1)
if i in train_indices:
images_train = np.concatenate((images_train, images[group_range]))
else:
images_test = np.concatenate((images_test, images[group_range]))
The problem with this code is, that images_train and images_test are new arrays and the single images are always copied in this new array. This leads to double the memory needed to run the program.
Is there a way to split my images array into images_train and images_test without having to copy the images, but rather reuse them?
My intention with the indices is to group the images into roughly 150 groups, where images from one group should be either in the train or test set
Without a running code it's difficult to understand the details. But I can try to give some ideas. If you have images_train and images_test then you will probabely use them to train and to test with a command that is something like
.fit(images_train);
.score(images_test)
An approach might be that you do not build images_train and images_test but that you use part of images directely
.fit(images[...]);
.score(images[...])
Now the question is, what should be in the [...]-brackets ? Or is there a numpy operater that extracts the right images[...]. First we have to think about what we should avoid:
for loop is always slow
iterative filling of an array like A = np.concatenate((A, B[j])) is always slow
Python's "fancy indexing" is always slow, as group_range = range(rng[0], rng[1]+1); images[group_range]
Some ideas:
use slices instead of "fancy indexing" see here
images[rng[0] : rng[1]+1] , or
group_range = slice(rng[0] , rng[1]+1); images[group_range]
Is images_train = images[train_indices, :, :] and images_test = images[test_indices, :, :] ?
images.shape => (50000, 256, 256) is 3-dimensional ?
try wether numpy.where can give some assitance
below the methods I've mentioned
...
import numpy as np
A = np.arange(20); print("A =",A)
B = A[5:16:2]; print("B =",B) # view of A only, faster
j = slice(5, 16, 2); C = A[j]; print("C =",C) # view of A only, faster
k = [2, 4, 8, 12]; D = A[k]; print("D =",D) # generates internal copies
A = [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
B = [ 5 7 9 11 13 15]
C = [ 5 7 9 11 13 15]
D = [ 2 4 8 12]

Numpy compare 2 array shape, if different, append 0 to match shape

I am comparing 2 numpy arrays, and want to add them together. but, before doing so, i need to make sure they are the same size. If the size are not same, then take the smaller sized one and fill the last rows with zero to match the shape.
Both array have 16 columns and N rows. I am assuming it should be pretty straight forward, but I can't get my head around it. So far I am able to compare the 2 array shape.
import csv
import numpy as np
import sys
data = np.genfromtxt('./test1.csv', dtype=float, delimiter=',')
data_sys = np.genfromtxt('./test2.csv', dtype=float, delimiter=',')
print data.shape
print data_sys.shape
if data.shape != data_sys.shape:
print "we have an error"
This is the output I got:
=============New file.csv============
(603, 16)
(604, 16)
we have an error
I want the fill the last row of "data" array with 0 so that I can add the 2 arrays.
Thanks for your help.
You can use vstack(array1, array2) from numpy which stacks arrays vertically. For example:
A = np.random.randint(2, size = (2, 16))
B = np.random.randint(2, size = (5, 16))
print A.shape
print B.shape
if A.shape[0] < B.shape[0]:
A = np.vstack((A, np.zeros((B.shape[0] - A.shape[0], 16))))
elif A.shape[0] > B.shape[0]:
B = np.vstack((B, np.zeros((A.shape[0] - B.shape[0], 16))))
print A.shape
print A
In your case:
if data.shape[0] < data_sys.shape[0]:
data = np.vstack((data, np.zeros((data_sys.shape[0] - data.shape[0], 16))))
elif data.shape[0] > data_sys.shape[0]:
data_sys = np.vstack((data_sys, np.zeros((data.shape[0] - data_sys.shape[0], 16))))
I assume that your matrices have always the same number of columns, if not you can similarly use hstack to stack them horizontally.
If you have only two files, and their shapes differ in just the 0th dimension, a simple check and copy is probably easiest, though it lacks generality:
import numpy as np
data = np.genfromtxt('./test1.csv', dtype=float, delimiter=',')
data_sys = np.genfromtxt('./test2.csv', dtype=float, delimiter=',')
fill_value = 0 # could be np.nan or something else instead
if data.shape[0]>data_sys.shape[0]:
temp = data_sys
data_sys = np.ones(data.shape)*fill_value
data_sys[:temp.shape[0],:] = temp
elif data.shape[0]<data_sys.shape[0]:
temp = data
data = np.ones(data_sys.shape)*fill_value
data[:temp.shape[0],:] = temp
print 'Using conditional:'
print data.shape
print data_sys.shape
if data.shape != data_sys.shape:
print "we have an error"
A much more general solution is a custom class--overkill for your two files but much easier if you have lots of files to handle. The basic idea is that static class variables sx and sy keep track of the largest widths and heights, and are used when get_data is called, to output a standard shape array. This is pre-filled with your desired fill value, and the actual data from the corresponding file are copied into the upper left corner of the standard shape array:
import numpy as np
class IsomorphicArray:
sy = 0 # static class variable
sx = 0 # static class variable
fill_value = 0.0
def __init__(self,csv_filename):
self.data = np.genfromtxt(csv_filename,dtype=float,delimiter=',')
self.instance_sy,self.instance_sx = self.data.shape
if self.instance_sy>IsomorphicArray.sy:
IsomorphicArray.sy = self.instance_sy
if self.instance_sx>IsomorphicArray.sx:
IsomorphicArray.sx = self.instance_sx
def get_data(self):
out = np.ones((IsomorphicArray.sy,IsomorphicArray.sx))*self.fill_value
out[:self.instance_sy,:self.instance_sx] = self.data
return out
isomorphic_array_list = []
for filename in ['./test1.csv','./test2.csv']:
isomorphic_array_list.append(IsomorphicArray(filename))
numpy_array_list = []
for isomorphic_array in isomorphic_array_list:
numpy_array_list.append(isomorphic_array.get_data())
print 'Using custom class:'
for numpy_array in numpy_array_list:
print numpy_array.shape
Assuming both arrays have 16 columns
len1=len(data)
len2=len(data_sys)
if len1<len2:
data=np.append(data, np.zeros((len2-len1, 16)),axis=0)
elif len2<len1:
data_sys=np.append(data_sys, np.zeros((len1-len2, 16)),axis=0)
print data.shape
print data_sys.shape
if data.shape != data_sys.shape:
print "we have an error"
else:
print "we r good"
Numpy provides an append function to add values to an array: see here for details. In multi-dimensional arrays you can define how the values should be added. As you have already the information which of your arrays is the smaller one, just add the desired number of zeroes with creating a zero filled array first by numpy.zeroes and then append it to your target array.
It might be necessary to flatten your array first and then to reshape it.
I had a similar situation. Two arrays of sizes mask_in:(n1,m1) and mask_ot:(n2,m2)that were generated through a mask of a 2D image of size (N,M) where A2 is larger than A1 and both share a common center (X0,Y0). I followed the approach suggested by #AniaG using vstack and hstack. I simply obtained the shapes of both arrays, size difference and finally account the number of missing elements at both ends.
Here is what I got:
mask_in = np.random.randint(2, size = (2, 8))
mask_ot = np.random.randint(2, size = (6, 16))
mask_in_amp = mask_in
dif_row = mask_ot.shape[0]-mask_in_amp.shape[0]
dif_col = mask_ot.shape[1]-mask_in_amp.shape[1]
complete_row = dif_row / 2
complete_col = dif_col / 2
mask_in_amp = np.vstack((mask_in_amp, np.zeros((complete_row, mask_in_amp.shape[1]))))
mask_in_amp = np.vstack((np.zeros((complete_row, mask_in_amp.data.shape[1])), mask_in_amp))
mask_in_amp = np.hstack((mask_in_amp, np.zeros((mask_in_amp.shape[0],complete_col))))
mask_in_amp = np.hstack((np.zeros((mask_in_amp.shape[0],complete_col)), mask_in_amp))
If you don't care about the exact shapes of two arrays you can also do the following:
if data.size == datasys.size:
print ('arrays have the same number of elements, and possibly shape')
else:
print ('arrays do not have the same shape for sure')

Categories