I have data with 1034 columns, I want to divide each raw of it into 3 matrixes of 49*7. It remains 5 columns delete them. How can I do this in python?
First, I removed the last 5 columns from the data.
rawData = pd.read_csv('../input/smartgrid/data/data.csv')#import the data
#remove the last 5 columns
rawData.pop('2016/9/9')
rawData.pop('2016/9/8')
rawData.pop('2016/9/7')
rawData.pop('2016/9/6')
rawData.pop('2016/9/5')
Then, It happens a preprocessing of the data. After that, it is fed to this function which is supposed to divide each row into three matrixes week1, week2 and week3.
def CNN2D(X_train, X_test, y_train, y_test):
print('2D - Convolutional Neural Network:')
#Transforming every row of the train set into a 2D array
n_array_X_train = X_train.to_numpy()
#devided n_array_Xtrain into 3 matrixes in order to apply it in convolution layer like RGB color
week1= [] # the first matrix
week2= [] # the second matrix
week3= [] # the third matrix
Here's a way to do what you're asking:
import pandas as pd
import numpy as np
#rawData = pd.read_csv('../input/smartgrid/data/data.csv')#import the data
rawData = pd.DataFrame([[x * 5 + i for x in range(1034)] for i in range(2)], columns=range(1034))
numRowsPerMatrix = len(rawData.columns) // 7 // 3
numColsNeeded = 3 * 7 * numRowsPerMatrix
rawData = rawData.T.iloc[:numColsNeeded].T
for i in range(len(rawData.index)):
n_array_X_train = rawData.iloc[i].to_numpy()
week1= np.reshape(n_array_X_train[:49 * 7], (49, 7)) # the first matrix
week2= np.reshape(n_array_X_train[49 * 7: 2 * 49 * 7], (49, 7)) # the second matrix
week3= np.reshape(n_array_X_train[2 * 49 * 7:], (49, 7)) # the third matrix
The line rawData = rawData.T.iloc[:numColsNeeded].T transposes the array, slices only the required rows (which were columns in the original df, all but last 5), then transposes it back.
The assignments to week1, week2 and week3 slice successive thirds of the 1D numpy array in the current row of rawData and reshape each into a 49 row by 7 column matrix.
Related
I have a numpy array of size 55 x 10 x 10, which represents 55 10 x 10 grayscale images. I'm trying to make them RGB by duplicating the 10 x 10 images 3 times.
From what I've understood, I first need to add a new dimension to house the duplicated data. I've done this using:
array_4d = np.expand_dims(array_3d, 1),
so I now have a 55 x 1 x 10 x 10 array. How do I now duplicate the 10 x 10 images and add them back into this array?
Quick edit: In the end I want a 55 x 3 x 10 x 10 array
Let us first create a 3d array of size 55x10x10
from matplotlib import pyplot as plt
import numpy as np
original_array = np.random.randint(10,255, (55,10,10))
print(original_array.shape)
>>>(55, 10, 10)
Visual of first image in array:
first_img = original_array[0,:,:]
print(first_img.shape)
plt.imshow(first_img, cmap='gray')
>>>(10, 10)
Now you can get your desired array in just one single step.
stacked_img = np.stack(3*(original_array,), axis=1)
print(stacked_img.shape)
>>>(55, 3, 10, 10)
Use axis=-1 if you want channel last
Now let us verify that the value are correct by extracting the first image from this array and taking average of 3 channels:
new_img = stacked_img[0,:,:,:]
print(new_img.shape)
>>> (3, 10, 10)
new_img_mean = new_img.mean(axis=0)
print(new_img_mean.shape)
>>> (10, 10)
np.allclose(new_img_mean, first_img) # If this is True then the two arrays are same
>>> True
For visual verification, you'll have to move the channel to last because that is what matplotlib needs. This is a 3 channel image, so we are not using cmap='gray' here
print(np.moveaxis(new_img, 0, -1).shape)
plt.imshow(np.moveaxis(new_img, 0, -1))
>>> (10, 10, 3)
I have big binary 3D data and I want to re-arrange the data such as it is a sequence of values in order achieved by parsing the original data as sub-arrays of size (4x4x4).
For example, if the data is 2D and I want to re-arrange the data from 2x2 sub-arrays
example image
I used simple loops for this but just iterating over the loops took way more times, I am trying to to use some numpy functions to do so but I am new to SciPy
My code looks like this
x,y,z = 1200,800,400
data = np.fromfile(file_name, dtype=np.float32)
data.shape = (z,y,x)
new_data = np.empty(shape=x*y*z, dtype = np.float32)
index = 0
for zz in range(0,z,4):
for yy in range(0,y,4):
for xx in range(0,x,4):
for zShift in range(4):
for yShift in range(4):
for xShift in range(4):
new_data[index] = data[zz+zShift][yy+yShift][xx+xShift]
index+=1
new_data.tofile(output)
However, this takes a lot of time, any better implementation ideas?
As I said, the code works as intended, however, I need a smarter, pythonic way to achieve my output
Thank you!
x,y,z = 1200,800,400
data = np.empty([x,y,z])
# numpy calculates the shape of -1
out = data.reshape(-1, 4, 4, 4)
out.shape
>>> (6000000, 4, 4, 4)
Perform the following test, for smaller data and block size:
x, y, z = 4, 4, 4 # Dimensions
stp = 2 # Block size (in each dimension)
# Create the test array
arr = np.arange(x * y * z).reshape((x, y, z))
And to create a list of "blocks", run:
new_data = []
for xx in range(0, x, stp):
for yy in range(0, y, stp):
for zz in range(0, z, stp):
print('Index:', xx, yy, zz)
obj = arr[xx:xx+stp, yy:yy+stp, zz:zz+stp].copy()
print(obj)
new_data.append(obj)
In the target version of your code:
restore original values of x, y and z,
read the array from your source,
change stp back to 4,
drop test printouts.
Note also that your code adds individual elements to new_data,
only iterating over blocks of size 4 * 4 * 4,
whereas you wrote that you want a sequence of smaller arrays
(i.e. slices) of size 4 * 4 * 4, what my code does.
So if you need a list of slices (smaller arrays), not a single
4-D array, use my code.
i want to make a linear equation with some dynamic inputs like it can be
y = θ0*x0 + θ1*x1
or
y = θ0*x0 + θ1*x1 + θ2*x2 + θ3*x3 + θ4*x4
for that i have
dictionary for x0,x1,x2......xn
and array for θ0,θ1,θ2......θn
im new to python so i tried this function but im stuck
so my question is how can i write a fucntion that gets x_values and theta_values as parameters and gives y_values as output
X = pd.DataFrame({'x0': np.ones(6), 'x1': np.linspace(0, 5, 6)})
θ = np.matrix('0 1')
def line_func(features, parameters):
result = []
for feat, param in zip(features.iteritems(), parameters):
for i in feat:
result.append(i*param)
return result
line_func(X,θ)
If you want to multiply your thetas with a list of features, then you technically mulitply a matrix (the features) with a vector (theta).
You can do this as follows:
import numpy as np
x_array= x.values
theta= np.array([theta_0, theta_1])
x_array.dot(theta)
Just order your theta-vector the way your columns are ordered in x. But note, that this gives a row-wise sum of the products for theta_i*x_i for all is. If you don't want it to be summed up rowise, you just need to write x_array * theta.
If you want to work with pandas (which I wouldn't recommend) also for the mulitplication and want to get a dataframe with the products of the column value and the corresponding theta, you could do this as follows:
# define the theta-x mapping (theta-value per column name in x)
thetas={'x1': 1, 'x2': 3}
# create an empty result dataframe with the index of x
df_result= pd.DataFrame(index=x.index)
# assign the calculated columns in a loop
for col_name, col_series in x.iteritems():
df_result[col_name]= col_series*thetas[col_name]
df_result
This results in:
x1 x2
0 1 6
1 -1 3
I have the following code for making the sequences of the dataframe, which has loaded the csv data of rains ratios.
import pandas as pd
import numpy as np
import sklearn
import sklearn.preprocessing
seq_len = 1100
def load_data(df_, seq_len):
data_raw = df_.values # convert to numpy array
data = []
data = np.array([data_raw[index: index + seq_len] for index in range(len(data_raw) - (seq_len+1))])
print(data.shape)
df = pd.read_csv("data.csv",index_col = 0)
temp = df.copy()
temp = normalize_data(temp)
load_data(df_, seq_len)temp
When I ran the function load_data(df_, seq_len)temp, I have to wait a lot of time. I am not understanding whether it is the issue of the seq_len.
Here is the attached dataset: data.csv
Please help me make it faster. It may happen that in future I may have a bigger data. But if this one becomes faster I need not have to worry for the future data.
**EDITED: ** As per #ParitoshSingh Comment.. Here is the part of the dataset. But do not consider this is the data. It is just a part of bigger data:
,rains_ratio_2013,rains_ratio_2014
0,1.12148,1.1216
1,1.12141,1.12162
2,1.12142,1.12163
3,1.12148,1.1216
4,1.12143,1.12165
5,1.12141,1.12161
6,1.1213799999999998,1.12161
7,1.1214,1.12158
8,1.1214,1.12158
9,1.12141,1.12158
10,1.12141,1.12161
11,1.12144,1.1215899999999999
12,1.12141,1.12162
13,1.12141,1.12161
14,1.12143,1.12161
15,1.12143,1.1216899999999999
16,1.12143,1.12173
17,1.12143,1.12178
18,1.1214600000000001,1.12179
19,1.12148,1.12174
20,1.12148,1.1217
21,1.12148,1.12174
22,1.12148,1.1217
23,1.12145,1.1217
24,1.12145,1.1217
25,1.12148,1.1217
26,1.1214899999999999,1.1217
27,1.1214899999999999,1.1216899999999999
28,1.12143,1.1216899999999999
29,1.12143,1.1216899999999999
30,1.12144,1.1216899999999999
This is essentially a sliding window problem.
One approach is to use vectorization to take the sliding windows over the data faster. Note that If you do not have enough memory to load the final output data, then this may cause issues as well.
import numpy as np
import pandas as pd
Creating some dummy dataframe for ease of use. You should test on your original dataframe.
seq_len = 5
df = pd.DataFrame(np.arange(300).reshape(-1, 3))
print(df.head())
#Output:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Now, we can create an array for all indexes that we need to use, and use indexing to access all our values in the desired format.
def load_data(df_, seq_len):
data_raw = df_.values # convert to numpy array
#find total number of rows
nrows = len(data_raw) - seq_len + 1 #Your code had -(seq_len + 1) for some reason. i am assuming that was just a mistake. If not, correct this accordingly.
#Now, create an index matrix from the total number of rows.
data = data_raw[np.arange(nrows)[:,None] + np.arange(seq_len)]
print("shape is", data.shape)
return data
out = load_data(df, seq_len)
#Output: shape is (98, 3, 3)
EDIT: If you run into memory errors, you can always modify the function to use a generator instead. This way, you take a middle ground between the two scenarios of iterating one by one or consuming too much memory.
def load_data_gen(df_, seq_len, chunksize=10):
data_raw = df_.values # convert to numpy array
nrows = len(data_raw) - seq_len + 1
for i in range(0, nrows, chunksize):
data = data_raw[np.arange(i, min(i+chunksize, nrows))[:,None] + np.arange(seq_len)]
print("shape is", data.shape)
yield data
out = load_data_gen(df, seq_len, 15)
test = list(out)
#Output:
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (6, 5, 3)
I have a 2D numarray, of size WIDTHxHEIGHT. I would like to bin the array by finding the median of each bin so that the resultant array is WIDTH/binsize x HEIGHT/binsize. Assume that both WIDTH and HEIGHT are divisible by binsize.
Edit: An example is given in the attached image.
I have found solutions where the binned array values are the sum or average of the individual elements in each bin:
How to bin a 2D array in numpy?
However, if I want to do a median combine of elements in each bin, I haven't been able to figure out a solution. Your help would be much appreciated!
Edit: image added
An example of the initial array and desired resultant median binned array
So you are looking for median over strided reshape:
import numpy as np
a = np.arange(24).reshape(4,6)
def median_binner(a,bin_x,bin_y):
m,n = np.shape(a)
strided_reshape = np.lib.stride_tricks.as_strided(a,shape=(bin_x,bin_y,m//bin_x,n//bin_y),strides = a.itemsize*np.array([(m / bin_x) * n, (n / bin_y), n, 1]))
return np.array([np.median(col) for row in strided_reshape for col in row]).reshape(bin_x,bin_y)
print "Original Matrix:"
print a
print "\n"
bin_tester1 = median_binner(a,2,3)
print "2x3 median bin :"
print bin_tester1
print "\n"
bin_tester2 = median_binner(a,2,2)
print "2x2 median bin :"
print bin_tester2
result:
Original Matrix:
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
2x3 median bin :
[[ 3.5 5.5 7.5]
[ 15.5 17.5 19.5]]
2x2 median bin :
[[ 4. 7.]
[ 16. 19.]]
Read this in order to completely understand the following line in the code:
strided_reshape = np.lib.stride_tricks.as_strided(a,shape=(bin_x,bin_y,m//bin_x,n//bin_y),strides = a.itemsize*np.array([(m / bin_x) * n, (n / bin_y), n, 1])) .
I was dealing with the same issue. I have found the answer of Kennet Celeste very useful but there are some caveats. First the stride reshape is fast but the loop then is slow. The trick is to get all the data you compute median from to the same location in the memory and use somehow vectorized numpy operation.
If you don't want to fiddle with the stride reshape you can go for np.swapaxes function. So let's say I have an array X of the size xdim x ydim and want to bin it by window bin_x x bin_y
import numpy as np
#Some sample values
xdim= 5039
ydim = 6637
bin_x = 5
bin_y = 7
X = np.random.rand(ydim, xdim)
#now compute reduced dimensions so that bin_x divides xdim_red
xdim_red = xdim - xdim % bin_x
ydim_red = ydim - ydim % bin_y
#and dimensions after binning
xdim_bin = xdim_red // bin_x
ydim_bin = ydim_red // bin_y
#crop X to the end of the indices
X = X[0:ydim_red, 0:xdim_red]
#Here alternative to stride reshape
X.shape = (ydim_bin, bin_y, xdim_bin, bin_x)
X_reshaped = X.swapaxes(1, 2)
#The following can be done on stride_reshape array as well and finally joins the chunks of the memory we need to get together
X_reshaped = X_reshaped.reshape((ydim_bin, xdim_bin, bin_x*bin_y))
#There could be faster implementation but this at least use batc
g = np.median(X_reshaped, axis=-1)