Python - Reversing data generated with particular loop order - python

Question: How could I peform the following task more efficiently?
My problem is as follows. I have a (large) 3D data set of points in real physical space (x,y,z). It has been generated by a nested for loop that looks like this:
# Generate given dat with its ordering
x_samples = 2
y_samples = 3
z_samples = 4
given_dat = np.zeros(((x_samples*y_samples*z_samples),3))
row_ind = 0
for z in range(z_samples):
for y in range(y_samples):
for x in range(x_samples):
row = [x+.1,y+.2,z+.3]
given_dat[row_ind,:] = row
row_ind += 1
for row in given_dat:
print(row)`
For the sake of comparing it to another set of data, I want to reorder the given data into my desired order as follows (unorthodox, I know):
# Generate data with desired ordering
x_samples = 2
y_samples = 3
z_samples = 4
desired_dat = np.zeros(((x_samples*y_samples*z_samples),3))
row_ind = 0
for z in range(z_samples):
for x in range(x_samples):
for y in range(y_samples):
row = [x+.1,y+.2,z+.3]
desired_dat[row_ind,:] = row
row_ind += 1
for row in desired_dat:
print(row)
I have written a function that does what I want, but it is horribly slow and inefficient:
def bad_method(x_samp,y_samp,z_samp,data):
zs = np.unique(data[:,2])
xs = np.unique(data[:,0])
rowlist = []
for z in zs:
for x in xs:
for row in data:
if row[0] == x and row[2] == z:
rowlist.append(row)
new_data = np.vstack(rowlist)
return new_data
# Shows that my function does with I want
fix = bad_method(x_samples,y_samples,z_samples,given_dat)
print('Unreversed data')
print(given_dat)
print('Reversed Data')
print(fix)
# If it didn't work this will throw an exception
assert(np.array_equal(desired_dat,fix))
How could I improve my function so it is faster? My data sets usually have roughly 2 million rows. It must be possible to do this with some clever slicing/indexing which I'm sure will be faster but I'm having a hard time figuring out how. Thanks for any help!

You could reshape your array, swap the axes as necessary and reshape back again:
# (No need to copy if you don't want to keep the given_dat ordering)
data = np.copy(given_dat).reshape(( z_samples, y_samples, x_samples, 3))
# swap the "y" and "x" axes
data = np.swapaxes(data, 1,2)
# back to 2-D array
data = data.reshape((x_samples*y_samples*z_samples,3))
assert(np.array_equal(desired_dat,data))

Related

Rectangular array with holes

I'm trying to create a rectangular grid with numbers in some cells (but not all of them), in a way such that it's easy to select a given row or column.
What I did so far is to create the list of the positions of the numbers in the grid and the list of the numbers contained in the grid, so that I can select the number at position (i,j) with numbers[positions.index([i,j]), but this is not very handy, especially if I need, for example, to find the minimum of the values in a given column.
Is there a way to create the grid so that, for example, I can select elements with grid[i][j] and columns with grid[:][j] or something similar? The programming language is Python.
You can use numpy for this. It lets you create an array, which can index a single value with array[i,j] or a full column with array[:,j].
I'm not entirely sure what you mean by holes, but numpy will require you to have a value in every spot in the array. The best thing I believe you can set it to a preset "empty" value.
Store your grid as a 2D array (a matrix) and use list comprehensions.
first_column = [row[0] for row in grid]
second_column = [row[1] for row in grid]
If you're going to have a large proportion of the "cells" that are unused, you could try using a dictionary with the coordinates as key in a tuple.
matrix = dict()
matrix[1,3] = 13
matrix[1,5] = 15
matrix[2,3] = 23
matrix[2,7] = 27
matrix[3,7] = 37
valuesInRow2 = [v for (r,c),v in matrix.items() if r==2]
# [23,27]
By creating a subclass of dict to manage indexing and overriding operators, you could get it to behave exactly the way you want:
class Sparse(dict):
def __init__(self,rows=0,cols=0):
super().__init__()
self.rows = rows
self.cols = cols
def __indexToRanges(self,rowIndex,colIndex):
scalar = isinstance(rowIndex,int) and isinstance(colIndex,int)
if isinstance(rowIndex,slice):
rowRange = range(*rowIndex.indices(self.rows))
else:
rowRange = range(rowIndex,rowIndex+1)
if isinstance(colIndex,slice):
colRange = range(*colIndex.indices(self.cols))
else:
colRange = range(colIndex,colIndex+1)
return rowRange,colRange,scalar
def __getitem__(self,indexes):
row,col = indexes
rowRange,colRange,scalar = self.__indexToRanges(row,col)
if scalar: return super()._getitem((row,col))
return [v for (r,c),v in self.items() if r in rowRange and c in colRange]
def __setitem__(self,index,value):
row,col=index
rowRange,colRange,scalar = self.__indexToRanges(row,col)
if scalar:
self.rows = max(self.rows,row+1)
self.cols = max(self.cols,col+1)
return super().__setitem__((row,col),value)
usage:
matrix = Sparse()
matrix[1,3] = 13
matrix[1,5] = 15
matrix[2,3] = 23
matrix[2,7] = 27
matrix[3,7] = 37
print("sum of column 3:", sum(matrix[:,3]) ) # 36
print("sum of row 2:", sum(matrix[2,:]) ) # 50
print("top left 4x4 values:", matrix[:4,:4] ) # [13, 23]

Creating multiple plots in Python for loop

I want to create a plot of (T/Tmax vs R/R0) for different values of Pa in a single plot like below. I have written this code that appends values to a list but all values of (T/Tmax vs R/R0) are appended in single list which does not give a good plot. What can I do to have such a plot? Also how can I make an excel sheet from the data from the loop where column 1 is T/Tmax list and column 2,3,4...are corresponding R/R0 values for different pa?
KLMDAT1 = []
KLMDAT2 = []
for j in range(z):
pa[j] = 120000-10000*j
i = 0
R = R0
q = 0
T = 0
while (T<Tmax):
k1 = KLM_RKM(i*dT,R,q,pa[j])
k2 = KLM_RKM((i+0.5)*dT,R,q+0.5*dT*k1,pa[j])
k3 = KLM_RKM((i+0.5)*dT,R,q+0.5*dT*k2,pa[j])
k4 = KLM_RKM((i+1)*dT,R,q+dT*k3,pa[j])
q = q +1/6.0*dT*(k1+2*k2+2*k3+k4)
R = R+dT*q
if(R>0):
KLMDAT1.append(T / Tmax)
KLMDAT2.append(R / R0)
if(R>Rmax):
Rmax = R
if (abs(q)>c or R < 0):
break
T=T+dT
i = i+1
wb.save('KLM.xlsx')
np.savetxt('KLM.csv',[KLMDAT1, KLMDAT2])
plt.plot(KLMDAT1, KLMDAT2)
plt.show()
You are plotting it wrong. Your first variable needs to be T/Tmax. So initialize an empty T list, append T values to it, divide it by Tmax, and then plot twice: first KLMDAT1 and then KLMDAT2. Following pseudocode explains it
KLMDAT1 = []
KLMDAT2 = []
T_list = [] # <--- Initialize T list here
for j in range(z):
...
while (T<Tmax):
...
T=T+dT
T_list.append(T) # <--- Append T here
i = i+1
# ... rest of the code
plt.plot(np.array(T_list)/Tmax, KLMDAT1) # <--- Changed here
plt.plot(np.array(T_list)/Tmax, KLMDAT2) # <--- Changed here
plt.show()

Append a 2D array while looping through it

I want to store certain values in a 2D array. In the below code. I want sT to be total. When the inner loop runs the values to be stored in rows and then next column when the outer loop increment happens.
class pricing_lookback:
def __init__(self,spot,rate,sigma,time,sims,steps):
self.spot = spot
self.rate = rate
self.sigma = sigma
self.time = time
self.sims = sims
self.steps = steps
self.dt = self.time/self.steps
def call_floatingstrike(self):
simulationS = np.array([])
simulationSt = np.array([])
call2 = np.array([])
total = np.empty(shape=[self.steps, self.sims])
for j in range(self.sims):
sT = self.spot
pathwiseminS = np.array([])
for i in range(self.steps):
phi= np.random.normal()
sT *= np.exp((self.rate-0.5*self.sigma*self.sigma)*self.dt + self.sigma*phi*np.sqrt(self.dt))
pathwiseminS = np.append(pathwiseminS, sT)
np.append(total,[[j,sT]])###This should store values in rows of j column
#print (pathwiseminS)
#tst1 = np.append(tst1, pathwiseminS[1])
call2 = np.append(call2, max(pathwiseminS[self.steps-1]-self.spot,0))
#print (pathwiseminS[self.steps-1])
#print(call2)
simulationSt = np.append(simulationSt,pathwiseminS[self.steps-1])
simulationS = np.append(simulationS,min(pathwiseminS))
call = max(np.average(simulationSt) - np.average(simulationS),0)
return call, total#,call2,
Here is a simple example of what I think you are trying to do:
for i in range(5):
row = np.random.rand(5,)
if i == 0:
my_array = row
else:
my_array = np.vstack((my_array, row))
print(row)
However, this is not very efficient with memory, especially if you are dealing with large arrays, as this has to allocate new memory on every loop. It would be much better to preallocate an empty array and then populate it if possible.
To answer the question of how to append a column, it would be something like this:
import numpy as np
x = np.random.rand(5, 4)
column_to_append = np.random.rand(5,)
np.insert(x, x.shape[1], column_to_append, axis=1)
Again, this is not memory efficient and should be avoided whenever possible. Preallocation is much better.

Transforming a 3 Column Matrix into an N x N Matrix in Numpy

I have a 2D numpy array with 3 columns. Columns 1 and 2 are a list of connections between ID's. Column 3 is a the strength of that connection. I would like to transform this 3 column matrix into a weighted adjacency matrix (an N x N matrix where cells represent the strength of connection between each ID).
I have already done this in my code below. matrix is the 3 column 2D array and t1 is the weighted adjacency matrix. My problem is this code is very slow because I am using nested for loops. I am familiar with the pandas function melt which does this, but I am not able to use pandas. Is there a faster implementation not using pandas?
import numpy as np
a = np.arange(2000)
np.random.shuffle(a)
b = np.arange(2000)
np.random.shuffle(b)
c = np.random.rand(2000,1)
matrix = np.column_stack((a,b,c))
#get unique value list of nm
flds = list(np.unique(matrix[:,0]))
flds.extend(list(np.unique(matrix[:,1])))
flds = np.asarray(flds)
flds = np.unique(flds)
#make lookup dict
lookup = dict(zip(np.arange(0,len(flds)), flds))
lookup_rev = dict(zip(flds, np.arange(0,len(flds))))
#make empty n by n matrix with unique lists
t1 = np.zeros([len(flds) , len(flds)])
#map values into the n by n matrix and make the rest 0
'''this takes a long time to run'''
#iterate through rows
for i in np.arange(0,len(lookup)):
#iterate through columns
for k in np.arange(0,len(lookup)):
val = matrix[(matrix[:,0] == lookup[i]) & (matrix[:,1] == lookup[k])][:,2]
if val:
t1[i,k] = sum(val)
Assuming that I understood the question correctly and that val is a scalar, you could use a vectorized approach that involves initializing with zeros and then indexing, like so -
out = np.zeros((len(flds),len(flds)))
out[matrix[:,0].astype(int),matrix[:,1].astype(int)] = matrix[:,2]
Please note that by my observation it looks like you can avoid using lookup.
You need to iterate your matrix only once:
import numpy as np
size = 2000
a = np.arange(size)
np.random.shuffle(a)
b = np.arange(size)
np.random.shuffle(b)
c = np.random.rand(size,1)
matrix = np.column_stack((a,b,c))
#get unique value list of nm
fields = np.unique(matrix[:,:2])
n = len(fields)
#make reverse lookup dict
lookup = dict(zip(fields, range(n)))
#make empty n by n matrix
t1 = np.zeros([n, n])
for src, dest, val in matrix:
i = lookup[src]
j = lookup[dest]
t1[i, j] += val
The main acceleration you can get is by not iterating through each element of the NxN matrix but instead iterate trough your connection list, which is much smaller.
I tried to simplify your code a bit. It use the list.index method, which can be slow, but it should still be faster that what you had.
import numpy as np
a = np.arange(2000)
np.random.shuffle(a)
b = np.arange(2000)
np.random.shuffle(b)
c = np.random.rand(2000,1)
matrix = np.column_stack((a,b,c))
lookup = np.unique(matrix[:,:2]).tolist() # You can call unique only once
t1 = np.zeros((len(lookup),len(lookup)))
for i,j,val in matrix:
t1[lookup.index(i),lookup.index(j)] = val # Fill the matrix

Getting data arrays from CSV with loops

I have a CSV that looks like this:
0.500187550,CPU1,7.93
0.500187550,CPU2,1.62
0.500187550,CPU3,7.93
0.500187550,CPU4,1.62
1.000445359,CPU1,9.96
1.000445359,CPU2,1.61
1.000445359,CPU3,9.96
1.000445359,CPU4,1.61
1.500674877,CPU1,9.94
1.500674877,CPU2,1.61
1.500674877,CPU3,9.94
1.500674877,CPU4,1.61
The first column is time, the second the CPU used and the third is energy.
As a final result I would like to have these arrays:
Time:
[0.500187550, 1.000445359, 1.500674877]
Energy (per CPU): e.g. CPU1
[7.93, 9.96, 9.94]
For parsing the CSV I'm using:
query = csv.reader(csvfile, delimiter=',', skipinitialspace=True)
#Arrays global time and power:
for row in query:
x = row[0]
x = float(x)
x_array.append(x) #column 0 to array
y = row[2]
y = float(y)
y_array.append(y) #column 2 to array
print x_array
print y_array
These way I get all the data from time and energy into two arrays: x_array and y_array.
Then I order the arrays:
energy_core_ord_array = []
time_ord_array = []
#Dividing array into energy and time per core:
for i in range(number_cores[0]):
e = 0 + i
for j in range(len(x_array)/(int(number_cores[0]))):
time_ord = x_array[e]
time_ord_array.append(time_ord)
energy_core_ord = y_array[e]
energy_core_ord_array.append(energy_core_ord)
e = e + int(number_cores[0])
And lastly, I cut the time array into the lenghts it should have:
final_time_ord_array = []
for i in range(len(x_array)/(int(number_cores[0]))):
final_time_ord = time_ord_array[i]
final_time_ord_array.append(final_time_ord)
Till here, although the code is not elegant, it works.
The problem comes when I try to get the array for each core.
I get it for the first core, but when I try to iterate for the next one, I donĀ“t know how to do it, and how can I store each array in a variable with a single name for example.
final_energy_core_ord_array = []
#Trunk energy core array:
for i in range(len(x_array)/(int(number_cores[0]))):
final_energy_core_ord = energy_core_ord_array[i]
final_energy_core_ord_array.append(final_energy_core_ord)
So using Pandas (library to handle dataframes in Python) you can do something like this, which is much quicker than trying to process the CSV manually like you're doing:
import pandas as pd
csvfile = "C:/Users/Simon/Desktop/test.csv"
data = pd.read_csv(csvfile, header=None, names=['time','cpu','energy'])
times = list(pd.unique(data.time.ravel()))
print times
cpuList = data.groupby(['cpu'])
cpuEnergy = {}
for i in range(len(cpuList)):
curCPU = 'CPU' + str(i+1)
cpuEnergy[curCPU] = list(cpuList.get_group('CPU' + str(i+1))['energy'])
for k, v in cpuEnergy.items():
print k, v
that will give the following as output:
[0.50018755000000004, 1.000445359, 1.5006748769999998]
CPU4 [1.6200000000000001, 1.6100000000000001, 1.6100000000000001]
CPU2 [1.6200000000000001, 1.6100000000000001, 1.6100000000000001]
CPU3 [7.9299999999999997, 9.9600000000000009, 9.9399999999999995]
CPU1 [7.9299999999999997, 9.9600000000000009, 9.9399999999999995]
Finally I got the answer, using globals.... not a great idea, but works, leave it here if someone find it useful.
final_energy_core_ord_array = []
#Trunk energy core array:
a = 0
for j in range(number_cores[0]):
for i in range(len(x_array)/(int(number_cores[0]))):
final_energy_core_ord = energy_core_ord_array[a + i]
final_energy_core_ord_array.append(final_energy_core_ord)
globals()['core%s' % j] = final_energy_core_ord_array
final_energy_core_ord_array = []
a = a + 12
print 'Final time and cores:'
print final_time_ord_array
for j in range(number_cores[0]):
print globals()['core%s' % j]

Categories