So, I was working on implementing my own version of the Statsitical Test of Homogeneity in Python where the user would submit a list of lists and the fuction would compute the corresponding chi value.
One issue I found was that my function was removing decimals when performing division, resulting in a somewhat innaccurate chi value for small sample sizes.
Here is the code:
import numpy as np
import scipy.stats as stats
def test_of_homo(list1):
a = np.array(list1)
#n = a.size
num_rows = a.shape[0]
num_cols = a.shape[1]
dof = (num_cols-1)*(num_rows-1)
column_totals = np.sum(a, axis=0)
row_totals = np.sum(a, axis=1)
n = sum(row_totals)
b = np.array(list1)
c = 0
for x in range(num_rows):
for y in range(num_cols):
print("X is " + str(x))
print("Y is " + str(y))
print("a[x][y] is " + str(a[x][y]))
print("row_totals[x] is " + str(row_totals[x]))
print("column_total[y] is " + str(column_totals[y]))
b[x][y] = (float(row_totals[x])*float(column_totals[y]))/float(n)
print("b[x][y] is " + str(b[x][y]))
numerator = ((a[x][y]) - b[x][y])**2
chi = float(numerator)/float(b[x][y])
c = float(c)+ float(chi)
print(b)
print(c)
print(stats.chi2.cdf(c, df=dof))
print(1-(stats.chi2.cdf(c, df=dof)))
listc = [(21, 36, 30), (48, 26, 19)]
test_of_homo(listc)
When the resulted were printed I saw that the b[x][y] values were [[33 29 23] [35 32 25]] instead of like 33.35, 29.97, 23.68 etc. This caused my resulting chi value to be 15.58 with a p of 0.0004 instead of the expected 14.5.
I tried to convert everything to float but that didn't seem to work. Using the decimal.Decimal(b[x][y]) resulted in a type error. Any help?
I think the problem could be due to the numbers you are providing to the function in the list. Note that if you convert a list to a Numpy array without specifying the data type it will try to guess based on the values:
>>> listc = [(21, 36, 30), (48, 26, 19)]
>>> a = np.array(listc)
>>> a.dtype
dtype('int64')
Here is how you force conversion to a desired data type:
>>> a = np.array(listc, dtype=float)
>>> a.dtype
dtype('float64')
Try that in the first and 9th lines of your function and see if it solves the problem. If you do this you shouldn't need to use float() all the time.
Related
I am doing a project on encrypting data using RSA algo and for that, I have taken a .wav file as an input and reading it by using wavfile and I can apply the key (3, 25777) but when I am applying the decryption key (16971,25777) it is giving wrong output like this:
The output I'm getting:
[[ 0 -25777]
[ 0 -25777]
[ 0 -25777]
...
[-25777 -25777]
[-15837 -15837]
[ -8621 1]]
output i want:
[[ 0 -1]
[ 2 -1]
[ 2 -3]
...
[-9 -5]
[-2 -2]
[-4 1]]
This was happening only with the decryption part of the array so I decided to convert the 2d array to a 2d list. After that, it is giving me the desired output but it is taking a lot of time to apply the keys to all the elements of the list(16min, in case of array it was 2sec). I don't understand why it is happening and if there is any other solution to this problem ?
here is the encryption and decryption part of the program:
#encryption
for i in range(0, tup[0]): #tup[0] is the no of rows
for j in range(0, tup[1]): #tup[1] is the no of cols
x = data[i][j]
x = ((pow(x,3)) % 25777) #applying the keys
data[i][j] = x #storing back the updated value
#decryption
data= data.tolist() #2d array to list of lists
for i1 in (range(len(data)):
for j1 in (range(len(data[i1]))):
x1 = data[i1][j1]
x1 = (pow(x1, 16971)%25777) #applying the keys
data[i1][j1] = x1
Looking forward to suggestions. Thank you.
The occurrence of something like pow(x1, 16971) should give you pause. This will for almost any integer x1 yield a result which a 64 bit int cannot hold. Which is the reason numpy gives the wrong result, because numpy uses 64 bit or 32 bit integers on the most common platforms. It is also the reason why plain python is slow, because while it can handle large integers this is costly.
A way around this is to apply the modulus in between multiplications, that way numbers remain small and can be readily handled by 64 bit arithmetic.
Here is a simple implementation:
def powmod(b, e, m):
b2 = b
res = 1
while e:
if e & 1:
res = (res * b2) % m
b2 = (b2*b2) % m
e >>= 1
return res
For example:
>>> powmod(2000, 16971, 25777)
10087
>>> (2000**16971)%25777
10087
>>> timeit(lambda: powmod(2000, 16971, 25777), number=100)
0.00031936285085976124
>>> timeit(lambda: (2000**16971)%25777, number=100)
0.255017823074013
Imagine I have a function like below:
f = (s**2 + 2*s + 5) + 1
where s is :
s = [1 , 2 , 3]
How can I pass the s to my function?
I know that I can define a function like below:
def model(s):
model = 1 + (s**2 + 2*s + 5)
return model
fitted_2_dis = [model(value) for value in s]
print ("fitted_2_dis =", fitted_2_dis)
To get :
fitted_2_dis = [9, 14, 21]
I prefer to not using this method. Because my actual function is so big with a lot of expressions. So, instead of bringing all the expressions in my code, I defined my function like below:
sum_f = sum (f)
Sum_f in my code is the summation of bunches of expressions.
Is there any other way to evaluate my function (sum_f) when the input is an array?
Thanks
The list comprehension method is a great method. Additionally you may also use map:
fitted_2_dis = list(map(model, s))
If you're a numpy fan you can use np.vectorize:
np.vectorize(model)(s)
Finally if you convert your array to numpy's ndarray you an pass it in directly:
import numpy as np
s = np.array(s)
model(s)
Map function will fulfill the task quite nicely:
>>> map(model, s)
[9, 14, 21]
You can try this:
import numpy as np
def sum_array(f):
np_s = np.array(f)
return (np_s**2 + 2*np_s + 5) + 1
s = [1, 2, 3]
sum_f = sum_array(s)
I'm trying to create a code where the first two numbers of a tuple are multiplied, and then totaled with other tuples. Here's my very wonky code:
numbers = [(68.9, 2, 24.8),
(12.4, 28, 21.12),
(38.0, 15, 90.86),
(23.1, 45, 15.3),
(45.12, 90, 12.66)]
def function(numbers):
first_decimal = [element[1] for element in numbers]
integer = [element[2] for element in numbers]
string_1 = ''.join(str(x) for x in first_decimal)
string_2 = ''.join(str(x) for x in integer)
# It says 'TypeError: float() argument must be a string or a number',
# but doesn't this convert it to a string??
tot = 1
for element in first_decimal:
tot = float(first_decimal) * int(integer)
return tot
function(numbers)
Forgot about the output. So basically what is needed is the total of:
total_add = 68.9 + 2, 12.4 + 28, 23.1 + 45, 45.12 + 90
i.e. the first two numbers of every tuple in the list. Apologies.
If you literally want to add up the product of the first two elements in each tuple, then you can use the sum() function with a generator:
>>> sum(t[0] * t[1] for t in numbers)
6155.299999999999
which we can check is correct through the following
>>> 68.9 * 2 + 12.4 * 28 + 38.0 * 15 + 23.1 * 45 + 45.12 * 90
6155.299999999999
My preference is to use a vectorised approach via numpy:
import numpy as np
numbers = [(68.9, 2, 24.8),
(12.4, 28, 21.12),
(38.0, 15, 90.86),
(23.1, 45, 15.3),
(45.12, 90, 12.66)]
a = np.array(numbers)
res = np.dot(a[:, 0], a[:, 1])
# 6155.3
First off, element[1] will give you the second entry of the tuple, indexing for a tuple or list always starts at 0. Apart from that, you're giving yourself a hard time with your function by converting variables back and forth. Not sure what you are trying to do with this part anyway:
string_1 = ''.join(str(x) for x in first_decimal)
string_2 = ''.join(str(x) for x in integer)
It seems pretty unecessary. Now to give you a solution that is similar to your approach. Basically, we enumerate through every tuple of the list, multiply the first two entries and add them to the total amound:
numbers = [(68.9, 2, 24.8),
(12.4, 28, 21.12),
(38.0, 15, 90.86),
(23.1, 45, 15.3),
(45.12, 90, 12.66)]
def function(numbers_list):
total_add = 0
# Because numbers is a list of tuples, you can just
# use `len()` to find the number of tuples
for tuple in range(len(numbers_list)):
total_add += numbers_list[tuple][0] * numbers_list[tuple][1]
return total_add
function(numbers)
or simply:
def function(numbers_list):
total_add = 0
for tuple in numbers_list:
total_add += tuple[0] * tuple[1]
return total_add
which can be further shortened to Joe Iddons answer:
total_add = sum(t[0] * t[1] for t in numbers)
I'm trying to square all the elements in a numpy array but the results are not what I'm expecting (ie some are negative numbers and none are the actual square values). Can anyone please explain what I'm doing wrong and/or whats going on?
import numpy as np
import math
f = 'file.bin'
frameNum = 25600
channelNum = 2640
data = np.fromfile(f,dtype=np.int16)
total = frameNum*channelNum*2
rs = data[:total].reshape(channelNum,-1) #reshaping the data a little. Omitting added values at the end.
I = rs[:,::2] # pull out every other column
print "Shape :", I.shape
print "I : ", I[1,:10]
print "I**2 : ", I[1,:10]**2
print "I*I : ",I[1,:10]* I[1,:10]
print "np.square : ",np.square(I[1,:10])
exit()
Output:
Shape : (2640L, 25600L)
I : [-5302 -5500 -5873 -5398 -5536 -6708 -6860 -6506 -6065 -6363]
I**2 : [ -3740 -27632 20193 -25116 -23552 -25968 4752 -8220 18529 -13479]
I*I : [ -3740 -27632 20193 -25116 -23552 -25968 4752 -8220 18529 -13479]
np.square : [ -3740 -27632 20193 -25116 -23552 -25968 4752 -8220 18529 -13479]
Any suggestions?
It is because of the dtype=np.int16. You are allowing only 16 bits to represent the numbers, and -5302**2 is larger than the maximum value (32767) that a signed 16-bit integer can take. So you're seeing only the lowest 16 bits of the result, the first of which is interpreted (or, from your point of view, misinterpreted) as a sign bit.
Convert your array to a different dtype - for example
I = np.array( I, dtype=np.int32 )
or
I = np.array( I, dtype=np.float )
before performing numerical operations that might go out of range.
With dtype=np.int16, the highest-magnitude integers you can square are +181 and -181. The square of 182 is larger than 32767 and so it overflows. Even with dtype=np.int32 representation, the highest-magnitude integers you can square are +46340 and -46340: the square of 46341 overflows.
This is the reason:
>>> a = np.array([-5302, -5500], dtype=np.int16)
>>> a * a
array([ -3740, -27632], dtype=int16)
This the solution:
b = np.array([-5302, -5500], dtype=np.int32)
>>> b * b
>>> array([28111204, 30250000], dtype=int32)
Change:
data = np.fromfile(f, dtype=np.int16)
into:
data = np.fromfile(f, dtype=np.in16).astype(np.int32)
I am working on performing image processing using Numpy, specifically a running standard deviation stretch. This reads in X number of columns, finds the Std. and performs a percentage linear stretch. It then iterates to the next "group" of columns and performs the same operations. The input image is a 1GB, 32-bit, single band raster which is taking quite a long time to process (hours). Below is the code.
I realize that I have 3 nested for loops which is, presumably where the bottleneck is occurring. If I process the image in "boxes", that is to say loading an array that is [500,500] and iterating through the image processing time is quite short. Unfortunately, camera error requires that I iterate in extremely long strips (52,000 x 4) (y,x) to avoid banding.
Any suggestions on speeding this up would be appreciated:
def box(dataset, outdataset, sampleSize, n):
quiet = 0
sample = sampleSize
#iterate over all of the bands
for j in xrange(1, dataset.RasterCount + 1): #1 based counter
band = dataset.GetRasterBand(j)
NDV = band.GetNoDataValue()
print "Processing band: " + str(j)
#define the interval at which blocks are created
intervalY = int(band.YSize/1)
intervalX = int(band.XSize/2000) #to be changed to sampleSize when working
#iterate through the rows
scanBlockCounter = 0
for i in xrange(0,band.YSize,intervalY):
#If the next i is going to fail due to the edge of the image/array
if i + (intervalY*2) < band.YSize:
numberRows = intervalY
else:
numberRows = band.YSize - i
for h in xrange(0,band.XSize, intervalX):
if h + (intervalX*2) < band.XSize:
numberColumns = intervalX
else:
numberColumns = band.XSize - h
scanBlock = band.ReadAsArray(h,i,numberColumns, numberRows).astype(numpy.float)
standardDeviation = numpy.std(scanBlock)
mean = numpy.mean(scanBlock)
newMin = mean - (standardDeviation * n)
newMax = mean + (standardDeviation * n)
outputBlock = ((scanBlock - newMin)/(newMax-newMin))*255
outRaster = outdataset.GetRasterBand(j).WriteArray(outputBlock,h,i)#array, xOffset, yOffset
scanBlockCounter = scanBlockCounter + 1
#print str(scanBlockCounter) + ": " + str(scanBlock.shape) + str(h)+ ", " + str(intervalX)
if numberColumns == band.XSize - h:
break
#update progress line
if not quiet:
gdal.TermProgress_nocb( (float(h+1) / band.YSize) )
Here is an update:
Without using the profile module, as I did not want to start wrapping small sections of the code into functions I used a mix of print and exit statements to get a really rough idea about which lines were taking the most time. Luckily (and I do understand how lucky I was) one line was dragging everything down.
outRaster = outdataset.GetRasterBand(j).WriteArray(outputBlock,h,i)#array, xOffset, yOffset
It appears that GDAL is quite inefficient when opening the output file and writing out the array. With this in mind I decided to add my modified arrays "outBlock" to a python list, then write out chunks. Here is the segment that I changed:
The outputBlock was just modified ...
#Add the array to a list (tuple)
outputArrayList.append(outputBlock)
#Check the interval counter and if it is "time" write out the array
if len(outputArrayList) >= (intervalX * writeSize) or finisher == 1:
#Convert the tuple to a numpy array. Here we horizontally stack the tuple of arrays.
stacked = numpy.hstack(outputArrayList)
#Write out the array
outRaster = outdataset.GetRasterBand(j).WriteArray(stacked,xOffset,i)#array, xOffset, yOffset
xOffset = xOffset + (intervalX*(intervalX * writeSize))
#Cleanup to conserve memory
outputArrayList = list()
stacked = None
finisher=0
Finisher is simply a flag that handles the edges. It took a bit of time to figure out how to build an array from the list. In that, using numpy.array was creating a 3-d array (anyone care to explain why?) and write array requires a 2d array. Total processing time is now varying from just under 2 minutes to 5 minutes. Any idea why the range of times might exist?
Many thanks to everyone who posted! The next step is to really get into Numpy and learn about vectorization for additional optimization.
One way to speed up operations over numpy data is to use vectorize. Essentially, vectorize takes a function f and creates a new function g that maps f over an array a. g is then called like so: g(a).
>>> sqrt_vec = numpy.vectorize(lambda x: x ** 0.5)
>>> sqrt_vec(numpy.arange(10))
array([ 0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
Without having the data you're working with available, I can't say for certain whether this will help, but perhaps you can rewrite the above as a set of functions that can be vectorized. Perhaps in this case you could vectorize over an array of indices into ReadAsArray(h,i,numberColumns, numberRows). Here's an example of the potential benefit:
>>> print setup1
import numpy
sqrt_vec = numpy.vectorize(lambda x: x ** 0.5)
>>> print setup2
import numpy
def sqrt_vec(a):
r = numpy.zeros(len(a))
for i in xrange(len(a)):
r[i] = a[i] ** 0.5
return r
>>> timeit.timeit(stmt='a = sqrt_vec(numpy.arange(1000000))', setup=setup1, number=1)
0.30318188667297363
>>> timeit.timeit(stmt='a = sqrt_vec(numpy.arange(1000000))', setup=setup2, number=1)
4.5400981903076172
A 15x speedup! Note also that numpy slicing handles the edges of ndarrays elegantly:
>>> a = numpy.arange(25).reshape((5, 5))
>>> a[3:7, 3:7]
array([[18, 19],
[23, 24]])
So if you could get your ReadAsArray data into an ndarray you wouldn't have to do any edge-checking shenanigans.
Regarding your question about reshaping -- reshaping doesn't fundamentally alter the data at all. It just changes the "strides" by which numpy indices the data. When you call the reshape method, the value returned is a new view into the data; the data isn't copied or altered at all, nor is the old view with the old stride information.
>>> a = numpy.arange(25)
>>> b = a.reshape((5, 5))
>>> a
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24])
>>> b
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
>>> a[5]
5
>>> b[1][0]
5
>>> a[5] = 4792
>>> b[1][0]
4792
>>> a.strides
(8,)
>>> b.strides
(40, 8)
Answered as requested.
If you are IO bound, you should chunk your reads/writes. Try dumping ~500 MB of data to an ndarray, process it all, write it out and then grab the next ~500 MB. Make sure to reuse the ndarray.
Without trying to completely understand exactly what you are doing, I notice that you aren't using any numpy slices or array broadcasting, both of which may speed up your code, or, at the very least, make it more readable. My apologies if these aren't germane to your problem.