def a4():
p = []
for i in range(10):
p.append(random.sample(x, 100))
r = []
for i in p:
for j in i:
r.append(j)
return r
OUTPUT:
[0.5202486543583558, 0.5202486543583558, 0.5202486543583558, 0.5202486543583558, 0.5202486543583558]
a1000 = []
for i in range(5):
a4()
a1000.append(statistics.mean(a4()))
print(a1000)
I tried to loop through the above defined function using for loop mentioned above but the function only runs once and all the loop results are basically the same. I want the function to run each time through the loop. Could someone tell me why the function is only running once?
As was pointed in the comments the sublists in p in the definition of a4 have exactly the same elements, exactly the same number of times only the order of these element changes.
Therefore the same goes for every new result of a4. These are the same lists upto a permutation of elements. But the order of elements is irrelevant for the computation of the mean (the sum of permuted elements is always the same). Hence you always get the same mean as a result.
However, what you might have wanted to implement is some kind of a bootstrapping mechanism. In that case you would want to sample with replacement. And that in turn would yield different result every time. If this is what you want then replace
p.append(random.sample(x, 100))
with
p.append(random.choices(x, k=100))
Also I would consider using numpy for these things. Read about numpy array methods concatenate, flatten. And numpy.random.sample and numpy.random.choice.
Related
I am working with the IRIS dataset. I have two sets of data, (1 training set) (2 test set). Now I want to calculate the euclidean distance between every test set row and the train set rows. However, I only want to include the first 4 points of the row.
A working example would be:
dist = np.linalg.norm(inner1test[0][0:4]-inner1train[0][0:4])
print(dist)
***output: 3.034243***
The problem is that I have 120 training set points and 30 test set points - so i would have to do 2700 operations manually, thus I thought about iterating through with a for-loop. Unfortunately, every of my attemps is failing.
This would be my best attempt, which shows the error message
for i in inner1test:
for number in inner1train:
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
print(dist)
(IndexError: arrays used as indices must be of integer (or boolean)
type)
What would be the best solution to iterate through this array?
ps: I will also provide a screenshot for better vizualisation.
From what I see, inner1test is a tuple of lists, so the i value will not be an index but the actual list.
You should use enumerate, which returns two variables, the index and the actual data.
for i, value in enumerate(inner1test):
for j, number in enumerate(inner1train):
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
print(dist)
Also, if your lists begin the be bigger, consider using a generator which will execute your calculcations iteration per iteration and return only one value at a time, avoiding to return a big chunk of results which would occupy a lot of memory.
eg:
def my_calculatiuon(inner1test, inner1train):
for i, value in enumerate(inner1test):
for j, number in enumerate(inner1train):
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
yield dist
for i in my_calculatiuon(inner1test, inner1train):
print(i)
You might also want to investigate python list comprehension which is sometimes more elegant way to handle for loops with lists.
[EDIT]
Here's a probably easier solution anyway, without the need of indexes, which won't fail to enumerate a numpy object:
for testvalue in inner1test:
for testtrain in inner1train:
dist = np.linalg.norm(testvalue[0:4]-testtrain[0:4])
[/EDIT]
This was the final solution with the correct output for me:
distanceslist = list()
for testvalue in inner1test:
for testtrain in inner1train:
dist = np.linalg.norm(testvalue[0:4]-testtrain[0:4])
distances = (dist, testtrain[0:4])
distanceslist.append(distances)
distanceslist
I'm looking for a better, faster way to center a couple of lists. Right now I have the following:
import random
m = range(2000)
sm = sorted(random.sample(range(100000), 16000))
si = random.sample(range(16005), 16000)
# Centered array.
smm = []
print sm
print si
for i in m:
if i in sm:
smm.append(si[sm.index(i)])
else:
smm.append(None)
print m
print smm
Which in effect creates a list (m) containing a range of random numbers to center against, another list (sm) from which m is centered against and a list of values (si) to append.
This sample runs fairly quickly, but when I run a larger task with much more variables performance slows to a standstill.
your mainloop contains this infamous line:
if i in sm:
it seems to be nothing but since sm is a result of sorted it is a list, hence O(n) lookup, which explains why it's slow with a big dataset.
Moreover you're using the even more infamous si[sm.index(i)], which makes your algorithm O(n**2).
Since you need the indexes, using a set is not so easy, and there's better to do:
Since sm is sorted, you could use bisect to find the index in O(log(n)), like this:
for i in m:
j = bisect.bisect_left(sm,i)
smm.append(si[j] if (j < len(sm) and sm[j]==i) else None)
small explanation: bisect gives you the insertion point of i in sm. It doesn't mean that the value is actually in the list so we have to check that (by checking if the returned value is within existing list range, and checking if the value at the returned index is the searched value), if so, append, else append None.
So I'm trying run a simple test of an idea. Basically I have some function that I define that depends on a variable and a parameter constant. And I have an array of parameter values. For the first parameter values, I have a set of start and end points of integration. For the second I have a different set of start and end points of integration. I've got the code working thanks to this link Integrating functions that return an array in Python
And it's basically exactly in that form.
My question is, if I have a definition of a function with a for loop in it, and that function looks something like:
def F(a):
F = []
for i in len(a):
F[i] = scipy.integrate.quad(g,0,1,args = (a[i],))
return F
(where g is some function I've defined previously in the code), then when I implement this function (or plug in an array for a) will all the elements in the array run consecutively? or will the integral at each element run at the same time?
Or in other words, in the link I attached at the beginning, when the function that is defined with a for loop is called, do all calculations in the function run consecutively (like a for loop running through indices) or concurrently since all elements are already defined?
The loop can better be expressed as:
def F(a):
results = []
for element in a:
results.append(scipy.integrate.quad(g, 0, 1, args=(element,)))
return results
or as a one-liner using a list comprehension:
def F(a):
return [scipy.integrate.quad(g, 0, 1, args=(element,)) for element in a]
And in both of these cases, the integrations will be done one at at time (consecutively).
I have a function that works something like this:
def Function(x):
a = random.random()
b = random.random()
c = OtherFunctionThatReturnsAThreeColumnArray()
results = np.zeros((1,5))
results[0,0] = a
results[0,1] = b
results[0,2] = c[-1,0]
results[0,3] = c[-1,1]
results[0,4] = c[-1,2]
return results
What I'm trying to do is run this function many, many times, appending the returned one row, 5 column results to a running data set. But the append function, and a for-loop are both ruinously inefficient as I understand it, and I'm both trying to improve my code and the number of runs is going to be large enough that that kind of inefficiency isn't doing me any favors.
Whats the best way to do the following such that it induces the least overhead:
Create a new numpy array to hold the results
Insert the results of N calls of that function into the array in 1?
You're correct in thinking that numpy.append or numpy.concatenate are going to be expensive if repeated many times (this is to do with numpy declaring a new array for the two previous arrays).
The best suggestion (If you know how much space you're going to need in total) would be to declare that before you run your routine, and then just put the results in place as they become available.
If you're going to run this nrows times, then
results = np.zeros([nrows, 5])
and then add your results
def function(x, i, results):
<.. snip ..>
results[i,0] = a
results[i,1] = b
results[i,2] = c[-1,0]
results[i,3] = c[-1,1]
results[0,4] = c[-1,2]
Of course, if you don't know how many times you're going to be running function this won't work. In that case, I'd suggest a less elegant approach;
Declare a possibly large results array and add to results[i, x] as above (keeping track of i and the size of results.
When you reach the size of results, then do the numpy.append (or concatenate) on a new array. This is less bad than appending repetitively and shouldn't destroy performance - but you will have to write some wrapper code.
There are other ideas you could pursue. Off the top of my head you could
Write the results to disk, depending on the speed of OtherFunctionThatReturnsAThreeColumnArray and the size of your data this may not be too daft an idea.
Save your results in a list comprehension (forgetting numpy until after the run). If function returned (a, b, c) not results;
results = [function(x) for x in my_data]
and now do some shuffling to get results into the form you need.
I'm wondering what is the most efficient way to replace elements in an array with other random elements in the array given some criteria. More specifically, I need to replace each element which doesn't meet a given criteria with another random value from that row. For example, I want to replace each row of data as a random cell in data(row) which is between -.8 and .8. My inefficinet solution looks something like this:
import numpy as np
data = np.random.normal(0, 1, (10, 100))
for index, row in enumerate(data):
row_copy = np.copy(row)
outliers = np.logical_or(row>.8, row<-.8)
for prob in np.where(outliers==1)[0]:
fixed = 0
while fixed == 0:
random_other_value = r.randint(0,99)
if random_other_value in np.where(outliers==1)[0]:
fixed = 0
else:
row_copy[prob] = row[random_other_value]
fixed = 1
Obviously, this is not efficient.
I think it would be faster to pull out all the good values, then use random.choice() to pick one whenever you need it. Something like this:
import numpy as np
import random
from itertools import izip
data = np.random.normal(0, 1, (10, 100))
for row in data:
good_ones = np.logical_and(row >= -0.8, row <= 0.8)
good = row[good_ones]
row_copy = np.array([x if f else random.choice(good) for f, x in izip(good_ones, row)])
High-level Python code that you write is slower than the C internals of Python. If you can push work down into the C internals it is usually faster. In other words, try to let Python do the heavy lifting for you rather than writing a lot of code. It's zen... write less code to get faster code.
I added a loop to run your code 1000 times, and to run my code 1000 times, and measured how long they took to execute. According to my test, my code is ten times faster.
Additional explanation of what this code is doing:
row_copy is being set by building a new list, and then calling np.array() on the new list to convert it to a NumPy array object. The new list is being built by a list comprehension.
The new list is made according to the rule: if the number is good, keep it; else, take a random choice from among the good values.
A list comprehension walks over a sequence of values, but to apply this rule we need two values: the number, and the flag saying whether that number is good or not. The easiest and fastest way to make a list comprehension walk along two sequences at once is to use izip() to "zip" the two sequences together. izip() will yield up tuples, one at a time, where the tuple is (f, x); f in this case is the flag saying good or not, and x is the number. (Python has a built-in feature called zip() which does pretty much the same thing, but actually builds a list of tuples; izip() just makes an iterator that yields up tuple values. But you can play with zip() at a Python prompt to learn more about how it works.)
In Python we can unpack a tuple into variable names like so:
a, b = (2, 3)
In this example, we set a to 2 and b to 3. In the list comprehension we unpack the tuples from izip() into variables f and x.
Then the heart of the list comprehension is a "ternary if" statement like so:
a if flag else b
The above will return the value a if the flag value is true, and otherwise return b. The one in this list comprehension is:
x if f else random.choice(good)
This implements our rule.