I know that python loops themselves are relatively slow when compared to other languages but when the correct functions are used they become much faster.
I have a pandas dataframe called "acoustics" which contains over 10 million rows:
print(acoustics)
timestamp c0 rowIndex
0 2016-01-01T00:00:12.000Z 13931.500000 8158791
1 2016-01-01T00:00:30.000Z 14084.099609 8158792
2 2016-01-01T00:00:48.000Z 13603.400391 8158793
3 2016-01-01T00:01:06.000Z 13977.299805 8158794
4 2016-01-01T00:01:24.000Z 13611.000000 8158795
5 2016-01-01T00:02:18.000Z 13695.000000 8158796
6 2016-01-01T00:02:36.000Z 13809.400391 8158797
7 2016-01-01T00:02:54.000Z 13756.000000 8158798
and there is the code I wrote:
acoustics = pd.read_csv("AccousticSandDetector.csv", skiprows=[1])
weights = [1/9, 1/18, 1/27, 1/36, 1/54]
sumWeights = np.sum(weights)
deltaAc = []
for i in range(5, len(acoustics)):
time = acoustics.iloc[i]['timestamp']
sum = 0
for c in range(5):
sum += (weights[c]/sumWeights)*(acoustics.iloc[i]['c0']-acoustics.iloc[i-c]['c0'])
print("Row " + str(i) + " of " + str(len(acoustics)) + " is iterated")
deltaAc.append([time, sum])
deltaAc = pd.DataFrame(deltaAc)
It takes a huge amount of time, how can I make it faster?
You can use diff from pandas and create all the differences for each row in an array, then multiply with your weigths and finally sum over the axis 1, such as:
deltaAc = pd.DataFrame({'timestamp': acoustics.loc[5:, 'timestamp'],
'summation': (np.array([acoustics.c0.diff(i) for i in range(5) ]).T[5:]
*np.array(weights)).sum(1)/sumWeights})
and you get the same values than what I get with your code:
print (deltaAc)
timestamp summation
5 2016-01-01T00:02:18.000Z -41.799986
6 2016-01-01T00:02:36.000Z 51.418728
7 2016-01-01T00:02:54.000Z -3.111184
First optimization, weights[c]/sumWeights could be done outside the loop.
weights_array = np.array([1/9, 1/18, 1/27, 1/36, 1/54])
sumWeights = np.sum(weights_array)
tmp = weights_array / sumWeights
...
sum += tmp[c]*...
I'm not familiar with pandas, but if you could extract your columns as 1D numpy array, it would be great for you. It might look something like:
# next lines to be tested, or find the correct way of extracting the column
c0_column = acoustics[['c0']].values
time_column = acoustics[['times']].values
...
sum = numpy.zeros(shape=(len(acoustics)-5,))
delta_ac = []
for c in range(5):
sum += tmp[c]*(c0_column[5:]-c0_column[5-c:len(acoustics)-c])
for i in range(len(acoustics)-5):
deltaAc.append([time[5+i], sum[i])
Dataframes have a great method rolling for constructing and applying windowing transformations; So, you don't need loops at all:
# df is your data frame
window_size = 5
weights = pd.np.array([1/9, 1/18, 1/27, 1/36, 1/54])
weights /= weights.sum()
df.loc[:,'deltaAc'] = df.loc[:, 'c0'].rolling(window_size).apply(lambda x: ((x[-1] - x)*weights).sum())
Related
I have a very big dataframe with this structure:
Timestamp Val1
Here you can see a real sample:
Timestamp Temp
0 1622471518.92911 36.443
1 1622471525.034114 36.445
2 1622471531.148139 37.447
3 1622471537.284337 36.449
4 1622471543.622588 43.345
5 1622471549.734765 36.451
6 1622471556.2518 36.454
7 1622471562.361368 41.461
8 1622471568.472718 42.468
9 1622471574.826475 36.470
What I want to do is compare the Temp column with itself and if is higher than "X", for example 4, and the time between they is lower than "Y", for example 180 min, then I save some data of they.
Now I'm using two for loops one inside the other, but this expends to much time and usually pandas has an option to avoid this.
This is my code:
cap_time, maxim = 180, 4
cap_time = cap_time * 60
temps= df['Temperature'].values
times = df['Timestamp'].values
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
print(i,j,len(temps))
if float(temps[j]) > float(temps[i])*maxim:
timeIn = dt.datetime.fromtimestamp(float(times[i]))
timeOut = dt.datetime.fromtimestamp(float(times[j]))
diff = timeOut - timeIn
tdiff = diff.total_seconds()
if dd > cap_time:
break
else:
res = [temps[i], temps[j], times[i], times[j], tdiff/60, cap_time/60, maxim]
results.append(res)
break
# Then I save it in a dataframe and another actions
Can Pandas help me to achieve my goal and reduce the execution time? I found dataFrame.diff() but I'm not sure is what I want (or I don`t know how to use it).
Thank you very much.
Short of avoiding the nested for loops, you can already speed things up by avoiding all unnecessary calculations and conversions within the loops. In particular, you can use NumPy broadcasting to define a Boolean array beforehand, in which you can look up whether the condition is met:
import numpy as np
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
if condition[i, j]:
results.append([temps[i], temps[j],
times[i], times[j],
times_diff[i, j]])
results
[[36.443, 43.345, 1622471518.92911, 1622471543.622588, 24.693477869033813],
...
[36.454, 42.468, 1622471556.2518, 1622471568.472718, 12.22091794013977]]
To avoid the loops altogether, you could define a 3-dimensional full results array and then use the condition array as a Boolean mask to filter out the results you want:
import numpy as np
n = len(temps)
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results_full = np.stack([np.repeat(temps[:, None], n, axis=1),
np.tile(temps, (n, 1)),
np.repeat(times[:, None], n, axis=1),
np.tile(times, (n, 1)),
times_diff])
results = results_full[np.stack(results_full.shape[0] * [condition])]
results.reshape((5, -1)).T
array([[ 3.64430000e+01, 4.33450000e+01, 1.62247152e+09,
1.62247154e+09, 2.46934779e+01],
...
[ 3.64540000e+01, 4.24680000e+01, 1.62247156e+09,
1.62247157e+09, 1.22209179e+01],
...
])
As you can see, the resulting numbers are the same as above, although this time the results array will contain more rows, because we didn't use the shortcut of starting the inner loop at i+1.
I have a time series s stored as a pandas.Series and I need to find when the value tracked by the time series changes by at least x.
In pseudocode:
print s(0)
s*=s(0)
for all t in ]t, t_max]:
if |s(t)-s*| > x:
s* = s(t)
print s*
Naively, this can be coded in Python as follows:
import pandas as pd
def find_changes(s, x):
changes = []
s_last = None
for index, value in s.iteritems():
if s_last is None:
s_last = value
if value-s_last > x or s_last-value > x:
changes += [index, value]
s_last = value
return changes
My data set is large, so I can't just use the method above. Moreover, I cannot use Cython or Numba due to limitations of the framework I will run this on. I can (and plan to) use pandas and NumPy.
I'm looking for some guidance on what NumPy vectorized/optimized methods to use and how.
Thanks!
EDIT: Changed code to match pseudocode.
I don't know if I am understanding you correctly, but here is how I interpreted the problem:
import pandas as pd
import numpy as np
# Our series of data.
data = pd.DataFrame(np.random.rand(10), columns = ['value'])
# The threshold.
threshold = .33
# For each point t, grab t - 1.
data['value_shifted'] = data['value'].shift(1)
# Absolute difference of t and t - 1.
data['abs_change'] = abs(data['value'] - data['value_shifted'])
# Test against the threshold.
data['change_exceeds_threshold'] = np.where(data['abs_change'] > threshold, 1, 0)
print(data)
Giving:
value value_shifted abs_change change_exceeds_threshold
0 0.005382 NaN NaN 0
1 0.060954 0.005382 0.055573 0
2 0.090456 0.060954 0.029502 0
3 0.603118 0.090456 0.512661 1
4 0.178681 0.603118 0.424436 1
5 0.597814 0.178681 0.419133 1
6 0.976092 0.597814 0.378278 1
7 0.660010 0.976092 0.316082 0
8 0.805768 0.660010 0.145758 0
9 0.698369 0.805768 0.107400 0
I don't think the pseudo code can be vectorized because the next state of s* is dependent on the last state. There's a pure python solution (1 iteration):
import random
import pandas as pd
s = [random.randint(0,100) for _ in range(100)]
res = [] # record changes
thres = 20
ss = s[0]
for i in range(len(s)):
if abs(s[i] - ss) > thres:
ss = s[i]
res.append([i, s[i]])
df = pd.DataFrame(res, columns=['value'])
I think there's no way to run faster than O(N) in this case.
I have a ranking function that I apply to a large number of columns of several million rows which takes minutes to run. By removing all of the logic preparing the data for application of the .rank( method, i.e., by doing this:
ranked = df[['period_id', 'sector_name'] + to_rank].groupby(['period_id', 'sector_name']).transform(lambda x: (x.rank(ascending = True) - 1)*100/len(x))
I managed to get this down to seconds. However, I need to retain my logic, and am struggling to restructure my code: ultimately, the largest bottleneck is my double use of lambda x:, but clearly other aspects are slowing things down (see below). I have provided a sample data frame, together with my ranking functions below, i.e. an MCVE. Broadly, I think that my questions boil down to:
(i) How can one replace the .apply(lambda x usage in the code with a fast, vectorized equivalent? (ii) How can one loop over multi-indexed, grouped, data frames and apply a function? in my case, to each unique combination of the date_id and category columns.
(iii) What else can I do to speed up my ranking logic? the main overhead seems to be in .value_counts(). This overlaps with (i) above; perhaps one can do most of this logic on df, perhaps via construction of temporary columns, before sending for ranking. Similarly, can one rank the sub-dataframe in one call?
(iv) Why use pd.qcut() rather than df.rank()? the latter is cythonized and seems to have more flexible handling of ties, but I cannot see a comparison between the two, and pd.qcut() seems most widely used.
Sample input data is as follows:
import pandas as pd
import numpy as np
import random
to_rank = ['var_1', 'var_2', 'var_3']
df = pd.DataFrame({'var_1' : np.random.randn(1000), 'var_2' : np.random.randn(1000), 'var_3' : np.random.randn(1000)})
df['date_id'] = np.random.choice(range(2001, 2012), df.shape[0])
df['category'] = ','.join(chr(random.randrange(97, 97 + 4 + 1)).upper() for x in range(1,df.shape[0]+1)).split(',')
The two ranking functions are:
def rank_fun(df, to_rank): # calls ranking function f(x) to rank each category at each date
#extra data tidying logic here beyond scope of question - can remove
ranked = df[to_rank].apply(lambda x: f(x))
return ranked
def f(x):
nans = x[np.isnan(x)] # Remove nans as these will be ranked with 50
sub_df = x.dropna() #
nans_ranked = nans.replace(np.nan, 50) # give nans rank of 50
if len(sub_df.index) == 0: #check not all nan. If no non-nan data, then return with rank 50
return nans_ranked
if len(sub_df.unique()) == 1: # if all data has same value, return rank 50
sub_df[:] = 50
return sub_df
#Check that we don't have too many clustered values, such that we can't bin due to overlap of ties, and reduce bin size provided we can at least quintile rank.
max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
max_bins = len(sub_df) / max_cluster
if max_bins > 100: #if largest cluster <1% of available data, then we can percentile_rank
max_bins = 100
if max_bins < 5: #if we don't have the resolution to quintile rank then assume no data.
sub_df[:] = 50
return sub_df
bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
sub_df_ranked = pd.qcut(sub_df, bins, labels=False) #currently using pd.qcut. pd.rank( seems to have extra functionality, but overheads similar in practice
sub_df_ranked *= (100 / bins) #Since we bin using the resolution specified in bins, to convert back to decile rank, we have to multiply by 100/bins. E.g. with quintiles, we'll have scores 1 - 5, so have to multiply by 100 / 5 = 20 to convert to percentile ranking
ranked_df = pd.concat([sub_df_ranked, nans_ranked])
return ranked_df
And the code to call my ranking function and recombine with df is:
# ensure don't get duplicate columns if ranking already executed
ranked_cols = [col + '_ranked' for col in to_rank]
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
df = df.join(ranked[ranked_cols])
I am trying to get this ranking logic as fast as I can, by removing both lambda x calls; I can remove the logic in rank_fun so that only f(x)'s logic is applicable, but I also don't know how to process multi-index dataframes in a vectorized fashion. An additional question would be on differences between pd.qcut( and df.rank(: it seems that both have different ways of dealing with ties, but the overheads seem similar, despite the fact that .rank( is cythonized; perhaps this is misleading, given the main overheads are due to my usage of lambda x.
I ran %lprun on f(x) which gave me the following results, although the main overhead is the use of .apply(lambda x rather than a vectorized approach:
Line # Hits Time Per Hit % Time Line Contents
2 def tst_fun(df, field):
3 1 685 685.0 0.2 x = df[field]
4 1 20726 20726.0 5.8 nans = x[np.isnan(x)]
5 1 28448 28448.0 8.0 sub_df = x.dropna()
6 1 387 387.0 0.1 nans_ranked = nans.replace(np.nan, 50)
7 1 5 5.0 0.0 if len(sub_df.index) == 0:
8 pass #check not empty. May be empty due to nans for first 5 years e.g. no revenue/operating margin data pre 1990
9 return nans_ranked
10
11 1 65559 65559.0 18.4 if len(sub_df.unique()) == 1:
12 sub_df[:] = 50 #e.g. for subranks where all factors had nan so ranked as 50 e.g. in 1990
13 return sub_df
14
15 #Finally, check that we don't have too many clustered values, such that we can't bin, and reduce bin size provided we can at least quintile rank.
16 1 74610 74610.0 20.9 max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
17 # print(counts)
18 1 9 9.0 0.0 max_bins = len(sub_df) / max_cluster #
19
20 1 3 3.0 0.0 if max_bins > 100:
21 1 0 0.0 0.0 max_bins = 100 #if largest cluster <1% of available data, then we can percentile_rank
22
23
24 1 0 0.0 0.0 if max_bins < 5:
25 sub_df[:] = 50 #if we don't have the resolution to quintile rank then assume no data.
26
27 # return sub_df
28
29 1 1 1.0 0.0 bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
30
31 #should track bin resolution for all data. To add.
32
33 #if get here, then neither nans_ranked, nor sub_df are empty
34 # sub_df_ranked = pd.qcut(sub_df, bins, labels=False)
35 1 160530 160530.0 45.0 sub_df_ranked = (sub_df.rank(ascending = True) - 1)*100/len(x)
36
37 1 5777 5777.0 1.6 ranked_df = pd.concat([sub_df_ranked, nans_ranked])
38
39 1 1 1.0 0.0 return ranked_df
I'd build a function using numpy
I plan on using this within each group defined within a pandas groupby
def rnk(df):
a = df.values.argsort(0)
n, m = a.shape
r = np.arange(a.shape[1])
b = np.empty_like(a)
b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
return pd.DataFrame(b / n, df.index, df.columns)
gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
df.groupby(gcols)[rcols].apply(rnk).add_suffix('_ranked')
var_1_ranked var_2_ranked var_3_ranked
0 0.333333 0.809524 0.428571
1 0.160000 0.360000 0.240000
2 0.153846 0.384615 0.461538
3 0.000000 0.315789 0.105263
4 0.560000 0.200000 0.160000
...
How It Works
Because I know that ranking is related to sorting, I want to use some clever sorting to do this quicker.
numpy's argsort will produce a permutation that can be used to slice the array into a sorted array.
a = np.array([25, 300, 7])
b = a.argsort()
print(b)
[2 0 1]
print(a[b])
[ 7 25 300]
So, instead, I'm going to use the argsort to tell me where the first, second, and third ranked elements are.
# create an empty array that is the same size as b or a
# but these will be ranks, so I want them to be integers
# so I use empty_like(b) because b is the result of
# argsort and is already integers.
u = np.empty_like(b)
# now just like when I sliced a above with a[b]
# I slice u the same way but instead I assign to
# those positions, the ranks I want.
# In this case, I defined the ranks as np.arange(b.size) + 1
u[b] = np.arange(b.size) + 1
print(u)
[2 3 1]
And that was exactly correct. The 7 was in the last position but was our first rank. 300 was in the second position and was our third rank. 25 was in the first position and was our second rank.
Finally, I divide by the number in the rank to get the percentiles. It so happens that because I used zero based ranking np.arange(n), as opposed to one based np.arange(1, n+1) or np.arange(n) + 1 as in our example, I can do the simple division to get the percentiles.
What's left to do is apply this logic to each group. We can do this in pandas with groupby
Some of the missing details include how I use argsort(0) to get independent sorts per column` and that I do some fancy slicing to rearrange each column independently.
Can we avoid the groupby and have numpy do the whole thing?
I'll also take advantage of numba's just in time compiling to speed up some things with njit
from numba import njit
#njit
def count_factor(f):
c = np.arange(f.max() + 2) * 0
for i in f:
c[i + 1] += 1
return c
#njit
def factor_fun(f):
c = count_factor(f)
cc = c[:-1].cumsum()
return c[1:][f], cc[f]
def lexsort(a, f):
n, m = a.shape
f = f * (a.max() - a.min() + 1)
return (f.reshape(-1, 1) + a).argsort(0)
def rnk_numba(df, gcols, rcols):
tups = list(zip(*[df[c].values.tolist() for c in gcols]))
f = pd.Series(tups).factorize()[0]
a = lexsort(np.column_stack([df[c].values for c in rcols]), f)
c, cc = factor_fun(f)
c = c[:, None]
cc = cc[:, None]
n, m = a.shape
r = np.arange(a.shape[1])
b = np.empty_like(a)
b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
return pd.DataFrame((b - cc) / c, df.index, rcols).add_suffix('_ranked')
How it works
Honestly, this is difficult to process mentally. I'll stick with expanding on what I explained above.
I want to use argsort again to drop rankings into the correct positions. However, I have to contend with the grouping columns. So what I do is compile a list of tuples and factorize them as was addressed in this question here
Now that I have a factorized set of tuples I can perform a modified lexsort that sorts within my factorized tuple groups. This question addresses the lexsort.
A tricky bit remains to be addressed where I must off set the new found ranks by the size of each group so that I get fresh ranks for every group. This is taken care of in the tiny snippet b - cc in the code below. But calculating cc is a necessary component.
So that's some of the high level philosophy. What about #njit?
Note that when I factorize, I am mapping to the integers 0 to n - 1 where n is the number of unique grouping tuples. I can use an array of length n as a convenient way to track the counts.
In order to accomplish the groupby offset, I needed to track the counts and cumulative counts in the positions of those groups as they are represented in the list of tuples or the factorized version of those tuples. I decided to do a linear scan through the factorized array f and count the observations in a numba loop. While I had this information, I'd also produce the necessary information to produce the cumulative offsets I also needed.
numba provides an interface to produce highly efficient compiled functions. It is finicky and you have to acquire some experience to know what is possible and what isn't possible. I decided to numbafy two functions that are preceded with a numba decorator #njit. This coded works just as well without those decorators, but is sped up with them.
Timing
%%timeit
ranked_cols = [col + '_ranked' for col in to_rank]
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
1 loop, best of 3: 481 ms per loop
gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
%timeit df.groupby(gcols)[rcols].apply(rnk_numpy).add_suffix('_ranked')
100 loops, best of 3: 16.4 ms per loop
%timeit rnk_numba(df, gcols, rcols).head()
1000 loops, best of 3: 1.03 ms per loop
I suggest you try this code. It's 3 times faster than yours, and more clear.
rank function:
def rank(x):
counts = x.value_counts()
bins = int(0 if len(counts) == 0 else x.count() / counts.iloc[0])
bins = 100 if bins > 100 else bins
if bins < 5:
return x.apply(lambda x: 50)
else:
return (pd.qcut(x, bins, labels=False) * (100 / bins)).fillna(50).astype(int)
single thread apply:
for col in to_rank:
df[col + '_ranked'] = df.groupby(['date_id', 'category'])[col].apply(rank)
mulple thread apply:
import sys
from multiprocessing import Pool
def tfunc(col):
return df.groupby(['date_id', 'category'])[col].apply(rank)
pool = Pool(len(to_rank))
result = pool.map_async(tfunc, to_rank).get(sys.maxint)
for (col, val) in zip(to_rank, result):
df[col + '_ranked'] = val
I am trying to find stdev for a sequence of numbers that were extracted from combinations of dice (30) that sum up to 120. I am very new to Python, so this code makes the console freeze because the numbers are endless and I am not sure how to fit them all into a smaller, more efficient function. What I did is:
found all possible combinations of 30 dice;
filtered combinations that sum up to 120;
multiplied all items in the list within result list;
tried extracting standard deviation.
Here is the code:
import itertools
import numpy
dice = [1,2,3,4,5,6]
subset = itertools.product(dice, repeat = 30)
result = []
for x in subset:
if sum(x) == 120:
result.append(x)
my_result = numpy.product(result, axis = 1).tolist()
std = numpy.std(my_result)
print(std)
Note that D(X^2) = E(X^2) - E(X)^2, you can solve this problem analytically by following equations.
f[i][N] = sum(k*f[i-1][N-k]) (1<=k<=6)
g[i][N] = sum(k^2*g[i-1][N-k])
h[i][N] = sum(h[i-1][N-k])
f[1][k] = k ( 1<=k<=6)
g[1][k] = k^2 ( 1<=k<=6)
h[1][k] = 1 ( 1<=k<=6)
Sample implementation:
import numpy as np
Nmax = 120
nmax = 30
min_value = 1
max_value = 6
f = np.zeros((nmax+1, Nmax+1), dtype ='object')
g = np.zeros((nmax+1, Nmax+1), dtype ='object') # the intermediate results will be really huge, to keep them accurate we have to utilize python big-int
h = np.zeros((nmax+1, Nmax+1), dtype ='object')
for i in range(min_value, max_value+1):
f[1][i] = i
g[1][i] = i**2
h[1][i] = 1
for i in range(2, nmax+1):
for N in range(1, Nmax+1):
f[i][N] = 0
g[i][N] = 0
h[i][N] = 0
for k in range(min_value, max_value+1):
f[i][N] += k*f[i-1][N-k]
g[i][N] += (k**2)*g[i-1][N-k]
h[i][N] += h[i-1][N-k]
result = np.sqrt(float(g[nmax][Nmax]) / h[nmax][Nmax] - (float(f[nmax][Nmax]) / h[nmax][Nmax]) ** 2)
# result = 32128174994365296.0
You ask for a result of an unfiltered lengths of 630 = 2*1023, impossible to handle as such.
There are two possibilities that can be combined:
Include more thinking to pre-treat the problem, e.g. on how to sample only
those with sum 120.
Do a Monte Carlo simulation instead, i.e. don't sample all
combinations, but only a random couple of 1000 to obtain a representative
sample to determine std sufficiently accurate.
Now, I only apply (2), giving the brute force code:
N = 30 # number of dices
M = 100000 # number of samples
S = 120 # required sum
result = [[random.randint(1,6) for _ in xrange(N)] for _ in xrange(M)]
result = [s for s in result if sum(s) == S]
Now, that result should be comparable to your result before using numpy.product ... that part I couldn't follow, though...
Ok, if you are out after the standard deviation of the product of the 30 dices, that is what your code does. Then I need 1 000 000 samples to get roughly reproducible values for std (1 digit) - takes my PC about 20 seconds, still considerably less than 1 million years :-D.
Is a number like 3.22*1016 what you are looking for?
Edit after comments:
Well, sampling the frequency of numbers instead gives only 6 independent variables - even 4 actually, by substituting in the constraints (sum = 120, total number = 30). My current code looks like this:
def p2(b, s):
return 2**b * 3**s[0] * 4**s[1] * 5**s[2] * 6**s[3]
hits = range(31)
subset = itertools.product(hits, repeat=4) # only 3,4,5,6 frequencies
product = []
permutations = []
for s in subset:
b = 90 - (2*s[0] + 3*s[1] + 4*s[2] + 5*s[3]) # 2 frequency
a = 30 - (b + sum(s)) # 1 frequency
if 0 <= b <= 30 and 0 <= a <= 30:
product.append(p2(b, s))
permutations.append(1) # TODO: Replace 1 with possible permutations
print numpy.std(product) # TODO: calculate std manually, considering permutations
This computes in about 1 second, but the confusing part is that I get as a result 1.28737023733e+17. Either my previous approaches or this one has a bug - or both.
Sorry - not that easy: The sampling is not of the same probability - that is the problem here. Each sample has a different number of possible combinations, giving its weight, which has to be considered before taking the std-deviation. I have drafted that in the code above.
i have a hw assignment i just finished up but it looks pretty horrendous knowing that theres a much simpler and efficient way to get the correct output but i just cant seem to figure it out.
Heres the objective of the assignment.
Write a program that stores the following values in a 2D list (these will be hardcoded):
2.42 11.42 13.86 72.32
56.59 88.52 4.33 87.70
73.72 50.50 7.97 84.47
The program should determine the maximum and average of each column
Output looks like
2.42 11.42 13.86 72.32
56.59 88.52 4.33 87.70
73.72 50.50 7.97 84.47
============================
73.72 88.52 13.86 87.70 column max
44.24 50.15 8.72 81.50 column average
The printing of the 2d list was done below, my problem is calculating the max, and averages.
data = [ [ 2.42, 11.42, 13.86, 72.32],
[ 56.59, 88.52, 4.33, 87.70],
[ 73.72, 50.50, 7.97, 84.47] ]
emptylist = []
r = 0
while r < 3:
c = 0
while c < 4 :
print "%5.2f" % data[r][c] ,
c = c + 1
r = r + 1
print
print "=" * 25
This prints the top half but the code i wrote to calculate the max and average is bad. for max i basically comapred all indexes in columns to each other with if, elif, statements and for the average i added the each column indency together and averaged, then printed. IS there anyway to calculate the bottom stuff with some sort of loop. Maybe something like the following
for numbers in data:
r = 0 #row index
c = 0 #column index
emptylist= []
while c < 4 :
while r < 3 :
sum = data[r][c]
totalsum = totalsum + sum
avg = totalsum / float(rows)
emptylist.append(avg) #not sure if this would work? here im just trying to
r = r + 1 #dump averages into an emptylist to print the values
c = c + 1 #in it later?
or something like that where im not manually adding each index number to each column and row. The max one i have no clue how to do in a loop . also NO LIST METHODS can be used. only append and len() can be used. Any help?
Here is what you're looking for:
num_rows = len(data)
num_cols = len(data[0])
max_values = [0]*num_cols # Assuming the numbers in the array are all positive
avg_values = [0]*num_cols
for row_data in data:
for col_idx, col_data in enumerate(row):
max_values[col_idx] = max(max_values[col_idx],col_data) # Max of two values
avg_values[col_idx] += col_data
for i in range(num_cols):
avg_values[i] /= num_rows
Then the max_values will contain the maximum for each column, while avg_values will contain the average for each column. Then you can print it like usual:
for num in max_values:
print num,
print
for num in avg_values:
print num
or simply (if allowed):
print ' '.join(max_values)
print ' '.join(avg_values)
I would suggest making a two new lists, each of the same size of each of your rows, and keeping a running sum in one, and a running max in the second one:
maxes = [0] * 4 # equivalent to [0, 0, 0, 0]
avgs = [0] * 4
for row in data: # this gives one row at a time
for c in range(4): # equivalent to for c in [0,1,2,3]:
#first, check if the max is big enough:
if row[c] > maxes[c]:
maxes[c] = row[c]
# next, add that value to the sum:
avgs[c] += row[c]/4.
You can print them like so:
for m in maxes:
print "%5.2f" % m,
for s in sums:
print "%5.2f" % s,
If you are allowed to use the enumerate function, this can be done a little more nicely:
for i, val in enumerate(row):
print i, val
0 2.42
1 11.42
2 13.86
3 72.32
So it gives us the values and the index, so we can use it like this:
maxes = [0] * 4
sums = [0] * 4
for row in data:
for c, val in enumerate(row):
#first, check if the max is big enough:
if val > maxes[c]:
maxes[c] = val
# next, add that value to the sum:
sums[c] += val