Script for working out exponential limit within a set range - python

In the image above column B is multiples of A1 and column C is C = C + B (working down the rows)
I worked out that in order for C for be 50 in 20 rows A1 has to be 0.2631579 but I'd like to be able to simplify that to a function that will return a list: list = exp(50, 20).
I'm not sure about the terminology of such a script so researching beforehand didn't really bring anything up sorry.

Well based on your problem statement, we know that:
Bn=(n-1)×a; and Cn=(n-1)×n×a/2 (here a is the value for A1).
So we only have to solve a for C20=50. Or more genic: Cn=m. This is simply: a = 2×m/(n×(n-1)).
So the function is simply:
def find_threshold(m,n):
return 2.0*m/(n*(n-1))
For your sample input, the lower bound is:
>>> find_threshold(50,20)
0.2631578947368421
If you plug in this value in the Excel sheet, you will obtain 50 (although there can be small rounding errors). Given we assume calculations on the numbers are done in constant time, this script works in constant time as well (O(1)) so it is quite fast (even if the row number, etc. would be huge).

Related

Python similarity on sets of strings via Pandas crashes memory. How can I make it work?

I'm struggling to get my python code to run, as I always run out of memory. So, I have the following data frame:
I have a column with a key and a column with features. This is a set containing a maximum of 10 strings that all have no spaces. And in this example I have about 70k rows.
key features
0 String A {'Thisisastring', 'Thisisanothersentence', ... 'Maximumof10Strings'}
1 String B {'Hellothere', 'Woop', ... 'Maxiningoutat10Strings'}
2 String C {'Yessir', 'Stackovervlowisawesome', ... 'Maximumof10Strings'}
...
70000 String XY {'Aintnostring', 'Maybeitis', ... 'pleasehelpme'}
...
Now what I want to do is to compare each of the feature-sets with all the other feature-sets and get their similarity. The similarity score is fairly easy in its code, as I only want, if there are 5 of the 10 the same, to give me 0.5 score of similarity, etc.:
def similarity_score(a, b):
c = a.intersection(b)
return 2 * float(len(c)) / (len(a) + len(b))
This is the current code, at the end I want to have a matrix, so that I can easily cluster them together based upon a similarity score threshold:
base_pd = original_pd['features']
i = base_pd.apply(frozenset, 1).to_frame()
j = i.assign(foo=1)
k = j.merge(j, on='foo').drop('foo', 1)
k.columns = ['A', 'B']
fnc = np.vectorize(similarity_score)
y = fnc(k['A'], k['B']).reshape(len(base_pd), len(base_pd))
queries = original_pd['key'].to_list()
df = pd.DataFrame(data=y, index=keys, columns=keys)
The issue is, though, that this wipes out my memory and uses > 25 GB quite early on. Obviously, it's also due to the huge amount of data with 70K rows, but there will be even possibilities of using even more rows, so I need to find a solution.
I've already tried with NumPy, to get around it a bit, but I'm not getting anywhere.
How could I make this more efficient? I need to use strings originally, obviously could change them to hashes or so, but even then I am a bit lost.
Best and thanks in advance,
Lukas

Speed Up Program Below

I have written this for loop program below where I go through element by element of an array and do some math to those elements. Once the math is calculated it gets stored into another array.
for i in range(0, 1024):
x[i] = a * data[i]+ b * x[(i-1)] + c * x[(i-2)]
So in my program a, b, and c are just scalar numbers. Data and x are arrays. Data has an array size 1024 filled with numbers in each element. X is also an array size 1024 but it's filled with all zeros initially. In order to calculate the new elements of x I use the previous two elements of x. Initially the previous two are 0 and 0 since it takes the last two element from the x array of zeros. I multiply the current element of data by a, the last element of x by b, and the second to last element of x by c. Then I add everything up and save it to the current element of x. Then I do the same thing for every element in data and x.
This loop program works but I was wondering if there is a faster way to do it? Maybe using a combination of numpy functions like cumsum or dot product? Can some one help me maybe make the program faster? Thank you!
Best you could do using recursive method:
x = a * data
coef = np.array([c,b])
for i in range(2, 1024):
x[i] += np.dot(coef, x[i-2:i])
But even better, you can solve this recurrence equation to a closed form solution and apply directly without loop. (This is a basic 2nd order linear equation)
In general, if you want a programm that is fast, Python is not the best option. Python is great for prototyping since it is easy and has a lot of tools, however it is not verry computationally efficient in it's raw form if you compare it to for example C. What I usually do is to use Cython, is is a module for python that let's you convert your script to machiene code (as you do with C) which would greatly increase the speed of the appliation.
It let's you type cast the variables for example:
cdef double a, b, c
When you use a variable in Python the variables has to be checked every single time to make sure what type of variable it is (int, double, string etc). In C, that is not an issue since you have to decide from the start what the variable should be, decreasing the time consumption of the operation.
I would try to transform the for loop in a list comprehension which has much faster processing time in python.

Best way to create a loop for multiplying a matrix by every one of its elements, then summing the results

very new to Python so apologies for the lack of vocabulary/knowledge. I would like to know if there is a better way to achieve what the code below provides. Using the loop I have made, I can generate and append all of the matrices/arrays formed from multiplying matrix A by each and every element within A. The last line of code then sums all of the elements in this array of arrays and prints out the result I want.
The problem is, when I get to about d = 600, I get SIGKILL errors, due to a lack of memory on my computer.
I have considered the mathematics behind it, which included breaking the summation into parts that dealt with different values of indices, but nothing seems to speed it up significantly.
This may be purely a memory-based issue, but I thought I would ask in case there are any Python/code based tips that could help. The code is as follows:
A = numpy.random.randint(0, 4, size=(d, d))
All = []
for n in range(0, d):
for m in range(0, d):
All.append(A*(A[n,m]))
print(numpy.sum(All))
So overall, I achieve the correct result, but due to the large size of the matrices and the number of multiplications, I cannot achieve the required d = 2000 I am looking for without a memory error. Thanks in advance.
You don't need to do looping here and building a new list if all you want is the total sum... what you're doing mathematically comes down to:
total = A.sum() ** 2

How to fix negative values in log?

So, I am getting the data from a txt file and I want to get specific data within the whole set. In the code, I am trying to grab it by specifying which indexes and which frequencies are being used for those indexes. But my log is showing a negative value and I don't how to fix that. Code is below, thanks!
indexes = [9,10,11,12,13]
frequenciesmh = [151,610,1400,4860,18000]
frequenciesgh = [i*10**-3 for i in frequenciesmh]
bigclusterallfluxes = bigcluster[indexes]
bigclusterlogflux151mhandredshift = [i[indexes] for i in bigcluster]
shiftedlogflux151mh =
[np.interp(np.log10((151*10**-3)*i[0]),np.log10(frequenciesgh),i[1:])
for i in bigclusterlogflux151mhandredshift]
shiftflux151mh = [10**i for i in shiftedlogflux151mh]
bigclusterflux151mhandredshift =
np.array(list(zip(shiftflux151mh,np.transpose(bigcluster)[9])))
I don't know what you are trying to fix exactly, but I would definitely NOT change the negative values as they would change the power to being positive always (if you know some maths you will understand that that means 1/16 ==> 16 and also 16 ==> 16).
What you probably want, as you are working with frequencies (Which are always between 0 and 1, if you normalize them, to do this divide each of them by the sum of all of them, hence your logarithm will always be smaller or equal to 0) is to make them all positive and have the - log 10 of your probability, which is quite a common value to have, then 1 == 1/10, 2 == 1/100, etc (which in genetics at least are called phred values I believe).
Summarizing always call the minus log, not the log
-math.log(0.0001)
The abs() function is what you are looking for.

fill missing values in python array

Using: Python 2.7.1 on Windows
Hello, I fear this question has a very simple answer, but I just can't seem to find an appropriate and efficient solution (I have limited python experience). I am writing an application that just downloads historic weather data from a third party API (wundergorund). The thing is, sometimes there's no value for a given hour (eg, we have 20 degrees at 5 AM, no value for 6 AM, and 21 degrees at 7 AM). I need to have exactly one temperature value in any given hour, so I figured I could just fit the data I do have and evaluate the points I'm missing (using SciPy's polyfit). That's all cool, however, I am having problems handling my program to detect if the list has missing hours, and if so, insert the missing hour and calculate a temperature value. I hope that makes sense..
My attempt at handling the hours and temperatures list is the following:
from scipy import polyfit
# Evaluate simple cuadratic function
def tempcal (array,x):
return array[0]*x**2 + array[1]*x + array[2]
# Sample data, note it has missing hours.
# My final hrs list should look like range(25), with matching temperatures at every point
hrs = [1,2,3,6,9,11,13,14,15,18,19,20]
temps = [14.0,14.5,14.5,15.4,17.8,21.3,23.5,24.5,25.5,23.4,21.3,19.8]
# Fit coefficients
coefs = polyfit(hrs,temps,2)
# Cycle control
i = 0
done = False
while not done:
# It has missing hour, insert it and calculate a temperature
if hrs[i] != i:
hrs.insert(i,i)
temps.insert(i,tempcal(coefs,i))
# We are done, leave now
if i == 24:
done = True
i += 1
I can see why this isn't working, the program will eventually try to access indexes out of range for the hrs list. I am also aware that modifying list's length inside a loop has to be done carefully. Surely enough I am either not being careful enough or just overlooking a simpler solution altogether.
In my googling attempts to help myself I came across pandas (the library) but I feel like I can solve this problem without it, (and I would rather do so).
Any input is greatly appreciated. Thanks a lot.
When I is equal 21. It means twenty second value in list. But there is only 21 values.
In future I recommend you to use PyCharm with breakpoints for debug. Or try-except construction.
Not sure i would recommend this way of interpolating values. I would have used the closest points surrounding the missing values instead of the whole dataset. But using numpy your proposed way is fairly straight forward.
hrs = np.array(hrs)
temps = np.array(temps)
newTemps = np.empty((25))
newTemps.fill(-300) #just fill it with some invalid data, temperatures don't go this low so it should be safe.
#fill in original values
newTemps[hrs - 1] = temps
#Get indicies of missing values
missing = np.nonzero(newTemps == -300)[0]
#Calculate and insert missing values.
newTemps[missing] = tempcal(coefs, missing + 1)

Categories