Testing with manual calculation or programmed calculation? - python

I'm using Django and I need to write a test that requires calculation.
Is it best practice to calculate the expected value manually or is it ok to do this by using the sum function (see below)
This example is easier for me because I don't have to calculate something manually:
def test_balance(self):
amounts = [120.82, 90.23, 89.32, 193.92]
for amount in amounts:
self.mockedTransaction(amount=amount)
total = Balance.total()
self.assertEqual(total, sum(amounts))
Or in this example I have to calculate the expected value manually:
def test_balance(self):
self.mockedTransaction(amount=120.82)
self.mockedTransaction(amount=90.23)
self.mockedTransaction(amount=89.32)
self.mockedTransaction(amount=193.92)
total = Balance.total()
self.assertEqual(total, 494.29)

Does your total function just use sum to get the sum of a list of numbers? If so, your test is performing the same steps as your code under test. In that case, the first test can't fail. It would be better to use a manually generated value. (If total does just wrap sum, then I wouldn't spend a lot of time worrying about it, though. That function has been thoroughly tested.)
If Balance.total() is getting its value by some other method (like a SQL query run on the database), then it's fine to compute the expected value in your test method, particularly in simple cases like summing a list of values. If the computation is very complex, then you probably want to go back to manually calculated values. Otherwise your test code might be just as difficult to debug as your code under test.

Related

Need help finding GCD (noob approach)

I am currently going through Math adventures with Python book by Peter Farrel. Now I am simply trying to improve my math skills while learning Python in a fun way. So we made a factors function as seen below:
def factors(num):
factorList = []
for i in range(1, num+1):
if num % i == 0:
factorList.append(i)
return factorList
Exercise 3-1 is asking to make GCF (Greatest Common Factor) function. All the answers here are how we could use builtin Python modules or recursive or Euclid algorithm. I have no clue what any of these things mean, let alone trying it on this assignment. I came with the following solution using the above function:
def gcFactor(num1, num2):
fnum1 = factors(num1)
fnum2 = factors(num2)
gcf = list(set(fnum1).intersection(fnum2))
return max(gcf)
print(gcFactor(28,21))
Is this the best way of doing it? Using the .intersection() function seems a little cheaty to me.
So what I wanted to do is if I could use a loop and separate the list values in fnum1 & fnum2 and compare them and then return the value that matches (which would make common factors) and is greatest (which would be GCF).
The idea behind your algorithm is sound, but there are a few problems:
In your original version, you used gcf[-1] to get the greatest factor, but that will not always work, since converting a set to list does not guarantee that the elements will be in sorted order, even if they were sorted before converting to set. Better use max (you already changed that).
Using set.intersection is definitely not "cheating" but just making good use of what the languages provides. It might be considered cheating to just use math.gcd, but not basic set or list functions.
Your algorithm is rather inefficient. I don't know the book, but I don't think you should actually use the factors function to calculate the gcf, but that was just an exercise to teach you stuff like loops and modulo. Consider two very different numbers as inputs, say 23764372 and 6. You'd calculate all the factors of 23764372 first, before testing the very few values that could actually be common factors. Instead of using factors directly, try to rewrite your gcFactor function to test which values up to the min of the two numbers are factors of both numbers.
Even then, your algorithm will not be very efficient. I would suggest reading up on Euclid's Algorithm and trying to implement that next. If unsure if you did it right, you can use your first function as a reference for testing, and to see the difference in performance.
About your factors function itself: Note that there is a symmetry: if i is a factor, so is n//i. If you use this, you do not have to test all the values up to n but just up to sqrt(n), which is a speed-up equivalent to reducing running time from O(n²) to O(n).

pytest - best design for fuzzing with limited parameters

I was wondering about fuzzing in pytest and what is the best way to do that.
In the past I used hypothesis library to fuzz values, but it work best only when running each test many times.
Because my system is slow I want to be able to split the tests into 2 categories: "daily_run" and "regression":
daily_run will run each test 1 time
regression will run each test X times
Each run I want to be able to use random values. The problem is that the test parameters have "valid" range I want to stay in when fuzzing. For example:
#pytest.parametrize("month_number", [4])
def test_foo(month_number):
# Test with that value
So in that example I get fixed value for the month number - 4. I'll give another example before explaining what I have tried:
#pytest.parametrize("month_number", [40])
def test_invalid_foo(month_number):
# Test with that invalid value
So in the second example I test with an invalid value.
The range for the month number is obviously 1-12. I guess I can write some logic of getting a random value between 1-12 on valid month, and getting random value between -infinity - 0 & 13 - infinity. But that is a very verbose way to do that for month_number alone. On reality I got dozen of parameters and I don't want to have to write a functionality for each one.
Of course I can write some generic logic to do so, and use that generic logic on every parameter, but I still wonder if there is a better way.
Also don't forget the 2 categories - daily_run & regression.
What is the best practice to write fuzzed test with parameter limits?
for python fuzzing, take a look at the new atheris fuzzer. Like libfuzzer, it gives you a Dataprovider to feed inputs your target, which by default feeds bytes, but can also be configured to one of the following:
def ConsumeIntList(count: int, bytes: int)
def ConsumeIntListInRange(count: int, min: int, max: int)
Your categories of targets is confusing at best. Usually you fuzz a harness/target for a certain amount of time and potentially reuse the folder of testcases produced.
Or you have a set of harnesses/targets that you fuzz often and some not so often (but I doubt that would be a good idea).

speed up function based on list comprehension

I'm trying to get the 15 most relevant item for each users but every functions i tried took an eternity. (more than 6 hours i shutdown it after that ...)
I have 418 unique users, 3718 unique items.
U2tfifd dict has as well 418 entry and there is 32645 words in tfidf_feature_names.
Shape of my interactions_full_df is (40733, 3)
i tried :
def index_tfidf_users(user_id) :
return [users for users in U2tfifd[user_id].flatten().tolist()]
def get_relevant_items(user_id):
return sorted(zip(tfidf_feature_names, index_tfidf_users(user_id)), key=lambda x: -x[1])[:15]
def get_tfidf_token(user_id) :
return [words for words, values in get_relevant_items(user_id)]
then interactions_full_df["tags"] = interactions_full_df["user_id"].apply(lambda x : get_tfidf_token(x))
or
def get_tfidf_token(user_id) :
tags = []
v = sorted(zip(tfidf_feature_names, U2tfifd[user_id].flatten().tolist()), key=lambda x: -x[1])[:15]
for words, values in v :
tags.append(words)
return tags
or
def get_tfidf_token(user_id) :
v = sorted(zip(tfidf_feature_names, U2tfifd[user_id].flatten().tolist()), key=lambda x: -x[1])[:15]
tags = [words for words in v]
return tags
U2tfifd is a dict with keys = user_id, values = an array
There are several things going on which could cause poor performance in your code. The impact of each of these will depend on things like your Python version (2.x or 3.x), your RAM speed, and whatnot. You'll need to experiment and benchmark the various potential improvements yourself.
1. TFIDF Sparsity (~10x speedup depending on sparsity)
One glaring potential problem is that TFIDF naturally returns sparse data (e.g. a paragraph doesn't use anywhere near as many unique words as an entire book), and working with dense structures like numpy arrays is a strange choice when the data is probably zero almost everywhere.
If you'll be doing this same analysis in the future, it might be helpful to make/use a version of TFIDF with sparse array outputs so that when you extract your tokens you can skip over the zero values. This would likely have the secondary benefit of the entire sparse array for each user fitting in the cache and preventing costly RAM access in your sorts and other operations.
It might be worth sparsifying your data anyway. On my potato, a quick benchmark on data which should be similar to yours indicates that the process can be done in ~30s. The process replaces much of the work you're doing with a highly optimized routine coded in C and wrapped for use in Python. The only real cost is the second pass through the non-zero entries, but unless that pass is pretty efficient to begin with you should be better off working with sparse data.
2. Duplicated Efforts and Memoization (~100x speedup)
If U2tfifd has 418 entries and interactions_full_df has 40733 rows then at least 40315 (or 99.0%) of your calls to get_tfidf_token() are wasted since you've already computed the answer. There are tons of memoization decorators out there, but you don't need anything very complicated for your use case.
def memoize(f):
_cache = {}
def _f(arg):
if arg not in _cache:
_cache[arg] = f(arg)
return _cache[arg]
return _f
#memoize
def get_tfidf_token(user_id):
...
Breaking this down, the function memoize() returns another function. The behavior of that function is to check a local cache for the expected return value before computing it and storing it if necessary.
The syntax #memoize... is short for something like the following.
def uncached_get_tfidf_token(user_id):
...
get_tfidf_token = memoize(uncached_get_tfidf_token)
The # symbol is used to signify that we want the modified, or decorated, version of get_tfidf_token() instead of the original. Depending on your application, it might be beneficial to chain decorators together.
3. Vectorized Operations (varying speedup, benchmarking necessary)
Python doesn't really have a notion of primitive types like other languages, and even integers take 24 bytes in memory on my machine. Lists aren't usually be packed, so you can incur costly cache misses as you're plowing through them. No matter how little work the CPU is doing for sorting and whatnot, clobbering a whole new chunk of memory to turn your array into a list and only using that brand new, expensive memory once is going to incur a performance hit.
Many of the things you are trying to do have fast (SIMD vectorized, parallelized, memory-efficient, packed memory, and other fun optimizations) numpy equivalents AND avoid unnecessary array copies and type conversions. It seems you're already using numpy anyway, so you won't have any extra imports or dependencies.
As one example, zip() creates another list in memory in Python 2.x and still does unnecessary work in Python 3.x when you really only care about the indices of tfidf_feature_names. To compute those indices, you can use something like the following, which avoids an unnecessary list creation and uses an optimized routine with slightly better asymptotic complexity as an added bonus.
def get_tfidf_token(user_id):
temp = U2tfifd[user_id].flatten()
ind = np.argpartition(temp, len(temp)-15)[-15:]
return tfidf_feature_names[ind] # works if tfidf_feature_names is a numpy array
return [tfidf_feature_names[i] for i in ind] # always works
Depending on the shape of U2tfifd[user_id], you could avoid the costly .flatten() computation by passing an axis argument to np.argsort() and flattening the 15 obtained indices instead.
4. Bonus
The sorted() function supports a reverse argument so that you can avoid extra computations like throwing a negative on every value. Simply use
sorted(..., reverse=True)
Even better, since you really don't care about the sort itself but just the 15 largest values you can get away with
sorted(...)[-15:]
to index the largest 15 instead of reversing the sort and taking the smallest 15. That doesn't really matter if you're using a better function for the application like np.argpartition(), but it could be helpful in the future.
You can also avoid some function calls by replacing .apply(lambda x : get_tfidf_token(x)) with .apply(get_tfidf_token) since get_tfidf_token is already a function which has the intended behavior. You don't really need the extra lambda.
As far as I can see though, most additional gains are fairly nitpicky and system-dependent. You can make most things faster with Cython or straight C with enough time for example, but you already have reasonably fast routines which do what you want out of the box. The extra engineering effort probably isn't worth any potential gains.

How to use a random seed value in order to unittest a PRNG in Python?

I'm still pretty new to programming and just learning how to unittest. I need to test a function that returns a random value. I've so far found answers suggesting the use of a specific seed value so that the 'random' sequence is constant and can be compared. This is what I've got so far:
This is the function I want to test:
import random
def roll():
'''Returns a random number in the range 1 to 6, inclusive.'''
return random.randint(1, 6)
And this is my unittest:
class Tests(unittest.TestCase):
def test_random_roll(self):
random.seed(900)
seq = random.randint(1, 6)
self.assertEqual(roll(), seq)
How do I set the corresponding seed value for the PRNG in the function so that it can be tested without writing it into the function itself? Or is this completely the wrong way to go about testing a random number generator?
Thanks
The other answers are correct as far as they go. Here I'm answering the deeper question of how to test a random number generator:
Your provided function is not really a random number generator, as its entire implementation depends on a provided random number generator. In other words, you are trusting that Python provides you with a sensible random generator. For most purposes, this is a good thing to do. If you are writing cryptographic primitives, you might want to do something else, and at that point you would want some really robust test strategies (but they will never be enough).
Testing a function returns a specific sequence of numbers tells you virtually nothing about the correctness of your function in terms of "producing random numbers". A predefined sequence of numbers is the opposite of a random sequence.
So, what do you actually want to test? For 'roll' function, I think you'd like to test:
That given 'enough' rolls it produces all the numbers between 1 and 6, preferably in 'approximately' equal proportions.
That it doesn't produce anything else.
The problem with 1. is that your function is defined to be a random sequence, so there is always a non-zero chance that any hard limits you put in to define 'enough' or 'approximately equal' will occasionally fail. You could do some calculations to pick some limits that would make sure your test is unlikely to fail more than e.g. 1 in a billion times, or you could slap a random.seed() call that will mean it will never fail if it passes once (unless the underlying implementation from Python changes).
Item 2. could be 'tested' more easily - generate some large 'N' number of items, check that all are within expected outcome.
For all of this, however, I'd ask what value the unit tests actually are. You literally cannot write a test to check whether something is 'random' or not. To see whether the function has a reasonable source of randomness and uses it correctly, tests are useless - you have to inspect the code. Once you have done that, it's clear that your function is correct (providing Python provides a decent random number generator).
In short, this is one of those cases where unit tests provide extremely little value. I would probably just write one test (item 2 above), and leave it at that.
By seeding the prng with a known seed, you know which sequence it will produce, so you can test for this sequence:
class Tests(unittest.TestCase):
def test_random_roll(self):
random.seed(900)
self.assertEqual(roll(), 6)
self.assertEqual(roll(), 2)
self.assertEqual(roll(), 5)

How should I use #pm.stochastic in PyMC?

Fairly simple question: How should I use #pm.stochastic? I have read some blog posts that claim #pm.stochasticexpects a negative log value:
#pm.stochastic(observed=True)
def loglike(value=data):
# some calculations that generate a numeric result
return -np.log(result)
I tried this recently but found really bad results. Since I also noticed that some people used np.log instead of -np.log, I give it a try and worked much better. What is really expecting #pm.stochastic? I'm guessing there was a small confusion on the sign required due to a very popular example using something like np.log(1/(1+t_1-t_0)) which was written as -np.log(1+t_1-t_0)
Another question: What is this decorator doing with the value argument? As I understand it, we start with some proposed value for the priors that need to enter in the likelihood and the idea of #pm.stochastic is basically produce some number to compare this likelihood to the number generated by the previous iteration in the sampling process. The likelihood should receive the value argument and some values for the priors, but I'm not sure if this is all value is doing because that's the only required argument and yet I can write:
#pm.stochastic(observed=True)
def loglike(value=[1]):
data = [3,5,1] # some data
# some calculations that generate a numeric result
return np.log(result)
And as far as I can tell, that produces the same result as before. Maybe, it works in this way because I added observed=True to the decorator. If I would have tried this in a stochastic variable with observed=False by default, value would be changed in each iteration trying to obtain a better likelihood.
#pm.stochastic is a decorator, so it is expecting a function. The simplest way to use it is to give it a function that includes value as one of its arguments, and returns a log-likelihood.
You should use the #pm.stochastic decorator to define a custom prior for a parameter in your model. You should use the #pm.observed decorator to define a custom likelihood for data. Both of these decorators will create a pm.Stochastic object, which takes its name from the function it decorates, and has all the familiar methods and attributes (here is a nice article on Python decorators).
Examples:
A parameter a that has a triangular distribution a priori:
#pm.stochastic
def a(value=.5):
if 0 <= value < 1:
return np.log(1.-value)
else:
return -np.inf
Here value=.5 is used as the initial value of the parameter, and changing it to value=1 raises an exception, because it is outside of the support of the distribution.
A likelihood b that has is normal distribution centered at a, with a fixed precision:
#pm.observed
def b(value=[.2,.3], mu=a):
return pm.normal_like(value, mu, 100.)
Here value=[.2,.3] is used to represent the observed data.
I've put this together in a notebook that shows it all in action here.
Yes confusion is easy since the #stochastic returns a likelihood which is the opposite of the error essentially. So you take the negative log of your custom error function and return THAT as your log-likelihood.

Categories