How to re classify 20 newsgroups data set from 20 to 6 - python

I have downloaded the popular 20 newsgroups data set which has 20 classes, but I want to re-classify the whole documents into six classes since some classes are very related.
So for example, all computer related docs should have a new class say 1. As it is now, the docs are assigned from 1-20 reflecting the classes. The computer related classes are 2,3,4,5,and 6.
I want say, 1 to be the class of all the computer related(2,3,4,5,6). I tested it by using 20_newsgroups.target[0], and it gave me 7. Meaning the class of the doc at 0 is 7.
I re-assigned it to a new class using 20_newsgroups.target[0]='1' and when I try 20_newsgroups.target[0], it shows 1 which is OK.
But how can I do this for all the documents that currently have (2,3,4,5,6) as their class? I can easily extend it to other classes if I understand that one.I also try for d in 20_newsgroups:
if 20_newsgroups.target in [2,3,4,5,6], 20_newsgroups.target='1'.
But this is showing an error that "the truth value of an array with more than one element is unambiguous, use a.any() or a.all".

I'm not sure if I understand your question, but you seem to want to join categories into supercategories. This should not be hard to do, but it's less than optimal to do this at a late stage of the experiment. If you want to reduce the number of categories, do this by joining some of the categories as the very first step of your process. That way, similar samples from different (original) categories will not cause confusion in the training phase (provided, of course, that they now belong to the same new category), thereby producing a better overall result.

You could do something like this. The code is based on the retrieval of the 20newsgroup data set with scikit learn: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
topic_1 = [0,15,19]
topic_2 = [1,2,3,4,5]
topic_3 = [6]
topic_4 = [7,8,9,10]
topic_5 = [11,12,13,14]
topic_6 = [16,17,18]
topics = [topic_1, topic_2, topic_3, topic_4, topic_5, topic_6]
The topic distribution is based on the table provided by http://qwone.com/~jason/20Newsgroups/ (but can be adjusted). The following code reduces the amount of categories of the data set.
twenty_train_reduced = twenty_train.target.copy
for index, target in enumerate(twenty_train.target):
for topic_i, topic in enumerate(topics):
if(target in topic):
twenty_train_reduced[index] = topic_i

Related

Does name and order of Features matter for prediction algorithm

Do the names/order of the columns of my X_test dataframe have to be the same as the X_train I use for fitting?
Below is an example
I am training my model with:
model.fit(X_train,y)
where X_train=data['var1','var2']
But then during prediction, when I use:
model.predict(X_test)
X_test is defined as: X_test=data['var1','var3']
where var3 could be a completely different variable than var2.
Does predict assume that var3 is the same as var2 because it is the second column in X_test?
What if:
X_live was defined as: X_live=data['var2','var1']
Would predict know to re-order X to line them up correctly?
The names of your columns don't matter but the order does. You need to make sure that the order is consistent from your training and test data. If you pass in two columns in your training data, your model will assume that any future inputs are those features in that order.
Just a really simple thought experiment. Imagine you train a model that subtracts two numbers. The features are (n_1, n_2), and your output is going to be n_1 - n_2.
Your model doesn't process the names of your columns (since only numbers are passed in), and so it learns the relationship between the first column, the second column, and the output - namely output = col_1 - col_2.
Regardless of what you pass in, you'll get the result of the first thing you passed in minus the second thing you pass in. You can name the first thing you pass in and the second thing you pass in to whatever you want, but at the end of the day you'll still get the result of the subtraction.
To get a little more technical, what's going on inside your model is mostly a series of matrix multiplications. You pass in the input matrix, the multiplications happen, and you get what comes out. Training the model just "tunes" the values in the matrices that your inputs get multiplied by with the intention of maximizing how close the output of these multiplications is to your label. If you pass in an input matrix that isn't like the ones it was trained on, the multiplications still happen, but you'll almost certainly get a terribly wrong output. There's no intelligent feature rearranging going on underneath.
Firstly answer your question "Does predict assume that var3 is the same as var2 because it is the second column in X_test?"
No; any machine Learning model does not have any such assumption on
the data that you are passing into the fit function or the predict
function. What the model simply sees is an array of numbers, let it
be a multidimensional array of higher order. It is completely on the
user to concern about the features.
Let's take a simple classification problem, where you have 2 groups:
First one is a group of kids, with short height, and thereby lesser weight,
Second group is of mature adults, with higher age, height and weight.
Now you want to classify the below individual into any one of the classes.
Age
Height
Weight
10
120
34
Any well trained classifier can easily classify this data point to the group of kids, since the age and weight are small. The vector which the model will now consider is [ 10, 120, 34 ].
But now let us reorder the feature columns, in the following way - [ 120, 10, 34 ]. But you know that the number 120, you want to refer to the height if the individual and not age! But it is pretty sure that the model won't understand what you know or expect, and it is bound to classify the point to the group of adults.
Hope that answers both your questions.

Using similarities.cosine (with dataset) of SurPRISE package python

Briefing:
I'm working over Movielens 100k Dataset for recommendation of movies. So far I've done foll.
Sorting of values
df_sorted_values = df.sort_values(['UserID', 'MovieID'])
print type(df_sorted_values)
Printing Matrix with NaN values
df_matrix = df.pivot_table(values='Rating', index='UserID', columns='MovieID')
Performed 5 Fold CV on it
reader = Reader(line_format="user item rating", sep='\t', rating_scale=(1,5))
df = Dataset.load_from_file('ml-100k/u.data', reader=reader)
df.split(n_folds=5)
I've evaluated the dataset using SVD
perf = evaluate(SVD(),df,measures=['RMSE','MAE'])
print_perf(perf)
HERE I NEED THE USE SIMILARITY ALGORITHM provided by same package (Surprise) which is written as surprise.cosine to Predict the missing values. This shows that it needs (*args,**kwargs) arguments but I'm clueless as what is actually to be passed.
ONCE THE SIMILARITIES ARE GENERATED I NEED TO PRINT THE MATRIX WITH REPLACED NaN values WHICH ARE NOW PREDICTED, later will be used for recommendation
P.S. I'm open to different solutions from CRAB, RECSYS, PANDAS and GRAPHLAB provided they can be worked out on steps 1 to 4 as well
My past references have been:
This Manual, but doesn't show on how the arguments have passed
nor the example
This Which doesn't have much difference than
first
While computing the cosine similarity between 2 vectors is very easy (how about 1-np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))
I would recommend you to Work with Scipy if you don't want to implement it yourself:
from scipy.spatial.distance import cosine
Those similarity functions are used like these docs: Using prediction algorithms , FAQ, The algorithm base class - compute_similarities for KNN-based algos. They are not supposed to be used like what you want to.
You may want to use the predict function if you choose to use SVD algorithm The algorithm base class - predict like:
# Build an algorithm, and train it.
algo = SVD()
algo.train(trainset)
uid = str(196) # raw user id
iid = str(302) # raw item id
# get a prediction for specific users and items.
pred = algo.predict(uid, iid)

Genetic Algorithm / AI; basically, am I on the right track?

I know Python isn't the best idea to be writing any kind of software of this nature. My reasoning is to use this type of algorithm for a Raspberry Pi 3 in it's decision making (still unsure how that will go), and the libraries and APIs that I'll be using (Adafruit motor HATs, Google services, OpenCV, various sensors, etc) all play nicely for importing in Python, not to mention I'm just more comfortable in this environment for the rPi specifically. Already I've cursed it as object oriented such as Java or C++ just makes more sense to me, but Id rather deal with its inefficiencies and focus on the bigger picture of integration for the rPi.
I won't explain the code here, as it's pretty well documented in the comment sections throughout the script. My questions is as stated above; can this be considered basically a genetic algorithm? If not, what must it have to be a basic AI or genetic code? Am I on the right track for this type of problem solving? I know usually there are weighted variables and functions to promote "survival of the fittest", but that can be popped in as needed, I think.
I've read up quite a bit of forums and articles about this topic. I didn't want to copy someone else's code that I barely understand and start using it as a base for a larger project of mine; I want to know exactly how it works so I'm not confused as to why something isn't working out along the way. So, I just tried to comprehend the basic idea of how it works, and write how I interpreted it. Please remember I'd like to stay in Python for this. I know rPi's have multiple environments for C++, Java, etc, but as stated before, most hardware components I'm using have only Python APIs for implementation. if I'm wrong, explain at the algorithmic level, not just with a block of code (again, I really want to understand the process). Also, please don't nitpick code conventions unless it's pertinent to my problem, everyone has a style and this is just a sketch up for now. Here it is, and thanks for reading!
# Created by X3r0, 7/3/2016
# Basic genetic algorithm utilizing a two dimensional array system.
# the 'DNA' is the larger array, and the 'gene' is a smaller array as an element
# of the DNA. There exists no weighted algorithms, or statistical tracking to
# make the program more efficient yet; it is straightforwardly random and solves
# its problem randomly. At this stage, only the base element is iterated over.
# Basic Idea:
# 1) User inputs constraints onto array
# 2) Gene population is created at random given user constraints
# 3) DNA is created with randomized genes ( will never randomize after )
# a) Target DNA is created with loop control variable as data (basically just for some target structure)
# 4) CheckDNA() starts with base gene from DNA, and will recurse until gene matches the target gene
# a) Randomly select two genes from DNA
# b) Create a candidate gene by splicing both parent genes together
# c) Check candidate gene against the target gene
# d) If there exists a match in gene elements, a child gene is created and inserted into DNA
# e) If the child gene in DNA is not equal to target gene, recurse until it is
import random
DNAsize = 32
geneSize = 5
geneDiversity = 9
geneSplit = 4
numRecursions = 0
DNA = []
targetDNA = []
def init():
global DNAsize, geneSize, geneDiversity, geneSplit, DNA
print("This is a very basic form of genetic software. Input variable constraints below. "
"Good starting points are: DNA strand size (array size): 32, gene size (sub array size: 5, gene diversity (randomized 0 - x): 5"
"gene split (where to split gene array for splicing): 2")
DNAsize = int(input('Enter DNA strand size: '))
geneSize = int(input('Enter gene size: '))
geneDiversity = int(input('Enter gene diversity: '))
geneSplit = int(input('Enter gene split: '))
# initializes the gene population, and kicks off
# checkDNA recursion
initPop()
checkDNA(DNA[0])
def initPop():
# builds an array of smaller arrays
# given DNAsize
for x in range(DNAsize):
buildDNA()
# builds the goal array with a recurring
# numerical pattern, in this case just the loop
# control variable
buildTargetDNA(x)
def buildDNA():
newGene = []
# builds a smaller array (gene) using a given geneSize
# and randomized with vaules 0 - [given geneDiversity]
for x in range(geneSize):
newGene.append(random.randint(0,geneDiversity))
# append the built array to the larger array
DNA.append(newGene)
def buildTargetDNA(x):
# builds the target array, iterating with x as a loop
# control from the call in init()
newGene = []
for y in range(geneSize):
newGene.append(x)
targetDNA.append(newGene)
def checkDNA(childGene):
global numRecursions
numRecursions = numRecursions+1
gene = DNA[0]
targetGene = targetDNA[0]
parentGeneA = DNA[random.randint(0,DNAsize-1)] # randomly selects an array (gene) from larger array (DNA)
parentGeneB = DNA[random.randint(0,DNAsize-1)]
pos = random.randint(geneSplit-1,geneSplit+1) # randomly selects a position to split gene for splicing
candidateGene = parentGeneA[:pos] + parentGeneB[pos:] # spliced gene given split from parentA and parentB
print("DNA Splice Position: " + str(pos))
print("Element A: " + str(parentGeneA))
print("Element B: " + str(parentGeneB))
print("Candidate Element: " + str(candidateGene))
print("Target DNA: " + str(targetDNA))
print("Old DNA: " + str(DNA))
# iterates over the candidate gene and compares each element to the target gene
# if the candidate gene element hits a target gene element, the resulting child
# gene is created
for x in range(geneSize):
#if candidateGene[x] != targetGene[x]:
#print("false ")
if candidateGene[x] == targetGene[x]:
#print("true ")
childGene.pop(x)
childGene.insert(x, candidateGene[x])
# if the child gene isn't quite equal to the target, and recursion hasn't reached
# a max (apparently 900), the child gene is inserted into the DNA. Recursion occurs
# until the child gene equals the target gene, or max recursuion depth is exceeded
if childGene != targetGene and numRecursions < 900:
DNA.pop(0)
DNA.insert(0, childGene)
print("New DNA: " + str(DNA))
print(numRecursions)
checkDNA(childGene)
init()
print("Final DNA: " + str(DNA))
print("Number of generations (recursions): " + str(numRecursions))
I'm working with evolutionary computation right now so I hope my answer will be helpful for you, personally, I work with java, mostly because is one of my main languages, and for the portability, because I tested in linux, windows and mac. In my case I work with permutation encoding, but if you are still learning how GA works, I strongly recommend binary encoding. This is what I called my InitialPopulation. I try to describe my program's workflow:
1-. Set my main variables
This are PopulationSize, IndividualSize, MutationRate, CrossoverRate. Also you need to create an objective function and decide the crossover method you use. For this example lets say that my PopulationSize is equals to 50, the IndividualSize is 4, MutationRate is 0.04%, CrossoverRate is 90% and the crossover method will be roulette wheel.
My objective function only what to check if my Individuals are capable to represent the number 15 in binary, so the best individual must be 1111.
2-. Initialize my Population
For this I create 50 individuals (50 is given by my PopulationSize) with random genes.
3-. Loop starts
For each Individuals in Population you need to:
Evaluate fitness according to the objective function. If an Individual is represented by the next characters: 00100 this mean that his fitness is 1. As you can see this is a simple fitness function. You can create your own while you are learning, like fitness = 1/numberOfOnes. Also you need to assign the sum of all the fitness to a variable called populationFitness, this will be useful in the next step.
Select the best individuals. For this task there's a lot of methods you can use, but we will use the roulette wheel method as we say before. In this method, You assign a value to every individual inside your population. This value is given by the next formula: (fitness/populationFitness) * 100. So, if your population fitness is 10, and a certain individual fitness is 3, this mean that this individual has a 30% chance to be selected to make a crossover with another individual. Also, if another individual have a 4 in his fitness, his value will be 40%.
Apply crossover. Once you have the "best" individuals of your population, you need to create a new population. This new population is formed by others individuals of the previous population. For each individual you create a random number from 0 to 1. If this numbers is in the range of 0.9 (since our crossoverRate = 90%), this individual can reproduce, so you select another individual. Each new individual has this 2 parents who inherit his genes. For example:
Lets say that parentA = 1001 and parentB = 0111. We need to create a new individual with this genes. There's a lot of methods to do this, uniform crossover, single point crossover, two point crossover, etc. We will use the single point crossover. In this method we choose a random point between the first gene and the last gene. Then, we create a new individual according to the first genes of parentA and the last genes of parentB. In a visual form:
parentA = 1001
parentB = 0111
crossoverPoint = 2
newIndividual = 1011
As you can see, the new individual share his parents genes.
Once you have a new population with new individuals, you apply the mutation. In this case, for each individual in the new population generate a random number between 0 and 1. If this number is in the range of 0.04 (since our mutationRate = 0.04), you apply the mutation in a random gene. In binary encoding the mutation is just change the 1's for 0's or viceversa. In a visual form:
individual = 1011
randomPoint = 3
mutatedIndividual = 1010
Get the best individual
If this individual has reached the solution stop. Else, repeat the loop
End
As you can see, my english is not very good, but I hope you understand the basic idea of a genetic algorithm. If you are truly interested in learning this, you can check the following links:
http://www.obitko.com/tutorials/genetic-algorithms/
This link explains in a clearer way the basics of a genetic algorithm
http://natureofcode.com/book/chapter-9-the-evolution-of-code/
This book also explain what a GA is, but also provide some code in Processing, basically java. But I think you can understand.
Also I would recommend the following books:
An Introduction to Genetic Algorithms - Melanie Mitchell
Evolutionary algorithms in theory and practice - Thomas Bäck
Introduction to genetic algorithms - S. N. Sivanandam
If you have no money, you can easily find all this books in PDF.
Also, you can always search for articles in scholar.google.com
Almost all are free to download.
Just to add a bit to Alberto's great answer, you need to watch out for two issues as your solution evolves.
The first one is Over-fitting. This basically means that your solution is complex enough to "learn" all samples, but it is not applicable outside the training set. To avoid this, your need to make sure that the "amount" of information in your training set is a lot larger than the amount of information that can fit in your solution.
The second problem is Plateaus. There are cases where you would arrive at certain mediocre solutions that are nonetheless, good enough to "outcompete" any emerging solution, so your progress stalls (one way to see this is, if you see your fitness "stuck" at a certain, less than optimal number). One method for dealing with this is Extinctions: You could track the rate of improvement of your optimal solution, and if the improvement has been 0 for the last N generations, you just Nuke your population. (That is, delete your population and the list of optimal individuals and start over). Randomness will make it so that eventually the Solutions will surpass the Plateau.
Another thing to keep in mind, is that the default Random class is really bad at Randomness. I have had solutions improve dramatically by simply using something like the Mesernne Twister Random generator or a Hardware Entropy Generator.
I hope this helps. Good luck.

Perceptron Learning Algorithm taking a lot of iterations to converge?

I am solving the homework-1 of Caltech Machine Learning Course (http://work.caltech.edu/homework/hw1.pdf) . To solve ques 7-10 we need to implement a PLA. This is my implementation in python:
import sys,math,random
w=[] # stores the weights
data=[] # stores the vector X(x1,x2,...)
output=[] # stores the output(y)
# returns 1 if dot product is more than 0
def sign_dot_product(x):
global w
dot=sum([w[i]*x[i] for i in xrange(len(w))])
if(dot>0):
return 1
else :
return -1
# checks if a point is misclassified
def is_misclassified(rand_p):
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
# loads data in the following format:
# x1 x2 ... y
# In the present case for d=2
# x1 x2 y
def load_data():
f=open("data.dat","r")
global w
for line in f:
data_tmp=([1]+[float(x) for x in line.split(" ")])
data.append(data_tmp[0:-1])
output.append(data_tmp[-1])
def train():
global w
w=[ random.uniform(-1,1) for i in xrange(len(data[0]))] # initializes w with random weights
iter=1
while True:
rand_p=random.randint(0,len(output)-1) # randomly picks a point
check=[0]*len(output) # check is a list. The ith location is 1 if the ith point is correctly classified
while not is_misclassified(rand_p):
check[rand_p]=1
rand_p=random.randint(0,len(output)-1)
if sum(check)==len(output):
print "All points successfully satisfied in ",iter-1," iterations"
print iter-1,w,data[rand_p]
return iter-1
sign=output[rand_p]
w=[w[i]+sign*data[rand_p][i] for i in xrange(len(w))] # changing weights
if iter>1000000:
print "greater than 1000"
print w
return 10000000
iter+=1
load_data()
def simulate():
#tot_iter=train()
tot_iter=sum([train() for x in xrange(100)])
print float(tot_iter)/100
simulate()
The problem according to the answer of question 7 it should take around 15 iterations for perceptron to converge when size of training set but the my implementation takes a average of 50000 iteration . The training data is to be randomly generated but I am generating data for simple lines such as x=4,y=2,..etc. Is this the reason why I am getting wrong answer or there is something else wrong. Sample of my training data(separable using y=2):
1 2.1 1
231 100 1
-232 1.9 -1
23 232 1
12 -23 -1
10000 1.9 -1
-1000 2.4 1
100 -100 -1
45 73 1
-34 1.5 -1
It is in the format x1 x2 output(y)
It is clear that you are doing a great job learning both Python and classification algorithms with your effort.
However, because of some of the stylistic inefficiencies with your code, it makes it difficult to help you and it creates a chance that part of the problem could be a miscommunication between you and the professor.
For example, does the professor wish for you to use the Perceptron in "online mode" or "offline mode"? In "online mode" you should move sequentially through the data point and you should not revisit any points. From the assignment's conjecture that it should require only 15 iterations to converge, I am curious if this implies the first 15 data points, in sequential order, would result in a classifier that linearly separates your data set.
By instead sampling randomly with replacement, you might be causing yourself to take much longer (although, depending on the distribution and size of the data sample, this is admittedly unlikely since you'd expect roughly that any 15 points would do about as well as the first 15).
The other issue is that after you detect a correctly classified point (cases when not is_misclassified evaluates to True) if you then witness a new random point that is misclassified, then your code will kick down into the larger section of the outer while loop, and then go back to the top where it will overwrite the check vector with all 0s.
This means that the only way your code will detect that it has correctly classified all the points is if the particular random sequence that it evaluates them (in the inner while loop) happens to be a string of all 1's except for the miraculous ability that on any particular 0, on that pass through the array, it classifies correctly.
I can't quite formalize why I think that will make the program take much longer, but it seems like your code is requiring a much stricter form of convergence, where it sort of has to learn everything all at once on one monolithic pass way late in the training stage after having been updated a bunch already.
One easy way to check if my intuition about this is crappy would be to move the line check=[0]*len(output) outside of the while loop all together and only initialize it one time.
Some general advice to make the code easier to manage:
Don't use global variables. Instead, let your function to load and prep the data return things.
There are a few places where you say, for example,
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
This kind of thing can be simplified to
return sign_dot_product(data[rand_p]) != output[rand_p]
which is easier to read and conveys what criteria you're trying to check for in a more direct manner.
I doubt efficiency plays an important role since this seems to be a pedagogical exercise, but there are a number of ways to refactor your use of list comprehensions that might be beneficial. And if possible, just use NumPy which has native array types. Witnessing how some of these operations have to be expressed with list operations is lamentable. Even if your professor doesn't want you to implement with NumPy because she or he is trying to teach you pure fundamentals, I say just ignore them and go learn NumPy. It will help you with jobs, internships, and practical skill with these kinds of manipulations in Python vastly more than fighting with the native data types to do something they were not designed for (array computing).

Running AB tests on Revenue in Python

I'm trying to run an AB test - comparing revenue amongst variants on websites.
Our standard approach (using t-tests) didn't seem like it would work because revenue can't be modelled binomially. However, I read about bootstrapping and came up with the following code:
import numpy as np
import scipy.stats as stats
import random
def resampler(original_array, number_of_samples):
sample_array = np.zeros(number_of_samples)
choice = random.choice
for i in range(number_of_samples):
sample_array[i] = sum([choice(original_array) for _ in range(len(original_array))])
y = stats.normaltest(sample_array)
if y[1] > 0.001:
print y
new_y = resampler(original_array, number_of_samples * 2)
y = new_y
return sample_array
Basically, randomly sample from the 'revenue vector' (a sparsely populated vector - a zero for all non-converting visitors) and sum the resulting vectors until you've got a normal distribution.
I can perform this for both test groups at which point I've got two normally distributed quantities for t-testing. Using scipy.stats.ttest_ind I was able to get results that looked someway reasonable.
However, I wondered what the effect of running this procedure on cookie split would be (expected each group to see 50% of the cookies). Here, I saw something fairly unexpected - given the following code:
x = [272898,389076,61091,65251,10060,1468815,216014,25863,42421,476379,73761]
y = [274253,387941,61333,65020,10056,1466908,214679,25682,42873,474692,73837]
print stats.ttest_ind(x,y)
I get the output: (0.0021911476165975929, 0.99827342714956546)
Not at all significant (I think I'm interpreting that correctly?)
However, when I run this code:
for i in range(1000, 100000, 5000):
one_array = resampler(x,i)
two_array = resampler(y,i)
t_value, p_value = stats.ttest_ind(one_array, two_array)
t_value_array.append(t_value)
p_value_array.append(p_value)
print np.mean(t_value_array)
print np.mean(p_value_array)
I get:
0.642213492773
0.490587258892
I'm not really sure how to interpret these numbers - as far as I'm aware, I've repeatedly generated normal distributions from the actual cookie splits (each number in the array represents a different site). In each of these cases, I've used a t-test on the two distributions and gotten a t-statistic and a p-value.
Is this a legitimate thing to do? I only ran these tests multiple times because I was seeing so much variation in the p-value and t-statistic when not doing this.
Am I missing an obvious way to run this kind of test?
Cheers,
Matt
p.s
The data we have:
Website 1 : test group 1: unique cookies: revenue
Website 1 : test group 2: unique cookies: revenue
Website 2 : test group 1: unique cookies: revenue
Website 2 : test group 2: unique cookies: revenue
e.t.c.
What we'd like:
Test group x is beating test group y with z% certainty
(null hypothesis of test group 1 = test group 2)
Bonus:
The same as above but at a per site, as well as overall, basis
Firstly, using a t-test to test binomial response variables isn't correct. You need to use a logistic regression model.
On to your question. It's very hard to read that code and understand what you think you're testing---what's your H_0 (null hypothesis)? If I'm being honest (and I hope you don't take offense) it looks pretty confused.
I'm going to have to guess what the data look like---you have a bunch of samples like this:
Website Method Revenue
------- ------ -------
w1 A 12
w2 B 0
w3 A 6
w4 B 0
etc etc. Does this look correct? Do you have repeated measures (i.e. do you have a revenue measurement for each website for each method? Or did you randomly assign websites to methods?)? I'm guessing that what you're passing to your method is an array of all revenues for one of the methods in turn, but do they pair up across methods in any way?
I can imagine testing various hypotheses with this data. For example, is method A more likely to generate non-zero revenue than method B (use logistic regression, response is binary)? Of the cases where a method generates revenue at all, does method A generate more than method B (t-test on non-zero revenues)? Does method A generate more revenue than method B across all instances (probably a sign test, due to problems with the assumption of normality when you include the zeros). I assume this troubling assumption is why you run the procedure of repeatedly subsampling until your data look normal, but you can't do this and test anything meaningful: just because some subset of your data is normally distributed doesn't mean you can look at only this part of it! In fact, I wouldn't be surprised to see that what this essentially does is excludes either most of the zero entries or most of the non-zero entries.
If you elaborate with what some of the actual data look like, and what questions you want to answer, I'm happy to make more specific suggestions.

Categories