Create a model to classificy a sentence logical or not

Create a model to classificy a sentence logical or not - python

As the title mention, how could I train a model to classify following sentences are logical or illogical?
“He has two legs”–logical
“He has six legs”–illogical
Solution I tried:
1 : Train the classifier by cnn
I have done it before, it works very well if you have enough of data. Problem is I do not have a huge data set which comes with “logical” or “illogical” labels for this case.
2 : Use language model
Train a language model introduced by gluonnlp on some data set like wiki, use it to find out the probability of the sentences. If the probability of the sentences are high, mark it as logical and vice versa. Problem is the results not good.
The way I estimate the probability
def __predict(self):
lines = self.__text_edit_input.toPlainText().split("\n")
result = ""
for line in lines:
result += str(self.__sentence_prob(line, 10)) + "\n"
self.__text_edit_output.setPlainText(result)
def __prepare_sentence(self, text, max_len):
result = mx.nd.zeros([max_len, 1], dtype='float32')
max_len = min(len(text), max_len)
i = max(max_len - len(text), 0)
j = 0
for index in range(i, max_len):
result[index][0] = self.__vocab[text[j]]
j = j + 1
return result
def __sentence_prob(self, text, max_len):
hiddens = self.__model.begin_state(1, func=mx.nd.zeros, ctx=self.__context)
tokens = self.__tokenizer(text)
data = self.__prepare_sentence(tokens, max_len)
output, _ = self.__model(data, hiddens)
prob = 0
for i in range(max_len):
total_prob = mx.nd.softmax(output[i][0])
prob += total_prob[self.__vocab[i]].asscalar()
return prob / max_len
Possible issues of language models:
1. Do not use correct way to split the sentences(I am using jieba to split the Chinese senteces)
2. Number of vocab is too small/big(test 10000, 15000 and 30000)
3. Loss too high(ppl around 190) after 50 epochs?
4. Number of sentences length should be larger/smaller(tried 10,20,35)
5. The data I use do not meet my requirements(not every sentences are logical)
6. Language model is not appropriate for this task?
Any suggestions?Thanks

Issue 6. Language model is not appropriate for this task? is the main problem. Language models are built to make sense of the input text with respect of language usage (syntax, semantics etc.) and not draw logical conclusions. So, you may not get good results even with a large amount of data or very deep models.
The problem you're trying to solve is extremely difficult. Something you may want to look at is Symbolic AI. There's a lot of ongoing research in this area.

Related

Solving memory issues when using Gensim LDA Multicore

For my project I am trying to use unsupervised learning to identify different topics from application descriptions, but I am running into a strange problem. Firstly, I have 3 different datasets, one with 15k documents another with 50k documents and last with 2m documents. I am trying to test models with different number of topics (k) ranging from 5 to 100 with a step size of 5. This is in order to check which k results in the best model assessed with initially with the highest coherence score. For each k, I also build 3 different models with chunksize 10, 100 and 1000.
So now moving onto the problem I am having. Obviously my own machine is too slow and does not have enough cores for this kind of computation hence I am using my university's server. The problem here is my program seems to be consuming too much memory and I am unsure of the reason. I already made some adjustments such that the corpus is not loaded entirely to memory (or atleast I think I did). The dataset with 50k entries already at iteration k=50 (so halfway) seems to have consumed the alloted 100GB of memory, which seems very huge.
I would appreciate any help in the right direction and thanks for taking the time to look at this. Below is the code from my topic_modelling.py file. Comments on the file are a bit outdated, sorry about that.
class MyCorpus:
texts: list
dictionary: dict
def __init__(self, descriptions, dictionary):
self.texts = descriptions
self.dictionary = dictionary
def __iter__(self):
for line in self.texts:
try:
# assume there's one document per line, tokens separated by whitespace
yield self.dictionary.doc2bow(line)
except StopIteration:
pass
# Function given a dataframe creates a dictionary and corupus
# These are used to create an LDA model. Here we automatically use the Descriptionb column
# from each dataframe
def create_dict_and_corpus(df):
text_descriptions = remove_characters_and_create_list(df, 'Description')
# print(text_descriptions)
dictionary = gensim.corpora.Dictionary(text_descriptions)
corpus = MyCorpus(text_descriptions, dictionary)
return text_descriptions, dictionary, corpus
# Given a dataframe remove and a column name in the data frame, extract all words and return a list
# Also to remove all chracters that are not alphanumeric or spaces
def remove_characters_and_create_list(df, column_name, split=True):
df[column_name] = df[column_name].astype(str)
texts = []
for x in range(df[column_name].size):
current_string = df[column_name][x]
filtered_string = re.sub(r'[^A-Za-z0-9 ]+', '', current_string)
if split:
texts.append(filtered_string.split())
else:
texts.append(filtered_string)
return texts
# This function given the parameters creates an LDA model for each number between
# the start limit and the end limit. After this the coherence and perplexity is calulated
# for each of those models and saved in a csv file to analyze later.
def test_lda_models(text, corpus, dictionary, start_limit, end_limit, path):
results = []
print("============Starting topic modelling============")
for k in range(start_limit, end_limit+1, 5):
for p in range(1, 4):
chunk = pow(10, p)
t0 = time.time()
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus,
num_topics=k,
id2word=dictionary,
passes=p,
chunksize=chunk)
# To calculate the goodness of the model
perplexity = lda_model.bound(corpus)
coherence_model_lda = CoherenceModel(model=lda_model, texts=text, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
t1 = time.time()
print(f"=====Done K={k} model with passes={p} and chunksize={chunk}, took {t1-t0} seconds=====")
results.append((k, chunk, coherence_lda, perplexity))
# Storing teh results in a csv file except the actual lda model (this would not make sense)
path = make_dir_if_not_exists(path)
list_tuples_to_csv(results, ['#OfTopics', 'ChunkSize', 'CoherenceScore', 'Perplexity'], f"{path}/K={start_limit}to{end_limit}.csv")
return results
# Function plot the visualization of an LDA model. This visualization is then
# saved as an html file inside the given path
def single_lda_model_visualization(k, c, corpus, dictionary, lda_model, path):
vis = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.save_html(vis, f"{path}/visualization.html")
# Given the results produced by test_lda_models, loop though the models and save the
# topic words of each model and the visualization of the topics in the given path
def save_lda_result(k, c, lda_model, corpus, dictionary, path):
list_tuples_to_csv(lda_model.print_topics(num_topics=k), ['Topic#', 'Associated Words'], f"{path}/associated_words.csv")
single_lda_model_visualization(k, c, corpus, dictionary, lda_model, path)
# This is the entire pipeline that needs to be performed for a single dataset,
# which includes computing the LDA models from start to end limit and calculating
# and saving the topic words and visual graphs for the top n topics with the highest
# coherence score.
def perform_topic_modelling_single_df(df, start_limit, end_limit, path):
# Extracting the necessary data required for LDA model computation
text_descriptions,dictionary, corpus = create_dict_and_corpus(df)
results_lda = test_lda_models(text_descriptions, corpus, dictionary, start_limit, end_limit, path)
# Sorting the results based on the 2nd tuple value returned which is 'coherence'
results_lda.sort(key=lambda x:x[2],reverse=True)
# Getting the top 5 results to save pass to save_lda_results function
results = results_lda[:5]
corpus_for_saving = [dictionary.doc2bow(text) for text in text_descriptions]
texts = remove_characters_and_create_list(df, 'Description', split=False)
# Perfrom application to topic modelling for the best lda model based on the
# coherence score (TODO maybe test with other lda models?)
print("getting descriptions for csv")
for k, c, _, _ in results:
dir_path = make_dir_if_not_exists(f"{path}/k={k}_chunk={c}")
p = int(math.log10(c))
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus,
num_topics=k,
id2word=dictionary,
passes=p,
chunksize=c)
print(f"=====REDOING K={k} model with passes={p} and chunksize={c}=====")
save_lda_result(k,c, lda_model, corpus_for_saving, dictionary, dir_path)
application_to_topic_modelling(df, k, c, lda_model, corpus_for_saving, texts, dir_path)
# Performs the whole topic modelling pipeline taking different genre data sets
# and the entire dataset as a whole
def perform_topic_modelling_pipeline(path_ex):
# entire_df = pd.read_csv("../data/preprocessed_data/preprocessed_10000_trial.csv")
entire_df = pd.read_csv(os.path.join(ROOT_DIR, f"data/preprocessed_data/preprocessedData_{path_ex}.csv"))
print("size of df")
print(entire_df.shape)
# For entire df go from start limit to ngenres to find best LDA model
nGenres = row_counter(os.path.join(ROOT_DIR, f"data/genre_wise_data/data{path_ex}/genre_frequency.csv"))
nGenres_rounded = math.ceil(nGenres / 5) * 5
print(f"Original number of genres should be {nGenres}, but we are rounding to {nGenres_rounded}")
path = make_dir_if_not_exists(os.path.join(ROOT_DIR, f"results/data{path_ex}/aall_data"))
perform_topic_modelling_single_df(entire_df, 5, 100, path)

how to get words of clusters

How can I get the words of each cluster
I divided them into groups
LabeledSentence1 = gensim.models.doc2vec.TaggedDocument
all_content_train = []
j=0
for em in train['KARMA'].values:
all_content_train.append(LabeledSentence1(em,[j]))
j+=1
print('Number of texts processed: ', j)
d2v_model = Doc2Vec(all_content_train, vector_size = 100, window = 10, min_count = 500, workers=7, dm = 1,alpha=0.025, min_alpha=0.001)
d2v_model.train(all_content_train, total_examples=d2v_model.corpus_count, epochs=10, start_alpha=0.002, end_alpha=-0.016)```
```kmeans_model = KMeans(n_clusters=10, init='k-means++', max_iter=100)
X = kmeans_model.fit(d2v_model.docvecs.doctag_syn0)
labels=kmeans_model.labels_.tolist()
l = kmeans_model.fit_predict(d2v_model.docvecs.doctag_syn0)
pca = PCA(n_components=2).fit(d2v_model.docvecs.doctag_syn0)
datapoint = pca.transform(d2v_model.docvecs.doctag_syn0)
I can get the text and its cluster but how can I learn the words which mainly created those groups

It's not an inherent feature of Doc2Vec to list words most-related to any document or doc-vector. (Other algorithms, such as LDA, will offer that.)
So, you could potentially write your own code, once you've split your documents into clusters, to report the words that are "most over-represented" in each cluster.
For example, calculate every word's frequency in the entire corpus, then each word's frequency in each cluster. For each cluster, report the N words whose in-cluster-frequency is the largest multiple of the full-corpus-frequency. Would this give helpful results on your data, for your needs? You'd have to try it.
Separately, regarding your use of Doc2Vec:
there's no good reason to alias the existing class TaggedDocument to a strange class name like LabeldSentence1. Just use TaggedDocument directly.
if you supply your corpus, all_content_train, to the object-inittialization – as your code does – then you don't need to also call train(). Training will have already happened automatically. If you do want more than the default amount of training (epochs=5), just supply a larger epochs value to the initialization.
the learning-rate values you've supplied to train() – start_alpha=0.002, end_alpha=-0.016 – are nonsensical & destructive. Few users should need to tinker with these alpha values at all, but especially, they should never increase from the beginning to end of a training cycle, as these values do.
If you were running with logging enabled at the INFO level, and/or watching the output closely, you would likely see readouts and warnings indicating that excessive training was happening, or problematic values used.

Implementing the TD-Gammon algorithm

I am attempting to implement the algorithm from the TD-Gammon article by Gerald Tesauro. The core of the learning algorithm is described in the following paragraph:
I have decided to have a single hidden layer (if that was enough to play world-class backgammon in the early 1990's, then it's enough for me). I am pretty certain that everything except the train() function is correct (they are easier to test), but I have no idea whether I have implemented this final algorithm correctly.
import numpy as np
class TD_network:
"""
Neural network with a single hidden layer and a Temporal Displacement training algorithm
taken from G. Tesauro's 1995 TD-Gammon article.
"""
def __init__(self, num_input, num_hidden, num_output, hnorm, dhnorm, onorm, donorm):
self.w21 = 2*np.random.rand(num_hidden, num_input) - 1
self.w32 = 2*np.random.rand(num_output, num_hidden) - 1
self.b2 = 2*np.random.rand(num_hidden) - 1
self.b3 = 2*np.random.rand(num_output) - 1
self.hnorm = hnorm
self.dhnorm = dhnorm
self.onorm = onorm
self.donorm = donorm
def value(self, input):
"""Evaluates the NN output"""
assert(input.shape == self.w21[1,:].shape)
h = self.w21.dot(input) + self.b2
hn = self.hnorm(h)
o = self.w32.dot(hn) + self.b3
return(self.onorm(o))
def gradient(self, input):
"""
Calculates the gradient of the NN at the given input. Outputs a list of dictionaries
where each dict corresponds to the gradient of an output node, and each element in
a given dict gives the gradient for a subset of the weights.
"""
assert(input.shape == self.w21[1,:].shape)
J = []
h = self.w21.dot(input) + self.b2
hn = self.hnorm(h)
o = self.w32.dot(hn) + self.b3
for i in range(len(self.b3)):
db3 = np.zeros(self.b3.shape)
db3[i] = self.donorm(o[i])
dw32 = np.zeros(self.w32.shape)
dw32[i, :] = self.donorm(o[i])*hn
db2 = np.multiply(self.dhnorm(h), self.w32[i,:])*self.donorm(o[i])
dw21 = np.transpose(np.outer(input, db2))
J.append(dict(db3 = db3, dw32 = dw32, db2 = db2, dw21 = dw21))
return(J)
def train(self, input_states, end_result, a = 0.1, l = 0.7):
"""
Trains the network using a single series of input states representing a game from beginning
to end, and a final (supervised / desired) output for the end state
"""
outputs = [self(input_state) for input_state in input_states]
outputs.append(end_result)
for t in range(len(input_states)):
delta = dict(
db3 = np.zeros(self.b3.shape),
dw32 = np.zeros(self.w32.shape),
db2 = np.zeros(self.b2.shape),
dw21 = np.zeros(self.w21.shape))
grad = self.gradient(input_states[t])
for i in range(len(self.b3)):
for key in delta.keys():
td_sum = sum([l**(t-k)*grad[i][key] for k in range(t + 1)])
delta[key] += a*(outputs[t + 1][i] - outputs[t][i])*td_sum
self.w21 += delta["dw21"]
self.w32 += delta["dw32"]
self.b2 += delta["db2"]
self.b3 += delta["db3"]
The way I use this is I play through a whole game (or rather, the neural net plays against itself), and then I send the states of that game, from start to finish, into train(), along with the final result. It then takes this game log, and applies the above formula to alter weights using the first game state, then the first and second game states, and so on until the final time, when it uses the entire list of game states. Then I repeat that many times and hope that the network learns.
To be clear, I am not after feedback on my code writing. This was never meant to be more than a quick and dirty implementation to see that I have all the nuts and bolts in the right spots.
However, I have no idea whether it is correct, as I have thus far been unable to make it capable of playing tic-tac-toe at any reasonable level. There could be many reasons for that. Maybe I'm not giving it enough hidden nodes (I have used 10 to 12). Maybe it needs more games to train (I have used 200 000). Maybe it would do better with different normalisation functions (I've tried sigmoid and ReLU, leaky and non-leaky, in different variations). Maybe the learning parameters are not tuned right. Maybe tic-tac-toe and its deterministic gameplay means it "locks in" on certain paths in the game tree. Or maybe the training implementation is just wrong. Which is why I'm here.
Have I misunderstood Tesauro's algorithm?

I can't say that I entirely understand your implementation, but this line jumps out to me:
td_sum = sum([l**(t-k)*grad[i][key] for k in range(t + 1)])
Comparing with the formula you reference:
I see at least two differences:
Your implementation sums over t+1 elements compared to t elements in the formula
The gradient should be indexed with the same k as used in l**(t-k), but in your implementation it is indexed with i and key, without any reference to k
Perhaps if you fix these discrepancies your solution will behave more as expected.

Doc2Vec & classification - very poor results

I have a dataset of 6000 observations; a sample of it is the following:
job_id job_title job_sector
30018141 Secondary Teaching Assistant Education
30006499 Legal Sales Assistant / Executive Sales
28661197 Private Client Practitioner Legal
28585608 Senior hydropower mechanical project manager Engineering
28583146 Warehouse Stock Checker - Temp / Immediate Start Transport & Logistics
28542478 Security Architect Contract IT & Telecoms
The goal is to predict the job sector of each row based on the job title.
Firstly, I apply some preprocessing on the job_title column:
def preprocess(document):
lemmatizer = WordNetLemmatizer()
stemmer_1 = PorterStemmer()
stemmer_2 = LancasterStemmer()
stemmer_3 = SnowballStemmer(language='english')
# Remove all the special characters
document = re.sub(r'\W', ' ', document)
# remove all single characters
document = re.sub(r'\b[a-zA-Z]\b', ' ', document)
# Substituting multiple spaces with single space
document = re.sub(r' +', ' ', document, flags=re.I)
# Converting to lowercase
document = document.lower()
# Tokenisation
document = document.split()
# Stemming
document = [stemmer_3.stem(word) for word in document]
document = ' '.join(document)
return document
df_first = pd.read_csv('../data.csv', keep_default_na=True)
for index, row in df_first.iterrows():
df_first.loc[index, 'job_title'] = preprocess(row['job_title'])
Then I do the following with Gensim and Doc2Vec:
X = df_first.loc[:, 'job_title'].values
y = df_first.loc[:, 'job_sector'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)
tagged_train = TaggedDocument(words=X_train.tolist(), tags=y_train.tolist())
tagged_train = list(tagged_train)
tagged_test = TaggedDocument(words=X_test.tolist(), tags=y_test.tolist())
tagged_test = list(tagged_test)
model = Doc2Vec(vector_size=5, min_count=2, epochs=30)
training_set = [TaggedDocument(sentence, tag) for sentence, tag in zip(X_train.tolist(), y_train.tolist())]
model.build_vocab(training_set)
model.train(training_set, total_examples=model.corpus_count, epochs=model.epochs)
test_set = [TaggedDocument(sentence, tag) for sentence, tag in zip(X_test.tolist(), y_test.tolist())]
predictors_train = []
for sentence in X_train.tolist():
sentence = sentence.split()
predictor = model.infer_vector(doc_words=sentence, steps=20, alpha=0.01)
predictors_train.append(predictor.tolist())
predictors_test = []
for sentence in X_test.tolist():
sentence = sentence.split()
predictor = model.infer_vector(doc_words=sentence, steps=20, alpha=0.025)
predictors_test.append(predictor.tolist())
sv_classifier = SVC(kernel='linear', class_weight='balanced', decision_function_shape='ovr', random_state=0)
sv_classifier.fit(predictors_train, y_train)
score = sv_classifier.score(predictors_test, y_test)
print('accuracy: {}%'.format(round(score*100, 1)))
However, the result which I am getting is 22% accuracy.
This makes me a lot suspicious especially because by using the TfidfVectorizer instead of the Doc2Vec (both with the same classifier) then I am getting 88% accuracy (!).
Therefore, I guess that I must be doing something wrong in how I apply the Doc2Vec of Gensim.
What is it and how can I fix it?
Or it it simply that my dataset is relatively small while more advanced methods such as word embeddings etc require way more data?

You don't mention the size of your dataset - in rows, total words, unique words, or unique classes. Doc2Vec works best with lots of data. Most published work trains on tens-of-thousands to millions of documents, of dozens to thousands of words each. (Your data appears to only have 3-5 words per document.)
Also, published work tends to train on data where every document has a unique-ID. It can sometimes make sense to use known-labels as tags instead of, or in addition to, unique-IDs. But it isn't necessarily a better approach. By using known-labels as the only tags, you're effectively only training one doc-vector per label. (It's essentially similar to concatenating all rows with the same tag into one document.)
You're inexplicably using fewer steps in inference than epochs in training - when in fact these are analogous values. In recent versions of gensim, inference will by default use the same number of inference epochs as the model was configured to use for training. And, it's more common to use more epochs during inference than training. (Also, you're inexplicably using different starting alpha values for inference for both classifier-training and classifier-testing.)
But the main problem is likely your choice of tiny size=5 doc vectors. Instead of the TfidfVectorizer, which will summarize each row as a vector of width equal to the unique-word count – perhaps hundreds or thousands of dimensions – your Doc2Vec model summarizes each document as just 5 values. You've essentially lobotomized Doc2Vec. Usual values here are 100-1000 – though if the dataset is tiny smaller sizes may be required.
Finally, the lemmatization/stemming may not be strictly necessary and may even be destructive. Lots of Word2Vec/Doc2Vec work doesn't bother to lemmatize/stem - often because there's plentiful data, with many appearances of all word forms.
These steps are most likely to help with smaller data, by making sure rarer word forms are combined with related longer forms to still get value from words that would otherwise be too rare to be retained (or get useful vectors).
But I can see many ways they might hurt for your domain. Manager and Management won't have exactly the same implications in this context, but could both be stemmed to manag. Similar for Security and Securities both becoming secur, and other words. I'd only perform these steps if you can prove through evaluation that they're helping. (Are the words passed to the TfidfVectorizer being lemmatized/stemmed?)

usually to train doc2vec/word2vec requires lots of generalised data (word2vec trained on 3 milian Wikipedia articles), since it's performing poorly on doc2vec consider experimenting with pre trained doc2vec refer this
Or you can try using word2vec and averaging it for entire document since word2vec gives vector for each word.
Let me know how this helps ?

The tools you are using are not suitable for classification. Id suggest you look into something like a char-rnn.
https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
This tutorial works on a similar problem, where it classifies names.

Why does my association model find subgroups in a dataset when there shouldn't any?

I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.
I'm working on a project that has a goal of detecting sub populations in a group of patients. I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject.
I there are 42 variables in total. Of those, 20 are continuous and had to be discretized. For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into.
def Freedman_Diaconis(column_values):
#sort the list first
column_values[1].sort()
first_quartile = int(len(column_values[1]) * .25)
third_quartile = int(len(column_values[1]) * .75)
fq_value = column_values[1][first_quartile]
tq_value = column_values[1][third_quartile]
iqr = tq_value - fq_value
n_to_pow = len(column_values[1])**(-1/3)
h = 2 * iqr * n_to_pow
retval = (column_values[1][-1] - column_values[1][1])/h
test = int(retval+1)
return test
From there I used min-max normalization
def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints
to transform my data and then I simply took the interger portion to get the final categorization.
def take_int(list_of_float):
ints = []
for flt in list_of_float:
asint = int(flt)
ints.append(asint)
return ints
I then also wrote a function that I used to combine this value with the variable name.
def string_transform(prefix, column, index):
transformed_list = []
transformed = ""
if index < 4:
for entry in column[1]:
transformed = prefix+str(entry)
transformed_list.append(transformed)
else:
prefix_num = prefix.split('x')
for entry in column[1]:
transformed = str(prefix_num[1])+'x'+str(entry)
transformed_list.append(transformed)
return transformed_list
This was done to differentiate variables that have the same value, but appear in different columns. For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. The string transform function would create 14x1 and 20x1 for the previously mentioned examples.
After this, I wrote everything to a file in basket format
def create_basket(list_of_lists, headers):
#for filename in os.listdir("."):
# if filename.e
if not os.path.exists('baskets'):
os.makedirs('baskets')
down_length = len(list_of_lists[0])
with open('baskets/dataset.basket', 'w') as basketfile:
basket_writer = csv.DictWriter(basketfile, fieldnames=headers)
for i in range(0, down_length):
basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
"x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
"x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
"x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
"x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
"x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
"x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
"x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
"x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
"x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
"x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
"x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
"x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
"x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})
and I used the apriori package in Orange to see if there were any association rules.
rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules:
my_rule = str(r)
split_rule = my_rule.split("->")
if 'trt' in split_rule[1]:
print 'treatment rule'
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
Using this, technique I found quite a few association rules with my testing data.
THIS IS WHERE I HAVE A PROBLEM
When I read the notes for the training data, there is this note
...That is, the only
reason for the differences among observed responses to the same treatment across patients is
random noise. Hence, there is NO meaningful subgroup for this dataset...
My question is,
why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?
I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state.
Supp Conf Rule
0.3 0.7 6x0 -> trt1
Even though my code runs, I'm not getting results anywhere close to what should be expected. This leads me to believe that I messed something up, but I'm not sure what it is.

After some research, I realized that my sample size is too small for the number of variables that I have. I would need a way larger sample size in order to really use the method that I was using. In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.