On the Python DescisionTree module homepage (DecisionTree-1.6.1), they give a piece of example code. Here it is:
dt = DecisionTree( training_datafile = "training.dat", debug1 = 1 )
dt.get_training_data()
dt.show_training_data()
root_node = dt.construct_decision_tree_classifier()
root_node.display_decision_tree(" ")
test_sample = ['exercising=>never', 'smoking=>heavy',
'fatIntake=>heavy', 'videoAddiction=>heavy']
classification = dt.classify(root_node, test_sample)
print "Classification: ", classification
My question is: How can I specify sample data (test_sample here) from variables? On the project homepage, it says: "You classify new data by first constructing a new data vector:" I have searched around but have been unable to find out what a data vector is or the answer to my question.
Any help would be appreciated!
Um, the example says it all. It's a list of strings, with features and values separated by '=>'. To use the example, a feature is 'exercising', and the value is 'never'.
Related
Hi got into another roadblock in tensorflow crashcourse...at the representation programming excercises at this page.
https://developers.google.com/…/repres…/programming-exercise
I'm at Task 2: Make Better Use of Latitude
seems I narrowed the issue to when I convert the raw latitude data into "buckets" or ranges which will be represented as 1 or zero in my feature. The actual code and issue I have is in the paste bin. Any advice would be great! thanks!
https://pastebin.com/xvV2A9Ac
this is to convert the raw latitude data in my pandas dictionary into "buckets" or ranges as google calls them.
LATITUDE_RANGES = zip(xrange(32, 44), xrange(33, 45))
the above code I changed and replaced xrange with just range since xrange is already deprecated python3.
could this be the problem? using range instead of xrange? see below for my conundrum.
def select_and_transform_features(source_df):
selected_examples = pd.DataFrame()
selected_examples["median_income"] = source_df["median_income"]
for r in LATITUDE_RANGES:
selected_examples["latitude_%d_to_%d" % r] = source_df["latitude"].apply(
lambda l: 1.0 if l >= r[0] and l < r[1] else 0.0)
return selected_examples
The next two are to run the above function and convert may exiting training and validation data sets into ranges or buckets for latitude
selected_training_examples = select_and_transform_features(training_examples)
selected_validation_examples = select_and_transform_features(validation_examples)
this is the training model
_ = train_model(
learning_rate=0.01,
steps=500,
batch_size=5,
training_examples=selected_training_examples,
training_targets=training_targets,
validation_examples=selected_validation_examples,
validation_targets=validation_targets)
THE PROBLEM:
oki so here is how I understand the problem. When I run the training model it throws this error
ValueError: Feature latitude_32_to_33 is not in features dictionary.
So I called selected_training_examples and selected_validation_examples
here's what I found. If I run
selected_training_examples = select_and_transform_features(training_examples)
then I get the proper data set when I call selected_training_examples which yields all the feature "buckets" including Feature #latitude_32_to_33
but when I run the next function
selected_validation_examples = select_and_transform_features(validation_examples)
it yields no buckets or ranges resulting in the
`ValueError: Feature latitude_32_to_33 is not in features dictionary.`
so I next tried disabling the first function
selected_training_examples = select_and_transform_features(training_examples)
and I just ran the second function
selected_validation_examples = select_and_transform_features(validation_examples)
If I do this, I then get the desired dataset for
selected_validation_examples .
The problem now is running the first function no longer gives me the "buckets" and I'm back to where I began? I guess my question is how are the two functions affecting each other? and preventing the other from giving me the datasets I need? If I run them together?
Thanks in advance!
a python developer gave me the solution so just wanted to share. LATITUDE_RANGES = zip(xrange(32, 44), xrange(33, 45)) can only be used once the way it was written so I placed it inside the succeding def select_and_transform_features(source_df) function which solved the issues. Thanks again everyone.
I'm playing around a bit with tensorflow, but am a bit confused about the input pipeline. The data I'm working on is in a large csv file, with 307 columns, of which the first is a string representing a date, and the rest are floats.
I'm running into some problems with preprocessing my data. I want to add a couple of features instead of, but based on, the date string. (specifically, a sine and a cosine representing the time). I also want to group the next 120 values in the CSV row together as one feature, the 96 ones after that as one feature, and base my label off of the remaining values in the CSV.
This is my code for generating the datasets for now:
import tensorflow as tf
defaults = []
defaults.append([""])
for i in range(0,306):
defaults.append([1.0])
def dataset(train_fraction=0.8):
path = "training_examples_shuffled.csv"
# Define how the lines of the file should be parsed
def decode_line(line):
items = tf.decode_csv(line, record_defaults=defaults)
datetimeString = items[0]
minuteFeatures = items[1:121]
halfHourFeatures = items[121:217]
labelFeatures = items[217:]
## Do something to convert datetimeString to timeSine and timeCosine
features_dict = {
'timeSine': timeSine,
'timeCosine': timeCosine,
'minuteFeatures': minuteFeatures,
'halfHourFeatures': halfHourFeatures
}
label = [1] # placeholder. I seem to need some python logic here, but I'm
not sure how to apply that to data in tensor format.
return features_dict, label
def in_training_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
num_buckets = 1000000
bucket_id = tf.string_to_hash_bucket_fast(line, num_buckets)
# Use the hash bucket id as a random number that's deterministic per example
return bucket_id < int(train_fraction * num_buckets)
def in_test_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
return ~in_training_set(line)
base_dataset = (tf.data
# Get the lines from the file.
.TextLineDataset(path))
train = (base_dataset
# Take only the training-set lines.
.filter(in_training_set)
# Decode each line into a (features_dict, label) pair.
.map(decode_line))
# Do the same for the test-set.
test = (base_dataset.filter(in_test_set).map(decode_line))
return train, test
My question now is: how can I access the string in the datetimeString Tensor to convert it to a datetime object? Or is this the wrong place to be doing this? I'd like to use the time and the day of the week as input features.
And secondly: Pretty much the same for the label based on the remaining values of the CSV. Can I just use standard python code for this in some way, or should I be using basic tensorflow ops to achieve what I want, if possible?
Finally, any comments on whether this is a decent way of handling my inputs? Tensorflow is a bit confusing, with old tutorials spread around the internet using deprecated ways of handling inputs.
I am really a beginner in programming, and I have run into a problem. I am making a comparative analysis between fake news and real news. I have a text corpus with aprox. 3000 real news and 3000 fake news. I need to figure out whether fake or real news evoke more high-arousal emotions. I want to do that by using Warriner et. al. word list: http://crr.ugent.be/archives/1003
I have imported the word list to my script:
warriner = pd.read_csv('warriner.csv', sep = '\t', encoding = 'utf-8')
print warriner.head()
I (think, I) want to find the Arousal Mean Sum, which in the word list is called A.Mean.Sum. But I can't make it work, Spyder just say: 'DataFrame' object has no attribute 'A'.
Can anyone help? I have already calculated the sentiment scores by using LabMT as seen below, but I can't make Warringer et al work.
text_scored = [] for text in df['text']: sent_score = tm.labMT_sent(text)
text_scored.append(sent_score)
df['abs_sent'] = text_scored #adding the scored text to the df
relative sentiment score
text_scored = [] for text in df['text']: sent_score = tm.labMT_sent(text, rel = True)
text_scored.append(sent_score)
df['rel_sent'] = text_scored #adding the scored text to the df
overall mean
df['abs_sent'].mean() df['abs_sent'].loc[df['label'] == 'FAKE'].mean()
#'fake' mean = - 22,1 df['abs_sent'].loc[df['label'] == 'REAL'].mean()
#'real' mean = - 41,95
relative score mean calculations
df['rel_sent'].mean() #overall mean df['rel_sent'].loc[df['label'] == 'FAKE'].mean()
#'fake' mean = - 0,02 df['rel_sent'].loc[df['label'] == 'REAL'].mean()
#'real' mean = - 0,05
The example code you provided is hard for me to read. You're reporting the problem as having to do with A.Mean.Sum, but there's no code relating to that. There are also references to Spyder and DataFrame without explanation, code, or tags. Finally, the title should tell the potential answerer something about the problem itself, not the general field the code is working with. The current one expects the reader to find what they're supposed to do from within the report.
I'll readily admit I'm a novice here, but I suggest reading the intro How-to-ask and clarifying your question with it.
I'm also guessing this is a pandas related question, so its docs page might help you.
I hope I was of any help!
I'm trying to follow along this tensorflow tutorial which uses a load_csv function. TUTORIAL_LINK
One of two lines in question is:
IRIS_TEST = "iris_test.csv"
test_set = tf.contrib.learn.datasets.base.load_csv(
filename=IRIS_TEST,
target_dtype=np.int
)
Where "iris_test.csv" looks like:
30,4,setosa,versicolor,virginica
5.9,3.0,4.2,1.5,1
6.9,3.1,5.4,2.1,2
5.1,3.3,1.7,0.5,0
6.0,3.4,4.5,1.6,1
5.5,2.5,4.0,1.3,1
6.2,2.9,4.3,1.3,1
5.5,4.2,1.4,0.2,0
6.3,2.8,5.1,1.5,2
5.6,3.0,4.1,1.3,1
6.7,2.5,5.8,1.8,2
7.1,3.0,5.9,2.1,2
4.3,3.0,1.1,0.1,0
I'm pretty sure the target of the machine learning exercise is the verginica column but I've no idea how it's specified as such.
Is it implied as the last column?
From the code:
def load_csv(filename, target_dtype, target_column=-1, has_header=True):
"""Load dataset from CSV file."""
default for target_column is -1. So, last column, good to know.
I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.
I'm working on a project that has a goal of detecting sub populations in a group of patients. I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject.
I there are 42 variables in total. Of those, 20 are continuous and had to be discretized. For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into.
def Freedman_Diaconis(column_values):
#sort the list first
column_values[1].sort()
first_quartile = int(len(column_values[1]) * .25)
third_quartile = int(len(column_values[1]) * .75)
fq_value = column_values[1][first_quartile]
tq_value = column_values[1][third_quartile]
iqr = tq_value - fq_value
n_to_pow = len(column_values[1])**(-1/3)
h = 2 * iqr * n_to_pow
retval = (column_values[1][-1] - column_values[1][1])/h
test = int(retval+1)
return test
From there I used min-max normalization
def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints
to transform my data and then I simply took the interger portion to get the final categorization.
def take_int(list_of_float):
ints = []
for flt in list_of_float:
asint = int(flt)
ints.append(asint)
return ints
I then also wrote a function that I used to combine this value with the variable name.
def string_transform(prefix, column, index):
transformed_list = []
transformed = ""
if index < 4:
for entry in column[1]:
transformed = prefix+str(entry)
transformed_list.append(transformed)
else:
prefix_num = prefix.split('x')
for entry in column[1]:
transformed = str(prefix_num[1])+'x'+str(entry)
transformed_list.append(transformed)
return transformed_list
This was done to differentiate variables that have the same value, but appear in different columns. For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. The string transform function would create 14x1 and 20x1 for the previously mentioned examples.
After this, I wrote everything to a file in basket format
def create_basket(list_of_lists, headers):
#for filename in os.listdir("."):
# if filename.e
if not os.path.exists('baskets'):
os.makedirs('baskets')
down_length = len(list_of_lists[0])
with open('baskets/dataset.basket', 'w') as basketfile:
basket_writer = csv.DictWriter(basketfile, fieldnames=headers)
for i in range(0, down_length):
basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
"x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
"x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
"x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
"x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
"x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
"x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
"x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
"x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
"x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
"x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
"x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
"x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
"x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})
and I used the apriori package in Orange to see if there were any association rules.
rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules:
my_rule = str(r)
split_rule = my_rule.split("->")
if 'trt' in split_rule[1]:
print 'treatment rule'
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
Using this, technique I found quite a few association rules with my testing data.
THIS IS WHERE I HAVE A PROBLEM
When I read the notes for the training data, there is this note
...That is, the only
reason for the differences among observed responses to the same treatment across patients is
random noise. Hence, there is NO meaningful subgroup for this dataset...
My question is,
why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?
I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state.
Supp Conf Rule
0.3 0.7 6x0 -> trt1
Even though my code runs, I'm not getting results anywhere close to what should be expected. This leads me to believe that I messed something up, but I'm not sure what it is.
After some research, I realized that my sample size is too small for the number of variables that I have. I would need a way larger sample size in order to really use the method that I was using. In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows.