issue in executing scikit-learn linear regression model - python

I have a dataset the sample structure of which looks like this:
SV,Arizona,618,264,63,923
SV,Arizona,367,268,94,138
SV,Arizona,421,268,121,178
SV,Arizona,467,268,171,250
SV,Arizona,298,270,62,924
SV,Arizona,251,272,93,138
SV,Arizona,215,276,120,178
SV,Arizona,222,279,169,250
SV,Arizona,246,279,64,94
SV,Arizona,181,281,97,141
SV,Arizona,197,286,125.01,182
SV,Arizona,178,288,175.94,256
SV,California,492,208,63,923
SV,California,333,210,94,138
SV,California,361,213,121,178
SV,California,435,217,171,250
SV,California,222,215,62,92
SV,California,177,218,93,138
SV,California,177,222,120,178
SV,California,156,228,169,250
SV,California,239,225,64,94
SV,California,139,229,97,141
SV,California,198,234,125,182
The records are in order of company_id,state,profit,feature1,feature2,feature3.
Now I wrote this code which breaks he whole dataset into chunks of 12 records (for each company and for each state in that company there are 12 records) and then passes it to process_chunk() function. Inside process_chunk() the records in the chunk are processed and broken into test set and training set with record number 10 and 11 going into test set while rest going into training set. I also store the company_id and state of records in test set into a global list for future display of predicted values. I also append the predicted values to a global list final_prediction
Now the issue that I am facing is that company_list, state_list and test_set lists have the same size (of about 200 records) but final_prediction has size half of what other lists have (100) records. If the test_set list has size of 200 then shouldn't the final_prediction be also of size 200? My current code is:
from sklearn import linear_model
import numpy as np
import csv
final_prediction = []
company_list = []
state_list = []
def process_chunk(chuk):
training_set_feature_list = []
training_set_label_list = []
test_set_feature_list = []
test_set_label_list = []
np.set_printoptions(suppress=True)
prediction_list = []
# to divide into training & test, I am putting line 10th and 11th in test set
count = 0
for line in chuk:
# Converting strings to numpy arrays
if count == 9:
test_set_feature_list.append(np.array(line[3:4],dtype = np.float))
test_set_label_list.append(np.array(line[2],dtype = np.float))
company_list.append(line[0])
state_list.append(line[1])
elif count == 10:
test_set_feature_list.append(np.array(line[3:4],dtype = np.float))
test_set_label_list.append(np.array(line[2],dtype = np.float))
company_list.append(line[0])
state_list.append(line[1])
else:
training_set_feature_list.append(np.array(line[3:4],dtype = np.float))
training_set_label_list.append(np.array(line[2],dtype = np.float))
count += 1
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(training_set_feature_list, training_set_label_list)
prediction_list.append(regr.predict(test_set_feature_list))
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
for items in prediction_list:
final_prediction.append(items)
# Load and parse the data
file_read = open('data.csv', 'r')
reader = csv.reader(file_read)
chunk, chunksize = [], 12
for i, line in enumerate(reader):
if (i % chunksize == 0 and i > 0):
process_chunk(chunk)
del chunk[:]
chunk.append(line)
# process the remainder
#process_chunk(chunk)
print len(company_list)
print len(test_set_feature_list)
print len(final_prediction)
Why is this difference in size coming and what mistake am I doing here in my code that I can rectify (maybe something that I am doing very naively and can be done in better way)?

Here:
prediction_list.append(regr.predict(test_set_feature_list))
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
for items in prediction_list:
final_prediction.append(items)
prediction_list will be a list of arrays (since predict returns an array).
So you'll be appending arrays to your final_prediction, which is probably what messes up your count: len(final_prediction) will probably be equal to the number of chunks.
At this point, the lengths are ok if prediction_list has the same length as test_set_feature_list.
You probably want to use extend like this:
final_prediction.extend(regr.predict(test_set_feature_list))
Which is also easier to read.
Then the length of final_prediction should be fine, and it should be a single list, rather than a list of lists.

Related

Request status update Twitter stream data

I retrieved Twitter data via the streaming API on Python, however, I am also interested in how the public metrics evolve during the time. As a result, I would like to request on a daily basis the metrics.
Unfortunately, the API for the status update can only handle 100 requests at a time. I have a list of all id's, how is it possible to automatically split the string of id's so that all of them will be requested, always in batches of 100?
Thank you a lot in advance!
Keep it as list of IDs instead of single string.
And then you can use range(len(...)) with [n:n+100] like
# example data
all_ids = list(range(500))
SIZE = 100
#SIZE = 10 # test on smaller size
for n in range(0, len(all_ids), SIZE):
print(all_ids[n:n+SIZE])
You can even use yield to create special function for this
def split(data, size):
for n in range(0, len(data), size):
yield data[n:n+size]
# example data
all_ids = list(range(500))
SIZE = 100
SIZE = 10
for part in split(all_ids, SIZE):
print(part)
Eventually you can get [:100] and slice [100:] but this destroy list so you have to do it on copy of this list
# example data
all_ids = list(range(500))
SIZE = 100
#SIZE = 10 # test on smaller size
all_ids_copy = all_ids.copy()
while all_ids_copy:
print(all_ids_copy[:SIZE])
all_ids_copy = all_ids_copy[SIZE:]
You can also use some external modules for this.
from toolz import partition
# example data
all_ids = list(range(500))
SIZE = 100
#SIZE = 10 # test on smaller size
for part in partition(SIZE, all_ids):
print(part)
If you will have list of strings then you can convert back to single string using join()
print( ",".join(part) )
For list of integers you may need to convert integers to strings
print( ",".join(str(x) for x in part) )

Python: problem with random seed in random sample

I have a dataset which the first column is a text, the second one is called author and the third one is called title. So I want to split my dataset into 3 subsamples based on title. Note that there are many different texts with the same title.
# Find the unique titles
random.seed(42)
mylist = list(set(list(dt_chunks['title'])))
print(len(mylist))
# Random sample of titles and match all of these titles with the respectively texts
random.seed(42)
trainlist = random.sample(mylist, k = int(len(mylist)*0.7))
pattern = '|'.join(trainlist)
train_idx = dt_chunks['title'].str.contains(pattern)
train_df = dt_chunks[train_idx]
# New list which is contains the other elements that the previous list doesn't contain
random.seed(42)
extralist = list(set(mylist)^set(trainlist))
# same logic
random.seed(42)
validlist = random.sample(extralist, k = int(len(extralist)*0.5))
pattern = '|'.join(validlist)
valid_idx = dt_chunks['title'].str.contains(pattern)
valid_df = dt_chunks[valid_idx]
# same logic
random.seed(42)
testlist = list(set(validlist)^set(extralist))
pattern = '|'.join(testlist)
test_idx = dt_chunks['title'].str.contains(pattern)
test_df = dt_chunks[test_idx]
The problem here is that I am using random seed, but if I restart the google colab, the output isn't the same. I would be grateful if you could help me.
Possibly because dt_chunks['title'] is not the same every time. If this is the case, then len(mylist) also changes and random.sample(mylist, k = int(len(mylist)*0.7)) will lead to the calling of the sampling function internally different number of times on different runs.

Improve performance of dataframe-like structure

I'm facing a data-structure challenge regarding a process in my code code in which I need to count the frequency of strings in positive and negative examples.
It's a large bottleneck and I don't seem to be able to find a better solution.
I have to go through every long strings in the dataset, and extract substrings, of which I need to count the frequency. In a perfect world, a pandas dataframe of the following shape would be perfect:
At the end, the expected structure is something like
string | frequency positive | frequency negative
________________________________________________
str1 | 5 | 7
str2 | 2 | 4
...
However, for obvious performance limits, this is not acceptable.
My solution is to use a dictionary to track the rows, and a Nx2 numpy matrix to track the frequency. This is also done because after this, I need to have the frequency in a Nx2 numpy matrix anyway.
Currently, my solution is something like this:
str_freq = np.zeros((N, 2), dtype=np.uint32)
str_dict = {}
str_dict_counter = 0
for i, string in enumerate(dataset):
substrings = extract(string) # substrings is a List[str]
for substring in substrings:
row = str_dict.get(substring, None)
if row is None:
str_dict[substring] = str_dict_counter
row = str_dict_counter
str_dict_counter += 1
str_freq[row, target[i]] += 1 # target[i] is equal to 1 or 0
However, it is really the bottleneck of my code, and I'd like to speed it up.
Some things about this code are imcompressible, for instance the extract(string), so that loop has to remain. However, if possible there is no problem with using parallel processing.
What I'm wondering especially is if there is a way to improve the inner loop. Python is known to be bad with loops, and this one seems a bit pointless, however since we can't (to my knowledge) do multiple get and sets to dictionaries like we can with numpy arrays, I don't know how I could improve it.
What do you suggest doing? Is the only solution to re-write in some lower level language?
I also though about using SQL-lite, however I don't know if it's worth it.
For the record, this has to take in about 10MB of data, it currently takes about 45 seconds, but needs to be done repeatedly with new data each time.
EDIT: Added example to test yourself
import random
import string
import re
import numpy as np
import pandas as pd
def get_random_alphaNumeric_string(stringLength=8):
return bytes(bytearray(np.random.randint(0,256,stringLength,dtype=np.uint8)))
def generate_dataset(n=10000):
d = []
for i in range(n):
rnd_text = get_random_alphaNumeric_string(stringLength=1000)
d.append(rnd_text)
return d
def test_dict(dataset):
pattern = re.compile(b"(q.{3})")
target = np.random.randint(0,2,len(dataset))
str_freq = np.zeros((len(dataset)*len(dataset[0]), 2), dtype=np.uint32)
str_dict = {}
str_dict_counter = 0
for i, string in enumerate(dataset):
substrings = pattern.findall(string) # substrings is a List[str]
for substring in substrings:
row = str_dict.get(substring, None)
if row is None:
str_dict[substring] = str_dict_counter
row = str_dict_counter
str_dict_counter += 1
str_freq[row, target[i]] += 1 # target[i] is equal to 1 or 0
return str_dict, str_freq[:str_dict_counter,:]
def test_df(dataset):
pattern = re.compile(b"(q.{3})")
target = np.random.randint(0,2,len(dataset))
df = pd.DataFrame(columns=["str","pos","neg"])
df.astype(dtype={"str":bytes, "pos":int, "neg":int}, copy=False)
df = df.set_index("str")
for i, string in enumerate(dataset):
substrings = pattern.findall(string) # substrings is a List[str]
for substring in substrings:
check = substring in df.index
if not check:
row = [0,0]
row[target[i]] = 1
df.loc[substring] = row
else:
df.loc[substring][target[i]] += 1
return df
dataset = generate_dataset(1000000)
d,f = test_dict(dataset) # takes ~10 seconds on my laptop
# to get the value of some key, say b'q123'
f[d[b'q123'],:]
d = test_df(dataset) # takes several minutes (hasn't finished yet)
# the same but with a dataframe
d.loc[b'q123']

NaiveBayes model training with separate training set and data using pyspark

So, I am trying to train a naive bayes clasifier. Went into a lot of trouble of preprocessing the data and I have now produced two RDDs:
Traininng set: composed of a set of sparse-vectors;
Labels: a corresponding list of labels (0,1) for every vector.
I need to run something like this:
# Train a naive Bayes model.
model = NaiveBayes.train(training, 1.0)
but "training" is a dataset derived from running:
def parseLine(line):
parts = line.split(',')
label = float(parts[0])
features = Vectors.dense([float(x) for x in parts[1].split(' ')])
return LabeledPoint(label, features)
data = sc.textFile('data/mllib/sample_naive_bayes_data.txt').map(parseLine)
based on the documentation for python here. My question is, given that I don't want to load the data from a txt file and that I have already created the training set in the form of records mapped to sparse-vectors (RDD) and a corresponding labelled list, how can I run naive-bayes?
Here is part of my code:
# Function
def featurize(tokens_kv, dictionary):
"""
:param tokens_kv: list of tuples of the form (word, tf-idf score)
:param dictionary: list of n words
:return: sparse_vector of size n
"""
# MUST sort tokens_kv by key
tokens_kv = collections.OrderedDict(sorted(tokens_kv.items()))
vector_size = len(dictionary)
non_zero_indexes = []
index_tfidf_values = []
for key, value in tokens_kv.iteritems():
index = 0
for word in dictionary:
if key == word:
non_zero_indexes.append(index)
index_tfidf_values.append(value)
index += 1
print non_zero_indexes
print index_tfidf_values
return SparseVector(vector_size, non_zero_indexes, index_tfidf_values)
# Feature Extraction
Training_Set_Vectors = (TFsIDFs_Vector_Weights_RDDs
.map(lambda (tokens): featurize(tokens, Dictionary_BV.value))
.cache())
... and labels is just a list of 1s and 0s. I understand that I may need to somehow use labeledpoint somehow but I am confused a to how... RDDs are not a list while labels is a list am hoping for something as simple as coming up with a way to create labeledpoint objets[i] combining sparse-vectors[i],corresponding-labels[i] respective values... any ideas?
I was able to solve this by first collecting the SparseVectors RDDs - effectively converting them to a list. Then, I run a function that constructed a list of
labelledpoint objects:
def final_form_4_training(SVs, labels):
"""
:param SVs: List of Sparse vectors.
:param labels: List of labels
:return: list of labeledpoint objects
"""
to_train = []
for i in range(len(labels)):
to_train.append(LabeledPoint(labels[i], SVs[i]))
return to_train
# Feature Extraction
Training_Set_Vectors = (TFsIDFs_Vector_Weights_RDDs
.map(lambda (tokens): featurize(tokens, Dictionary_BV.value))
.collect())
raw_input("Generate the LabeledPoint parameter... ")
labelled_training_set = sc.parallelize(final_form_4_training(Training_Set_Vectors, training_labels))
raw_input("Train the model... ")
model = NaiveBayes.train(labelled_training_set, 1.0)
However, this assumes that the RDDs maintain their order (with which I am not messing with) throughout the process pipeline. I also hate the part where I had to collect everything on the master. Any better ideas?

How to correct my Naive Bayes method returning extremely small conditional probabilities?

I'm trying to calculate the probability that an email is spam with Naive Bayes. I have a document class to create the documents (fed in from a website), and another class to train and classify documents. My train function calculates all the unique terms in all the documents, all documents in the spam class, all documents in the non-spam class, computes prior probabilities (one for spam, another for ham). Then I use the following formula to store conditional probabilities for each term into a dict
Tct = the number of occurances of a term in a given class
Tct' is the # terms in terms in a given class
B' = # unique terms in all documents
classes = either spam or ham
spam = spam, ham = not spam
the issue is that when I use this formula in my code it gives me extremely small conditional probability scores such as 2.461114392596968e-05. I'm quite sure this is because the values for Tct are very small (like 5 or 8) compared to the denominator values of Tct' (which is 64878 for ham and 308930 for spam) and B' (which is 16386). I can't figure out how to get the condprob scores down to something like .00034155, as I can only assume my condprob scores aren't supposed to be as exponentially small as they are. Am I doing something wrong with my calculations? Are the values actually supposed to be this small?
If it helps, my goal is to score a test set of documents and get results like 327.82, 758.80, or 138.66
using this formula
however, using my small condprob values I only get negative numbers.
Code
-Create Document
class Document(object):
"""
The instance variables are:
filename....The path of the file for this document.
label.......The true class label ('spam' or 'ham'), determined by whether the filename contains the string 'spmsg'
tokens......A list of token strings.
"""
def __init__(self, filename=None, label=None, tokens=None):
""" Initialize a document either from a file, in which case the label
comes from the file name, or from specified label and tokens, but not
both.
"""
if label: # specify from label/tokens, for testing.
self.label = label
self.tokens = tokens
else: # specify from file.
self.filename = filename
self.label = 'spam' if 'spmsg' in filename else 'ham'
self.tokenize()
def tokenize(self):
self.tokens = ' '.join(open(self.filename).readlines()).split()
-NaiveBayes
class NaiveBayes(object):
def train(self, documents):
"""
Given a list of labeled Document objects, compute the class priors and
word conditional probabilities, following Figure 13.2 of your
book. Store these as instance variables, to be used by the classify
method subsequently.
Params:
documents...A list of training Documents.
Returns:
Nothing.
"""
###TODO
unique = []
proxy = []
proxy2 = []
proxy3 = []
condprob = [{},{}]
Tct = defaultdict()
Tc_t = defaultdict()
prior = {}
count = 0
oldterms = []
old_terms = []
for a in range(len(documents)):
done = False
for item in documents[a].tokens:
if item not in unique:
unique.append(item)
if documents[a].label == "ham":
proxy2.append(item)
if done == False:
count += 1
elif documents[a].label == "spam":
proxy3.append(item)
done = True
V = unique
N = len(documents)
print("N:",N)
LB = len(unique)
print("THIS IS LB:",LB)
self.V = V
print("THIS IS COUNT/NC", count)
Nc = count
prior["ham"] = Nc / N
self.prior = prior
Nc = len(documents) - count
print("THIS IS SPAM COUNT/NC", Nc)
prior["spam"] = Nc / N
self.prior = prior
text2 = proxy2
text3 = proxy3
TctTotal = len(text2)
Tc_tTotal = len(text3)
print("THIS IS TCTOTAL",TctTotal)
print("THIS IS TC_TTOTAL",Tc_tTotal)
for term in text2:
if term not in oldterms:
Tct[term] = text2.count(term)
oldterms.append(term)
for term in text3:
if term not in old_terms:
Tc_t[term] = text3.count(term)
old_terms.append(term)
for term in V:
if term in text2:
condprob[0].update({term: (Tct[term] + 1) / (TctTotal + LB)})
if term in text3:
condprob[1].update({term: (Tc_t[term] + 1) / (Tc_tTotal + LB)})
print("This is condprob", condprob)
self.condprob = condprob
def classify(self, documents):
""" Return a list of strings, either 'spam' or 'ham', for each document.
Params:
documents....A list of Document objects to be classified.
Returns:
A list of label strings corresponding to the predictions for each document.
"""
###TODO
#return list["string1", "string2", "stringn"]
# docs2 = ham, condprob[0] is ham
# docs3 = spam, condprob[1] is spam
unique = []
ans = []
hscore = 0
sscore = 0
for a in range(len(documents)):
for item in documents[a].tokens:
if item not in unique:
unique.append(item)
W = unique
hscore = math.log(float(self.prior['ham']))
sscore = math.log(float(self.prior['spam']))
for t in W:
try:
hscore += math.log(self.condprob[0][t])
except KeyError:
continue
try:
sscore += math.log(self.condprob[1][t])
except KeyError:
continue
print("THIS IS SSCORE",sscore)
print("THIS IS HSCORE",hscore)
unique = []
if hscore > sscore:
str = "Spam"
elif sscore > hscore:
str = "Ham"
ans.append(str)
return ans
-Test
if not os.path.exists('train'): # download data
from urllib.request import urlretrieve
import tarfile
urlretrieve('http://cs.iit.edu/~culotta/cs429/lingspam.tgz', 'lingspam.tgz')
tar = tarfile.open('lingspam.tgz')
tar.extractall()
tar.close()
train_docs = [Document(filename=f) for f in glob.glob("train/*.txt")]
test_docs = [Document(filename=f) for f in glob.glob("test/*.txt")]
test = train_docs
nb = NaiveBayes()
nb.train(train_docs[1500:])
#uncomment when testing classify()
#predictions = nb.classify(test_docs[:200])
#print("PREDICTIONS",predictions)
The eventual goal is to be able to classify documents as spam or ham, but I want to work on the conditional probability issue first.
The Issue
Are the conditional probability values supposed to be this small? if so, why am I getting strange scores via classify? If not, how do I fix my code to give me the proper condprob values?
Values
The current condprob values that I am getting are along the lines of this:
'tradition': 2.461114392596968e-05, 'fillmore': 2.461114392596968e-05, '796': 2.461114392596968e-05, 'zann': 2.461114392596968e-05
condprob is a list containing two dictionaries, the first is ham and the next is spam. Each dictionary maps a term to it's conditional probability. I want to have "normal" small values such as .00031235 not 3.1235e-05.
The reason for this is that when I run the condprob values through the classify method with some test documents I get scores like
THIS IS HSCORE -2634.5292392650663, THIS IS SSCORE -1707.983339196181
when they should look like
THIS IS HSCORE 327.82, THIS IS SSCORE 758.80
Running Time
~1 min, 30 sec
(You seem to be working with log probabilities, which is very sensible, but I am going to write most of the following for the raw probabilities, which you could get by taking the exponential of the log probabilities, because it makes the algebra easier even if it does in practice mean that you would probably get numerical underflow if you didn't use logs)
As far as I can tell from your code you start with prior probabilities p(Ham) and p(Spam) and then use probabilities estimated from previous data to work out p(Ham) * p(Observed data | Ham) and p(Spam) * p(Observed data | Spam).
Bayes Theorem rearranges p(Obs|Spam) = p(Obs & Spam) / p(Spam) = p(Obs) p(Spam|Obs) / p(Spam) to give you P(Spam|Obs) = p(Spam) p(Obs|Spam)/p(Obs) and you seem to have calculated p(Spam) p(Obs|Spam) = p(Obs & Spam) but not divided by p(Obs). Since there are only two possibilities, Ham and Spam, the easiest thing to do is probably to note that p(Obs) = p(Obs & Spam) + p(Obs & Ham) and so just divide each of your two calculated values by their sum, essentially scaling the values so that they do indeed sum to 1.0.
This scaling is trickier if you start off with log probabilities lA and lB. To scale these I would first of all bring them into range by scaling them both by a rough value as logarithms, so doing a subtraction
lA = lA - max(lA, lB)
lB = lB - max(lA, lB)
Now at least the larger of the two won't overflow. The smaller still might, but I'd rather deal with underflow than overflow. Now turn them into not quite scaled probabilities:
pA = exp(lA)
pB = exp(lB)
and scale properly so they add to zero
truePA = pA / (pA + pB)
truePB = pB / (pA + pB)

Categories