Python: problem with random seed in random sample - python

I have a dataset which the first column is a text, the second one is called author and the third one is called title. So I want to split my dataset into 3 subsamples based on title. Note that there are many different texts with the same title.
# Find the unique titles
random.seed(42)
mylist = list(set(list(dt_chunks['title'])))
print(len(mylist))
# Random sample of titles and match all of these titles with the respectively texts
random.seed(42)
trainlist = random.sample(mylist, k = int(len(mylist)*0.7))
pattern = '|'.join(trainlist)
train_idx = dt_chunks['title'].str.contains(pattern)
train_df = dt_chunks[train_idx]
# New list which is contains the other elements that the previous list doesn't contain
random.seed(42)
extralist = list(set(mylist)^set(trainlist))
# same logic
random.seed(42)
validlist = random.sample(extralist, k = int(len(extralist)*0.5))
pattern = '|'.join(validlist)
valid_idx = dt_chunks['title'].str.contains(pattern)
valid_df = dt_chunks[valid_idx]
# same logic
random.seed(42)
testlist = list(set(validlist)^set(extralist))
pattern = '|'.join(testlist)
test_idx = dt_chunks['title'].str.contains(pattern)
test_df = dt_chunks[test_idx]
The problem here is that I am using random seed, but if I restart the google colab, the output isn't the same. I would be grateful if you could help me.

Possibly because dt_chunks['title'] is not the same every time. If this is the case, then len(mylist) also changes and random.sample(mylist, k = int(len(mylist)*0.7)) will lead to the calling of the sampling function internally different number of times on different runs.

Related

Request status update Twitter stream data

I retrieved Twitter data via the streaming API on Python, however, I am also interested in how the public metrics evolve during the time. As a result, I would like to request on a daily basis the metrics.
Unfortunately, the API for the status update can only handle 100 requests at a time. I have a list of all id's, how is it possible to automatically split the string of id's so that all of them will be requested, always in batches of 100?
Thank you a lot in advance!
Keep it as list of IDs instead of single string.
And then you can use range(len(...)) with [n:n+100] like
# example data
all_ids = list(range(500))
SIZE = 100
#SIZE = 10 # test on smaller size
for n in range(0, len(all_ids), SIZE):
print(all_ids[n:n+SIZE])
You can even use yield to create special function for this
def split(data, size):
for n in range(0, len(data), size):
yield data[n:n+size]
# example data
all_ids = list(range(500))
SIZE = 100
SIZE = 10
for part in split(all_ids, SIZE):
print(part)
Eventually you can get [:100] and slice [100:] but this destroy list so you have to do it on copy of this list
# example data
all_ids = list(range(500))
SIZE = 100
#SIZE = 10 # test on smaller size
all_ids_copy = all_ids.copy()
while all_ids_copy:
print(all_ids_copy[:SIZE])
all_ids_copy = all_ids_copy[SIZE:]
You can also use some external modules for this.
from toolz import partition
# example data
all_ids = list(range(500))
SIZE = 100
#SIZE = 10 # test on smaller size
for part in partition(SIZE, all_ids):
print(part)
If you will have list of strings then you can convert back to single string using join()
print( ",".join(part) )
For list of integers you may need to convert integers to strings
print( ",".join(str(x) for x in part) )

Redistribute a list of merchants_id so each user receives different set of merchants but equal in number - Python

Update: This can not be solved 100% since the number of merchants each user must receive is different. So some users might end up getting the same merchants as before. However, is it possible to let them get the same merchants, if there are not any other different merchants available?
I have the following excel file:
What I would like to do is to redistribute the merchants (Mer_id) so each user (Origin_pool) gets the same number of merchants as before, but a different set of merchants. For example, after the redistribution, Nick will receive 3 Mer_id's but not: 30303, 101020, 220340. Anna will receive 4 merchants but not 23401230,310231, 2030230, 2310505 and so on. Of course, one merchant can not be assigned to more than one person.
What I did so far is to find the total number of merchants each user must receive and randomly give them one mer_id that is not previously assigned to them. After I find a different mer_id I remove it from the list, so the other users won't receive the same merchant:
import pandas as pd
import numpy as np
df=pd.read_excel('dup_check_origin.xlsx')
dfcounts=df.groupby(['Origin_pool']).size().reset_index(name='counts')
Origin_pool=list(dfcounts['Origin_pool'])
counts=list(dfcounts['counts'])
dict_counts = dict(zip(Origin_pool, counts))
dest_name=[]
dest_mer=[]
for pool in Origin_pool:
pername=0
#for j in range(df.shape[0]):
while pername<=dict_counts[pool]:
rn=random.randint(0,df.shape[0]-1)
rid=df['Mer_id'].iloc[rn]
if (pool!=df['Origin_pool'].iloc[rn]):
#new_dict[pool]=rid
pername+=1
dest_name.append(pool)
dest_mer.append(rid)
df=df.drop(df.loc[df['Mer_id']==rid].index[0])
But it is not efficient at all, given the fact that in the future I might have more data than 18 rows.
Is there any library that does this or a way to make it more efficient?
Several days after your question, but I think it's a bullet proof code.
You can manage to create a function or class with the entire code.
I only created one, which is a recursive one, to handle the leftovers.
There are 3 lists, initialized at the beginning of the code:
pairs -> it returns your pool list (final one)
reshuffle -> it returns the pairs pool generated randomly and already appeared at pool pairs in the excel
still -> to handle the repeated pool pairs inside the function pullpush
The pullpsuh function comes first, because it will be called in different situations.
The first part of the program is a random algorithm to make pairs from mer_id(merchants) and origin_pool(poolers).
If the pair is not in the excel than it goes to the pairs list, otherwise they go to the reshuffle list.
Depending on the reshuffle characteristics another random algorithm is called or it will be processed by pullpush function.
If you execute the code once, as it is, and print(pairs) you may find a list with 15, 14 any more pool pairs lesser than 18.
Then, if you print(reshuffle) you will see the rest of the pairs to make 18.
To get the full 18 matchings in the pairs variable you must run:
pullpush(reshuffle).
The output here was obtained running the code followed by:
pullpush(reshuffle)
If you want to control that mer_id and origin_pool should not repeat for 3 rounds, you can load other 2 excels and split
them into oldpair2 and oldpair3.
[[8348201, 'Anna'], [53256236, 'Anna'], [9295, 'Anna'], [54240, 'Anna'], [30303, 'Marios'], [101020, 'Marios'], [959295, 'Marios'], [2030230, 'George'], [310231, 'George'], [23401230, 'George'], [2341134, 'Nick'], [178345, 'Marios'], [220340, 'Marios'], [737635, 'George'], [[2030230, 'George'], [928958, 'Nick']], [[5560503, 'George'], [34646, 'Nick']]]
The code:
import pandas as pd
import random
df=pd.read_excel('dup_check_origin.xlsx')
oldpair = df.values.tolist() #check previous pooling pairs
merchants = df['Mer_id'].values.tolist() #convert mer_id in list
poolers = df['Origin_pool'].values.tolist() #convert mer_id in list
random.shuffle(merchants) #1st step shuffle
pairs = [] #empty pairs list
reshuffle = [] #try again
still = [] #same as reshuffle for pullpush
def pullpush(repetition):
replacement = repetition #reshuffle transfer
for re in range(len(replacement)):
replace = next(r for r in pairs if r not in replacement)
repair = [[replace[0],replacement[re][1]],
[replacement[re][0],replace[1]]]
if repair not in oldpair:
iReplace = pairs.index(replace)#get index of pair
pairs.append(repair)
del pairs[iReplace] # remove from pairs
else:
still.append(repair)
if still:
pullpush(still) #recursive call
for p in range(len(poolers)):#avoid more merchants than poolers
pair = [merchants[p],poolers[p]]
if pair not in oldpair:
pairs.append(pair)
else:
reshuffle.append(pair)
if reshuffle:
merchants_bis = [x[0] for x in reshuffle]
poolers_bis = [x[1] for x in reshuffle]
if len(reshuffle) > 2: #shuffle needs 3 or more elements
random.shuffle(merchants_bis)
reshuffle = [] #clean before the loop
for n in range(len(poolers_bis)):
new_pair = [merchants_bis[n],poolers_bis[n]]
if new_pair not in oldpair:
pairs.append(new_pair)
else:
reshuffle.append(new_pair)
if len(reshuffle) == len(poolers_bis):#infinite loop
pullpush(reshuffle)
# double pairs and different poolers
elif (len(reshuffle) == 2 and not[i for i in reshuffle[0] if i in reshuffle[1]]):
merchants_bis = [merchants_bis[1],merchants_bis[0]]
new_pair = [[merchants_bis[1],poolers_bis[0]],
[merchants_bis[0],poolers_bis[1]]]
if new_pair not in oldpair:
pairs.append(new_pair)
else:
reshuffle.append(new_pair)
pullpush(reshuffle)
else: #one left or same poolers
pullpush(reshuffle)
My solution using dictionaries and lists, i print the result, but you can create a new dataframe with that.
from random import shuffle
import pandas as pd
df = pd.read_excel('dup_check_origin.xlsx')
dpool = {}
mers = list(df.Mer_id.unique())
shuffle(mers)
for pool in df.Origin_pool.unique():
dpool[pool] = list(df.Mer_id[df.Origin_pool == pool])
for key in dpool.keys():
inmers = dpool[key]
cnt = len(inmers)
new = [x for x in mers if x not in inmers][:cnt]
mers = [x for x in mers if x not in new]
print(key, new)

How can I split a list up into two groups? How can I print those groups?

I am making a randomizer to put in names and then print out those names in 2 even groups. I am trying to figure out how to split up the list into groups.
If I get this right, you can use numpy and random.sample for this:
import numpy as np
import random
# here goes your list
your_list = np.array(["name1", "name2","name3", "name4","name5", "name6","name7", "name8"])
# get its length
n_names = len(your_list)
# generates a random list od indexes
groups = random.sample(range(n_names), n_names)
# split these indixes into two even groups
g1, g2 = groups[:n_names//2], groups[n_names//2:]
# put elements into two groups
group1, group2 = your_list[g1], your_list[g2]
# print them
print("Group1: ",group1)
print("Group2: ",group2)
If there's no problem lost the original ordenation of the names you could try this simples solution:
import random
# here goes your list
your_list = np.array(["name1", "name2","name3", "name4","name5", "name6","name7", "name8"])
# get its length
n_names = len(your_list)
# shuffle it
random.shuffle(your_list)
group1, group2 = your_list[:n_names//2], your_list[n_names//2:]
# print them
print("Group1: ",group1)
print("Group2: ",group2)

Sentiment Classification with NLTK Naive Baysian classifier

I am implementing Naive Bayesian classifier with NLTK. But when i train classifier with extracted features it gives error "too many values to unpack". I am just beginner to python. Here is code. Program is reading text from files and extracting features from these files.
import nltk.classify.util,os,sys;
from nltk.classify import NaiveBayesClassifier;
from nltk.corpus import stopwords;
from nltk.tokenize import word_tokenize,RegexpTokenizer;
import re;
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
def word_feats(words):
return dict([(word,True) for word in words])
def feature_extractor(sentiment):
path = "train/"+sentiment+"/"
files = os.listdir(path);
feats = {};
i = 0;
for file in files:
f = open(path+file,"r", encoding='utf-8');
review = f.read();
review = remove_tags(review);
stopWords = (stopwords.words("english"))
tokenizer = RegexpTokenizer(r"\w+");
tokens = tokenizer.tokenize(review);
features = word_feats(tokens);
feats.update(features)
return feats;
posative_feat = feature_extractor("pos");
p = open("posFeat.txt","w", encoding='utf-8');
p.write(str(posative_feat));
negative_feat = feature_extractor("neg");
n = open("negFeat.txt","w", encoding='utf-8');
n.write(str(negative_feat));
plength = int(len(posative_feat)*3/4);
nlength = int(len(negative_feat)*3/4)
totalLength = plength+nlength;
trainFeatList = {}
testFeatList = {}
i = 0
for items in posative_feat.items():
i +=1;
value = {items[0]:items[1]}
if(i<plength):
trainFeatList.update(value);
else:
testFeatList.update(value);
j = 0
for items in negative_feat.items():
j +=1;
value = {items[0]:items[1]}
if(j<plength):
trainFeatList.update(value);
else:
testFeatList.update(value);
classifier = NaiveBayesClassifier.train(trainFeatList)
print(nltk.classify.util.accuracy(classifier,testFeatList));
classifier.show_most_informative_features();
looking at the NLTK book page http://www.nltk.org/book/ch06.html it seems the data that is given to the NaiveBayesClassifier is of the type list(tuple(dict,str)) whereas the data you are passing to the classifier is of the type list(dict).
If you represent the data in a similar manner, you will get different results. Basically, it is a list of (feature dict, label).
There are multiple errors in your code:
Python does not use a semicolon as a line ending
The True boolean does not seem to serve a purpose on line 12
trainFeatList and testFeatList should be lists
each value in your feature items list should betuple(dict,str)
assign labels to features in the list (in (4))
take NaiveBayesClassifier, and any use of classifier out of the negative features loop
If you fix the previous errors, the classifier will work, but unless I know what you are trying to achieve it is confusing and does not predict well.
the main line you need to pay attention to is when you assign something to your variable value.
for example:
value = {items[0]:items[1]}
should be something like:
value = ({feature_name:feature}, label)
Then afterwards you would call .append() on your lists to add each value instead of .update().
You can look at an example of your updated code in a buggy working state at http://pastebin.com/91Zu59Cm but I would suggest thinking about the following:
How is the data supposed to be represented for the NaiveBayesClassifier class?
What features are you trying to capture?
What labels are associated with those features?

issue in executing scikit-learn linear regression model

I have a dataset the sample structure of which looks like this:
SV,Arizona,618,264,63,923
SV,Arizona,367,268,94,138
SV,Arizona,421,268,121,178
SV,Arizona,467,268,171,250
SV,Arizona,298,270,62,924
SV,Arizona,251,272,93,138
SV,Arizona,215,276,120,178
SV,Arizona,222,279,169,250
SV,Arizona,246,279,64,94
SV,Arizona,181,281,97,141
SV,Arizona,197,286,125.01,182
SV,Arizona,178,288,175.94,256
SV,California,492,208,63,923
SV,California,333,210,94,138
SV,California,361,213,121,178
SV,California,435,217,171,250
SV,California,222,215,62,92
SV,California,177,218,93,138
SV,California,177,222,120,178
SV,California,156,228,169,250
SV,California,239,225,64,94
SV,California,139,229,97,141
SV,California,198,234,125,182
The records are in order of company_id,state,profit,feature1,feature2,feature3.
Now I wrote this code which breaks he whole dataset into chunks of 12 records (for each company and for each state in that company there are 12 records) and then passes it to process_chunk() function. Inside process_chunk() the records in the chunk are processed and broken into test set and training set with record number 10 and 11 going into test set while rest going into training set. I also store the company_id and state of records in test set into a global list for future display of predicted values. I also append the predicted values to a global list final_prediction
Now the issue that I am facing is that company_list, state_list and test_set lists have the same size (of about 200 records) but final_prediction has size half of what other lists have (100) records. If the test_set list has size of 200 then shouldn't the final_prediction be also of size 200? My current code is:
from sklearn import linear_model
import numpy as np
import csv
final_prediction = []
company_list = []
state_list = []
def process_chunk(chuk):
training_set_feature_list = []
training_set_label_list = []
test_set_feature_list = []
test_set_label_list = []
np.set_printoptions(suppress=True)
prediction_list = []
# to divide into training & test, I am putting line 10th and 11th in test set
count = 0
for line in chuk:
# Converting strings to numpy arrays
if count == 9:
test_set_feature_list.append(np.array(line[3:4],dtype = np.float))
test_set_label_list.append(np.array(line[2],dtype = np.float))
company_list.append(line[0])
state_list.append(line[1])
elif count == 10:
test_set_feature_list.append(np.array(line[3:4],dtype = np.float))
test_set_label_list.append(np.array(line[2],dtype = np.float))
company_list.append(line[0])
state_list.append(line[1])
else:
training_set_feature_list.append(np.array(line[3:4],dtype = np.float))
training_set_label_list.append(np.array(line[2],dtype = np.float))
count += 1
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(training_set_feature_list, training_set_label_list)
prediction_list.append(regr.predict(test_set_feature_list))
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
for items in prediction_list:
final_prediction.append(items)
# Load and parse the data
file_read = open('data.csv', 'r')
reader = csv.reader(file_read)
chunk, chunksize = [], 12
for i, line in enumerate(reader):
if (i % chunksize == 0 and i > 0):
process_chunk(chunk)
del chunk[:]
chunk.append(line)
# process the remainder
#process_chunk(chunk)
print len(company_list)
print len(test_set_feature_list)
print len(final_prediction)
Why is this difference in size coming and what mistake am I doing here in my code that I can rectify (maybe something that I am doing very naively and can be done in better way)?
Here:
prediction_list.append(regr.predict(test_set_feature_list))
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
for items in prediction_list:
final_prediction.append(items)
prediction_list will be a list of arrays (since predict returns an array).
So you'll be appending arrays to your final_prediction, which is probably what messes up your count: len(final_prediction) will probably be equal to the number of chunks.
At this point, the lengths are ok if prediction_list has the same length as test_set_feature_list.
You probably want to use extend like this:
final_prediction.extend(regr.predict(test_set_feature_list))
Which is also easier to read.
Then the length of final_prediction should be fine, and it should be a single list, rather than a list of lists.

Categories