NLP Bert Multiprocessing Text Summaries - python

I have a dataframe consisting of over 1m articles I have cleaned for the process of training BERT a text summary model which in turn gets fed into a QA pipeline to ultimately train a T5 closed book QA model. I used multithreading before to vastly improve scraping times. However, is there an equivalent of multiprocessing for BERT?
The current flow looks like this:
def longBatch(context):
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
inputs_no_trunc = tokenizer(row, max_length=None, return_tensors='pt', truncation=False)
chunk_start = 0
chunk_end = tokenizer.model_max_length
inputs_batch_lst = []
while chunk_start <= len(inputs_no_trunc['input_ids'][0]):
inputs_batch = inputs_no_trunc['input_ids'][0][chunk_start:chunk_end]
inputs_batch = torch.unsqueeze(inputs_batch, 0)
inputs_batch_lst.append(inputs_batch)
chunk_start += tokenizer.model_max_length # == 1024 for Bart
chunk_end += tokenizer.model_max_length # == 1024 for Bart
# generate a summary on each batch
summary_ids_lst = [model.generate(inputs, num_beams=4, max_length=100, early_stopping=True) for inputs in inputs_batch_lst]
summary_batch_lst = []
for summary_id in summary_ids_lst:
summary_batch = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_id]
summary_batch_lst.append(summary_batch[0])
summary_all = '\n'.join(summary_batch_lst)
listarray.append(summary_all)
print(summary_all)
listarray = []
for row in x["0"]:
try:
pd.DataFrame(listarray).to_csv("/drive/somepath/sum.csv")
print(len(listarray))
listarray.append(longBatch(row))
print("Done with line")
except:
print("error:" +" " + row)
Now this works fine as is, with each article in its own row to be processed one at a time, but it is fairly slow. Multithreading took article my scrape times into the realm of roughly 500,000 a day. But this current model of summarization puts the peak performance numbers at roughly 5,500 a day. I've tried defining the processes and running it into a pool but the times actually increased by 6 seconds per summarization. The system doesn't strain feeding 1 article at a time, so I'd imagine it could handle quite a bit more, how would I go about parallelizing the process?
For reference, I am testing each section of script in ColabPro+ which has Tesla V100 GPU and 8 core CPU before moving to my local machine.

Related

Solving memory issues when using Gensim LDA Multicore

For my project I am trying to use unsupervised learning to identify different topics from application descriptions, but I am running into a strange problem. Firstly, I have 3 different datasets, one with 15k documents another with 50k documents and last with 2m documents. I am trying to test models with different number of topics (k) ranging from 5 to 100 with a step size of 5. This is in order to check which k results in the best model assessed with initially with the highest coherence score. For each k, I also build 3 different models with chunksize 10, 100 and 1000.
So now moving onto the problem I am having. Obviously my own machine is too slow and does not have enough cores for this kind of computation hence I am using my university's server. The problem here is my program seems to be consuming too much memory and I am unsure of the reason. I already made some adjustments such that the corpus is not loaded entirely to memory (or atleast I think I did). The dataset with 50k entries already at iteration k=50 (so halfway) seems to have consumed the alloted 100GB of memory, which seems very huge.
I would appreciate any help in the right direction and thanks for taking the time to look at this. Below is the code from my topic_modelling.py file. Comments on the file are a bit outdated, sorry about that.
class MyCorpus:
texts: list
dictionary: dict
def __init__(self, descriptions, dictionary):
self.texts = descriptions
self.dictionary = dictionary
def __iter__(self):
for line in self.texts:
try:
# assume there's one document per line, tokens separated by whitespace
yield self.dictionary.doc2bow(line)
except StopIteration:
pass
# Function given a dataframe creates a dictionary and corupus
# These are used to create an LDA model. Here we automatically use the Descriptionb column
# from each dataframe
def create_dict_and_corpus(df):
text_descriptions = remove_characters_and_create_list(df, 'Description')
# print(text_descriptions)
dictionary = gensim.corpora.Dictionary(text_descriptions)
corpus = MyCorpus(text_descriptions, dictionary)
return text_descriptions, dictionary, corpus
# Given a dataframe remove and a column name in the data frame, extract all words and return a list
# Also to remove all chracters that are not alphanumeric or spaces
def remove_characters_and_create_list(df, column_name, split=True):
df[column_name] = df[column_name].astype(str)
texts = []
for x in range(df[column_name].size):
current_string = df[column_name][x]
filtered_string = re.sub(r'[^A-Za-z0-9 ]+', '', current_string)
if split:
texts.append(filtered_string.split())
else:
texts.append(filtered_string)
return texts
# This function given the parameters creates an LDA model for each number between
# the start limit and the end limit. After this the coherence and perplexity is calulated
# for each of those models and saved in a csv file to analyze later.
def test_lda_models(text, corpus, dictionary, start_limit, end_limit, path):
results = []
print("============Starting topic modelling============")
for k in range(start_limit, end_limit+1, 5):
for p in range(1, 4):
chunk = pow(10, p)
t0 = time.time()
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus,
num_topics=k,
id2word=dictionary,
passes=p,
chunksize=chunk)
# To calculate the goodness of the model
perplexity = lda_model.bound(corpus)
coherence_model_lda = CoherenceModel(model=lda_model, texts=text, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
t1 = time.time()
print(f"=====Done K={k} model with passes={p} and chunksize={chunk}, took {t1-t0} seconds=====")
results.append((k, chunk, coherence_lda, perplexity))
# Storing teh results in a csv file except the actual lda model (this would not make sense)
path = make_dir_if_not_exists(path)
list_tuples_to_csv(results, ['#OfTopics', 'ChunkSize', 'CoherenceScore', 'Perplexity'], f"{path}/K={start_limit}to{end_limit}.csv")
return results
# Function plot the visualization of an LDA model. This visualization is then
# saved as an html file inside the given path
def single_lda_model_visualization(k, c, corpus, dictionary, lda_model, path):
vis = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.save_html(vis, f"{path}/visualization.html")
# Given the results produced by test_lda_models, loop though the models and save the
# topic words of each model and the visualization of the topics in the given path
def save_lda_result(k, c, lda_model, corpus, dictionary, path):
list_tuples_to_csv(lda_model.print_topics(num_topics=k), ['Topic#', 'Associated Words'], f"{path}/associated_words.csv")
single_lda_model_visualization(k, c, corpus, dictionary, lda_model, path)
# This is the entire pipeline that needs to be performed for a single dataset,
# which includes computing the LDA models from start to end limit and calculating
# and saving the topic words and visual graphs for the top n topics with the highest
# coherence score.
def perform_topic_modelling_single_df(df, start_limit, end_limit, path):
# Extracting the necessary data required for LDA model computation
text_descriptions,dictionary, corpus = create_dict_and_corpus(df)
results_lda = test_lda_models(text_descriptions, corpus, dictionary, start_limit, end_limit, path)
# Sorting the results based on the 2nd tuple value returned which is 'coherence'
results_lda.sort(key=lambda x:x[2],reverse=True)
# Getting the top 5 results to save pass to save_lda_results function
results = results_lda[:5]
corpus_for_saving = [dictionary.doc2bow(text) for text in text_descriptions]
texts = remove_characters_and_create_list(df, 'Description', split=False)
# Perfrom application to topic modelling for the best lda model based on the
# coherence score (TODO maybe test with other lda models?)
print("getting descriptions for csv")
for k, c, _, _ in results:
dir_path = make_dir_if_not_exists(f"{path}/k={k}_chunk={c}")
p = int(math.log10(c))
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus,
num_topics=k,
id2word=dictionary,
passes=p,
chunksize=c)
print(f"=====REDOING K={k} model with passes={p} and chunksize={c}=====")
save_lda_result(k,c, lda_model, corpus_for_saving, dictionary, dir_path)
application_to_topic_modelling(df, k, c, lda_model, corpus_for_saving, texts, dir_path)
# Performs the whole topic modelling pipeline taking different genre data sets
# and the entire dataset as a whole
def perform_topic_modelling_pipeline(path_ex):
# entire_df = pd.read_csv("../data/preprocessed_data/preprocessed_10000_trial.csv")
entire_df = pd.read_csv(os.path.join(ROOT_DIR, f"data/preprocessed_data/preprocessedData_{path_ex}.csv"))
print("size of df")
print(entire_df.shape)
# For entire df go from start limit to ngenres to find best LDA model
nGenres = row_counter(os.path.join(ROOT_DIR, f"data/genre_wise_data/data{path_ex}/genre_frequency.csv"))
nGenres_rounded = math.ceil(nGenres / 5) * 5
print(f"Original number of genres should be {nGenres}, but we are rounding to {nGenres_rounded}")
path = make_dir_if_not_exists(os.path.join(ROOT_DIR, f"results/data{path_ex}/aall_data"))
perform_topic_modelling_single_df(entire_df, 5, 100, path)

6 GB RAM Fails in Vectorizing text using Word2Vec

I'm trying to do one basic tweet sentiment analysis using word2vec and tfidf-score on a dataset consisting of 1,6M tweets but my 6 GB Gforce-Nvidia fails to do so. since this is my first practice project relating machine learning I'm wondering what I'm doing wrong because dataset is all text it shouldn't take this much RAM which makes my laptop froze in tweet2vec function or giving Memory Error in scaling part. below is part of my code that everything collapses.
the last thing is that I've tried with up to 1M data and it worked! so I'm curious what causes the problem
# --------------- calculating word weight for using later in word2vec model & bringing words together ---------------
def word_weight(data):
vectorizer = TfidfVectorizer(sublinear_tf=True, use_idf=True)
d = dict()
for index in tqdm(data, total=len(data), desc='Assigning weight to words'):
# --------- try except caches the empty indexes ----------
try:
matrix = vectorizer.fit_transform([w for w in index])
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
d.update(tfidf)
except ValueError:
continue
print("every word has weight now\n"
"--------------------------------------")
return d
# ------------------- bringing tokens with weight to recreate tweets ----------------
def tweet2vec(tokens, size, tfidf):
count = 0
for index in tqdm(tokens, total=len(tokens), desc='creating sentence vectors'):
# ---------- size is the dimension of word2vec model (200) ---------------
vec = np.zeros(size)
for word in index:
try:
vec += model[word] * tfidf[word]
except KeyError:
continue
tokens[count] = vec.tolist()
count += 1
print("tweet vectors are ready for scaling for ML algorithm\n"
"-------------------------------------------------")
return tokens
dataset = read_dataset('training.csv', ['target', 't_id', 'created_at', 'query', 'user', 'text'])
dataset = delete_unwanted_col(dataset, ['t_id', 'created_at', 'query', 'user'])
dataset_token = [pre_process(t) for t in tqdm(map(lambda t: t, dataset['text']),
desc='cleaning text', total=len(dataset['text']))]
print('pre_process completed, list of tweet tokens is returned\n'
'--------------------------------------------------------')
X = np.array(tweet2vec(dataset_token, 200, word_weight(dataset_token)))
print('scaling vectors ...')
X_scaled = scale(X)
print('features scaled!')
the data given to word_weight function is a (1599999, 200) shaped list which each index is consisted of pre-processed tweet tokens.
I appreciate your time and answer in advance and of course I'm glad to hear better approaches for handling big datasets
If I understood correctly, it works with 1M tweets, but fails with 1.6M tweets? So you know the code is correct.
If the GPU is running out of memory when you think it shouldn't, it may be holding on from a previous process. Use nvidia-smi to check what processes are using the GPU, and how much memory. If (before you run your code) you spot python processes in there holding a big chunk, it could be a crashed process, or a Jupyter window still open, etc.
I find it useful to watch nvidia-smi (not sure if there is a windows equivalent), to see how GPU memory changes as training progresses. Normally a chunk is reserved at the start, and then it stays fairly constant. If you see it rising linearly, something could be wrong with the code (are you re-loading the model on each iteration, something like that?).
my problem was solved when i changed the code (tweet2vec function) to this
(w is word weight)
def tweet2vec(tokens, size, tfidf):
# ------------- size is the dimension of word2vec model (200) ---------------
vec = np.zeros(size).reshape(1, size)
count = 0
for word in tokens:
try:
vec += model[word] * tfidf[word]
count += 1
except KeyError:
continue
if count != 0:
vec /= count
return vec
X = np.concatenate([tweet2vec(token, 200, w) for token in tqdm(map(lambda token: token, dataset_token),
desc='creating tweet vectors',
total=len(dataset_token))]
)
I have no idea why!!!!

Is there a faster way to preprocess huge amount of text data in Python?

I'm building sentiment analyse algorithm to predict the score of IMDb reviews. I wanted to do it from scratch, so I scraped half a million reviews and created my own data set.
I'm sending small review packages (consist of 50 reviews) to review_cleaner with pool. It helped me to reduce run time from 40 minutes to 11 minutes for 1000 reviews. But, I have half a million reviews, I need faster way to process them. I was thinking if it's possible to run it on my GPU (GTX1060 6GB)? I installed CUDA, but I couldn't find how to run specific function(review_cleaner) on GPU cores.
Basically, what I need is, solution to run preprocess faster. I searched and tried many different things but couldn't do it. Is there any way to run it faster?
def filling_the_database(review_data):
try:
c.executemany("""INSERT INTO preprocessed_reviews(review, review_score) VALUES (?, ?)""", review_data)
conn.commit()
except Error as e:
print(e)
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
def review_cleaner(review):
lemmatizer = WordNetLemmatizer()
new_review_data = ()
bulk_data = []
for each in review:
review_temp = ''.join([i for i in each[0] if not i.isdigit()])
review_temp = REPLACE_NO_SPACE.sub(" ", review_temp.lower())
review_temp = nltk.word_tokenize(review_temp)
review_temp = (lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in review_temp)
review_temp = ' '.join([word for word in review_temp if word not in stopwords.words('english')])
new_review_data = (review_temp, each[1])
bulk_data.append(new_review_data)
filling_the_database(bulk_data)
if __name__ == "__main__":
review_data = ()
bulk_data = []
amount_of_reviews = 0
previous_amount = 0
conn = create_connection('2020-04-11')
c = conn.cursor()
c.execute("""CREATE TABLE IF NOT EXISTS preprocessed_reviews(review TEXT, review_score INTEGER, ID PRIMARY KEY)""")
conn.commit()
total_number_of_reviews = c.execute("""SELECT COUNT(*) FROM movie_reviews""")
for each in total_number_of_reviews:
total_number_of_reviews = each[0]
while amount_of_reviews < total_number_of_reviews:
review_package = []
amount_of_reviews += 50
data = c.execute("""SELECT * FROM movie_reviews WHERE ID BETWEEN (?) AND (?)""", (previous_amount, amount_of_reviews-1))
conn.commit()
previous_amount = amount_of_reviews
for each in data:
review_data = (each[0], each[1])
review_package.append(review_data)
del review_data
bulk_data.append(review_package)
del review_package
print(amount_of_reviews)
p = Pool(4)
p.map(review_cleaner, bulk_data)
p.close()
print('---- %s seconds ----' % (time.time() - start_time))
I'm storing around half a million (400k) reviews in SQLite database. One column for review and one column for score of the review. In another table, I'm inserting the preprocessed reviews same way, one column for review and one column for score. I have 16 gigs of RAM, Intel i7 6700HQ, SSD and GTX1060 6GB.
A few thoughts crossed my mind.
Reading and writing from SQLite may have quite a lot of overhead when in reality you could fit 500k reviews in your 16GB of ram. You could do this by dumping your data to a tabulated csv file and then reading it in with pandas to do the preprocessing. You could also use pandarallel to parallelise the work instead of using pool to make your life easier.
If SQLite is not the bottleneck then it's likely a computational bottleneck in which case I would look at running the process overnight, or hiring a cloud-compute instance with good cpu resources. A 16core machine wouldn't be too expensive to rent on AWS for a short amount of time and that would give you a theoretical 4x speedup.

How to do multidimentional time-series prediction in PyBrain?

this is a repost from the PyBrain google group: https://groups.google.com/forum/#!topic/pybrain/J9qv0nHuxVY.
I've been tinkering with OpenNN and FANN and have yet to find an ANN library that does what I need.
I'll break this up into short-term and medium-term goals, but first a little background...
Background
I want to use an ANN to do time-series prediction of a 1000-2000 item vector varying over time. Each element is a Boolean value that represents the presence of a cluster of visual properties in a computer vision system at a particular moment in time.
The idea is to feed the network the vector at time t-1 with the vector at time t as a target value.
The output of the network will then be a prediction of what is expected to happen in the current time (t) based on the previous state (t-1) of the vector.
Short Term
I would like to train an ANN so that it can learn to predict these vectors over time. That is, I would like to feed it an arbitrary vector and it return the vector that it has been trained to expect next. It's fine to use a finite dataSet for now, and I expect to start with normal epoch learning. I'm starting with a normal supervised data set where the input and targets are offset by one time unit. Thus far I have not been showing results as good as I was able to get in FANN in terms of MSE returns (no significant decrease of error after the first epoch), as described here: https://groups.google.com/forum/#!topic/pybrain/QSfVHsFRXz0.
In FANN I was just using a simple MLP with 1026 inputs, 103 hidden, and 1026 outputs. Boolean inputs were scaled to -1 to 1 and weights were initialized to random values between -1 and 1. (This was done because apparently learning is faster with negative values than just 0-1). The network was reproducing input patterns quite well and ended up with a small MSE.
In PyBrain this is the current version of the code:
#!/usr/bin/python
# First try and using pyBrain for building an ANN. We'll start with an MLP for obvious reasons.
from pybrain.tools.shortcuts import buildNetwork
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.tools.validation import ModuleValidator
import pickle, time
# Load the dataset (parse data)
print "Loading and parsing dataset..."
inputFile = open("../MLP/data/113584_backgroundState-filter-1.fann", 'r')
rawData = inputFile.readlines()
inputFile.close()
processedData = list()
for inputLine in rawData[1:]: # all but last element.
data = inputLine.split() # Slicing to strip off the final newline
scaleData = list()
for item in data:
scaleData.append(int(item))
processedData.append(scaleData)
# Create dataset from parsed data
inputData = SupervisedDataSet(1027, 1027)
for index in xrange(0,len(processedData)-1,2): # every second line.
inputData.addSample(processedData[index], processedData[index+1])
del processedData # no longer needed
# Build the same network as in FANN
net = buildNetwork(1027, 103, 1027, bias=True)
# Train the network
print "Training network..."
trainer = BackpropTrainer(net, inputData, verbose = True)
for i in xrange(5):
startTime = time.time()
error = trainer.train()
print "ERROR " + str(i) + " " + str(error) # test network (calculate error)
print "ProcessingTime " + str(i) + " " + str(time.time() - startTime)
# save results.
print "Saving network..."
fileObject = open('network.pybrain', 'w')
pickle.dump(net, fileObject)
# For each input pattern, what is the output?
# Compatible with FANN output
print "Testing network and generating results..."
i = 0;
for inputPattern in inputData['input']:
outputPattern = net.activate(inputPattern)
for j in xrange(len(inputPattern)):
print "RESULT " + str(i) + " " + str(j) + " " + str(inputPattern[j]) + " " + str(outputPattern[j])
i += 1
print "Done."
Any recommendations for how to best set this up for learning? (I expect I'll need recurrency, but wanted to compare PyBrain to my previous FANN results with a plain MLP.)
I actually tried building this network with recurrent=True, but in all my tests python ended up using all the available RAM and crashed (there is 8GB ram on this machine). I'm not sure how I can enable recurrency without a huge increase of memory footprint.
Medium Term
Eventually, the system will have to be running on-line where inputs are fed on the fly and constantly changing. This means epoch training will not be possible, so I need to be able to run a single iteration of the learning algorithm. I realize it will be hard for the ANN to learn, but the good news is there will be no shortage of data-points (at least 100,000s). Because there is no fixed set of data, there is no need for convergence. I expect error will rise and fall as new or stable patterns are presented.
Thanks for any comments and suggestions.

Numpy very slow when performing looping

I am developing an agent-based labor market model in python/numpy. The model focuses on the process of matching workers and firms, which are characterized by l-dimensional bit strings. Workers and firms with closely matching bit strings are matched together.
At this point, the model runs properly and produces the correct output. However, it is extremely slow. It takes around 77 seconds for 20 iterations. (I am running the model on a Macbook Pro with an i5 processor and 8GB of RAM). By comparison, I originally wrote the model in R, where 20 iterations takes roughly 0.5 seconds. This seems really strange as from everything I have read python should be faster than R for looping and other programming functions.
I have spent a good deal of time trying to optimize the code and looking into problems with numpy. Additionally, I tried running the model in Sage, but don't notice any difference.
I am attaching key segments of the code below. Please let me know if there are problems with the code or if there are other problems with numpy I may have missed.
Thanks,
Daniel Scheer
Code:
from __future__ import division
from numpy import*
import numpy as np
import time
import math as math
NUM_WORKERS = 1000
NUM_FIRMS = 65
ITERATIONS = 20
HIRING_THRESHOLD = 0.4
INTERVIEW_THRESHOLD = 0.2
RANDOM_SEED = 1
SKILLSET_LENGTH = 50
CONS_RETURN = 1
INC_RETURN = 1
RETURN_COEFF = 1.8
PRODUCTIVITY_FACTOR = 0.001
#"corr" function computes closeness between worker i and firm j
def corr(x,y):
return 1-(np.sum(np.abs(x-y))/SKILLSET_LENGTH)
#"skill_evolve" function randomly changes a segment of the firm's skill demand bit string
def skill_evolve(start,end,start1,q,j,firms):
random.seed(q*j)
return around(random.uniform(0,1,(end-start1)))
#"production" function computes firm output
def production(prod):
return (CONS_RETURN*prod)+math.pow(INC_RETURN*prod,RETURN_COEFF)
#"hire_unemp" function loops though unemployed workers and matches them with firms
def hire_unemp(j):
for i in xrange(NUM_WORKERS):
correlation = corr(workers[(applicants[i,0]-1),9:(9+SKILLSET_LENGTH+1)],firms[j,4:(4+SKILLSET_LENGTH+1)])
if (workers[(applicants[i,0]-1),3] == 0 and correlation > HIRING_THRESHOLD and production(correlation*PRODUCTIVITY_FACTOR) >= (production((firms[j,2]+(correlation*PRODUCTIVITY_FACTOR))/(firms[j,1]+1)))):
worker_row = (applicants[i,0]-1)
workers[worker_row,3] = firms[j,0]
workers[worker_row,4] = correlation
workers[worker_row,5] = (workers[worker_row,4]+workers[worker_row,1])*PRODUCTIVITY_FACTOR
firms[j,1] = firms[j,1]+1
firms[j,2] = firms[j,2]+workers[worker_row,5]
firms[j,3] = production(firms[j,2])
workers[worker_row,7] = firms[j,3]/firms[j,1]
#print "iteration",q,"loop unemp","worker",workers[worker_row,0]
break
#"hire_unemp" function loops though employed workers and matches them with firms
def hire_emp(j):
for i in xrange(NUM_WORKERS):
correlation = corr(workers[(applicants[i,0]-1),9:(9+SKILLSET_LENGTH+1)],firms[j,4:(4+SKILLSET_LENGTH+1)])
if (workers[(applicants[i,0]-1),3] > 0 and correlation > HIRING_THRESHOLD and (production((firms[j,2]+(correlation*PRODUCTIVITY_FACTOR))/(firms[j,1]+1) > workers[(applicants[i,0]-1),7]))):
worker_row = (applicants[i,0]-1)
otherfirm_row = (workers[worker_row,3]-1)
#print q,firms[otherfirm_row,0],firms[otherfirm_row,1],"before"
firms[otherfirm_row,1] = firms[otherfirm_row,1]-1
#print q,firms[otherfirm_row,0],firms[otherfirm_row,1],"after"
firms[otherfirm_row,2] = array([max(array([0], float),firms[otherfirm_row,2]-workers[worker_row,5])],float)
firms[otherfirm_row,3] = production(firms[otherfirm_row,2])
workers[worker_row,3] = firms[j,0]
workers[worker_row,4] = correlation
workers[worker_row,5] = (workers[worker_row,4]+workers[worker_row,1])*PRODUCTIVITY_FACTOR
firms[j,1] = firms[j,1]+1
firms[j,2] = firms[j,2]+workers[worker_row,5]
firms[j,3] = CONS_RETURN*firms[j,2]+math.pow(INC_RETURN*firms[j,2],RETURN_COEFF)
workers[worker_row,7] = firms[j,3]/firms[j,1]
#print "iteration",q,"loop emp","worker",workers[worker_row,0]
break
workers = zeros((NUM_WORKERS,9+SKILLSET_LENGTH))
workers[:,0] = arange(1,NUM_WORKERS+1)
random.seed(RANDOM_SEED*1)
workers[:,1] = random.uniform(0,1,NUM_WORKERS)
workers[:,2] = 5
workers[:,3] = 0
workers[:,4] = 0
random.seed(RANDOM_SEED*2)
workers[:, 9:(9+SKILLSET_LENGTH)] = around(random.uniform(0,1,(NUM_WORKERS,SKILLSET_LENGTH)))
random.seed(RANDOM_SEED*3)
firms = zeros((NUM_FIRMS, 4))
firms[:,0] = arange(1,NUM_FIRMS+1)
firms = hstack((firms,around(random.uniform(0,1,(NUM_FIRMS,SKILLSET_LENGTH)))))
start_full = time.time()
for q in arange(ITERATIONS):
random.seed(q)
ordering = random.uniform(0,1,NUM_WORKERS).reshape(-1,1)
applicants = hstack((workers, ordering))
applicants = applicants[applicants[:,(size(applicants,axis=1)-1)].argsort(),]
#Hire workers from unemployment
start_time = time.time()
map(hire_unemp, xrange(NUM_FIRMS))
print "Iteration unemp %2d: %2.5f seconds" % (q, time.time() - start_time)
#Hire workers from employment
start_time = time.time()
map(hire_emp, xrange(NUM_FIRMS))
print "Iteration emp %2d: %2.5f seconds" % (q, time.time() - start_time)

Categories