In order to achieve the Bizantine Consensus, these 3 rules must apply: during every round each process must send a bit (with value either 0 or 1) to the others n-1 processes; before the start of a new round each process must have received a bit from every other process; in each round every process has sent his bit to all the other processes.
I need to implement the following Rabin-based, Monte Carlo Byzantine Consensus randomized protocol for 661 non-traitor processes out of 991 total processes:
Variables:
b(i) = value to send to all other processes, initially = input value
p = global random number which can be either 1 and 0, picked after every run
maj(i) = majority value
tally(i) = number of times majority appears in b(i)
Implementation:
loop = TRUE
while (loop)
1. send b(i) to the other n−1 processes
2. receive the values sent by the other n−1 processes
3. set maj(i) = majority value in b(i)
4. set tally(i) = number of times the majority value appears
5. if tally(i) ≥ 2t + 1
then b(i) = maj(i)
else if (p=1) then b(i) = 1
else b(i) ← 0
My question would be: how can I implement the data structure which stores for every process the bits they have sent and received, not to mention how to implement the sending mechanism itself? I was thinking about implementing an array A.i[j] for each process, where j ranges over all process ids. I have heard it is possible to make processes read the values of the other n-1 processes from such a table instead of implemeneting a sending mechanism; how could I implement this?
I solved this problem by creating a class Process with a Type and vote attribute and a getVote() method,
Type: reliable, unreliable
Vote: the vote (v0 at the beginning, it will converge to v following the protocol)
setVote(): sets the vote.
getVote(): if type is reliable then return the bit else return random number between 0 and 1.
To simulate a distributed algo, you can run the protocol in separate threads, each one containing a process class and communicate with a shared common array which you can update each round.
An example here
You can also simulate everything starting from an array of these objects, without the need of separate threads.
I believe that should be enough to handle the model.
something like this:
export class Process {
constructor(public pid: number, public vote: number, public type: ProcessType ) { }
set(bit: number) {
this.vote = this.send(bit);
}
send(b) {
if (this.type === ProcessType.unreliable) {
return Math.round(Math.random());
}
return b;
}
get() {
return { pid: this.pid, vote: this.vote, type: this.type};
}
getVote() {
return this.send(this.vote);
}
}
Good luck!
I would like to join the output of two different MapReduce jobs. I want to be able to do something like I have below, but I cannot figure out how to reuse results from previous jobs and join them. How could I do this?
Job1:
Andrea Vanzo, c288f70f-f417-4a96-8528-25c61372cae7, 125
Job2:
c288f70f-f417-4a96-8528-25c61372cae7, 071e1103-1b06-4671-8324-a9beb3e90d18, 25
Result:
Andrea Vanzo, c288f70f-f417-4a96-8528-25c61372cae7, 25
you can use JobControl to set your workflow in your mappereduce ,BTW read job1&job2 's output (use MultipleInputs) also could solved your problem .
Use different processing methods and write data according to the path of the data.
mapper
job1data == job1.path = > split write key data[1] ,value data[0]+"tagjob1"
job2data ==job2.path = >split write key data[0] ,value data[0]+"tagjob2"
reducer
each key has it value sets .
put the values into two list group by your "tag"
write the key and each Cartesian product of the two list .
hopes
I have this dataframe :
order_id product_id user_id
2 33120 u202279
2 28985 u202279
2 9327 u202279
4 39758 u178520
4 21351 u178520
5 6348 u156122
5 40878 u156122
Type user_id : String
Type product_id : Integer
I would like to use this dataframe to create a Doc2vec corpus. So, I need to use the LabeledSentence function to create a dict :
{tags : user_id, words:
all product ids ordered by each user_id}
But the the dataframe shape is (32434489, 3), so I should avoid to use a loop to create my labeledSentence.
I try to run this function (below) with multiprocessing but is too long.
Have you any idea to transform my dataframe in the good format for a Doc2vec corpus where the tag is the user_id and the words is the list of products by user_id?
def append_to_sequences(i):
user_id = liste_user_id.pop(0)
liste_produit_userID = data.ix[data["user_id"]==user_id, "product_id"].astype(str).tolist()
return doc2vec.LabeledSentence(words=prd_user_list, tags=user_id )
pool = multiprocessing.Pool(processes=3)
result = pool.map_async(append_to_sequences, np.arange(len_liste_unique_user))
pool.close()
pool.join()
sentences = result.get()
Using multiprocessing is likely overkill. The forking of processes can wind up duplicating all existing memory, and involve excess communication marshalling results back into the master process.
Using a loop should be OK. 34 million rows (and far fewer unique user_ids) isn't that much, depending on your RAM.
Note that in recent versions of gensim TaggedDocument is the preferred class for Doc2Vec examples.
If we were to assume you have a list of all unique user_ids in liste_user_id, and a (new, not shown) function that gets the list-of-words for a user_id called words_for_user(), creating the documents for Doc2Vec in memory could be as simple as:
documents = [TaggedDocument(words=words_for_user(uid), tags=[uid])
for uid in liste_user_id]
Note that tags should be a list of tags, not a single tag – even though in many common cases each document only has a single tag. (If you provide a single string tag, it will see tags as a list-of-characters, which is not what you want.)
Hi i am making a staditical soft in c++ with QT.
I need to make many calculation over a table with the output of multivariate cluster analysis:
Var1,Var2,Var3,..VarN, k2,k3,k4...kn
where Var1 to n are the variables of study,
and k2 to kn the cluster clasification.
Table Example:
Var1,Var2,Var3,Var4,k2,k3,k4,k5,k6
3464.57,2992.33,2688.33,504.79,2,3,2,3,2
2895.32,3365.35,2824.35,504.86,1,2,3,2,6
2249.32,3300.19,2382.19,504.92,2,1,4,3,4
3417.81,3311.04,2426.04,504.97,1,2,2,5,2
3329.66,3497.14,2467.14,505.03,2,2,1,4,2
3087.85,3653.53,2296.53,505.09,2,1,2,3,4
The c++ storage will be defined like:
QList table;
Struct record
{
QList<double> vars;
QList<int> cluster;
}
I need to calculate the total, the within group and the between group square sum.
https://en.wikipedia.org/wiki/F-test
So by example to calculate WSS for Var1 and k2 need to:
in pseudo code:
get the size of every group:
count(*) group by(k2),
calculate the mean of every group:
sum(Var1) group by(k2), and then divide every one by the previous count.
compute the diference:
pow((xgroup1-xmeangroup1),2)
and many other operations....
Which alternatives will have more easy and powerfull codification:
1)Create a MySQL table on the fly and make SQL operations.
2)Use LINQ, but i does not if QT have QTLinq class.
3)Try to make trough Python Equivalents of LINQ Methods,
(how is the interaction between QT and Python, I see that Qgis have many plugin writed in Python)
Also in my app need to many make other calculus.
I hope to be clear.
Greetings
After some time I respond to my self,
the solution was maked in Python with Pandas.
This link is very ussefull:
Iterating through groups on: http://pandas.pydata.org/pandas-docs/stable/groupby.html
Also the book "Python for Data Analysis, West McKinney" pag 255
This video show how to make calculation:
ANOVA 2: Calculating SSW and SSB (total sum of squares within and between) | Khan Academy
https://www.youtube.com/watch?v=j9ZPMlVHJVs
[code]
def getDFrameFixed2D():
y = np.array([3,2,1,5,3,4,5,6,7])
k = np.array([1,1,1,2,2,2,3,3,3])
clusters = pd.DataFrame([[a,b] for a,b in zip(y,k)],columns=['Var1','K2'])
# print (clusters.head()) print("shape(0):",clusters.shape[0])
return clusters
X2D=getDFrameFixed2D()
MainMean = X2D['Var1'].mean(0)
print("Main mean:",MainMean)
grouped = X2D['Var1'].groupby(X2D['K2'])
print("-----Iterating Over Groups-------------")
Wss=0
Bss=0
for name, group in grouped:
#print(type(name))
#print(type(group))
print("Group key:",name)
groupmean = group.mean(0)
groupss = sum((group-groupmean)**2)
print(" groupmean:",groupmean)
print(" groupss:",groupss)
Wss+= groupss
Bss+= ((groupmean - MainMean)**2)*len(group)
print("----------------------------------")
print("Wss:",Wss)
print("Bss:",Bss)
print("T=B+W:",Bss+Wss)
Tss = np.sum((X-X.mean(0))**2)
print("Tss:",Tss)
print("----------------------------------")
[/code]
I am sure that could be do with aggregates(lambdas func) or apply.
But i don´t figure how
(if somebody know, please post here)
Greetings
I am now writing a Job-to-Job List. For a JobID, I would like to ouptput its similar jobs (by self-defined score in descending order). For example, the structure should be:
"_id":"abcd","job":1234,"jobList":[{"job":1,"score":0.9},{"job":2,"score":0.8},{"job":3,"score":0.7}]}
Here JobID is 1234 and job 1,2,3 are its similar jobs listed by name-score pairs.
And my python codes are:
def sortSparseMatrix(m, rev=True, only_indices=True):
f=open("/root/workspace/PythonOutPut/test.json",'wb')
w=csv.writer(f,dialect='excel')
col_list = [None] * (m.shape[0])
j=0
for i in xrange(m.shape[0]):
d=m.getrow(i)
if len(d.indices) != 0:
s=zip(d.indices, d.data)
s.sort(key=lambda v:v[1], reverse=True)
if only_indices:
col_list[j] =[[element[0],element[1]] for element in s]
col_list[j]=col_list[j][0:4]
h1 = u'Job'+":" +str(col_list[j][0][0])+","
json.dump(h1,f)
h2=[]
h3=u'JobList'+":"
json.dump(h3,f)
for subrow in col_list[j][1:]:
h2.append(u'{Job'+":"+str(subrow[0])+","u'score'+":"+str(subrow[1])+"}")
json.dump(h2,f)
del col_list[j][:]
j=j+1
where d contains unsorted name-score pairs with respect to JobID: col_list[j][0][0] (After sorting, the most similar job (highest score) with JobID(col_list[j][0][0]) is itself).d.data is the score and [element[0],element[1]] is the name-score pair. I would like to keep the most similar three jobs w.r.t. JobID. I would like to dump h1 (to show JobID) first, and output the list of similar jobs in h2.
I typed 'mongodbimport --db test_database -- collection TestOfJSON --type csv --file /as above/ --fields JobList'. It can import the result to mongodb. However, it is only one JobID with many fields. But what I want is JobID associated with its similar Job's name-score pairs only. How should I do? thanks