Map Reduce Simple score aggregating for common queries - python

One of my mapper produces some logs distributed in files like part-0, part-1, part-2 etc. Now each of these have some queries and some associated data for that query:
part-0
q score
1 ben 10 4.01
horse shoe 5.96
...
part-1
1 ben 10 3.23
horse shoe 2.98
....
and so on for part-2,3 etc.
Now the same query q i.e. "1 ben 10" above resides in part-1, part-2 etc.
Now I have to write a map reduce phase where in I can collect the same queries and aggregate (add up) their scores.
My mapper function can be an identity and in the reduce I will be accomplishing this task.
Output would be:
q aggScore
1 ben 10 7.24
horse shoe 8.96
...
Seems to be a simple task but I am not able to think of as to how can I proceed with this (Read a lot but not really able to proceed). I can think in terms of generic Algorithm problem wherein first I will collect the common queries and than add up their scores.
Any help with some hints of pythonic solution or Algorithm (map reduce) would really be appreciated.

Here is the MapReduce solution:
Map input: Each input file (part-0, part-1, part-2, ...) can be input to individual (separate) map task.
foreach input line in the input file,
Mapper emits <q,aggScore>. If there are multiple scores for a query in a single file, Map will sum them all up, otherwise if we know that each query will appear in each file just once, map can be an identity function emitting <q,aggScore> for each input line as is.
Reducer input is in the form <q,list<aggScore1,aggScore2,...> The Reducer operation is similar to well-known MapReduce example of wordcount. If you are using Hadoop, you can use the following method for Reducer.
public void reduce(Text q, Iterable<IntWritable> aggScore, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : aggScore) {
sum += val.get();
}
context.write(q, new IntWritable(sum));
}
The method will sum all aggScores for a particularq and give you the desired output. The python code for the reducer should look something like this (Here q is the key and the list of aggScores is the values) :
def reduce(self, key, values, output, reporter):
sum = 0
while values.hasNext():
sum += values.next().get()
output.collect(key, IntWritable(sum))

Related

Byzantine Consensus Randomized - Monte Carlo Implementation with matrix for value sending

In order to achieve the Bizantine Consensus, these 3 rules must apply: during every round each process must send a bit (with value either 0 or 1) to the others n-1 processes; before the start of a new round each process must have received a bit from every other process; in each round every process has sent his bit to all the other processes.
I need to implement the following Rabin-based, Monte Carlo Byzantine Consensus randomized protocol for 661 non-traitor processes out of 991 total processes:
Variables:
b(i) = value to send to all other processes, initially = input value
p = global random number which can be either 1 and 0, picked after every run
maj(i) = majority value
tally(i) = number of times majority appears in b(i)
Implementation:
loop = TRUE
while (loop)
1. send b(i) to the other n−1 processes
2. receive the values sent by the other n−1 processes
3. set maj(i) = majority value in b(i)
4. set tally(i) = number of times the majority value appears
5. if tally(i) ≥ 2t + 1
then b(i) = maj(i)
else if (p=1) then b(i) = 1
else b(i) ← 0
My question would be: how can I implement the data structure which stores for every process the bits they have sent and received, not to mention how to implement the sending mechanism itself? I was thinking about implementing an array A.i[j] for each process, where j ranges over all process ids. I have heard it is possible to make processes read the values of the other n-1 processes from such a table instead of implemeneting a sending mechanism; how could I implement this?
I solved this problem by creating a class Process with a Type and vote attribute and a getVote() method,
Type: reliable, unreliable
Vote: the vote (v0 at the beginning, it will converge to v following the protocol)
setVote(): sets the vote.
getVote(): if type is reliable then return the bit else return random number between 0 and 1.
To simulate a distributed algo, you can run the protocol in separate threads, each one containing a process class and communicate with a shared common array which you can update each round.
An example here
You can also simulate everything starting from an array of these objects, without the need of separate threads.
I believe that should be enough to handle the model.
something like this:
export class Process {
constructor(public pid: number, public vote: number, public type: ProcessType ) { }
set(bit: number) {
this.vote = this.send(bit);
}
send(b) {
if (this.type === ProcessType.unreliable) {
return Math.round(Math.random());
}
return b;
}
get() {
return { pid: this.pid, vote: this.vote, type: this.type};
}
getVote() {
return this.send(this.vote);
}
}
Good luck!

Reuse output of two MapReduce jobs and join the results together

I would like to join the output of two different MapReduce jobs. I want to be able to do something like I have below, but I cannot figure out how to reuse results from previous jobs and join them. How could I do this?
Job1:
Andrea Vanzo, c288f70f-f417-4a96-8528-25c61372cae7, 125
Job2:
c288f70f-f417-4a96-8528-25c61372cae7, 071e1103-1b06-4671-8324-a9beb3e90d18, 25
Result:
Andrea Vanzo, c288f70f-f417-4a96-8528-25c61372cae7, 25
you can use JobControl to set your workflow in your mappereduce ,BTW read job1&job2 's output (use MultipleInputs) also could solved your problem .
Use different processing methods and write data according to the path of the data.
mapper
job1data == job1.path = > split write key data[1] ,value data[0]+"tagjob1"
job2data ==job2.path = >split write key data[0] ,value data[0]+"tagjob2"
reducer
each key has it value sets .
put the values into two list group by your "tag"
write the key and each Cartesian product of the two list .
hopes

Pandas dataframe to doc2vec.LabeledSentence

I have this dataframe :
order_id product_id user_id
2 33120 u202279
2 28985 u202279
2 9327 u202279
4 39758 u178520
4 21351 u178520
5 6348 u156122
5 40878 u156122
Type user_id : String
Type product_id : Integer
I would like to use this dataframe to create a Doc2vec corpus. So, I need to use the LabeledSentence function to create a dict :
{tags : user_id, words:
all product ids ordered by each user_id}
But the the dataframe shape is (32434489, 3), so I should avoid to use a loop to create my labeledSentence.
I try to run this function (below) with multiprocessing but is too long.
Have you any idea to transform my dataframe in the good format for a Doc2vec corpus where the tag is the user_id and the words is the list of products by user_id?
def append_to_sequences(i):
user_id = liste_user_id.pop(0)
liste_produit_userID = data.ix[data["user_id"]==user_id, "product_id"].astype(str).tolist()
return doc2vec.LabeledSentence(words=prd_user_list, tags=user_id )
pool = multiprocessing.Pool(processes=3)
result = pool.map_async(append_to_sequences, np.arange(len_liste_unique_user))
pool.close()
pool.join()
sentences = result.get()
Using multiprocessing is likely overkill. The forking of processes can wind up duplicating all existing memory, and involve excess communication marshalling results back into the master process.
Using a loop should be OK. 34 million rows (and far fewer unique user_ids) isn't that much, depending on your RAM.
Note that in recent versions of gensim TaggedDocument is the preferred class for Doc2Vec examples.
If we were to assume you have a list of all unique user_ids in liste_user_id, and a (new, not shown) function that gets the list-of-words for a user_id called words_for_user(), creating the documents for Doc2Vec in memory could be as simple as:
documents = [TaggedDocument(words=words_for_user(uid), tags=[uid])
for uid in liste_user_id]
Note that tags should be a list of tags, not a single tag – even though in many common cases each document only has a single tag. (If you provide a single string tag, it will see tags as a list-of-characters, which is not what you want.)

Linq, Python or Sql, need advice for TSS WSS BSS calculation

Hi i am making a staditical soft in c++ with QT.
I need to make many calculation over a table with the output of multivariate cluster analysis:
Var1,Var2,Var3,..VarN, k2,k3,k4...kn
where Var1 to n are the variables of study,
and k2 to kn the cluster clasification.
Table Example:
Var1,Var2,Var3,Var4,k2,k3,k4,k5,k6
3464.57,2992.33,2688.33,504.79,2,3,2,3,2
2895.32,3365.35,2824.35,504.86,1,2,3,2,6
2249.32,3300.19,2382.19,504.92,2,1,4,3,4
3417.81,3311.04,2426.04,504.97,1,2,2,5,2
3329.66,3497.14,2467.14,505.03,2,2,1,4,2
3087.85,3653.53,2296.53,505.09,2,1,2,3,4
The c++ storage will be defined like:
QList table;
Struct record
{
QList<double> vars;
QList<int> cluster;
}
I need to calculate the total, the within group and the between group square sum.
https://en.wikipedia.org/wiki/F-test
So by example to calculate WSS for Var1 and k2 need to:
in pseudo code:
get the size of every group:
count(*) group by(k2),
calculate the mean of every group:
sum(Var1) group by(k2), and then divide every one by the previous count.
compute the diference:
pow((xgroup1-xmeangroup1),2)
and many other operations....
Which alternatives will have more easy and powerfull codification:
1)Create a MySQL table on the fly and make SQL operations.
2)Use LINQ, but i does not if QT have QTLinq class.
3)Try to make trough Python Equivalents of LINQ Methods,
(how is the interaction between QT and Python, I see that Qgis have many plugin writed in Python)
Also in my app need to many make other calculus.
I hope to be clear.
Greetings
After some time I respond to my self,
the solution was maked in Python with Pandas.
This link is very ussefull:
Iterating through groups on: http://pandas.pydata.org/pandas-docs/stable/groupby.html
Also the book "Python for Data Analysis, West McKinney" pag 255
This video show how to make calculation:
ANOVA 2: Calculating SSW and SSB (total sum of squares within and between) | Khan Academy
https://www.youtube.com/watch?v=j9ZPMlVHJVs
[code]
def getDFrameFixed2D():
y = np.array([3,2,1,5,3,4,5,6,7])
k = np.array([1,1,1,2,2,2,3,3,3])
clusters = pd.DataFrame([[a,b] for a,b in zip(y,k)],columns=['Var1','K2'])
# print (clusters.head()) print("shape(0):",clusters.shape[0])
return clusters
X2D=getDFrameFixed2D()
MainMean = X2D['Var1'].mean(0)
print("Main mean:",MainMean)
grouped = X2D['Var1'].groupby(X2D['K2'])
print("-----Iterating Over Groups-------------")
Wss=0
Bss=0
for name, group in grouped:
#print(type(name))
#print(type(group))
print("Group key:",name)
groupmean = group.mean(0)
groupss = sum((group-groupmean)**2)
print(" groupmean:",groupmean)
print(" groupss:",groupss)
Wss+= groupss
Bss+= ((groupmean - MainMean)**2)*len(group)
print("----------------------------------")
print("Wss:",Wss)
print("Bss:",Bss)
print("T=B+W:",Bss+Wss)
Tss = np.sum((X-X.mean(0))**2)
print("Tss:",Tss)
print("----------------------------------")
[/code]
I am sure that could be do with aggregates(lambdas func) or apply.
But i don´t figure how
(if somebody know, please post here)
Greetings

How to write the result (from python) to mongodb (in json)?

I am now writing a Job-to-Job List. For a JobID, I would like to ouptput its similar jobs (by self-defined score in descending order). For example, the structure should be:
"_id":"abcd","job":1234,"jobList":[{"job":1,"score":0.9},{"job":2,"score":0.8},{"job":3,"score":0.7}]}
Here JobID is 1234 and job 1,2,3 are its similar jobs listed by name-score pairs.
And my python codes are:
def sortSparseMatrix(m, rev=True, only_indices=True):
f=open("/root/workspace/PythonOutPut/test.json",'wb')
w=csv.writer(f,dialect='excel')
col_list = [None] * (m.shape[0])
j=0
for i in xrange(m.shape[0]):
d=m.getrow(i)
if len(d.indices) != 0:
s=zip(d.indices, d.data)
s.sort(key=lambda v:v[1], reverse=True)
if only_indices:
col_list[j] =[[element[0],element[1]] for element in s]
col_list[j]=col_list[j][0:4]
h1 = u'Job'+":" +str(col_list[j][0][0])+","
json.dump(h1,f)
h2=[]
h3=u'JobList'+":"
json.dump(h3,f)
for subrow in col_list[j][1:]:
h2.append(u'{Job'+":"+str(subrow[0])+","u'score'+":"+str(subrow[1])+"}")
json.dump(h2,f)
del col_list[j][:]
j=j+1
where d contains unsorted name-score pairs with respect to JobID: col_list[j][0][0] (After sorting, the most similar job (highest score) with JobID(col_list[j][0][0]) is itself).d.data is the score and [element[0],element[1]] is the name-score pair. I would like to keep the most similar three jobs w.r.t. JobID. I would like to dump h1 (to show JobID) first, and output the list of similar jobs in h2.
I typed 'mongodbimport --db test_database -- collection TestOfJSON --type csv --file /as above/ --fields JobList'. It can import the result to mongodb. However, it is only one JobID with many fields. But what I want is JobID associated with its similar Job's name-score pairs only. How should I do? thanks

Categories