Reuse output of two MapReduce jobs and join the results together - python

I would like to join the output of two different MapReduce jobs. I want to be able to do something like I have below, but I cannot figure out how to reuse results from previous jobs and join them. How could I do this?
Job1:
Andrea Vanzo, c288f70f-f417-4a96-8528-25c61372cae7, 125
Job2:
c288f70f-f417-4a96-8528-25c61372cae7, 071e1103-1b06-4671-8324-a9beb3e90d18, 25
Result:
Andrea Vanzo, c288f70f-f417-4a96-8528-25c61372cae7, 25

you can use JobControl to set your workflow in your mappereduce ,BTW read job1&job2 's output (use MultipleInputs) also could solved your problem .
Use different processing methods and write data according to the path of the data.
mapper
job1data == job1.path = > split write key data[1] ,value data[0]+"tagjob1"
job2data ==job2.path = >split write key data[0] ,value data[0]+"tagjob2"
reducer
each key has it value sets .
put the values into two list group by your "tag"
write the key and each Cartesian product of the two list .
hopes

Related

Design MapReduce function in Hadoop Streaming

I am using Hadoop Streaming (ver 2.9.0) with Python. The input of my MapReduce is a text file. My <key,value> pairs are as following, where key are numeric, and value are sets of numeric):
k1,v11
k2,v21
k1,v12
k2,v22
k1,v13
k2,v23
I want to output (where v11 union v12 represents union between sets v11 and v12):
k1, v11 union v12, v11 union v13, v12 union v13
k2, v21 union v22, v21 union v23, v22 union v23
......
So, I implement Map.py:
for line in sys.stdin:
key, values = line.strip().split(",")
print(key,values)
And my Reduce.py is
for line in sys.stdin:
key, values = line.strip().split(",")
if key==current_key:
# save values to a temporary list
list.append(values)
print(key,values)
# when moving to another attribute
else:
for l in list:
# we process union between elements of list her, save it to aggre variable
aggre=....
print(current_key, aggre)
#clear the list
list[:]=[]
#set new key as current_key
current_key=key
#save values to this list
list.append(values)
The output of my Map.py is same as its input. In Reducer.py, since Hadoop streaming reads line by line, I wait until I read all lines which have the same key before start calculation. Then, I calculate right after changing to other keys.
However, I feel like that this implementation is similar as sequential method, i.e., nothing is parallel here. Please correct me if I am wrong?
If it is really sequential, is there any way of design MapReduce to make it really parallel. For example, I want Reducer.py should be like this:
for line in sys.stdin:
key, values = line.strip().split(",")
# save values to a temporary list
list.append(values)
for l in list:
# we process union between elements of list her, save it to aggre variable
aggre=....
print(key, aggre)
When I use this code, together with set mapred.reduce.tasks=n, the result is split into n files, each files contains only 1 <key,value> as I expected.
However, I got error when set n too high.
My key k1,k2,.. represents attributes in database. I want to compute union of values on each attribute, and computing on each attribute k1,k2,.. is done in parallel. If there are 5 attributes, there should be 5 Reducers. And so on.
However, I got error if always set mapred.reduce.tasks= number of attributes. For example, if when number of attributes is 30, if I set mapred.reduce.tasks= 30, there will be Java exception errors.
So, my question is:
How to ensure that each key (or each attribute) is processed only by 1 reducer even when I set n < number of attributes.
Thank you very much.

Pandas dataframe to doc2vec.LabeledSentence

I have this dataframe :
order_id product_id user_id
2 33120 u202279
2 28985 u202279
2 9327 u202279
4 39758 u178520
4 21351 u178520
5 6348 u156122
5 40878 u156122
Type user_id : String
Type product_id : Integer
I would like to use this dataframe to create a Doc2vec corpus. So, I need to use the LabeledSentence function to create a dict :
{tags : user_id, words:
all product ids ordered by each user_id}
But the the dataframe shape is (32434489, 3), so I should avoid to use a loop to create my labeledSentence.
I try to run this function (below) with multiprocessing but is too long.
Have you any idea to transform my dataframe in the good format for a Doc2vec corpus where the tag is the user_id and the words is the list of products by user_id?
def append_to_sequences(i):
user_id = liste_user_id.pop(0)
liste_produit_userID = data.ix[data["user_id"]==user_id, "product_id"].astype(str).tolist()
return doc2vec.LabeledSentence(words=prd_user_list, tags=user_id )
pool = multiprocessing.Pool(processes=3)
result = pool.map_async(append_to_sequences, np.arange(len_liste_unique_user))
pool.close()
pool.join()
sentences = result.get()
Using multiprocessing is likely overkill. The forking of processes can wind up duplicating all existing memory, and involve excess communication marshalling results back into the master process.
Using a loop should be OK. 34 million rows (and far fewer unique user_ids) isn't that much, depending on your RAM.
Note that in recent versions of gensim TaggedDocument is the preferred class for Doc2Vec examples.
If we were to assume you have a list of all unique user_ids in liste_user_id, and a (new, not shown) function that gets the list-of-words for a user_id called words_for_user(), creating the documents for Doc2Vec in memory could be as simple as:
documents = [TaggedDocument(words=words_for_user(uid), tags=[uid])
for uid in liste_user_id]
Note that tags should be a list of tags, not a single tag – even though in many common cases each document only has a single tag. (If you provide a single string tag, it will see tags as a list-of-characters, which is not what you want.)

How to write the result (from python) to mongodb (in json)?

I am now writing a Job-to-Job List. For a JobID, I would like to ouptput its similar jobs (by self-defined score in descending order). For example, the structure should be:
"_id":"abcd","job":1234,"jobList":[{"job":1,"score":0.9},{"job":2,"score":0.8},{"job":3,"score":0.7}]}
Here JobID is 1234 and job 1,2,3 are its similar jobs listed by name-score pairs.
And my python codes are:
def sortSparseMatrix(m, rev=True, only_indices=True):
f=open("/root/workspace/PythonOutPut/test.json",'wb')
w=csv.writer(f,dialect='excel')
col_list = [None] * (m.shape[0])
j=0
for i in xrange(m.shape[0]):
d=m.getrow(i)
if len(d.indices) != 0:
s=zip(d.indices, d.data)
s.sort(key=lambda v:v[1], reverse=True)
if only_indices:
col_list[j] =[[element[0],element[1]] for element in s]
col_list[j]=col_list[j][0:4]
h1 = u'Job'+":" +str(col_list[j][0][0])+","
json.dump(h1,f)
h2=[]
h3=u'JobList'+":"
json.dump(h3,f)
for subrow in col_list[j][1:]:
h2.append(u'{Job'+":"+str(subrow[0])+","u'score'+":"+str(subrow[1])+"}")
json.dump(h2,f)
del col_list[j][:]
j=j+1
where d contains unsorted name-score pairs with respect to JobID: col_list[j][0][0] (After sorting, the most similar job (highest score) with JobID(col_list[j][0][0]) is itself).d.data is the score and [element[0],element[1]] is the name-score pair. I would like to keep the most similar three jobs w.r.t. JobID. I would like to dump h1 (to show JobID) first, and output the list of similar jobs in h2.
I typed 'mongodbimport --db test_database -- collection TestOfJSON --type csv --file /as above/ --fields JobList'. It can import the result to mongodb. However, it is only one JobID with many fields. But what I want is JobID associated with its similar Job's name-score pairs only. How should I do? thanks

Simple example of retrieving 500 items from dynamodb using Python

Looking for a simple example of retrieving 500 items from dynamodb minimizing the number of queries. I know there's a "multiget" function that would let me break this up into chunks of 50 queries, but not sure how to do this.
I'm starting with a list of 500 keys. I'm then thinking of writing a function that takes this list of keys, breaks it up into "chunks," retrieves the values, stitches them back together, and returns a dict of 500 key-value pairs.
Or is there a better way to do this?
As a corollary, how would I "sort" the items afterwards?
Depending on you scheme, There are 2 ways of efficiently retrieving your 500 items.
1 Items are under the same hash_key, using a range_key
Use the query method with the hash_key
you may ask to sort the range_keys A-Z or Z-A
2 Items are on "random" keys
You said it: use the BatchGetItem method
Good news: the limit is actually 100/request or 1MB max
you will have to sort the results on the Python side.
On the practical side, since you use Python, I highly recommend the Boto library for low-level access or dynamodb-mapper library for higher level access (Disclaimer: I am one of the core dev of dynamodb-mapper).
Sadly, neither of these library provides an easy way to wrap the batch_get operation. On the contrary, there is a generator for scan and for query which 'pretends' you get all in a single query.
In order to get optimal results with the batch query, I recommend this workflow:
submit a batch with all of your 500 items.
store the results in your dicts
re-submit with the UnprocessedKeys as many times as needed
sort the results on the python side
Quick example
I assume you have created a table "MyTable" with a single hash_key
import boto
# Helper function. This is more or less the code
# I added to devolop branch
def resubmit(batch, prev):
# Empty (re-use) the batch
del batch[:]
# The batch answer contains the list of
# unprocessed keys grouped by tables
if 'UnprocessedKeys' in prev:
unprocessed = res['UnprocessedKeys']
else:
return None
# Load the unprocessed keys
for table_name, table_req in unprocessed.iteritems():
table_keys = table_req['Keys']
table = batch.layer2.get_table(table_name)
keys = []
for key in table_keys:
h = key['HashKeyElement']
r = None
if 'RangeKeyElement' in key:
r = key['RangeKeyElement']
keys.append((h, r))
attributes_to_get = None
if 'AttributesToGet' in table_req:
attributes_to_get = table_req['AttributesToGet']
batch.add_batch(table, keys, attributes_to_get=attributes_to_get)
return batch.submit()
# Main
db = boto.connect_dynamodb()
table = db.get_table('MyTable')
batch = db.new_batch_list()
keys = range (100) # Get items from 0 to 99
batch.add_batch(table, keys)
res = batch.submit()
while res:
print res # Do some usefull work here
res = resubmit(batch, res)
# The END
EDIT:
I've added a resubmit() function to BatchList in Boto develop branch. It greatly simplifies the worklow:
add all of your requested keys to BatchList
submit()
resubmit() as long as it does not return None.
this should be available in next release.

Loop on Select_Analysis tool (Python and ArcGIS 9.3)

First, I'm new in Python and I work on Arc GIS 9.3.
I'd like to realize a loop on the "Select_Analysis" tool. Indeed I have a layer "stations" composed of all the bus stations of a city.
The layer has a field "rte_id" that explains on what line a station is located.
And I'd like to save in distinct layers all the stations with "rte_id" = 1, the stations with "rte_id" = 2 and so on. Hence the use of the tool select_analysis.
So, I decided to make a loop (I have 70 different "rte_id" .... so 70 different layers to create!). But it does not work and I'm totally lost!
Here is my code:
import arcgisscripting, os, sys, string
gp = arcgisscripting.create(9.3)
gp.AddToolbox("C:/Program Files (x86)/ArcGIS/ArcToolbox/Toolboxes/Data Management Tools.tbx")
stations = "d:/Travaux/NantesMetropole/Traitements/SIG/stations.shp"
field = "rte_id"
for i in field:
gp.Select_Analysis (stations, "d:/Travaux/NantesMetropole/Traitements/SIG/stations_" + i + ".shp", field + "=" + i)
i = i+1
print "ok"
And here is the error message:
gp.Select_Analysis (stations, "d:/Travaux/NantesMetropole/Traitements/SIG/stations_" + i + ".shp", field + "=" + i)
TypeError: can only concatenate list (not "str") to list
Have you got any ideas to solve my problem?
Thanks in advance!
Julien
The main problem here is in the string
for i in field:
You are trying to iterate a string - field name ("rte_id").
This is not correct.
You need to iterate all possible values of field "rte_id".
Easiest solution:
if you know that field "rte_id" have values 1 - 70 (for example) then you can try
for i in range(1, 71):
shp_name = "d:/Travaux/NantesMetropole/Traitements/SIG/stations_" + str(i) + ".shp"
expression = '{0} = {1}'.format(field, i)
gp.Select_Analysis (stations, shp_name , expression)
print "ok"
More sophisticated solution:
You need to get a list of all unique values of field "rte_id" in terms of SQL - to perform GROUP BY.
I think it is not actually possible to perform GROUP BY operation on SHP files with one tool.
You can use SearchCursor, iterate through all features and generate a list of unique values of you field. But this is more complex task.
Another way is to use the Summarize option on the shapefile table in ArcMap (open table, right click on the column header). You will get dbf table with unique values which you can read in your script.
I hope it will help you to start!
Don't have ArcGIS right now and can't write and check any script.
You will need to make substantial changes to this code in order to get it to do what you want. You may just want to download the Split Layer By Attribute Code from ArcGIS online which does the exact same thing.

Categories