Design MapReduce function in Hadoop Streaming - python

I am using Hadoop Streaming (ver 2.9.0) with Python. The input of my MapReduce is a text file. My <key,value> pairs are as following, where key are numeric, and value are sets of numeric):
k1,v11
k2,v21
k1,v12
k2,v22
k1,v13
k2,v23
I want to output (where v11 union v12 represents union between sets v11 and v12):
k1, v11 union v12, v11 union v13, v12 union v13
k2, v21 union v22, v21 union v23, v22 union v23
......
So, I implement Map.py:
for line in sys.stdin:
key, values = line.strip().split(",")
print(key,values)
And my Reduce.py is
for line in sys.stdin:
key, values = line.strip().split(",")
if key==current_key:
# save values to a temporary list
list.append(values)
print(key,values)
# when moving to another attribute
else:
for l in list:
# we process union between elements of list her, save it to aggre variable
aggre=....
print(current_key, aggre)
#clear the list
list[:]=[]
#set new key as current_key
current_key=key
#save values to this list
list.append(values)
The output of my Map.py is same as its input. In Reducer.py, since Hadoop streaming reads line by line, I wait until I read all lines which have the same key before start calculation. Then, I calculate right after changing to other keys.
However, I feel like that this implementation is similar as sequential method, i.e., nothing is parallel here. Please correct me if I am wrong?
If it is really sequential, is there any way of design MapReduce to make it really parallel. For example, I want Reducer.py should be like this:
for line in sys.stdin:
key, values = line.strip().split(",")
# save values to a temporary list
list.append(values)
for l in list:
# we process union between elements of list her, save it to aggre variable
aggre=....
print(key, aggre)
When I use this code, together with set mapred.reduce.tasks=n, the result is split into n files, each files contains only 1 <key,value> as I expected.
However, I got error when set n too high.
My key k1,k2,.. represents attributes in database. I want to compute union of values on each attribute, and computing on each attribute k1,k2,.. is done in parallel. If there are 5 attributes, there should be 5 Reducers. And so on.
However, I got error if always set mapred.reduce.tasks= number of attributes. For example, if when number of attributes is 30, if I set mapred.reduce.tasks= 30, there will be Java exception errors.
So, my question is:
How to ensure that each key (or each attribute) is processed only by 1 reducer even when I set n < number of attributes.
Thank you very much.

Related

Reuse output of two MapReduce jobs and join the results together

I would like to join the output of two different MapReduce jobs. I want to be able to do something like I have below, but I cannot figure out how to reuse results from previous jobs and join them. How could I do this?
Job1:
Andrea Vanzo, c288f70f-f417-4a96-8528-25c61372cae7, 125
Job2:
c288f70f-f417-4a96-8528-25c61372cae7, 071e1103-1b06-4671-8324-a9beb3e90d18, 25
Result:
Andrea Vanzo, c288f70f-f417-4a96-8528-25c61372cae7, 25
you can use JobControl to set your workflow in your mappereduce ,BTW read job1&job2 's output (use MultipleInputs) also could solved your problem .
Use different processing methods and write data according to the path of the data.
mapper
job1data == job1.path = > split write key data[1] ,value data[0]+"tagjob1"
job2data ==job2.path = >split write key data[0] ,value data[0]+"tagjob2"
reducer
each key has it value sets .
put the values into two list group by your "tag"
write the key and each Cartesian product of the two list .
hopes

Algorithmic / coding help for a PySpark markov model

I need some help getting my brain around designing an (efficient) markov chain in spark (via python). I've written it as best as I could, but the code I came up with doesn't scale.. Basically for the various map stages, I wrote custom functions and they work fine for sequences of a couple thousand, but when we get in the 20,000+ (and I've got some up to 800k) things slow to a crawl.
For those of you not familiar with markov moodels, this is the gist of it..
This is my data.. I've got the actual data (no header) in an RDD at this point.
ID, SEQ
500, HNL, LNH, MLH, HML
We look at sequences in tuples, so
(HNL, LNH), (LNH,MLH), etc..
And I need to get to this point.. where I return a dictionary (for each row of data) that I then serialize and store in an in memory database.
{500:
{HNLLNH : 0.333},
{LNHMLH : 0.333},
{MLHHML : 0.333},
{LNHHNL : 0.000},
etc..
}
So in essence, each sequence is combined with the next (HNL,LNH become 'HNLLNH'), then for all possible transitions (combinations of sequences) we count their occurrence and then divide by the total number of transitions (3 in this case) and get their frequency of occurrence.
There were 3 transitions above, and one of those was HNLLNH.. So for HNLLNH, 1/3 = 0.333
As a side not, and I'm not sure if it's relevant, but the values for each position in a sequence are limited.. 1st position (H/M/L), 2nd position (M/L), 3rd position (H,M,L).
What my code had previously done was to collect() the rdd, and map it a couple times using functions I wrote. Those functions first turned the string into a list, then merged list[1] with list[2], then list[2] with list[3], then list[3] with list[4], etc.. so I ended up with something like this..
[HNLLNH],[LNHMLH],[MHLHML], etc..
Then the next function created a dictionary out of that list, using the list item as a key and then counted the total ocurrence of that key in the full list, divided by len(list) to get the frequency. I then wrapped that dictionary in another dictionary, along with it's ID number (resulting in the 2nd code block, up a above).
Like I said, this worked well for small-ish sequences, but not so well for lists with a length of 100k+.
Also, keep in mind, this is just one row of data. I have to perform this operation on anywhere from 10-20k rows of data, with rows of data varying between lengths of 500-800,000 sequences per row.
Any suggestions on how I can write pyspark code (using the API map/reduce/agg/etc.. functions) to do this efficiently?
EDIT
Code as follows.. Probably makes sense to start at the bottom. Please keep in mind I'm learning this(Python and Spark) as I go, and I don't do this for a living, so my coding standards are not great..
def f(x):
# Custom RDD map function
# Combines two separate transactions
# into a single transition state
cust_id = x[0]
trans = ','.join(x[1])
y = trans.split(",")
s = ''
for i in range(len(y)-1):
s= s + str(y[i] + str(y[i+1]))+","
return str(cust_id+','+s[:-1])
def g(x):
# Custom RDD map function
# Calculates the transition state probabilities
# by adding up state-transition occurrences
# and dividing by total transitions
cust_id=str(x.split(",")[0])
trans = x.split(",")[1:]
temp_list=[]
middle = int((len(trans[0])+1)/2)
for i in trans:
temp_list.append( (''.join(i)[:middle], ''.join(i)[middle:]) )
state_trans = {}
for i in temp_list:
state_trans[i] = temp_list.count(i)/(len(temp_list))
my_dict = {}
my_dict[cust_id]=state_trans
return my_dict
def gen_tsm_dict_spark(lines):
# Takes RDD/string input with format CUST_ID(or)PROFILE_ID,SEQ,SEQ,SEQ....
# Returns RDD of dict with CUST_ID and tsm per customer
# i.e. {cust_id : { ('NLN', 'LNN') : 0.33, ('HPN', 'NPN') : 0.66}
# creates a tuple ([cust/profile_id], [SEQ,SEQ,SEQ])
cust_trans = lines.map(lambda s: (s.split(",")[0],s.split(",")[1:]))
with_seq = cust_trans.map(f)
full_tsm_dict = with_seq.map(g)
return full_tsm_dict
def main():
result = gen_tsm_spark(my_rdd)
# Insert into DB
for x in result.collect():
for k,v in x.iteritems():
db_insert(k,v)
You can try something like below. It depends heavily on tooolz but if you prefer to avoid external dependencies you can easily replace it with some standard Python libraries.
from __future__ import division
from collections import Counter
from itertools import product
from toolz.curried import sliding_window, map, pipe, concat
from toolz.dicttoolz import merge
# Generate all possible transitions
defaults = sc.broadcast(dict(map(
lambda x: ("".join(concat(x)), 0.0),
product(product("HNL", "NL", "HNL"), repeat=2))))
rdd = sc.parallelize(["500, HNL, LNH, NLH, HNL", "600, HNN, NNN, NNN, HNN, LNH"])
def process(line):
"""
>>> process("000, HHH, LLL, NNN")
('000', {'LLLNNN': 0.5, 'HHHLLL': 0.5})
"""
bits = line.split(", ")
transactions = bits[1:]
n = len(transactions) - 1
frequencies = pipe(
sliding_window(2, transactions), # Get all transitions
map(lambda p: "".join(p)), # Joins strings
Counter, # Count
lambda cnt: {k: v / n for (k, v) in cnt.items()} # Get frequencies
)
return bits[0], frequencies
def store_partition(iter):
for (k, v) in iter:
db_insert(k, merge([defaults.value, v]))
rdd.map(process).foreachPartition(store_partition)
Since you know all possible transitions I would recommend using a sparse representation and ignore zeros. Moreover you can replace dictionaries with sparse vectors to reduce memory footprint.
you can achieve this result by using pure Pyspark, i did using it using pyspark.
To create frequencies, let say you have already achieved and these are input RDDs
ID, SEQ
500, [HNL, LNH, MLH, HML ...]
and to get frequencies like, (HNL, LNH),(LNH, MLH)....
inputRDD..map(lambda (k, list): get_frequencies(list)).flatMap(lambda x: x) \
.reduceByKey(lambda v1,v2: v1 +v2)
get_frequencies(states_list):
"""
:param states_list: Its a list of Customer States.
:return: State Frequencies List.
"""
rest = []
tuples_list = []
for idx in range(0,len(states_list)):
if idx + 1 < len(states_list):
tuples_list.append((states_list[idx],states_list[idx+1]))
unique = set(tuples_list)
for value in unique:
rest.append((value, tuples_list.count(value)))
return rest
and you will get results
((HNL, LNH), 98),((LNH, MLH), 458),() ......
after this you may convert result RDDs into Dataframes or yu can directly insert into DB using RDDs mapPartitions

How to write the result (from python) to mongodb (in json)?

I am now writing a Job-to-Job List. For a JobID, I would like to ouptput its similar jobs (by self-defined score in descending order). For example, the structure should be:
"_id":"abcd","job":1234,"jobList":[{"job":1,"score":0.9},{"job":2,"score":0.8},{"job":3,"score":0.7}]}
Here JobID is 1234 and job 1,2,3 are its similar jobs listed by name-score pairs.
And my python codes are:
def sortSparseMatrix(m, rev=True, only_indices=True):
f=open("/root/workspace/PythonOutPut/test.json",'wb')
w=csv.writer(f,dialect='excel')
col_list = [None] * (m.shape[0])
j=0
for i in xrange(m.shape[0]):
d=m.getrow(i)
if len(d.indices) != 0:
s=zip(d.indices, d.data)
s.sort(key=lambda v:v[1], reverse=True)
if only_indices:
col_list[j] =[[element[0],element[1]] for element in s]
col_list[j]=col_list[j][0:4]
h1 = u'Job'+":" +str(col_list[j][0][0])+","
json.dump(h1,f)
h2=[]
h3=u'JobList'+":"
json.dump(h3,f)
for subrow in col_list[j][1:]:
h2.append(u'{Job'+":"+str(subrow[0])+","u'score'+":"+str(subrow[1])+"}")
json.dump(h2,f)
del col_list[j][:]
j=j+1
where d contains unsorted name-score pairs with respect to JobID: col_list[j][0][0] (After sorting, the most similar job (highest score) with JobID(col_list[j][0][0]) is itself).d.data is the score and [element[0],element[1]] is the name-score pair. I would like to keep the most similar three jobs w.r.t. JobID. I would like to dump h1 (to show JobID) first, and output the list of similar jobs in h2.
I typed 'mongodbimport --db test_database -- collection TestOfJSON --type csv --file /as above/ --fields JobList'. It can import the result to mongodb. However, it is only one JobID with many fields. But what I want is JobID associated with its similar Job's name-score pairs only. How should I do? thanks

File to File Comparison of Primary Key to Select Specific Records

I am currently trying to put together a python script to compare two text files (tab-separated values). The smaller file consists of one field per record of key values (e.g. much like a database primary key), whereas the larger file is comprised of a first-field key, up to thousands of fields per record, with tens of thousands of records.
I am trying to select (from the larger file) only the records which match their corresponding key in the smaller file, and output these to a new text file. The keys occur in the first field of each record.
I have hit a wall. Admittedly, I have been trying for loops, and thus far have had minimal success. I got it to display the key values of each file--a small victory!
I may be a glutton for punishment, as I am bent on using python (2.7) to solve this, rather than import it into something SQL based; I will never learn otherwise!
UPDATE: I have the following code thus far. Is the use of forward-slash correct for the write statement?
# Defining some counters, and setting them to zero.
counter_one = 0
counter_two = 0
counter_three = 0
counter_four = 0
# Defining a couple arrays for sorting purposes.
array_one = []
array_two = []
# This module opens the list of records to be selected.
with open("c:\lines_to_parse.txt") as f0:
LTPlines = f0.readlines()
for i, line in enumerate(LTPlines):
returned_line = line.split()
array_one.append(returned_line)
for line in array_one:
counter_one = counter_one + 1
# This module opens the file to be trimmed as an array.
with open('c:\target_data.txt') as f1:
targetlines = f1.readlines()
for i, line in enumerate(targetlines):
array_two.append(line.split())
for line in array_two:
counter_two = counter_two + 1
# The last module performs a logical check
# of the data and writes to a tertiary file.
with open("c:/research/results", 'w') as f2:
while counter_three <= 3: #****Arbitrarily set, to test if the program will work.
if array_one[counter_three][0] == array_two[counter_four][0]:
f2.write(str(array_two[counter_four]))
counter_three = (counter_three + 1)
counter_four = (counter_four + 1)
else:
counter_four = (counter_four + 1)
You could create a dictionary with the keys in the small file. The key in the small file as th ekey and the value True (is not important). Keep this dict in memory.
Then open the file where you will write to (output file) and the larger file. Check for each line in the larger file if the key exist in the dictionary and if it does write to the output file.
I am not sure if is clear enough. Or if that was your problem.

Simple example of retrieving 500 items from dynamodb using Python

Looking for a simple example of retrieving 500 items from dynamodb minimizing the number of queries. I know there's a "multiget" function that would let me break this up into chunks of 50 queries, but not sure how to do this.
I'm starting with a list of 500 keys. I'm then thinking of writing a function that takes this list of keys, breaks it up into "chunks," retrieves the values, stitches them back together, and returns a dict of 500 key-value pairs.
Or is there a better way to do this?
As a corollary, how would I "sort" the items afterwards?
Depending on you scheme, There are 2 ways of efficiently retrieving your 500 items.
1 Items are under the same hash_key, using a range_key
Use the query method with the hash_key
you may ask to sort the range_keys A-Z or Z-A
2 Items are on "random" keys
You said it: use the BatchGetItem method
Good news: the limit is actually 100/request or 1MB max
you will have to sort the results on the Python side.
On the practical side, since you use Python, I highly recommend the Boto library for low-level access or dynamodb-mapper library for higher level access (Disclaimer: I am one of the core dev of dynamodb-mapper).
Sadly, neither of these library provides an easy way to wrap the batch_get operation. On the contrary, there is a generator for scan and for query which 'pretends' you get all in a single query.
In order to get optimal results with the batch query, I recommend this workflow:
submit a batch with all of your 500 items.
store the results in your dicts
re-submit with the UnprocessedKeys as many times as needed
sort the results on the python side
Quick example
I assume you have created a table "MyTable" with a single hash_key
import boto
# Helper function. This is more or less the code
# I added to devolop branch
def resubmit(batch, prev):
# Empty (re-use) the batch
del batch[:]
# The batch answer contains the list of
# unprocessed keys grouped by tables
if 'UnprocessedKeys' in prev:
unprocessed = res['UnprocessedKeys']
else:
return None
# Load the unprocessed keys
for table_name, table_req in unprocessed.iteritems():
table_keys = table_req['Keys']
table = batch.layer2.get_table(table_name)
keys = []
for key in table_keys:
h = key['HashKeyElement']
r = None
if 'RangeKeyElement' in key:
r = key['RangeKeyElement']
keys.append((h, r))
attributes_to_get = None
if 'AttributesToGet' in table_req:
attributes_to_get = table_req['AttributesToGet']
batch.add_batch(table, keys, attributes_to_get=attributes_to_get)
return batch.submit()
# Main
db = boto.connect_dynamodb()
table = db.get_table('MyTable')
batch = db.new_batch_list()
keys = range (100) # Get items from 0 to 99
batch.add_batch(table, keys)
res = batch.submit()
while res:
print res # Do some usefull work here
res = resubmit(batch, res)
# The END
EDIT:
I've added a resubmit() function to BatchList in Boto develop branch. It greatly simplifies the worklow:
add all of your requested keys to BatchList
submit()
resubmit() as long as it does not return None.
this should be available in next release.

Categories