Create glue workflows using for_each, mulitple lists - python

I need to create around 25 glue workflows, and we are using terraform for the same. I am trying to create multiple lists, and terraform's for_each function so that script is short and also much better to handle. Following is my terraform script, and when I try to run it am getting duplicate objects found. Think terraform generates 3X3 matrix, and i need exactly 3 workflows for each items from the list. I did refer to this SO : for loop with multiple lists
locals {
wf_name = [var.cdl_workflow_apld1, var.cdl_workflow_apld2, var.cdl_workflow_apld3]
glue_job_name = [var.cdl_apld_glue_job1, var.cdl_apld_glue_job2, var.cdl_apld_glue_job3]
trigger_type = ["ON_DEMAND", "CONDITIONAL", "CONDITIONAL"]
wf_props = {for val in setproduct(local.wf_name, local.glue_job_name, local.trigger_type):
"${val[0]}-${val[1]}-${val[2]}" => val}
}
resource "aws_glue_workflow" "cdl_workflow_apld" {
name = var.cdl_workflow_apld
}
resource "aws_glue_trigger" "cdl_workflow_apld" {
for_each = local.wf_props
name = each.value[0]
type = each.value[2]
workflow_name = aws_glue_workflow.cdl_workflow_apld.name
actions {
job_name = each.value[1]
}
}
I am expecting, for each workflow the respective index from wf_name, glue_job_name should be applied.
Also for trigger_type, I would like to set ON_DEMAND only for the first index, or the first workflow, and remaining all should be CONDITIONAL. Any suggestions/help with code please.
Expected output
workflow1:
workflow_name ="myworkflow_name"
name: var.cdl_workflow_apld1 (from variable.tf & tfvars)
job_name = var.cdl_apld_glue_job1 (from variable.tf & tfvars)
type = ON_DEMAND
workflow2:
workflow_name ="myworkflow_name"
name: var.cdl_workflow_apld2 (from variable.tf & tfvars)
job_name = var.cdl_apld_glue_job2 (from variable.tf & tfvars)
type = CONDITIONAL
The error message:
Error: Duplicate object key
on glue_wf_apld_test.tf line 7, in locals:
6: wf_props = {for val in setproduct(local.wf_name, local.glue_job_name, local.trigger_type):
7: "${val[0]}-${val[1]}-${val[2]}" => val}
val[0] is "tr_apld_landing_to_raw"
val[1] is "dev_load_landing_to_raw_jb"
val[2] is "CONDITIONAL"
Two different items produced the key "tr_apld_landing_to_raw-dev_load_landing_to_raw_jb-CONDITIONAL" in this 'for' expression. If duplicates are expected, use the
│ ellipsis (...) after the value expression to enable grouping by key.

Related

python - how to create a more compact group for dictionary

Hi this part of my code for a biology project:
# choosing and loading the file:
df = pd.read_csv('Dafniyot_Data.csv',delimiter=',')
#grouping data by C/I groups:
CII = df[df['group'].str.contains('CII')]
CCI = df[df['group'].str.contains('CCI')]
CCC = df[df['group'].str.contains('CCC')]
III = df[df['group'].str.contains('III')]
CIC = df[df['group'].str.contains('CIC')]
ICC = df[df['group'].str.contains('ICC')]
IIC = df[df['group'].str.contains('IIC')]
ICI = df[df['group'].str.contains('ICI')]
#creating a dictonary of the groups:
dict = {'CII':CII, 'CCI':CCI, 'CCC':CCC,'III':III,'CIC':CIC,'ICC':ICC,'IIC':IIC,'ICI':ICI}
#T test
#FERTUNITY
#using ttest for checking FERTUNITY - grandmaternal(F0)
t_F0a = stats.ttest_ind(CCC['N_offspring'],ICC['N_offspring'],nan_policy='omit')
t_F0b = stats.ttest_ind(CCI['N_offspring'],ICI['N_offspring'],nan_policy='omit')
t_F0c = stats.ttest_ind(IIC['N_offspring'],CIC['N_offspring'],nan_policy='omit')
t_F0d = stats.ttest_ind(CCI['N_offspring'],III['N_offspring'],nan_policy='omit')
t_F0 = {'FERTUNITY - grandmaternal(F0)':[t_F0a,t_F0b,t_F0c,t_F0d]}
I need to repeat the ttest part 6 more times with either changing the groups(CCC,etc..)or the row from the df('N_offspring',survival) which takes a lot of lines in the project.
I'm trying to find a way to still get the dictionary of each group in the end:
t_F0 = {'FERTUNITY - grandmaternal(F0)':[t_F0a,t_F0b,t_F0c,t_F0d]}
Because its vey useful for me later, but in a less repetitive way with less lines
Use itertools.product to generate all the keys, and a dict comprehension to generate the values:
from itertools import product
keys = [''.join(items) for items in product("CI", repeat=3)]
the_dict = { key: df[df['group'].str.contains(key)] for key in keys }
Similarly, you can generate the latter part of your test keys:
half_keys = [''.join(items) for items in product("CI", repeat=2)]
t_F0 = {
'FERTUNITY - grandmaternal(F0)': [
stats.ttest_ind(
the_dict[f"C{half_key}"]['N_offspring'],
the_dict[f"I{half_key}"]['N_offspring'],
nan_policy='omit'
) for half_key in half_keys
],
}
As an aside, you should not use dict as a variable name: it already has a meaning (the type of dict objects).
As a second aside, this deals with the literal question of how to DRY up creating a dictionary. However, do consider what Chris said in comments; this may be an XY problem.

How to change date format and have concatenated string matches in mongodb query filter?

I'm matching two collections residing in 2 different databases over a criteria and creates a new collection for records that matches this criterion.
Below is working with simple criteria, but I need a different criterion.
Definitions
function insertBatch(collection, documents) {
var bulkInsert = collection.initializeUnorderedBulkOp();
var insertedIds = [];
var id;
documents.forEach(function(doc) {
id = doc._id;
// Insert without raising an error for duplicates
bulkInsert.find({_id: id}).upsert().replaceOne(doc);
insertedIds.push(id);
});
bulkInsert.execute();
return insertedIds;
}
function moveDocuments(sourceCollection, targetCollection, filter, batchSize) {
print("Moving " + sourceCollection.find(filter).count() + " documents from " + sourceCollection + " to " + targetCollection);
var count;
while ((count = sourceCollection.find(filter).count()) > 0) {
print(count + " documents remaining");
sourceDocs = sourceCollection.find(filter).limit(batchSize);
idsOfCopiedDocs = insertBatch(targetCollection, sourceDocs);
targetDocs = targetCollection.find({_id: {$in: idsOfCopiedDocs}});
}
print("Done!")
}
Call
var db2 = new Mongo("<URI_1>").getDB("analy")
var db = new Mongo("<URI_2>").getDB("clone")
var readDocs= db2.coll1
var writeDocs= db.temp_coll
var Urls = new Mongo("<URI_2>").getDB("clone").myCollection.distinct("Url" ,{})
var filter= {"Url": {$in: Urls }}
moveDocuments(readDocs, writeDocs, filter, 10932)
In a nutshell, my criterion is distinct "Url" string. Instead, I want Url + Date string to be my criterion. There are 2 problems:
In one collection, the date is in format ISODate("2016-03-14T13:42:00.000+0000") and in other collection the date format is "2018-10-22T14:34:40Z". So, How to make them uniform so that they match each other?
Assuming, we get a solution to 1., and we create a new array having concatenated strings UrlsAndDate instead of Urls. How would we create a similar concatenated field on the fly and match it in the other collection?
For example: (non-functional code!)
var UrlsAndDate = new Mongo("<URI_2>").getDB("clone").myCollection.distinct("Url"+"formated_Date" ,{})
var filter= {"Url"+"formated_Date": {$in: Urls }}
readDocs.find(filter)
...and do the same stuff as above!
Any suggestions?
Got a brute force solution, but isn't feasible!
Problem:
I want to merge 2 collections mycoll & coll1. Both have a field name Url and Date. mycoll has 35000 docs and coll1 has 4.7M docs(16+gb)-can't load into m/m.
Algo, written using pymongo client :
iterate over mycoll
create a src string "url+common_date_format"
Try to find a match in coll1, since, coll1 is big I can't load it in m/m and treat as dictionary!. So, I'm iterating over each doc in this collection again and again.
iterate over coll1
create a destination string "url+common_date_format"
if src_string == dest_string
insert this doc in a new collection called temp_coll
This is a terrible algorithm since O(35000*4.7M), would take ages to complete!. If I could load 4.7M in m/m then the run time will reduce to O(35000), that's doable!
Any suggestions for another algorithm!
First thing I would do is create compound index with {url: 1, date: 1} on collections if they don't already exist. Say collection A has 35k docs and collection B has 4.7M docs. We can't load whole 4.7M docs data in-memory. You are iterating over cursor object of B in inner loop. I assume once that cursor object is exhausted you are querying the collection again.
Some observations to make here why are we iterating over 4.7M docs each time. Instead of fetching all 4.7M docs and then matching, we could just fetch docs that match url and date for each doc in A. Converting a_doc date to b_doc format and then querying would be better than making both to common format which forces us to do 4.7M docs iteration. Read the below pseudo code.
a_docs = a_collection.find()
c_docs = []
for doc in a_docs:
url = doc.url
date = doc.date
date = convert_to_b_collection_date_format(date)
query = {'url': url, 'date': date}
b_doc = b_collection.find(query)
c_docs.append(b_doc)
c_docs = covert_c_docs_to_required_format(c_docs)
c_collection.insert_many(c_docs)
Above we are looping over 35k docs and filter for each doc. Given that we have indexes created already lookup takes logarithmic time, which seems reasonable.

how to find the pathing flow and rank them using pig or hive?

Below is the example for my use case.
You can reference this question where an OP was asking something similar. If I am understanding your problem correctly, you want to remove duplicates from the path, but only when they occur next to each other. So 1 -> 1 -> 2 -> 1 would become 1 -> 2 -> 1. If this is correct, then you can't just group and distinct (as I'm sure you have noticed) because it will remove all duplicates. An easy solution is to write a UDF to remove those duplicates while preserving the distinct path of the user.
UDF:
package something;
import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class RemoveSequentialDuplicatesUDF extends UDF {
public ArrayList<Text> evaluate(ArrayList<Text> arr) {
ArrayList<Text> newList = new ArrayList<Text>();
newList.add(arr.get(0));
for (int i = 1; i < arr.size(); i++) {
String front = arr.get(i).toString();
String back = arr.get(i-1).toString();
if (!back.equals(front)) {
newList.add(arr.get(i));
}
}
return newList;
}
}
To build this jar you will need a hive-core.jar and hadoop-core.jar, you can find these here in the Maven Repository. Make sure you get the version of Hive and Hadoop that you are using in your environment. Also, if you plan to run this in a production environment, I'd suggest adding some exception handling to the UDF. After the jar is built, import it and run this query:
Query:
add jar /path/to/jars/brickhouse-0.7.1.jar;
add jar /path/to/jars/hive_common-SNAPSHOT.jar;
create temporary function collect as "brickhouse.udf.collect.CollectUDAF";
create temporary function remove_dups as "something.RemoveSequentialDuplicatesUDF";
select screen_flow, count
, dense_rank() over (order by count desc) rank
from (
select screen_flow
, count(*) count
from (
select session_id
, concat_ws("->", remove_dups(screen_array)) screen_flow
from (
select session_id
, collect(screen_name) screen_array
from (
select *
from database.table
order by screen_launch_time ) a
group by session_id ) b
) c
group by screen_flow ) d
Output:
s1->s2->s3 2 1
s1->s2 1 2
s1->s2->s3->s1 1 2
Hope this helps.
Input
990004916946605-1404157897784,S1,1404157898275
990004916946605-1404157897784,S1,1404157898286
990004916946605-1404157897784,S2,1404157898337
990004947764274-1435162269418,S1,1435162274044
990004947764274-1435162269418,S2,1435162274057
990004947764274-1435162269418,S3,1435162274081
990004947764274-1435162287965,S2,1435162690002
990004947764274-1435162287965,S1,1435162690001
990004947764274-1435162287965,S3,1435162690003
990004947764274-1435162287965,S1,1435162690004
990004947764274-1435162212345,S1,1435168768574
990004947764274-1435162212345,S2,1435168768585
990004947764274-1435162212345,S3,1435168768593
register /home/cloudera/jar/ScreenFilter.jar;
screen_records = LOAD '/user/cloudera/inputfiles/screen.txt' USING PigStorage(',') AS(session_id:chararray,screen_name:chararray,launch_time:long);
screen_rec_order = ORDER screen_records by launch_time ASC;
session_grped = GROUP screen_rec_order BY session_id;
eached = FOREACH session_grped
{
ordered = ORDER screen_rec_order by launch_time;
GENERATE group as session_id, REPLACE(BagToString(ordered.screen_name),'_','-->') as screen_str;
};
screen_each = FOREACH eached GENERATE session_id, GetOrderedScreen(screen_str) as screen_pattern;
screen_grp = GROUP screen_each by screen_pattern;
screen_final_each = FOREACH screen_grp GENERATE group as screen_pattern, COUNT(screen_each) as pattern_cnt;
ranker = RANK screen_final_each BY pattern_cnt DESC DENSE;
output_data = FOREACH ranker GENERATE screen_pattern, pattern_cnt, $0 as rank_value;
dump output_data;
I am not able to find a way to use Pig Builtin function to remove adjacent screens for a same session_id ,hence I have used JAVA UDF inorder to remove the adjacent screen names.
I created a JAVA UDF called GetOrderedScreen and covrerted that UDF in to jar and named that jar as ScreenFilter.jar and registered that jar in this Pig Script
Below is the Code for that GetOrderedScreen Java UDF
public class GetOrderedScreen extends EvalFunc<String> {
#Override
public String exec(Tuple input) throws IOException {
String incoming_screen_str= (String)input.get(0);
String outgoing_screen_str ="";
String screen_array[] =incoming_screen_str.split("-->");
String full_screen=screen_array[0];
for (int i=0; i<screen_array.length;i++)
{
String prefix_screen= screen_array[i];
String suffix_screen="";
int j=i+1;
if(j< screen_array.length)
{
suffix_screen = screen_array[j];
}
if (!prefix_screen.equalsIgnoreCase(suffix_screen))
{
full_screen = full_screen+ "-->" +suffix_screen;
}
}
outgoing_screen_str =full_screen.substring(0, full_screen.lastIndexOf("-->"));
return outgoing_screen_str;
}
}
Output
(S1-->S2-->S3,2,1)
(S1-->S2,1,2)
(S1-->S2-->S3-->S1,1,2)
Hope this helps you!.. Also Wait for some more time, Some good brains who see this Question will answer effectively(without JAVA UDF)

Does spark utilize the sorted order of hbase keys, when using hbase as data source

I store time-series data in HBase. The rowkey is composed from user_id and timestamp, like this:
{
"userid1-1428364800" : {
"columnFamily1" : {
"val" : "1"
}
}
}
"userid1-1428364803" : {
"columnFamily1" : {
"val" : "2"
}
}
}
"userid2-1428364812" : {
"columnFamily1" : {
"val" : "abc"
}
}
}
}
Now I need to perform per-user analysis. Here is the initialization of hbase_rdd (from here)
sc = SparkContext(appName="HBaseInputFormat")
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
The natural mapreduce-like way to process would be:
hbase_rdd
.map(lambda row: (row[0].split('-')[0], (row[0].split('-')[1], row[1]))) # shift timestamp from key to value
.groupByKey()
.map(processUserData) # process user's data
While executing first map (shift timestamp from key to value) it is crucial to know when the time-series data of the current user is finished and therefore groupByKey transformation could be started. Thus we do not need to map over all table and store all the temporary data. It is possible because hbase stores row-keys in a sorted order.
With hadoop streaming it could be done in such way:
import sys
current_user_data = []
last_userid = None
for line in sys.stdin:
k, v = line.split('\t')
userid, timestamp = k.split('-')
if userid != last_userid and current_user_data:
print processUserData(last_userid, current_user_data)
last_userid = userid
current_user_data = [(timestamp, v)]
else:
current_user_data.append((timestamp, v))
The question is: how to utilize the sorted order of hbase keys within Spark?
I'm not super familiar with the guarantees you get with the way you're pulling data from HBase, but if I understand correctly, I can answer with just plain old Spark.
You've got some RDD[X]. As far as Spark knows, the Xs in that RDD are completely unordered. But you have some outside knowledge, and you can guarantee that the data is in fact grouped by some field of X (and perhaps even sorted by another field).
In that case, you can use mapPartitions to do virtually the same thing you did with hadoop streaming. That lets you iterate over all the records in one partition, so you can look for blocks of records w/ the same key.
val myRDD: RDD[X] = ...
val groupedData: RDD[Seq[X]] = myRdd.mapPartitions { itr =>
var currentUserData = new scala.collection.mutable.ArrayBuffer[X]()
var currentUser: X = null
//itr is an iterator over *all* the records in one partition
itr.flatMap { x =>
if (currentUser != null && x.userId == currentUser.userId) {
// same user as before -- add the data to our list
currentUserData += x
None
} else {
// its a new user -- return all the data for the old user, and make
// another buffer for the new user
val userDataGrouped = currentUserData
currentUserData = new scala.collection.mutable.ArrayBuffer[X]()
currentUserData += x
currentUser = x
Some(userDataGrouped)
}
}
}
// now groupedRDD has all the data for one user grouped together, and we didn't
// need to do an expensive shuffle. Also, the above transformation is lazy, so
// we don't necessarily even store all that data in memory -- we could still
// do more filtering on the fly, eg:
val usersWithLotsOfData = groupedRDD.filter{ userData => userData.size > 10 }
I realize you wanted to use python -- sorry I figure I'm more likely to get the example correct if I write in Scala. and I think the type annotations make the meaning more clear, but that it probably a Scala bias ... :). In any case, hopefully you can understand what is going on and translate it. (Don't worry too much about flatMap & Some & None, probably unimportant if you understand the idea ...)

What is the exact way to send erlang module and erlang function to the mapreduce face in python-riak client l

can any body tell with an example of what is the correct way of sending the erlang module and erlang function to the
query.map()
in python riak client, In Documents it was like
function (string, list) – Either a named Javascript function (ie: ‘Riak.mapValues’), or an anonymous javascript function (ie: ‘function(...) ... ‘ or an array [‘erlang_module’, ‘function’].
options (dict) – phase options, containing ‘language’, ‘keep’ flag, and/or ‘arg’.
but there is no clear information of what i have to send.actually i've been giving the query.map() phase as
query.map(['maps','fun']) # maps is the maps.erl and fun is the function in the maps.erl file
I have set the beam files path under the app.cofig as mentioned in the documents, to keep the compiled beam files. i have did all those things but , i am getting error after running the commands
query.map(['maps','funs'])
>>> query.run()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib/python2.6/site-packages/riak-1.5.2-py2.6.egg/riak/mapreduce.py", line 234, in run
result = t.mapred(self._inputs, query, timeout)
File "/usr/lib/python2.6/site-packages/riak-1.5.2-py2.6.egg/riak/transports/http.py", line 322, in mapred
(repr(response[0]), repr(response[1])))
Exception: Error running MapReduce operation. Headers: {'date': 'Mon, 26 May 2014 11:24:04 GMT', 'content-length': '1121', 'content-type': 'application/json'
, 'http_code': 500, 'server': 'MochiWeb/1.1 WebMachine/1.10.0 (never breaks eye contact)'} Body: '{"phase":0,"error":"undef","input":"{ok,{r_object,<<\\"tst\
\">>,<<\\"test5\\">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<\\"X-Riak-VTag\\
">>,50,53,75,69,55,80,113,109,65,69,117,106,109,109,99,65,72,101,75,82,115,86]],[[<<\\"index\\">>]],[],[[<<\\"X-Riak-Last-Modified\\">>|{1400,340359,663135}]
],[],[]}}},<<\\"6\\">>}],[{<<197,82,177,11,83,115,139,10>>,{1,63567559559}}], {dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],
[],[],[],[],[],[],[],[],[],...}}},...},...}","type":"error","stack":"[{maps,funs, [{r_object,<<\\"tst\\">>,<<\\"test5\\">>,[{r_content,{dict,3,16,16,8,80,48,{
[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[], [[<<\\"X-Riak-VTag\\">>,50,53,75,69,55,80,113,109,65,69,117,106,109,109,99,6
5,72,101,75,82,115,86]],[[<<\\"index\\">>]],[],[[<<\\"X-Riak-Last-Modified\\">>|{1400,340359,663135}]],[],[]}}},<<\\"6\\">>}],[{<<197,82,177,11,83,115,139,10
>>,{1,63567559559}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],...}}},...},...],...},...]"}'
what was the wrong what i have missed, please suggest me
There are 3 parts to using an Erlang map function with the Python client:
writing and compiling the Erlang module
preparing the Riak cluster
invoking the function from the Python client
The Erlang module should be fairly straightforward, for this example I will have the map function return the number of values(siblings) for each key:
-module(custom_mr).
-export([mapcount/3]).
mapcount(Obj,_Keydata,_Arg) ->
[length(riak_object:get_values(Obj))].
Versions of Erlang vary in subtle ways, so it will be safer to use Riak's bundled Erlang, or the same one you used to compile it if you built from source. The resultant .beam file will need to be placed in a directory that is readable by the user that Riak is running as - this defaults to riak if you used a package install. You will need to deploy the .beam file and modify the app.config at each node in the cluster.
# /usr/lib/riak/erts-5.9.1/bin/erlc custom_mr.erl
# mkdir /var/lib/riak/custom_code
# mv custom_mr.beam /var/lib/riak/custom_code
# chown -R riak:riak /var/lib/riak/custom_code
Then edit app.config and add {add_paths,["/var/lib/riak/custom_code"]} to the riak_kv section, and restart the node.
Test from riak attach to make sure the new module has been loaded - in this example, nodes 1-4 have loaded the module, but node5 is down:
# riak attach
1> riak_core_util:rpc_every_member_ann(code,which,[custom_mr]).
{[{'riak#node1.lab.local',"/var/lib/riak/custom_code/custom_mr.beam"},
{'riak#node2.lab.local',"/var/lib/riak/custom_code/custom_mr.beam"},
{'riak#node3.lab.local',"/var/lib/riak/custom_code/custom_mr.beam"},
{'riak#node4.lab.local',"/var/lib/riak/custom_code/custom_mr.beam"}],
['riak#node5.lab.local']}
2> custom_mr:mapcount(riak_object:new(<<"test">>,<<"test">>,<<"test">>),keydata,arg).
[1]
(detach from the riak console with ctrl-d if you are running a pre-1.4 version, otherwise ctrl-c a)
Lastly, the Python code (I used the filename test.py):
import riak
client = riak.RiakClient()
test_bucket = client.bucket('test')
data1 = test_bucket.new('key1',data={'field1':'1data1','field2':'1data2','field3':1})
data1.store()
data2 = test_bucket.new('key2',data={'field1':'2data1','field2':'2data2','field3':2})
data2.store()
data3 = test_bucket.new('key3',data={'field1':'3data1','field2':'3data2','field3':3})
data3.store()
query = riak.RiakMapReduce(client).add('testbucket')
query.map(['custom_mr','mapcount'])
for result in query.run():
print "%s" % (result)
Running this code returns a 1 for each key in the bucket:
#python test.py
1
1
1
NOTE I did not do a get before putting the new values, so if your default bucket properties include allow_mult:true, running this a second time will create a sibling for each value, and you will get '2's instead of '1's
Adding further examples
New module, compile and install as above
-module(custom_mr).
-export([mapcount/3,
mapvalue/3,
mapfield/3,
mapfieldwithid/3,
reducecount/2,
reducepropsort/2,
reducedoublesort/2,
reducesort/2]).
mapcount(Obj,_Kd,_Arg) ->
[length(riak_object:get_values(Obj))].
mapvalue(Obj,_Kd,_Arg) ->
[hd(riak_object:get_values(Obj))].
mapfield(Obj,_Kd,Arg) ->
Val = case catch mochijson2:decode(hd(riak_object:get_values(Obj))) of
{struct, Data} ->
case Arg =:= null of
true -> Data;
false -> [{Arg,proplists:get_value(Arg,Data)}]
end;
_ ->
[{Arg,{error,notjson}}]
end,
[list_to_binary(mochijson2:encode(Val))].
mapfieldwithid(Obj,_Kd,Arg) ->
Val = case catch mochijson2:decode(hd(riak_object:get_values(Obj))) of
{struct, Data} ->
case Arg =:= null of
true -> Data;
false -> [{Arg,proplists:get_value(Arg,Data)}]
end;
_ ->
[{Arg,{error,notjson}}]
end,
V = [{bucket,riak_object:bucket(Obj)},{key,riak_object:key(Obj)}|Val],
[list_to_binary(mochijson2:encode(V))].
reducecount(L,_Arg) ->
[lists:sum([ N || N <- L, is_integer(N) ])].
sortfun(F) ->
fun(A,B) ->
proplists:get_value(F,A,<<"zzzz">>) =< proplists:get_value(F,B,<<"zzzz">>)
end.
reducepropsort(L,Arg) ->
Decoded = [ I || {struct,I} <- [ mochijson2:decode(E) || E <- L], is_list(I)],
Sorted = lists:sort(sortfun(Arg), Decoded),
[ list_to_binary(mochijson2:encode(I)) || I <- Sorted ].
reducesort(L,_Arg) ->
lists:sort(L).
reducedoublesort(L,Arg) ->
Decoded = [ lists:sort(I) || {struct,I} <- [ mochijson2:decode(E) || E <- L], is_list(I)],
Sorted = lists:sort(sortfun(Arg), Decoded),
[ list_to_binary(mochijson2:encode(I)) || I <- Sorted ].
Python code
import riak
client = riak.RiakClient(pb_port=8087, host="172.31.0.1", protocol='pbc')
test_bucket = client.bucket('test_bucket')
data1 = test_bucket.new('key1',data={'field1':'1data1','field2':'1data2','field3':1, 'zone':'D'})
data1.store()
data2 = test_bucket.new('key2',data={'field1':'2data1','field2':'2data2','field3':2, 'zone':'A'})
data2.store()
data3 = test_bucket.new('key3',data={'field1':'3data1','field2':'3data2','field3':3, 'zone':'C'})
data3.store()
def printresult(q):
for result in q.run():
print "%s" % (result)
print "\nCount the number of values in the bucket"
query = riak.RiakMapReduce(client).add('test_bucket')
query.map(['custom_mr','mapcount'])
query.reduce(['custom_mr','reducecount'])
printresult(query)
print "\nList all values in natual sort order"
query = riak.RiakMapReduce(client).add('test_bucket')
query.map(['custom_mr','mapvalue'])
query.reduce(['custom_mr','reducesort'])
printresult(query)
print "\nList all values sorted by 'zone'"
query = riak.RiakMapReduce(client).add('test_bucket')
query.map(['custom_mr','mapfield'])
query.reduce(['custom_mr','reducepropsort'],{'arg':'zone'})
printresult(query)
print "\nList all values sorted by 'zone', also sort the fields in each object"
query = riak.RiakMapReduce(client).add('test_bucket')
query.map(['custom_mr','mapfield'])
query.reduce(['custom_mr','reducedoublesort'],{'arg':'zone'})
printresult(query)
print "\nList just field3, sorted"
query = riak.RiakMapReduce(client).add('test_bucket')
query.map(['custom_mr','mapfield'],{'arg':'field3'})
query.reduce(['custom_mr','reducepropsort'],{'arg':'field3'})
printresult(query)
print "\nList just bucket,key,field3, sorted by field3"
query = riak.RiakMapReduce(client).add('test_bucket')
query.map(['custom_mr','mapfieldwithid'],{'arg':'field3'})
query.reduce(['custom_mr','reducepropsort'],{'arg':'field3'})
printresult(query)
print "\nReturn just the zone for key2"
query = riak.RiakMapReduce(client).add('test_bucket','key2')
query.map(['custom_mr','mapfield'],{'arg':'zone'})
printresult(query)
print "\nReturn the bucket,key,zone for key1 and key3"
query = riak.RiakMapReduce(client).add('test_bucket',['key1','key3'])
query.map(['custom_mr','mapfieldwithid'],{'arg':'zone'})
query.reduce(['custom_mr','reducepropsort'],{'arg':'zone'})
printresult(query)
Please Note: Many of these examples use full-bucket MapReduce which will be very heavy and will likely affect performance if used on a non-trivial amount of data. The last 2 examples show how to select a specific key or a list of keys as input. A Secondary Index or Riak Search could also be used as input if the cluster is setup with those, see riak-python-client query inputs in the docs.
And the output:
# python ~/test.py
Count the number of values in the bucket
3
List all values in natual sort order
{"field2": "1data2", "field3": 1, "field1": "1data1", "zone": "D"}
{"field2": "2data2", "field3": 2, "field1": "2data1", "zone": "A"}
{"field2": "3data2", "field3": 3, "field1": "3data1", "zone": "C"}
List all values sorted by 'zone'
{"field2":"2data2","field3":2,"field1":"2data1","zone":"A"}
{"field2":"3data2","field3":3,"field1":"3data1","zone":"C"}
{"field2":"1data2","field3":1,"field1":"1data1","zone":"D"}
List all values sorted by 'zone', also sort the fields in each object
{"field1":"2data1","field2":"2data2","field3":2,"zone":"A"}
{"field1":"3data1","field2":"3data2","field3":3,"zone":"C"}
{"field1":"1data1","field2":"1data2","field3":1,"zone":"D"}
List just field3, sorted
{"field3":1}
{"field3":2}
{"field3":3}
List just bucket,key,field3, sorted by field3
{"bucket":"test_bucket","key":"key1","field3":1}
{"bucket":"test_bucket","key":"key2","field3":2}
{"bucket":"test_bucket","key":"key3","field3":3}
Return just the zone for key2
{"zone":"A"}
Return the bucket,key,zone for key1 and key3
{"bucket":"test_bucket","key":"key3","zone":"C"}
{"bucket":"test_bucket","key":"key1","zone":"D"}

Categories