Below is the example for my use case.
You can reference this question where an OP was asking something similar. If I am understanding your problem correctly, you want to remove duplicates from the path, but only when they occur next to each other. So 1 -> 1 -> 2 -> 1 would become 1 -> 2 -> 1. If this is correct, then you can't just group and distinct (as I'm sure you have noticed) because it will remove all duplicates. An easy solution is to write a UDF to remove those duplicates while preserving the distinct path of the user.
UDF:
package something;
import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class RemoveSequentialDuplicatesUDF extends UDF {
public ArrayList<Text> evaluate(ArrayList<Text> arr) {
ArrayList<Text> newList = new ArrayList<Text>();
newList.add(arr.get(0));
for (int i = 1; i < arr.size(); i++) {
String front = arr.get(i).toString();
String back = arr.get(i-1).toString();
if (!back.equals(front)) {
newList.add(arr.get(i));
}
}
return newList;
}
}
To build this jar you will need a hive-core.jar and hadoop-core.jar, you can find these here in the Maven Repository. Make sure you get the version of Hive and Hadoop that you are using in your environment. Also, if you plan to run this in a production environment, I'd suggest adding some exception handling to the UDF. After the jar is built, import it and run this query:
Query:
add jar /path/to/jars/brickhouse-0.7.1.jar;
add jar /path/to/jars/hive_common-SNAPSHOT.jar;
create temporary function collect as "brickhouse.udf.collect.CollectUDAF";
create temporary function remove_dups as "something.RemoveSequentialDuplicatesUDF";
select screen_flow, count
, dense_rank() over (order by count desc) rank
from (
select screen_flow
, count(*) count
from (
select session_id
, concat_ws("->", remove_dups(screen_array)) screen_flow
from (
select session_id
, collect(screen_name) screen_array
from (
select *
from database.table
order by screen_launch_time ) a
group by session_id ) b
) c
group by screen_flow ) d
Output:
s1->s2->s3 2 1
s1->s2 1 2
s1->s2->s3->s1 1 2
Hope this helps.
Input
990004916946605-1404157897784,S1,1404157898275
990004916946605-1404157897784,S1,1404157898286
990004916946605-1404157897784,S2,1404157898337
990004947764274-1435162269418,S1,1435162274044
990004947764274-1435162269418,S2,1435162274057
990004947764274-1435162269418,S3,1435162274081
990004947764274-1435162287965,S2,1435162690002
990004947764274-1435162287965,S1,1435162690001
990004947764274-1435162287965,S3,1435162690003
990004947764274-1435162287965,S1,1435162690004
990004947764274-1435162212345,S1,1435168768574
990004947764274-1435162212345,S2,1435168768585
990004947764274-1435162212345,S3,1435168768593
register /home/cloudera/jar/ScreenFilter.jar;
screen_records = LOAD '/user/cloudera/inputfiles/screen.txt' USING PigStorage(',') AS(session_id:chararray,screen_name:chararray,launch_time:long);
screen_rec_order = ORDER screen_records by launch_time ASC;
session_grped = GROUP screen_rec_order BY session_id;
eached = FOREACH session_grped
{
ordered = ORDER screen_rec_order by launch_time;
GENERATE group as session_id, REPLACE(BagToString(ordered.screen_name),'_','-->') as screen_str;
};
screen_each = FOREACH eached GENERATE session_id, GetOrderedScreen(screen_str) as screen_pattern;
screen_grp = GROUP screen_each by screen_pattern;
screen_final_each = FOREACH screen_grp GENERATE group as screen_pattern, COUNT(screen_each) as pattern_cnt;
ranker = RANK screen_final_each BY pattern_cnt DESC DENSE;
output_data = FOREACH ranker GENERATE screen_pattern, pattern_cnt, $0 as rank_value;
dump output_data;
I am not able to find a way to use Pig Builtin function to remove adjacent screens for a same session_id ,hence I have used JAVA UDF inorder to remove the adjacent screen names.
I created a JAVA UDF called GetOrderedScreen and covrerted that UDF in to jar and named that jar as ScreenFilter.jar and registered that jar in this Pig Script
Below is the Code for that GetOrderedScreen Java UDF
public class GetOrderedScreen extends EvalFunc<String> {
#Override
public String exec(Tuple input) throws IOException {
String incoming_screen_str= (String)input.get(0);
String outgoing_screen_str ="";
String screen_array[] =incoming_screen_str.split("-->");
String full_screen=screen_array[0];
for (int i=0; i<screen_array.length;i++)
{
String prefix_screen= screen_array[i];
String suffix_screen="";
int j=i+1;
if(j< screen_array.length)
{
suffix_screen = screen_array[j];
}
if (!prefix_screen.equalsIgnoreCase(suffix_screen))
{
full_screen = full_screen+ "-->" +suffix_screen;
}
}
outgoing_screen_str =full_screen.substring(0, full_screen.lastIndexOf("-->"));
return outgoing_screen_str;
}
}
Output
(S1-->S2-->S3,2,1)
(S1-->S2,1,2)
(S1-->S2-->S3-->S1,1,2)
Hope this helps you!.. Also Wait for some more time, Some good brains who see this Question will answer effectively(without JAVA UDF)
can any body tell with an example of what is the correct way of sending the erlang module and erlang function to the
query.map()
in python riak client, In Documents it was like
function (string, list) – Either a named Javascript function (ie: ‘Riak.mapValues’), or an anonymous javascript function (ie: ‘function(...) ... ‘ or an array [‘erlang_module’, ‘function’].
options (dict) – phase options, containing ‘language’, ‘keep’ flag, and/or ‘arg’.
but there is no clear information of what i have to send.actually i've been giving the query.map() phase as
query.map(['maps','fun']) # maps is the maps.erl and fun is the function in the maps.erl file
I have set the beam files path under the app.cofig as mentioned in the documents, to keep the compiled beam files. i have did all those things but , i am getting error after running the commands
query.map(['maps','funs'])
>>> query.run()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib/python2.6/site-packages/riak-1.5.2-py2.6.egg/riak/mapreduce.py", line 234, in run
result = t.mapred(self._inputs, query, timeout)
File "/usr/lib/python2.6/site-packages/riak-1.5.2-py2.6.egg/riak/transports/http.py", line 322, in mapred
(repr(response[0]), repr(response[1])))
Exception: Error running MapReduce operation. Headers: {'date': 'Mon, 26 May 2014 11:24:04 GMT', 'content-length': '1121', 'content-type': 'application/json'
, 'http_code': 500, 'server': 'MochiWeb/1.1 WebMachine/1.10.0 (never breaks eye contact)'} Body: '{"phase":0,"error":"undef","input":"{ok,{r_object,<<\\"tst\
\">>,<<\\"test5\\">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<\\"X-Riak-VTag\\
">>,50,53,75,69,55,80,113,109,65,69,117,106,109,109,99,65,72,101,75,82,115,86]],[[<<\\"index\\">>]],[],[[<<\\"X-Riak-Last-Modified\\">>|{1400,340359,663135}]
],[],[]}}},<<\\"6\\">>}],[{<<197,82,177,11,83,115,139,10>>,{1,63567559559}}], {dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],
[],[],[],[],[],[],[],[],[],...}}},...},...}","type":"error","stack":"[{maps,funs, [{r_object,<<\\"tst\\">>,<<\\"test5\\">>,[{r_content,{dict,3,16,16,8,80,48,{
[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[], [[<<\\"X-Riak-VTag\\">>,50,53,75,69,55,80,113,109,65,69,117,106,109,109,99,6
5,72,101,75,82,115,86]],[[<<\\"index\\">>]],[],[[<<\\"X-Riak-Last-Modified\\">>|{1400,340359,663135}]],[],[]}}},<<\\"6\\">>}],[{<<197,82,177,11,83,115,139,10
>>,{1,63567559559}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],...}}},...},...],...},...]"}'
what was the wrong what i have missed, please suggest me
There are 3 parts to using an Erlang map function with the Python client:
writing and compiling the Erlang module
preparing the Riak cluster
invoking the function from the Python client
The Erlang module should be fairly straightforward, for this example I will have the map function return the number of values(siblings) for each key:
-module(custom_mr).
-export([mapcount/3]).
mapcount(Obj,_Keydata,_Arg) ->
[length(riak_object:get_values(Obj))].
Versions of Erlang vary in subtle ways, so it will be safer to use Riak's bundled Erlang, or the same one you used to compile it if you built from source. The resultant .beam file will need to be placed in a directory that is readable by the user that Riak is running as - this defaults to riak if you used a package install. You will need to deploy the .beam file and modify the app.config at each node in the cluster.
# /usr/lib/riak/erts-5.9.1/bin/erlc custom_mr.erl
# mkdir /var/lib/riak/custom_code
# mv custom_mr.beam /var/lib/riak/custom_code
# chown -R riak:riak /var/lib/riak/custom_code
Then edit app.config and add {add_paths,["/var/lib/riak/custom_code"]} to the riak_kv section, and restart the node.
Test from riak attach to make sure the new module has been loaded - in this example, nodes 1-4 have loaded the module, but node5 is down:
# riak attach
1> riak_core_util:rpc_every_member_ann(code,which,[custom_mr]).
{[{'riak#node1.lab.local',"/var/lib/riak/custom_code/custom_mr.beam"},
{'riak#node2.lab.local',"/var/lib/riak/custom_code/custom_mr.beam"},
{'riak#node3.lab.local',"/var/lib/riak/custom_code/custom_mr.beam"},
{'riak#node4.lab.local',"/var/lib/riak/custom_code/custom_mr.beam"}],
['riak#node5.lab.local']}
2> custom_mr:mapcount(riak_object:new(<<"test">>,<<"test">>,<<"test">>),keydata,arg).
[1]
(detach from the riak console with ctrl-d if you are running a pre-1.4 version, otherwise ctrl-c a)
Lastly, the Python code (I used the filename test.py):
import riak
client = riak.RiakClient()
test_bucket = client.bucket('test')
data1 = test_bucket.new('key1',data={'field1':'1data1','field2':'1data2','field3':1})
data1.store()
data2 = test_bucket.new('key2',data={'field1':'2data1','field2':'2data2','field3':2})
data2.store()
data3 = test_bucket.new('key3',data={'field1':'3data1','field2':'3data2','field3':3})
data3.store()
query = riak.RiakMapReduce(client).add('testbucket')
query.map(['custom_mr','mapcount'])
for result in query.run():
print "%s" % (result)
Running this code returns a 1 for each key in the bucket:
#python test.py
1
1
1
NOTE I did not do a get before putting the new values, so if your default bucket properties include allow_mult:true, running this a second time will create a sibling for each value, and you will get '2's instead of '1's
Adding further examples
New module, compile and install as above
-module(custom_mr).
-export([mapcount/3,
mapvalue/3,
mapfield/3,
mapfieldwithid/3,
reducecount/2,
reducepropsort/2,
reducedoublesort/2,
reducesort/2]).
mapcount(Obj,_Kd,_Arg) ->
[length(riak_object:get_values(Obj))].
mapvalue(Obj,_Kd,_Arg) ->
[hd(riak_object:get_values(Obj))].
mapfield(Obj,_Kd,Arg) ->
Val = case catch mochijson2:decode(hd(riak_object:get_values(Obj))) of
{struct, Data} ->
case Arg =:= null of
true -> Data;
false -> [{Arg,proplists:get_value(Arg,Data)}]
end;
_ ->
[{Arg,{error,notjson}}]
end,
[list_to_binary(mochijson2:encode(Val))].
mapfieldwithid(Obj,_Kd,Arg) ->
Val = case catch mochijson2:decode(hd(riak_object:get_values(Obj))) of
{struct, Data} ->
case Arg =:= null of
true -> Data;
false -> [{Arg,proplists:get_value(Arg,Data)}]
end;
_ ->
[{Arg,{error,notjson}}]
end,
V = [{bucket,riak_object:bucket(Obj)},{key,riak_object:key(Obj)}|Val],
[list_to_binary(mochijson2:encode(V))].
reducecount(L,_Arg) ->
[lists:sum([ N || N <- L, is_integer(N) ])].
sortfun(F) ->
fun(A,B) ->
proplists:get_value(F,A,<<"zzzz">>) =< proplists:get_value(F,B,<<"zzzz">>)
end.
reducepropsort(L,Arg) ->
Decoded = [ I || {struct,I} <- [ mochijson2:decode(E) || E <- L], is_list(I)],
Sorted = lists:sort(sortfun(Arg), Decoded),
[ list_to_binary(mochijson2:encode(I)) || I <- Sorted ].
reducesort(L,_Arg) ->
lists:sort(L).
reducedoublesort(L,Arg) ->
Decoded = [ lists:sort(I) || {struct,I} <- [ mochijson2:decode(E) || E <- L], is_list(I)],
Sorted = lists:sort(sortfun(Arg), Decoded),
[ list_to_binary(mochijson2:encode(I)) || I <- Sorted ].
Python code
import riak
client = riak.RiakClient(pb_port=8087, host="172.31.0.1", protocol='pbc')
test_bucket = client.bucket('test_bucket')
data1 = test_bucket.new('key1',data={'field1':'1data1','field2':'1data2','field3':1, 'zone':'D'})
data1.store()
data2 = test_bucket.new('key2',data={'field1':'2data1','field2':'2data2','field3':2, 'zone':'A'})
data2.store()
data3 = test_bucket.new('key3',data={'field1':'3data1','field2':'3data2','field3':3, 'zone':'C'})
data3.store()
def printresult(q):
for result in q.run():
print "%s" % (result)
print "\nCount the number of values in the bucket"
query = riak.RiakMapReduce(client).add('test_bucket')
query.map(['custom_mr','mapcount'])
query.reduce(['custom_mr','reducecount'])
printresult(query)
print "\nList all values in natual sort order"
query = riak.RiakMapReduce(client).add('test_bucket')
query.map(['custom_mr','mapvalue'])
query.reduce(['custom_mr','reducesort'])
printresult(query)
print "\nList all values sorted by 'zone'"
query = riak.RiakMapReduce(client).add('test_bucket')
query.map(['custom_mr','mapfield'])
query.reduce(['custom_mr','reducepropsort'],{'arg':'zone'})
printresult(query)
print "\nList all values sorted by 'zone', also sort the fields in each object"
query = riak.RiakMapReduce(client).add('test_bucket')
query.map(['custom_mr','mapfield'])
query.reduce(['custom_mr','reducedoublesort'],{'arg':'zone'})
printresult(query)
print "\nList just field3, sorted"
query = riak.RiakMapReduce(client).add('test_bucket')
query.map(['custom_mr','mapfield'],{'arg':'field3'})
query.reduce(['custom_mr','reducepropsort'],{'arg':'field3'})
printresult(query)
print "\nList just bucket,key,field3, sorted by field3"
query = riak.RiakMapReduce(client).add('test_bucket')
query.map(['custom_mr','mapfieldwithid'],{'arg':'field3'})
query.reduce(['custom_mr','reducepropsort'],{'arg':'field3'})
printresult(query)
print "\nReturn just the zone for key2"
query = riak.RiakMapReduce(client).add('test_bucket','key2')
query.map(['custom_mr','mapfield'],{'arg':'zone'})
printresult(query)
print "\nReturn the bucket,key,zone for key1 and key3"
query = riak.RiakMapReduce(client).add('test_bucket',['key1','key3'])
query.map(['custom_mr','mapfieldwithid'],{'arg':'zone'})
query.reduce(['custom_mr','reducepropsort'],{'arg':'zone'})
printresult(query)
Please Note: Many of these examples use full-bucket MapReduce which will be very heavy and will likely affect performance if used on a non-trivial amount of data. The last 2 examples show how to select a specific key or a list of keys as input. A Secondary Index or Riak Search could also be used as input if the cluster is setup with those, see riak-python-client query inputs in the docs.
And the output:
# python ~/test.py
Count the number of values in the bucket
3
List all values in natual sort order
{"field2": "1data2", "field3": 1, "field1": "1data1", "zone": "D"}
{"field2": "2data2", "field3": 2, "field1": "2data1", "zone": "A"}
{"field2": "3data2", "field3": 3, "field1": "3data1", "zone": "C"}
List all values sorted by 'zone'
{"field2":"2data2","field3":2,"field1":"2data1","zone":"A"}
{"field2":"3data2","field3":3,"field1":"3data1","zone":"C"}
{"field2":"1data2","field3":1,"field1":"1data1","zone":"D"}
List all values sorted by 'zone', also sort the fields in each object
{"field1":"2data1","field2":"2data2","field3":2,"zone":"A"}
{"field1":"3data1","field2":"3data2","field3":3,"zone":"C"}
{"field1":"1data1","field2":"1data2","field3":1,"zone":"D"}
List just field3, sorted
{"field3":1}
{"field3":2}
{"field3":3}
List just bucket,key,field3, sorted by field3
{"bucket":"test_bucket","key":"key1","field3":1}
{"bucket":"test_bucket","key":"key2","field3":2}
{"bucket":"test_bucket","key":"key3","field3":3}
Return just the zone for key2
{"zone":"A"}
Return the bucket,key,zone for key1 and key3
{"bucket":"test_bucket","key":"key3","zone":"C"}
{"bucket":"test_bucket","key":"key1","zone":"D"}