We have a lot of historical data that we need to migrate into HBase. The setup of our HBase is that the (timestamp) versioning is relevant and using some domain knowledge we know at which time the different columns were available. The amount of data is vast so I was wondering what would be a good way of doing this bulk load. Scala or Python is fine, preferably with Spark.
I've published a gist that gets you most of the way there. I'll reproduce the most relevant method here:
def write[TK, TF, TQ, TV](
tableName: String,
ds: Dataset[(TK, Map[TF, Map[TQ, TV]])],
batch: Int = 1000
)(implicit
fk: TK => HBaseData,
ff: TF => HBaseData,
fq: TQ => HBaseData,
fv: TV => HBaseData
): Unit = {
ds.foreachPartition(p => {
val hbase = HBase.getHBase
val table = hbase.getTable(TableName.valueOf(tableName))
val puts = ArrayBuffer[Put]()
p.foreach(r => {
val put = new Put(r._1)
r._2.foreach( f => {
f._2.foreach( q => {
put.addColumn(f._1, q._1, q._2)
})
})
puts += put
if (puts.length >= batch) {
table.put(puts.asJava)
puts.clear()
}
})
if (puts.nonEmpty) {
table.put(puts.asJava)
puts.clear()
}
table.close()
})
}
The caveat is that this method only uses the HBase timestamp in it's default behavior, so it will have to be extended to include providing your own timestamp. Essentially, just make the TV type into a Map[Long, TV], and add the appropriate additional nested loop.
The HBaseData type is a case class with several of implicit methods to convert from the most common types to an Array[Byte] for efficient HBase storage.
The getHbase method ensures only one connection to HBase from each partition, to avoid connecting/disconnecting for every record.
Hopefully this is all sensible, as I implemented this as a beginner in generics.
Related
I need to create around 25 glue workflows, and we are using terraform for the same. I am trying to create multiple lists, and terraform's for_each function so that script is short and also much better to handle. Following is my terraform script, and when I try to run it am getting duplicate objects found. Think terraform generates 3X3 matrix, and i need exactly 3 workflows for each items from the list. I did refer to this SO : for loop with multiple lists
locals {
wf_name = [var.cdl_workflow_apld1, var.cdl_workflow_apld2, var.cdl_workflow_apld3]
glue_job_name = [var.cdl_apld_glue_job1, var.cdl_apld_glue_job2, var.cdl_apld_glue_job3]
trigger_type = ["ON_DEMAND", "CONDITIONAL", "CONDITIONAL"]
wf_props = {for val in setproduct(local.wf_name, local.glue_job_name, local.trigger_type):
"${val[0]}-${val[1]}-${val[2]}" => val}
}
resource "aws_glue_workflow" "cdl_workflow_apld" {
name = var.cdl_workflow_apld
}
resource "aws_glue_trigger" "cdl_workflow_apld" {
for_each = local.wf_props
name = each.value[0]
type = each.value[2]
workflow_name = aws_glue_workflow.cdl_workflow_apld.name
actions {
job_name = each.value[1]
}
}
I am expecting, for each workflow the respective index from wf_name, glue_job_name should be applied.
Also for trigger_type, I would like to set ON_DEMAND only for the first index, or the first workflow, and remaining all should be CONDITIONAL. Any suggestions/help with code please.
Expected output
workflow1:
workflow_name ="myworkflow_name"
name: var.cdl_workflow_apld1 (from variable.tf & tfvars)
job_name = var.cdl_apld_glue_job1 (from variable.tf & tfvars)
type = ON_DEMAND
workflow2:
workflow_name ="myworkflow_name"
name: var.cdl_workflow_apld2 (from variable.tf & tfvars)
job_name = var.cdl_apld_glue_job2 (from variable.tf & tfvars)
type = CONDITIONAL
The error message:
Error: Duplicate object key
on glue_wf_apld_test.tf line 7, in locals:
6: wf_props = {for val in setproduct(local.wf_name, local.glue_job_name, local.trigger_type):
7: "${val[0]}-${val[1]}-${val[2]}" => val}
val[0] is "tr_apld_landing_to_raw"
val[1] is "dev_load_landing_to_raw_jb"
val[2] is "CONDITIONAL"
Two different items produced the key "tr_apld_landing_to_raw-dev_load_landing_to_raw_jb-CONDITIONAL" in this 'for' expression. If duplicates are expected, use the
│ ellipsis (...) after the value expression to enable grouping by key.
Please check the code below. if I remove the $format from the last line to $sheet_write->write($row, $col,$cell -> value) then it works fine.
I am declaring to add the format in excel_2 spreadsheet which would be written by "Excel::Writer::XLSX;" package. And I am coping the cell format($format= $cell->get_format();) from another reader parser "use Spreadsheet::ParseXLSX". Any help would be appreciated. I am expecting there is some mismatch between these 2 different methods.
#!/home/utils/perl-5.24/5.24.2-058/bin/perl -w
use strict;
use warnings;
use Excel::Writer::XLSX;
use Spreadsheet::ParseXLSX;
my $parser = Spreadsheet::ParseXLSX->new();
my $workbook = $parser->parse('abc.xlsx');
my $excel_2 = Excel::Writer::XLSX -> new ('abc_copied.xlsx');
my $format = $excel_2->add_format();
if ( !defined $workbook ) {
die $parser->error(), ".\n";
}
for my $worksheet ( $workbook->worksheets() ) {
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
printf("Sheet: %s\n", $worksheet->{Name});
my $sheet_write = $excel_2->add_worksheet($worksheet->{Name});
for my $row ( $row_min .. $row_max ) {
for my $col ( $col_min .. $col_max ) {
my $cell = $worksheet->get_cell( $row, $col );
next unless $cell;
print "Row, Col = ($row, $col)\n";
#print "Value = ", $cell->value(), "\n";
#print "Unformatted = ", $cell->unformatted(), "\n";
#print "\n";
$format= $cell->get_format();
$sheet_write->write($row, $col,$cell -> value,$format);
}
}
}
Spreadsheet::ParseXLSX's cell->get_format yields a different object than what Excel::Writer::XLSX is expecting as the format object in its write method. Even though they both deal with spreadsheets, they are different modules and will very unlikely share a class in this manner.
The properties of Excel::Writer::XLSX's format object are well documented:
https://metacpan.org/pod/Excel::Writer::XLSX#CELL-FORMATTING
I see there is a clone format module:
https://metacpan.org/pod/Excel::CloneXLSX::Format
No personal experience with it, but it looks promising... Otherwise you will probably have to dump the contents of the Parse module, figure out what's most important to you, and then do the translations yourself. But seriously, try the module.
Alternatively, if this is on a Windows machine, Win32::OLE may be better to handle this type of task for you if you are tethered to Perl (which needless to say would not be my first choice if your sole focus is Excel spreadsheet operations).
I'm matching two collections residing in 2 different databases over a criteria and creates a new collection for records that matches this criterion.
Below is working with simple criteria, but I need a different criterion.
Definitions
function insertBatch(collection, documents) {
var bulkInsert = collection.initializeUnorderedBulkOp();
var insertedIds = [];
var id;
documents.forEach(function(doc) {
id = doc._id;
// Insert without raising an error for duplicates
bulkInsert.find({_id: id}).upsert().replaceOne(doc);
insertedIds.push(id);
});
bulkInsert.execute();
return insertedIds;
}
function moveDocuments(sourceCollection, targetCollection, filter, batchSize) {
print("Moving " + sourceCollection.find(filter).count() + " documents from " + sourceCollection + " to " + targetCollection);
var count;
while ((count = sourceCollection.find(filter).count()) > 0) {
print(count + " documents remaining");
sourceDocs = sourceCollection.find(filter).limit(batchSize);
idsOfCopiedDocs = insertBatch(targetCollection, sourceDocs);
targetDocs = targetCollection.find({_id: {$in: idsOfCopiedDocs}});
}
print("Done!")
}
Call
var db2 = new Mongo("<URI_1>").getDB("analy")
var db = new Mongo("<URI_2>").getDB("clone")
var readDocs= db2.coll1
var writeDocs= db.temp_coll
var Urls = new Mongo("<URI_2>").getDB("clone").myCollection.distinct("Url" ,{})
var filter= {"Url": {$in: Urls }}
moveDocuments(readDocs, writeDocs, filter, 10932)
In a nutshell, my criterion is distinct "Url" string. Instead, I want Url + Date string to be my criterion. There are 2 problems:
In one collection, the date is in format ISODate("2016-03-14T13:42:00.000+0000") and in other collection the date format is "2018-10-22T14:34:40Z". So, How to make them uniform so that they match each other?
Assuming, we get a solution to 1., and we create a new array having concatenated strings UrlsAndDate instead of Urls. How would we create a similar concatenated field on the fly and match it in the other collection?
For example: (non-functional code!)
var UrlsAndDate = new Mongo("<URI_2>").getDB("clone").myCollection.distinct("Url"+"formated_Date" ,{})
var filter= {"Url"+"formated_Date": {$in: Urls }}
readDocs.find(filter)
...and do the same stuff as above!
Any suggestions?
Got a brute force solution, but isn't feasible!
Problem:
I want to merge 2 collections mycoll & coll1. Both have a field name Url and Date. mycoll has 35000 docs and coll1 has 4.7M docs(16+gb)-can't load into m/m.
Algo, written using pymongo client :
iterate over mycoll
create a src string "url+common_date_format"
Try to find a match in coll1, since, coll1 is big I can't load it in m/m and treat as dictionary!. So, I'm iterating over each doc in this collection again and again.
iterate over coll1
create a destination string "url+common_date_format"
if src_string == dest_string
insert this doc in a new collection called temp_coll
This is a terrible algorithm since O(35000*4.7M), would take ages to complete!. If I could load 4.7M in m/m then the run time will reduce to O(35000), that's doable!
Any suggestions for another algorithm!
First thing I would do is create compound index with {url: 1, date: 1} on collections if they don't already exist. Say collection A has 35k docs and collection B has 4.7M docs. We can't load whole 4.7M docs data in-memory. You are iterating over cursor object of B in inner loop. I assume once that cursor object is exhausted you are querying the collection again.
Some observations to make here why are we iterating over 4.7M docs each time. Instead of fetching all 4.7M docs and then matching, we could just fetch docs that match url and date for each doc in A. Converting a_doc date to b_doc format and then querying would be better than making both to common format which forces us to do 4.7M docs iteration. Read the below pseudo code.
a_docs = a_collection.find()
c_docs = []
for doc in a_docs:
url = doc.url
date = doc.date
date = convert_to_b_collection_date_format(date)
query = {'url': url, 'date': date}
b_doc = b_collection.find(query)
c_docs.append(b_doc)
c_docs = covert_c_docs_to_required_format(c_docs)
c_collection.insert_many(c_docs)
Above we are looping over 35k docs and filter for each doc. Given that we have indexes created already lookup takes logarithmic time, which seems reasonable.
Solution
This solved all issues with my Perl code (plus extra implementation code.... :-) ) In conlusion both Perl and Python are equally awesome.
use WWW::Curl::Easy;
Thanks to ALL who responded, very much appreciated.
Edit
It appears that the Perl code I am using is spending the majority of its time performing the http get, for example:
my $start_time = gettimeofday;
$request = HTTP::Request->new('GET', 'http://localhost:8080/data.json');
$response = $ua->request($request);
$page = $response->content;
my $end_time = gettimeofday;
print "Time taken #{[ $end_time - $start_time ]} seconds.\n";
The result is:
Time taken 74.2324419021606 seconds.
My python code in comparison:
start = time.time()
r = requests.get('http://localhost:8080/data.json', timeout=120, stream=False)
maxsize = 100000000
content = ''
for chunk in r.iter_content(2048):
content += chunk
if len(content) > maxsize:
r.close()
raise ValueError('Response too large')
end = time.time()
timetaken = end-start
print timetaken
The result is:
20.3471381664
In both cases the sort times are sub second. So first of all I apologise for the misleading question, and it is another lesson for me to never ever make assumptions.... :-)
I'm not sure what is the best thing to do with this question now. Perhaps someone can propose a better way of performing the request in perl?
End of edit
This is just a quick question regarding sort performance differences in Perl vs Python. This is not a question about which language is better/faster etc, for the record, I first wrote this in perl, noticed the time the sort was taking, and then tried to write the same thing in python to see how fast it would be. I simply want to know, how can I make the perl code perform as fast as the python code?
Lets say we have the following json:
["3434343424335": {
"key1": 2322,
"key2": 88232,
"key3": 83844,
"key4": 444454,
"key5": 34343543,
"key6": 2323232
},
"78237236343434": {
"key1": 23676722,
"key2": 856568232,
"key3": 838723244,
"key4": 4434544454,
"key5": 3432323543,
"key6": 2323232
}
]
Lets say we have a list of around 30k-40k records which we want to sort by one of the sub keys. We then want to build a new array of records ordered by the sub key.
Perl - Takes around 27 seconds
my #list;
$decoded = decode_json($page);
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
push(#list,{"key"=>$id,"key1"=>$decoded->{$id}{key1}...etc));
}
Python - Takes around 6 seconds
list = []
data = json.loads(content)
data2 = sorted(data, key = lambda x: data[x]['key5'], reverse=True)
for key in data2:
tmp= {'id':key,'key1':data[key]['key1'],etc.....}
list.append(tmp)
For the perl code, I have tried using the following tweaks:
use sort '_quicksort'; # use a quicksort algorithm
use sort '_mergesort'; # use a mergesort algorithm
Your benchmark is flawed, you're benchmarking multiple variables, not one. It is not just sorting data, but it is also doing JSON decoding, and creating strings, and appending to an array. You can't know how much time is spent sorting and how much is spent doing everything else.
The matter is made worse in that there are several different JSON implementations in Perl each with their own different performance characteristics. Change the underlying JSON library and the benchmark will change again.
If you want to benchmark sort, you'll have to change your benchmark code to eliminate the cost of loading your test data from the benchmark, JSON or not.
Perl and Python have their own internal benchmarking libraries that can benchmark individual functions, but their instrumentation can make them perform far less well than they would in the real world. The performance drag from each benchmarking implementation will be different and might introduce a false bias. These benchmarking libraries are more useful for comparing two functions in the same program. For comparing between languages, keep it simple.
Simplest thing to do to get an accurate benchmark is to time them within the program using the wall clock.
# The current time to the microsecond.
use Time::HiRes qw(gettimeofday);
my #list;
my $decoded = decode_json($page);
my $start_time = gettimeofday;
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
push(#list,{"key"=>$id,"key1"=>$decoded->{$id}{key1}...etc));
}
my $end_time = gettimeofday;
print "sort and append took #{[ $end_time - $start_time ]} seconds\n";
(I leave the Python version as an exercise)
From here you can improve your technique. You can use CPU seconds instead of wall clock. The array append and cost of creating the string are still involved in the benchmark, they can be eliminated so you're just benchmarking sort. And so on.
Additionally, you can use a profiler to find out where your programs are spending their time. These have the same raw performance caveats as benchmarking libraries, the results are only useful to find out what percentage of its time a program is using where, but it will prove useful to quickly see if your benchmark has unexpected drag.
The important thing is to benchmark what you think you're benchmarking.
Something else is at play here; I can run your sort in half a second. Improving that is not going to depend on sorting algorithm so much as reducing the amount of code run per comparison; a Schwartzian Transform gets it to a third of a second, a Guttman-Rosler Transform gets it down to a quarter of a second:
#!/usr/bin/perl
use 5.014;
use warnings;
my $decoded = { map( (int rand 1e9, { map( ("key$_", int rand 1e9), 1..6 ) } ), 1..40000 ) };
use Benchmark 'timethese';
timethese( -5, {
'original' => sub {
my #list;
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
push(#list,{"key"=>$id,%{$decoded->{$id}}});
}
},
'st' => sub {
my #list;
foreach my $id (
map $_->[1],
sort { $b->[0] <=> $a->[0] }
map [ $decoded->{$_}{key5}, $_ ],
keys %{$decoded}
) {
push(#list,{"key"=>$id,%{$decoded->{$id}}});
}
},
'grt' => sub {
my $maxkeylen=15;
my #list;
foreach my $id (
map substr($_,$maxkeylen),
sort { $b cmp $a }
map sprintf('%0*s', $maxkeylen, $decoded->{$_}{key5}) . $_,
keys %{$decoded}
) {
push(#list,{"key"=>$id,%{$decoded->{$id}}});
}
},
});
Don't create a new hash for each record. Just add the key to the existing one.
$decoded->{$_}{key} = $_
for keys(%$decoded);
my #list = sort { $b->{key5} <=> $a->{key5} } values(%$decoded);
Using Sort::Key will make it even faster.
use Sort::Key qw( rukeysort );
$decoded->{$_}{key} = $_
for keys(%$decoded);
my #list = rukeysort { $_->{key5} } values(%$decoded);
I store time-series data in HBase. The rowkey is composed from user_id and timestamp, like this:
{
"userid1-1428364800" : {
"columnFamily1" : {
"val" : "1"
}
}
}
"userid1-1428364803" : {
"columnFamily1" : {
"val" : "2"
}
}
}
"userid2-1428364812" : {
"columnFamily1" : {
"val" : "abc"
}
}
}
}
Now I need to perform per-user analysis. Here is the initialization of hbase_rdd (from here)
sc = SparkContext(appName="HBaseInputFormat")
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
The natural mapreduce-like way to process would be:
hbase_rdd
.map(lambda row: (row[0].split('-')[0], (row[0].split('-')[1], row[1]))) # shift timestamp from key to value
.groupByKey()
.map(processUserData) # process user's data
While executing first map (shift timestamp from key to value) it is crucial to know when the time-series data of the current user is finished and therefore groupByKey transformation could be started. Thus we do not need to map over all table and store all the temporary data. It is possible because hbase stores row-keys in a sorted order.
With hadoop streaming it could be done in such way:
import sys
current_user_data = []
last_userid = None
for line in sys.stdin:
k, v = line.split('\t')
userid, timestamp = k.split('-')
if userid != last_userid and current_user_data:
print processUserData(last_userid, current_user_data)
last_userid = userid
current_user_data = [(timestamp, v)]
else:
current_user_data.append((timestamp, v))
The question is: how to utilize the sorted order of hbase keys within Spark?
I'm not super familiar with the guarantees you get with the way you're pulling data from HBase, but if I understand correctly, I can answer with just plain old Spark.
You've got some RDD[X]. As far as Spark knows, the Xs in that RDD are completely unordered. But you have some outside knowledge, and you can guarantee that the data is in fact grouped by some field of X (and perhaps even sorted by another field).
In that case, you can use mapPartitions to do virtually the same thing you did with hadoop streaming. That lets you iterate over all the records in one partition, so you can look for blocks of records w/ the same key.
val myRDD: RDD[X] = ...
val groupedData: RDD[Seq[X]] = myRdd.mapPartitions { itr =>
var currentUserData = new scala.collection.mutable.ArrayBuffer[X]()
var currentUser: X = null
//itr is an iterator over *all* the records in one partition
itr.flatMap { x =>
if (currentUser != null && x.userId == currentUser.userId) {
// same user as before -- add the data to our list
currentUserData += x
None
} else {
// its a new user -- return all the data for the old user, and make
// another buffer for the new user
val userDataGrouped = currentUserData
currentUserData = new scala.collection.mutable.ArrayBuffer[X]()
currentUserData += x
currentUser = x
Some(userDataGrouped)
}
}
}
// now groupedRDD has all the data for one user grouped together, and we didn't
// need to do an expensive shuffle. Also, the above transformation is lazy, so
// we don't necessarily even store all that data in memory -- we could still
// do more filtering on the fly, eg:
val usersWithLotsOfData = groupedRDD.filter{ userData => userData.size > 10 }
I realize you wanted to use python -- sorry I figure I'm more likely to get the example correct if I write in Scala. and I think the type annotations make the meaning more clear, but that it probably a Scala bias ... :). In any case, hopefully you can understand what is going on and translate it. (Don't worry too much about flatMap & Some & None, probably unimportant if you understand the idea ...)