How to bulk load data to hbase in python

How to bulk load data to hbase in python - python

I wrote a MR job in python running by streaming jar package. I want to know how to use bulk load to put data into HBase.
I konw that there are 2 ways to get the data into hbase by bulk loading.
generate the HFiles in MR job, and use CompleteBulkLoad to load data into hbase.
use ImportTsv option and then use CompleteBulkLoad to load data.
I don't know how to use python generate HFile to fits in Hbase. And then I try to use ImportTsv utility. But got failure. I followed the instructions in this [example]（http://hbase.apache.org/book.html#importtsv）.But I got exception:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/Filter...
Now I want to ask 3 questions:
Whether Python could be used to generate HFile by streaming jar or not.
How to use importtsv.
Could bulkload be used to update the table in Hbase. I get a big file bigger than 10GB every day. Could bulkload be used to push the file into Hbase.
The hadoop version is: Hadoop 2.8.0
The hbase version is: HBase 1.2.6
Both running in standalone mode.
Thanks for any answer.
--- update ---
ImportTsv works correctly.
But I stil want to know how to generate the HFile in MR job by streaming jar in Python language.

You could try the happyBase.
table = connection.table("mytable")
with table.batch(batch_size=1000) as b:
for i in range(1200):
b.put(b'row-%04d'.format(i), {
b'cf1:col1': b'v1',
b'cf1:col2': b'v2',
})
As you may have imagined already, a Batch keeps all mutations in memory until the batch is sent, either by calling Batch.send() explicitly, or when the with block ends. This doesn’t work for applications that need to store huge amounts of data, since it may result in batches that are too big to send in one round-trip, or in batches that use too much memory. For these cases, the batch_size argument can be specified. The batch_size acts as a threshold: a Batch instance automatically sends all pending mutations when there are more than batch_size pending operations.
This need a Thrift server stand before hbase. Just a suggestion.

Related

Azure ML File Dataset mount() is slow & downloads data twice

I have created a Fie Dataset using Azure ML python API. Data under question is bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. Then, I tried to mount the dataset in AML compute instance. During this mounting process, I have observed that each parquet file has been downloaded twice under the /tmp directory of the compute instance with the following message printed as the console logs:
Downloaded path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<blob_path>/20211203.parquet is different from target path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<container_name>/<blob_path>/20211203.parquet
This log message gets printed for each parquet file which is part of the dataset.
Also, the process of mounting the dataset is very slow: 44 minutes for for ~10K parquet files each of size of 330 KB.
"%%time" command in the Jupyter Lab shows most of the time has been used for IO process?
CPU times: user 4min 22s, sys: 51.5 s, total: 5min 13s
Wall time: 44min 15s
Note: Both the Data Lake Gen 2 and Azure ML compute instance are under the same virtual network.
Here are my questions:
How to avoid downloading the parquet file twice?
How to make the mounting process faster?
I have gone through this thread, but the discussion there didn't conclude
The Python code I have used is as followed:
data = Dataset.File.from_files(path=list_of_blobs, validate=True)
dataset = data.register(workspace=ws, name=dataset_name, create_new_version=create_new_version)
mount_context = None
try:
mount_context = dataset.mount(path_to_mount)
# Mount the file stream
mount_context.start()
except Exception as ex:
raise(ex)
df = pd.read_parquet(path_to_mount)

The robust option is to download directly from AzureBlobDatastore. You need to know the datastore and relative path, which you get by printing the dataset description. Namely
ws = Workspace.from_config()
dstore = ws.datastores.get(dstore_name)
target = (dstore, dstore_path)
with tempfile.TemporaryDirectory() as tmpdir:
ds = Dataset.File.from_files(target)
ds.download(tmpdir)
df = pd.read_parquet(tmpdir)
The convenient option is to stream tabular datasets. Note that you don't control how the file is read (Microsoft converters may occasionally not work as you expect). Here is the template:
ds = Dataset.Tabular.from_parquet_files(target)
df = ds.to_pandas_dataframe()

I have executed a bunch of tests to compare the performance of FileDataset.mount() and FileDataset.download(). In my environment, download() is much faster than mount().
download() works well when the disk size of the compute is large enough to fit all the files. However, in a multi-node environment, the same data (in my case parquet files) gets downloaded to each of the nodes (multiple copies). As per the documentation:
If your script processes all the files in your dataset and the disk on
your compute resource is large enough for the dataset, the download
access mode is the better choice. The download access mode will avoid
the overhead of streaming the data at runtime. If your script accesses
a subset of the dataset or it's too large for your compute, use the
mount access mode.
Downloading data in a multi-node environment could trigger performance issues (link). In such a case, mount() might be preferred.
I have tried with TabularDataset as well. As Maciej S has mentioned, in case of TabularDataset user doesn't need to decide how data is read from the datastore (i.e. user doesn't need to select mount or download as access mode). But, with the current implementation (azureml-core 1.38.0) of TabularDataset, compute needs to have larger memory (RAM) compared to FileDataset.download() for identical set of parquet files. Looks like, the current implementation reads all the individual parquet files into pandas DataFrame (which gets saved into memory/RAM) first. Then it appends those into a single DataFrame (accessed by the API user). Higher memory might be needed for this "eager" nature of the API.

PyVisa Data Extraction Issues with Keysight B1500

I have a similar question as this one but the solution there did not apply to my problem. I can connect and send commands to my Keysight B1500 mainframe, via pyvisa/GPIB. The B1500 is connected via Keysight's IO tool "Connection Expert"
rman = pyvisa.ResourceManager()
keyS = rman.open_resource('GPIB0::18::INSTR')
keyS.timeout = 20000 # time in ms
keyS.chunk_size = 8204800 # 102400 is 100 kB
keyS.write('*rst; status:preset; *cls')
print('variable keyS is being assigned to ',keyS.query('*IDN?'))
Using this pyvisa object I can query without issues (*IDN? above provides the expected output), and I have also run and extracted data from a different type of IV curve on the same tool.
However, when I try to run a pulsed voltage sweep (change voltage of pulses as function of time and measure current) I do not get the measured data out from the tool. I can hook the output lead from the B1500 to an oscilloscope and can see that my setup has worked and the tool is behaving as expected, right up until I try to extract sweep data.
Again, I can run a standard non-pulsed sweep on the tool and the data extraction works fine using [pyvisaobject].read_raw() - so something is different with the way I'm pulsing the voltage.
What I'm looking for is a way to interrogate the connection in cases where the data transfer is unsuccessful.
Here, in no particular order, are the ways I've tried to extract data. These methods are suggested in this link:
keyS.query_ascii_values('CURV?')
or
keyS.read_ascii_values()
or
keyS.query_binary_values('CURV?')
or
keyS.read_binary_values()
This link from the vendor does cover the extraction of data, but also doesn't yield data in the read statement in the case of pulsed voltage sweep:
myFieldFox.write("TRACE:DATA?")
ff_SA_Trace_Data = myFieldFox.read()
Also tried (based on tab autocompletion on iPython):
read_raw() # This is the one that works with non-pulsed sweep
and
read_bytes(nbytes)
The suggestion from #Paul-Cornelius is a good one, I had to include an *OPC? to get the previous data transfer to work as well. So right before I attempt the data transfer, I send these lines:
rep = keyS.query('NUB?')
keyS.query('*OPC?')
print(rep,'AAAAAAAAAAAAAAAAAAAAA') # this line prints!
mretholder = keyS.read_raw() # system hangs here!
In all the cases the end result is the same - I get a timeout error:
pyvisa.errors.VisaIOError: VI_ERROR_TMO (-1073807339): Timeout expired before operation completed.
The tracebacks for all of these show that they are all using the same basic framework from:
chunk, status = self.visalib.read(self.session, size)
Hoping someone has seen this before, or at least has some ideas on how to troubleshoot. Thanks in advance!

I don't have access to that instrument, but to read a waveform from a Tektronics oscilloscope I had to do the following (pyvisa module):
On obtaining the Resource, I do this:
resource.timeout = 10000
In some cases, I have to wait for a command to complete before proceeding like this:
resource.query("*opc?")
To transfer a waveform from the scope to the PC I do this:
ascii_waveform = resource.query("wavf?")
The "wavf?" query is specifically for this instrument, but the "*opc?" query is generic ("wait for operation complete"), and so is the method for setting the timeout parameter.
There is a lot of information in the user's guide (on the same site you have linked to) about reading data from various devices. I have used the visa library a few times on different devices, and it always requires some fiddling to get it to work.
It looks like you don't have any time delays in your program. In my experience they are almost always necessary, somehow, either by using the resource.timeout feature, the *opc? query, or both.

Python simple communication between python programs?

I have two python programs.
One is kind of data gathering program.
The other is for analysis and prediction using Tensorflow.
(on Windows OS. python3.5. local )
The data gathering program requires 32-bit env because of API it is using.
And as you know, the other program requires 64-bit env because of TensorFlow.
So
Q : I just need to send a dict data to TensorFlow and it sends one integer back as a return.
What is the most simple way to send data each other?
thanks for your time.

The simplest way would be to have one program save the data into a file, and then have the other program read the file. The recommended way to do this is to use JSON, via the json module.
import json
#Write
with open('file.txt', 'w') as file:
file.write(json.dumps(myDict))
#Read
with open('file.txt') as file:
myDict = json.load(json_data)
However, depending on your use case, it might not be the best way. Sockets are a common solution. Managers are also very robust, but are overkill in my opinion.
For more information, I recommend checking out the list that the Python team maintains, of mechanisms that you can use for communication between processes.

If you want to connect both programs over the network, may I suggest you take a look at Pyro4? Basically what that does for you is enable you to do normal Python method calls, but over the network, to code running on another computer or in another Python process. You (almost) don't have to worry about low-level network details with it.

How to resume file transferring with paramiko

I'm working on a Python project that is required some file transferring. One side of the connection is highly available ( REHL 6 ) and always online. But the other side is going on and off ( Windows 7 ) and the connection period is not guaranteed. The files are transporting on both directions and sizes are between 10MB to 2GB.
Is it possible to resume the file transferring with paramiko instead of transferring the entire file from the beginning.
I would like to use rSync but one side is windows and I would like to avoid cwRsync and DeltaCopy

Paramiko doesn't offer an out of the box 'resume' function however, Syncrify, DeltaCopy's big successor has a retry built in and if the backup goes down the server waits up to six hours for a reconnect. Pretty trusty, easy to use and data diff by default.

paramiko.sftp_client.SFTPClient contains an open function, which functions exactly like python's built-in open function.
You can use this to open both a local and remote file, and manually transfer data from one to the other, all the while recording how much data has been transferred. When the connection is interrupted, you should be able to pick up right where you left off (assuming that neither file has been changed by a 3rd party) by using the seek method.
Keep in mind that a naive implementation of this is likely to be slower than paramiko's get and put functions.

Confusion about file accesses in disco

I have a simple 2 node cluster (master on one, workers on both). I tried using:
python disco/util/distrfiles.py bigtxt /etc/nodes > bigtxt.chunks
To distribute the files (which worked ok).
I expected this to mean that the processes would spawn and only operate on local data, but it seems that they are trying to access data on the other machine, at times.
Instead, I completely copied the data directory. Everything worked fine, until the reduce portion. I received the error:
CommError: Unable to access resource (http://host:8989/host/8b/sup#4f6:d2f6:34b3b/map-index.txt):
It seems like the item is expected to be accessed directly via http. But I don't think this is happening correctly. Are files supposed to be passed back and forth by http? Must I have a distributed FS for multi-node MapReduce?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to bulk load data to hbase in python - python

Related

Azure ML File Dataset mount() is slow & downloads data twice

PyVisa Data Extraction Issues with Keysight B1500

Python simple communication between python programs?

How to resume file transferring with paramiko

Confusion about file accesses in disco

Categories

Resources