How to use hbase as a source for hadoop streaming jobs - python

Is there any way to use a Hbase table as a source for a Hadoop streaming job ? Specifically, I want to run a Hadoop streaming job written in Python. This works well when the input is specified as a folder on HDFS. But I've not been able to find any documentation about reading data from a Hbase table.
Is this supported ? Or I'll have to go through the ordeal of writing a java code for getting data from Hbase to HDFS first and then run streaming job ?
I'm using Hbase 0.94 from Cloudera.
(There is a similar question already present here. But it points to a third party solution, not actively contributed to. I was hoping that this will be supported in Hbase).

I would use Pig to load the data and then feed it into a streaming Python application.
See here:
http://pig.apache.org/docs/r0.12.0/func.html#HBaseStorage
http://pig.apache.org/docs/r0.12.0/basic.html#stream

Related

Is there a Kinesis Connector for PyFlink?

I'm starting to work on a streaming application and trying to figure out if PyFlink would fit the requirements I have. I need to be able to read from a Kinesis Stream. I saw on the docs that there is a Kinesis Stream Connector, but I can't figure out if that's available for the Python version as well, and, if it is, how to configure it.
Update:
I've found this other doc page, that explains how to use connectors other than the default ones in Python. I've then downloaded the Kinesis jar from here. The version I've downloaded is flink-connector-kinesis_2.11-1.11.2, which matches the one being referenced here.
Then, I changed this line from the script in the documentation: t_env.get_config().get_configuration().set_string("pipeline.jars", "file://<absolute_path_to_jar>/connector.jar").
When trying to execute the script, however, I'm getting this Java error: Caused by: org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'kinesis' that implements 'org.apache.flink.table.factories.DynamicTableSourceFactory' in the classpath..
I've also tried removing that config line from the script, and then running it as ./bin/flink run -py <my_script>.py -j ./<path_to_jar>/connector.jar, but that got me the same error.
What I interpret from that is that the Jar that I added has not been properly recognized by Flink. Am I doing something wrong here?
It may be relevant to clarify that PyFlink is currently (Flink 1.11) a wrapper around Flink's Table API/SQL. The connector you're trying to use is a DataStream API connector.
In Flink 1.12, coming out in the next few weeks, there will be a Kinesis connector for the Table API/SQL too, so you should be able to use it then. For an overview of the currently supported connectors, this is the documentation page you should refer to.
Note: As Xingbo mentioned, PyFlink will wrap the DataStream API starting from Flink 1.12, so if you need a lower-level abstraction for more complex implementations you'll also be able to consume from Kinesis there.
because there are many connectors to support, we need to contribute back to the community one after another. We have developed the Kinesis connector locally. Since users have demand of Kinesis connector, we will contribute it to PyFlink. Now the relevant documentation of PyFlink datastream is still improving.You can take a look at Jira first to see the supported features

How to set up GCP infrastructure to perform search quickly over massive set of json data?

I have about 100 million json files (10 TB), each with a particular field containing a bunch of text, for which I would like to perform a simple substring search and return the filenames of all the relevant json files. They're all currently stored on Google Cloud Storage. Normally for a smaller number of files I might just spin up a VM with many CPUs and run multiprocessing via Python, but alas this is a bit too much.
I want to avoid spending too much time setting up infrastructure like a Hadoop server, or loading all of that into some MongoDB database. My question is: what would be a quick and dirty way to perform this task? My original thoughts were to set up something on Kubernetes with some parallel processing running Python scripts, but I'm open to suggestions and don't really have a clue how to go about this.
Easier would be to just load the GCS data into Big Query and just run your query from there.
Send your data to AWS S3 and use Amazon Athena.
The Kubernetes option would be set up a cluster in GKE and install Presto in it with a lot of workers, use a hive metastore with GCS and query from there. (Presto doesn't have direct GCS connector yet, afaik) -- This option seems more elaborate.
Hope it helps!

Data ingestion using kafka from crawlers

I am trying to work with Kafka for data ingestion but being new to this, i am kind of pretty much confused.I have multiple crawlers, who extract data for me from web platform. Now, the issue is i want to ingest that extract data to Hadoop using Kafka without any middle scripts/service file . Is it possible ?
without any middle scripts/service file . Is it possible ?
Unfortunately, no.
You need some service that's writing into Kafka (your scraper). Whether you produce into Kafka HTTP links (then write an intermediate consumer/producer that generates the scraped results), or only produce the final scraped results, that's up to you.
You also need a second service consuming those topic(s) that writes to HDFS. This could be Kafka Connect (via Confluent's HDFS Connector library), or PySpark (code you'd have to write yourself), or other options that include "middle scripts/services".
If you'd like to combine both options, I would suggest taking a look at Apache Nifi or Streamsets, which can perform HTTP lookups, (X)HTML parsing, and Kafka+HDFS connectors, all configured via a centralized GUI. Note: I believe any Python code would have to be rewritten in a JVM language to support major custom parsing logic in this pipeline

Bigquery to Bigtable data transfer using google dataflow in python

We are using the dataflow pipeline written in JAVA for transfering data from bigquery to bigtable.
e.g. https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/blob/master/java/dataflow-connector-examples/src/main/java/com/google/cloud/bigtable/dataflow/example/BigQueryBigtableTransfer.java
I am trying to write same code in python. But I am not able to get the bigtable dataflow connector for python. Any clue how it can be done?
Just not to leave the question unanswered, as Graham Polley commented: "The Python SDK doesn't have support for Bigtable yet.". I see that engineer from Bigtable is already involved, but if you want, you can also create a feature request in Public Issue Tracker.

Error streaming from pub/sub into big query python

I am having trouble creating a dataflowRunner job that connects a pub/sub source to a big query sink, by plugging these two:
apache_beam.io.gcp.pubsub.PubSubSource
apache_beam.io.gcp.bigquery.BigQuerySink
into lines 59 and 74 respectively in the beam/sdks/python/apache_beam/examples/streaming_wordcount.py (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/streaming_wordcount.py) example on github. After removing lines 61-70, and specifying the correct pub/sub and bigquery arguments, the script runs without errors without building the pipeline.
sidenote: the script mentions streaming pipeline support isnt available for use in Python. However, on the beam docs it mentions apache_beam.io.gcp.pubsub.PubSubSource is only available for streaming
(1st sentence underneath the "apache_beam.io.gcp.pubsub module" heading: https://beam.apache.org/documentation/sdks/pydoc/2.0.0/apache_beam.io.gcp.html#module-apache_beam.io.gcp.pubsub)
You can't stream on Python Dataflow - for now.
Monitor this changelog to find out the day it does:
https://cloud.google.com/dataflow/release-notes/release-notes-python
(soon!)

Categories