Error streaming from pub/sub into big query python - python

I am having trouble creating a dataflowRunner job that connects a pub/sub source to a big query sink, by plugging these two:
apache_beam.io.gcp.pubsub.PubSubSource
apache_beam.io.gcp.bigquery.BigQuerySink
into lines 59 and 74 respectively in the beam/sdks/python/apache_beam/examples/streaming_wordcount.py (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/streaming_wordcount.py) example on github. After removing lines 61-70, and specifying the correct pub/sub and bigquery arguments, the script runs without errors without building the pipeline.
sidenote: the script mentions streaming pipeline support isnt available for use in Python. However, on the beam docs it mentions apache_beam.io.gcp.pubsub.PubSubSource is only available for streaming
(1st sentence underneath the "apache_beam.io.gcp.pubsub module" heading: https://beam.apache.org/documentation/sdks/pydoc/2.0.0/apache_beam.io.gcp.html#module-apache_beam.io.gcp.pubsub)

You can't stream on Python Dataflow - for now.
Monitor this changelog to find out the day it does:
https://cloud.google.com/dataflow/release-notes/release-notes-python
(soon!)

Related

Is there a Kinesis Connector for PyFlink?

I'm starting to work on a streaming application and trying to figure out if PyFlink would fit the requirements I have. I need to be able to read from a Kinesis Stream. I saw on the docs that there is a Kinesis Stream Connector, but I can't figure out if that's available for the Python version as well, and, if it is, how to configure it.
Update:
I've found this other doc page, that explains how to use connectors other than the default ones in Python. I've then downloaded the Kinesis jar from here. The version I've downloaded is flink-connector-kinesis_2.11-1.11.2, which matches the one being referenced here.
Then, I changed this line from the script in the documentation: t_env.get_config().get_configuration().set_string("pipeline.jars", "file://<absolute_path_to_jar>/connector.jar").
When trying to execute the script, however, I'm getting this Java error: Caused by: org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'kinesis' that implements 'org.apache.flink.table.factories.DynamicTableSourceFactory' in the classpath..
I've also tried removing that config line from the script, and then running it as ./bin/flink run -py <my_script>.py -j ./<path_to_jar>/connector.jar, but that got me the same error.
What I interpret from that is that the Jar that I added has not been properly recognized by Flink. Am I doing something wrong here?
It may be relevant to clarify that PyFlink is currently (Flink 1.11) a wrapper around Flink's Table API/SQL. The connector you're trying to use is a DataStream API connector.
In Flink 1.12, coming out in the next few weeks, there will be a Kinesis connector for the Table API/SQL too, so you should be able to use it then. For an overview of the currently supported connectors, this is the documentation page you should refer to.
Note: As Xingbo mentioned, PyFlink will wrap the DataStream API starting from Flink 1.12, so if you need a lower-level abstraction for more complex implementations you'll also be able to consume from Kinesis there.
because there are many connectors to support, we need to contribute back to the community one after another. We have developed the Kinesis connector locally. Since users have demand of Kinesis connector, we will contribute it to PyFlink. Now the relevant documentation of PyFlink datastream is still improving.You can take a look at Jira first to see the supported features

What is the best approach to getting data into S3 for Elasticsearch and RabbitMQ?

In my company we developed a few games for which for some games the events are being sent to either Elasticsearch and others to RabbitMQ. We have a local CLI which grabs the data from both, compiles the messages into compressed (Gzip) JSON files after which another CLI converts them to SQL statements and throws them into a local SQL Server. We want now to scale up but the current setup is painful and nowhere near real-time for analysis.
I've recently built an application in Python which I was planning to publish to a docker container in AWS. The script grabs data from Elasticsearch, compiles into small compressed JSONS and publishes to an S3 bucket. From there the data is ingested into Snowflake for analysis. So far I was able to get the data in quite quickly and looks promising as an alternative.
I was planning to do something similar with RabbitMQ but I wanted to find an even better alternative which would allow this ingestion process to happen seamlessly and help me avoid having to implement within the python code all sorts of exception calls.
I've researched a bit and found there might be a way to link RabbitMQ to Amazon Kinesis Firehose. My question would be: How would I send the stream from RabbitMQ to Kinesis?
For Elasticsearch, what is the best way to achieve this? I've read about the logstash plugin for S3 (https://www.elastic.co/guide/en/logstash/current/plugins-outputs-s3.html) and about logstash plugin for kinesis (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kinesis.html). Which approach would be ideal for real-time ingestion?
My answer will be very theotic and need to be adapted tested in real world and adapted to your use case.
For a near realtime behaviour, I would use logstash
with elasticsearch input and a short cron. this post can help https://serverfault.com/questions/946237/logstashs-elasticsearch-input-plugin-should-be-used-to-output-to-elasticsearch
S3 output (support gzip)
maybe jdbc output to your DB
RabbitMq output plugin
You can create more scallable archi by output to RabbitMQ and use other pipeline to listen to the queue and execute other tasks.
From logstash ES -> Rabbit MQ
From logstash RabbitMQ -> SQL
From logstash RabbitMQ -> Kinesis
From logstash RabbitMQ -> AWS
etc....

Data ingestion using kafka from crawlers

I am trying to work with Kafka for data ingestion but being new to this, i am kind of pretty much confused.I have multiple crawlers, who extract data for me from web platform. Now, the issue is i want to ingest that extract data to Hadoop using Kafka without any middle scripts/service file . Is it possible ?
without any middle scripts/service file . Is it possible ?
Unfortunately, no.
You need some service that's writing into Kafka (your scraper). Whether you produce into Kafka HTTP links (then write an intermediate consumer/producer that generates the scraped results), or only produce the final scraped results, that's up to you.
You also need a second service consuming those topic(s) that writes to HDFS. This could be Kafka Connect (via Confluent's HDFS Connector library), or PySpark (code you'd have to write yourself), or other options that include "middle scripts/services".
If you'd like to combine both options, I would suggest taking a look at Apache Nifi or Streamsets, which can perform HTTP lookups, (X)HTML parsing, and Kafka+HDFS connectors, all configured via a centralized GUI. Note: I believe any Python code would have to be rewritten in a JVM language to support major custom parsing logic in this pipeline

AWS Batch analog in GCP?

I was using AWS and am new to GCP. One feature I used heavily was AWS Batch, which automatically creates a VM when the job is submitted and deletes the VM when the job is done. Is there a GCP counterpart? Based on my research, the closest is GCP Dataflow. The GCP Dataflow documentation led me to Apache Beam. But when I walk through the examples here (link), it feels totally different from AWS Batch.
Any suggestions on submitting jobs for batch processing in GCP? My requirement is to simply retrieve data from Google Cloud Storage, analyze the data using a Python script, and then put the result back to Google Cloud Storage. The process can take overnight and I don't want the VM to be idle when the job is finished but I'm sleeping.
You can do this using AI Platform Jobs which is now able to run arbitrary docker images:
gcloud ai-platform jobs submit training $JOB_NAME \
--scale-tier BASIC \
--region $REGION \
--master-image-uri gcr.io/$PROJECT_ID/some-image
You can define the master instance type and even additional worker instances if desired. They should consider creating a sibling product without the AI buzzword so people can find this functionality easier.
I recommend checking out dsub. It's an open-source tool initially developed by the Google Genomics teams for doing batch processing on Google Cloud.
UPDATE: I have now used this service and I think it's awesome.
As of July 13, 2022, GCP now has it's own new fully managed Batch processing service (GCP Batch), which seems very akin to AWS Batch.
See the GCP Blog post announcing it at: https://cloud.google.com/blog/products/compute/new-batch-service-processes-batch-jobs-on-google-cloud (with links to docs as well)
Officially, according to the "Map AWS services to Google Cloud Platform products" page, there is no direct equivalent but you can put a few things together that might get you to get close.
I wasn't sure if you were or had the option to run your python code in Docker. Then the Kubernetes controls might do the trick. From the GCP docs:
Note: Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads. However, while a node pool can scale to a zero size, the overall cluster size does not scale down to zero nodes (as at least one node is always required to run system Pods).
So, if you are running other managed instances anyway you can scale up or down to and from 0 but you have the Kubernetes node is still active and running the pods.
I'm guessing you are already using something like "Creating API Requests and Handling Responses" to get an ID you can verify that the process is started, instance created, and the payload is processing. You can use that same process to submit that the process completes as well. That takes care of the instance creation and launch of the python script.
You could use Cloud Pub/Sub. That can help you keep track of the state of that: can you modify your python to notify the completion of the task? When you create the task and launch the instance, you can also report that the python job is complete and then kick off an instance tear down process.
Another thing you can do to drop costs is to use Preemptible VM Instances so that the instances run at 1/2 cost and will run a maximum of 1 day anyway.
Hope that helps.
The Product that best suits your use-case in GCP is Cloud Task. We are using it for a similar use-case where we are retrieving files from another HTTP server and after some processing storing them in Google Cloud Storage.
This GCP documentation describes in full detail the steps to create tasks and using them.
You schedule your task programmatically in Cloud Tasks and you have to create task handlers(worker services) in the App Engine. Some limitation For worker services running in App Engine
the standard environment:
Automatic scaling: task processing must finish in 10 minutes.
Manual and basic scaling: requests can run up to 24 hours.
the flex environment: all types have a 60 minutes timeout.
I think the Cron job can help you in this regard and you can implement it with the help of App engine, Pub/sub and Compute engine. Reliable Task Scheduling on Google Compute Engine In distributed systems, such as a network of Google Compute Engine instances, it is challenging to reliably schedule tasks because any individual instance may become unavailable due to autoscaling or network partitioning.
Google App Engine provides a Cron service. Using this service for scheduling and Google Cloud Pub/Sub for distributed messaging, you can build an application to reliably schedule tasks across a fleet of Compute Engine instances.
For a detailed look you can check it here: https://cloud.google.com/solutions/reliable-task-scheduling-compute-engine

How to use hbase as a source for hadoop streaming jobs

Is there any way to use a Hbase table as a source for a Hadoop streaming job ? Specifically, I want to run a Hadoop streaming job written in Python. This works well when the input is specified as a folder on HDFS. But I've not been able to find any documentation about reading data from a Hbase table.
Is this supported ? Or I'll have to go through the ordeal of writing a java code for getting data from Hbase to HDFS first and then run streaming job ?
I'm using Hbase 0.94 from Cloudera.
(There is a similar question already present here. But it points to a third party solution, not actively contributed to. I was hoping that this will be supported in Hbase).
I would use Pig to load the data and then feed it into a streaming Python application.
See here:
http://pig.apache.org/docs/r0.12.0/func.html#HBaseStorage
http://pig.apache.org/docs/r0.12.0/basic.html#stream

Categories