I'm starting to work on a streaming application and trying to figure out if PyFlink would fit the requirements I have. I need to be able to read from a Kinesis Stream. I saw on the docs that there is a Kinesis Stream Connector, but I can't figure out if that's available for the Python version as well, and, if it is, how to configure it.
Update:
I've found this other doc page, that explains how to use connectors other than the default ones in Python. I've then downloaded the Kinesis jar from here. The version I've downloaded is flink-connector-kinesis_2.11-1.11.2, which matches the one being referenced here.
Then, I changed this line from the script in the documentation: t_env.get_config().get_configuration().set_string("pipeline.jars", "file://<absolute_path_to_jar>/connector.jar").
When trying to execute the script, however, I'm getting this Java error: Caused by: org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'kinesis' that implements 'org.apache.flink.table.factories.DynamicTableSourceFactory' in the classpath..
I've also tried removing that config line from the script, and then running it as ./bin/flink run -py <my_script>.py -j ./<path_to_jar>/connector.jar, but that got me the same error.
What I interpret from that is that the Jar that I added has not been properly recognized by Flink. Am I doing something wrong here?
It may be relevant to clarify that PyFlink is currently (Flink 1.11) a wrapper around Flink's Table API/SQL. The connector you're trying to use is a DataStream API connector.
In Flink 1.12, coming out in the next few weeks, there will be a Kinesis connector for the Table API/SQL too, so you should be able to use it then. For an overview of the currently supported connectors, this is the documentation page you should refer to.
Note: As Xingbo mentioned, PyFlink will wrap the DataStream API starting from Flink 1.12, so if you need a lower-level abstraction for more complex implementations you'll also be able to consume from Kinesis there.
because there are many connectors to support, we need to contribute back to the community one after another. We have developed the Kinesis connector locally. Since users have demand of Kinesis connector, we will contribute it to PyFlink. Now the relevant documentation of PyFlink datastream is still improving.You can take a look at Jira first to see the supported features
Related
I am trying to work with Kafka for data ingestion but being new to this, i am kind of pretty much confused.I have multiple crawlers, who extract data for me from web platform. Now, the issue is i want to ingest that extract data to Hadoop using Kafka without any middle scripts/service file . Is it possible ?
without any middle scripts/service file . Is it possible ?
Unfortunately, no.
You need some service that's writing into Kafka (your scraper). Whether you produce into Kafka HTTP links (then write an intermediate consumer/producer that generates the scraped results), or only produce the final scraped results, that's up to you.
You also need a second service consuming those topic(s) that writes to HDFS. This could be Kafka Connect (via Confluent's HDFS Connector library), or PySpark (code you'd have to write yourself), or other options that include "middle scripts/services".
If you'd like to combine both options, I would suggest taking a look at Apache Nifi or Streamsets, which can perform HTTP lookups, (X)HTML parsing, and Kafka+HDFS connectors, all configured via a centralized GUI. Note: I believe any Python code would have to be rewritten in a JVM language to support major custom parsing logic in this pipeline
I am having trouble creating a dataflowRunner job that connects a pub/sub source to a big query sink, by plugging these two:
apache_beam.io.gcp.pubsub.PubSubSource
apache_beam.io.gcp.bigquery.BigQuerySink
into lines 59 and 74 respectively in the beam/sdks/python/apache_beam/examples/streaming_wordcount.py (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/streaming_wordcount.py) example on github. After removing lines 61-70, and specifying the correct pub/sub and bigquery arguments, the script runs without errors without building the pipeline.
sidenote: the script mentions streaming pipeline support isnt available for use in Python. However, on the beam docs it mentions apache_beam.io.gcp.pubsub.PubSubSource is only available for streaming
(1st sentence underneath the "apache_beam.io.gcp.pubsub module" heading: https://beam.apache.org/documentation/sdks/pydoc/2.0.0/apache_beam.io.gcp.html#module-apache_beam.io.gcp.pubsub)
You can't stream on Python Dataflow - for now.
Monitor this changelog to find out the day it does:
https://cloud.google.com/dataflow/release-notes/release-notes-python
(soon!)
I have a mixed (C#, Python) system communicating asynchronously through Azure Service Bus queues. Everything was working fine but now I'm getting strange error messages in my Python consumer (which is basically a copy and paste from: https://azure.microsoft.com/en-gb/documentation/articles/service-bus-python-how-to-use-queues/). In particular, the line
msg = bus_service.receive_queue_message('myqueue', peek_lock=False)
always results in a could not convert string to float: max-age=31536000 error - the queue is accessed though (in fact, I can see in Azure that the message gets actually off the queue), and I already tried with different types of payload (the original Json based I was using and simple string now). Strangest of all, was working fine. Does anybody got a similar experience?
Just answering my own question in case somebody stumbles into the same problem. My requirements.txt file was not up to date with the latest Python Azure module (of course, I checked the wrong Python env and so I was "sure" it wasn't that :-)). Once I updated the dependencies, things started working again.
I have a existing Website deployed in Google App Engine for Python. Now I have setup the local development server in my System. But I don't know how to get the updated DataBase from live server. There is no Export option in Google's developer console.
And, I don't want to read the data for each request from Production Datastore, I want to set it up locally for once. The google manual says that it stores the local datastore in sqlite file.
Any hint would be appreciated.
First, make sure your app.yaml enables the "remote" built-in, with a stanza such as:
builtins:
- remote_api: on
This app.yaml of course must be the one deployed to your appspot.com (or whatever) "production" GAE app.
Then, it's a job for /usr/local/google_appengine/bulkloader.py or wherever you may have installed the bulkloader component. Run it with -h to get a list of the many, many options you can pass.
You may need to generate an application-specific password for this use on your google accounts page. Then, the general use will be something like:
/usr/local/google_appengine/bulkloader.py --dump --url=http://your_app.appspot.com/_ah/remote_api --filename=allkinds.sq3
You may not (yet) be able to use this "all kinds" query -- the server only generates the needed statistics for the all-kinds query "periodically", so you may get an error message including info such as:
[ERROR ] Unable to download kind stats for all-kinds download.
[ERROR ] Kind stats are generated periodically by the appserver
[ERROR ] Kind stats are not available on dev_appserver.
If that's the case, then you can still get things "one kind at a time" by adding the option --kind=EntityKind and running the bulkloader repeatedly (with separate sqlite3 result files) for each kind of entity.
Once you've dumped (kind by kind if you have to, all at once if you can) the production datastore, you can use the bulkloader again, this time with --restore and addressing your localhost dev_appserver instance, to rebuild the latter's datastore.
It should be possible to explicitly list kinds in the --kind flag (by separating them with commas and putting them all in parentheses) but unfortunately I think I've found a bug stopping that from working -- I'll try to get it fixed but don't hold your breath. In any case, this feature is not documented (I just found it by studying the open-source release of bulkloader.py) so it may be best not to rely on it!-)
More info about the then-new bulkloader can be found in a blog post by Nick Johnson at http://blog.notdot.net/2010/04/Using-the-new-bulkloader (though it doesn't cover newer functionalities such as the sqlite3 format of results in the "zero configuration" approach I outlined above). There's also a demo, with plenty of links, at http://bulkloadersample.appspot.com/ (also a bit outdated, alas).
Check out the remote API. This will tunnel your database calls over HTTP to the production database.
Is there any way to use a Hbase table as a source for a Hadoop streaming job ? Specifically, I want to run a Hadoop streaming job written in Python. This works well when the input is specified as a folder on HDFS. But I've not been able to find any documentation about reading data from a Hbase table.
Is this supported ? Or I'll have to go through the ordeal of writing a java code for getting data from Hbase to HDFS first and then run streaming job ?
I'm using Hbase 0.94 from Cloudera.
(There is a similar question already present here. But it points to a third party solution, not actively contributed to. I was hoping that this will be supported in Hbase).
I would use Pig to load the data and then feed it into a streaming Python application.
See here:
http://pig.apache.org/docs/r0.12.0/func.html#HBaseStorage
http://pig.apache.org/docs/r0.12.0/basic.html#stream