Kinesis python client drops any message with "\x" escape?

Kinesis python client drops any message with "\x" escape? - python

I'm using boto3 (version 1.4.4) to talk to Amazon's Kinesis API:
import boto3
kinesis = boto3.client('kinesis')
# write a record with data '\x08' to the test stream
response = kinesis.put_record(StreamName='test', Data=b'\x08', PartitionKey='foobar')
print(response['ResponseMetadata']['HTTPStatusCode']) # 200
# now read from the test stream
shard_it = kinesis.get_shard_iterator(StreamName="test", ShardId='shardId-000000000000', ShardIteratorType="LATEST")["ShardIterator"]
response = kinesis.get_records(ShardIterator=shard_it, Limit=10)
print(response['ResponseMetadata']['HTTPStatusCode']) # 200
print(response['Records']) # []
When I test it with any data without the \x escape I'm able to get back the record as expected. Amazon boto3's doc says that "The data blob can be any type of data; for example, a segment from a log file, geographic/location data, website clickstream data, and so on." then why is the message with \x escaped characters dropped? Am I expected to '\x08'.encode('string_escape') before sending the data to kinesis?
If you are interested, I have characters like \x08 in the message data because I'm trying to write a serialized protocol buffer message to a Kinesis stream.

Okay so I finally figured it out. The reason why it wasn't working was because my botocore was on version 1.4.62. I only realized it because another script that ran fine on my colleague's machine was throwing exceptions on mine. We had the same boto3 version but different botocore versions. After I pip install botocore==1.5.26 both the other script and my kinesis put_record started working.
tldr: botocore 1.4.62 is horribly broken in many ways so upgrade NOW. I can't believe how much of my life is wasted by outdated broken libraries. I wonder if Amazon dev can unpublish broken versions of the client?

Related

Is there a Kinesis Connector for PyFlink?

I'm starting to work on a streaming application and trying to figure out if PyFlink would fit the requirements I have. I need to be able to read from a Kinesis Stream. I saw on the docs that there is a Kinesis Stream Connector, but I can't figure out if that's available for the Python version as well, and, if it is, how to configure it.
Update:
I've found this other doc page, that explains how to use connectors other than the default ones in Python. I've then downloaded the Kinesis jar from here. The version I've downloaded is flink-connector-kinesis_2.11-1.11.2, which matches the one being referenced here.
Then, I changed this line from the script in the documentation: t_env.get_config().get_configuration().set_string("pipeline.jars", "file://<absolute_path_to_jar>/connector.jar").
When trying to execute the script, however, I'm getting this Java error: Caused by: org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'kinesis' that implements 'org.apache.flink.table.factories.DynamicTableSourceFactory' in the classpath..
I've also tried removing that config line from the script, and then running it as ./bin/flink run -py <my_script>.py -j ./<path_to_jar>/connector.jar, but that got me the same error.
What I interpret from that is that the Jar that I added has not been properly recognized by Flink. Am I doing something wrong here?

It may be relevant to clarify that PyFlink is currently (Flink 1.11) a wrapper around Flink's Table API/SQL. The connector you're trying to use is a DataStream API connector.
In Flink 1.12, coming out in the next few weeks, there will be a Kinesis connector for the Table API/SQL too, so you should be able to use it then. For an overview of the currently supported connectors, this is the documentation page you should refer to.
Note: As Xingbo mentioned, PyFlink will wrap the DataStream API starting from Flink 1.12, so if you need a lower-level abstraction for more complex implementations you'll also be able to consume from Kinesis there.

because there are many connectors to support, we need to contribute back to the community one after another. We have developed the Kinesis connector locally. Since users have demand of Kinesis connector, we will contribute it to PyFlink. Now the relevant documentation of PyFlink datastream is still improving.You can take a look at Jira first to see the supported features

IBM Watson SpechtoTextV1 error - Python

I have been trying my hands on IBM Watson speechtotext api. However, it works with short length audio files, but not with audio files which are around 5 mins. It's giving me below error
"watson {'code_description': 'Bad Request', 'code': 400, 'error': 'No speech detected for 30s.'}"
I am using Watson's trial account. Is there a limitation in case of trial account? or bug in below code.
Python code:-
from watson_developer_cloud import SpeechToTextV1
speech_to_text = SpeechToTextV1(
username='XXX',
password='XXX',
x_watson_learning_opt_out=False
)
with open('trial.flac', 'rb') as audio_file:
print(speech_to_text.recognize(audio_file, content_type='audio/flac', model='en-US_NarrowbandModel', timestamps=False, word_confidence=False, continuous=True))
Appreciate any help!

Please see the implementation notes from the Speech to Text API Explorer for the recognize API you are attempting to use:
Implementation Notes
Sends audio and returns transcription results for
a sessionless recognition request. Returns only the final results; to
enable interim results, use session-based requests or the WebSocket
API. The service imposes a data size limit of 100 MB. It automatically
detects the endianness of the incoming audio and, for audio that
includes multiple channels, downmixes the audio to one-channel mono
during transcoding.
Streaming mode
For requests to transcribe live
audio as it becomes available or to transcribe multiple audio files
with multipart requests, you must set the Transfer-Encoding header to
chunked to use streaming mode. In streaming mode, the server closes
the connection (status code 408) if the service receives no data chunk
for 30 seconds and the service has no audio to transcribe for 30
seconds. The server also closes the connection (status code 400) if no
speech is detected for inactivity_timeout seconds of audio (not
processing time); use the inactivity_timeout parameter to change the
default of 30 seconds.
There are two factors here. First there is a data size limit of 100 MB, so I would make sure you do not send files larger then that to the Speech to Text service. Secondly, you can see the server will close the connection and return a 400 error if there is no speech detected for the amount of seconds defined for inactivity_timeout. It seems the default value is 30 seconds, so this matches the error you are seeing above.
I would suggest you make sure there is valid speech in the first 30 seconds of your file and/or make the inactivity_timeout parameter larger to see if the problem still exists. To make things easier, you can test the failing file and other sound files by using the API Explorer in a browser:
Speech to Text API Explorer

In the API documentation, there is this python code, it will avoid to close the server when the default 30s finishes, and works for other errors too.
It's like a "try and except" with the extra step of instanciating the function in a class.
def on_error(self, error):
print('Error received: {}'.format(error))
Here it is the link
https://cloud.ibm.com/apidocs/speech-to-text?code=python

boto3 s3 put_object times out

I've got a python script that Works On My Machine (OSX, python 2.7.13, boto3 1.4.4) but won't work for my colleague (Windows7, otherwise same).
The authentication seems to work, and we can both s3's list_objects_v2 and get_object. However when he tries to upload with put_object, it times out. Here is a full log; the upload starts at line 45.
I've tried using his credentials and it works. He's tried uploading a tiny file and it'll work when it's in the bytes range, but even kb is too big for it. We've even tried it on another windows machine on another internet connection with no luck.
My upload code is pretty simple:
with open("tmp_build.zip", "r") as zip_to_upload:
upload_response = s3.put_object(Bucket=target_bucket, Body=zip_to_upload, Key=build_type+".zip")
The Key resolves to test.zip in our runs, and the file is about 15mb.
Why is it failing on windows? What more debug info can I give you?

Using inspiration from this https://github.com/boto/boto3/issues/870 issue, I added .read() to my Body parameter, and lo it works.

Might be network issues. are you on the same network?
are you able to upload it using the AWS-CLI
try the following
aws s3 cp my-file.txt s3://my-s3-bucket/data/ --debug
also I would consider adding X retries to the upload might give you more information on the error at hand. most of the times these are sporadic network related issues

BigQuery Upload Encoding Error (Python 3)

I am trying to upload some data into my BigQuery tables using HTTP post (httplib2). My existing upload Python code has always worked until I moved to Python 3. Now I get encoding errors on the body of the request for the same data I was uploading successfully before using Python 2.7.x.
The error happens when the HTTP client library tries to encode the body using the default HTTP encoding of ISO-8859-1. It fails stating that it couldn't encode characters at position xxxx-yyyy.
I have heard of a Python 3 Unicode bug but don't know if it's related or not. What should I do to get my upload working with Python 3 and the bigquery API (v2)?

Persistent HTTPS Connections in Python

I want to make an HTTPS request to a real-time stream and keep the connection open so that I can keep reading content from it and processing it.
I want to write the script in python. I am unsure how to keep the connection open in my script. I have tested the endpoint with curl which keeps the connection open successfully. But how do I do it in Python. Currently, I have the following code:
c = httplib.HTTPSConnection('userstream.twitter.com')
c.request("GET", "/2/user.json?" + req.to_postdata())
response = c.getresponse()
Where do I go from here?
Thanks!

It looks like your real-time stream is delivered as one endless HTTP GET response, yes? If so, you could just use python's built-in urllib2.urlopen(). It returns a file-like object, from which you can read as much as you want until the server hangs up on you.
f=urllib2.urlopen('https://encrypted.google.com/')
while True:
data = f.read(100)
print(data)
Keep in mind that although urllib2 speaks https, it doesn't validate server certificates, so you might want to try and add-on package like pycurl or urlgrabber for better security. (I'm not sure if urlgrabber supports https.)

Connection keep-alive features are not available in any of the python standard libraries for https. The most mature option is probably urllib3

httplib2 supports this. (I'd have thought this the most mature option, didn't know urllib3 yet, so TokenMacGuy may still be right)
EDIT: while httplib2 does support persistent connections, I don't think you can really consume streams with it (ie. one long response vs. multiple requests over the same connection), which I now realise you may need.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.