BigQuery Upload Encoding Error (Python 3)

BigQuery Upload Encoding Error (Python 3) - python

I am trying to upload some data into my BigQuery tables using HTTP post (httplib2). My existing upload Python code has always worked until I moved to Python 3. Now I get encoding errors on the body of the request for the same data I was uploading successfully before using Python 2.7.x.
The error happens when the HTTP client library tries to encode the body using the default HTTP encoding of ISO-8859-1. It fails stating that it couldn't encode characters at position xxxx-yyyy.
I have heard of a Python 3 Unicode bug but don't know if it's related or not. What should I do to get my upload working with Python 3 and the bigquery API (v2)?

Related

InternalError on multipart upload

Recently my code is facing intermittent issues with a bucket in S3 that is basically downloading, parsing and reuploading files, I'm using python and boto (not boto3) to do the S3 actions my version is boto 2.36.0 Do you have any idea why sometimes S3 gives that kind of errors.
Based on their documentation
https://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadComplete.htmlç
Sample Response with Error Specified in Body
The following response indicates that an error occurred after the HTTP response header was sent. Note that while the HTTP status code is 200 OK, the request actually failed as described in the Error element.
But still is not really a good example of what's going on and why it happens sometimes
I've tried some manual uploads to my bucket using the same version but I haven't noticed anything while doing it manually
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>InternalError</Code>
<Message>We encountered an internal error. Please try again.</Message>
<RequestId>A127747D40AB1AC3</RequestId>
<HostId>Clz3f+rO2K1KfD0ZwSkpa9WnPvUh/mngdi99eDiSbdR0uzOP5a7RcYUem6ILYbtQdIJ02aUw2M4=</HostId>
</Error>

Kinesis python client drops any message with "\x" escape?

I'm using boto3 (version 1.4.4) to talk to Amazon's Kinesis API:
import boto3
kinesis = boto3.client('kinesis')
# write a record with data '\x08' to the test stream
response = kinesis.put_record(StreamName='test', Data=b'\x08', PartitionKey='foobar')
print(response['ResponseMetadata']['HTTPStatusCode']) # 200
# now read from the test stream
shard_it = kinesis.get_shard_iterator(StreamName="test", ShardId='shardId-000000000000', ShardIteratorType="LATEST")["ShardIterator"]
response = kinesis.get_records(ShardIterator=shard_it, Limit=10)
print(response['ResponseMetadata']['HTTPStatusCode']) # 200
print(response['Records']) # []
When I test it with any data without the \x escape I'm able to get back the record as expected. Amazon boto3's doc says that "The data blob can be any type of data; for example, a segment from a log file, geographic/location data, website clickstream data, and so on." then why is the message with \x escaped characters dropped? Am I expected to '\x08'.encode('string_escape') before sending the data to kinesis?
If you are interested, I have characters like \x08 in the message data because I'm trying to write a serialized protocol buffer message to a Kinesis stream.

Okay so I finally figured it out. The reason why it wasn't working was because my botocore was on version 1.4.62. I only realized it because another script that ran fine on my colleague's machine was throwing exceptions on mine. We had the same boto3 version but different botocore versions. After I pip install botocore==1.5.26 both the other script and my kinesis put_record started working.
tldr: botocore 1.4.62 is horribly broken in many ways so upgrade NOW. I can't believe how much of my life is wasted by outdated broken libraries. I wonder if Amazon dev can unpublish broken versions of the client?

python azure blob storage md5 check fails on blob upload using put_block_blob_from_path

I am trying to upload a blob to azure blob storage with python sdk. I want to pass the MD5 hash for validation on the server side after upload.
Here's the code:
blob_service.put_block_blob_from_path(
container_name='container_name',
blob_name='upload_dir/'+object_name,
file_path=object_name,
content_md5=object_md5Hash
)
But I get this error:
AzureHttpError: The MD5 value specified in the request did not match with the MD5 value calculated by the server.
The file is ~200mb and the error throws instantly. Does not upload the file. So I suspect that it may be comparing the supplied hash with perhaps the hash of the first chunk or something.
Any ideas?

This is sort of an SDK bug in that we should throw a better error message rather than hitting the service, but validating the content of a large upload that has to be chunked simply doesn't work. x_ms_blob_content_md5 will store the md5 but the service will not validate it. That is something you could do on download though. content_md5 is validated by the server for the body of a particular request but since there's more than one with chunked blobs it will never work.
So, if the blob is small enough (below BLOB_MAX_DATA_SIZE) to be put in a single request, content_md5 will work fine. Otherwise I'd simply recommend using HTTPS and storing MD5 in x_ms_blob_content_md5 if you think you might want to download with HTTP and validate it on download. HTTPS already provides validation for things like bit flips on the wire so using it for upload/download will do a lot. If you can't upload/download with HTTPS for one reason or another you can consider chunking the blob yourself using the put block and put block list APIs.
FYI: In future versions we do intend to add automatic MD5 calculation for both single put and chunked operations in the library itself which will fully solve this. For the next version, we will add an improved error message if content_md5 is specified for a chunked download.

I reviewed the source code of the function put_block_blob_from_path of the Azure Blob Storage SDK. It explained the case in the function comment, please see the content below and refer to https://github.com/Azure/azure-storage-python/blob/master/azure/storage/blob/blobservice.py.
content_md5:
Optional. An MD5 hash of the blob content. This hash is used to
verify the integrity of the blob during transport. When this header
is specified, the storage service checks the hash that has arrived
with the one that was sent. If the two hashes do not match, the
operation will fail with error code 400 (Bad Request).

I think there're two things going on here.
Bug in SDK - I believe you have discovered a bug in the SDK. I looked at the source code for this function on Github and what I found is that when a large blob is uploaded in chunks, the SDK is first trying to create an empty block blob. With block blobs, this is not required. When it creates the empty block blob, it does not send any data. But you're setting content-md5 and the SDK compares the content-md5 you sent with the content-md5 of empty content and because they don't match, you get an error.
To fix the issue in the interim, please modify the source code in blobservice.py and comment out the following lines of code:
self.put_blob(
container_name,
blob_name,
None,
'BlockBlob',
content_encoding,
content_language,
content_md5,
cache_control,
x_ms_blob_content_type,
x_ms_blob_content_encoding,
x_ms_blob_content_language,
x_ms_blob_content_md5,
x_ms_blob_cache_control,
x_ms_meta_name_values,
x_ms_lease_id,
)
I have created a new issue on Github for this: https://github.com/Azure/azure-storage-python/issues/99.
Incorrect Usage - I noticed that you're passing the md5 hash of the file in content_md5 parameter. This will not work for you. You should actually pass md5 hash in x_ms_blob_content_md5 parameter. So your call should be:
blob_service.put_block_blob_from_path(
container_name='container_name',
blob_name='upload_dir/'+object_name,
file_path=object_name,
x_ms_blob_content_md5=object_md5Hash
)

How do I do JSONP with python on my webspace..?

I just checked my webspace and it's signature says: Apache/2.2.9 (Debian) mod_python/3.3.1 Python/2.5.2 mod_ssl/2.2.9 OpenSSL/0.9.8g
This give me hope that Python is somehow supported. Why is python listed twice? mod_python/3.3.1 AND Python/2.5.2 ???
There is a cgi-bin folder on my webspace.
What I want to do: I need to do a cross-site call to get some text-data from a server. The text-data is not JSON but I guess I should convert it to JSON (or is there an option to do cross-site without JSON?)
The python script gets the request for some JSONP. Depending on the request (I guess I should somehow parse the URL) the python script is to load the a requested text-data file from the webserver and wrap it in some JSON and return it.
Can somebody tell me how I do these three steps with python on my webspace?

First off, the signature isn't listing python twice. Its listing first the version of mod_python, which is an Apache web server plugin, then it is listing the version of the python interpreter on the system.
python cgi module - This is really an inefficient approach to writing python server code, but here it is. Ultimately you should consider one of the many amazing python web frameworks out there. But, using the cgi module, your response would always start with this:
print 'Content-Type: application/json\n\n'
Your python script would run on the server from an HTTP request. In that script you would check the request and determine the data you will want to serve from either the URL value or the query string.
At the very least you would just wrap your return value in a basic JSON data structure. The text data itself can just be a string:
import json
text_data = "FOO"
json_data = json.dumps({'text': text_data})
print json_data
# {"text": "FOO"}
For the JSONP aspect, you would usually check the query string to see if the request contains a specific name for the callback function the client wants, or just default to 'callback'
print "callback(%s);" % json_data
# callback({"text": "FOO"});
Returning that would be a JSONP type response, because when the client receives it, the callback is executed for the client.
And to conclude, let me add that you should be aware that python cgi scripts will need to start a brand new python interpreter process for every single request (even repeat requests from the same client). This can easily overwhelm a server under increased load. For this reason, people usually go with the wsgi route (mod_wsgi in apache). wsgi allows a persistant application to keep running, and handles ongoing requests.

Can I read from the AppEngine BlobStore using the remote api

I am trying to read (and subsequently save) blobs from the blobstore using the remote api. I get the error: "No api proxy found for service "blobstore"" when I execute the read.
Here is the stub code:
for b in bs.BlobInfo.all().fetch(100):
blob_reader = bs.BlobReader(str(b.key))
file = blob_reader.read()
the error occurs on the line: file = blob_reader.read()
I am reading the file from my personal appspot via terminal with:
python tools/remote_api/blobstore_download.py --servername=myinstance.appspot.com --appid=myinstance
So, reading from the blobstore possible via the remote api? or is my code bad? Any suggestions?

We recently added blobstore support to remote_api. Make sure you're using the latest version of the SDK, and your error should go away.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BigQuery Upload Encoding Error (Python 3) - python

Related

InternalError on multipart upload

Kinesis python client drops any message with "\x" escape?

python azure blob storage md5 check fails on blob upload using put_block_blob_from_path

How do I do JSONP with python on my webspace..?

Can I read from the AppEngine BlobStore using the remote api

Categories

Resources