Upload file to Databricks DBFS with Python API - python

I'm following the Databricks example for uploading a file to DBFS (in my case .csv):
import json
import requests
import base64
DOMAIN = '<databricks-instance>'
TOKEN = '<your-token>'
BASE_URL = 'https://%s/api/2.0/dbfs/' % (DOMAIN)
def dbfs_rpc(action, body):
""" A helper function to make the DBFS API request, request/response is encoded/decoded as JSON """
response = requests.post(
BASE_URL + action,
headers={'Authorization': 'Bearer %s' % TOKEN },
json=body
)
return response.json()
# Create a handle that will be used to add blocks
handle = dbfs_rpc("create", {"path": "/temp/upload_large_file", "overwrite": "true"})['handle']
with open('/a/local/file') as f:
while True:
# A block can be at most 1MB
block = f.read(1 << 20)
if not block:
break
data = base64.standard_b64encode(block)
dbfs_rpc("add-block", {"handle": handle, "data": data})
# close the handle to finish uploading
dbfs_rpc("close", {"handle": handle})
When using the tutorial as is, I get an error:
Traceback (most recent call last):
File "db_api.py", line 65, in <module>
data = base64.standard_b64encode(block)
File "C:\Miniconda3\envs\dash_p36\lib\base64.py", line 95, in standard_b64encode
return b64encode(s)
File "C:\Miniconda3\envs\dash_p36\lib\base64.py", line 58, in b64encode
encoded = binascii.b2a_base64(s, newline=False)
TypeError: a bytes-like object is required, not 'str'
I tried doing with open('./sample.csv', 'rb') as f: before passing the blocks to base64.standard_b64encode but then getting another error:
TypeError: Object of type 'bytes' is not JSON serializable
This happens when the encoded block data is being sent into the API call.
I tried skipping encoding entirely and just passing the blocks into the post call. In this case the file gets created in the DBFS but has 0 bytes size.
At this point I'm trying to make sense of it all. It doesn't want a string but it doesn't want bytes either. What am I doing wrong? Appreciate any help.

In Python we have strings and bytes, which are two different entities note that there is no implicit conversion between them, so you need to know when to use which and how to convert when necessary. This answer provides nice explanation.
With the code snippet I see two issues:
This you already got - open by default reads the file as text. So your block is a string, while standard_b64encode expects bytes and returns bytes. To read bytes from file it needs to be opened in binary mode:
with open('/a/local/file', 'rb') as f:
Only strings can be encoded as JSON. There's no source code available for dbfs_rpc (or I can't find it), but apparently it expects a string, which it internally encodes. Since your data is bytes, you need to convert it to string explicitly and that's done using decode:
dbfs_rpc("add-block", {"handle": handle, "data": data.decode('utf8')})

Related

TrasnslationAPI a bytes-like object is required, not 'Repeated'

I am trying to translate a pdf document from english to french using google translation api and python, however I get a type error.
Traceback (most recent call last):
File "C:\Users\troberts034\Documents\translate_test\translate.py", line 42, in <module>
translate_document()
File "C:\Users\troberts034\Documents\translate_test\translate.py", line 33, in translate_document
f.write(response.document_translation.byte_stream_outputs)
TypeError: a bytes-like object is required, not 'Repeated'
I have a feeling that it has something to do with writing to the file as binary, but I open it as binary too so I am unsure what the issue is. I want it to take a pdf file that has english text and edit the text and translate it to french using the api. Any ideas whats wrong?
from google.cloud import translate_v3beta1 as translate
def translate_document():
client = translate.TranslationServiceClient()
location = "global"
project_id = "translatedocument"
parent = f"projects/{project_id}/locations/{location}"
# Supported file types: https://cloud.google.com/translate/docs/supported-formats
with open("C:/Users/###/Documents/translate_test/test.pdf", "rb") as document:
document_content = document.read()
document_input_config = {
"content": document_content,
"mime_type": "application/pdf",
}
response = client.translate_document(
request={
"parent": parent,
"target_language_code": "fr-FR",
"document_input_config": document_input_config,
}
)
# To output the translated document, uncomment the code below.
f = open('test.pdf', 'wb')
f.write(response.document_translation.byte_stream_outputs)
f.close()
# If not provided in the TranslationRequest, the translated file will only be returned through a byte-stream
# and its output mime type will be the same as the input file's mime type
print("Response: Detected Language Code - {}".format(
response.document_translation.detected_language_code))
translate_document()
I think there is a bug on the sample code (I'm assuming you got the sample from the Cloud Translate API documentation).
To fix your code, you do need to use response.document_translation.byte_stream_outputs[0]. So basically changing this line:
f.write(response.document_translation.byte_stream_outputs)
by:
f.write(response.document_translation.byte_stream_outputs[0])
then your code will work.

Django StreamingHttpResponse format error

I have simple Django view, for downloading file from Amazon s3.
Test by saving file locally was alright:
def some_view(request):
res = s3.get_object(...)
try:
s3_file_content = res['Body'].read()
with open("/Users/yanik/ololo.jpg", 'wb') as f:
f.write(s3_file_content)
# file saved and I can view it
except:
pass
When switch to StreamingHttpResponse I got incorrect file format (can't open) and even wrong size (If original is 317kb image the output wood be around 620kb)
def some_view(request):
res = s3.get_object(...)
response = StreamingHttpResponse(res['Body'].read(), content_type=res['ContentType'])
response['Content-Disposition'] = 'attachment;filename=' + 'ololo.jpg'
response['ContentLength'] = res['ContentLength']
return response
Tried many different setting, but so far nothing worked for me. The output file is broken.
UPDATE
I managed to get more debuting information. If I change my file writing method in first sample from 'wb' to 'w' mode I'll have same output as with StreamingHttpResponse (first view will generate same broken file).
So it looks like I must tell http header that my output is on binary format
UPDATE (get to the core of problem)
Now I'm understand the problem. But still don't have the solution.
The res['Body'].read() returns bytes type and StreamingHttpResponse iterate through these bytes, which returns byte codes. So my pretty incoming bytes '...\x05cgr\xb8=:\xd0\xc3\x97U\xf4\xf3\xdc\xf0*\xd4#\xff\xd9' force converted to array like: [ ... , 195, 151, 85, 244, 243, 220, 240, 42, 212, 64, 255, 217] and then downloaded like concatenated strings. Screenshot: http://take.ms/JQztk
As you see, the list elements in the end.
StreamingHttpResponse.make_bytes
"""Turn a value into a bytestring encoded in the output charset."""
Still don't sure what is going on. But FileWrapper from https://stackoverflow.com/a/8601118/2576817 works fine for boto3 StreamingBody response type.
Welcome if someone have courage to explain this fuzzy behavior.
https://docs.djangoproject.com/en/1.9/ref/request-response/#streaminghttpresponse-objects
StreamingHttpResponse needs an iterator. I think if your file is binary (image), then StreamingHttpResponse is not the best solution, or you should create chunks of that file.
Bytearray is an iterator but perhaps you want to go on lines not on bytes/characters.
I'm not sure if your file is line based text data, but if it is, you could create a generator in order to iterate over the file like object:
def line_generator(file_like_obj):
for line in file_like_obj:
yield line
and feed that generator to the StreamingHttpResponse:
some_view(request):
res = s3.get_object(...)
response = StreamingHttpResponse(line_generator(res['Body']), ...)
return response

Python3 and hmac . How to handle string not being binary

I had a script in Python2 that was working great.
def _generate_signature(data):
return hmac.new('key', data, hashlib.sha256).hexdigest()
Where data was the output of json.dumps.
Now, if I try to run the same kind of code in Python 3, I get the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/hmac.py", line 144, in new
return HMAC(key, msg, digestmod)
File "/usr/lib/python3.4/hmac.py", line 42, in __init__
raise TypeError("key: expected bytes or bytearray, but got %r" %type(key).__name__)
TypeError: key: expected bytes or bytearray, but got 'str'
If I try something like transforming the key to bytes like so:
bytes('key')
I get
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding
I'm still struggling to understand the encodings in Python 3.
You can use bytes literal: b'key'
def _generate_signature(data):
return hmac.new(b'key', data, hashlib.sha256).hexdigest()
In addition to that, make sure data is also bytes. For example, if it is read from file, you need to use binary mode (rb) when opening the file.
Not to resurrect an old question but I did want to add something I feel is missing from this answer, to which I had trouble finding an appropriate explanation/example of anywhere else:
Aquiles Carattino was pretty close with his attempt at converting the string to bytes, but was missing the second argument, the encoding of the string to be converted to bytes.
If someone would like to convert a string to bytes through some other means than static assignment (such as reading from a config file or a DB), the following should work:
(Python 3+ only, not compatible with Python 2)
import hmac, hashlib
def _generate_signature(data):
key = 'key' # Defined as a simple string.
key_bytes= bytes(key , 'latin-1') # Commonly 'latin-1' or 'ascii'
data_bytes = bytes(data, 'latin-1') # Assumes `data` is also an ascii string.
return hmac.new(key_bytes, data_bytes , hashlib.sha256).hexdigest()
print(
_generate_signature('this is my string of data')
)
try
codecs.encode()
which can be used both in python2.7.12 and 3.5.2
import hashlib
import codecs
import hmac
a = "aaaaaaa"
b = "bbbbbbb"
hmac.new(codecs.encode(a), msg=codecs.encode(b), digestmod=hashlib.sha256).hexdigest()
for python3 this is how i solved it.
import codecs
import hmac
def _generate_signature(data):
return hmac.new(codecs.encode(key), codecs.encode(data), codecs.encode(hashlib.sha256)).hexdigest()

read xml files online

i'm new to programing and I'm trying to accese the webservice provided in http://indicadoreseconomicos.bccr.fi.cr/indicadoreseconomicos/WebServices/wsindicadoreseconomicos.asmx?op=ObtenerIndicadoresEconomicosXML, i've added the parameters I need to acces it but when I try to read the file in python I get
TypeError: 'HTTPResponse' object cannot be interpreted as an integer
this is my code
import urllib
import http.client
import time
HEADERS={"Content-type":"application/x-www-form-urlencoded","Accept":"text/plain"}
HOST = "indicadoreseconomicos.bccr.fi.cr"
POST = "/indicadoreseconomicos/WebServices/wsIndicadoresEconomicos.asmx/ObtenerIndicadoresEconomicos"
data = urllib.parse.urlencode({'tcIndicador': 317,
'tcFechaInicio':str(time.strftime("%d/%m/%Y")),
'tcFechaFinal':str(time.strftime("%d/%m/%Y")),
'tcNombre' : 'TI1400',
'tnSubNiveles' : 'N'})
conn=http.client.HTTPConnection(HOST)
conn.request("POST",POST,data,headers=HEADERS)
response= conn.getresponse()
responseSTR= response.read(response)
print (response)
Any suggestions are apreciated
response.read() takes an optional argument that is the number of bytes to read from the response; an integer, whole number. Now you passed the response object instead.
As you want to read the entire response, you should omit the argument altogether, thus:
response_str = response.read()
print(response_str)

posting pickle dumps string with urllib

I need to post data to a Django server. I'd like to use pickle. (There're no security requirements -> small intranet app.)
First, I pickle the data on the client and sending it with urllib2
def dispatch(self, func_name, *args, **kwargs):
dispatch_url = urljoin(self.base_url, "api/dispatch")
pickled_args = cPickle.dumps(args, 2)
pickled_kwargs = cPickle.dumps(kwargs, 2)
data = urllib.urlencode({'func_name' : func_name,
'args' : pickled_args,
'kwargs': pickled_kwargs})
resp = self.opener.open(dispatch_url, data)
Recieving the data at the server works, too:
def dispatch(request):
func_name = request.POST["func_name"]
pickled_args = request.POST["args"]
pickled_kwargs = request.POST["kwargs"]
But unpickling raises an error:
cPickle.loads(pickled_args)
Traceback (most recent call last):
File "<string>", line 1, in <fragment>
TypeError: must be string, not unicode
Obviously the urllib.urlencode has created a unicode string. But how can I convert it back to be able to unpickling (laods) again?
By the way, using pickle format 0 (ascii) works. I can convert to string before unpickling, but I'd rather use format 2.
Also, recommendations about how to get binary data to a Django view are highly appreciated.
Obviously the urllib.urlencode has created a unicode string.
urllib.urlencode doesn't return Unicode string.
It might be a feature of the web framework you use that request.POST contains Unicode strings.
Don't use pickle to communicate between services it is not secure, brittle, not portable, hard to debug and it makes your components too coupled.

Categories