TrasnslationAPI a bytes-like object is required, not 'Repeated' - python

I am trying to translate a pdf document from english to french using google translation api and python, however I get a type error.
Traceback (most recent call last):
File "C:\Users\troberts034\Documents\translate_test\translate.py", line 42, in <module>
translate_document()
File "C:\Users\troberts034\Documents\translate_test\translate.py", line 33, in translate_document
f.write(response.document_translation.byte_stream_outputs)
TypeError: a bytes-like object is required, not 'Repeated'
I have a feeling that it has something to do with writing to the file as binary, but I open it as binary too so I am unsure what the issue is. I want it to take a pdf file that has english text and edit the text and translate it to french using the api. Any ideas whats wrong?
from google.cloud import translate_v3beta1 as translate
def translate_document():
client = translate.TranslationServiceClient()
location = "global"
project_id = "translatedocument"
parent = f"projects/{project_id}/locations/{location}"
# Supported file types: https://cloud.google.com/translate/docs/supported-formats
with open("C:/Users/###/Documents/translate_test/test.pdf", "rb") as document:
document_content = document.read()
document_input_config = {
"content": document_content,
"mime_type": "application/pdf",
}
response = client.translate_document(
request={
"parent": parent,
"target_language_code": "fr-FR",
"document_input_config": document_input_config,
}
)
# To output the translated document, uncomment the code below.
f = open('test.pdf', 'wb')
f.write(response.document_translation.byte_stream_outputs)
f.close()
# If not provided in the TranslationRequest, the translated file will only be returned through a byte-stream
# and its output mime type will be the same as the input file's mime type
print("Response: Detected Language Code - {}".format(
response.document_translation.detected_language_code))
translate_document()

I think there is a bug on the sample code (I'm assuming you got the sample from the Cloud Translate API documentation).
To fix your code, you do need to use response.document_translation.byte_stream_outputs[0]. So basically changing this line:
f.write(response.document_translation.byte_stream_outputs)
by:
f.write(response.document_translation.byte_stream_outputs[0])
then your code will work.

Related

Upload file to Databricks DBFS with Python API

I'm following the Databricks example for uploading a file to DBFS (in my case .csv):
import json
import requests
import base64
DOMAIN = '<databricks-instance>'
TOKEN = '<your-token>'
BASE_URL = 'https://%s/api/2.0/dbfs/' % (DOMAIN)
def dbfs_rpc(action, body):
""" A helper function to make the DBFS API request, request/response is encoded/decoded as JSON """
response = requests.post(
BASE_URL + action,
headers={'Authorization': 'Bearer %s' % TOKEN },
json=body
)
return response.json()
# Create a handle that will be used to add blocks
handle = dbfs_rpc("create", {"path": "/temp/upload_large_file", "overwrite": "true"})['handle']
with open('/a/local/file') as f:
while True:
# A block can be at most 1MB
block = f.read(1 << 20)
if not block:
break
data = base64.standard_b64encode(block)
dbfs_rpc("add-block", {"handle": handle, "data": data})
# close the handle to finish uploading
dbfs_rpc("close", {"handle": handle})
When using the tutorial as is, I get an error:
Traceback (most recent call last):
File "db_api.py", line 65, in <module>
data = base64.standard_b64encode(block)
File "C:\Miniconda3\envs\dash_p36\lib\base64.py", line 95, in standard_b64encode
return b64encode(s)
File "C:\Miniconda3\envs\dash_p36\lib\base64.py", line 58, in b64encode
encoded = binascii.b2a_base64(s, newline=False)
TypeError: a bytes-like object is required, not 'str'
I tried doing with open('./sample.csv', 'rb') as f: before passing the blocks to base64.standard_b64encode but then getting another error:
TypeError: Object of type 'bytes' is not JSON serializable
This happens when the encoded block data is being sent into the API call.
I tried skipping encoding entirely and just passing the blocks into the post call. In this case the file gets created in the DBFS but has 0 bytes size.
At this point I'm trying to make sense of it all. It doesn't want a string but it doesn't want bytes either. What am I doing wrong? Appreciate any help.
In Python we have strings and bytes, which are two different entities note that there is no implicit conversion between them, so you need to know when to use which and how to convert when necessary. This answer provides nice explanation.
With the code snippet I see two issues:
This you already got - open by default reads the file as text. So your block is a string, while standard_b64encode expects bytes and returns bytes. To read bytes from file it needs to be opened in binary mode:
with open('/a/local/file', 'rb') as f:
Only strings can be encoded as JSON. There's no source code available for dbfs_rpc (or I can't find it), but apparently it expects a string, which it internally encodes. Since your data is bytes, you need to convert it to string explicitly and that's done using decode:
dbfs_rpc("add-block", {"handle": handle, "data": data.decode('utf8')})

Python protobuf decode base64 string

I am trying to get JSON data from encrypted base64 string. I have created my proto file like below
syntax = "proto2";
message ArtifactList {
repeated Artifact artifacts = 1;
}
message Artifact {
required string id = 1;
required uint64 type_id = 2;
required string uri = 3;
}
After that, I have generated python files using the proto command. I am trying to decrypt the base64 string like below.
import message_pb2
import base64
data = base64.b64decode("AAAAAA8KDQgTEBUgBCjln62lxS6AAAAAD2dycGMtc3RhdHVzOjANCg==")
s = str(data)
message_pb2.ArtifactList.ParseFromString(s)
But I am getting the below error.
Traceback (most recent call last):
File "app.py", line 7, in <module>
message_pb2.ArtifactList.ParseFromString(s)
TypeError: descriptor 'ParseFromString' requires a 'google.protobuf.pyext._message.CMessage' object but received a 'str'
I am a newbie for protobuf. I couldn't find a solution to fix this issue.
Could anyone help to fix this problem?
Thanks in advance.
There are two issues.
ParseFromString is a method of an ArtifactList instance
ParseFromString takes a byte-like object, not str, as parameter
>>>import message_pb2
>>>import base64
>>>data = base64.b64decode("AAAAAA8KDQgTEBUgBCjln62lxS6AAAAAD2dycGMtc3RhdHVzOjANCg==")
>>>m=message_pb2.ArtifactList()
>>>m.ParseFromString(data)
>>>m.artifacts
<google.protobuf.pyext._message.RepeatedCompositeContainer object at 0x7fd09a937d68>
ParseFromString is a method on an protobuf Message instance.
Try:
message = message_pb2.ArtifactList()
message.ParseFromString(s)

How can a GRIB file be opened with pygrib without first downloading the file?

The documentation for pygrib shows a function called fromstring which creates a gribmessage instance from a python bytes object representing a binary grib message. I might be misunderstanding the purpose of this function, but it leads me to believe I can use it in place of downloading a GRIB file and using the open function on it. Unfortunately, my attempts to open a multi-message GRIB file from NLDAS2 have failed. Does anyone else know how to use pygrib on GRIB data without first saving the file? My code below shows how I would like it to work. Instead, it gives the error TypeError: expected bytes, int found on the line for grib in gribs:
from urllib import request
import pygrib
url = "<remote address of desired file>"
username = "<username>"
password = "<password>"
redirectHandler = request.HTTPRedirectHandler()
cookieProcessor = request.HTTPCookieProcessor()
passwordManager = request.HTTPPasswordMgrWithDefaultRealm()
passwordManager.add_password(None, "https://urls.earthdata.nasa.gov", username, password)
authHandler = request.HTTPBasicAuthHandler(passwordManager)
opener = request.build_opener(redirectHandler, cookieProcessor, authHandler)
request.install_opener(opener)
with request.urlopen(url) as response:
data = response.read()
gribs = pygrib.fromstring(data)
for grib in gribs:
print(grib)
Edit to add the entire error output:
Traceback (most recent call last):
File ".\example.py", line 19, in <module>
for grb in grbs:
File "pygrib.pyx", line 1194, in pygrib.gribmessage.__getitem__
TypeError: expected bytes, int found
Edit: This interface does not support multi-message GRIB files, but the authors are open to a pull request if anyone wants to write up the code. Unfortunately, my research focus has shifted and I don't have time to contribute myself.
As stated by jasonharper you can use pygrib.fromstring(). I just tried it myself and this works.
Here is the link to the documentation.
Starting with pygrib v2.1.4, the changelog says that pygrib.open() now accepts io.BufferedReader object as an input argument.
see pygrib changelog here
That would theoretically allow you to read a GRIB2 file from RAM memory without writing it to disk.
I think the usage is supposed to be the following :
binary_io = io.BytesIO(bytes_data)
buffer_io = io.BufferedReader(binary_io)
grib_file = pygrib.open(buffer_io)
But I was not able to make it work on my side !

pyArango bulkImport_json is complaining about improper indicies

I'm testing the ability to store PyTest results, generated by the json plugin for that test harness, into ArangoDB. I am attempting to import as follows
import pyArango.connection as adbConn
dbConn = adbConn.Connection(...)
db = dbConn['mydb']
collection = db.collections['PyTestResults']
collection.bulkImport_json('/path/to/results.json')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/site-packages/pyArango/collection.py", line 777, in bulkImport_json
errorMessage = "At least: %d errors. The first one is: '%s'\n\n more in <this_exception>.data" %
(len(data), data[0]["errorMessage"])
TypeError: string indices must be integers
What isn't making sense is that the JSON file is properly formed. In fact, using the regular Python JSON module, it works just fine:
import json
with open('/path/to/results.json') as fd:
data = json.load(fd)
print(data)
This works. The beginning of the file is
{"report":
{"environment":
{
"Python": "3.6.9", "Platform": "Linux-4.4.0-17763-Microsoft-x86_64-with-Ubuntu-18.04-bionic"
},
It seems that the library, pyArango, is wanting the keys to be integers. I tried this, that is I tried changing "report" to 0. However, this resulted in invalidating the JSON structure.
How is one to use the pyArango library to import JSON? The overall structure of this JSON file doesn't look much different than any of the examples in this page. Any pointers are greatly appreciated.

Annotating image file to text using googleapiclient

I am trying to annotate a local image file using google cloud services. I followed the instructions given here https://cloud.google.com/natural-language/docs/reference/libraries, and setup the google API. The given test examples on the page executed without any problem. However, when I am trying to actually annotate a file I am getting error, here is the code I am using:
files = [];
files.append("/opt/lampp/htdocs/test.jpg");
def get_text_from_files(fileNames):
texts = detect_text(fileNames);
def detect_text(fileNames):
max_results = 6;
num_retries=3;
service = googleapiclient.discovery.build('language', 'v1');
batch_request = [];
for filename in fileNames:
request = {
'image': {},
'features': [{
'type': 'TEXT_DETECTION',
'maxResults': max_results,
}]
}
with open(filename, 'rb') as image_file:
request['image']['content'] = base64.b64encode(image_file.read()).decode('UTF-8');
batch_request.append(request);
request = service.images().annotate(body={'requests': batch_request});
try:
responses = request.execute(num_retries=num_retries)
if 'responses' not in responses:
return {};
text_response = {};
for filename, response in zip(
input_filenames, responses['responses']):
if 'error' in response:
logging.error('API Error for {}: {}'.format(
filename,
response['error'].get('message', '')))
continue
text_response[filename] = response.get('textAnnotations', [])
return text_response;
except googleapiclient.errors.HttpError as e:
print ('Http Error for {}: {}', e)
except KeyError as e2:
print ('Key error: {}', e2)
get_text_from_files(files);
But I am getting error, I have given the stack trace below:
Traceback (most recent call last):
File "test.py", line 68, in <module>
get_text_from_files(pdf);
File "test.py", line 21, in get_text_from_files
texts = detect_text(fileNames);
File "test.py", line 41, in detect_text
request = service.images().annotate(body={'requests': batch_request});
AttributeError: 'Resource' object has no attribute 'images'
Thanks in advance.
Note that you are using the wrong Google API Client Python Library. You are using the Natural Language API, while the one that you want to use is the Vision API. The error message AttributeError: 'Resource' object has no attribute 'images' indicates that the resource associated to the Language API does not have any images attribute. In order to solve this issue, it should be enough to do the following change:
# Wrong API being used
service = googleapiclient.discovery.build('language', 'v1');
# Correct API being used
service = googleapiclient.discovery.build('vision', 'v1');
In this Google API Client Libraries page you will find the whole list of available APIs and their names and versions available. And here, there's the complete documentation for the Vision API legacy API Client Library.
Finally, let me recommend you the usage of the idiomatic Client Libraries instead of the legacy API Client Libraries. They are much more intuitive to use and there are some good documentation references in their GitHub page.

Categories