Exporting text/plain from Google Drive - python

I'm trying to export a Google Doc as text. I've tried two approaches, neither's working.
Exporting the contents, I get the contents as a byte object which I haven't been able to convert to a simple string:
req = service.files().export_media(fileId=file_id,mimeType='text/plain')
fh = io.BytesIO()
download = MediaIoBaseDownload(fh, req)
done = False
while done is False:
status, done = download.next_chunk()
return fh.getvalue()
I then get variants of codec errors
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to < undefined >
or these:
TypeError: write() argument must be str, not bytes
So that's a type conversion problem, I guess, and if I can solve that it'd work.
BUT, I'd much rather just use the exportLinks property to download the file as text/plain. Problem is, that's simply missing:
file = service.files().get(fileId=id).execute()
pprint.pprint(file)
{'id': 'xxxxxxxxxx',
'kind': 'drive#file',
'mimeType': 'application/vnd.google-apps.document',
'name': 'export'}
file['exportLinks'], unsurprisingly, gives a KeyError:
KeyError: 'exportLinks'
I've relaxed the scope so it's now 'https://www.googleapis.com/auth/drive', so that shouldn't be the problem.
What am I missing?

The Drive platform allows developers to open, import, and export native Google Docs types such as Google Spreadsheets, Presentations, Documents, and Drawings. For instance, if your application is configured to open PDF files, then because Google Documents are exportable to PDF, users will be able to use your application to open those documents.
The app can download the converted file content with the files.export method.
file_id = '1ZdR3L3qP4Bkq8noWLJHSr_iBau0DNT4Kli4SxNc2YEo'
request = drive_service.files().export_media(fileId=file_id,
mimeType='application/pdf')
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print "Download %d%%." % int(status.progress() * 100)

Related

Decode zip/base64 file as a PDF file in response in Python (Django), why my PDF is blank?

I have zip/base64 data string. First, I tried to use base64 decode it. Then used zipfile in Python to unzip this file.
Then I put contents in response as a PDF file. I can see that the PDF I downloaded has 7 pages and length is about 75000. But all pages are blank. I wonder there is something wrong with decoding?
This .decode(errors='ignore') was from another stackoverflow post. Without this statement, I cannot decode the whole thing. It will pop error like "'utf-8' codec can't decode byte 0xfe in position 28".
Here is my code:
decoded = base64.b64decode(data) // data is "zip/base64" type
with zipfile.ZipFile(io.BytesIO(decoded)) as zf:
for name in zf.namelist():
with zf.open(name) as f:
contents = f.read().decode(errors='ignore')
response = HttpResponse(
contents, content_type="application/pdf"
)
response["Content-Disposition"] = 'attachment; filename="{}"'.format(
report_name + ".pdf"
)
return response
You shouldn't try to decode the file contents, since it's not text. f.read() returns a bytestring, which HttpResponse will accept perfectly fine for the page content.

Uploading a h5py file to aws and downloading it again

I have not been able to find a clear solution so hopefully this question can be of use to others in the future.
Basically I am creating a h5py object as follows:
keys_d = [matrix_l2T.data, matrix_l2T.indices, matrix_l2T.indptr]
keys_n = ['data', 'indices', 'indptr' ]
file_to_save = h5py.File(FILENAME)
for i,j in zip(keys_n, keys_d):
file_to_save.create_dataset(i, data = j)
Uploading to AWS as follows:
s3 = boto3.client('s3')
s3.upload_file(Filename = FILENAME, Key = KEY, Bucket = BUCKET)
Downloading as follows:
s3 = boto3.resource('s3')
try:
s3.Bucket(BUCKET_NAME).download_file(KEY, FILENAME) except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print("The object does not exist.")
else:
raise
Then I have tried to open this file in a variety of ways but it does not seem to be performed without an error.
I have tried:
h5py.File(FILENAME,'r')
And i got:
OSError: Unable to open file (bad object header version number)
I tried also:
with open(FILENAME, 'rb') as f:
a = f.read()
I need the rb because without it I get 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte. The last piece of code returns binary code not really sure what I should do next.
I resolved the issue. I wasn't closing the h5py file. One needs to add
file_to_save.close()
before uploading to aws

Python throwing error in reading JSON file

I am writing a function in a Python Script which will read the json file and print it.
The scripts reads as:
def main(conn):
global link, link_ID
with open('ad_link.json', 'r') as statusFile:
status = json.loads(statusFile.read())
statusFile.close()
print(status)
link_data = json.load[status]
link = link_data["link"]
link_ID = link_data["link_id"]
print(link)
print(link_ID)
I am getting error as:
link_data = json.load[status]
TypeError: 'function' object is not subscriptable
What is the issue?
The content of ad_link.json The file I am receiving is saved in this manner.
"{\"link\": \"https://res.cloudinary.com/dnq9phiao/video/upload/v1534157695/Adidas-Break-Free_nfadrz.mp4\", \"link_id\": \"ad_Bprise_ID_Adidas_0000\"}"
The function to receive and write JSON file
def on_message2(client, userdata, message):
print("New MQTT message received. File %s line %d" % (filename, cf.f_lineno))
print("message received?/'/'/' ", str(message.payload.decode("utf-8")), \
"topic", message.topic, "retained ", message.retain)
global links
links = str(message.payload.decode("utf-8")
logging.debug("Got new mqtt message as %s" % message.payload.decode("utf-8"))
status_data = str(message.payload.decode("utf-8"))
print(status_data)
print("in function on_message2")
with open("ad_link.json", "w") as outFile:
json.dump(status_data, outFile)
time.sleep(3)
The output of this function
New MQTT message received. File C:/Users/arunav.sahay/PycharmProjects/MediaPlayer/venv/Include/mediaplayer_db_mqtt.py line 358
message received?/'/'/' {"link": "https://res.cloudinary.com/dnq9phiao/video/upload/v1534157695/Adidas-Break-Free_nfadrz.mp4", "link_id": "ad_Bprise_ID_Adidas_0000"} topic ios_push retained 1
{"link": "https://res.cloudinary.com/dnq9phiao/video/upload/v1534157695/Adidas-Break-Free_nfadrz.mp4", "link_id": "ad_Bprise_ID_Adidas_0000"}
EDIT
I found out the error is in JSON format. I am receiving the JSON data in a wrong format. How will I correct that?
There are two major errors here:
You are trying to use the json.load function as a sequence or dictionary mapping. It's a function, you can only call it; you'd use json.load(file_object). Since status is actually a string, you'd have to use json.loads(status) to actually decode a JSON document stored in a string.
In on_message2, you encoded JSON data to JSON again. Now you have to decode it twice. That's an unfortunate waste of computer resources.
In the on_message2 function, the message.payload object is a bytes-value containing a UTF-8 encoded JSON document, if you want to write that to a file, don't decode to text, and don't encode the text to JSON again. Just write those bytes directly to a file:
def on_message2(client, userdata, message):
logging.debug("Got new mqtt message as %s" % message.payload.decode("utf-8"))
with open("ad_link.json", "wb") as out:
out.write(message.payload)
Note the 'wb' status; that opens a file in binary mode for writing, at which point you can write the bytes object to that file.
When you open a file without a b in the mode, you open a file in text mode, and when you write a text string to that file object, Python encodes that text to bytes for you. The default encoding depends on your OS settings, so without an explicit encoding argument to open() you can't even be certain that you end up with UTF-8 JSON bytes again! Since you already have a bytes value, there is no need to manually decode then have Python encode again, so use a binary file object and avoid that decode / encode dance here too.
You can now load the file contents with json.load() without having to decode again:
def main(conn):
with open('ad_link.json', 'rb') as status_file:
status = json.load(status_file)
link = status["link"]
link_id = status["link_id"]
Note that I opened the file as binary again. As of Python 3.6, the json.load() function can work both with binary files and text files, and for binary files it can auto-detect if the JSON data was encoded as UTF-8, UTF-16 or UTF-32.\
If you are using Python 3.5 or earlier, open the file as text, but do explicitly set the encoding to UTF-8:
def main(conn):
with open('ad_link.json', 'r', encoding='utf-8') as status_file:
status = json.load(status_file)
link = status["link"]
link_id = status["link_id"]
def main(conn):
global link, link_ID
with open('ad_link.json', 'r') as statusFile:
link_data = json.loads(statusFile.read())
link = link_data["link"]
link_ID = link_data["link_id"]
print(link)
print(link_ID)
replace loads with load when dealing with file object which supports read like operation
def main(conn):
global link, link_ID
with open('ad_link.json', 'r') as statusFile:
status = json.load(statusFile)
status=json.loads(status)
link = status["link"]
link_ID = status["link_id"]
print(link)
print(link_ID)

How to overwrite a file uploading to SharePoint with Microsoft Graph

I have a Python script that will upload a file to Sharepoint using Microsoft Graph but it gives me a 500 status code error when I try to upload the same file twice.
Here is the code for the function that uploads the file:
def upload_file(session,filename,driveid,folder):
"""Upload a file to Sharepoint.
"""
fname_only = os.path.basename(filename)
# create the Graph endpoint to be used
endpoint = f'drives/{driveid}/root:/{folder}/{fname_only}:/createUploadSession'
start_response = session.put(api_endpoint(endpoint))
json_response = start_response.json()
upload_url = json_response["uploadUrl"]
# upload in chunks
filesize = os.path.getsize(filename)
with open(filename, 'rb') as fhandle:
start_byte = 0
while True:
file_content = fhandle.read(10*1024*1024)
data_length = len(file_content)
if data_length <= 0:
break
end_byte = start_byte + data_length - 1
crange = "bytes "+str(start_byte)+"-"+str(end_byte)+"/"+str(filesize)
print(crange)
chunk_response = session.put(upload_url,
headers={"Content-Length": str(data_length),"Content-Range": crange},
data=file_content)
if not chunk_response.ok:
print(f'<Response [{chunk_response.status_code}]>')
pprint.pprint(chunk_response.json()) # show error message
break
start_byte = end_byte + 1
return chunk_response
Here is the output for the first run:
bytes 0-10485759/102815295
bytes 10485760-20971519/102815295
bytes 20971520-31457279/102815295
bytes 31457280-41943039/102815295
bytes 41943040-52428799/102815295
bytes 52428800-62914559/102815295
bytes 62914560-73400319/102815295
bytes 73400320-83886079/102815295
bytes 83886080-94371839/102815295
bytes 94371840-102815294/102815295
Here is the output for the second run:
bytes 0-10485759/102815295
bytes 10485760-20971519/102815295
bytes 20971520-31457279/102815295
bytes 31457280-41943039/102815295
bytes 41943040-52428799/102815295
bytes 52428800-62914559/102815295
bytes 62914560-73400319/102815295
bytes 73400320-83886079/102815295
bytes 83886080-94371839/102815295
bytes 94371840-102815294/102815295
<Response [500]>
{'error': {'code': 'generalException',
'message': 'An unspecified error has occurred.'}}
I guess I could figure out how to delete the file before I overwrite it but it would be nice to preserve history since Sharepoint keeps versions.
Thanks for any help on this.
Bobby
p.s. I have been hacking the code in https://github.com/microsoftgraph/python-sample-console-app to get it to upload a file to SharePoint so some of the code in the function is from Microsoft's sample application.
For anyone ending up here whilst looking into file name conflict issues, according to the Microsoft article below, if there is a file name collision and you have not correctly specified that it should be replaced, the final byte range upload will fail in the way OP is describing. Hopefully this helps someone.
Handle upload errors
When the last byte range of a file is uploaded, it is possible for an error to occur. This can be due to a name conflict or quota limitation being exceeded. The upload session will be preserved until the expiration time, which allows your app to recover the upload by explicitly committing the upload session.
From: https://learn.microsoft.com/en-us/onedrive/developer/rest-api/api/driveitem_createuploadsession?view=odsp-graph-online#create-an-upload-session

How do I write a file to temporary storage that I receive through django rest framework?

I have a view that accepts a file (.doc, docx, or pdf):
from rest_framework.decorators import api_view
from rest_framework.parsers import FileUploadParser
#api_view(['POST'])
#parser_classes((FileUploadParser,) )
def parse_document(request, format=None):
file_obj = request.data['file']
I need to parse these documents and return json. I'm using Textract to convert the documents to text but in order for that to happen I need to pass a filepath to Textract, hence the reason I need to write the file temporarily to the file system.
I've tried reading the contents of the file and writing them to a temp file but I just end up with jumbled text like, \x00\x14. I've also tried decoding the file with 'utf-8' but I receive an error saying
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte
I can read a .txt file just fine, this error only happens with file types other than .txt. I feel like there has to be a way to just copy the file to temp storage without needing to read the contents.
#api_view(['POST'])
#parser_classes((FileUploadParser,) )
def parse_resume(request, format=None):
file_obj = request.data['file']
tempf, tempfn = tempfile.mkstemp()
try:
for chunk in file_obj.chunks():
os.write(tempf, chunk)
except:
raise Exception("Problem with the input file %s" % file_obj.name)
finally:
text = textract.process(tempfn).decode('utf-8') # This is where the error described above is thrown
os.close(tempf)
return Response({"text": None})
Handling uploads by rest-framework should not be any different than doing it by Django itself (except that you can use request.data instead of request.FILES).
Try to dissect the problem to smaller pieces, and see where things start to break. Like following:
Make plain django view and store file to hardcoded file path. Make sure that things work ok. Here are docs: https://docs.djangoproject.com/en/2.0/topics/http/file-uploads/
Replace django view with rest-framework view
Replace hardcoded file path with temporary file

Categories