Upload .txt file to Dropbox without saving locally first - python

I have a .txt file that I need to upload into a Dropbox folder. On my PC it works great as it is however I need to put the code into a Google Cloud Function and as the GCP file system is read-only - this method if failing.
Can anyone recommend an alternative way of doing this that doesn't require me to save the data locally before pushing it up into Dropbox?
Here is my current working code for my local version:
import pathlib
import dropbox
api_key = 'XXXXXXXXXX'
# Build String And Save Locally To File
string = ["Item_A","Item_B","Item_C","Item_D"]
string = str(string)
with open('Item_List.txt', 'w') as f:
f.write(string)
# Define Local File Path
localfolder = pathlib.Path(".")
localpath = localfolder / 'Item_List.txt'
# Define Dropbox Target Location
targetfile = '/Data/' + 'Item_List.txt'
# Initilize Dropbox
d = dropbox.Dropbox(api_key)
# Upload File To Dropbox
with localpath.open("rb") as f:
d.files_upload(f.read(), targetfile, mode=dropbox.files.WriteMode("overwrite"))

If you need to simply use byte data, you can use the built-in bytes function to convert a string to byte data (you need to also specify encoding):
data = ["Item_A", "Item_B", "Item_C", "Item_D"]
string_data = str(data)
byte_data = bytes(string_data, encoding='utf-8')
And then later just use the byte data as the argument:
d.files_upload(byte_data, targetfile, mode=dropbox.files.WriteMode("overwrite"))

Related

Gzip a file in Python before uploading to Cloud Storage

I have the following Python function to write the given content to a bucket in Cloud Storage:
import gzip
from google.cloud import storage
def upload_to_cloud_storage(json):
"""Write to Cloud Storage."""
# The contents to upload as a JSON string.
contents = json
storage_client = storage.Client()
# Path and name of the file to upload (file doesn't yet exist).
destination = "path/to/name.json.gz"
# Gzip the contents before uploading
with gzip.open(destination, "wb") as f:
f.write(contents.encode("utf-8"))
# Bucket
my_bucket = storage_client.bucket('my_bucket')
# Blob (content)
blob = my_bucket.blob(destination)
blob.content_encoding = 'gzip'
# Write to storage
blob.upload_from_string(contents, content_type='application/json')
However, I receive an error when running the function:
FileNotFoundError: [Errno 2] No such file or directory: 'path/to/name.json.gz'
Highlighting this line as the cause:
with gzip.open(destination, "wb") as f:
I can confirm that the bucket and path both exist although the file itself is new and to be written.
I can also confirm that removing the Gzipping part sees the file successfully written to Cloud Storage.
How can I gzip a new file and upload to Cloud Storage?
Other answers I've used for reference:
https://stackoverflow.com/a/54769937
https://stackoverflow.com/a/67995040
Although #David's answer wasn't complete at the time of solving my problem, it got me on the right track. Here's what I ended up using along with explanations I found out along the way.
import gzip
from google.cloud import storage
from google.cloud.storage import fileio
def upload_to_cloud_storage(json_string):
"""Gzip and write to Cloud Storage."""
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
# Filename (include path)
blob = bucket.blob('path/to/file.json')
# Set blog meta data for decompressive transcoding
blob.content_encoding = 'gzip'
blob.content_type = 'application/json'
writer = fileio.BlobWriter(blob)
# Must write as bytes
gz = gzip.GzipFile(fileobj=writer, mode="wb")
# When writing as bytes we must encode our JSON string.
gz.write(json_string.encode('utf-8'))
# Close connections
gz.close()
writer.close()
We use the GzipFile() class instead of convenience method (compress) to enable us to pass in the mode. When trying to write using w or wt you will receive the error:
TypeError: memoryview: a bytes-like object is required, not 'str'
So we must write in binary mode (wb). This will also enable the .write() method. When doing so however we need to encode our JSON string. This can be done using str.encode() and setting it as utf-8. Failing to do this will also result in the same error.
Finally, I wanted to be able to enable decompressive transcoding where the requester (browser in my case) will receive the uncompressed version of the file when requested. To enable this google.cloud.storage.blob allows you to set some meta data including content_type and content_encoding so we can can follow best practices.
This sees the JSON object in memory written to your chosen destination in Cloud Storage in a compressed format and decompressed on the fly (without needing to download a gzip archive).
Thanks also to #JohnHanley for the troubleshooting advice.
The best solution is not to write the gzip to a file at all, and directly compress and stream to GCS.
from google.cloud import storage
from google.cloud.storage import fileio
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('my_object')
writer = fileio.BlobWriter(blob)
gz = gzip.GzipFile(fileobj=writer, mode="w") # use "wb" if bytes
gz.write(contents)
gz.close()
writer.close()

Why doesn't the password open my zip file in s3 when passed as a bytes object in python?

I have a small but mysterious and unsolvable problem using python to open a password protected file in an AWS S3 bucket.
The password I have been given is definitely correct and I can download the zip to Windows and extract it to reveal the csv data I need.
However I need to code up a process to load this data into a database regularly.
The password has a pattern like this (includes mixed case letters, numbers and a single "#"):-
ABCD#Efghi12324567890
The code below works with other zip files I place in the location with the same password:-
import boto3
import pyzipper
from io import BytesIO
s3_resource = boto3.resource('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
zip_obj = s3_resource.Object(bucket_name=my_bucket, key=my_folder + my_zip)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = pyzipper.ZipFile(buffer)
my_newfile=z.namelist()[0]
s3_resource.meta.client.upload_fileobj(
z.open(my_newfile, pwd=b"ABCD#Efghi12324567890"), #HERE IS THE OPEN COMMAND
Bucket=my_bucket,
Key=my_folder + my_newfile)
I am told the password is incorrect:-
RuntimeError: Bad password for file 'ThisIsTheFileName.csv'
I resorted to using pyzipper rather than zipfile, since zipfile didn't support the compression method of the file in question:-
That compression method is not supported
In 7-zip I can see the following for the zip file:-
Method: AES-256 Deflate
Characteristics: WzAES: Encrypt
Host OS: FAT
So to confirm:-
-The password is definitely correct (can open it manually)
-The code seems ok - it opens my zip files with the same password
What is the issue here please and how do I fix it?
You would have my sincere thanks!
Phil
With some help from a colleague and a useful article, I now have this working.
Firstly as per the compression type, I have found it necessary to use the AESZipFile() method of pyzipper (although this method also seemed to work on other compression types).
Secondly the AESZipFile() method apparently accepts a BytesIO object as well as a file path, presumably because this is what it sees when it opens the file.
Therefore the zip file can be extracted in situ without having to download it first.
This method creates the pyzipper object which you can then read by specifying the file name and the password.
The final code looks like this:-
import pyzipper
import boto3
from io import BytesIO
my_bucket = ''
my_folder = ''
my_zip = ''
my_password = b''
aws_access_key_id=''
aws_secret_access_key=''
s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
s3_file = s3.get_object(Bucket=my_bucket, Key=my_folder + my_zip)
s3_iodata = BytesIO(s3_file['Body'].read())
f = pyzipper.AESZipFile(s3_iodata)
my_file = f.namelist()[0]
file_content = f.read(my_file, pwd = my_password)
response = s3.put_object(Body=file_content, Bucket=my_bucket, Key=my_folder + my_file)
Here is an article that was useful:-
https://www.linkedin.com/pulse/extract-files-from-zip-archives-in-situ-aws-s3-using-python-tom-reid
I hope this is helpful to someone,
Phil

Convert .pdf to .docx on Adobe pdf services API (using Python)

I'm trying to write a Python program converting ".pdf" files to ".docx" ones, using Adobe PDF Server API (free trial).
I've found literature enabling to transform any ".pdf" file to a ".zip" file containing ".txt" files (restoring text data) and ".excel" files (returning tabular data).
import logging
import os.path
from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation
logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))
try:
# get base path.
base_path =os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath("C:/..link.../extractpdf/extract_txt_from_pdf.ipynb"))))
# Initial setup, create credentials instance.
credentials = Credentials.service_account_credentials_builder()\
.from_file(base_path + "\\pdfservices-api-credentials.json") \
.build()
#Create an ExecutionContext using credentials and create a new operation instance.
execution_context = ExecutionContext.create(credentials)
extract_pdf_operation = ExtractPDFOperation.create_new()
#Set operation input from a source file.
source = FileRef.create_from_local_file(base_path + "/resources/trs_pdf_file.pdf")
extract_pdf_operation.set_input(source)
# Build ExtractPDF options and set them into the operation
extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
.with_element_to_extract(ExtractElementType.TEXT) \
.with_element_to_extract(ExtractElementType.TABLES) \
.build()
extract_pdf_operation.set_options(extract_pdf_options)
#Execute the operation.
result: FileRef = extract_pdf_operation.execute(execution_context)
# Save the result to the specified location.
result.save_as(base_path + "/output/Extract_TextTableau_From_trs_pdf_file.zip")
except (ServiceApiException, ServiceUsageException, SdkException):
logging.exception("Exception encountered while executing operation")
But I can't yet get the conversion done to a ".docx" file, event after changing the name of the extracted file to name.docx
I went to read the litterature of adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options.ExtractPDFOptions() but didn't found ways to tune the extraction and change it from ".zip" to ".docx". What things can I try next?
Unfortunately, right now the Python SDK is only supporting the Extract portion of our PDF services. You could use the services via the REST APIs (https://documentcloud.adobe.com/document-services/index.html#how-to-get-started-) as an alternative.

Use seek, write and readline methods on a CSV file stored on Google Cloud Storage (bucket)

I have multiple methods on my Python script to work with a csv file. It's working on my local machine but it does not when I am working with the same csv file stored inside a Google Cloud Storage bucket. I need to keep track of my current_position in the file so this is why I am using seek() and tell(). I tried to use the pandas library but there is no such methods. Does anyone has a basic example of a Python script to read a csv stored in a GCP bucket with those methods?
def read_line_from_csv(position):
#df = pandas.read_csv('gs://trends_service_v1/your_path.csv')
with open('keywords.csv') as f:
f.seek(position)
keyword = f.readline()
position = f.tell()
f.close()
return position, keyword
def save_new_position(current_positon):
f = open("position.csv", "w")
f.write(str(current_positon))
f.close()
update_csv_bucket("position.csv")
def get_position_reader():
try:
with open('position.csv') as f:
return int(f.readline())
except OSError as e:
print(e)
Official library do not have such capabilities I think.
You can download file first than open it and work normally.
Apart from official one you can use gcsfs which implements missing functionality
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-google-project')
with fs.open('my-bucket/my-file.txt', 'rb') as f:
print(f.seek(location))
Another way other than #emil-gi's suggestions would be to use the method mentioned here
#Download the contents of this blob as a bytes object
blob.download_as_string()
Where blob is the object associated with your CSV in your GCS bucket.
If you need to create the connection to the blob first (I don't know what you do in other parts of the code), use the docs
You can use Google Cloud Storage fileio.
For instance:
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(file_path) #folder/filename.csv
#Instantiate a BlobReader
blobReader=storage.fileio.BlobReader(blob)
#Get current position in your file
print(blobReader.tell())
#Read line by line
print(blobReader.readline().decode('utf-8')) #read and print row 1
print(blobReader.readline().decode('utf-8')) #read and print row 2
#Read chunk of X bytes
print(blobReader.read(1000).decode('utf-8')) #read next 1000 bytes
#To seek a specific position.
blobReader.seek(position)

GCS - Read a text file from Google Cloud Storage directly into python

I feel kind of stupid right now. I have been reading numerous documentations and stackoverflow questions but I can't get it right.
I have a file on Google Cloud Storage. It is in a bucket 'test_bucket'. Inside this bucket there is a folder, 'temp_files_folder', which contains two files, one .txt file named 'test.txt' and one .csv file named 'test.csv'. The two files are simply because I try using both but the result is the same either way.
The content in the files is
hej
san
and I am hoping to read it into python the same way I would do on a local with
textfile = open("/file_path/test.txt", 'r')
times = textfile.read().splitlines()
textfile.close()
print(times)
which gives
['hej', 'san']
I have tried using
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('test_bucket')
blob = bucket.get_blob('temp_files_folder/test.txt')
print(blob.download_as_string)
but it gives the output
<bound method Blob.download_as_string of <Blob: test_bucket, temp_files_folder/test.txt>>
How can I get the actual string(s) in the file?
download_as_string is a method, you need to call it.
print(blob.download_as_string())
More likely, you want to assign it to a variable so that you download it once and can then print it and do whatever else you want with it:
downloaded_blob = blob.download_as_string()
print(downloaded_blob)
do_something_else(downloaded_blob)
The method 'download_as_string()' will read in the content as byte.
Find below an example to process a .csv file.
import csv
from io import StringIO
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket(YOUR_BUCKET_NAME)
blob = bucket.blob(YOUR_FILE_NAME)
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")
According to the documentation (https://googleapis.dev/python/storage/latest/blobs.html), As of the time of writing (2021/08), the download_as_string method is a depreciated alias for the download_as_byte method which - as suggested by the name - returns a byte object.
You can instead use the download_as_text method to return a str object.
For instances, to download the file MYFILE from bucket MYBUCKET and store it as an utf-8 encoded string:
from google.cloud.storage import Client
client = Client()
bucket = client.get_bucket(MYBUCKET)
blob = bucket.get_blob(MYFILE)
downloaded_file = blob.download_as_text(encoding="utf-8")
You can then also use this in order to read different file formats. For json, replace the last line to
import json
downloaded_json_file = json.loads(blob.download_as_text(encoding="utf-8"))
For yaml files, replace the last line to :
import yaml
downloaded_yaml_file = yaml.safe_load(blob.download_as_text(encoding="utf-8"))
DON'T USE: blob.download_as_string()
USE: blob.download_as_text()
blob.download_as_text() does indeed return a string.
blob.download_as_string() is deprecated and returns a bytes object instead of a string object.
Works out when reading a docx / text file
from google.cloud import storage
# create storage client
storage_client = storage.Client.from_service_account_json('**PATH OF JSON FILE**')
bucket = storage_client.get_bucket('**BUCKET NAME**')
# get bucket data as blob
blob = bucket.blob('**SPECIFYING THE DOXC FILENAME**')
downloaded_blob = blob.download_as_string()
downloaded_blob = downloaded_blob.decode("utf-8")
print(downloaded_blob)

Categories