Reading a pandas pickle file in Tensorflow in CloudML - python

I'm getting an error trying to read a pandas pickle e.g. df.to_pickle() method, which is stored in Google Cloud storage. I'm trying to do the following:
path_to_gcs_file = 'gs://xxxxx'
f = file_io.FileIO(path_to_gcs_file, mode='r').read()
train_df = pd.read_pickle(f)
f.close()
I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Alternatively I tried:
f = BytesIO(file_io.read_file_to_string(path_to_gcs_file, binary_mode=True))
train_df = pd.read_pickle(f)
Which works locally but not on CloudML!
f = file_io.read_file_to_string(path_to_gcs_file, binary_mode=True)
train_df = pd.read_pickle(f)
Gives me an error:
AttributeError: 'bytes' object has no attribute 'seek'

You should be able to get away with using a context manager, but I think you're pulling the end of the certificate using this way, so you should instead download the file through the api
pip install --upgrade google-cloud-storage
Then
# Initialise a client
storage_client = storage.Client("[Your project name here]")
# Create a bucket object for our bucket
bucket = storage_client.get_bucket(bucket_name)
# Create a blob object from the filepath
blob = bucket.blob("folder_one/foldertwo/filename.extension")
# Download the file to a destination
blob.download_to_filename(path_to_gcs_file)
with open(path_to_gcs_file, "rb" as f:
train_df = = pickle.load(f)
Much was taken from this answer:
Downloading a file from google cloud storage inside a folder

pandas.read_pickle accepts a path as the first argument; you are passing a File object (file.FileIO) and a bytes object (read_to_string).
So far I have not found a way to read a pickle object directly from GCS using pandas, so you will have to copy it to the machine. You could use file_io.copy for that:
file_io.copy('gs://xxxx', '/tmp/x.pkl')
train_df = pd.read_pickle('/tmp/x.pkl')

Related

Gzip a file in Python before uploading to Cloud Storage

I have the following Python function to write the given content to a bucket in Cloud Storage:
import gzip
from google.cloud import storage
def upload_to_cloud_storage(json):
"""Write to Cloud Storage."""
# The contents to upload as a JSON string.
contents = json
storage_client = storage.Client()
# Path and name of the file to upload (file doesn't yet exist).
destination = "path/to/name.json.gz"
# Gzip the contents before uploading
with gzip.open(destination, "wb") as f:
f.write(contents.encode("utf-8"))
# Bucket
my_bucket = storage_client.bucket('my_bucket')
# Blob (content)
blob = my_bucket.blob(destination)
blob.content_encoding = 'gzip'
# Write to storage
blob.upload_from_string(contents, content_type='application/json')
However, I receive an error when running the function:
FileNotFoundError: [Errno 2] No such file or directory: 'path/to/name.json.gz'
Highlighting this line as the cause:
with gzip.open(destination, "wb") as f:
I can confirm that the bucket and path both exist although the file itself is new and to be written.
I can also confirm that removing the Gzipping part sees the file successfully written to Cloud Storage.
How can I gzip a new file and upload to Cloud Storage?
Other answers I've used for reference:
https://stackoverflow.com/a/54769937
https://stackoverflow.com/a/67995040
Although #David's answer wasn't complete at the time of solving my problem, it got me on the right track. Here's what I ended up using along with explanations I found out along the way.
import gzip
from google.cloud import storage
from google.cloud.storage import fileio
def upload_to_cloud_storage(json_string):
"""Gzip and write to Cloud Storage."""
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
# Filename (include path)
blob = bucket.blob('path/to/file.json')
# Set blog meta data for decompressive transcoding
blob.content_encoding = 'gzip'
blob.content_type = 'application/json'
writer = fileio.BlobWriter(blob)
# Must write as bytes
gz = gzip.GzipFile(fileobj=writer, mode="wb")
# When writing as bytes we must encode our JSON string.
gz.write(json_string.encode('utf-8'))
# Close connections
gz.close()
writer.close()
We use the GzipFile() class instead of convenience method (compress) to enable us to pass in the mode. When trying to write using w or wt you will receive the error:
TypeError: memoryview: a bytes-like object is required, not 'str'
So we must write in binary mode (wb). This will also enable the .write() method. When doing so however we need to encode our JSON string. This can be done using str.encode() and setting it as utf-8. Failing to do this will also result in the same error.
Finally, I wanted to be able to enable decompressive transcoding where the requester (browser in my case) will receive the uncompressed version of the file when requested. To enable this google.cloud.storage.blob allows you to set some meta data including content_type and content_encoding so we can can follow best practices.
This sees the JSON object in memory written to your chosen destination in Cloud Storage in a compressed format and decompressed on the fly (without needing to download a gzip archive).
Thanks also to #JohnHanley for the troubleshooting advice.
The best solution is not to write the gzip to a file at all, and directly compress and stream to GCS.
from google.cloud import storage
from google.cloud.storage import fileio
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('my_object')
writer = fileio.BlobWriter(blob)
gz = gzip.GzipFile(fileobj=writer, mode="w") # use "wb" if bytes
gz.write(contents)
gz.close()
writer.close()

Upload .txt file to Dropbox without saving locally first

I have a .txt file that I need to upload into a Dropbox folder. On my PC it works great as it is however I need to put the code into a Google Cloud Function and as the GCP file system is read-only - this method if failing.
Can anyone recommend an alternative way of doing this that doesn't require me to save the data locally before pushing it up into Dropbox?
Here is my current working code for my local version:
import pathlib
import dropbox
api_key = 'XXXXXXXXXX'
# Build String And Save Locally To File
string = ["Item_A","Item_B","Item_C","Item_D"]
string = str(string)
with open('Item_List.txt', 'w') as f:
f.write(string)
# Define Local File Path
localfolder = pathlib.Path(".")
localpath = localfolder / 'Item_List.txt'
# Define Dropbox Target Location
targetfile = '/Data/' + 'Item_List.txt'
# Initilize Dropbox
d = dropbox.Dropbox(api_key)
# Upload File To Dropbox
with localpath.open("rb") as f:
d.files_upload(f.read(), targetfile, mode=dropbox.files.WriteMode("overwrite"))
If you need to simply use byte data, you can use the built-in bytes function to convert a string to byte data (you need to also specify encoding):
data = ["Item_A", "Item_B", "Item_C", "Item_D"]
string_data = str(data)
byte_data = bytes(string_data, encoding='utf-8')
And then later just use the byte data as the argument:
d.files_upload(byte_data, targetfile, mode=dropbox.files.WriteMode("overwrite"))

Use seek, write and readline methods on a CSV file stored on Google Cloud Storage (bucket)

I have multiple methods on my Python script to work with a csv file. It's working on my local machine but it does not when I am working with the same csv file stored inside a Google Cloud Storage bucket. I need to keep track of my current_position in the file so this is why I am using seek() and tell(). I tried to use the pandas library but there is no such methods. Does anyone has a basic example of a Python script to read a csv stored in a GCP bucket with those methods?
def read_line_from_csv(position):
#df = pandas.read_csv('gs://trends_service_v1/your_path.csv')
with open('keywords.csv') as f:
f.seek(position)
keyword = f.readline()
position = f.tell()
f.close()
return position, keyword
def save_new_position(current_positon):
f = open("position.csv", "w")
f.write(str(current_positon))
f.close()
update_csv_bucket("position.csv")
def get_position_reader():
try:
with open('position.csv') as f:
return int(f.readline())
except OSError as e:
print(e)
Official library do not have such capabilities I think.
You can download file first than open it and work normally.
Apart from official one you can use gcsfs which implements missing functionality
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-google-project')
with fs.open('my-bucket/my-file.txt', 'rb') as f:
print(f.seek(location))
Another way other than #emil-gi's suggestions would be to use the method mentioned here
#Download the contents of this blob as a bytes object
blob.download_as_string()
Where blob is the object associated with your CSV in your GCS bucket.
If you need to create the connection to the blob first (I don't know what you do in other parts of the code), use the docs
You can use Google Cloud Storage fileio.
For instance:
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(file_path) #folder/filename.csv
#Instantiate a BlobReader
blobReader=storage.fileio.BlobReader(blob)
#Get current position in your file
print(blobReader.tell())
#Read line by line
print(blobReader.readline().decode('utf-8')) #read and print row 1
print(blobReader.readline().decode('utf-8')) #read and print row 2
#Read chunk of X bytes
print(blobReader.read(1000).decode('utf-8')) #read next 1000 bytes
#To seek a specific position.
blobReader.seek(position)

Writing a file to S3 using Lambda in Python with AWS

In AWS, I'm trying to save a file to S3 in Python using a Lambda function. While this works on my local computer, I am unable to get it to work in Lambda. I've been working on this problem for most of the day and would appreciate help. Thank you.
def pdfToTable(PDFfilename, apiKey, fileExt, bucket, key):
# parsing a PDF using an API
fileData = (PDFfilename, open(PDFfilename, "rb"))
files = {"f": fileData}
postUrl = "https://pdftables.com/api?key={0}&format={1}".format(apiKey, fileExt)
response = requests.post(postUrl, files=files)
response.raise_for_status()
# this code is probably the problem!
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'rb') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_fileobj(data, key)
# FYI, on my own computer, this saves the file
with open('output.csv', "wb") as f:
f.write(response.content)
In S3, there is a bucket transportation.manifests.parsed containing the folder csv where the file should be saved.
The type of response.content is bytes.
From AWS, the error from the current set-up above is [Errno 2] No such file or directory: '/tmp/output2.csv': FileNotFoundError. In fact, my goal is to save the file to the csv folder under a unique name, so tmp/output2.csv might not be the best approach. Any guidance?
In addition, I've tried to use wb and w instead of rb also to no avail. The error with wb is Input <_io.BufferedWriter name='/tmp/output2.csv'> of type: <class '_io.BufferedWriter'> is not supported. The documentation suggests that using 'rb' is the recommended usage, but I do not understand why that would be the case.
Also, I've tried s3_client.put_object(Key=key, Body=response.content, Bucket=bucket) but receive An error occurred (404) when calling the HeadObject operation: Not Found.
Assuming Python 3.6. The way I usually do this is to wrap the bytes content in a BytesIO wrapper to create a file like object. And, per the boto3 docs you can use the-transfer-manager for a managed transfer:
from io import BytesIO
import boto3
s3 = boto3.client('s3')
fileobj = BytesIO(response.content)
s3.upload_fileobj(fileobj, 'mybucket', 'mykey')
If that doesn't work I'd double check all IAM permissions are correct.
You have a writable stream that you're asking boto3 to use as a readable stream which won't work.
Write the file, and then simply use bucket.upload_file() afterwards, like so:
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'w') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_file('/tmp/output2.csv', key)

GCS - Read a text file from Google Cloud Storage directly into python

I feel kind of stupid right now. I have been reading numerous documentations and stackoverflow questions but I can't get it right.
I have a file on Google Cloud Storage. It is in a bucket 'test_bucket'. Inside this bucket there is a folder, 'temp_files_folder', which contains two files, one .txt file named 'test.txt' and one .csv file named 'test.csv'. The two files are simply because I try using both but the result is the same either way.
The content in the files is
hej
san
and I am hoping to read it into python the same way I would do on a local with
textfile = open("/file_path/test.txt", 'r')
times = textfile.read().splitlines()
textfile.close()
print(times)
which gives
['hej', 'san']
I have tried using
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('test_bucket')
blob = bucket.get_blob('temp_files_folder/test.txt')
print(blob.download_as_string)
but it gives the output
<bound method Blob.download_as_string of <Blob: test_bucket, temp_files_folder/test.txt>>
How can I get the actual string(s) in the file?
download_as_string is a method, you need to call it.
print(blob.download_as_string())
More likely, you want to assign it to a variable so that you download it once and can then print it and do whatever else you want with it:
downloaded_blob = blob.download_as_string()
print(downloaded_blob)
do_something_else(downloaded_blob)
The method 'download_as_string()' will read in the content as byte.
Find below an example to process a .csv file.
import csv
from io import StringIO
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket(YOUR_BUCKET_NAME)
blob = bucket.blob(YOUR_FILE_NAME)
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")
According to the documentation (https://googleapis.dev/python/storage/latest/blobs.html), As of the time of writing (2021/08), the download_as_string method is a depreciated alias for the download_as_byte method which - as suggested by the name - returns a byte object.
You can instead use the download_as_text method to return a str object.
For instances, to download the file MYFILE from bucket MYBUCKET and store it as an utf-8 encoded string:
from google.cloud.storage import Client
client = Client()
bucket = client.get_bucket(MYBUCKET)
blob = bucket.get_blob(MYFILE)
downloaded_file = blob.download_as_text(encoding="utf-8")
You can then also use this in order to read different file formats. For json, replace the last line to
import json
downloaded_json_file = json.loads(blob.download_as_text(encoding="utf-8"))
For yaml files, replace the last line to :
import yaml
downloaded_yaml_file = yaml.safe_load(blob.download_as_text(encoding="utf-8"))
DON'T USE: blob.download_as_string()
USE: blob.download_as_text()
blob.download_as_text() does indeed return a string.
blob.download_as_string() is deprecated and returns a bytes object instead of a string object.
Works out when reading a docx / text file
from google.cloud import storage
# create storage client
storage_client = storage.Client.from_service_account_json('**PATH OF JSON FILE**')
bucket = storage_client.get_bucket('**BUCKET NAME**')
# get bucket data as blob
blob = bucket.blob('**SPECIFYING THE DOXC FILENAME**')
downloaded_blob = blob.download_as_string()
downloaded_blob = downloaded_blob.decode("utf-8")
print(downloaded_blob)

Categories