I have a dataset on google drive that's about 20GB big.
I use a generator to pull in the dataset to my keras/TF models, and the overhead of loading the files (for every batch) is insane.
I want to prefetch the content as one operation and then simply fetched from the local VM disk
I tried this:
from google.colab import drive
drive.mount('/content/drive')
!mkdir -p $RAW_NOTEBOOKS_DIR
!cp $RAW_NOTEBOOKS_DIR $LOCAL_NOTEBOOKS_DIR
However, this snippet runs finishes executing instantly (so it obviously didn't download the data - which was the intent of the cp command (copying from Drive to local).
Is this at all possible?
RAW_NOTEBOOKS_DIR = "/content/drive/My\ Drive/Colab\ Notebooks"
Theres a good example on google codelab for doing this, they write the dataset in a local TFRecords:
https://codelabs.developers.google.com/codelabs/keras-flowers-tpu/#0
you can find more info here:
https://keras.io/examples/keras_recipes/tfrecord/
so instead of reading the data every time from google drive you just need to read it one time and write it in the local memory in a TFRecord, then pass it to the model for training.
If you follow the guides is pretty straightforward.
Related
I am new to gcloud storage and VMs. My problem is I have a big dataset of images stored in a bucket in gcloud, I want to do two things:
first, I want to load this dataset to my Jupyter instance to run some deep learning models. But I didn't find any clear method to do so.
second, I want to save some output images to a different bucket and so some other files.
Thank you in advance :)
You can use the Cloud Storage FUSE tool to mount a Cloud Storage bucket to your instance. The mounted bucket behaves similarly to a persistent disk even though Cloud Storage buckets are object storage. Once you have the buckets as disks you can make it so that the first bucket inputs the images into the deep learning model and the output be saved in the second bucket.
I've been at this for 2 days and cannot figure this out. This is my first foray into the world of Google Cloud Platform. I would like to do the following two things the easiest way possible:
Allow me to provide some context before I start: I would like to pull data from:
NYC Taxi and Limousine Commission or https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
I am interested in pulling all the months from 2019 (12 csv files, + 10GB). I would like to bring these 12 csv files into Google Cloud Platform and into Google Storage for later processing in BigQuery / Dataprep. Storing these files locally is not an option due to size.
First thing - Is there a way to bring in these files using the -wget command but for multiple .csv files? From my research, -wget is better for pulling in a single zip file or single (or a few) individual files. The data in the taxi website is not zipped but is available by month in csv format.
Second thing - If the -wget option is not a scalable option, I created a python file that contains script to pull / scrape data from that website I mentioned previously for those 12 months (2019). The script I am using is as follows:
import requests
import shutil
filenames = [ # probably could loop through these .._YYYY-MM
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-05.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-06.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-07.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-08.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-09.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-10.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-11.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-12.csv',
]
for x in filenames:
r = requests.get(x, stream = True)
if r.status_code == 200:
with open(x.split('_')[-1], 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
How do I use this Python script inside Google Cloud Platform to 'import' these 12 csv files into Google Storage for use in BigQuery and Dataprep? I tried using the SSH command line and activated conda but still couldn't figure this out (even after reviewing Google Cloud Platform user / reference guides!).
Perhaps someone may be able to assist me. Thank you for any help you can provide! :)
For a one-time loading of these files into Cloud Storage, I'd recommend creating a Compute Engine VM and then downloading and then uploading with gsutil. The Compute Engine VMs have high-bandwidth connections and should be able to download/upload 10GB without trouble.
Create a Compute Engine VM and SSH into it. You could also use Cloud Shell, but it limits you to 5GB storage, so you'd need to chunk the download / uploads.
Download the files. Below is an example using curl, but you could also do it with wget or in Python using the requests library.
curl "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-[01-12].csv" -o "tripdata_2019-#1.csv"
Upload with gsutil
gsutil cp tripdata-2019-{01..12}.csv gs://your-bucket
Another option would be Storage Transfer Service. I haven't personally used that service yet, but it seems like to fit this use-case.
I'm training a model with Tensorflow using Amazon Sagemaker, and I'd like to be able to monitor training progress while the job is running. During training however, no Tensorboard files are output to S3, only once the training job is completed are the files uploaded to S3. After training has completed, I can download the files and see that Tensorboard has been logging values correctly throughout training, despite only being updated in S3 once after training completes.
I'd like to know why Sagemaker isn't uploading the Tensorboard information to S3 throughout the training process?
Here is the code from my notebook on Sagemaker that kicks off the training job
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig, TensorBoardOutputConfig
import time
bucket = 'my-bucket'
output_prefix = 'training-jobs'
model_name = 'my-model'
dataset_name = 'my-dataset'
dataset_path = f's3://{bucket}/datasets/{dataset_name}'
output_path = f's3://{bucket}/{output_prefix}'
job_name = f'{model_name}-{dataset_name}-training-{time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())}'
s3_checkpoint_path = f"{output_path}/{job_name}/checkpoints" # Checkpoints are updated live as expected
s3_tensorboard_path = f"{output_path}/{job_name}/tensorboard" # Tensorboard data isn't appearing here until the training job has completed
tensorboard_output_config = TensorBoardOutputConfig(
s3_output_path=s3_tensorboard_path,
container_local_output_path= '/opt/ml/output/tensorboard' # I have confirmed this is the unaltered path being provided to tf.summary.create_file_writer()
)
role = sagemaker.get_execution_role()
estimator = TensorFlow(entry_point='main.py', source_dir='./', role=role, max_run=60*60*24*5,
output_path=output_path,
checkpoint_s3_uri=s3_checkpoint_path,
tensorboard_output_config=tensorboard_output_config,
instance_count=1, instance_type='ml.g4dn.xlarge',
framework_version='2.3.1', py_version='py37', script_mode=True)
dpe_estimator.fit({'train': dataset_path}, wait=True, job_name=job_name)
There is a issue on tensorflow github related to the s3 client in version 2.3.1 which is the one you are using. Check in the cloudwatch logs if you have an error like
OP_REQUIRES failed at whole_file_read_ops.cc:116 : Failed precondition: AWS Credentials have not been set properly. Unable to access the specified S3 location
Then the provided solution is to add GetObejectVersion permission to the bucket. Alternatively, to confirm that is a tensorflow issue, you can try a different version.
First some speculation without any facts: Sagemaker could work as some other systems that sync files between local drive and s3. They might check that the file hasn't been accessed recently before syncing it so that they don't copy it while someone is writing to it. The log files are written constantly until shutdown so that might result in them not being copied ever.
I have used Sagemaker Docker containers with same problem. I've tried two ways circumvent this problem and they seemed to work.
First one is to periodically create a new log file. So e.g. every 30 minutes call again tf.summary.create_file_writer(...) to switch to a new log file. Old file is synced to s3 when it's not used anymore.
Second one is to directly write logs to s3. tf.summary.create_file_writer('s3://bucket/dir/'). This is more instant way of getting the info into s3.
I have to download .parqeat file from Azure storage and have to analyze the Data.
So I started using python codes with azure-storage-blob library and Pandas to store DataFrames and flask to create API on response.
But now file size in the Azure storage is getting increased up to 30MB(a day file) and as tested it takes 6min to download 20MB file into my local which I can't afford because I have to analyse files of each days in a month, which will have 20 MB each. That's almost 3020MB size.
Any suggestion?
Did you find your answer? Without a code snippet it would be hard to diagnos your problem. Azure Blob Storage supports read in parallel. If you have BatchServiceClient service that would be straightforward. Otherwise, try Multiprocessing module. Kr.
One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/
The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.