Reading netCDF file from google cloud bucket directly into python - python

I have a bucket on the Google cloud that contains multiple netcdf files. Normally, when the files are stored locally, I would perform:
import netCDF4
nc = netCDF4.Dataset('path/to/netcdf.nc')
Is it possible to do this in python straight from the google cloud without having to first download the file from the bucket?

This function works for loading NetCDF files from a Google Cloud storage bucket:
import xarray as xr
import fsspec
def load_dataset(filename, engine="h5netcdf", *args, **kwargs) -> xr.Dataset:
"""Load a NetCDF dataset from local file system or cloud bucket."""
with fsspec.open(filename, mode="rb") as file:
dataset = xr.load_dataset(file, engine=engine, *args, **kwargs)
return dataset
dataset = load_dataset("gs://bucket-name/path/to/file.nc")

I'm not sure how to work with Google object store, but here's how you can open a netCDF file from an in-memory buffer containing all the bytes from the file:
from netCDF4 import Dataset
fobj = open('path/to/netcdf.nc', 'rb')
data = fobj.read()
nc = Dataset('memory', memory=data)
So the path forward would be to read all the data from object store, then use that command to read it. That will have some drawbacks for large netcdf files because you're putting all those bytes in your system memory.

Related

How to compress a json lines file and uploading to azure container?

I am working in databricks, and have a Pyspark dataframe that I am converting to pandas, and then to a json lines file and want to upload it to an Azure container (ADLS gen2). The file is large, and I wanted to compress it prior to uploading.
I am first converting the pyspark dataframe to pandas.
pandas_df = df.select("*").toPandas()
Then converting it to a newline delimited json:
json_lines_data = pandas_df.to_json(orient='records', lines=True)
Then writing to blob storage with the following function:
def upload_blob(json_lines_data, connection_string, container_name, blob_name):
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
try:
blob_client.get_blob_properties()
blob_client.delete_blob()
# except if no delete necessary
except:
pass
blob_client.upload_blob(json_lines_data)
This is working fine, but the data is around 3 GB per file, and takes a long time to download so I would rather zip the files. Can anyone here help with how to compress the json lines file and upload it to the azure container? I have tried a lot of different things, and nothing is working.
If there is a better way to do this in databricks, I can change it. I did not write using databricks as I need to output 1 file and control the filename.
There is a way you can follow to compress the JSON file before uploading to blob storage.
Here is code for converting the data to JSON and Convert to binary code(utf-8) and lastly compress it.
Would suggest you add this code before uploading function.
import json
import gzip
def compress_data(data):
# Convert to JSON
json_data = json.dumps(data, indent=2)
# Convert to bytes
encoded = json_data.encode('utf-8')
# Compress
compressed = gzip.compress(encoded)
Reference: https://gist.github.com/LouisAmon/4bd79b8ab80d3851601f3f9016300ac4#file-json_to_gzip-py

How to read large gzip csv files from azure storage in memory in aws lambda?

I'm trying to transform large gzip csv files(>3 Gigs) from azure storage blob by loading it to pandas dataframe in AWS Lambda function.
Local machine with 16 gigs is able to process my files but it gives memory error "errorType": "MemoryError", while executing in aws lambda, since lambda has max memory is 3 Gb.
Is there any way of reading and processing those large data in memory in aws lambda. Just to mention I tried stream way as well but no luck.
Below is my sample code-
from azure.storage.blob import *
import pandas as pd
import gzip
blob = BlobClient(account_url="https://" + SOURCE_ACCOUNT + ".blob.core.windows.net",
container_name=SOURCE_CONTAINER,
blob_name=blob_name,
credential=SOURCE_ACCT_KEY)
data = blob.download_blob()
data = data.readall()
unzipped_bytes = gzip.decompress(data)
unzipped_str = unzipped_bytes.decode("utf-8")
df = pd.read_csv(StringIO(unzipped_str), usecols=req_columns,low_memory=False)
Try using chunks and read data by n-rows per one time:
for chunk in pd.read_csv("large.csv", chunksize=10_000):
print(chunk)
Although, I’m not sure about compressed files. In worst case you can uncompress data and then read by chunks.

How to load pickled dataframes from GCS into App Engine

I'm trying to load a pickled pandas dataframe from Google Cloud Storage into App Engine.
I have been using blob.download_to_file() to read the bytestream into pandas, however I encounter the following error:
UnpicklingError: invalid load key, m
I have tried seeking to the beginning to no avail and am pretty sure something fundamental is missing from my understanding.
When attempting to pass an open file object and read from there, I get an
UnsupportedOperation: write
error
from io import BytesIO
from google.cloud import storage
def get_byte_fileobj(project, bucket, path) -> BytesIO:
blob = _get_blob(bucket, path, project)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return(byte_stream)
def _get_blob(bucket_name, path, project):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return(blob)
fileobj = get_byte_fileobj(projectid, 'backups', 'Matches/Matches.pickle')
pd.read_pickle(fileobj)
Ideally pandas would just read from pickle since all of my GCS backups are in that format, but I'm open to suggestions.
The pandas.read_pickle() method takes as argument a file path string, not a file handler/object:
pandas.read_pickle(path, compression='infer')
Load pickled pandas object (or any object) from file.
path : str
File path where the pickled object will be loaded.
If you're in the 2nd generation standard or the flexible environment you could try to use a real /tmp file instead of BytesIO.
Otherwise you'd have to figure out another method of loading the data into pandas, which supports a file object/descriptor. In general the approach is described in How to restore Tensorflow model from Google bucket without writing to filesystem? (context is different, but same general idea)

Renaming spark output csv in azure blob storage

I have a Databricks notebook setup that works as the following;
pyspark connection details to Blob storage account
Read file through spark dataframe
convert to pandas Df
data modelling on pandas Df
convert to spark Df
write to blob storage in single file
My problem is, that you can not name the file output file, where I need a static csv filename.
Is there way to rename this in pyspark?
## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""
## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"
## Connection string to connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
Followed by outputting file after data transformation
dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Where the file is then write as "part-00000-tid-336943946930983.....csv"
Where as a the goal is to have "Output.csv"
Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.
I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs
Any help here is greatly appreciated.
Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-.... files in a HDFS output path like Output/ named by you.
If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv or set the number of reduce processes with 1 like using coalesce(1) function.
So in your scenario, you only need to adjust the order of calling these functions to make the coalease function called at the front of save function, as below.
dfspark.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)
The coalesce and repartition do not help with saving the dataframe into 1 normally named file.
I ended up just renaming the 1 csv file and deleting the folder with log:
def save_csv(df, location, filename):
outputPath = os.path.join(location, filename + '_temp.csv')
df.repartition(1).write.format("com.databricks.spark.csv").mode("overwrite").options(header="true", inferSchema="true").option("delimiter", "\t").save(outputPath)
csv_files = os.listdir(os.path.join('/dbfs', outputPath))
# moving the parquet-like temp csv file into normally named one
for file in csv_files:
if file[-4:] == '.csv':
dbutils.fs.mv(os.path.join(outputPath,file) , os.path.join(location, filename))
dbutils.fs.rm(outputPath, True)
# using save_csv
save_csv_location = 'mnt/.....'
save_csv(df, save_csv_location, 'name.csv')

How to read parquet file into pandas from Azure blob store

I need to read and write parquet files from an Azure blob store within the context of a Jupyter notebook running Python 3 kernel.
I see code for working strictly with parquet files and python and other code for grabbing/writing to an Azure blob store but nothing yet that put's it all together.
Here is some sample code I'm playing with:
from azure.storage.blob import BlockBlobService
block_blob_service = BlockBlobService(account_name='testdata', account_key='key-here')
block_blob_service.get_blob_to_text(container_name='mycontainer', blob_name='testdata.parquet')
This last line with throw an encoding-related error.
I've played with storefact but coming up short there.
Thanks for any help
To access the file you need to access the azure blob storage first.
storage_account_name = "your storage account name"
storage_account_access_key = "your storage account access key"
Read path of parquet file into variable
commonLOB_mst_source = "Parquet file path"
file_type = "parquet"
Connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
Read Parquet file into dataframe
df = spark.read.format(file_type).option("inferSchema", "true").load(commonLOB_mst_source)

Categories