Unable to read csv file uploaded on google cloud storage bucket - python

Goal - To read csv file uploaded on google cloud storage bucket.
Environment - Run Jupyter notebook using SSH instance on Master node. Using python on Jupyter notebook trying to access a simple csv file uploaded onto google cloud storage bucket.
Approaches -
1st approach - Write a simple python program
Wrote following program
import csv
f = open('gs://python_test_hm/train.csv' , 'rb' )
csv_f = csv.reader(f)
for row in csv_f
print row
Results - Error message "No such file or directory"
2nd Approach - Using gcloud Package tried to access the train.csv file. The sample code is shown below. Below code is not the actual code. The file on google Cloud storage in my version of code was referred to "gs:///Filename.csv"
Results - Error message "No such file or directory"
Load data from CSV
import csv
from gcloud import bigquery
from gcloud.bigquery import SchemaField
client = bigquery.Client()
dataset = client.dataset('dataset_name')
dataset.create() # API request
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = dataset.table('table_name', SCHEMA)
table.create()
with open('csv_file', 'rb') as readable:
table.upload_from_file(
readable, source_format='CSV', skip_leading_rows=1)
3rd Approach -
import csv
import urllib
url = 'https://storage.cloud.google.com/<bucket>/train.csv'
response = urllib.urlopen(url)
cr = csv.reader(response)
print cr
for row in cr:
print row
Results - Above code doesn't result in any error but it displays the XML content of the google page as shown below. I am interested in viewing the data of the train csv file.
['<!DOCTYPE html>']
['<html lang="en">']
[' <head>']
[' <meta charset="utf-8">']
[' <meta content="width=300', ' initial-scale=1" name="viewport">']
[' <meta name="google-site-verification" content="LrdTUW9psUAMbh4Ia074- BPEVmcpBxF6Gwf0MSgQXZs">']
[' <title>Sign in - Google Accounts</title>']
Can someone throw some light on what could be possibly wrong here and how do I achieve my goal? Your help is highly appreciated.
Thanks so much for your help!

I assume you are using Jupyter notebook running on a machine in Google Cloud Platform (GCP)?
If that's the case, you will already have the Google Cloud SDK running on that machine (by default).
With this setup you have 2 easy options to work with Google Cloud Storage (GCS):
Use the gcloud/gsutil commands in Jupyter
Writing to GCS: gsutil cp train.csv gs://python_test_hm/train.csv
Reading from GCS:
gsutil cp gs://python_test_hm/train.csv train.csv
Use google-cloud python library
Writing to GCS:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('python_test_hm')
blob = bucket.blob('train.csv')
blob.upload_from_string('this is test content!')
Reading from GCS:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('python_test_hm')
blob = storage.Blob('train.csv', bucket)
content = blob.download_as_string()

The sign in page your app fetches isn't actually the object - it's an auth redirect page that, if interacted-with, would proceed to serve the object. You should check out the documentation on Cloud Storage to see about how auth works, and look up the auth details for whichever library or means you use to access the bucket / object.

Related

writing a simple text file with no key value pair to cloud storage storage

My requirement is to export the data from BQ to GCS in a particular sorted order which I am not able to get using automatic export and hence trying to write a manual export for this.
File format is like below:
HDR001||5378473972abc||20101|182082||
DTL001||436282798101|
DTL002||QS
DTL005||3733|8
DTL002||QA
DTL005||3733|8
DTL002||QP
DTL005||3733|8
DTL001||436282798111|
DTL002||QS
DTL005||3133|2
DTL002||QA
DTL005||3133|8
DTL002||QP
DTL005||3133|0
I am very new to this and am able to write the file in local disk but I am not sure how I can write this to file to GCS. I tried to use the write_to_file but I seem to be missing something.
import pandas as pd
import pickle as pkl
import tempfile
from google.colab import auth
from google.cloud import bigquery, storage
#import cloudstorage as gcs
auth.authenticate_user()
df = pd.DataFrame(data=job)
sc = storage.Client(project='temp-project')
with tempfile.NamedTemporaryFile(mode='w+b', buffering=- 1,prefix='test',suffix='temp') as fh:
with open(fh.name,'w+',newline='') as f:
dfAsString = df.to_string(header=" ", index=False)
fh.name = fh.write(dfAsString)
fh.close()
bucket = sc.get_bucket('my-bucket')
target_fn = 'test.csv'
source_fn = fh.name
destination_blob_name = bucket.blob('test.csv')
bucket.blob(destination_blob_name).upload_from_file(source_fn)
Can someone please help?
Thank You.
I would suggest to upload an object through a Cloud Storage bucket. Instead of upload_from_file, you need to use upload_from_filename. Your code should look like this:
bucket.blob(destination_blob_name).upload_from_filename(source_fn)
Here are links for the documentation on how to upload an object to Cloud Storage bucket and Client library docs.
EDIT:
The reason why you're getting that is because somewhere in your code, you're passing a Blob object, rather than a String. Currently your destination variable is a Blob Object, change it to String instead:
destination_blob_name = bucket.blob('test.csv')
to
destination_blob_name = 'test.csv'

Cannot write xlsx to GCS from pandas

I have a strange issue.
I trigger a K8S job from airflow as a data pipeline. At the end I need to write the dataframe to a Google Cloud Storage as a .parquet and .xlsx files.
[...]
export_app.to_parquet(f"{output_path}.parquet")
export_app.to_excel(f"{output_path}.xlsx")
Everything is ok for the parquet file but I got an error for the xlsx.
severity: "INFO"
textPayload: "[Errno 2] No such file or directory: 'gs://my_bucket/incidents/prediction/2020-04-29_incidents_result.xlsx'
I try to write the file as a csv to try
export_app.to_parquet(f"{output_path}.parquet")
export_app.to_csv(f"{output_path}.csv")
export_app.to_excel(f"{output_path}.xlsx")
I get the same message every time and I find the other file as expected.
There is any limitation to write a xlsx file ?
I have the package openpyxl installed in my env.
As requested I am passing some codes how I created new xlsx file using directly gcs python3 api. I used this tutorial and this api reference:
# Imports the Google Cloud client library
from google.cloud import storage
# Instantiates a client
storage_client = storage.Client()
# Create the bucket object
bucket = storage_client.get_bucket("my-new-bucket")
#Confirm bucket connected
print("Bucket {} connected.".format(bucket.name))
#Create file in the bucket
blob = bucket.blob('test.xlsx')
with open("/home/vitooh/test.xlsx", "rb") as my_file:
blob.upload_from_file(my_file)
I hope it will help!

Loading CSV into Neo4j on Azure

I have Neo4j operational on Azure. I can load data using python and a series of create statements:
create (n:Person) return n
I can query successfully using python.
Using LOAD CSV requires a file in the Neo4j import directory. I've located that directory, but moving a file into it is blocked. I've also tried putting the file in an accessable directory, but then cannot figure out how to address the path in the LOAD CSV statement.
This LOAD gives an error because the file cannot get into the Neo4j import directory:
USING PERIODIC COMMIT 10000 LOAD CSV WITH HEADERS FROM 'file:///FTDNATree.csv' AS line FIELDTERMINATOR '|' merge (s:SNPNode{SNP:toString(line.Parent)})
This statement does not find the file and gives an error: EXTERNAL file not found
USING PERIODIC COMMIT 10000 LOAD CSV WITH HEADERS FROM 'file:///{my directory path/}FTDNATree.csv' AS line FIELDTERMINATOR '|' merge (s:SNPNode{SNP:toString(line.Parent)})
Even though the python and neo4j are in the same resource group, they are different VMs. The problem seems to be the interoperability between the two VM?
If you have access to neo4j.conf, then you can modify the value of dbms.directories.import to point to an accessible directory
See https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_dbms.directories.import
The solution was NOT well documented in one place. But here is what evolved by trial and error and which works.
I created a storage account within the resource
Created a directory accessible from code in which the upload file was placed.
Added container, called it neo4j-import
I could then a tranfer the file to the container as a blob (i.e., *.csv file)
You then need to make the file accessible. This involves creating an sas token and attaching it to a URL pointing to the container and the file (see python code to do this below).
You can test this URL in your local browser. It should retrieve the file, which is not accessible without the sas token
This URL is used in the LOAD CSV statement and successfully loads the Neo4j database
The code for step 4; pardon indent issues upon pasting here.
from azure.storage.blob import BlobServiceClient, BlobClient,
ContainerClient, generate_account_sas, ResourceTypes, AccountSasPermissions
def UploadFileToDataStorage(FileName,
UploadFileSourceDirecory=ImportDirectory,BlobConnStr=AzureBlobConnectionString,
Container="neo4j-import"):
#uploads file as blob to data storage
#https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python #upload-blobs-to-a-container
blob_service_client = BlobServiceClient.from_connection_string(BlobConnStr)
blob_client = blob_service_client.get_blob_client(container=Container, blob=FileName)
with open(UploadFileSourceDirecory + FileName, "rb") as data:
blob_client.upload_blob(data)
The key python code (step 5 above).
def GetBlobURLwithSAS(FileName,Container="neo4j-import"):
#https://pypi.org/project/azure-storage-blob/
#https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobserviceclient?view=azure-python
#generates sas token for object blob so it can be consumed by another process
sas_token = generate_account_sas(
account_name="{storage account name}",
account_key="{storage acct key}",
resource_types=ResourceTypes(service=False, container=False, object=True),
permission=AccountSasPermissions(read=True),
expiry=datetime.utcnow() + timedelta(hours=1))
return "https://{storage account name}.blob.core.windows.net/" + Container + "/" + FileName + "?" + sas_token
The LOAD statement looks like this and does not use the file:/// prefix:
LOAD CSV WITH HEADERS FROM '" + {URL from above} + "' AS line FIELDTERMINATOR '|'{your cypher query for loading csv}
I hope this helps other to navigate this scenario!

Load file from Azure Files to Azure Databricks

Looking for a way using Azure files SDK to upload files to my azure databricks blob storage
I tried many things using function from this page
But nothing worked. I don't understand why
example:
file_service = FileService(account_name='MYSECRETNAME', account_key='mySECRETkey')
generator = file_service.list_directories_and_files('MYSECRETNAME/test') #listing file in folder /test, working well
for file_or_dir in generator:
print(file_or_dir.name)
file_service.get_file_to_path('MYSECRETNAME','test/tables/input/referentials/','test.xlsx','/dbfs/FileStore/test6.xlsx')
with test.xlsx = name of file in my azure file
/dbfs/FileStore/test6.xlsx => path where to upload the file in my dbfs system
I have the error message:
Exception=The specified resource name contains invalid characters
Tried to change the name but doesn't seem to work
edit: I'm not even sure the function is doing what I want. What is the best way to load file from azure files?
Per my experience, I think the best way to load file from Azure Files is directly to read a file via its url with sas token.
For example, as the figures below, it's a file named test.xlsx in my test file share, that I viewed it using Azure Storage Explorer, then to generate its url with sas token.
Fig 1. Right click the file and then click the Get Shared Access Signature...
Fig 2. Must select the option Read permission for directly reading the file content.
Fig 3. Copy the url with sas token
Here is my sample code, you can run it with the sas token url of your file in your Azure Databricks.
import pandas as pd
url_sas_token = 'https://<my account name>.file.core.windows.net/test/test.xlsx?st=2020-01-27T10%3A16%3A12Z&se=2020-01-28T10%3A16%3A12Z&sp=rl&sv=2018-03-28&sr=f&sig=XXXXXXXXXXXXXXXXX'
# Directly read the file content from its url with sas token to get a pandas dataframe
pdf = pd.read_excel(url_sas_token )
# Then, to convert the pandas dataframe to a PySpark dataframe in Azure Databricks
df = spark.createDataFrame(pdf)
Alternatively, to use Azure File Storage SDK to generate the url with sas token for your file or to get the bytes of your file for reading, please refer to the offical document Develop for Azure Files with Python and my sample code below.
# Create a client of Azure File Service as same as yours
from azure.storage.file import FileService
account_name = '<your account name>'
account_key = '<your account key>'
share_name = 'test'
directory_name = None
file_name = 'test.xlsx'
file_service = FileService(account_name=account_name, account_key=account_key)
To generate the sas token url of a file
from azure.storage.file import FilePermissions
from datetime import datetime, timedelta
sas_token = file_service.generate_file_shared_access_signature(share_name, directory_name, file_name, permission=FilePermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
url_sas_token = f"https://{account_name}.file.core.windows.net/{share_name}/{file_name}?{sas_token}"
import pandas as pd
pdf = pd.read_excel(url_sas_token)
df = spark.createDataFrame(pdf)
Or using get_file_to_stream function to read the file content
from io import BytesIO
import pandas as pd
stream = BytesIO()
file_service.get_file_to_stream(share_name, directory_name, file_name, stream)
pdf = pd.read_excel(stream)
df = spark.createDataFrame(pdf)
Just as an addition to #Peter Pan answer, the alternative approach without using Pandas with python azure-storage-file-share library.
Very detailed documentation: https://pypi.org/project/azure-storage-file-share/#downloading-a-file

How can one load an AppEngine cloud storage backup to a local development server?

I'm experimenting with the Google cloud storage backup feature for an application.
After downloading the backup files using gsutil, how can they be loaded into a local development server?
Is there a parser available for these formats (eg, protocol buffers)?
Greg Bayer wrote some Python code showing how to do this in a blog post:
# Make sure App Engine SDK is available
import sys
sys.path.append('/usr/local/google_appengine')
from google.appengine.api.files import records
from google.appengine.datastore import entity_pb
from google.appengine.api import datastore
raw = open('path_to_datastore_export_file', 'r')
reader = records.RecordsReader(raw)
for record in reader:
entity_proto = entity_pb.EntityProto(contents=record)
entity = datastore.Entity.FromPb(entity_proto)
#Entity is available as a dictionary!
Backups are stored in leveldb record format, you should be able to read then using:
google.api.files.records.RecordsReader
ext.mapreduce.input_readers.RecordReader
For those using windows change the open line to:
raw = open('path_to_datastore_export_file', 'rb')
The file must be opened in binary mode!

Categories