I have a strange issue.
I trigger a K8S job from airflow as a data pipeline. At the end I need to write the dataframe to a Google Cloud Storage as a .parquet and .xlsx files.
[...]
export_app.to_parquet(f"{output_path}.parquet")
export_app.to_excel(f"{output_path}.xlsx")
Everything is ok for the parquet file but I got an error for the xlsx.
severity: "INFO"
textPayload: "[Errno 2] No such file or directory: 'gs://my_bucket/incidents/prediction/2020-04-29_incidents_result.xlsx'
I try to write the file as a csv to try
export_app.to_parquet(f"{output_path}.parquet")
export_app.to_csv(f"{output_path}.csv")
export_app.to_excel(f"{output_path}.xlsx")
I get the same message every time and I find the other file as expected.
There is any limitation to write a xlsx file ?
I have the package openpyxl installed in my env.
As requested I am passing some codes how I created new xlsx file using directly gcs python3 api. I used this tutorial and this api reference:
# Imports the Google Cloud client library
from google.cloud import storage
# Instantiates a client
storage_client = storage.Client()
# Create the bucket object
bucket = storage_client.get_bucket("my-new-bucket")
#Confirm bucket connected
print("Bucket {} connected.".format(bucket.name))
#Create file in the bucket
blob = bucket.blob('test.xlsx')
with open("/home/vitooh/test.xlsx", "rb") as my_file:
blob.upload_from_file(my_file)
I hope it will help!
Related
I am trying to upload multiple files with the name 'data' to the Azure blob, I am uploading them in different folders in my function, I create a new one for each new 'data' file, but still I get the error BlobAlreadyExistsThe specified blob already exists.Any ideas ?
`
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service_client.get_blob_client(container=container_name+'\\'+id+'\\'+uploadnr, blob=filename)
with open(pathF+filename,"rb") as data:
blob_client.upload_blob(data)
print(f"Uploaded {filename}.")
`
I tried in my environment and got below results:
BlobAlreadyExistsThe specified blob already exists.
Initially I tried in my environment and got same results:
Code:
from azure.storage.blob import BlobServiceClient
connection_string="<Connect_string>"
container="test\data"
filename="sample1.pdf"
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service_client.get_blob_client(container,filename)
with open("path+filename","rb") as data:
blob_client.upload_blob(data)
print(f"Uploaded {filename}.")
Console:
After I made changes in code it executed successfully.
I have the following Python function to write the given content to a bucket in Cloud Storage:
import gzip
from google.cloud import storage
def upload_to_cloud_storage(json):
"""Write to Cloud Storage."""
# The contents to upload as a JSON string.
contents = json
storage_client = storage.Client()
# Path and name of the file to upload (file doesn't yet exist).
destination = "path/to/name.json.gz"
# Gzip the contents before uploading
with gzip.open(destination, "wb") as f:
f.write(contents.encode("utf-8"))
# Bucket
my_bucket = storage_client.bucket('my_bucket')
# Blob (content)
blob = my_bucket.blob(destination)
blob.content_encoding = 'gzip'
# Write to storage
blob.upload_from_string(contents, content_type='application/json')
However, I receive an error when running the function:
FileNotFoundError: [Errno 2] No such file or directory: 'path/to/name.json.gz'
Highlighting this line as the cause:
with gzip.open(destination, "wb") as f:
I can confirm that the bucket and path both exist although the file itself is new and to be written.
I can also confirm that removing the Gzipping part sees the file successfully written to Cloud Storage.
How can I gzip a new file and upload to Cloud Storage?
Other answers I've used for reference:
https://stackoverflow.com/a/54769937
https://stackoverflow.com/a/67995040
Although #David's answer wasn't complete at the time of solving my problem, it got me on the right track. Here's what I ended up using along with explanations I found out along the way.
import gzip
from google.cloud import storage
from google.cloud.storage import fileio
def upload_to_cloud_storage(json_string):
"""Gzip and write to Cloud Storage."""
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
# Filename (include path)
blob = bucket.blob('path/to/file.json')
# Set blog meta data for decompressive transcoding
blob.content_encoding = 'gzip'
blob.content_type = 'application/json'
writer = fileio.BlobWriter(blob)
# Must write as bytes
gz = gzip.GzipFile(fileobj=writer, mode="wb")
# When writing as bytes we must encode our JSON string.
gz.write(json_string.encode('utf-8'))
# Close connections
gz.close()
writer.close()
We use the GzipFile() class instead of convenience method (compress) to enable us to pass in the mode. When trying to write using w or wt you will receive the error:
TypeError: memoryview: a bytes-like object is required, not 'str'
So we must write in binary mode (wb). This will also enable the .write() method. When doing so however we need to encode our JSON string. This can be done using str.encode() and setting it as utf-8. Failing to do this will also result in the same error.
Finally, I wanted to be able to enable decompressive transcoding where the requester (browser in my case) will receive the uncompressed version of the file when requested. To enable this google.cloud.storage.blob allows you to set some meta data including content_type and content_encoding so we can can follow best practices.
This sees the JSON object in memory written to your chosen destination in Cloud Storage in a compressed format and decompressed on the fly (without needing to download a gzip archive).
Thanks also to #JohnHanley for the troubleshooting advice.
The best solution is not to write the gzip to a file at all, and directly compress and stream to GCS.
from google.cloud import storage
from google.cloud.storage import fileio
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('my_object')
writer = fileio.BlobWriter(blob)
gz = gzip.GzipFile(fileobj=writer, mode="w") # use "wb" if bytes
gz.write(contents)
gz.close()
writer.close()
My requirement is to export the data from BQ to GCS in a particular sorted order which I am not able to get using automatic export and hence trying to write a manual export for this.
File format is like below:
HDR001||5378473972abc||20101|182082||
DTL001||436282798101|
DTL002||QS
DTL005||3733|8
DTL002||QA
DTL005||3733|8
DTL002||QP
DTL005||3733|8
DTL001||436282798111|
DTL002||QS
DTL005||3133|2
DTL002||QA
DTL005||3133|8
DTL002||QP
DTL005||3133|0
I am very new to this and am able to write the file in local disk but I am not sure how I can write this to file to GCS. I tried to use the write_to_file but I seem to be missing something.
import pandas as pd
import pickle as pkl
import tempfile
from google.colab import auth
from google.cloud import bigquery, storage
#import cloudstorage as gcs
auth.authenticate_user()
df = pd.DataFrame(data=job)
sc = storage.Client(project='temp-project')
with tempfile.NamedTemporaryFile(mode='w+b', buffering=- 1,prefix='test',suffix='temp') as fh:
with open(fh.name,'w+',newline='') as f:
dfAsString = df.to_string(header=" ", index=False)
fh.name = fh.write(dfAsString)
fh.close()
bucket = sc.get_bucket('my-bucket')
target_fn = 'test.csv'
source_fn = fh.name
destination_blob_name = bucket.blob('test.csv')
bucket.blob(destination_blob_name).upload_from_file(source_fn)
Can someone please help?
Thank You.
I would suggest to upload an object through a Cloud Storage bucket. Instead of upload_from_file, you need to use upload_from_filename. Your code should look like this:
bucket.blob(destination_blob_name).upload_from_filename(source_fn)
Here are links for the documentation on how to upload an object to Cloud Storage bucket and Client library docs.
EDIT:
The reason why you're getting that is because somewhere in your code, you're passing a Blob object, rather than a String. Currently your destination variable is a Blob Object, change it to String instead:
destination_blob_name = bucket.blob('test.csv')
to
destination_blob_name = 'test.csv'
I have Neo4j operational on Azure. I can load data using python and a series of create statements:
create (n:Person) return n
I can query successfully using python.
Using LOAD CSV requires a file in the Neo4j import directory. I've located that directory, but moving a file into it is blocked. I've also tried putting the file in an accessable directory, but then cannot figure out how to address the path in the LOAD CSV statement.
This LOAD gives an error because the file cannot get into the Neo4j import directory:
USING PERIODIC COMMIT 10000 LOAD CSV WITH HEADERS FROM 'file:///FTDNATree.csv' AS line FIELDTERMINATOR '|' merge (s:SNPNode{SNP:toString(line.Parent)})
This statement does not find the file and gives an error: EXTERNAL file not found
USING PERIODIC COMMIT 10000 LOAD CSV WITH HEADERS FROM 'file:///{my directory path/}FTDNATree.csv' AS line FIELDTERMINATOR '|' merge (s:SNPNode{SNP:toString(line.Parent)})
Even though the python and neo4j are in the same resource group, they are different VMs. The problem seems to be the interoperability between the two VM?
If you have access to neo4j.conf, then you can modify the value of dbms.directories.import to point to an accessible directory
See https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_dbms.directories.import
The solution was NOT well documented in one place. But here is what evolved by trial and error and which works.
I created a storage account within the resource
Created a directory accessible from code in which the upload file was placed.
Added container, called it neo4j-import
I could then a tranfer the file to the container as a blob (i.e., *.csv file)
You then need to make the file accessible. This involves creating an sas token and attaching it to a URL pointing to the container and the file (see python code to do this below).
You can test this URL in your local browser. It should retrieve the file, which is not accessible without the sas token
This URL is used in the LOAD CSV statement and successfully loads the Neo4j database
The code for step 4; pardon indent issues upon pasting here.
from azure.storage.blob import BlobServiceClient, BlobClient,
ContainerClient, generate_account_sas, ResourceTypes, AccountSasPermissions
def UploadFileToDataStorage(FileName,
UploadFileSourceDirecory=ImportDirectory,BlobConnStr=AzureBlobConnectionString,
Container="neo4j-import"):
#uploads file as blob to data storage
#https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python #upload-blobs-to-a-container
blob_service_client = BlobServiceClient.from_connection_string(BlobConnStr)
blob_client = blob_service_client.get_blob_client(container=Container, blob=FileName)
with open(UploadFileSourceDirecory + FileName, "rb") as data:
blob_client.upload_blob(data)
The key python code (step 5 above).
def GetBlobURLwithSAS(FileName,Container="neo4j-import"):
#https://pypi.org/project/azure-storage-blob/
#https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobserviceclient?view=azure-python
#generates sas token for object blob so it can be consumed by another process
sas_token = generate_account_sas(
account_name="{storage account name}",
account_key="{storage acct key}",
resource_types=ResourceTypes(service=False, container=False, object=True),
permission=AccountSasPermissions(read=True),
expiry=datetime.utcnow() + timedelta(hours=1))
return "https://{storage account name}.blob.core.windows.net/" + Container + "/" + FileName + "?" + sas_token
The LOAD statement looks like this and does not use the file:/// prefix:
LOAD CSV WITH HEADERS FROM '" + {URL from above} + "' AS line FIELDTERMINATOR '|'{your cypher query for loading csv}
I hope this helps other to navigate this scenario!
Goal - To read csv file uploaded on google cloud storage bucket.
Environment - Run Jupyter notebook using SSH instance on Master node. Using python on Jupyter notebook trying to access a simple csv file uploaded onto google cloud storage bucket.
Approaches -
1st approach - Write a simple python program
Wrote following program
import csv
f = open('gs://python_test_hm/train.csv' , 'rb' )
csv_f = csv.reader(f)
for row in csv_f
print row
Results - Error message "No such file or directory"
2nd Approach - Using gcloud Package tried to access the train.csv file. The sample code is shown below. Below code is not the actual code. The file on google Cloud storage in my version of code was referred to "gs:///Filename.csv"
Results - Error message "No such file or directory"
Load data from CSV
import csv
from gcloud import bigquery
from gcloud.bigquery import SchemaField
client = bigquery.Client()
dataset = client.dataset('dataset_name')
dataset.create() # API request
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = dataset.table('table_name', SCHEMA)
table.create()
with open('csv_file', 'rb') as readable:
table.upload_from_file(
readable, source_format='CSV', skip_leading_rows=1)
3rd Approach -
import csv
import urllib
url = 'https://storage.cloud.google.com/<bucket>/train.csv'
response = urllib.urlopen(url)
cr = csv.reader(response)
print cr
for row in cr:
print row
Results - Above code doesn't result in any error but it displays the XML content of the google page as shown below. I am interested in viewing the data of the train csv file.
['<!DOCTYPE html>']
['<html lang="en">']
[' <head>']
[' <meta charset="utf-8">']
[' <meta content="width=300', ' initial-scale=1" name="viewport">']
[' <meta name="google-site-verification" content="LrdTUW9psUAMbh4Ia074- BPEVmcpBxF6Gwf0MSgQXZs">']
[' <title>Sign in - Google Accounts</title>']
Can someone throw some light on what could be possibly wrong here and how do I achieve my goal? Your help is highly appreciated.
Thanks so much for your help!
I assume you are using Jupyter notebook running on a machine in Google Cloud Platform (GCP)?
If that's the case, you will already have the Google Cloud SDK running on that machine (by default).
With this setup you have 2 easy options to work with Google Cloud Storage (GCS):
Use the gcloud/gsutil commands in Jupyter
Writing to GCS: gsutil cp train.csv gs://python_test_hm/train.csv
Reading from GCS:
gsutil cp gs://python_test_hm/train.csv train.csv
Use google-cloud python library
Writing to GCS:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('python_test_hm')
blob = bucket.blob('train.csv')
blob.upload_from_string('this is test content!')
Reading from GCS:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('python_test_hm')
blob = storage.Blob('train.csv', bucket)
content = blob.download_as_string()
The sign in page your app fetches isn't actually the object - it's an auth redirect page that, if interacted-with, would proceed to serve the object. You should check out the documentation on Cloud Storage to see about how auth works, and look up the auth details for whichever library or means you use to access the bucket / object.