Given a pandas Dataframe which contains some data, what is the best to store this data to Firebase?
Should I convert the Dataframe to a local file (e.g. .csv, .txt) and then upload it on Firebase Storage, or is it also possible to directly store the pandas Dataframe without conversion? Or are there better best practices?
Update 01/03 - So far I've come with this solution, which requires writing a csv file locally, then reading it in and uploading it and then deleting the local file. I doubt however that this is the most efficient method, thus I would like to know if it can be done better and quicker?
import os
import firebase_admin
from firebase_admin import db, storage
cred = firebase_admin.credentials.Certificate(cert_json)
app = firebase_admin.initialize_app(cred, config)
bucket = storage.bucket(app=app)
def upload_df(df, data_id):
"""
Upload a Dataframe as a csv to Firebase Storage
:return: storage_ref
"""
# Storage location + extension
storage_ref = data_id + ".csv"
# Store locally
df.to_csv(data_id)
# Upload to Firebase Storage
blob = bucket.blob(storage_ref)
with open(data_id,'rb') as local_file:
blob.upload_from_file(local_file)
# Delete locally
os.remove(data_id)
return storage_ref
With python-firebase and to_dict:
postdata = my_df.to_dict()
# Assumes any auth/headers you need are already taken care of.
result = firebase.post('/my_endpoint', postdata, {'print': 'pretty'})
print(result)
# Snapshot info
You can get the data back using the snapshot info and endpoint, and reestablish the df with from_dict(). You could adapt this solution to SQL and JSON solutions, which pandas also has support for.
Alternatively and depending on where you script executes from, you might consider treating firebase as a db and using the dbapi from firebase_admin (check this out.)
As for whether it's according to best practice, it's difficult to say without knowing anything about your use case.
if you just want to reduce code length and the steps of creating and deleting files, you can use upload_from_string:
import firebase_admin
from firebase_admin import db, storage
cred = firebase_admin.credentials.Certificate(cert_json)
app = firebase_admin.initialize_app(cred, config)
bucket = storage.bucket(app=app)
def upload_df(df, data_id):
"""
Upload a Dataframe as a csv to Firebase Storage
:return: storage_ref
"""
storage_ref = data_id + '.csv'
blob = bucket.blob(storage_ref)
blob.upload_from_string(df.to_csv())
return storage_ref
https://googleapis.github.io/google-cloud-python/latest/storage/blobs.html#google.cloud.storage.blob.Blob.upload_from_string
After figuring out for hours, the following solution works for me. You need to convert your csv file to bytes & then upload it.
import pyrebase
import pandas
firebaseConfig = {
"apiKey": "xxxxx",
"authDomain": "xxxxx",
"projectId": "xxxxx",
"storageBucket": "xxxxx",
"messagingSenderId": "xxxxx",
"appId": "xxxxx",
"databaseURL":"xxxxx"
};
firebase = pyrebase.initialize_app(firebaseConfig)
storage = firebase.storage()
df = pd.read_csv("/content/Future Prices.csv")
# here is the magic. Convert your csv file to bytes and then upload it
df_string = df.to_csv(index=False)
db_bytes = bytes(df_string, 'utf8')
fileName = "Future Prices.csv"
storage.child("predictions/" + fileName).put(db_bytes)
That's all Happy Coding!
I found that starting from very modest size of dataframe (below 100KB!), and certainly for bigger ones, it's paying off to compress the data before storing. It does not have to be a dataframe, but it can be any onject (e.g. a dictionary). I used pickle below to compress. Your object can be seen on the usual firebase storage this way, and you get gains in memory and speed, both when writing and when reading, compared to uncompressed storage. For big objects it's also worth adding timeout for to avoid ConnectionError after the default timeout of 60 seconds.
import firebase_admin
from firebase_admin import credentials, initialize_app, storage
import pickle
cred = credentials.Certificate(json_cert_file)
firebase_admin.initialize_app(cred, {'storageBucket': 'YOUR_storageBucket (without gs://)'})
bucket = storage.bucket()
file_name = data_id + ".pkl"
blob = bucket.blob(file_name)
# write df to storage
blob.upload_from_string(pickle.dumps(df, timeout=300))
# read df from storage
df = pickle.loads(blob.download_as_string(timeout=300))
Related
My requirement is to export the data from BQ to GCS in a particular sorted order which I am not able to get using automatic export and hence trying to write a manual export for this.
File format is like below:
HDR001||5378473972abc||20101|182082||
DTL001||436282798101|
DTL002||QS
DTL005||3733|8
DTL002||QA
DTL005||3733|8
DTL002||QP
DTL005||3733|8
DTL001||436282798111|
DTL002||QS
DTL005||3133|2
DTL002||QA
DTL005||3133|8
DTL002||QP
DTL005||3133|0
I am very new to this and am able to write the file in local disk but I am not sure how I can write this to file to GCS. I tried to use the write_to_file but I seem to be missing something.
import pandas as pd
import pickle as pkl
import tempfile
from google.colab import auth
from google.cloud import bigquery, storage
#import cloudstorage as gcs
auth.authenticate_user()
df = pd.DataFrame(data=job)
sc = storage.Client(project='temp-project')
with tempfile.NamedTemporaryFile(mode='w+b', buffering=- 1,prefix='test',suffix='temp') as fh:
with open(fh.name,'w+',newline='') as f:
dfAsString = df.to_string(header=" ", index=False)
fh.name = fh.write(dfAsString)
fh.close()
bucket = sc.get_bucket('my-bucket')
target_fn = 'test.csv'
source_fn = fh.name
destination_blob_name = bucket.blob('test.csv')
bucket.blob(destination_blob_name).upload_from_file(source_fn)
Can someone please help?
Thank You.
I would suggest to upload an object through a Cloud Storage bucket. Instead of upload_from_file, you need to use upload_from_filename. Your code should look like this:
bucket.blob(destination_blob_name).upload_from_filename(source_fn)
Here are links for the documentation on how to upload an object to Cloud Storage bucket and Client library docs.
EDIT:
The reason why you're getting that is because somewhere in your code, you're passing a Blob object, rather than a String. Currently your destination variable is a Blob Object, change it to String instead:
destination_blob_name = bucket.blob('test.csv')
to
destination_blob_name = 'test.csv'
I have a big text (with millions of record) as bz2 format in Minio bucket.
Now I am processing them demonstrated by the procedure below:
Call the file from Minio bucket;
Partition the files per day based on 'timestamp' column;
Remove some of the empty/blank partitioned files using 'cull_empty_partitions()';
Save the partitioned files in local directory as a .csv;
Save it back to the Minio bucket;
Remove the files from local workspace.
In this current procedure, I have to store the files into the local workspace which I don't want.
All I want to read are the .txt or the .bz2 files from my bucket, without using the local workspace partition.
Then, name the partition based on the first date in the 'timestamp' column in Dask datafrmar and store them back directly into the Minio bucket using Dask framework.
Here is my code:
import dask.dataframe as dd
from datetime import date, timedelta
path='/a/desktop/workspace/project/log/'
bucket= config['data_bucket']['abc']
folder_prefix = config["folder_prefix"]["root"]
folder_store = config["folder_prefix"]["store"]
col_names = [
"id", "pro_id", "tr_id", "bo_id", "se", "lo", "timestamp", "ch"
]
data = dd.read_csv(
folder_prefix + 'abc1/2001-01-01-logging_*.txt',
sep = '\t', names = col_names, parse_dates = 'timestamp'],
low_memory = False
)
data['timestamp'] = dd.to_datetime(
user_data['timestamp'], format= '%Y-%m-%d %H:%M:%S',
errors = 'ignore'
)
ddf = data.set_index(user_data['timestamp']).repartition(freq='1d').dropna()
# Remove the dask dataframe which are splitted as empty
ddf = cull_empty_partitions(ddf)
# Storing the partitioned dask files to local workspace as a csv
o = ddf.to_csv("out_csv/log_user_*.csv", index=False)
# Storing the file in minio bucket
for each in o:
if len(each) > 0 :
print(each.split("/")[-1])
minioClient.fput_object(bucket, folder_store+ each.split("/")[-1], each)
# Removing partitioned csv files from local workspace
os.remove(each)
I can use the below code to connect the s3 buckets and get access to see the list of bucket:
import botocore, os
from botocore.client import Config
from botocore.session import Session
s3 = boto3.resource('s3',
endpoint_url='https://blabalbla.com',
aws_access_key_id= "abcd",
aws_secret_access_key="sfsdfdfdcdfdfedfsdfsdf",
config=Config(signature_version='s3v4'),
region_name='us-east-1')
os.environ['S3_USE_SIGV4'] = 'True'
for bucket in s3.buckets.all():
print(bucket.name)
When try to read the object of the buckets with the code below, it does not respond.
df = dd.read_csv('s3://bucket/myfiles.*.csv')
Any update on this regard will be highly appreciated. Thank you in advance!
Looking for a way using Azure files SDK to upload files to my azure databricks blob storage
I tried many things using function from this page
But nothing worked. I don't understand why
example:
file_service = FileService(account_name='MYSECRETNAME', account_key='mySECRETkey')
generator = file_service.list_directories_and_files('MYSECRETNAME/test') #listing file in folder /test, working well
for file_or_dir in generator:
print(file_or_dir.name)
file_service.get_file_to_path('MYSECRETNAME','test/tables/input/referentials/','test.xlsx','/dbfs/FileStore/test6.xlsx')
with test.xlsx = name of file in my azure file
/dbfs/FileStore/test6.xlsx => path where to upload the file in my dbfs system
I have the error message:
Exception=The specified resource name contains invalid characters
Tried to change the name but doesn't seem to work
edit: I'm not even sure the function is doing what I want. What is the best way to load file from azure files?
Per my experience, I think the best way to load file from Azure Files is directly to read a file via its url with sas token.
For example, as the figures below, it's a file named test.xlsx in my test file share, that I viewed it using Azure Storage Explorer, then to generate its url with sas token.
Fig 1. Right click the file and then click the Get Shared Access Signature...
Fig 2. Must select the option Read permission for directly reading the file content.
Fig 3. Copy the url with sas token
Here is my sample code, you can run it with the sas token url of your file in your Azure Databricks.
import pandas as pd
url_sas_token = 'https://<my account name>.file.core.windows.net/test/test.xlsx?st=2020-01-27T10%3A16%3A12Z&se=2020-01-28T10%3A16%3A12Z&sp=rl&sv=2018-03-28&sr=f&sig=XXXXXXXXXXXXXXXXX'
# Directly read the file content from its url with sas token to get a pandas dataframe
pdf = pd.read_excel(url_sas_token )
# Then, to convert the pandas dataframe to a PySpark dataframe in Azure Databricks
df = spark.createDataFrame(pdf)
Alternatively, to use Azure File Storage SDK to generate the url with sas token for your file or to get the bytes of your file for reading, please refer to the offical document Develop for Azure Files with Python and my sample code below.
# Create a client of Azure File Service as same as yours
from azure.storage.file import FileService
account_name = '<your account name>'
account_key = '<your account key>'
share_name = 'test'
directory_name = None
file_name = 'test.xlsx'
file_service = FileService(account_name=account_name, account_key=account_key)
To generate the sas token url of a file
from azure.storage.file import FilePermissions
from datetime import datetime, timedelta
sas_token = file_service.generate_file_shared_access_signature(share_name, directory_name, file_name, permission=FilePermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
url_sas_token = f"https://{account_name}.file.core.windows.net/{share_name}/{file_name}?{sas_token}"
import pandas as pd
pdf = pd.read_excel(url_sas_token)
df = spark.createDataFrame(pdf)
Or using get_file_to_stream function to read the file content
from io import BytesIO
import pandas as pd
stream = BytesIO()
file_service.get_file_to_stream(share_name, directory_name, file_name, stream)
pdf = pd.read_excel(stream)
df = spark.createDataFrame(pdf)
Just as an addition to #Peter Pan answer, the alternative approach without using Pandas with python azure-storage-file-share library.
Very detailed documentation: https://pypi.org/project/azure-storage-file-share/#downloading-a-file
I read the minio docs and I see two methods to upload data:
put_object() this needs a io-stream
fput_object() this reads a file on disk
I want to test minio and upload some data I just created with numpy.random.bytes().
How to upload data which is stored in a variable in the python interpreter?
Take a look at io.BytesIO. These allow you to wrap byte arrays up in a stream which you can give to minio.
For example:
import io
from minio import Minio
value = "Some text I want to upload"
value_as_bytes = value.encode('utf-8')
value_as_a_stream = io.BytesIO(value_as_bytes)
client = Minio("my-url-here", ...) # Edit this bit to connect to your Minio server
client.put_object("my_bucket", "my_key", value_as_a_stream , length=len(value_as_bytes))
I was in a similar situation: trying to store a pandas DataFrame as a feather file into minio.
I needed to store bytes directly using the Minio client. In the end the code looked like that:
from io import BytesIO
from pandas import df
from numpy import random
import minio
# Create the client
client = minio.Minio(
endpoint="localhost:9000",
access_key="access_key",
secret_key="secret_key",
secure=False
)
# Create sample dataset
df = pd.DataFrame({
"a": numpy.random.random(size=1000),
})
# Create a BytesIO instance that will behave like a file opended in binary mode
feather_output = BytesIO()
# Write feather file
df.to_feather(feather_output)
# Get numver of bytes
nb_bytes = feather_output.tell()
# Go back to the start of the opened file
feather_output.seek(0)
# Put the object into minio
client.put_object(
bucket_name="datasets",
object_name="demo.feather",
length=nb_bytes,
data=feather_output
)
I had to use .seek(0) in order for minio to be able to insert correct amounts of bytes.
#gcharbon: this solution does not work for me. client.put_object() does only accept bytes like objects.
Here is my solution:
from minio import Minio
import pandas as pd
import io
#Can use a string with csv data here as well
csv_bytes = df.to_csv().encode('utf-8')
csv_buffer = io.BytesIO(csv_bytes)
# Create the client
client = Minio(
endpoint="localhost:9000",
access_key="access_key",
secret_key="secret_key",
secure=False
)
client.put_object("bucketname",
"objectname",
data=csv_buffer,
length=len(csv_bytes),
content_type='application/csv')
I feel kind of stupid right now. I have been reading numerous documentations and stackoverflow questions but I can't get it right.
I have a file on Google Cloud Storage. It is in a bucket 'test_bucket'. Inside this bucket there is a folder, 'temp_files_folder', which contains two files, one .txt file named 'test.txt' and one .csv file named 'test.csv'. The two files are simply because I try using both but the result is the same either way.
The content in the files is
hej
san
and I am hoping to read it into python the same way I would do on a local with
textfile = open("/file_path/test.txt", 'r')
times = textfile.read().splitlines()
textfile.close()
print(times)
which gives
['hej', 'san']
I have tried using
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('test_bucket')
blob = bucket.get_blob('temp_files_folder/test.txt')
print(blob.download_as_string)
but it gives the output
<bound method Blob.download_as_string of <Blob: test_bucket, temp_files_folder/test.txt>>
How can I get the actual string(s) in the file?
download_as_string is a method, you need to call it.
print(blob.download_as_string())
More likely, you want to assign it to a variable so that you download it once and can then print it and do whatever else you want with it:
downloaded_blob = blob.download_as_string()
print(downloaded_blob)
do_something_else(downloaded_blob)
The method 'download_as_string()' will read in the content as byte.
Find below an example to process a .csv file.
import csv
from io import StringIO
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket(YOUR_BUCKET_NAME)
blob = bucket.blob(YOUR_FILE_NAME)
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")
According to the documentation (https://googleapis.dev/python/storage/latest/blobs.html), As of the time of writing (2021/08), the download_as_string method is a depreciated alias for the download_as_byte method which - as suggested by the name - returns a byte object.
You can instead use the download_as_text method to return a str object.
For instances, to download the file MYFILE from bucket MYBUCKET and store it as an utf-8 encoded string:
from google.cloud.storage import Client
client = Client()
bucket = client.get_bucket(MYBUCKET)
blob = bucket.get_blob(MYFILE)
downloaded_file = blob.download_as_text(encoding="utf-8")
You can then also use this in order to read different file formats. For json, replace the last line to
import json
downloaded_json_file = json.loads(blob.download_as_text(encoding="utf-8"))
For yaml files, replace the last line to :
import yaml
downloaded_yaml_file = yaml.safe_load(blob.download_as_text(encoding="utf-8"))
DON'T USE: blob.download_as_string()
USE: blob.download_as_text()
blob.download_as_text() does indeed return a string.
blob.download_as_string() is deprecated and returns a bytes object instead of a string object.
Works out when reading a docx / text file
from google.cloud import storage
# create storage client
storage_client = storage.Client.from_service_account_json('**PATH OF JSON FILE**')
bucket = storage_client.get_bucket('**BUCKET NAME**')
# get bucket data as blob
blob = bucket.blob('**SPECIFYING THE DOXC FILENAME**')
downloaded_blob = blob.download_as_string()
downloaded_blob = downloaded_blob.decode("utf-8")
print(downloaded_blob)