I have a Databricks notebook setup that works as the following;
pyspark connection details to Blob storage account
Read file through spark dataframe
convert to pandas Df
data modelling on pandas Df
convert to spark Df
write to blob storage in single file
My problem is, that you can not name the file output file, where I need a static csv filename.
Is there way to rename this in pyspark?
## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""
## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"
## Connection string to connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
Followed by outputting file after data transformation
dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Where the file is then write as "part-00000-tid-336943946930983.....csv"
Where as a the goal is to have "Output.csv"
Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.
I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs
Any help here is greatly appreciated.
Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-.... files in a HDFS output path like Output/ named by you.
If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv or set the number of reduce processes with 1 like using coalesce(1) function.
So in your scenario, you only need to adjust the order of calling these functions to make the coalease function called at the front of save function, as below.
dfspark.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)
The coalesce and repartition do not help with saving the dataframe into 1 normally named file.
I ended up just renaming the 1 csv file and deleting the folder with log:
def save_csv(df, location, filename):
outputPath = os.path.join(location, filename + '_temp.csv')
df.repartition(1).write.format("com.databricks.spark.csv").mode("overwrite").options(header="true", inferSchema="true").option("delimiter", "\t").save(outputPath)
csv_files = os.listdir(os.path.join('/dbfs', outputPath))
# moving the parquet-like temp csv file into normally named one
for file in csv_files:
if file[-4:] == '.csv':
dbutils.fs.mv(os.path.join(outputPath,file) , os.path.join(location, filename))
dbutils.fs.rm(outputPath, True)
# using save_csv
save_csv_location = 'mnt/.....'
save_csv(df, save_csv_location, 'name.csv')
Related
I have few parquet files stored in my storage account, which I am trying to read using the below code. However it fails with error as incorrect syntax. Can someone suggest to me as whats the correct way to read parquet files using azure databricks?
val data = spark.read.parquet("abfss://containername#storagename.dfs.core.windows.net/TestFolder/XYZ/part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet")
display(data)
abfss://containername#storagename.dfs.core.windows.net/TestFolder/XYZ/part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet
As per the above abfss URL you can use delta or parquet format in the storage account.
Note: If you created delta table, part file creates automatically like this part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet.As per above code it is not possible to read parquet file in delta format .
I have written the datafram df1 and overwrite into a storage account with parquet format.
df1.coalesce(1).write.format('parquet').mode("overwrite").save("abfss://<container>#<stoarge_account>.dfs.core.windows.net/<folder>/<sub_folder>")
Scala
val df11 = spark.read.format("parquet").load("abfss://<container>#<stoarge_account>.dfs.core.windows.net/demo/d121/part-00000-tid-2397072542034942773-def47888-c000.snappy.parquet")
display(df11)
python
df11 = spark.read.format("parquet").load("abfss://<container>#<stoarge_account>.dfs.core.windows.net/demo/d121/part-00000-tid-2397072542034942773-def47888-c000.snappy.parquet")
display(df11)
Output:
i am trying to read a parquet file that is stored on dbfs, with pyspark. But i get the following error:
org.apache.spark.SparkException: java.io.Serializable is not annotated
with SQLUserDefinedType nor registered with UDTRegistration.}
The parquet file is extracted from a zip file an stored on dbfs. This is done by the following function:
def loading_zip(file_name, dest_file):
temp_folder_write = f"/dbfs/mnt/.../{dest_file}"
temp_folder_read = f"dbfs:/mnt/.../{dest_file}"
with ZipFile(file_name, "r") as zipObj:
zipObj.extractall(dest_file)
df = spark.read.parquet(temp_folder_read)
return df
You don't need to extract the zipped parquet files before reading. Just spark.read.parquet(f"dbfs:/mnt/.../{dest_file}") would be enough
I have found the issue. I have do define a schema like:
example_schema = StructType([StructField(..),..])
df_spark = spark.read.schema(example_schema).parquet("<path>")
I'm trying to develop a script in python to read a file in .xlsx from a blob storage container called "source", convert it in .csv and store it in a new container (I'm testing the script locally, if working I should include it in an ADF pipeline). So far, I managed to access to the blob storage, but I'm having problems in reading the file content.
from azure.storage.blob import BlobServiceClient, ContainerClient, BlobClient
import pandas as pd
conn_str = "DefaultEndpointsProtocol=https;AccountName=XXXXXX;AccountKey=XXXXXX;EndpointSuffix=core.windows.net"
container = "source"
blob_name = "prova.xlsx"
container_client = ContainerClient.from_connection_string(
conn_str=conn_str,
container_name=container
)
# Download blob as StorageStreamDownloader object (stored in memory)
downloaded_blob = container_client.download_blob(blob_name)
df = pd.read_excel(downloaded_blob)
print(df)
I get following error:
ValueError: Invalid file path or buffer object type: <class 'azure.storage.blob._download.StorageStreamDownloader'>
I tried with a .csv file as input and writing the parsing code as follows:
df = pd.read_csv(StringIO(downloaded_blob.content_as_text()) )
and it works.
Any suggestion on how to modify the code so that the excel file becomes readable?
I summary the solution as below.
When we use the method pd.read_excel() in sdk pandas, we need to provide bytes as input. But when we use download_blob to download the excel file from azure blob, we just get azure.storage.blob.StorageStreamDownloader. So we need to use the method readall() or content_as_bytes() to convert it to bytes. For more details, please refer to the document and the document
Change
df = pd.read_excel(downloaded_blob)
to
df = pd.read_excel(downloaded_blob.content_as_bytes())
I have around 500 .txt files in my local system and would like to merge them into a dataframe in Google Colab. I have already uploaded them via Upload option where I uploaded the zipped folder containing the .txt files and later unzipped them in Google Colab. Each .txt file has one row data eg. 0 12 34.3 423
I tried the following code to directly upload from my local system but it did not work
Colab cannot access your local files through the typical built-ins as far as I know. You have to use Colab-specific modules. The guide is here.
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
This will prompt you to select the files to upload.
EDIT: As you need the file names, you can just use the loop above and then concatenate as you mentioned correctly.
# create a list of file names
file = []
for fn in uploaded.keys():
files.append(fn)
# create a list of dataframes
for file in files:
new = pd.read_csv(file)
try:
frames.append(new)
except:
frames = [new]
# concat all of your frames at once
df = pd.concat(frames)
Alternatively, depending on the size of your files, you could also join the for loops and load one file and concat it directly to the existing frames such that the memory has to hold less data at once.
I am using pyspark to save a data frame as a parquet file or as a csv file with this:
def write_df_as_parquet_file(df, path, mode="overwrite"):
df = df.repartition(1) # join partitions to produce 1 parquet file
dfw = df.write.format("parquet").mode(mode)
dfw.save(path)
def write_df_as_csv_file(df, path, mode="overwrite", header=True):
df = df.repartition(1) # join partitions to produce 1 csv file
header = "true" if header else "false"
dfw = df.write.format("csv").option("header", header).mode(mode)
dfw.save(path)
But this saves the parquet/csv file inside a folder called path, where it saves a few other files that we don't need, in this way:
Image: https://ibb.co/9c1D8RL
Basically, I would like to create some function that saves the file to a location using the above methods, and then moves the CSV or PARQUET file to a new location. Like:
def write_df_as_parquet_file(df, path, mode="overwrite"):
# save df in one file inside tmp_folder
df = df.repartition(1) # join partitions to produce 1 parquet file
dfw = df.write.format("parquet").mode(mode)
tmp_folder = path + "TEMP"
dfw.save(tmp_folder)
# move parquet file from tmp_folder to path
copy_file(tmp_folder + "*.parquet", path)
remove_folder(tmp_folder)
How can I do that? How do I implement copy_file or remove_folder? I have seen a few solutions in scala, that use the Hadoop api for this, but I have not been able to make this work in python. I think I need to use sparkContext, but I am still learning Hadoop and have not found the way to do it.
You can use one of Python's HDFS libraries to connect to your HDFS instance and then carry out whatever operations required.
From hdfs3 docs(https://hdfs3.readthedocs.io/en/latest/quickstart.html):
from hdfs3 import HDFileSystem
hdfs = HDFileSystem(host=<host>, port=<port>)
hdfs.mv(tmp_folder + "*.parquet", path)
Wrap the above in a function and you're good to go.
Note: i've just used hdfs3 as an example. You could also use hdfsCLI.