How to read a parquet file in Azure Databricks? - python

I have few parquet files stored in my storage account, which I am trying to read using the below code. However it fails with error as incorrect syntax. Can someone suggest to me as whats the correct way to read parquet files using azure databricks?
val data = spark.read.parquet("abfss://containername#storagename.dfs.core.windows.net/TestFolder/XYZ/part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet")
display(data)

abfss://containername#storagename.dfs.core.windows.net/TestFolder/XYZ/part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet
As per the above abfss URL you can use delta or parquet format in the storage account.
Note: If you created delta table, part file creates automatically like this part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet.As per above code it is not possible to read parquet file in delta format .
I have written the datafram df1 and overwrite into a storage account with parquet format.
df1.coalesce(1).write.format('parquet').mode("overwrite").save("abfss://<container>#<stoarge_account>.dfs.core.windows.net/<folder>/<sub_folder>")
Scala
val df11 = spark.read.format("parquet").load("abfss://<container>#<stoarge_account>.dfs.core.windows.net/demo/d121/part-00000-tid-2397072542034942773-def47888-c000.snappy.parquet")
display(df11)
python
df11 = spark.read.format("parquet").load("abfss://<container>#<stoarge_account>.dfs.core.windows.net/demo/d121/part-00000-tid-2397072542034942773-def47888-c000.snappy.parquet")
display(df11)
Output:

Related

How to read partitioned parquet data using Pandas

I have data in Azure blob storage which was written using PySpark, when it was written, a WeekStartDate field (Monday) was used for partitionBy i.e. df.write.partitionBy("WeekStartDate"). This gives rise to a folder/file structure in Azure blob storage like this
container/MyTable/WeekStartDate=2020-09-28/
hash0.gz.parquet
...
hash(n-1).gz.parquet
container/MyTable/WeekStartDate=2020-10-05/
hash0.gz.parquet
...
hash(n-1).gz.parquet
container/MyTable/WeekStartDate=2020-10-12/
hash0.gz.parquet
...
hash(n-1).gz.parquet
container/MyTable/WeekStartDate=2020-10-19/
etc
Now, say I want to read data for just WeekStartDate's 2020-09-28 and 2020-10-05.
Is there any way to do this in pandas without having to build up the two paths and then stitching together like
df1 = pd.read_parquet("MyTable/WeekStartDate=2020-09-28", storage_options=xxx)
df2 = pd.read_parquet("MyTable/WeekStartDate=2020-09-28", storage_options=xxx)
df = pd.concat((df1, df2))

Writing dataframe to excel in Azure function environment to store xlsx file in blob storage

In my Azure Function, I'm currently writing data frame into CSV using df.to_csv() and passing this object to the append blob function.
output_data = df.to_csv(index=False, encoding="utf=8")
blob_client.append_block(output)
Now I want to store data into .xlsx format and then add auto filter to excel file using xlsxwriter
This is what, I tried but was unable to understand what is wrong here
writer = io.BytesIO()
df.to_excel(writer, index=False)
writer.seek(0)
blob_client.upload_blob(writer.getvalue())
I have already tried the following solution but it didn't work for me.
Either file is created but empty or file not readable in excel apps is happening
Azure Function - Pandas Dataframe to Excel is Empty
Writing pandas dataframe as xlsx file to an azure blob storage without creating a local file

Combine CSV files into XLS or XLSX on AWS S3

I have code running as AWS Lambda that queries internal database and generates files in different formats. The files are generated in parts and uploaded to S3 using multi-part upload:
self.mpu = self.s3_client.create_multipart_upload(
Bucket=self.bucket_name,
ContentType=self.get_content_type(),
Expires=self.expire_daytime,
Key=self.filename,
)
and
response = self.s3_client.upload_part(
Bucket=self.bucket_name,
Key=self.filename,
PartNumber=self.part_number,
UploadId=self.upload_id,
Body=data
)
self.current_part_info.update({
'PartNumber': self.part_number,
'ETag': response['ETag']
})
One of the formats I need to support is XLS or XLSX. It's fairly easy to create multiple CSV files on S3. But is it possible to combine them directly on S3 into XLS/XLSX without downloading them?
My current code generates an XLSX file in memory, creates a local file, and then uploads it to S3:
import xlsxwriter
self.workbook = xlsxwriter.Workbook(self.filename)
# download CSV files...
for sheet_name, sheet_info in sheets.items():
sheet = self.workbook.add_worksheet(name=sheet_name)
# code that does formatting
for... # loop through rows
for... # loop thru columns
sheet.write(row, col, col_str)
self.workbook.close()
This works fine for small queries, but the users will want to use it for a large amount of data.
When I run it with large queries, it runs out of memory. AWS Lambda has limited memory and limited disk space, and I'm hitting those limits.
Is it possible to combine CSV files into XLS or XLSX somehow without holding the entire file in local space (both memory and disk space are a problem)?

Spark: How to write bytes string to hdfs hadoop in pyspark for spark-xml transformation?

In python, bytes string can be simply saved to single xml file:
with open('/home/user/file.xml' ,'wb') as f:
f.write(b'<Value>1</Value>')
Current output : /home/user/file.xml (file saved in local file)
Question: How to save string to xml file on hdfs in pyspark:
Expected output: 'hdfs://hostname:9000/file.xml'
Background: Large amount of xml files are provided by 3rd party web APIs. I build ETL pipeline in pyspark to delta lake. Data are extracted asynchronously by aiohttp, next I want to use spark-xml for transformation before saving spark data frame to delta lake(requires pyspark). I'm looking for most efficient way to buid the pipeline.
Similar question was asked spark-xml developers on github.
https://github.com/databricks/spark-xml/issues/515
Latest research:
spark-xml use as an input xml files directly stored as text on the disk or spark dataframe
So I'm limited to use one of 2 options below:
a) some hdfs client(pyarrow,hdfs,aiohdfs) to save files to hdfs (text file on hdfs is not very efficient format)
b) load data to spark dataframe for spark-xml transformation(native format for delta lake)
If you have other ideas, please let me know.
Don't be misled by databricks spark-xml docs, as they lead to use uncompressed XML file as an input. This is very inefficient and much faster is to download XMLs directly to spark dataframe. Databricks xml-pyspark version doesn't include it but there is a workaround:
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.types import _parse_datatype_json_string
def ext_from_xml(xml_column, schema, options={}):
java_column = _to_java_column(xml_column.cast('string'))
java_schema = spark._jsparkSession.parseDataType(schema.json())
scala_map = spark._jvm.org.apache.spark.api.python.PythonUtils.toScalaMap(options)
jc = spark._jvm.com.databricks.spark.xml.functions.from_xml(
java_column, java_schema, scala_map)
return Column(jc)
def ext_schema_of_xml_df(df, options={}):
assert len(df.columns) == 1
scala_options = spark._jvm.PythonUtils.toScalaMap(options)
java_xml_module = getattr(getattr(
spark._jvm.com.databricks.spark.xml, "package$"), "MODULE$")
java_schema = java_xml_module.schema_of_xml_df(df._jdf, scala_options)
return _parse_datatype_json_string(java_schema.json())
XMLs downloaded to list
xml = [('url',"""<Level_0 Id0="Id0_value_file1">
<Level_1 Id1_1 ="Id3_value" Id_2="Id2_value">
<Level_2_A>A</Level_2_A>
<Level_2>
<Level_3>
<Level_4>
<Date>2021-01-01</Date>
<Value>4_1</Value>
</Level_4>
<Level_4>
<Date>2021-01-02</Date>
<Value>4_2</Value>
</Level_4>
</Level_3>
</Level_2>
</Level_1>
</Level_0>"""),
('url',"""<Level_0 I"d0="Id0_value_file2">
<Level_1 Id1_1 ="Id3_value" Id_2="Id2_value">
<Level_2_A>A</Level_2_A>
<Level_2>
<Level_3>
<Level_4>
<Date>2021-01-01</Date>
<Value>4_1</Value>
</Level_4>
<Level_4>
<Date>2021-01-02</Date>
<Value>4_2</Value>
</Level_4>
</Level_3>
</Level_2>
</Level_1>
</Level_0>""")]
Spark data frame transformation of XML string:
#create df with XML strings
rdd = sc.parallelize(xml)
df = spark.createDataFrame(rdd,"url string, content string")
# XML schema
payloadSchema = ext_schema_of_xml_df(df.select("content"))
# parse xml
parsed = df.withColumn("parsed", ext_from_xml(df.content, payloadSchema, {"rowTag":"Level_0"}))
# select required data
df2 = parsed.select(
'parsed._Id0',
F.explode_outer('parsed.Level_1.Level_2.Level_3.Level_4').alias('Level_4')
).select(
'`parsed._Id0`',
'Level_4.*'
)
To decode bytes: b'string'.decode('utf-8')
#mck answer for more info about XMLs:
How to transform to spark Data Frame data from multiple nested XML files with attributes

Renaming spark output csv in azure blob storage

I have a Databricks notebook setup that works as the following;
pyspark connection details to Blob storage account
Read file through spark dataframe
convert to pandas Df
data modelling on pandas Df
convert to spark Df
write to blob storage in single file
My problem is, that you can not name the file output file, where I need a static csv filename.
Is there way to rename this in pyspark?
## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""
## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"
## Connection string to connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
Followed by outputting file after data transformation
dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Where the file is then write as "part-00000-tid-336943946930983.....csv"
Where as a the goal is to have "Output.csv"
Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.
I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs
Any help here is greatly appreciated.
Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-.... files in a HDFS output path like Output/ named by you.
If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv or set the number of reduce processes with 1 like using coalesce(1) function.
So in your scenario, you only need to adjust the order of calling these functions to make the coalease function called at the front of save function, as below.
dfspark.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)
The coalesce and repartition do not help with saving the dataframe into 1 normally named file.
I ended up just renaming the 1 csv file and deleting the folder with log:
def save_csv(df, location, filename):
outputPath = os.path.join(location, filename + '_temp.csv')
df.repartition(1).write.format("com.databricks.spark.csv").mode("overwrite").options(header="true", inferSchema="true").option("delimiter", "\t").save(outputPath)
csv_files = os.listdir(os.path.join('/dbfs', outputPath))
# moving the parquet-like temp csv file into normally named one
for file in csv_files:
if file[-4:] == '.csv':
dbutils.fs.mv(os.path.join(outputPath,file) , os.path.join(location, filename))
dbutils.fs.rm(outputPath, True)
# using save_csv
save_csv_location = 'mnt/.....'
save_csv(df, save_csv_location, 'name.csv')

Categories