I am using Azure Databricks to analyze some data. I have the following folder structure in blob storage:
folder_1\n1 csv files
folder_2\n2 csv files
..
folder_k\nk csv files
I want to read these files, run some algorithm (relatively simple) and write out some log files and image files for each of the csv files in a similar folder structure at another blob storage location. Right now I have a simple loop structure to do this:
for folder in folders:
#set up some stuff
for file in files:
#do the work and write out results
The database contains 150k files. Is there a way to parallelize this?
The best way I found to parallelize such embarassingly parallel tasks in databricks is using pandas UDF (https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html?_ga=2.143957493.1972283838.1643225636-354359200.1607978015)
I created a spark dataframe with the list of files and folders to loop through, passed it to a pandas UDF with specified number of partitions (essentially cores to parallelize over). This can leverage the available cores on a databricks cluster. There are a few restrictions as to what you can call from a pandas UDF (for example, cannot use 'dbutils' calls directly), but it worked like a charm for my application.
Related
I am trying to export a huge table (2,000,000,000 rows, roughly 600GB in size) from BigQuery into a google bucket as a single file. All tools suggested in Google's Documentation are limited in export size and will create multiple files.
Is there a pythonic way to do it without needing to hold the entire table in the memory?
While perhaps there are other ways to make it as a script, the recommended solution is to merge the files using Google Storage compose action.
What you have to do is:
export in CSV format
this produces many files
run the compose action batched from 32 files until the final one, the big file is merged
All this can be combined in a cloud Workflow, there is a tutorial here.
While writing in CSV file, automatically folder is created and then csv file with cryptic name is created, how to create this CSV with any specific name but without creating folder in pyspark not in pandas.
That's just the way Spark works with the parallelizing mechanism. Spark application meant to have one or more workers to read your data and to write into a location. When you write a CSV file, having a directory with multiple files is the way multiple workers can write at the same time.
If you're using HDFS, you can consider writing another bash script to move or reorganize files the way you want
If you're using Databricks, you can use dbutils.ls to interact with DBFS files in the same way.
This is the way spark is designed to write out multiple files in parallel. Writing out many files at the same time is faster for big datasets. But still you can achieve by using of coalesce(1,true).saveAsTextFile() .You can refer here
In PySpark, the following code helped me to directly write data into CSV file
df.toPandas().to_csv('FileName.csv')
I'm trying to fetch all rows data from spark dataframe to a file in databricks. I'm able to write df data to a file with only few counts. Suppose if i'm getting the count in df as 100 , then in file its 50 count so it's skipping the data.How can i load completed data from dataframe to a file without skipping the data. I have created a udf that udf will open the file and append the data to it.I have called that udf in spark sql df.
Can someone help me on this issue?
I would advise against using a udf the way you are for a few reasons:
UDFs run on the worker nodes, so you would have multiple udfs, each writing a portion of your data to a local file.
Even if you have your UDF appending to a file in a shared location (like the DBFS), you still have multiple nodes writing to a file concurrently, which could lead to errors.
Spark already has a way to do this out of the box that you should take advantage of
To write a spark dataframe to a file in databricks:
Use the Dataframe.write attribute (Databricks docs).
There are plenty of options, so should be able to do whatever you need (Spark docs (this one is for CSVs))
Note on partitions: Spark writes each partition of the DF in its own file, so you should use the coalesce function (warning: this is very slow with extremely large dataframes since spark has to fit the whole dataframe into memory on the driver node)
Note on File locations: The file path you give will be on the driver node, so unless you plan on reading it back with another script, you should start your path with "/dbfs" , which is mounted onto all of the nodes' file systems.This way, it is saved on the Databricks File System, which is accessible from any cluster in your databricks instance. (It's also available to download using the Databricks CLI.)
Full Example:
df_to_write = my_df.select(<columns you want>)
df_to_write.coalesce(1).write.csv("/dbfs/myFileDownloads/dataframeDownload.csv")
I am converting txt files into XML format using pure Python. I have a list of files from 1kb to 2.5Gb in the txt format. When converting the size grows about 5x.
The issue is that when processing the larger 2.5Gb files the first file works but subsequent processing hangs and gets stuck running command... Smaller files seem to work with no issue.
I've edited the code to make sure it's using generators and not keeping large lists in memory.
I'm processing from dbfs so connection should not be an issue.
Doing memory checks show that it's consistently using only ~200Mb of memory and the size does not grow.
Large files take about 10 mins to process.
No GC warnings or other Error in logs
Azure Databricks, Pure Python
Cluster is large enough and using only Python so that shouldn't be the issue.
Restarting cluster is the only thing that gets things working again.
Stuck command also causes other notebooks on the cluster to not work.
Basic code outline with redaction for simplicity.
# list of files to convert that are in Azure Blob Storage
text_files = ['file1.txt','file2.txt','file3.txt']
# loop over files and convert them to xml
for file in text_files:
xml_filename = file.replace('.txt','.xml')
# copy files from blob storage to dbfs
dbutils.fs.cp(f'dbfs:/mnt/storage_account/projects/xml_converter/input/{file}',f'dbfs:/tmp/temporary/{file}')
# open files and convert to xml
with open(f'/dbfs/tmp/temporary/{file}','r') as infile, open(f'/dbfs/tmp/temporary/{xml_filename}','a', encoding="utf-8") as outfile:
# list of strings to join at write time
to_write = []
for line in infile:
# convert to xml
# code redacted for simplicity
to_write.append(new_xml)
# batch the write operations to avoid huge lists
if len(to_write) > 10_000:
outfile.write(''.join(to_write))
to_write = [] # reset the batch
# do a final write of anything that is in the list
outfile.write(''.join(to_write))
# move completed files from dbfs to blob storage
dbutils.fs.cp(f'dbfs:/tmp/temporary/{xml_filename}',f"/mnt/storage_account/projects/xml_converter/output/{xml_filename}")
Azure Cluster Info
I would expect this code to run with no issues. Memory doesn't seem to be the problem. The data is in dbfs so it's not a blob issue. It's using generators so not much is in memory. I'm at a loss. Any suggestions would be appreciated. Thanks for looking!
Have you tried to copy the files from Azure Storage to the local Databricks /tmp/ folder and not using dbfs? I had a similar issues when unpacking large .zip files and that fixed the problem. Have a look here: https://docs.databricks.com/data/databricks-file-system.html
Side note: Since you are using pure Python the workers are not used for processing the files. You can switch to a single node setup.
This is environment behavioral, If the script is pure Python then it would only run on the driver node of the Databricks cluster making it very expensive as single node processing.Python will definitely perform better compared to pyspark on smaller data sets. But you will see the difference when you are dealing with larger data sets.
I have 1000 csv files that are to be processed in parallel using map function available in spark. I have two desktops connected in a cluster and I'm using the pyspark shell for computation. I am passing the name of csv files into the map function and the function accesses the files based on name. However, I need to copy files to the slave for the process to function properly. This means there has to be a copy of all the csv files on the other system. Kindly suggest an alternative storage while avoiding data transfer latency.
I also tried storing these files into a 3-d array and generating an RDD by using parallelize command. But that gives out of memory error.
you can use spark-csv to load the files
https://github.com/databricks/spark-csv
Then you can use dataframe concept to pre-process the files.
Since its 1000 csv files and if there is some link among them , use spark-sql to run operation on them , and then extract your output for final computation.
If that doesn't work , you can try to load the same in HBase or Hive and then use spark to compute , I checked with 100 gb of csv contents in my single node cluster.
It may help