How to write in CSV file without creating folder in pyspark? - python

While writing in CSV file, automatically folder is created and then csv file with cryptic name is created, how to create this CSV with any specific name but without creating folder in pyspark not in pandas.

That's just the way Spark works with the parallelizing mechanism. Spark application meant to have one or more workers to read your data and to write into a location. When you write a CSV file, having a directory with multiple files is the way multiple workers can write at the same time.
If you're using HDFS, you can consider writing another bash script to move or reorganize files the way you want
If you're using Databricks, you can use dbutils.ls to interact with DBFS files in the same way.

This is the way spark is designed to write out multiple files in parallel. Writing out many files at the same time is faster for big datasets. But still you can achieve by using of coalesce(1,true).saveAsTextFile() .You can refer here

In PySpark, the following code helped me to directly write data into CSV file
df.toPandas().to_csv('FileName.csv')

Related

Azure Databricks: Python parallel for loop

I am using Azure Databricks to analyze some data. I have the following folder structure in blob storage:
folder_1\n1 csv files
folder_2\n2 csv files
..
folder_k\nk csv files
I want to read these files, run some algorithm (relatively simple) and write out some log files and image files for each of the csv files in a similar folder structure at another blob storage location. Right now I have a simple loop structure to do this:
for folder in folders:
#set up some stuff
for file in files:
#do the work and write out results
The database contains 150k files. Is there a way to parallelize this?
The best way I found to parallelize such embarassingly parallel tasks in databricks is using pandas UDF (https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html?_ga=2.143957493.1972283838.1643225636-354359200.1607978015)
I created a spark dataframe with the list of files and folders to loop through, passed it to a pandas UDF with specified number of partitions (essentially cores to parallelize over). This can leverage the available cores on a databricks cluster. There are a few restrictions as to what you can call from a pandas UDF (for example, cannot use 'dbutils' calls directly), but it worked like a charm for my application.

PySpark deleted written CSV after job finished

I've literally ran the below that generates a bunch of CSVs for a dataframe:
df.write.csv(path, mode="overwrite", header=True, quoteAll=True)
I see the files being generated and updating whilst the script is running. After some hours of the script running it converts the prefixed attempt- folders to task- and then all of a sudden they disappear! Real need to find these files, any help would be much appreciated!
I've never used PySpark -- only ever written to CSV using Pandas. Perhaps try using
df.toPandas().to_csv('yourfilehere.csv') to regenerate your CSV files if you can for now as a workaround?
toPandas
to_csv

Reading Csv file written by Dataframewriter Pyspark

I was having dataframe which I wrote to a CSV by using below code:
df.write.format("csv").save(base_path+"avg.csv")
As i am running spark in client mode, above snippets created a folder name avg.csv and the folder contains some file with part-*
.csv on my worker node or nested folder then file part-*.csv.
Now when I am trying to read avg.csv I am getting path doesn't exist.
df.read.format("com.databricks.spark.csv").load(base_path+"avg.csv")
Can anybody tell where am I doing wrong ?
Part-00** files are output of distributively computed files (like MR, spark). So, it will be always a folder created with part files when you try to store, as this is an output of some distributed storage which is to be kept in mind.
So, try using:
df.read.format("com.databricks.spark.csv").load(base_path+"avg.csv/*")

How to save a file on the cluster

I'm connected to the cluster using ssh and I send the program to the cluster using
spark-submit --master yarn myProgram.py
I want to save the result in a text file and I tried using the following lines:
counts.write.json("hdfs://home/myDir/text_file.txt")
counts.write.csv("hdfs://home/myDir/text_file.csv")
However, none of them work. The program finishes and I cannot find the text file in myDir. Do you have any idea how can I do this?
Also, is there a way to write directly to my local machine?
EDIT: I found out that home directory doesn't exist so now I save the result as:
counts.write.json("hdfs:///user/username/text_file.txt")
But this creates a directory named text_file.txt and inside I have a lot of files with partial results inside. But I want one file with the final result inside. Any ideas how I can do this ?
Spark will save the results in multiple files since the computation is distributed. Therefore writing:
counts.write.csv("hdfs://home/myDir/text_file.csv")
means to save the data on each partition as a separate file in the folder text_file.csv. If you want the data saved as a single file, use coalesce(1) first:
counts.coalesce(1).write.csv("hdfs://home/myDir/text_file.csv")
This will put all the data into a single partition and the number of saved files will thus be 1. However, this could be a bad idea if you have a lot of data. If the data is very small then using collect() is an alternative. This will put all data onto the driver machine as an array, which can then be saved as a single file.
You can concatenate your results into one file from the command line:
hadoop fs -cat hdfs:///user/username/text_file.txt/* > path/to/local/file.txt
This should be faster than using coalesce - in my experience all collect() type operations are slow because all of the data is funneled through the master node. Furthermore, you can run into troubles with collect() if your data exceeds the memory on your master node.
However, a potential pitfall with this approach is that you will have to explicitly remove the files from a previous run (since the current run may not produce exactly the same number of files). There may be a flag to do this with each run, but I am not sure.
To remove:
hadoop fs -rm -r hdfs:///user/username/text_file.txt/*
Do you get any error? Maybe you can check if you have the correct permissions to write/read from that folder.
Also think that Spark by default will create a folder called text_file.txt with some files inside, depending on the number of partitions that you have.
If you want to write in your local machine you can specify the path with file:///home/myDir/text_file.txt. If you use a path like /user/hdfs/... by default is wrote in HDFS
To have a unique file (not named as you want) you need .repartition(1),look here, piped to your RDD.
I suppose that your hdfs path is wrong. In Spark HDFS for text file is the default and in Hadoop (by default) there is not a home dir in root dir, unless you have created it before.
If you want a csv/txt file (with this extention) the only way to write it, is without RDD or DF functions, but using the usual libraries of python csv and io, after you have collected, with .collect(), your RDD in a martix (dataset has not be huge).
If you want to write directly on your filesystem (and not on HDFS) use
counts.write.csv("file:///home/myDir/text_file.csv")
But this won't write a single file with csv extension. It will create a folder with the part-m-0000n of the n partitions of your dataset.

PySpark: spit out single file when writing instead of multiple part files

Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file?
If I run:
df.write.format('json').save('myfile.json')
or
df1.write.json('myfile.json')
it creates the folder named myfile and within it I find several small files named part-***, the HDFS way. Is it by any means possible to have it spit out a single file instead?
Well, the answer to your exact question is coalesce function. But as already mentioned it is not efficient at all as it will force one worker to fetch all data and write it sequentially.
df.coalesce(1).write.format('json').save('myfile.json')
P.S. Btw, the result file is not a valid json file. It is a file with a json object per line.
This was a better solution for me.
rdd.map(json.dumps)
.saveAsTextFile(json_lines_file_name)
df1.rdd.repartition(1).write.json('myfile.json')
Would be nice, but isn't available. Check this related question. https://stackoverflow.com/a/33311467/2843520

Categories