Reading Csv file written by Dataframewriter Pyspark - python

I was having dataframe which I wrote to a CSV by using below code:
df.write.format("csv").save(base_path+"avg.csv")
As i am running spark in client mode, above snippets created a folder name avg.csv and the folder contains some file with part-*
.csv on my worker node or nested folder then file part-*.csv.
Now when I am trying to read avg.csv I am getting path doesn't exist.
df.read.format("com.databricks.spark.csv").load(base_path+"avg.csv")
Can anybody tell where am I doing wrong ?

Part-00** files are output of distributively computed files (like MR, spark). So, it will be always a folder created with part files when you try to store, as this is an output of some distributed storage which is to be kept in mind.
So, try using:
df.read.format("com.databricks.spark.csv").load(base_path+"avg.csv/*")

Related

How to write in CSV file without creating folder in pyspark?

While writing in CSV file, automatically folder is created and then csv file with cryptic name is created, how to create this CSV with any specific name but without creating folder in pyspark not in pandas.
That's just the way Spark works with the parallelizing mechanism. Spark application meant to have one or more workers to read your data and to write into a location. When you write a CSV file, having a directory with multiple files is the way multiple workers can write at the same time.
If you're using HDFS, you can consider writing another bash script to move or reorganize files the way you want
If you're using Databricks, you can use dbutils.ls to interact with DBFS files in the same way.
This is the way spark is designed to write out multiple files in parallel. Writing out many files at the same time is faster for big datasets. But still you can achieve by using of coalesce(1,true).saveAsTextFile() .You can refer here
In PySpark, the following code helped me to directly write data into CSV file
df.toPandas().to_csv('FileName.csv')

How do I create a folder inside Datalake gen 1 while saving a pandas dataframe as csv?

I'm new to Databricks and basically I'm trying to save a pandas dataframe to a datalake storage.
Datalake is mounted
so when I save the file to a folder which is already created it works perfectly fine however when I try to save the csv file to a folder which is not created yet its not working and throwing an error that the folder does not exist. I was under the assumption that if I give a path which is not there it does create the folder by it self.
example - folders are created until snapshot so if I try the below code works perfectly fine
df.to_csv("/dbfs/mnt/test/snapshot/test.csv", index=False)
but when I try saving inside a folder which is not yet created it throws an error
df.to_csv("/dbfs/mnt/test/snapshot/2020/08/27/test.csv", index=False)
Is there a way to achieve this via code instead of manually creating folders.
Thank you in advance
You can create the folder beforehand using dbutils.fs.mkdirs():
dbutils.fs.mkdirs("/mnt/test/snapshot/2020/08/27")
df.to_csv("/dbfs/mnt/test/snapshot/2020/08/27/test.csv", index=False)

How to create a new dataframe with CSV file from a folder with subfolders in Pyspark in S3

Hi I'm very new to Pyspark and S3. I have problem at hand. I have a folder, which consists of subfolders and files and also files from the subfolder(all CSVs) i need to create a new dataframe or a csv file where i get contents of the files and create as a single file. Which later need to be read to a table in postgress
Can anyone please help me. I have code in python, but not sure how to go about with pyspark and S3
Try with this option.
recursiveFileLookup – recursively scan a directory for files. Using this option disables partition discovery.
df = spark.read.option("header","true").option("recursiveFileLookup","true").csv("s3://path/to/root/")

pyspark creates output file as folder

Pyspark creates folder instead of file. For the below command, it creates an empty folder with name proto.parquet in the directory.
df.write.parquet("output/proto.parquet")
Tried with csv and other formats, but still the same.
The fact that Spark creates a folder instead of a file is the expected behavior. The reason being that Spark is a distributed system, hence data is processed in partitions and each worker node will write out its data to a part file.
So what you are seeing is the way it should work. It works the same way with mapreduce.

How to save a file on the cluster

I'm connected to the cluster using ssh and I send the program to the cluster using
spark-submit --master yarn myProgram.py
I want to save the result in a text file and I tried using the following lines:
counts.write.json("hdfs://home/myDir/text_file.txt")
counts.write.csv("hdfs://home/myDir/text_file.csv")
However, none of them work. The program finishes and I cannot find the text file in myDir. Do you have any idea how can I do this?
Also, is there a way to write directly to my local machine?
EDIT: I found out that home directory doesn't exist so now I save the result as:
counts.write.json("hdfs:///user/username/text_file.txt")
But this creates a directory named text_file.txt and inside I have a lot of files with partial results inside. But I want one file with the final result inside. Any ideas how I can do this ?
Spark will save the results in multiple files since the computation is distributed. Therefore writing:
counts.write.csv("hdfs://home/myDir/text_file.csv")
means to save the data on each partition as a separate file in the folder text_file.csv. If you want the data saved as a single file, use coalesce(1) first:
counts.coalesce(1).write.csv("hdfs://home/myDir/text_file.csv")
This will put all the data into a single partition and the number of saved files will thus be 1. However, this could be a bad idea if you have a lot of data. If the data is very small then using collect() is an alternative. This will put all data onto the driver machine as an array, which can then be saved as a single file.
You can concatenate your results into one file from the command line:
hadoop fs -cat hdfs:///user/username/text_file.txt/* > path/to/local/file.txt
This should be faster than using coalesce - in my experience all collect() type operations are slow because all of the data is funneled through the master node. Furthermore, you can run into troubles with collect() if your data exceeds the memory on your master node.
However, a potential pitfall with this approach is that you will have to explicitly remove the files from a previous run (since the current run may not produce exactly the same number of files). There may be a flag to do this with each run, but I am not sure.
To remove:
hadoop fs -rm -r hdfs:///user/username/text_file.txt/*
Do you get any error? Maybe you can check if you have the correct permissions to write/read from that folder.
Also think that Spark by default will create a folder called text_file.txt with some files inside, depending on the number of partitions that you have.
If you want to write in your local machine you can specify the path with file:///home/myDir/text_file.txt. If you use a path like /user/hdfs/... by default is wrote in HDFS
To have a unique file (not named as you want) you need .repartition(1),look here, piped to your RDD.
I suppose that your hdfs path is wrong. In Spark HDFS for text file is the default and in Hadoop (by default) there is not a home dir in root dir, unless you have created it before.
If you want a csv/txt file (with this extention) the only way to write it, is without RDD or DF functions, but using the usual libraries of python csv and io, after you have collected, with .collect(), your RDD in a martix (dataset has not be huge).
If you want to write directly on your filesystem (and not on HDFS) use
counts.write.csv("file:///home/myDir/text_file.csv")
But this won't write a single file with csv extension. It will create a folder with the part-m-0000n of the n partitions of your dataset.

Categories