Copy large spark Dataframe on disk [duplicate] - python

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.
Need a Scala function which will take parameter like path and file name and write that CSV file.

It is creating a folder with multiple files, because each partition is saved individually. If you need a single output file (still in a folder) you can repartition (preferred if upstream data is large, but requires a shuffle):
df
.repartition(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("mydata.csv")
or coalesce:
df
.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("mydata.csv")
data frame before saving:
All data will be written to mydata.csv/part-00000. Before you use this option be sure you understand what is going on and what is the cost of transferring all data to a single worker. If you use distributed file system with replication, data will be transfered multiple times - first fetched to a single worker and subsequently distributed over storage nodes.
Alternatively you can leave your code as it is and use general purpose tools like cat or HDFS getmerge to simply merge all the parts afterwards.

If you are running Spark with HDFS, I've been solving the problem by writing csv files normally and leveraging HDFS to do the merging. I'm doing that in Spark (1.6) directly:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
// the "true" setting deletes the source files once they are merged into the new output
}
val newData = << create your dataframe >>
val outputfile = "/user/feeds/project/outputs/subject"
var filename = "myinsights"
var outputFileName = outputfile + "/temp_" + filename
var mergedFileName = outputfile + "/merged_" + filename
var mergeFindGlob = outputFileName
newData.write
.format("com.databricks.spark.csv")
.option("header", "false")
.mode("overwrite")
.save(outputFileName)
merge(mergeFindGlob, mergedFileName )
newData.unpersist()
Can't remember where I learned this trick, but it might work for you.

I might be a little late to the game here, but using coalesce(1) or repartition(1) may work for small data-sets, but large data-sets would all be thrown into one partition on one node. This is likely to throw OOM errors, or at best, to process slowly.
I would highly suggest that you use the FileUtil.copyMerge() function from the Hadoop API. This will merge the outputs into a single file.
EDIT - This effectively brings the data to the driver rather than an executor node. Coalesce() would be fine if a single executor has more RAM for use than the driver.
EDIT 2: copyMerge() is being removed in Hadoop 3.0. See the following stack overflow article for more information on how to work with the newest version: How to do CopyMerge in Hadoop 3.0?

If you are using Databricks and can fit all the data into RAM on one worker (and thus can use .coalesce(1)), you can use dbfs to find and move the resulting CSV file:
val fileprefix= "/mnt/aws/path/file-prefix"
dataset
.coalesce(1)
.write
//.mode("overwrite") // I usually don't use this, but you may want to.
.option("header", "true")
.option("delimiter","\t")
.csv(fileprefix+".tmp")
val partition_path = dbutils.fs.ls(fileprefix+".tmp/")
.filter(file=>file.name.endsWith(".csv"))(0).path
dbutils.fs.cp(partition_path,fileprefix+".tab")
dbutils.fs.rm(fileprefix+".tmp",recurse=true)
If your file does not fit into RAM on the worker, you may want to consider chaotic3quilibrium's suggestion to use FileUtils.copyMerge(). I have not done this, and don't yet know if is possible or not, e.g., on S3.
This answer is built on previous answers to this question as well as my own tests of the provided code snippet. I originally posted it to Databricks and am republishing it here.
The best documentation for dbfs's rm's recursive option I have found is on a Databricks forum.

spark's df.write() API will create multiple part files inside given path ... to force spark write only a single part file use df.coalesce(1).write.csv(...) instead of df.repartition(1).write.csv(...) as coalesce is a narrow transformation whereas repartition is a wide transformation see Spark - repartition() vs coalesce()
df.coalesce(1).write.csv(filepath,header=True)
will create folder in given filepath with one part-0001-...-c000.csv file
use
cat filepath/part-0001-...-c000.csv > filename_you_want.csv
to have a user friendly filename

This answer expands on the accepted answer, gives more context, and provides code snippets you can run in the Spark Shell on your machine.
More context on accepted answer
The accepted answer might give you the impression the sample code outputs a single mydata.csv file and that's not the case. Let's demonstrate:
val df = Seq("one", "two", "three").toDF("num")
df
.repartition(1)
.write.csv(sys.env("HOME")+ "/Documents/tmp/mydata.csv")
Here's what's outputted:
Documents/
tmp/
mydata.csv/
_SUCCESS
part-00000-b3700504-e58b-4552-880b-e7b52c60157e-c000.csv
N.B. mydata.csv is a folder in the accepted answer - it's not a file!
How to output a single file with a specific name
We can use spark-daria to write out a single mydata.csv file.
import com.github.mrpowers.spark.daria.sql.DariaWriters
DariaWriters.writeSingleFile(
df = df,
format = "csv",
sc = spark.sparkContext,
tmpFolder = sys.env("HOME") + "/Documents/better/staging",
filename = sys.env("HOME") + "/Documents/better/mydata.csv"
)
This'll output the file as follows:
Documents/
better/
mydata.csv
S3 paths
You'll need to pass s3a paths to DariaWriters.writeSingleFile to use this method in S3:
DariaWriters.writeSingleFile(
df = df,
format = "csv",
sc = spark.sparkContext,
tmpFolder = "s3a://bucket/data/src",
filename = "s3a://bucket/data/dest/my_cool_file.csv"
)
See here for more info.
Avoiding copyMerge
copyMerge was removed from Hadoop 3. The DariaWriters.writeSingleFile implementation uses fs.rename, as described here. Spark 3 still used Hadoop 2, so copyMerge implementations will work in 2020. I'm not sure when Spark will upgrade to Hadoop 3, but better to avoid any copyMerge approach that'll cause your code to break when Spark upgrades Hadoop.
Source code
Look for the DariaWriters object in the spark-daria source code if you'd like to inspect the implementation.
PySpark implementation
It's easier to write out a single file with PySpark because you can convert the DataFrame to a Pandas DataFrame that gets written out as a single file by default.
from pathlib import Path
home = str(Path.home())
data = [
("jellyfish", "JALYF"),
("li", "L"),
("luisa", "LAS"),
(None, None)
]
df = spark.createDataFrame(data, ["word", "expected"])
df.toPandas().to_csv(home + "/Documents/tmp/mydata-from-pyspark.csv", sep=',', header=True, index=False)
Limitations
The DariaWriters.writeSingleFile Scala approach and the df.toPandas() Python approach only work for small datasets. Huge datasets can not be written out as single files. Writing out data as a single file isn't optimal from a performance perspective because the data can't be written in parallel.

I'm using this in Python to get a single file:
df.toPandas().to_csv("/tmp/my.csv", sep=',', header=True, index=False)

A solution that works for S3 modified from Minkymorgan.
Simply pass the temporary partitioned directory path (with different name than final path) as the srcPath and single final csv/txt as destPath Specify also deleteSource if you want to remove the original directory.
/**
* Merges multiple partitions of spark text file output into single file.
* #param srcPath source directory of partitioned files
* #param dstPath output path of individual path
* #param deleteSource whether or not to delete source directory after merging
* #param spark sparkSession
*/
def mergeTextFiles(srcPath: String, dstPath: String, deleteSource: Boolean): Unit = {
import org.apache.hadoop.fs.FileUtil
import java.net.URI
val config = spark.sparkContext.hadoopConfiguration
val fs: FileSystem = FileSystem.get(new URI(srcPath), config)
FileUtil.copyMerge(
fs, new Path(srcPath), fs, new Path(dstPath), deleteSource, config, null
)
}

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql.{DataFrame,SaveMode,SparkSession}
import org.apache.spark.sql.functions._
I solved using below approach (hdfs rename file name):-
Step 1:- (Crate Data Frame and write to HDFS)
df.coalesce(1).write.format("csv").option("header", "false").mode(SaveMode.Overwrite).save("/hdfsfolder/blah/")
Step 2:- (Create Hadoop Config)
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
Step3 :- (Get path in hdfs folder path)
val pathFiles = new Path("/hdfsfolder/blah/")
Step4:- (Get spark file names from hdfs folder)
val fileNames = hdfs.listFiles(pathFiles, false)
println(fileNames)
setp5:- (create scala mutable list to save all the file names and add it to the list)
var fileNamesList = scala.collection.mutable.MutableList[String]()
while (fileNames.hasNext) {
fileNamesList += fileNames.next().getPath.getName
}
println(fileNamesList)
Step 6:- (filter _SUCESS file order from file names scala list)
// get files name which are not _SUCCESS
val partFileName = fileNamesList.filterNot(filenames => filenames == "_SUCCESS")
step 7:- (convert scala list to string and add desired file name to hdfs folder string and then apply rename)
val partFileSourcePath = new Path("/yourhdfsfolder/"+ partFileName.mkString(""))
val desiredCsvTargetPath = new Path(/yourhdfsfolder/+ "op_"+ ".csv")
hdfs.rename(partFileSourcePath , desiredCsvTargetPath)

spark.sql("select * from df").coalesce(1).write.option("mode","append").option("header","true").csv("/your/hdfs/path/")
spark.sql("select * from df") --> this is dataframe
coalesce(1) or repartition(1) --> this will make your output file to 1 part file only
write --> writing data
option("mode","append") --> appending data to existing directory
option("header","true") --> enabling header
csv("<hdfs dir>") --> write as CSV file & its output location in HDFS

repartition/coalesce to 1 partition before you save (you'd still get a folder but it would have one part file in it)

you can use rdd.coalesce(1, true).saveAsTextFile(path)
it will store data as singile file in path/part-00000

Here is a helper function with which you can get a single result-file without the part-0000 and without a subdirectory on S3 and AWS EMR:
def renameSinglePartToParentFolder(directoryUrl: String): Unit = {
import sys.process._
val lsResult = s"aws s3 ls ${directoryUrl}/" !!
val partFilename = lsResult.split("\n").map(_.split(" ").last).filter(_.contains("part-0000")).last
s"aws s3 rm ${directoryUrl}/_SUCCESS" !
s"aws s3 mv ${directoryUrl}/${partFilename} ${directoryUrl}" !
}
val targetPath = "s3://my-bucket/my-folder/my-file.csv"
df.coalesce(1).write.csv(targetPath)
renameSinglePartToParentFolder(targetPath)
Write to a single part-0000... file.
Use AWS CLI to list all files and rename the single file accordingly.

by using Listbuffer we can save data into single file:
import java.io.FileWriter
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ListBuffer
val text = spark.read.textFile("filepath")
var data = ListBuffer[String]()
for(line:String <- text.collect()){
data += line
}
val writer = new FileWriter("filepath")
data.foreach(line => writer.write(line.toString+"\n"))
writer.close()

def export_csv(
fileName: String,
filePath: String
) = {
val filePathDestTemp = filePath + ".dir/"
val merstageout_df = spark.sql(merstageout)
merstageout_df
.coalesce(1)
.write
.option("header", "true")
.mode("overwrite")
.csv(filePathDestTemp)
val listFiles = dbutils.fs.ls(filePathDestTemp)
for(subFiles <- listFiles){
val subFiles_name: String = subFiles.name
if (subFiles_name.slice(subFiles_name.length() - 4,subFiles_name.length()) == ".csv") {
dbutils.fs.cp (filePathDestTemp + subFiles_name, filePath + fileName+ ".csv")
dbutils.fs.rm(filePathDestTemp, recurse=true)
}}}

There is one more way to use Java
import java.io._
def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit)
{
val p = new java.io.PrintWriter(f);
try { op(p) }
finally { p.close() }
}
printToFile(new File("C:/TEMP/df.csv")) { p => df.collect().foreach(p.println)}

Related

Read many parquet files from S3 to pandas dataframe

I've been researching this topic for a few days now and have yet to come up with a working solution. Apologies if this question is repetitive (although I have checked for similar questions and have not quite found the right one).
I have an s3 bucket with about 150 parquet files in it. I have been searching for a dynamic way to bring in all of these files to one dataframe (can be multiple, if more computationally efficient). If all of these parquets were appended to one dataframe, it would be a very large amount of data, so if the solution to this is simply that I require more computing power, please do let me know. I have ultimately stumbled across the awswrangler, and am using the below code, which has been running as expected:
df = wr.s3.read_parquet(path="s3://my-s3-data/folder1/subfolder1/subfolder2/", dataset=True, columns = df_cols, chunked=True)
This code has been returning a generator object, which I am not sure how to get into a dataframe. I have tried solutions from the linked pages (below) and have returned various errors such as invalid filepath and length mismatch.
https://newbedev.com/create-a-pandas-dataframe-from-generator
https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html
Create a pandas DataFrame from generator?
Another solution I tried was from https://www.py4u.net/discuss/140245 :
import s3fs
import pyarrow.parquet as pq
fs = s3fs.S3FileSystem()
bucket = "cortex-grm-pdm-auto-design-data"
path = "s3://my-bucket/folder1/subfolder1/subfolder2/"
# Python 3.6 or later
p_dataset = pq.ParquetDataset(
f"s3://my-bucket/folder1/subfolder1/subfolder2/",
filesystem=fs
)
df = p_dataset.read().to_pandas()
which resulted in an error "'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'"
lastly, I also tried the many parquet solution from https://newbedev.com/how-to-read-a-list-of-parquet-files-from-s3-as-a-pandas-dataframe-using-pyarrow :
# Read multiple parquets from a folder on S3 generated by spark
def pd_read_s3_multiple_parquets(filepath, bucket, s3=None,
s3_client=None, verbose=False, **args):
if not filepath.endswith('/'):
filepath = filepath + '/' # Add '/' to the end
if s3_client is None:
s3_client = boto3.client('s3')
if s3 is None:
s3 = boto3.resource('s3')
s3_keys = [item.key for item in s3.Bucket(bucket).objects.filter(Prefix=filepath)
if item.key.endswith('.parquet')]
if not s3_keys:
print('No parquet found in', bucket, filepath)
elif verbose:
print('Load parquets:')
for p in s3_keys:
print(p)
dfs = [pd_read_s3_parquet(key, bucket=bucket, s3_client=s3_client, **args)
for key in s3_keys]
return pd.concat(dfs, ignore_index=True)
df = pd_read_s3_multiple_parquets('path/to/folder', 'my_bucket')
This one returned no parquet found in the path (which I am certain is false, the parquets are all there when I visit the actual s3), as well as the error "no objects to concatenate"
Any guidance you can provide is greatly appreciated! Again, apologies for any repetitiveness in my question. Thank you in advance.
AWS data wrangler works seamlessly, I have used it.
Install via pip or conda.
Reading multiple parquet files is a one-liner: see example below.
Creds are automatically read from your environment variables.
# this is running on my laptop
import numpy as np
import pandas as pd
import awswrangler as wr
# assume multiple parquet files in 's3://mybucket/etc/etc/'
s3_bucket_uri = 's3://mybucket/etc/etc/'
df = wr.s3.read_parquet(path=s3_bucket_daily)
# df is a pandas DataFrame
AWS doc with examples that include your use case are here:
https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html

Pyspark export a dataframe to csv is creating a directory instead of a csv file [duplicate]

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.
Need a Scala function which will take parameter like path and file name and write that CSV file.
It is creating a folder with multiple files, because each partition is saved individually. If you need a single output file (still in a folder) you can repartition (preferred if upstream data is large, but requires a shuffle):
df
.repartition(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("mydata.csv")
or coalesce:
df
.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("mydata.csv")
data frame before saving:
All data will be written to mydata.csv/part-00000. Before you use this option be sure you understand what is going on and what is the cost of transferring all data to a single worker. If you use distributed file system with replication, data will be transfered multiple times - first fetched to a single worker and subsequently distributed over storage nodes.
Alternatively you can leave your code as it is and use general purpose tools like cat or HDFS getmerge to simply merge all the parts afterwards.
If you are running Spark with HDFS, I've been solving the problem by writing csv files normally and leveraging HDFS to do the merging. I'm doing that in Spark (1.6) directly:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
// the "true" setting deletes the source files once they are merged into the new output
}
val newData = << create your dataframe >>
val outputfile = "/user/feeds/project/outputs/subject"
var filename = "myinsights"
var outputFileName = outputfile + "/temp_" + filename
var mergedFileName = outputfile + "/merged_" + filename
var mergeFindGlob = outputFileName
newData.write
.format("com.databricks.spark.csv")
.option("header", "false")
.mode("overwrite")
.save(outputFileName)
merge(mergeFindGlob, mergedFileName )
newData.unpersist()
Can't remember where I learned this trick, but it might work for you.
I might be a little late to the game here, but using coalesce(1) or repartition(1) may work for small data-sets, but large data-sets would all be thrown into one partition on one node. This is likely to throw OOM errors, or at best, to process slowly.
I would highly suggest that you use the FileUtil.copyMerge() function from the Hadoop API. This will merge the outputs into a single file.
EDIT - This effectively brings the data to the driver rather than an executor node. Coalesce() would be fine if a single executor has more RAM for use than the driver.
EDIT 2: copyMerge() is being removed in Hadoop 3.0. See the following stack overflow article for more information on how to work with the newest version: How to do CopyMerge in Hadoop 3.0?
If you are using Databricks and can fit all the data into RAM on one worker (and thus can use .coalesce(1)), you can use dbfs to find and move the resulting CSV file:
val fileprefix= "/mnt/aws/path/file-prefix"
dataset
.coalesce(1)
.write
//.mode("overwrite") // I usually don't use this, but you may want to.
.option("header", "true")
.option("delimiter","\t")
.csv(fileprefix+".tmp")
val partition_path = dbutils.fs.ls(fileprefix+".tmp/")
.filter(file=>file.name.endsWith(".csv"))(0).path
dbutils.fs.cp(partition_path,fileprefix+".tab")
dbutils.fs.rm(fileprefix+".tmp",recurse=true)
If your file does not fit into RAM on the worker, you may want to consider chaotic3quilibrium's suggestion to use FileUtils.copyMerge(). I have not done this, and don't yet know if is possible or not, e.g., on S3.
This answer is built on previous answers to this question as well as my own tests of the provided code snippet. I originally posted it to Databricks and am republishing it here.
The best documentation for dbfs's rm's recursive option I have found is on a Databricks forum.
spark's df.write() API will create multiple part files inside given path ... to force spark write only a single part file use df.coalesce(1).write.csv(...) instead of df.repartition(1).write.csv(...) as coalesce is a narrow transformation whereas repartition is a wide transformation see Spark - repartition() vs coalesce()
df.coalesce(1).write.csv(filepath,header=True)
will create folder in given filepath with one part-0001-...-c000.csv file
use
cat filepath/part-0001-...-c000.csv > filename_you_want.csv
to have a user friendly filename
This answer expands on the accepted answer, gives more context, and provides code snippets you can run in the Spark Shell on your machine.
More context on accepted answer
The accepted answer might give you the impression the sample code outputs a single mydata.csv file and that's not the case. Let's demonstrate:
val df = Seq("one", "two", "three").toDF("num")
df
.repartition(1)
.write.csv(sys.env("HOME")+ "/Documents/tmp/mydata.csv")
Here's what's outputted:
Documents/
tmp/
mydata.csv/
_SUCCESS
part-00000-b3700504-e58b-4552-880b-e7b52c60157e-c000.csv
N.B. mydata.csv is a folder in the accepted answer - it's not a file!
How to output a single file with a specific name
We can use spark-daria to write out a single mydata.csv file.
import com.github.mrpowers.spark.daria.sql.DariaWriters
DariaWriters.writeSingleFile(
df = df,
format = "csv",
sc = spark.sparkContext,
tmpFolder = sys.env("HOME") + "/Documents/better/staging",
filename = sys.env("HOME") + "/Documents/better/mydata.csv"
)
This'll output the file as follows:
Documents/
better/
mydata.csv
S3 paths
You'll need to pass s3a paths to DariaWriters.writeSingleFile to use this method in S3:
DariaWriters.writeSingleFile(
df = df,
format = "csv",
sc = spark.sparkContext,
tmpFolder = "s3a://bucket/data/src",
filename = "s3a://bucket/data/dest/my_cool_file.csv"
)
See here for more info.
Avoiding copyMerge
copyMerge was removed from Hadoop 3. The DariaWriters.writeSingleFile implementation uses fs.rename, as described here. Spark 3 still used Hadoop 2, so copyMerge implementations will work in 2020. I'm not sure when Spark will upgrade to Hadoop 3, but better to avoid any copyMerge approach that'll cause your code to break when Spark upgrades Hadoop.
Source code
Look for the DariaWriters object in the spark-daria source code if you'd like to inspect the implementation.
PySpark implementation
It's easier to write out a single file with PySpark because you can convert the DataFrame to a Pandas DataFrame that gets written out as a single file by default.
from pathlib import Path
home = str(Path.home())
data = [
("jellyfish", "JALYF"),
("li", "L"),
("luisa", "LAS"),
(None, None)
]
df = spark.createDataFrame(data, ["word", "expected"])
df.toPandas().to_csv(home + "/Documents/tmp/mydata-from-pyspark.csv", sep=',', header=True, index=False)
Limitations
The DariaWriters.writeSingleFile Scala approach and the df.toPandas() Python approach only work for small datasets. Huge datasets can not be written out as single files. Writing out data as a single file isn't optimal from a performance perspective because the data can't be written in parallel.
I'm using this in Python to get a single file:
df.toPandas().to_csv("/tmp/my.csv", sep=',', header=True, index=False)
A solution that works for S3 modified from Minkymorgan.
Simply pass the temporary partitioned directory path (with different name than final path) as the srcPath and single final csv/txt as destPath Specify also deleteSource if you want to remove the original directory.
/**
* Merges multiple partitions of spark text file output into single file.
* #param srcPath source directory of partitioned files
* #param dstPath output path of individual path
* #param deleteSource whether or not to delete source directory after merging
* #param spark sparkSession
*/
def mergeTextFiles(srcPath: String, dstPath: String, deleteSource: Boolean): Unit = {
import org.apache.hadoop.fs.FileUtil
import java.net.URI
val config = spark.sparkContext.hadoopConfiguration
val fs: FileSystem = FileSystem.get(new URI(srcPath), config)
FileUtil.copyMerge(
fs, new Path(srcPath), fs, new Path(dstPath), deleteSource, config, null
)
}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql.{DataFrame,SaveMode,SparkSession}
import org.apache.spark.sql.functions._
I solved using below approach (hdfs rename file name):-
Step 1:- (Crate Data Frame and write to HDFS)
df.coalesce(1).write.format("csv").option("header", "false").mode(SaveMode.Overwrite).save("/hdfsfolder/blah/")
Step 2:- (Create Hadoop Config)
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
Step3 :- (Get path in hdfs folder path)
val pathFiles = new Path("/hdfsfolder/blah/")
Step4:- (Get spark file names from hdfs folder)
val fileNames = hdfs.listFiles(pathFiles, false)
println(fileNames)
setp5:- (create scala mutable list to save all the file names and add it to the list)
var fileNamesList = scala.collection.mutable.MutableList[String]()
while (fileNames.hasNext) {
fileNamesList += fileNames.next().getPath.getName
}
println(fileNamesList)
Step 6:- (filter _SUCESS file order from file names scala list)
// get files name which are not _SUCCESS
val partFileName = fileNamesList.filterNot(filenames => filenames == "_SUCCESS")
step 7:- (convert scala list to string and add desired file name to hdfs folder string and then apply rename)
val partFileSourcePath = new Path("/yourhdfsfolder/"+ partFileName.mkString(""))
val desiredCsvTargetPath = new Path(/yourhdfsfolder/+ "op_"+ ".csv")
hdfs.rename(partFileSourcePath , desiredCsvTargetPath)
spark.sql("select * from df").coalesce(1).write.option("mode","append").option("header","true").csv("/your/hdfs/path/")
spark.sql("select * from df") --> this is dataframe
coalesce(1) or repartition(1) --> this will make your output file to 1 part file only
write --> writing data
option("mode","append") --> appending data to existing directory
option("header","true") --> enabling header
csv("<hdfs dir>") --> write as CSV file & its output location in HDFS
repartition/coalesce to 1 partition before you save (you'd still get a folder but it would have one part file in it)
you can use rdd.coalesce(1, true).saveAsTextFile(path)
it will store data as singile file in path/part-00000
Here is a helper function with which you can get a single result-file without the part-0000 and without a subdirectory on S3 and AWS EMR:
def renameSinglePartToParentFolder(directoryUrl: String): Unit = {
import sys.process._
val lsResult = s"aws s3 ls ${directoryUrl}/" !!
val partFilename = lsResult.split("\n").map(_.split(" ").last).filter(_.contains("part-0000")).last
s"aws s3 rm ${directoryUrl}/_SUCCESS" !
s"aws s3 mv ${directoryUrl}/${partFilename} ${directoryUrl}" !
}
val targetPath = "s3://my-bucket/my-folder/my-file.csv"
df.coalesce(1).write.csv(targetPath)
renameSinglePartToParentFolder(targetPath)
Write to a single part-0000... file.
Use AWS CLI to list all files and rename the single file accordingly.
by using Listbuffer we can save data into single file:
import java.io.FileWriter
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ListBuffer
val text = spark.read.textFile("filepath")
var data = ListBuffer[String]()
for(line:String <- text.collect()){
data += line
}
val writer = new FileWriter("filepath")
data.foreach(line => writer.write(line.toString+"\n"))
writer.close()
def export_csv(
fileName: String,
filePath: String
) = {
val filePathDestTemp = filePath + ".dir/"
val merstageout_df = spark.sql(merstageout)
merstageout_df
.coalesce(1)
.write
.option("header", "true")
.mode("overwrite")
.csv(filePathDestTemp)
val listFiles = dbutils.fs.ls(filePathDestTemp)
for(subFiles <- listFiles){
val subFiles_name: String = subFiles.name
if (subFiles_name.slice(subFiles_name.length() - 4,subFiles_name.length()) == ".csv") {
dbutils.fs.cp (filePathDestTemp + subFiles_name, filePath + fileName+ ".csv")
dbutils.fs.rm(filePathDestTemp, recurse=true)
}}}
There is one more way to use Java
import java.io._
def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit)
{
val p = new java.io.PrintWriter(f);
try { op(p) }
finally { p.close() }
}
printToFile(new File("C:/TEMP/df.csv")) { p => df.collect().foreach(p.println)}

How to copy a file in pyspark / hadoop from python

I am using pyspark to save a data frame as a parquet file or as a csv file with this:
def write_df_as_parquet_file(df, path, mode="overwrite"):
df = df.repartition(1) # join partitions to produce 1 parquet file
dfw = df.write.format("parquet").mode(mode)
dfw.save(path)
def write_df_as_csv_file(df, path, mode="overwrite", header=True):
df = df.repartition(1) # join partitions to produce 1 csv file
header = "true" if header else "false"
dfw = df.write.format("csv").option("header", header).mode(mode)
dfw.save(path)
But this saves the parquet/csv file inside a folder called path, where it saves a few other files that we don't need, in this way:
Image: https://ibb.co/9c1D8RL
Basically, I would like to create some function that saves the file to a location using the above methods, and then moves the CSV or PARQUET file to a new location. Like:
def write_df_as_parquet_file(df, path, mode="overwrite"):
# save df in one file inside tmp_folder
df = df.repartition(1) # join partitions to produce 1 parquet file
dfw = df.write.format("parquet").mode(mode)
tmp_folder = path + "TEMP"
dfw.save(tmp_folder)
# move parquet file from tmp_folder to path
copy_file(tmp_folder + "*.parquet", path)
remove_folder(tmp_folder)
How can I do that? How do I implement copy_file or remove_folder? I have seen a few solutions in scala, that use the Hadoop api for this, but I have not been able to make this work in python. I think I need to use sparkContext, but I am still learning Hadoop and have not found the way to do it.
You can use one of Python's HDFS libraries to connect to your HDFS instance and then carry out whatever operations required.
From hdfs3 docs(https://hdfs3.readthedocs.io/en/latest/quickstart.html):
from hdfs3 import HDFileSystem
hdfs = HDFileSystem(host=<host>, port=<port>)
hdfs.mv(tmp_folder + "*.parquet", path)
Wrap the above in a function and you're good to go.
Note: i've just used hdfs3 as an example. You could also use hdfsCLI.

Renaming spark output csv in azure blob storage

I have a Databricks notebook setup that works as the following;
pyspark connection details to Blob storage account
Read file through spark dataframe
convert to pandas Df
data modelling on pandas Df
convert to spark Df
write to blob storage in single file
My problem is, that you can not name the file output file, where I need a static csv filename.
Is there way to rename this in pyspark?
## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""
## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"
## Connection string to connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
Followed by outputting file after data transformation
dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Where the file is then write as "part-00000-tid-336943946930983.....csv"
Where as a the goal is to have "Output.csv"
Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.
I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs
Any help here is greatly appreciated.
Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-.... files in a HDFS output path like Output/ named by you.
If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv or set the number of reduce processes with 1 like using coalesce(1) function.
So in your scenario, you only need to adjust the order of calling these functions to make the coalease function called at the front of save function, as below.
dfspark.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)
The coalesce and repartition do not help with saving the dataframe into 1 normally named file.
I ended up just renaming the 1 csv file and deleting the folder with log:
def save_csv(df, location, filename):
outputPath = os.path.join(location, filename + '_temp.csv')
df.repartition(1).write.format("com.databricks.spark.csv").mode("overwrite").options(header="true", inferSchema="true").option("delimiter", "\t").save(outputPath)
csv_files = os.listdir(os.path.join('/dbfs', outputPath))
# moving the parquet-like temp csv file into normally named one
for file in csv_files:
if file[-4:] == '.csv':
dbutils.fs.mv(os.path.join(outputPath,file) , os.path.join(location, filename))
dbutils.fs.rm(outputPath, True)
# using save_csv
save_csv_location = 'mnt/.....'
save_csv(df, save_csv_location, 'name.csv')

writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark

i have started the shell with databrick csv package
#../spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.3.0
Then i read a csv file did some groupby op and dump that to a csv.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(path.csv') ####it has columns and df.columns works fine
type(df) #<class 'pyspark.sql.dataframe.DataFrame'>
#now trying to dump a csv
df.write.format('com.databricks.spark.csv').save('path+my.csv')
#it creates a directory my.csv with 2 partitions
### To create single file i followed below line of code
#df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("path+file_satya.csv") ## this creates one partition in directory of csv name
#but in both cases no columns information(How to add column names to that csv file???)
# again i am trying to read that csv by
df_new = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("the file i just created.csv")
#i am not getting any columns in that..1st row becomes column names
Please don't answer like add a schema to dataframe after read_csv or while reading mention the column names.
Question1- while giving csv dump is there any way i can add column name with that???
Question2-is there a way to create single csv file(not directory again) which can be opened by ms office or notepad++???
note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.
Try
df.coalesce(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
Note that this may not be an issue on your current setup, but on extremely large datasets, you can run into memory problems on the driver. This will also take longer (in a cluster scenario) as everything has to push back to a single location.
Just in case,
on spark 2.1 you can create a single csv file with the following lines
dataframe.coalesce(1) //So just a single part- file will be created
.write.mode(SaveMode.Overwrite)
.option("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") //Avoid creating of crc files
.option("header","true") //Write the header
.csv("csvFullPath")
with spark >= 2.o, we can do something like
df = spark.read.csv('path+filename.csv', sep = 'ifany',header='true')
df.write.csv('path_filename of csv',header=True) ###yes still in partitions
df.toPandas().to_csv('path_filename of csv',index=False) ###single csv(Pandas Style)
The following should do the trick:
df \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Alternatively, if you want the results to be in a single partition, you can use coalesce(1):
df \
.coalesce(1) \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Note however that this is an expensive operation and might not be feasible with extremely large datasets.
got answer for 1st question, it was a matter of passing one extra parameter header = 'true' along with to csv statement
df.write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
#Alternative for 2nd question
Using topandas.to_csv , But again i don't want to use pandas here, so please suggest if any other way around is there.

Categories