I'm connected to the cluster using ssh and I send the program to the cluster using
spark-submit --master yarn myProgram.py
I want to save the result in a text file and I tried using the following lines:
counts.write.json("hdfs://home/myDir/text_file.txt")
counts.write.csv("hdfs://home/myDir/text_file.csv")
However, none of them work. The program finishes and I cannot find the text file in myDir. Do you have any idea how can I do this?
Also, is there a way to write directly to my local machine?
EDIT: I found out that home directory doesn't exist so now I save the result as:
counts.write.json("hdfs:///user/username/text_file.txt")
But this creates a directory named text_file.txt and inside I have a lot of files with partial results inside. But I want one file with the final result inside. Any ideas how I can do this ?
Spark will save the results in multiple files since the computation is distributed. Therefore writing:
counts.write.csv("hdfs://home/myDir/text_file.csv")
means to save the data on each partition as a separate file in the folder text_file.csv. If you want the data saved as a single file, use coalesce(1) first:
counts.coalesce(1).write.csv("hdfs://home/myDir/text_file.csv")
This will put all the data into a single partition and the number of saved files will thus be 1. However, this could be a bad idea if you have a lot of data. If the data is very small then using collect() is an alternative. This will put all data onto the driver machine as an array, which can then be saved as a single file.
You can concatenate your results into one file from the command line:
hadoop fs -cat hdfs:///user/username/text_file.txt/* > path/to/local/file.txt
This should be faster than using coalesce - in my experience all collect() type operations are slow because all of the data is funneled through the master node. Furthermore, you can run into troubles with collect() if your data exceeds the memory on your master node.
However, a potential pitfall with this approach is that you will have to explicitly remove the files from a previous run (since the current run may not produce exactly the same number of files). There may be a flag to do this with each run, but I am not sure.
To remove:
hadoop fs -rm -r hdfs:///user/username/text_file.txt/*
Do you get any error? Maybe you can check if you have the correct permissions to write/read from that folder.
Also think that Spark by default will create a folder called text_file.txt with some files inside, depending on the number of partitions that you have.
If you want to write in your local machine you can specify the path with file:///home/myDir/text_file.txt. If you use a path like /user/hdfs/... by default is wrote in HDFS
To have a unique file (not named as you want) you need .repartition(1),look here, piped to your RDD.
I suppose that your hdfs path is wrong. In Spark HDFS for text file is the default and in Hadoop (by default) there is not a home dir in root dir, unless you have created it before.
If you want a csv/txt file (with this extention) the only way to write it, is without RDD or DF functions, but using the usual libraries of python csv and io, after you have collected, with .collect(), your RDD in a martix (dataset has not be huge).
If you want to write directly on your filesystem (and not on HDFS) use
counts.write.csv("file:///home/myDir/text_file.csv")
But this won't write a single file with csv extension. It will create a folder with the part-m-0000n of the n partitions of your dataset.
Related
I'm new to Azure and Python and was creating a notebook in databricks to output the results of a piece of sql. The code below produces the expected output, but with a default filename that's about 100 characters long. Id like to be able to give the output a sensible name and add a date/time to create uniqueness, something like testfile20191001142340.csv. I've serched high and low and can't find anything that helps, hoping somebody in the community can point me in the right direction
%python
try:
dfsql = spark.sql("select * from dbsmets1mig02_technical_build.tbl_Temp_Output_CS_Firmware_Final order by record1") #Replace with your SQL
except:
print("Exception occurred")
if dfsql.count() == 0:
print("No data rows")
else:
dfsql.coalesce(1).write.format("com.databricks.spark.csv").option("header","false").option("delimiter","|").mode("overwrite").option("quote","\u0000").save(
"/mnt/publisheddatasmets1mig/metering/smets1mig/cs/system_data_build/firmware/outbound/")
The issue with naming a single file is that it pretty much goes against the philosophy of spark. To enable quick processing, Spark has to be able to parallelise writes. For parquet files or other outputs that naturally support parallelizm it's not a problem. In case of .csv files we are used to working with single files and thus a lot of confusion.
Long story short, if you did not use .coalesce(1) Spark would write your data to multiple .csv files in one folder. Since there is only one partition, there will be only one file - but with a generated name. So you have here two options:
rename/move the file afterwards using databricks utils or regular python libraries
.collect the result and save it using other libraries (default would be csv package)
The obvious question you may have is why is it so hard to do something so simple as saving to a single file - and the answer is, because it's a problem for Spark. The issue with your approach to saving a single partition is that if you have more data than can fit in your driver / executor memory, repartitioning to 1 partition or collecting the data to executor is going to simply fail and explode with an exception.
For safely saving to single .csv file you can use toLocalIterator method which loads only one partition to memory at time and within its iterator save your results to a single file using csv package.
Pyspark creates folder instead of file. For the below command, it creates an empty folder with name proto.parquet in the directory.
df.write.parquet("output/proto.parquet")
Tried with csv and other formats, but still the same.
The fact that Spark creates a folder instead of a file is the expected behavior. The reason being that Spark is a distributed system, hence data is processed in partitions and each worker node will write out its data to a part file.
So what you are seeing is the way it should work. It works the same way with mapreduce.
I had a hard time last week getting data out of Spark, in the end I had to simply go with
df.toPandas().to_csv('mycsv.csv')
out of this answer.
I had tested the more native
df.write.csv('mycsv.csv')
for Spark 2.0+ but as per the comment underneath, it drops a set of csv files instead of one which need to be concatenated, whatever that means in this context. It also dropped an empty file into the directory called something like 'success'. The directory name was /mycsv/ but the csv itself had an unintelligible name out of a long string of characters.
This was the first I had heard of such a thing. Well, Excel has multiple tabs which must somehow be reflected in an .xls file, and NumPy arrays can be multidimensional, but I thought a csv file was just a header, values separated into columns by commas in rows.
Another answer suggested:
query.repartition(1).write.csv("cc_out.csv", sep='|')
So this drops just one file and the blank 'success' file, still the file does not have the name you want, the directory does.
Does anyone know why Spark is doing this, why will it not simply output a csv, how does it name the csv, what is that success file supposed to contain, and if concatenating csv files means here joining them vertically, head to tail.
There are a few reasons why Spark outputs multiple CSVs:
- Spark runs on a distributed cluster. For large datasets, all the data may not be able to fit on a single machine, but it can fit across a cluster of machines. To write one CSV, all the data would presumably have to be on one machine and written by one machine, which one machine may not be able to do.
- Spark is designed for speed. If data lives on 5 partitions across 5 executors, it makes sense to write 5 CSVs in parallel rather than move all data to a single executor and have one executor write the entire dataset.
If you need one CSV, my presumption is that your dataset is not super large. My recommendation is to download all the CSV files into a directory, and run cat *.csv > output.csv in the relevant directory. This will join your CSV files head-to-tail. You may need to do more work to strip headers from each part file if you're writing with headers.
Does anyone know why Spark is doing this, why will it not simply output a csv,
Because it is designed for distributed computing where each chunk of data (a.k.a. partition) is written independently of others.
how does it name the csv
Name depends on the partition number.
what is that success file supposed to contain
Nothing. It just indicates success.
This basically happens because Spark dumps file based on the number of partitions between which the data is divided. So, each partition would simply dump it's own file seperately. You can use the coalesce option to save them to a single file. Check this link for more info.
However, this method has a disadvantage that it needs to collect all the data in the Master Node, hence the Master Node should contain enough memory. A workaround for this can seen in this answer.
This link also sheds some more information about this behavior of Spark:
Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data to HDFS. You'll know what I mean the first time you try to save "all-the-data.csv" and are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS file and then several part-0000n files for each partition that took part in the job.
I use pyspark
And use MLUtils saveaslibsvm to save an RDD on labledpoints
It works but keeps that files in all the worker nodes under /_temporary/ as many files.
No error is thrown, i would like to save the files in the proper folder, and preferably saving all the output to one libsvm file that will be located on the nodes or on the master.
Is that possible?
edit
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
No matter what i do, i can't use MLUtils.loadaslibsvm() to load the libsvm data from the same path i used to save it. maybe something is wrong with writing the file?
This is a normal behavior for Spark. All writing and reading activities are performed in parallel directly from the worker nodes and data is not passed to or from driver node.
This why reading and writing should be performed using storage which can be accessed from each machine, like distributed file system, object store or database. Using Spark with local file system has very limited applications.
For testing you can can use network file system (it is quite easy to deploy) but it won't work well in production.
I have 1000 csv files that are to be processed in parallel using map function available in spark. I have two desktops connected in a cluster and I'm using the pyspark shell for computation. I am passing the name of csv files into the map function and the function accesses the files based on name. However, I need to copy files to the slave for the process to function properly. This means there has to be a copy of all the csv files on the other system. Kindly suggest an alternative storage while avoiding data transfer latency.
I also tried storing these files into a 3-d array and generating an RDD by using parallelize command. But that gives out of memory error.
you can use spark-csv to load the files
https://github.com/databricks/spark-csv
Then you can use dataframe concept to pre-process the files.
Since its 1000 csv files and if there is some link among them , use spark-sql to run operation on them , and then extract your output for final computation.
If that doesn't work , you can try to load the same in HBase or Hive and then use spark to compute , I checked with 100 gb of csv contents in my single node cluster.
It may help