PySpark: spit out single file when writing instead of multiple part files - python

Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file?
If I run:
df.write.format('json').save('myfile.json')
or
df1.write.json('myfile.json')
it creates the folder named myfile and within it I find several small files named part-***, the HDFS way. Is it by any means possible to have it spit out a single file instead?

Well, the answer to your exact question is coalesce function. But as already mentioned it is not efficient at all as it will force one worker to fetch all data and write it sequentially.
df.coalesce(1).write.format('json').save('myfile.json')
P.S. Btw, the result file is not a valid json file. It is a file with a json object per line.

This was a better solution for me.
rdd.map(json.dumps)
.saveAsTextFile(json_lines_file_name)

df1.rdd.repartition(1).write.json('myfile.json')
Would be nice, but isn't available. Check this related question. https://stackoverflow.com/a/33311467/2843520

Related

How to write in CSV file without creating folder in pyspark?

While writing in CSV file, automatically folder is created and then csv file with cryptic name is created, how to create this CSV with any specific name but without creating folder in pyspark not in pandas.
That's just the way Spark works with the parallelizing mechanism. Spark application meant to have one or more workers to read your data and to write into a location. When you write a CSV file, having a directory with multiple files is the way multiple workers can write at the same time.
If you're using HDFS, you can consider writing another bash script to move or reorganize files the way you want
If you're using Databricks, you can use dbutils.ls to interact with DBFS files in the same way.
This is the way spark is designed to write out multiple files in parallel. Writing out many files at the same time is faster for big datasets. But still you can achieve by using of coalesce(1,true).saveAsTextFile() .You can refer here
In PySpark, the following code helped me to directly write data into CSV file
df.toPandas().to_csv('FileName.csv')

reading and storing data into a python data structure from a csv

i am trying to figure out how to read my survey data but i keep getting this error. i feel like i am missing some things when i type out how to specifically find the file i need... and would a dictionary be the best data structure to use for this? the ultimate goal is: by using each voting method to determine a winner from the data that was collected. i plan on using if/else if statements...
Did you check your directory properly? The error says No such file or directory. Provide the full correct dir for your csv file and it should work fine.
In addition to providing a full file path, you can put the csv file with the python script under the same folder and try.

What is the best way to open multiple files in python when I have the name of the files stored in a list?

I am dealing with a large set of data that can be classified to be written in one of many files. I am trying to open the files all at once so I can write to the files as I am going through the data (I am working with Python 3.7).
I could do multiple
with open(...) as ... statements but I was wondering if there is a way to do this without having to write out the open statements for each file.
I was thinking about using a for loop to open the files but heard this is not exception safe and is bad practice.
So what do you think is the best way to open multiple files where the filenames are stored in a list?
I usually use glob and dict to do so. This will asume your data is in .csv format, but shouldn't really matter to the idea:
You use glob to create a variable with all your files. Say they are in a folder called Data inside your main folder:
data=glob.glob('Data/'+'*.csv') #Put every .csv file into a list
#you can change .csv with wathever you need
dict_data={} #Create empty dictionary
for n,i in enumerate(sorted(data)):
dict_data['file_'+str(n+1)]=pd.read_csv(i)
Here you can replace with your with...open statement. In the end you'll get a dict with keys file_1 ,..., file_n that will have inside your data. I find it the best way to work with lots of data. Might need to do some tinkering if you're working with more than one type of data, though.
Hope it helps

Why does Spark output a set of csv's instead or just one?

I had a hard time last week getting data out of Spark, in the end I had to simply go with
df.toPandas().to_csv('mycsv.csv')
out of this answer.
I had tested the more native
df.write.csv('mycsv.csv')
for Spark 2.0+ but as per the comment underneath, it drops a set of csv files instead of one which need to be concatenated, whatever that means in this context. It also dropped an empty file into the directory called something like 'success'. The directory name was /mycsv/ but the csv itself had an unintelligible name out of a long string of characters.
This was the first I had heard of such a thing. Well, Excel has multiple tabs which must somehow be reflected in an .xls file, and NumPy arrays can be multidimensional, but I thought a csv file was just a header, values separated into columns by commas in rows.
Another answer suggested:
query.repartition(1).write.csv("cc_out.csv", sep='|')
So this drops just one file and the blank 'success' file, still the file does not have the name you want, the directory does.
Does anyone know why Spark is doing this, why will it not simply output a csv, how does it name the csv, what is that success file supposed to contain, and if concatenating csv files means here joining them vertically, head to tail.
There are a few reasons why Spark outputs multiple CSVs:
- Spark runs on a distributed cluster. For large datasets, all the data may not be able to fit on a single machine, but it can fit across a cluster of machines. To write one CSV, all the data would presumably have to be on one machine and written by one machine, which one machine may not be able to do.
- Spark is designed for speed. If data lives on 5 partitions across 5 executors, it makes sense to write 5 CSVs in parallel rather than move all data to a single executor and have one executor write the entire dataset.
If you need one CSV, my presumption is that your dataset is not super large. My recommendation is to download all the CSV files into a directory, and run cat *.csv > output.csv in the relevant directory. This will join your CSV files head-to-tail. You may need to do more work to strip headers from each part file if you're writing with headers.
Does anyone know why Spark is doing this, why will it not simply output a csv,
Because it is designed for distributed computing where each chunk of data (a.k.a. partition) is written independently of others.
how does it name the csv
Name depends on the partition number.
what is that success file supposed to contain
Nothing. It just indicates success.
This basically happens because Spark dumps file based on the number of partitions between which the data is divided. So, each partition would simply dump it's own file seperately. You can use the coalesce option to save them to a single file. Check this link for more info.
However, this method has a disadvantage that it needs to collect all the data in the Master Node, hence the Master Node should contain enough memory. A workaround for this can seen in this answer.
This link also sheds some more information about this behavior of Spark:
Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data to HDFS. You'll know what I mean the first time you try to save "all-the-data.csv" and are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS file and then several part-0000n files for each partition that took part in the job.

How to load directory of JSON files into Apache Spark in Python

I'm relatively new to Apache Spark, and I want to create a single RDD in Python from lists of dictionaries that are saved in multiple JSON files (each is gzipped and contains a list of dictionaries). The resulting RDD would then, roughly speaking, contain all of the lists of dictionaries combined into a single list of dictionaries. I haven't been able to find this in the documentation (https://spark.apache.org/docs/1.2.0/api/python/pyspark.html), but if I missed it please let me know.
So far I tried reading the JSON files and creating the combined list in Python, then using sc.parallelize(), however the entire dataset is too large to fit in memory so this is not a practical solution. It seems like Spark would have a smart way of handling this use case, but I'm not aware of it.
How can I create a single RDD in Python comprising the lists in all of the JSON files?
I should also mention that I do not want to use Spark SQL. I'd like to use functions like map, filter, etc., if that's possible.
Following what tgpfeiffer mentioned in their answer and comment, here's what I did.
First, as they mentioned, the JSON files had to be formatted so they had one dictionary per line rather than a single list of dictionaries. Then, it was as simple as:
my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files)
my_RDD_dictionaries = my_RDD_strings.map(json.loads)
If there's a better or more efficient way to do this, please let me know, but this seems to work.
You can use sqlContext.jsonFile() to get a SchemaRDD (which is an RDD[Row] plus a schema) that can then be used with Spark SQL. Or see Loading JSON dataset into Spark, then use filter, map, etc for a non-SQL processing pipeline. I think you may have to unzip the files, and also Spark can only work with files where each line is a single JSON document (i.e., no multiline objects possible).
You can load a directory of files into a single RDD using textFile and it also supports wildcards. That wouldn't give you file names, but you don't seem to need them.
You can use Spark SQL while using basic transformations like map, filter etc. SchemaRDD is also an RDD (in Python, as well as Scala)
To load list of Json from a file as RDD:
def flat_map_json(x): return [each for each in json.loads(x[1])]
rdd = sc.wholeTextFiles('example.json').flatMap(flat_map_json)

Categories