I have 1000 csv files that are to be processed in parallel using map function available in spark. I have two desktops connected in a cluster and I'm using the pyspark shell for computation. I am passing the name of csv files into the map function and the function accesses the files based on name. However, I need to copy files to the slave for the process to function properly. This means there has to be a copy of all the csv files on the other system. Kindly suggest an alternative storage while avoiding data transfer latency.
I also tried storing these files into a 3-d array and generating an RDD by using parallelize command. But that gives out of memory error.
you can use spark-csv to load the files
https://github.com/databricks/spark-csv
Then you can use dataframe concept to pre-process the files.
Since its 1000 csv files and if there is some link among them , use spark-sql to run operation on them , and then extract your output for final computation.
If that doesn't work , you can try to load the same in HBase or Hive and then use spark to compute , I checked with 100 gb of csv contents in my single node cluster.
It may help
Related
Reading a parquet file from disc I can choose to read only a few columns (I assume it scans the header/footer, then decides). Is it possible to do this remotely (such as via Google Cloud Storage?)
We have 100 MB parquet files with about 400 columns and we have a use-case where we want to read 3 of them, and show them to the user. The user can choose which columns.
Currently we download the entire file, and then filter it but this takes time.
Long term we will be putting it into Google BigQuery and the problem will be solved
More specifically we use Python with either pandas or PyArrow and ideally would like to use those (either with a GCS backend or manually getting the specific data we need via a wrapper). This runs in Cloud Run so we would prefer to not use Fuse, although that is certainly possible.
I intend to use Python and pandas/pyarrow as the backend for this, running in Cloud Run (hence why data size matter, because 100MB download to disk actually means 100MB downloaded to RAM)
We use pyarrow.parquet.read_parquet with to_pandas() or pandas.read_parquet.
pandas.read_parquet function has columns argument to read a subset of columns.
I am using Azure Databricks to analyze some data. I have the following folder structure in blob storage:
folder_1\n1 csv files
folder_2\n2 csv files
..
folder_k\nk csv files
I want to read these files, run some algorithm (relatively simple) and write out some log files and image files for each of the csv files in a similar folder structure at another blob storage location. Right now I have a simple loop structure to do this:
for folder in folders:
#set up some stuff
for file in files:
#do the work and write out results
The database contains 150k files. Is there a way to parallelize this?
The best way I found to parallelize such embarassingly parallel tasks in databricks is using pandas UDF (https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html?_ga=2.143957493.1972283838.1643225636-354359200.1607978015)
I created a spark dataframe with the list of files and folders to loop through, passed it to a pandas UDF with specified number of partitions (essentially cores to parallelize over). This can leverage the available cores on a databricks cluster. There are a few restrictions as to what you can call from a pandas UDF (for example, cannot use 'dbutils' calls directly), but it worked like a charm for my application.
While writing in CSV file, automatically folder is created and then csv file with cryptic name is created, how to create this CSV with any specific name but without creating folder in pyspark not in pandas.
That's just the way Spark works with the parallelizing mechanism. Spark application meant to have one or more workers to read your data and to write into a location. When you write a CSV file, having a directory with multiple files is the way multiple workers can write at the same time.
If you're using HDFS, you can consider writing another bash script to move or reorganize files the way you want
If you're using Databricks, you can use dbutils.ls to interact with DBFS files in the same way.
This is the way spark is designed to write out multiple files in parallel. Writing out many files at the same time is faster for big datasets. But still you can achieve by using of coalesce(1,true).saveAsTextFile() .You can refer here
In PySpark, the following code helped me to directly write data into CSV file
df.toPandas().to_csv('FileName.csv')
I'm trying to fetch all rows data from spark dataframe to a file in databricks. I'm able to write df data to a file with only few counts. Suppose if i'm getting the count in df as 100 , then in file its 50 count so it's skipping the data.How can i load completed data from dataframe to a file without skipping the data. I have created a udf that udf will open the file and append the data to it.I have called that udf in spark sql df.
Can someone help me on this issue?
I would advise against using a udf the way you are for a few reasons:
UDFs run on the worker nodes, so you would have multiple udfs, each writing a portion of your data to a local file.
Even if you have your UDF appending to a file in a shared location (like the DBFS), you still have multiple nodes writing to a file concurrently, which could lead to errors.
Spark already has a way to do this out of the box that you should take advantage of
To write a spark dataframe to a file in databricks:
Use the Dataframe.write attribute (Databricks docs).
There are plenty of options, so should be able to do whatever you need (Spark docs (this one is for CSVs))
Note on partitions: Spark writes each partition of the DF in its own file, so you should use the coalesce function (warning: this is very slow with extremely large dataframes since spark has to fit the whole dataframe into memory on the driver node)
Note on File locations: The file path you give will be on the driver node, so unless you plan on reading it back with another script, you should start your path with "/dbfs" , which is mounted onto all of the nodes' file systems.This way, it is saved on the Databricks File System, which is accessible from any cluster in your databricks instance. (It's also available to download using the Databricks CLI.)
Full Example:
df_to_write = my_df.select(<columns you want>)
df_to_write.coalesce(1).write.csv("/dbfs/myFileDownloads/dataframeDownload.csv")
I'm connected to the cluster using ssh and I send the program to the cluster using
spark-submit --master yarn myProgram.py
I want to save the result in a text file and I tried using the following lines:
counts.write.json("hdfs://home/myDir/text_file.txt")
counts.write.csv("hdfs://home/myDir/text_file.csv")
However, none of them work. The program finishes and I cannot find the text file in myDir. Do you have any idea how can I do this?
Also, is there a way to write directly to my local machine?
EDIT: I found out that home directory doesn't exist so now I save the result as:
counts.write.json("hdfs:///user/username/text_file.txt")
But this creates a directory named text_file.txt and inside I have a lot of files with partial results inside. But I want one file with the final result inside. Any ideas how I can do this ?
Spark will save the results in multiple files since the computation is distributed. Therefore writing:
counts.write.csv("hdfs://home/myDir/text_file.csv")
means to save the data on each partition as a separate file in the folder text_file.csv. If you want the data saved as a single file, use coalesce(1) first:
counts.coalesce(1).write.csv("hdfs://home/myDir/text_file.csv")
This will put all the data into a single partition and the number of saved files will thus be 1. However, this could be a bad idea if you have a lot of data. If the data is very small then using collect() is an alternative. This will put all data onto the driver machine as an array, which can then be saved as a single file.
You can concatenate your results into one file from the command line:
hadoop fs -cat hdfs:///user/username/text_file.txt/* > path/to/local/file.txt
This should be faster than using coalesce - in my experience all collect() type operations are slow because all of the data is funneled through the master node. Furthermore, you can run into troubles with collect() if your data exceeds the memory on your master node.
However, a potential pitfall with this approach is that you will have to explicitly remove the files from a previous run (since the current run may not produce exactly the same number of files). There may be a flag to do this with each run, but I am not sure.
To remove:
hadoop fs -rm -r hdfs:///user/username/text_file.txt/*
Do you get any error? Maybe you can check if you have the correct permissions to write/read from that folder.
Also think that Spark by default will create a folder called text_file.txt with some files inside, depending on the number of partitions that you have.
If you want to write in your local machine you can specify the path with file:///home/myDir/text_file.txt. If you use a path like /user/hdfs/... by default is wrote in HDFS
To have a unique file (not named as you want) you need .repartition(1),look here, piped to your RDD.
I suppose that your hdfs path is wrong. In Spark HDFS for text file is the default and in Hadoop (by default) there is not a home dir in root dir, unless you have created it before.
If you want a csv/txt file (with this extention) the only way to write it, is without RDD or DF functions, but using the usual libraries of python csv and io, after you have collected, with .collect(), your RDD in a martix (dataset has not be huge).
If you want to write directly on your filesystem (and not on HDFS) use
counts.write.csv("file:///home/myDir/text_file.csv")
But this won't write a single file with csv extension. It will create a folder with the part-m-0000n of the n partitions of your dataset.