File path/name under Databricks file system - python

I use glob function glob to grab directory/file name under regular Python.
For example:
glob.glob("/dbfs/mnt/.../*/A*.txt")
However, just realized under DBFS, the full path name starts with /mnt. But is there a way under Pyspark like using glob to get the file directory/name list?
Thanks,

If you want only to get directory/name list, you can only do it in Python.
Pyspark can process a directory/name list sc.textFile("/dbfs/mnt/.../*/A*.txt"), but not return it.
Pyspark is a processing engine, not a framework for filesystem tasks.

Related

Save a CSV in same directory as python file, using 'to_csv' and 'os.path'?

I want this line to save the csv in my current directory alongside my python file:
df.to_csv(./"test.csv")
My python file is in "C:\Users\Micheal\Desktop\VisualStudioCodes\Q1"
Unfortunately it saves it in "C:\Users\Micheal" instead.
I have tried import os path to use os.curdir but i get nothing but errors with that.
Is there even a way to save the csv alongside the python file using os.curdir?
Or is there a simpler way to just do this in python without importing anything?
import os
directory_of_python_script = os.path.dirname(os.path.abspath(__file__))
df.to_csv(os.path.join(directory_of_python_script, "test.csv"))
And if you want to read same .csv file later,
pandas.read_csv(os.path.join(directory_of_python_script, "test.csv"))
Here, __file__ gives the relative location(path) of the python script being runned. We get the absolute path by os.path.abspath() and then convert it to the name of the parent directory.
os.path.join() joins two paths together considering the operating system defaults for path seperators, '\' for Windows and '/' for Linux, for example.
This kind of an approach should work, I haven't tried, if does not work, let me know.

PySpark: how to resolve path of a resource file present inside the dependency zip file

I have a mapPartitions on an RDD and within each partition, a resource file has to be opened. This module that contains the method invoked by mapPartitions and the resource file is passed on to each executor using the --py-files argument as a zip file.
To make it clear:
rdd = rdd.mapPartitions(work_doing_method)
def work_doing_method(rows):
for row in rows:
resource_file_path = os.path.join(os.path.dirname(__file__), "resource.json")
with open(resource_file_path) as f:
resource = json.loads(f.read())
...
When I do this after passing the zip file which includes all of this using the --py-file parameter to the spark-submit command,
I get IOError: [Errno 20] Not a directory:/full/path/to/the/file/within/zip/file
I do not understand how Spark uses the zip file to read the dependencies. The os.path.dirname utility returns the full path including the zip file, for eg. /spark/dir/my_dependency_file.zip/path/to/the/resource/file. I believe this should be the problem. I tried many combinations to resolve the path of the file. Any help is appreciated.
Thanks!
I think when you add a file to a Spark job, it will be copied to the working directory of each executor. I've used the SparkFiles API to get absolute paths to files on the executors.
You can also use the --archives flag to pass in arbitrary data archives such as zipfiles. What's the difference between --archives, --files, py-files in pyspark job arguments
We get the path to a resource file within an egg/zip file (inside the executor working dir) when we look for the absolute path. I ended up using the zipfile module in Python and actually open it like here.

Analyze Multiple Non-Text Files With Pyspark

I have several .mat files (matlab) that I want to process with PySpark. But I'm not sure how to do it in parallel. Here's the basic single-threaded setup that I wish to parallelize. The code will generate a list of lists, where each inner list has arbitrary length:
filenames = ['1.mat','2.mat',...]
output_lists = [None]*len(filenames) # will be a list of lists
for i,filename in enumerate(filenames):
output_lists[i] = analyze(filename) # analyze is some function that returns a list
Any individual output_lists[i] can fit in memory, but the entire output_lists object cannot. I would like output_lists to be an rdd.
Any ideas? I am also open to using a combination of pyspark and the multiprocessing module. Thanks!
Put files in a POSIX compliant file system which can be accessed from every worker (NFS, MapR filesystem, Databricks filesystem, Ceph)
Convert path so they reflect path in the file system.
parallelize names:
rdd = sc.parallelize(filenames)
map with process:
result = rdd.map(analyze)
Do whatever you want to do with results.
The other answer looks elegant, but I didn't want to install a new file system. I opted for analyzing the files in parallel with the joblib module, writing the results to .txt files, and opening the .txt files with Spark.
from joblib import Parallel, delayed
def analyze(filename):
# write results to text file with name= filename+'.txt'
return
filenames = ['1.mat', '2.mat', ...]
Parallel(n_jobs=8)(delayed(analyze)(filename) for filename in filenames)
Then I use Pyspark to read all the .txt files to one rdd:
data = sc.textFile('path/*.txt')

How to read a file whose name includes '/' in python?

Now I have a file named Land/SeaMask and I want to open it, but it cannot be recognized as a filename by programme, but as a directory, how to do it?
First of all I recommend you to find out how Python interpreter displays yours file name. You can do this simply using os built-in module:
import os
os.listdir('path/to/directory')
You'll get a list of directories and files in directory you passed as argument in listdir method. In this list you can find something like Land:SeaMask. After recognizing this, open('path/to/Land:SeaMask') will work for you.

Grabbing all files in a query using Python

I'm processing a large amount of files using Python. Data related to each other through their file names.
If I was to perform CMD command to perform this (in windoes) it would look something like:
DIR filePrefix_??.txt
And this would return all the file names I would need for that group.
Is there a similar function that I can use in Python?
Have a look at the glob module.
glob.glob("filePrefix_??.txt")
returns a list of matching file names.

Categories