Analyze Multiple Non-Text Files With Pyspark - python

I have several .mat files (matlab) that I want to process with PySpark. But I'm not sure how to do it in parallel. Here's the basic single-threaded setup that I wish to parallelize. The code will generate a list of lists, where each inner list has arbitrary length:
filenames = ['1.mat','2.mat',...]
output_lists = [None]*len(filenames) # will be a list of lists
for i,filename in enumerate(filenames):
output_lists[i] = analyze(filename) # analyze is some function that returns a list
Any individual output_lists[i] can fit in memory, but the entire output_lists object cannot. I would like output_lists to be an rdd.
Any ideas? I am also open to using a combination of pyspark and the multiprocessing module. Thanks!

Put files in a POSIX compliant file system which can be accessed from every worker (NFS, MapR filesystem, Databricks filesystem, Ceph)
Convert path so they reflect path in the file system.
parallelize names:
rdd = sc.parallelize(filenames)
map with process:
result = rdd.map(analyze)
Do whatever you want to do with results.

The other answer looks elegant, but I didn't want to install a new file system. I opted for analyzing the files in parallel with the joblib module, writing the results to .txt files, and opening the .txt files with Spark.
from joblib import Parallel, delayed
def analyze(filename):
# write results to text file with name= filename+'.txt'
return
filenames = ['1.mat', '2.mat', ...]
Parallel(n_jobs=8)(delayed(analyze)(filename) for filename in filenames)
Then I use Pyspark to read all the .txt files to one rdd:
data = sc.textFile('path/*.txt')

Related

Multiple instances of the same function asynchronously in python

I have a little script that does a few simple tasks. Running Python 3.7.
One of the tasks has to merge some files together which can be a little time consuming.
It loops through multiple directories, then each directory gets passed to the function. The function just loops through the files and merges them.
Instead of waiting for it to finish one directory, then onto the next one, then wait, then onto the next one, etc...
I'd like to utilize the horsepower/cores/threads to have the script merging the PDF's in multiple directories at once, together, which should shave time.
I've got something like this:
if multi_directories:
if os.path.isdir('merged'):
pass
else:
os.makedirs('merged')
for directory in multi_directories:
merge_pdfs(directory)
My merge PDF function looks like this:
def merge_pdfs(directory):
root_dir = os.path.dirname(os.path.abspath(__file__))
merged_dir_location = os.path.join(root_dir, 'merged')
dir_title = directory.rsplit('/', 1)[-1]
file_list = [file for file in os.listdir(directory)]
merger = PdfFileMerger()
for pdf in file_list:
file_to_open = os.path.join(directory, pdf)
merger.append(open(file_to_open, 'rb'))
file_to_save = os.path.join(
merged_dir_location,
dir_title+"-merged.pdf"
)
with open(file_to_save, "wb") as fout:
merger.write(fout)
return True
This works great - but merge_pdfs runs slow in some instances where there are a high number of PDF's in the directory.
Essentially - I want to be a be able to loop through multi_directories and create a new thread or process for each directory and merge the PDF's at the same time.
I've looked at asyncio, multithreading and a wealth of little snippets here and there but can't seem to get it to work.
You can do something like:
from multiprocessing import Pool
n_processes = 2
...
if multi_directories:
if os.path.isdir('merged'):
pass
else:
os.makedirs('merged')
pool = Pool(n_processes)
pool.map(merge_pdfs, multi_directories)
It should help if the bottleneck is CPU usage. But it may make things even worse if the bottleneck is HDD, cause reading several files in parallel from one physical HDD is usually slower then reading them consecutively. Try it with different values of n_processes.
BTW, to make list from iterable use list(): file_list = list(os.listdir(directory)). And since listdir() returns List, you can just write file_list = os.listdir(directory)

How can i read all files from a directory and do operations parallelly?

Suppose I have some files in directory and i want to read each file and extract the file name and first row from the file (i.e header) for some validation. How can we do this in spark (using python).
input_file = sc.textFile(sourceFileDir)
By sc.textFile() we can read all files parallelly but using map we can apply any rules or function to each element in the rdd. I am not understanding how can i fetch only file name and one row of all files using sc.textFile()
Currently, I am doing these requirement (mentioned above) using a for loop.
files = os.listdir(sourceFileDir)
for x in files:
**operations**
How can i do the same in parallel manner to all files that will save some times as there are lots of files in the directory.
Thanks in advance ..
textFile is not what you are looking for. You should use wholeTextFile. It creates a rdd with key as FileName and value as content. Then you apply a map to get only first line:
sc.wholeTextFiles(sourceFileDir).map(lambda x : (x[0], x[1].split('\n')[0]))
By doing that, the output of your map is the fileName and the 1st line.

How to work with CSV files inside a zipped folder?

I'm working with zipped files in python for the first time, and I'm stumped.
I read the documentation for zipfile, but I'm not sure what would be the best way to do what I'm trying to do. I have a zipped folder with CSV files inside, and I'd like to be able to open the zip file, and retrieve certain values from the csv files inside.
Do I use zipfile.extract(file name here) to bring it to the current working directory? And if I do that, do I just use the file name to work with the file, or does this index or list them differently?
Currently, I manually extract all files in the zipped folder to the current working directory for my project, and then use the csv module to read them. All I'm really trying to do is remove that step.
Any and all help would be greatly appreciated!
You are looking to avoid extracting to disk, in the zip docs for python there is ZipFile.open() which gives you a file-like object. That is an object that mostly behaves like a regular file on disk, but it is in memory. It gives a bytes array when read, at least in py3.
Something like this...
from zipfile import ZipFile
import csv
with ZipFile('abc.zip') as myzip:
print(myzip.filelist)
for mf in myzip.filelist:
with myzip.open(mf.filename) as myfile:
mc = myfile.read()
c = csv.StringIO(mc.decode())
for row in c:
print(row)
The documentation of Python is actually quite good once one has learned how to find things as well as some of the basic programming terms/descriptions used in the documentation.
For some reason csv.BytesIO is not implemented, hence the extra step via csv.StringIO.

Data structures suitable for searching

I have common problem. I have some data and I want search in them. My issue is, that I dont know a proper data structures and algorhitm suitable for this situation.
There are two kind of objects - Process and Package. Both have some properties, but they are only data structures (dont have any methods). Next, there are PackageManager and something what can be called ProcessManager, which both have function returning list of files that belongs to some Package or files that is used by some Process.
So semantically, we can imagine these data as
Packages:
Package_1
file_1
_ file_2
file_3
Package_2
file_4
file_5
file_6
Actually file that belongs to Package_k can not belong to Package_l for k != l :-)
Processes:
Process_1
file_2
file_3
Process_2
file_1
Files used by processes corresponds to files owned by packages. Also, there the rule doesn't applies on this as for packages - that means, n processes can use one same file at the same time.
Now what is the task. I need to find some match between processes and packages - for given list of packages, I need to find list of processes which uses any of files owned by packages.
My temporary solution was making list of [package_name, package_files] and list of [process_name, process_files] and for every file from every package I was searching through every file of every process searching for match, but of course it could be only temporary solution vzhledem to horrible time complexity (even when I sort the files and use binary search to it).
What you can recommend me for this kind of searching please?
(I am coding it in python)
Doing the matching with sets should be faster:
watched_packages = [Package_1, Package_3] # Packages to consider
watched_files = { # set comprehension
file_
for package in watched_packages
for file_ in package.list_of_files
}
watched_processes = [
process
for process in all_processes
if any(
file_ in watched_files
for file_ in process.list_of_files
)
]
Based on my understanding of what you are trying to do - given a file name, you want to find a list of all the processes that use that file, this snippet of code should help:
from collections import defaultdict
# First make a dictionary that contains a file, and all processes it is a member of.
file_process_map=defaultdict(list)
[file_process_map[fn].append(p) for p in processes for fn in p.file_list]
Basically, we're converting your existing structure (where a process has one or more files) into a structure where we have a filename, and a list of processes that use it.
Now when you have a file you need to search for (in the processes) just look it up in the "file_process_map" dictionary and you'll have a list of all the processes that use the given file.
It is assumed here that "processes" is a list of objects, and each object has a file_list attribute that contains a list of associated files. Obviously, depending on your data structure you might need to alter the code..

Grabbing all files in a query using Python

I'm processing a large amount of files using Python. Data related to each other through their file names.
If I was to perform CMD command to perform this (in windoes) it would look something like:
DIR filePrefix_??.txt
And this would return all the file names I would need for that group.
Is there a similar function that I can use in Python?
Have a look at the glob module.
glob.glob("filePrefix_??.txt")
returns a list of matching file names.

Categories