I have a large data set (~2Gb) to analyse and I'd like to multi process it to reduce the run time of the code. I've imported the dataset into a list which I will then want to run numerous passes over. On each pass I'll set up a pool for each available core and each pool will then only assess a certain block of the data set (note, the pool still needs access to the complete data set).
Each line of the input file takes the format "a,b,c,d,e,f,g,h" and all are numbers.
I'm struggling to separate out the get the parameters in the Calc1stPass Pool; I'm getting a tuple index out or range error. Can anyone help me out with this error please?
def Calc1stPass(DataSet,Params):
print("DataSet =", DataSet)
print("Params =", Params)
Pass, (PoolNumber, ArrayCount, CoreCount) = Params
StartRow = int((ArrayCount / CoreCount) * PoolNumber)
EndRow = int(((ArrayCount / CoreCount) * (PoolNumber+1))-1)
for Row in range(StartRow,EndRow):
Rand = randrange(ArrayCount)
Value1 = Decimal(DataSet[Row][0]) + Decimal(DataSet[Row][1])
Value2 = Decimal(DataSet[Rand][0]) + Decimal(DataSet[Rand][1])
Value3 = Value1 - Value2
NewValue = Decimal(DataSet[Row][7]) + Value3
DataSet[Row][7] = str(NewValue)
def main():
#Importing the file
print("Importing File ", FileToImport)
OriginalDataSet = []
f = open(FileToImport)
for line in f:
StrippedLine = line.rstrip()
OriginalDataSet.append(StrippedLine.split(",",))
ArrayCount = len(OriginalDataSet)
#Running passes on dataset
for Pass in range(NumberofPasses):
print("Running Pass : ", Pass + 1, " of ", NumberofPasses)
CoreCount = mp.cpu_count()
WorkPool=mp.Pool(CoreCount)
for PoolNumber in range(CoreCount):
Params = [Pass,PoolNumber,ArrayCount,CoreCount]
RevisedDataSet = WorkPool.starmap(Calc1stPass, product(OriginalDataSet, zip(range(1),Params)))
print(RevisedDataSet)
if __name__ == "__main__":
freeze_support()
main()
Okay, here we go with what I came up with after some discussion plus trial and error. I hope I've kept it somewhat comprehensible. However, it seems you are very new to a lot of this, so you probably have a lot of reading to do regarding how certain libraries and data types work.
Analyzing the algorithm
Let's start with taking a closer look at your computation:
for Pass in range(Passes:
for Row in range(StartRow,EndRow):
Rand = randrange(ArrayCount)
Value1 = Decimal(DataSet[Row][0]) + Decimal(DataSet[Row][1])
Value2 = Decimal(DataSet[Rand][0]) + Decimal(DataSet[Rand][1])
Value3 = Value1 - Value2
NewValue = Decimal(DataSet[Row][7]) + Value3
DataSet[Row][7] = str(NewValue)
So basically, we update a single row through a computation involving another random row.
Assumptions that I make:
the real algorithm does a bit more stuff, otherwise it is hard to see what you want to achieve
the access pattern of the real algorithm stays the same
Following our discussion, there are no functional reasons for the following aspects:
Computation in Decimal is unnecessary. float will do just fine
The values don't need to be stored as string. We can use an array of float
At this point it is clear that we can save tremendous amounts of runtime by using a numpy array instead of a list of string.
There is an additional hazard here for parallelization: We use random numbers. When we use multiple processes, the random number generators need to be set up for parallel generation. We'll cross that bridge when we get there.
Notably, the output column is no input for the next pass. The inputs per pass stay constant.
Input / Output
The input file format seems to be a simple CSV mostly filled with floating point numbers (using only one decimal place) and one column not being a floating point number. The text based format coupled with your information that there are gigabytes of data means that a significant amount of time will be spent just parsing the input file or formatting the output. I'll try to be efficient in both but keep things simple enough that extensions in both are possible.
Optimizing the sequential algorithm
It is always advisable to first optimize the sequential case before parallelizing. So we start here. We begin with parsing the input file into a numpy array.
import numpy as np
def ReadInputs(Filename):
"""Read a CSV file containing 10 columns
The 7th column is skipped because it doesn't contains floating point values
Return value:
2D numpy array of floats
"""
UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
return np.loadtxt(Filename, delimiter=',', usecols=UsedColumns)
Since we are using numpy, we switch over to its random number generators. This is the setup routine. It allows us to force deterministic values for easier debugging.
def MakeRandomGenerator(Deterministic=False):
"""Initializes the random number generator avoiding birthday paradox
Arguments:
Deterministic -- if True, the same same random numbers are being used
Return value:
numpy random number generator
"""
SeedInt = 0 if Deterministic else None
Seed = np.random.SeedSequence(SeedInt)
return np.random.default_rng(Seed)
And now the main computation. Numpy makes this very straight-forward.
def ComputePass(DataSets, RandomGenerator):
"""The main computation
Arguments:
DataSets -- 2D numpy array. Changed in place
RandomGenerator -- numpy random number generator
"""
Count = len(DataSets)
RandomIndices = RandomGenerator.integers(
low=0, high=Count, size=Count)
RandomRows = DataSets[RandomIndices]
# All rows: first column + second column
Value1 = DataSets[:, 0] + DataSets[:, 1]
Value2 = RandomRows[:, 0] + RandomRows[:, 1]
Value3 = Value1 - Value2
# This change is in-place of the whole DataSets array
DataSets[:, 7] += Value3
I've kept the structure the same. That means there are a few optimizations that we can still do:
We never use most columns. Columns that are unnecessary should be removed from the array (skipped in input parsing) to reduce memory consumption and improve locality of data. If necessary for output, it is better to merge in the output stage, maybe by re-reading the input file to gather the remaining columns
Since Value1 and Value2 never change, we could pre-compute Value3 for all rows and just use that. Again, if we don't need the first two columns in memory, better to remove them
If we transpose the array (or store in Fortran order), we improve vectorization. This will make the use of MPI harder, but not impossible
I've not done any of this because I do not want to stray too far from the original algorithm.
The last step is the output. Here I go with a pure Python route to keep things simple and replicate the input file format:
def WriteOutputs(Filename, DataSets):
LineFormat = "{:.1f}, " * 6 + "+" + ", {:.1f}" * 3 + "\n"
with open(Filename, 'w') as OutFile:
for Row in DataSets:
OutFile.write(LineFormat.format(*Row))
Now the entire operation is rather simple:
def main():
InFilename = "indata.csv"
OutFilename = "outdata.csv"
Passes = 20
RandomGenerator = MakeRandomGenerator()
DataSets = ReadInputs(InFilename)
for _ in range(Passes):
ComputePass(DataSets, RandomGenerator)
WriteOutputs(OutFilename, DataSets)
if __name__ == '__main__':
main()
Parallelization framework
There are two main concerns for parallelization:
For every row, we need access to the entire input data set to pick a random entry
The amount of calculation per row is very low
So we need to find a way that keeps overhead per row small and shares the input data set efficiently.
The first choice is multiprocessing since, you know, standard library and all that. However, I think that the normal usage patterns have too much overhead. It's certainly possible but I would like to use MPI for this to give us as much performance as possible. Also, your first attempt at parallelization used a pattern that matches MPI's preferred pattern. So it is a good fit.
A word towards the concept of MPI: multiprocessing.Pool works with a main process that distributes work items among a set of worker processes. MPI start N processes that all execute the same code. There is no main process. The only distinguishing feature is the process "rank", which is a number [0, N). If you need a main process, the one with rank 0 is usually chosen. Other than that, the idea is that all processes execute the same code, only picking different indices or offsets based on their rank. If processes need to communicate, there are a couple of "collective" communication patterns such as broadcasting, scattering, and gathering.
Option 1: Pure MPI
Let's rewrite the code. The main idea is this: We distribute rows in the data set among all processes. Then each process calculates all passes for its own set of rows. Input and output take considerable time, so we try to do as much as possible parallelized, too.
We start by defining a helper function that defines how we distribute rows among all processes. This is very similar to what you had in your original version.
from mpi4py import MPI
def MakeDistribution(NumberOfRows):
"""Computes how the data set should be distributed across processes
Arguments:
NumberOfRows -- size of the whole dataset
Return value:
(Offsets, Counts) numpy integer arrays. One entry per process
"""
Comm = MPI.COMM_WORLD
WorldSize = Comm.Get_size()
SameSize, Tail = divmod(NumberOfRows, WorldSize)
Counts = np.full(WorldSize, SameSize, dtype=int)
Counts[:Tail] += 1
# Start offset per process
Offsets = np.cumsum(Counts) - Counts[0]
return Offsets, Counts
A second helper function is used to distribute the data sets among all processes. MPI's allgather function is used to collect results of a computation among all processes into one array. The normal function gather collects the whole array on one process. Allgather collects it in all processes. Since all processes need access to all data sets for their random access, we use allgather. Allgatherv is a generalized version that allows different number of entries per process. We need this because we cannot guarantee that all processes have the same number of rows in their local data set.
This function uses the "buffer" interface of mpi4py. This is the more efficient version but also very error-prone. If we mess up an index or the size of a data type, we risk data corruption.
def DistributeDataSets(DataSets, Offsets, Counts):
"""Shares the datasets with all other processes
Arguments:
DataSets -- numpy array of floats. Changed in place
Offsets, Counts -- See MakeDistribution
Return value:
DataSets. Most likely a reference to the original.
Might be an updated copy
"""
# Sanitize the input. Better safe than sorry and shouldn't cost anything
DataSets = np.ascontiguousarray(DataSets, dtype='f8')
assert len(DataSets) == np.sum(Counts)
# MPI works best if we pretend to have 1-dimensional data
InnerSize = np.prod(DataSets.shape[1:], dtype=int)
# I really wish mpi4py had a helper for this
BufferDescr = (DataSets,
Counts * InnerSize,
Offsets * InnerSize,
MPI.DOUBLE)
MPI.COMM_WORLD.Allgatherv(MPI.IN_PLACE, BufferDescr)
return DataSets
We split reading the input data into two parts. First we read all lines in a single process. This is relatively cheap and we need to know the total number of rows before we can distribute the datasets. Then we scatter the lines among all processes and let each process parse its own set of rows. After that, we use the DistributeDataSets function to let each process know all the results.
Scattering the lines uses mpi4py's pickle interface that can transfer arbitrary objects among processes. It's slower but more convenient. For stuff like lists of strings it's very good.
def ParseLines(TotalLines, Offset, OwnLines):
"""Allocates a data set and parses the own segment of it
Arguments:
TotalLines -- number of rows in the whole data set across all processes
Offset -- starting offset of the set of rows parsed by this process
OwnLines -- list of lines to be parsed by the local process
Return value:
a 2D numpy array. The rows [Offset:Offset+len(OwnLines)] are initialized
with the parsed values
"""
UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
DataSet = np.empty((TotalLines, len(UsedColumns)), dtype='f8')
OwnEnd = Offset + len(OwnLines)
for Row, Line in zip(DataSet[Offset:OwnEnd], OwnLines):
Columns = Line.split(',')
# overwrite in-place with new values
Row[:] = [float(Columns[Column]) for Column in UsedColumns]
return DataSet
def DistributeInputs(Filename):
"""Read input from the file and distribute it among processes
Arguments:
Filename -- path to the CSV file to parse
Return value:
(DataSets, Offsets, Counts) with
DataSets -- 2D array containing all values in the CSV file
Offsets -- Row indices (one per rank) where each process starts its own
processing
Counts -- number of rows per process
"""
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
Lines = None
LineCount = None
if not Rank:
# Read the data. We do as little work as possible here so that other
# processes can help with the parsing
with open(Filename) as InFile:
Lines = InFile.readlines()
LineCount = len(Lines)
# broadcast so that all processes know the number of datasets
LineCount = Comm.bcast(LineCount, root=0)
Offsets, Counts = MakeDistribution(LineCount)
# reshape into one list per process
if not Rank:
Lines = [Lines[Offset:Offset+Count]
for Offset, Count
in zip(Offsets, Counts)]
# distribute strings for parsing
Lines = Comm.scatter(Lines, root=0)
# parse into a float array
DataSets = ParseLines(LineCount, Offsets[Rank], Lines)
del Lines # release strings because this is a huge array
# Share the parsed result
DataSets = DistributeDataSets(DataSets, Offsets, Counts)
return DataSets, Offsets, Counts
Now we need to update the way the random number generator is initialized. What we need to prevent is that each process has the same state and generates the same random numbers. Thankfully, numpy gives us a convenient way of doing this.
def MakeRandomGenerator(Deterministic=False):
"""Initializes the random number generator avoiding birthday paradox
Arguments:
Deterministic -- if True, the same number of processes should always result
in the same random numbers being used
Return value:
numpy random number generator
"""
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
AllSeeds = None
if not Rank:
# the root process (rank=0) generates a seed sequence for everyone else
WorldSize = Comm.Get_size()
SeedInt = 0 if Deterministic else None
OwnSeed = np.random.SeedSequence(SeedInt)
AllSeeds = OwnSeed.spawn(WorldSize)
# mpi4py can scatter Python objects. This is the simplest way
OwnSeed = Comm.scatter(AllSeeds, root=0)
return np.random.default_rng(OwnSeed)
The computation itself is almost unchanged. We just need to limit it to the rows for which the individual process is responsible.
def ComputePass(DataSets, Offset, Count, RandomGenerator):
"""The main computation
Arguments:
DataSets -- 2D numpy array. Changed in place
Offset, Count -- rows that should be updated by this process
RandomGenerator -- numpy random number generator
"""
RandomIndices = RandomGenerator.integers(
low=0, high=len(DataSets), size=Count)
RandomRows = DataSets[RandomIndices]
# Creates a "view" into the whole dataset for the given slice
OwnDataSets = DataSets[Offset:Offset + Count]
# All rows: first column + second column
Value1 = OwnDataSets[:, 0] + OwnDataSets[:, 1]
Value2 = RandomRows[:, 0] + RandomRows[:, 1]
Value3 = Value1 - Value2
# This change is in-place of the whole DataSets array
OwnDataSets[:, 7] += Value3
Now we come to writing the output. The most expensive part is formatting the floating point numbers into strings. So we let each process format its own data. MPI has a file IO interface that allows all processes to write a single file together. Unfortunately, for text files, we need to calculate the offsets before writing the data. So we format all rows into one huge string per process, then write the file.
import io
def WriteOutputs(Filename, DataSets, Offset, Count):
"""Writes all DataSets to a CSV file
We parse all rows to a string (one per process), then write it
collectively using MPI
Arguments:
Filename -- output path
DataSets -- all values among all processes
Offset, Count -- the rows for which the local process is responsible
"""
StringBuf = io.StringIO()
LineFormat = "{:.6f}, " * 6 + "+" + ", {:.6f}" * 3 + "\n"
for Row in DataSets[Offset:Offset+Count]:
StringBuf.write(LineFormat.format(*Row))
StringBuf = StringBuf.getvalue() # to string
StringBuf = StringBuf.encode() # to bytes
Comm = MPI.COMM_WORLD
BytesPerProcess = Comm.allgather(len(StringBuf))
Rank = Comm.Get_rank()
OwnOffset = sum(BytesPerProcess[:Rank])
FileLength = sum(BytesPerProcess)
AccessMode = MPI.MODE_WRONLY | MPI.MODE_CREATE
OutFile = MPI.File.Open(Comm, Filename, AccessMode)
OutFile.Set_size(FileLength)
OutFile.Write_ordered(StringBuf)
OutFile.Close()
The main process is almost unchanged.
def main():
InFilename = "indata.csv"
OutFilename = "outdata.csv"
Passes = 20
RandomGenerator = MakeRandomGenerator()
DataSets, Offsets, Counts = DistributeInputs(InFilename)
Rank = MPI.COMM_WORLD.Get_rank()
Offset = Offsets[Rank]
Count = Counts[Rank]
for _ in range(Passes):
ComputePass(DataSets, Offset, Count, RandomGenerator)
WriteOutputs(OutFilename, DataSets, Offset, Count)
if __name__ == '__main__':
main()
You need to call this script with mpirun or mpiexec. E.g. mpiexec python3 script_name.py
Using shared memory
The MPI pattern has one significant drawback: Each process needs its own copy of the whole data set. Given its size, this is very inconvenient. We might run out of memory before we run out of CPU cores for multithreading. As a different idea, we can use shared memory. Shared memory allows multiple processes to access the same physical memory without any extra cost. This has some drawbacks:
We need a very recent python version. 3.8 IIRC
Python's implementation may behave differently on various operating systems. I could only test it on Linux. There is a chance that it will not work on any different system
IMHO python's implementation is not great. You will notice that the final version will print some warnings which I think are harmless. Maybe I'm using it wrong but I don't see a more correct way of using it
It limits you to a single PC. MPI itself is perfectly capable (and indeed designed to) operate across multiple systems on a network. Shared memory works only locally.
The major benefit is that the memory consumption does not increase with the number of processes.
We start by allocating such a data set.
From here on, we put in "barriers" at various points where processes may have to wait for one another. For example because all processes need to access the same shared memory segment, they all have to open it before we can unlink it.
from multiprocessing import shared_memory
def AllocateSharedDataSets(NumberOfRows, NumberOfCols=9):
"""Creates a numpy array in shared memory
Arguments:
NumberOfRows, NumberOfCols -- basic shape
Return value:
(DataSets, Buf) with
DataSets -- numpy array shaped (NumberOfRows, NumberOfCols).
Datatype float
Buf -- multiprocessing.shared_memory.SharedMemory that backs the array.
Close it when no longer needed
"""
length = NumberOfRows * NumberOfCols * np.float64().itemsize
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
Buf = None
BufName = None
if not Rank:
Buf = shared_memory.SharedMemory(create=True, size=length)
BufName = Buf.name
BufName = Comm.bcast(BufName)
if Rank:
Buf = shared_memory.SharedMemory(name=BufName, size=length)
DataSets = np.ndarray((NumberOfRows, NumberOfCols), dtype='f8',
buffer=Buf.buf)
Comm.barrier()
if not Rank:
Buf.unlink() # this may differ among operating systems
return DataSets, Buf
The input parsing also changes a little because have to put the data into the previously allocated array
def ParseLines(DataSets, Offset, OwnLines):
"""Reads lines into a preallocated array
Arguments:
DataSets -- [Rows, Cols] numpy array. Will be changed in-place
Offset -- starting offset of the set of rows parsed by this process
OwnLines -- list of lines to be parsed by the local process
"""
UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
OwnEnd = Offset + len(OwnLines)
OwnDataSets = DataSets[Offset:OwnEnd]
for Row, Line in zip(OwnDataSets, OwnLines):
Columns = Line.split(',')
Row[:] = [float(Columns[Column]) for Column in UsedColumns]
def DistributeInputs(Filename):
"""Read input from the file and stores it in shared memory
Arguments:
Filename -- path to the CSV file to parse
Return value:
(DataSets, Offsets, Counts, Buf) with
DataSets -- [Rows, 9] array containing two copies of all values in the
CSV file
Offsets -- Row indices (one per rank) where each process starts its own
processing
Counts -- number of rows per process
Buf -- multiprocessing.shared_memory.SharedMemory object backing the
DataSets object
"""
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
Lines = None
LineCount = None
if not Rank:
# Read the data. We do as little work as possible here so that other
# processes can help with the parsing
with open(Filename) as InFile:
Lines = InFile.readlines()
LineCount = len(Lines)
# broadcast so that all processes know the number of datasets
LineCount = Comm.bcast(LineCount, root=0)
Offsets, Counts = MakeDistribution(LineCount)
# reshape into one list per process
if not Rank:
Lines = [Lines[Offset:Offset+Count]
for Offset, Count
in zip(Offsets, Counts)]
# distribute strings for parsing
Lines = Comm.scatter(Lines, root=0)
# parse into a float array
DataSets, Buf = AllocateSharedDataSets(LineCount)
try:
ParseLines(DataSets, Offsets[Rank], Lines)
Comm.barrier()
return DataSets, Offsets, Counts, Buf
except:
Buf.close()
raise
Output writing is exactly the same. The main process changes slightly because now we have to manage the life time of the shared memory.
import contextlib
def main():
InFilename = "indata.csv"
OutFilename = "outdata.csv"
Passes = 20
RandomGenerator = MakeRandomGenerator()
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
DataSets, Offsets, Counts, Buf = DistributeInputs(InFilename)
with contextlib.closing(Buf):
Offset = Offsets[Rank]
Count = Counts[Rank]
for _ in range(Passes):
ComputePass(DataSets, Offset, Count, RandomGenerator)
WriteOutputs(OutFilename, DataSets, Offset, Count)
Results
I've not benchmarked the original version. The sequential version requires 2 GiB memory and 3:20 minutes for 12500000 lines and 20 passes.
The pure MPI version requires 6 GiB and 42 seconds with 6 cores.
The shared memory version requires a bit over 2 GiB of memory and 38 seconds with 6 cores.
I am new to using parallel processing for data analysis. I have a fairly large array and I want to apply a function to each index of said array.
Here is the code I have so far:
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import multiprocessing
from functools import partial
def fit_model(data,q):
#data is a 1-D array holding precipitation values
years = np.arange(1895,2018,1)
res = QuantReg(exog=sm.add_constant(years),endog=data).fit(q=q)
pointEstimate = res.params[1] #output slope of quantile q
return pointEstimate
#precipAll is an array of shape (1405*621,123,12) (longitudes*latitudes,years,months)
#find all indices where there is data
nonNaN = np.where(~np.isnan(precipAll[:,0,0]))[0] #481631 indices
month = 4
#holder array for results
asyncResults = np.zeros((precipAll.shape[0])) * np.nan
def saveResult(result,pos):
asyncResults[pos] = result
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=20) #my server has 24 CPUs
for i in nonNaN:
#use partial so I can also pass the index i so the result is
#stored in the expected position
new_callback_function = partial(saveResult, pos=i)
pool.apply_async(fit_model, args=(precipAll[i,:,month],0.9),callback=new_callback_function)
pool.close()
pool.join()
When I ran this, I stopped it after it took longer than had I not used multiprocessing at all. The function, fit_model, is on the order of 0.02 seconds, so could the overhang associated with apply_async be causing the slowdown? I need to maintain order of the results as I am plotting this data onto a map after this processing is done. Any thoughts on where I need improvement is greatly appreciated!
If you need to use the multiprocessing module, you'll probably want to batch more rows together into each task that you give to the worker pool. However, for what you're doing, I'd suggest trying out Ray due to its efficient handling of large numerical data.
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import ray
#ray.remote
def fit_model(precip_all, i, month, q):
data = precip_all[i,:,month]
years = np.arange(1895, 2018, 1)
res = QuantReg(exog=sm.add_constant(years), endog=data).fit(q=q)
pointEstimate = res.params[1]
return pointEstimate
if __name__ == '__main__':
ray.init()
# Create an array and place it in shared memory so that the workers can
# access it (in a read-only fashion) without creating copies.
precip_all = np.zeros((100, 123, 12))
precip_all_id = ray.put(precip_all)
result_ids = []
for i in range(precip_all.shape[0]):
result_ids.append(fit_model.remote(precip_all_id, i, 4, 0.9))
results = np.array(ray.get(result_ids))
Some Notes
The example above runs out of the box, but note that I simplified the logic a bit. In particular, I removed the handling of NaNs.
On my laptop with 4 physical cores, this takes about 4 seconds. If you use 20 cores instead and make the data 9000 times bigger, I'd expect it to take about 7200 seconds, which is quite a long time. One possible approach to speeding this up is to use more machines or to process multiple rows in each call to fit_model in order to amortize some of the overhead.
The above example actually passes the entire precip_all matrix into each task. This is fine because each fit_model task only has read access to a copy of the matrix stored in shared memory and so doesn't need to create its own local copy. The call to ray.put(precip_all) places the array in shared memory once up front.
For about the differences between Ray and Python multiprocessing. Note I'm helping develop Ray.
I'm trying to measure four similarities(cosine_similarity, jaccard, Sequence Matcher similarity, jaccard_variants similarity) over 800K pairs of documents.
Every document file is txt format and about 100KB ~ 300KB(About 1500000 characters).
I have two questions regarding how to make my python scripts faster:
MY PYTHON SCRIPTS:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from difflib import SequenceMatcher
def get_tf_vectors(doc1, doc2):
text = [doc1, doc2]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()
def measure_sim(doc1, doc2):
a, b = doc1.split(), doc2.split()
c, d = set(a), set(b)
vectors = [t for t in get_tf_vectors(doc1, doc2)]
return cosine_similarity(vectors)[1][0], float(len(c&d) / len(c|d)), \
1 - (sum(abs(vectors[0] - vectors[1])) / sum(vectors[0] + vectors[1])), \
SequenceMatcher(None, a, b).ratio()
#items in doc_pair list are like('ID', 'doc1_directory', 'doc2_directory')
def data_analysis(doc_pair_list):
result = {}
for item in doc_pair_list:
f1 = open(item[1], 'rb')
doc1 = f1.read()
f1.close()
f2 = oepn(item[2], 'rb')
doc2 = f2.read()
f2.close()
result[item[0]] = measure_sim(doc1, doc2)
However, this code uses only 10% of my CPU and it takes almost 20 days to this task to be done. So I want to ask if there would be any way to make this code more efficient.
Q1. Since Documents are saved in HDD, I thought loading those text data should take some time. Hence, I suspect that loading only two documents every time the computer computes the similarities might not be efficient. Hence I am going to try loading 50 pairs of documents at once and computes similarity respectively. Would it be helpful?
Q2. Most of the postings about "How to make your codes run faster" said that I should use Python module based on C-code. However, since I'm using sklearn module which is known to be quite efficient, I wonder there would be any better way.
Is there any way that could help this python script to use more computer resources and become faster??
There are maybe better solutions, but you may try something like this, if the counting of similarities is the blocker:
1) A separate process to read all the files one by one and put them to a multiprocessing.Queue
2) Pool of multiple worker processes to count the similarities and put results into multiprocessing.Queue.
3) Main thread then simply loads results from results_queue and save them to dictionary as you have it now.
I don't know your hardware limitations (number and speed of CPU cores, RAM size, disk read speed) and I don't have any samples to test it on.
EDIT: Below is provided the described code. Please try and check if it is faster and let me know. If the main blocker is loading of files, we can create more loader processes (e.g. 2 processes and each loads half of the files). If the blocker is calculating similarities, then you can create more worker processes (just change worker_count). Finally 'results' is the dictionary with all the results.
import multiprocessing
import os
from difflib import SequenceMatcher
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def get_tf_vectors(doc1, doc2):
text = [doc1, doc2]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()
def calculate_similarities(doc_pairs_queue, results_queue):
""" Pick docs from doc_pairs_queue and calculate their similarities, save the result to results_queue. Repeat infinitely (until process is terminated). """
while True:
pair = doc_pairs_queue.get()
pair_id = pair[0]
doc1 = pair[1]
doc2 = pair[2]
a, b = doc1.split(), doc2.split()
c, d = set(a), set(b)
vectors = [t for t in get_tf_vectors(doc1, doc2)]
results_queue.put((pair_id, cosine_similarity(vectors)[1][0], float(len(c&d) / len(c|d)),
1 - (sum(abs(vectors[0] - vectors[1])) / sum(vectors[0] + vectors[1])),
SequenceMatcher(None, a, b).ratio()))
def load_files(doc_pair_list, loaded_queue):
"""
Pre-load files and put them to a queue, so working processes can get them.
:param doc_pair_list: list of files to be loaded (ID, doc1_path, doc2_path)
:param loaded_queue: multiprocessing.Queue that will hold pre-loaded data
"""
print("Started loading files...")
for item in doc_pair_list:
with open(item[1], 'rb') as f1:
with open(item[2], 'rb') as f2:
loaded_queue.put((item[0], f1.read(), f2.read())) # if queue is full, this automatically waits until there is space
print("Finished loading files.")
def data_analysis(doc_pair_list):
# create a loader process that will pre-load files (it does no calculations, so it loads much faster)
# loader puts loaded files to a queue; 1 pair ~ 500 KB, 1000 pairs ~ 500 MB max size of queue (RAM memory)
loaded_pairs_queue = multiprocessing.Queue(maxsize=1000)
loader = multiprocessing.Process(target=load_files, args=(doc_pair_list, loaded_pairs_queue))
loader.start()
# create worker processes - these will do all calculations
results_queue = multiprocessing.Queue(maxsize=1000) # workers put results to this queue
worker_count = os.cpu_count() if os.cpu_count() else 2 # number of worker processes
workers = [] # create list of workers, so we can terminate them later
for i in range(worker_count):
worker = multiprocessing.Process(target=calculate_similarities, args=(loaded_pairs_queue, results_queue))
worker.start()
workers.append(worker)
# main process just picks the results from queue and saves them to the dictionary
results = {}
i = 0 # results counter
pairs_count = len(doc_pair_list)
while i < pairs_count:
res = results_queue.get(timeout=600) # timeout is just in case something unexpected happened (results are calculated much quicker)
# Queue.get() is blocking - if queue is empty, get() waits until something is put into queue and then get it
results[res[0]] = res[1:] # save to dictionary by ID (first item in the result)
# clean up the processes (so there aren't any zombies left)
loader.terminate()
loader.join()
for worker in workers:
worker.terminate()
worker.join()
Let me know about the results please, I am quite interested in it and will assist you further if needed ;)
First thing to do is see if you can find the real bottleneck and I think using cProfile might confirm your suspicion or shed some more light on your problem.
You should be able to run your code unmodified using cProfile like this:
python -m cProfile -o profiling-results python-file-to-test.py
After that you can analyze the results using pstats like this:
import pstats
stats = pstats.Stats("profiling-results")
stats.sort_stats("tottime")
stats.print_stats(10)
More on profiling your code is on Marco Bonazanin's blog article My Python Code is Slow? Tips for Profiling
I do some computationally expensive tasks in python and found the thread module for parallelization. I have a function which does the computation and returns a ndarray as result. Now I want to know how I can parallize my function and get back the calculated Arrays from each thread.
The followed example is strongly simplified with light functions and calculations.
import numpy as np
def calculate_result(input):
a=np.linspace(1.0, 1000.0, num=10000) # just an example
result = input*a
return(result)
input =[1,2,3,4]
for i in range(0,len(input(i))):
t.Thread(target=calculate_result, args=(input))
t. start()
#Here I want to receive the return value from the thread
I am looking for a way to get the return value from the thread / function for each thread, because in my task each thread calculates different values.
I found an other Question (how to get the return value from a thread in python?) where someone is looking for a similar problem (no ndarrays) and which is handled with ThreadPool and async...
-------------------------------------------------------------------------------
Thanks for your answers !
Due to your help now I am looking for a way to solve my problem with the multiprocessing modul. To give you a better understanding what I do, see my following Explanation.
Explanation:
My 'input_data' is an ndarray with 282240 elements of type uint32
In the 'calculation_function()'I use a for loop to calculate from
every 12 bit a result and put it into the 'output_data'
Because this is very slow, I split my input_data into e.g. 4 or 8
parts and calculate each part in the calculation_function().
Now I am looking for a way, how to parallize the 4 or 8 function
calls
The order of the data is elementary, because the data is in image and
each pixel have to be at the correct Position. So function call no. 1
calculates the first and the last function call the last pixel of the
image.
The calculations work fine and the image can be completly rebuilt
from my algo but I need the parallelization to speed up for time
critical aspects.
Summary:
One input ndarray is devided into 4 or 8 parts. In each part are 70560 or 35280 uint32 values. From each 12 bit I calculate one Pixel with 4 or 8 function calls. Each function returns one ndarray with 188160 or 94080 pixel. All return values will be put together in a row and reshaped into an image.
What allready works:
Calculations are allready working and I can reconstruct my image
Problem:
Function calls are done seriall and in a row but each image reconstruction is very slow
Main Goal:
Speed up the function calls by parallize the function calls.
Code:
def decompress(payload,WIDTH,HEIGHT):
# INPUTS / OUTPUTS
n_threads = 4
img_input = np.fromstring(payload, dtype='uint32')
img_output = np.zeros((WIDTH * HEIGHT), dtype=np.uint32)
n_elements_part = np.int(len(img_input) / n_threads)
input_part=np.zeros((n_threads,n_elements_part)).astype(np.uint32)
output_part =np.zeros((n_threads,np.int(n_elements_part/3*8))).astype(np.uint32)
# DEFINE PARTS (here 4 different ones)
start = np.zeros(n_threads).astype(np.int)
end = np.zeros(n_threads).astype(np.int)
for i in range(0,n_threads):
start[i] = i * n_elements_part
end[i] = (i+1) * n_elements_part -1
# COPY IMAGE DATA
for idx in range(0,n_threads):
input_part [idx,:] = img_input[start[idx]:end[idx]+1]
for idx in range(0,n_threads): # following line is the function_call that should be parallized
output_part[idx,:] = decompress_part2(input_part[idx],output_part[idx])
# COPY PARTS INTO THE IMAGE
img_output[0 : 188160] = output_part[0,:]
img_output[188160: 376320] = output_part[1,:]
img_output[376320: 564480] = output_part[2,:]
img_output[564480: 752640] = output_part[3,:]
# RESHAPE IMAGE
img_output = np.reshape(img_output,(HEIGHT, WIDTH))
return img_output
Please dont take care of my beginner programming style :)
Just looking for a solution how to parallize the function calls with the multiprocessing module and get back the return ndarrays.
Thank you so much for your help !
You can use process pool from the multiprocessing module
def test(a):
return a
from multiprocessing.dummy import Pool
p = Pool(3)
a=p.starmap(test, zip([1,2,3]))
print(a)
p.close()
p.join()
kar's answer works, however keep in mind that he's using the .dummy module which might be limited by the GIL. Heres more info on it:
multiprocessing.dummy in Python is not utilising 100% cpu