I have 12 millions of data from an eshop. I would like to compute association rules using efficient_apriori package. The problem is that 12 millions observations are too many, so the computation tooks too much time. Is there a way how to speed up the algorithm? I am thinking about some Parallel-processing or compile python code into C. I tried PYPY, but PYPY does not support pandas package. Thank you for any help or idea.
If you want to see my code:
import pandas as pd
from efficient_apriori import apriori
orders = pd.read_csv("orders.csv", sep=";")
customer = orders.groupby("id_customer")["name"].agg(tuple).tolist()
itemsets, rules = apriori(
customer, min_support=100/len(customer), min_confidence=0
)
can you this approach to run this task parallel:
from multiprocessing import Pool
length_of_input_file=len(raw_data_min)
total_offset_count=4 # number of parallel process to run
offset=int(length_of_input_file/total_offset_count // 1)
dataNew1=customer[0:offset-1]
dataNew2=customer[offset:2*offset-1]
dataNew3=customer[2*offset:3*offset-1]
dataNew4=customer[3*offset:4*offset-1]
def calculate_frequent_itemset(fractional_data):
"""Function that calculated the frequent dataset parallely"""
itemsets, rules = apriori(fractional_data, min_support=MIN_SUPPORT,
min_confidence=MIN_CONFIDENCE)
return itemsets, rules
p=Pool()
frequent_itemsets=p.map(calculate_frequent_itemset,(dataNew1,dataNew2,dataNew3,dataNew4))
p.close()
p.join()
itemsets1, rules1 =frequent_itemsets[0]
itemsets2, rules2=frequent_itemsets[1]
itemsets3, rules3=frequent_itemsets[2]
itemsets4, rules4=frequent_itemsets[3]
Related
I am writing a genetic optimization algorithm based on the deap package in python 2.7 (goal is to migrate to python 3 soon). As it is a pretty heavy process, some parts of the optimisation are processed using the multiprocessing package. Here is a summary outline of my program:
Configurations are read in and saved in a config object
Some additional pre-computations are made and saved as well in the config object
The optimisation starts (population is initialized randomly and mutations, crossover is applied to find a better solution) and some parts of it (evaluation function) are executed in multiprocessing
The results are saved
For the evaluation function, we need to have access to some parts of the config object (which after phase 2 stays a constant). Therefore we make it accessible to the different cores using a global (constant) variable:
from deap import base
import multiprocessing
toolbox = base.Toolbox()
def evaluate(ind):
# compute evaluation using config object
return(obj1,obj2)
toolbox.register('evaluate',evaluate)
def init_pool_global_vars(self, _config):
global config
config = _config
...
# setting up multiprocessing
pool = multiprocessing.Pool(processes=72, initializer=self.init_pool_global_vars,
initargs=[config])
toolbox.register('map', pool.map_async)
...
while tic < max_time:
# creating new individuals
# computing in optimisation the objective function on the different individuals
jobs = toolbox.map(toolbox.evaluate, ind)
fits = jobs.get()
# keeping best individuals
We basically make different iterations (big for loop) until a maximum time is reached. I have noticed that if I make the config object bigger (i.e. add big attributes to it, like a big numpy array) even if the code is still same it runs much slower (fewer iterations for the same timespan). So I thought I would make a specific config_multiprocessing object that contains only the attributes needed in the multiprocessing part and pass that as a global variable, but when I run it on 3 cores it is slower than with the big config object and on 72 cores, it is slightly faster, but not much.
What should I do in order to make sure my loops don't suffer in speed from the config object or from any other data manipulations I make before launching the multiprocessing loops?
Running in a Linux docker image on a linux VM in the cloud.
The joblib package is designed to handle cases where you have large numpy arrays to distribute to workers with shared memory. This is especially useful if you are treating the data in shared memory as "read-only" like what you describe in your scenario. You can also create writable shared memory as described in the docs.
Your code might look something like:
import os
import numpy as np
from joblib import Parallel, delayed
from joblib import dump, load
folder = './joblib_memmap'
try:
os.mkdir(folder)
except FileExistsError:
pass
def evaluate(ind, data):
# compute evaluation using shared memory data
return(obj1, obj2)
# just used to initialize memory mapped data
def init_memmap_data(original_data):
data_filename_memmap = os.path.join(folder, 'data_memmap')
dump(original_data, data_filename_memmap)
shared_data = load(data_filename_memmap, mmap_mode='r')
return shared_data
...
# however you set up indices needs to be changed here
indexes = range(10)
# however you load your numpy data needs to be done here
shared_data = init_memmap_data(numpy_array_to_share)
# change n_jobs as appropriate
results = Parallel(n_jobs=2)(delayed(evaluate)(ind, shared_data) for ind in indexes)
# get index of the maximum as the "best" individual
best_fit_individual = indexes[results.argmax()]
Additionally, joblib supports a threading backend that may be faster than the process based one. It will be easy to test both with joblib.
I have the following situation with my Pyspark:
In my driver program (driver.py), I call a function from another file (prod.py)
latest_prods = prod.featurize_prods().
Driver code:
from Featurize import Featurize
from LatestProd import LatestProd
from Oldprod import Oldprod
sc = SparkContext()
if __name__ == '__main__':
print 'Into main'
featurize_latest = Featurize('param1', 'param2', sc)
latest_prod = LatestProd(featurize_latest)
latest_prods = latest_prod.featurize_prods()
featurize_old = Featurize('param3', 'param3', sc)
old_prods = Oldprod(featurize_old)
old_prods = oldprod.featurize_oldprods()
total_prods = sc.union([latest_prods, old_prods])
Then I do some some reduceByKey code here... that generates total_prods_processed.
Finally I call:
total_prods_processed.saveAsTextFile(...)
I would like to generate latest_prods and old_prods in parallel. Both are created in the same SparkContext. Is it possible to do that? If not, how can I achieve that functionality?
Is this something that does Spark automatically? I am not seeing this behavior when I run the code so please let me know if it is a configuration option.
After searching on the internet, I think your problem can be addressed by threads. It is as simple as create two threads for your old_prod and latest_prod work.
Check this post for a simplified example. Since Spark is thread-safe, you gain the parallel efficiency without sacrifice anything.
The short answer is no, you can't schedule operations on two distinct RDDs at the same time in the same spark context. However there are some workarounds, you could process them in two distinct SparkContext on the same cluster and call SaveAsTextFile. Then read both in another job to perform the union. (this is not recommended by the documentation).
If you want to try this method, it is discussed here using spark-jobserver since spark doesn't support multiple context by default : https://github.com/spark-jobserver/spark-jobserver/issues/147
However according to the operations you perform there would be no reason to process both at the same time since you need the full results to perform the union, spark will split those operations in 2 different stages that will be executed one after the other.
Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...
To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.
import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool
def threaded(sources, destinations):
# Define worker function
def worker(point):
dlt = (destinations-point) # delta between destinations and given point
d = inner1d(dlt,dlt) # get distances
return np.argmin(d) # return closest index
# Multithread!
p = ThreadPool()
return p.map(worker, sources)
def unthreaded(sources, destinations):
results = []
#for p in sources:
for i in range(len(sources)):
dlt = (destinations-sources[i]) # difference between destinations and given point
d = inner1d(dlt,dlt) # get distances
results.append(np.argmin(d)) # append closest index
return results
# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000 # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100
#Compare!
print 'threaded: %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]
Retults:
threaded: 0.894030461056
unthreaded: 1.97295164054
Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.
All recommendations to improve performance (within the limitations described above) will be greatly appreciated!
Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:
They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)
Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.
Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).
On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.
I have an application that uses a number of classes inheriting from HasTraits. Some of these classes manage access to data and others provide functions for analyzing that data. This works wonderfully for a gui -- I can check that the data and analysis code is doing what it should. However, I've noticed that when I use these classes for gui-less computations, all the cpus on the system end up getting used.
Here is a small example that shows the cpu usage:
from traits.api import HasTraits, List, Int, Enum, Instance
import numpy as np
import psutil
from itertools import combinations
"""
Small example of high CPU usage by traited classes
"""
class DataStorage(HasTraits):
nsamples = Int(2000)
samples = List
def _samples_default(self):
return np.random.randn(self.nsamples,2000).tolist()
def sample_samples(self,indices):
""" return a 2D array of data at indices """
return np.array(
[self.samples[i] for i in indices])
class DataAccessor(HasTraits):
""" Class that grabs data and computes something """
measure = Enum("correlation","covariance")
data_source = Instance(DataStorage,())
def compute_measure(self,indices):
""" example of some computation """
samples = self.data_source.sample_samples(indices)
percentage = psutil.cpu_percent(interval=0, percpu=True)
if self.measure == "correlation":
result = np.corrcoef(samples)
elif self.measure == "covariance":
result = np.cov(samples)
return percentage
# Run a simulation to see cpu usage
analyzer = DataAccessor()
usage = []
n_iterations = 0
max_iterations = 500
for combo in combinations(np.arange(2000),500):
# evaluate the measurement on a subset of the data
usage.append(analyzer.compute_measure(combo))
n_iterations += 1
if n_iterations > max_iterations:
break
print n_iterations
use_percents = np.array(usage).T
When I run this on an 8-cpu machine running CentOS, top reports the python process at roughly 600%.
>>> use_percents.mean(1)
shows
array([ 67.05548902, 67.06906188, 66.89041916, 67.28942116,
66.69421158, 67.61437126, 99.8007984 , 67.31996008])
Question:
My computation is embarrassingly parallel, so it would be great to have the other cpus available to split up the job. Does anyone know what's happening here? A plain python version of this uses 100% on a single cpu.
Is there a way to keep everything local to a single cpu without rewriting all my classes without traits?
Traits is not causing the CPU usage. It's easy to rewrite this bit of code without Traits, and you will see that you get the same pattern of CPU usage (at least, I do).
Instead, what you are probably seeing is the CPU usage of the BLAS library that your build of numpy is linked against. numpy.corrcoeff() calls numpy.cov(), and much of the computation of numpy.cov() is taken up by a numpy.dot() call, which does a matrix-matrix multiplication using BLAS. If it is an optimized BLAS library, then it will usually use non-Python threads internally to split up these computations among your CPUs. You will have to consult the documentation of your optimized BLAS library to find out how to change this.
sorry for this question because there are several examples in Stackoverflow. I am writing in order to clarify some of my doubts because I am quite new in Python language.
i wrote a function:
def clipmyfile(inFile,poly,outFile):
... # doing something with inFile and poly and return outFile
Normally I do this:
clipmyfile(inFile="File1.txt",poly="poly1.shp",outFile="res1.txt")
clipmyfile(inFile="File2.txt",poly="poly2.shp",outFile="res2.txt")
clipmyfile(inFile="File3.txt",poly="poly3.shp",outFile="res3.txt")
......
clipmyfile(inFile="File21.txt",poly="poly21.shp",outFile="res21.txt")
I had read in this example Run several python programs at the same time and i can use (but probably i wrong)
from multiprocessing import Pool
p = Pool(21) # like in your example, running 21 separate processes
to run the function in the same time and speed my analysis
I am really honest to say that I didn't understand the next step.
Thanks in advance for help and suggestion
Gianni
The map that is used in the example you provided only works for functions that recieve one argument. You can see a solution to this here: Python multiprocessing pool.map for multiple arguments
In your case what you would do is (assuming you have 3 arrays with files, polies, outs):
def expand_args(f_p_o):
clipmyfile(*f_p_o)
files = ["file1.txt", "file2.txt"]
polis = ["poli1.txt", "poly2.txt"]
outis = ["out1.txt", "out2.txt"]
len_f = len(files)
p = Pool()
p.map(expand_args, [(files[i], polis[i], outis[i]) for i in xrange(len_f)])