Display progress of lambda function [duplicate] - python

I regularly perform pandas operations on data frames in excess of 15 million or so rows and I'd love to have access to a progress indicator for particular operations.
Does a text based progress indicator for pandas split-apply-combine operations exist?
For example, in something like:
df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)
where feature_rollup is a somewhat involved function that take many DF columns and creates new user columns through various methods. These operations can take a while for large data frames so I'd like to know if it is possible to have text based output in an iPython notebook that updates me on the progress.
So far, I've tried canonical loop progress indicators for Python but they don't interact with pandas in any meaningful way.
I'm hoping there's something I've overlooked in the pandas library/documentation that allows one to know the progress of a split-apply-combine. A simple implementation would maybe look at the total number of data frame subsets upon which the apply function is working and report progress as the completed fraction of those subsets.
Is this perhaps something that needs to be added to the library?

Due to popular demand, I've added pandas support in tqdm (pip install "tqdm>=4.9.0"). Unlike the other answers, this will not noticeably slow pandas down -- here's an example for DataFrameGroupBy.progress_apply:
import pandas as pd
import numpy as np
from tqdm import tqdm
# from tqdm.auto import tqdm # for notebooks
# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()
df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))
# Now you can use `progress_apply` instead of `apply`
df.groupby(0).progress_apply(lambda x: x**2)
In case you're interested in how this works (and how to modify it for your own callbacks), see the examples on GitHub, the full documentation on PyPI, or import the module and run help(tqdm). Other supported functions include map, applymap, aggregate, and transform.
EDIT
To directly answer the original question, replace:
df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)
with:
from tqdm import tqdm
tqdm.pandas()
df_users.groupby(['userID', 'requestDate']).progress_apply(feature_rollup)
Note: tqdm <= v4.8:
For versions of tqdm below 4.8, instead of tqdm.pandas() you had to do:
from tqdm import tqdm, tqdm_pandas
tqdm_pandas(tqdm())

In case you need support for how to use this in a Jupyter/ipython notebook, as I did, here's a helpful guide and source to relevant article:
from tqdm._tqdm_notebook import tqdm_notebook
import pandas as pd
tqdm_notebook.pandas()
df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))
df.groupby(0).progress_apply(lambda x: x**2)
Note the underscore in the import statement for _tqdm_notebook. As referenced article mentions, development is in late beta stage.
UPDATE as of 11/12/2021
I'm currently now using pandas==1.3.4 and tqdm==4.62.3, and I'm not sure which version tqdm authors implemented this change, but the above import statement is deprecated. Instead use:
from tqdm.notebook import tqdm_notebook
UPDATE as of 02/01/2022
It's now possible to simplify import statements for .py an .ipynb files alike:
from tqdm.auto import tqdm
tqdm.pandas()
That should work as expected for both types of development environments, and should work on pandas dataframes or other tqdm-worthy iterables.
UPDATE as of 05/27/2022
If you're using a jupyter notebook on SageMaker, this combo works:
from tqdm import tqdm
from tqdm.gui import tqdm as tqdm_gui
tqdm.pandas(ncols=50)

To tweak Jeff's answer (and have this as a reuseable function).
def logged_apply(g, func, *args, **kwargs):
step_percentage = 100. / len(g)
import sys
sys.stdout.write('apply progress: 0%')
sys.stdout.flush()
def logging_decorator(func):
def wrapper(*args, **kwargs):
progress = wrapper.count * step_percentage
sys.stdout.write('\033[D \033[D' * 4 + format(progress, '3.0f') + '%')
sys.stdout.flush()
wrapper.count += 1
return func(*args, **kwargs)
wrapper.count = 0
return wrapper
logged_func = logging_decorator(func)
res = g.apply(logged_func, *args, **kwargs)
sys.stdout.write('\033[D \033[D' * 4 + format(100., '3.0f') + '%' + '\n')
sys.stdout.flush()
return res
Note: the apply progress percentage updates inline. If your function stdouts then this won't work.
In [11]: g = df_users.groupby(['userID', 'requestDate'])
In [12]: f = feature_rollup
In [13]: logged_apply(g, f)
apply progress: 100%
Out[13]:
...
As usual you can add this to your groupby objects as a method:
from pandas.core.groupby import DataFrameGroupBy
DataFrameGroupBy.logged_apply = logged_apply
In [21]: g.logged_apply(f)
apply progress: 100%
Out[21]:
...
As mentioned in the comments, this isn't a feature that core pandas would be interested in implementing. But python allows you to create these for many pandas objects/methods (doing so would be quite a bit of work... although you should be able to generalise this approach).

For anyone who's looking to apply tqdm on their custom parallel pandas-apply code.
(I tried some of the libraries for parallelization over the years, but I never found a 100% parallelization solution, mainly for the apply function, and I always had to come back for my "manual" code.)
df_multi_core - this is the one you call. It accepts:
Your df object
The function name you'd like to call
The subset of columns the function can be performed upon (helps reducing time / memory)
The number of jobs to run in parallel (-1 or omit for all cores)
Any other kwargs the df's function accepts (like "axis")
_df_split - this is an internal helper function that has to be positioned globally to the running module (Pool.map is "placement dependent"), otherwise I'd locate it internally..
here's the code from my gist (I'll add more pandas function tests there):
import pandas as pd
import numpy as np
import multiprocessing
from functools import partial
def _df_split(tup_arg, **kwargs):
split_ind, df_split, df_f_name = tup_arg
return (split_ind, getattr(df_split, df_f_name)(**kwargs))
def df_multi_core(df, df_f_name, subset=None, njobs=-1, **kwargs):
if njobs == -1:
njobs = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=njobs)
try:
splits = np.array_split(df[subset], njobs)
except ValueError:
splits = np.array_split(df, njobs)
pool_data = [(split_ind, df_split, df_f_name) for split_ind, df_split in enumerate(splits)]
results = pool.map(partial(_df_split, **kwargs), pool_data)
pool.close()
pool.join()
results = sorted(results, key=lambda x:x[0])
results = pd.concat([split[1] for split in results])
return results
Bellow is a test code for a parallelized apply with tqdm "progress_apply".
from time import time
from tqdm import tqdm
tqdm.pandas()
if __name__ == '__main__':
sep = '-' * 50
# tqdm progress_apply test
def apply_f(row):
return row['c1'] + 0.1
N = 1000000
np.random.seed(0)
df = pd.DataFrame({'c1': np.arange(N), 'c2': np.arange(N)})
print('testing pandas apply on {}\n{}'.format(df.shape, sep))
t1 = time()
res = df.progress_apply(apply_f, axis=1)
t2 = time()
print('result random sample\n{}'.format(res.sample(n=3, random_state=0)))
print('time for native implementation {}\n{}'.format(round(t2 - t1, 2), sep))
t3 = time()
# res = df_multi_core(df=df, df_f_name='apply', subset=['c1'], njobs=-1, func=apply_f, axis=1)
res = df_multi_core(df=df, df_f_name='progress_apply', subset=['c1'], njobs=-1, func=apply_f, axis=1)
t4 = time()
print('result random sample\n{}'.format(res.sample(n=3, random_state=0)))
print('time for multi core implementation {}\n{}'.format(round(t4 - t3, 2), sep))
In the output you can see 1 progress bar for running without parallelization, and per-core progress bars when running with parallelization.
There is a slight hickup and sometimes the rest of the cores appear at once, but even then I think its usefull since you get the progress stats per core (it/sec and total records, for ex)
Thank you #abcdaa for this great library!

Every answer here used pandas.DataFrame.groupby. If you want a progress bar on pandas.Series.apply without a groupby, here's how you can do it inside a jupyter-notebook:
from tqdm.notebook import tqdm
tqdm.pandas()
df['<applied-col-name>'] = df['<col-name>'].progress_apply(<your-manipulation-function>)

You can easily do this with a decorator
from functools import wraps
def logging_decorator(func):
#wraps
def wrapper(*args, **kwargs):
wrapper.count += 1
print "The function I modify has been called {0} times(s).".format(
wrapper.count)
func(*args, **kwargs)
wrapper.count = 0
return wrapper
modified_function = logging_decorator(feature_rollup)
then just use the modified_function (and change when you want it to print)

I've changed Jeff's answer, to include a total, so that you can track progress and a variable to just print every X iterations (this actually improves the performance by a lot, if the "print_at" is reasonably high)
def count_wrapper(func,total, print_at):
def wrapper(*args):
wrapper.count += 1
if wrapper.count % wrapper.print_at == 0:
clear_output()
sys.stdout.write( "%d / %d"%(calc_time.count,calc_time.total) )
sys.stdout.flush()
return func(*args)
wrapper.count = 0
wrapper.total = total
wrapper.print_at = print_at
return wrapper
the clear_output() function is from
from IPython.core.display import clear_output
if not on IPython Andy Hayden's answer does that without it

For operations like merge, concat, join the progress bar can be shown by using Dask.
You can convert the Pandas DataFrames to Dask DataFrames. Then you can show Dask progress bar.
The code below shows simple example:
Create and convert Pandas DataFrames
import pandas as pd
import numpy as np
from tqdm import tqdm
import dask.dataframe as dd
n = 450000
maxa = 700
df1 = pd.DataFrame({'lkey': np.random.randint(0, maxa, n),'lvalue': np.random.randint(0,int(1e8),n)})
df2 = pd.DataFrame({'rkey': np.random.randint(0, maxa, n),'rvalue': np.random.randint(0, int(1e8),n)})
sd1 = dd.from_pandas(df1, npartitions=3)
sd2 = dd.from_pandas(df2, npartitions=3)
Merge with progress bar
from tqdm.dask import TqdmCallback
from dask.diagnostics import ProgressBar
ProgressBar().register()
with TqdmCallback(desc="compute"):
sd1.merge(sd2, left_on='lkey', right_on='rkey').compute()
Dask is faster and requires less resources than Pandas for the same operation:
Pandas 74.7 ms
Dask 20.2 ms
For more details:
Progress Bar for Merge Or Concat Operation With tqdm in Pandas
Test Notebook
Note 1: I've tested this solution: https://stackoverflow.com/a/56257514/3921758 but it doesn't work for me. Doesn't measure the merge operation.
Note 2: I've checked "open request" for tqdm for Pandas like:
https://github.com/tqdm/tqdm/issues/1144
https://github.com/noamraph/tqdm/issues/28

For concat operations:
df = pd.concat(
[
get_data(f)
for f in tqdm(files, total=len(files))
]
)
tqdm just returns an iterable.

Related

multiprocessing in Python dataframe

I have a dataframe and want to do add a column by taking first 3 digits of a base column using multiprocessing.
Please see below the python code:
import multiprocessing as mp
import pandas as pd
import numpy as np
data = pd.DataFrame({'employee':['Donald','Douglas','Jennifer','Michael','Pat','Susan','Hermann','Shelley','William',
'Steven','Neena','Lex','Alexander','Bruce','David','Valli','Diana','Nancy','Daniel','John'],
'PHONE_NUMBER':['650.507.9833','650.507.9844','515.123.4444','515.123.5555','603.123.6666',
'515.123.7777','515.123.8888','515.123.8080','515.123.8181','515.123.4567','515.123.4568',
'515.123.4569','590.423.4567','590.423.4568','590.423.4569','590.423.4560','590.423.5567',
'515.124.4569','515.124.4169','515.124.4269']})
# Part3- Multiprocessing thread
def strip_digits(x):
return str(x)[:3]
def city_code(x):
x['start_digits'] = x['PHONE_NUMBER'].apply(strip_digits)
return x
def parallelize(df,func):
df_split = np.array_split(df,partitions)
pool = mp.Pool(cores)
df_retun = pd.concat(pool.map(func,df_split), ignore_index=True)
pool.close
return df_retun
if __name__=='__main__':
mp.set_start_method('spawn')
cores = mp.cpu_count()
partitions = cores
df = parallelize(data, city_code)
group_data = df.groupby(['start_digits'])
group_size = group_data.size()
print(group_data.get_group('515'))
I am getting different Attributes error. please help me to identify the error in the code. This is a sample dataframe. I want to do similar task for large dataframe using multi-processing.
Thanks in advance.

Pass multiple columns from dataframe into function with multiprocessing or concurrent.futures

Question: How can I pass the dataframe's columns into a function for each row using multiprocessing or concurrent.futures?
Details:
For each row in df, I want to pass its columns leader and years into the function print_sentences(). I want to use the function in a parallel way where each row is printed asynchronously. For example, I want to make use of concurrent.futures.Executor.map.
It needs to be in Python 3.6.
Reprex: My actual problem is computationally demanding, so here is a simplified reprex:
import pandas as pd
import numpy as np
import concurrent.futures
df = pd.DataFrame(np.array([["Larry", 3, "Germany"], ["Jerry", 5, "Sweden"], ["George", 12, "UK"]]),
columns=['leader', 'years', 'score'])
def print_sentences(df):
print(df["leader"] + " has been leader for " + df["years"] + " years")
print_sentences(df)
Background:
Other questions related to this issue seem to deal with object types other than a dataframe.
My specific issue begins when I read in a .csv, the dataframe. I want pass this dataframe's columns, for each of its rows, into some function. My actual function (dramatically simplified for a reprex) is computationally demanding. It scrapes data and saves it to a .json. Each row therefore acts a different query (inputting a different leader's name and score, for example).
To optimise this, I want the rows to map into the function in a parallel way.
I have simplified my problem with the reprex above.
Thanks for your help in advance.
Try this,
#Edits made to reflect your use case.
import pandas as pd
import numpy as np
from multiprocessing import cpu_count, Pool
cores = cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def parallelize(data, func):
data_split = np.array_split(data, partitions)
pool = Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
def print_sentences(cols):
leader, years = cols[0], cols[1]
print(leader + " has been leader for " + years + " years")
df = pd.DataFrame(np.array([["Larry", 3, "Germany"], ["Jerry", 5, "Sweden"],
["George", 12, "UK"]]),
columns=['leader', 'years', 'score'])
data = df.copy()
data = parallelize(data, print_sentences)
data.apply(print_sentences, axis=1)

Pandas: memory usage when working with very many columns using Groupby

I have a dataframe with over 1000 columns and I would like to know whether it makes a difference on memory usage and/or speed to run a groupby directly on a dataframe or to create a smaller subset of the dataframe columnwise.
df[['xnew','ynew','znew']] = df.groupby(['a','b'])['x','y','z'].transform(lambda f: f.rolling(3).mean().shift())
or,
df2=df[['a','b','x','y','z']]
df2[['xnew','ynew','znew']] = df2.groupby(['a','b'])['x','y','z'].transform(lambda f: f.rolling(3).mean().shift())
df=pd.concat([df,df2[['xnew','ynew','znew']]],axis=1)
I would like to test this myself but I am unfamiliar with how to do it. Advice on how to test this would be much appreciated.
The short answer is no, it doesn't matter on either dimension. From a Colab notebook:
%load_ext memory_profiler
import pandas as pd
import numpy as np
d = {'a': [1]*100 + [2]*100, 'b': [3]*50 + [4]*50 + [5]*50 + [6]*50}
for i in range(1000):
d[i] = np.random.random(200)
for c in 'xyz':
d[c] = np.random.random(200)
df = pd.DataFrame(d)
%time %memit df[['xnew','ynew','znew']] = df.groupby(['a','b'])[['x','y','z']].transform(lambda f: f.rolling(3).mean().shift())
%%time
%%memit
df2=df[['a','b','x','y','z']]
df2[['xnew','ynew','znew']] = df2.groupby(['a','b'])[['x','y','z']].transform(lambda f: f.rolling(3).mean().shift())
df=pd.concat([df,df2[['xnew','ynew','znew']]],axis=1)
The simple way to do this is to get the time and then subtract the time at the end of the process to display the elapsed time.
import time
start = time.time()
# Write down the process.
process_time = time.time() - start
print(process_time)

Get CSV from Tensorflow summaries

I have some very large tensorflow summaries. If these are plotted using tensorboard, I can download CSV files from them.
However, plotting these using tensorboard would take a very long time. I found in the docs that there is a method for reading the summary directly in Python. This method is summary_iterator and can be used as follows:
import tensorflow as tf
for e in tf.train.summary_iterator(path to events file):
print(e)
Can I use this method to create CSV files directly? If so, how can I do this? This would save a lot of time.
One possible way of doing it would be like this:
from tensorboard.backend.event_processing import event_accumulator
import numpy as np
import pandas as pd
import sys
def create_csv(inpath, outpath):
sg = {event_accumulator.COMPRESSED_HISTOGRAMS: 1,
event_accumulator.IMAGES: 1,
event_accumulator.AUDIO: 1,
event_accumulator.SCALARS: 0,
event_accumulator.HISTOGRAMS: 1}
ea = event_accumulator.EventAccumulator(inpath, size_guidance=sg)
ea.Reload()
scalar_tags = ea.Tags()['scalars']
df = pd.DataFrame(columns=scalar_tags)
for tag in scalar_tags:
events = ea.Scalars(tag)
scalars = np.array(map(lambda x: x.value, events))
df.loc[:, tag] = scalars
df.to_csv(outpath)
if __name__ == '__main__':
args = sys.argv
inpath = args[1]
outpath = args[2]
create_csv(inpath, outpath)
Please note, this code will load the entire event file into memory, so best to run this on a cluster. For information about the sg argument of the EventAccumulator, see this SO question.
An additional improvement might be to not only store the value of each scalar, but also the step.
Note The code snippet was updated for recent versions of TF. For TF < 1.1 use the following import instead:
from tensorflow.tensorboard.backend.event_processing import event_accumulator as eva

pandas multiprocessing apply

I'm trying to use multiprocessing with pandas dataframe, that is split the dataframe to 8 parts. apply some function to each part using apply (with each part processed in different process).
EDIT:
Here's the solution I finally found:
import multiprocessing as mp
import pandas.util.testing as pdt
def process_apply(x):
# do some stuff to data here
def process(df):
res = df.apply(process_apply, axis=1)
return res
if __name__ == '__main__':
p = mp.Pool(processes=8)
split_dfs = np.array_split(big_df,8)
pool_results = p.map(aoi_proc, split_dfs)
p.close()
p.join()
# merging parts processed by different processes
parts = pd.concat(pool_results, axis=0)
# merging newly calculated parts to big_df
big_df = pd.concat([big_df, parts], axis=1)
# checking if the dfs were merged correctly
pdt.assert_series_equal(parts['id'], big_df['id'])
You can use https://github.com/nalepae/pandarallel, as in the following example:
from pandarallel import pandarallel
from math import sin
pandarallel.initialize()
def func(x):
return sin(x**2)
df.parallel_apply(func, axis=1)
A more generic version based on the author solution, that allows to run it on every function and dataframe:
from multiprocessing import Pool
from functools import partial
import numpy as np
def parallelize(data, func, num_of_processes=8):
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
def run_on_subset(func, data_subset):
return data_subset.apply(func, axis=1)
def parallelize_on_rows(data, func, num_of_processes=8):
return parallelize(data, partial(run_on_subset, func), num_of_processes)
So the following line:
df.apply(some_func, axis=1)
Will become:
parallelize_on_rows(df, some_func)
This is some code that I found useful. Automatically splits the dataframe into however many cpu cores you have.
import pandas as pd
import numpy as np
import multiprocessing as mp
def parallelize_dataframe(df, func):
num_processes = mp.cpu_count()
df_split = np.array_split(df, num_processes)
with mp.Pool(num_processes) as p:
df = pd.concat(p.map(func, df_split))
return df
def parallelize_function(df):
df[column_output] = df[column_input].apply(example_function)
return df
def example_function(x):
x = x*2
return x
To run:
df_output = parallelize_dataframe(df, parallelize_function)
This worked well for me:
rows_iter = (row for _, row in df.iterrows())
with multiprocessing.Pool() as pool:
df['new_column'] = pool.map(process_apply, rows_iter)
Since I don't have much of your data script, this is a guess, but I'd suggest using p.map instead of apply_async with the callback.
p = mp.Pool(8)
pool_results = p.map(process, np.array_split(big_df,8))
p.close()
p.join()
results = []
for result in pool_results:
results.extend(result)
To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.
You can set the amount of cores (and the chunking behaviour) upon init:
import pandas as pd
import mapply
mapply.init(n_workers=-1)
def process_apply(x):
# do some stuff to data here
def process(df):
# spawns a pathos.multiprocessing.ProcessPool if sensible
res = df.mapply(process_apply, axis=1)
return res
By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.
You could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):
import multiprocessing
n_workers = multiprocessing.cpu_count()
# or more explicit
import psutil
n_workers = psutil.cpu_count(logical=True)
I also run into the same problem when I use multiprocessing.map() to apply function to different chunk of a large dataframe.
I just want to add several points just in case other people run into the same problem as I do.
remember to add if __name__ == '__main__':
execute the file in a .py file, if you use ipython/jupyter notebook, then you can not run multiprocessing (this is true for my case, though I have no clue)
Install Pyxtension that simplifies using parallel map and use like this:
from pyxtension.streams import stream
big_df = pd.concat(stream(np.array_split(df, multiprocessing.cpu_count())).mpmap(process))
I ended up using concurrent.futures.ProcessPoolExecutor.map in place of multiprocessing.Pool.map which took 316 microseconds for some code that took 12 seconds in serial.
Python's pool.starmap() method can be used to succinctly introduce parallelism also to apply use cases where column values are passed as arguments, i.e. to cases like:
df.apply(lambda row: my_func(row["col_1"], row["col_2"], ...), axis=1)
Full example and benchmarking:
import time
from multiprocessing import Pool
import numpy as np
import pandas as pd
def mul(a, b, c):
# For illustration, could obviously be vectorized
return a * b * c
df = pd.DataFrame(np.random.randint(0, 100, size=(10_000_000, 3)), columns=list('ABC'))
# Standard apply
start = time.time()
df["mul"] = df.apply(lambda row: mul(row["A"], row["B"], row["C"]), axis=1)
print(f"Standard apply took {time.time() - start:.0f} seconds.")
# Starmap apply
start = time.time()
with Pool(10) as pool:
df["mul_pool"] = pool.starmap(mul, zip(df["A"], df["B"], df["C"]))
print(f"Starmap apply took {time.time() - start:.0f} seconds.")
pd.testing.assert_series_equal(df["mul"], df["mul_pool"], check_names=False)
>>> Standard apply took 72 seconds.
>>> Starmap apply took 5 seconds.
This has the benefit of not relying on external libraries, plus being very readable.
Tom Raz's answer https://stackoverflow.com/a/53135031/11847090 misses an edge case where there are fewer rows in the dataframe than processes
use this parallelize method instead
def parallelize(data, func, num_of_processes=8):
# check if the number of rows is less than the number of processes
# to avoid the following error
# ValueError: Expected a 1D array, got an array with shape
num_rows = len(data)
if num_rows == 0:
return None
elif num_rows < num_of_processes:
num_of_processes = num_rows
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
and also I used dask bag to multithread this instead of this custom code

Categories