Related
I was using pandas eval within a where that sits inside a function in order to create a column in a data frame. While it was working in the past, not it doesn't. There was a recent move to Python 3 within our dataiku software. Could that be the reason for it?
Below will be the code that is now in place
import pandas as pd, numpy as np
from numpy import where, nan
d = {'ASSET': ['X','X','A','X','B'], 'PRODUCT': ['Z','Y','Z','C','Y']}
MAIN_df = pd.DataFrame(data=d)
def val_per(ASSET, PRODUCT):
return(
where(pd.eval("ASSET== 'X' & PRODUCT == 'Z'"),0.04,
where(pd.eval("PRODUCT == 'Y'"),0.08,1.5)
)
)
MAIN_2_df = (MAIN_df.eval("PCT = #val_per(ASSET, PRODUCT)"))
The error received now is <class 'TypeError'>: unhashable type: 'numpy.ndarray'
You can change the last two lines with:
MAIN_2_df = MAIN_df.copy()
MAIN_2_df = val_per(MAIN_2_df.ASSET, MAIN_2_df.PRODUCT)
This approach will work faster for large dataframes. You can use a vectorized aproach to faster results.
Python 3.6 pycharm
import prettytable as pt
import numpy as np
import pandas as pd
a=np.random.randn(30,2)
b=a.round(2)
df=pd.DataFrame(b)
df.columns=['data1','data2']
tb = pt.PrettyTable()
def func1(columns):
def func2(column):
return tb.add_column(column,df[column])
return map(func2,columns)
column1=['data1','data2']
print(column1)
print(func1(column1))
I want to get the results are:
tb.add_column('data1',df['data1'])
tb.add_column('data2',df['data2'])
As a matter of fact,the results are:
<map object at 0x000001E527357828>
I am trying find the answer in Stack Overflow for a long time, some answer tell me can use list(func1(column1)), but the result is [None, None].
Based on the tutorial at https://ptable.readthedocs.io/en/latest/tutorial.html, PrettyTable.add_column modifies the PrettyTable in-place. Such functions generally return None, not the modified object.
You're also overcomplicating the problem by trying to use map and a fancy wrapper function. The below code is much simpler, but produces the desired result.
import prettytable as pt
import numpy as np
import pandas as pd
column_names = ['data1', 'data2']
a = np.random.randn(30, 2)
b = a.round(2)
df = pd.DataFrame(b)
df.columns = column_names
tb = pt.PrettyTable()
for col in column_names:
tb.add_column(col, df[col])
print(tb)
If you're still interesting in learning about the thing that map returns, I suggest reading about iterables and iterators. map returns an iterator over the results of calling the function, and does not actually do any work until you iterate over it.
I am a python beginner, the situation is:
In test.py:
import numpy as np
import pandas as pd
from numpy import *
def model(file):
import numpy as np
import pandas as pd
data0 = pd.ExcelFile(file)
data = data0.parse('For Stata')
data1 = data.values
varnames = list(data)
for i in range(np.shape(data)[1]):
var = varnames[i]
exec(var+'=np.reshape(data1[:,i],(2217,1))')
return air
air is one of the 'varnames'
Now I run the following in jupyter notebook:
file0 = 'BLPreadydata.xlsx'
from test import model
model(file0)
the error that I get is:
NameError: name 'air' is not defined
EDIT: I tried to pin down the error, it actually came from
exec(var+'=np.reshape(data1[:,i],(2217,1))')
somehow this is not working when I call the function, but it does work when I run it outside the function.
NOTE:
Someone have done this in MATLAB:
vals = [1 2 3 4]
vars = {'a', 'b', 'c', 'd'}
for i = vals
eval([vars{i} '= vals(i)'])
end
You should use one more for loop in function to iterate varnames and find 'air, if found then store it another variable and return that variable.
Try this.
for j in varnames:
if j=='air':
c=j
Then return c.
return c
I found an answer after reading the exec(.) doc and guessing...
air is actually saved as a local variable after exec(.)...
hence, instead of
return air
put
return locals()['air']
Thanks for all the help.
I am using a FCC api to convert lat/long coordinates into block group codes:
import pandas as pd
import numpy as np
import urllib
import time
import json
# getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='
getup1 = '&longitude='
getup2 = '&showall=false'
lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
'33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
'39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
'32.7554883','42.331427','31.7775757','35.1495343']
long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
'-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
'-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
'-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']
#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']
new_list = []
def block(x):
for index,row in x.iterrows():
#request url and read the output
a = urllib.request.urlopen(getup + row['lat'] + getup1 + row['long'] + getup2).read()
#load json output in to a form python can understand
a1 = json.loads(a)
#append output to an empty list.
new_list.append(a1['Block']['FIPS'])
#call the function with latlong as the argument.
block(latlong)
#print the list, note: it is important that function appends to the list
print(new_list)
gives this output:
['360610031001021', '060372074001033', '170318391001104', '482011000003087',
'421010005001010', '040131141001032', '480291101002041', '060730053003011',
'481130204003064', '060855010004004', '484530011001092', '180973910003057',
'120310010001023', '060750201001001', '390490040001005', '371190001005000',
'484391233002071', '261635172001069', '481410029001001', '471570042001018']
The problem with this script is that I can only call the api once per row. It takes about 5 minutes per thousand for the script to run, which is not acceptable with 1,000,000+ entries I am planning on using this script with.
I want to use multiprocessing to parallel this function to decrease the time to run the function. I have tried to look in to the multiprocessing handbook, but have not been able to figure out how to run the function and append the output in to an empty list in parallel.
Just for reference: I am using python 3.6
Any guidance would be great!
You do not have to implement the parallelism yourself, there are libraries better than urllib, e.g. requests [0] and some spin-offs [1] which use either threads or futures. I guess you need to check yourself which one is the fastest.
Because of the small amount of dependencies I like the requests-futures best, here my implementation of your code using ten threads. The library would even support processes if you believe or figure out that it is somehow better in your case:
import pandas as pd
import numpy as np
import urllib
import time
import json
from concurrent.futures import ThreadPoolExecutor
from requests_futures.sessions import FuturesSession
#getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='
getup1 = '&longitude='
getup2 = '&showall=false'
lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
'33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
'39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
'32.7554883','42.331427','31.7775757','35.1495343']
long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
'-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
'-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
'-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']
#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']
def block(x):
requests = []
session = FuturesSession(executor=ThreadPoolExecutor(max_workers=10))
for index, row in x.iterrows():
#request url and read the output
url = getup+row['lat']+getup1+row['long']+getup2
requests.append(session.get(url))
new_list = []
for request in requests:
#load json output in to a form python can understand
a1 = json.loads(request.result().content)
#append output to an empty list.
new_list.append(a1['Block']['FIPS'])
return new_list
#call the function with latlong as the argument.
new_list = block(latlong)
#print the list, note: it is important that function appends to the list
print(new_list)
[0] http://docs.python-requests.org/en/master/
[1] https://github.com/kennethreitz/grequests
I regularly perform pandas operations on data frames in excess of 15 million or so rows and I'd love to have access to a progress indicator for particular operations.
Does a text based progress indicator for pandas split-apply-combine operations exist?
For example, in something like:
df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)
where feature_rollup is a somewhat involved function that take many DF columns and creates new user columns through various methods. These operations can take a while for large data frames so I'd like to know if it is possible to have text based output in an iPython notebook that updates me on the progress.
So far, I've tried canonical loop progress indicators for Python but they don't interact with pandas in any meaningful way.
I'm hoping there's something I've overlooked in the pandas library/documentation that allows one to know the progress of a split-apply-combine. A simple implementation would maybe look at the total number of data frame subsets upon which the apply function is working and report progress as the completed fraction of those subsets.
Is this perhaps something that needs to be added to the library?
Due to popular demand, I've added pandas support in tqdm (pip install "tqdm>=4.9.0"). Unlike the other answers, this will not noticeably slow pandas down -- here's an example for DataFrameGroupBy.progress_apply:
import pandas as pd
import numpy as np
from tqdm import tqdm
# from tqdm.auto import tqdm # for notebooks
# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()
df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))
# Now you can use `progress_apply` instead of `apply`
df.groupby(0).progress_apply(lambda x: x**2)
In case you're interested in how this works (and how to modify it for your own callbacks), see the examples on GitHub, the full documentation on PyPI, or import the module and run help(tqdm). Other supported functions include map, applymap, aggregate, and transform.
EDIT
To directly answer the original question, replace:
df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)
with:
from tqdm import tqdm
tqdm.pandas()
df_users.groupby(['userID', 'requestDate']).progress_apply(feature_rollup)
Note: tqdm <= v4.8:
For versions of tqdm below 4.8, instead of tqdm.pandas() you had to do:
from tqdm import tqdm, tqdm_pandas
tqdm_pandas(tqdm())
In case you need support for how to use this in a Jupyter/ipython notebook, as I did, here's a helpful guide and source to relevant article:
from tqdm._tqdm_notebook import tqdm_notebook
import pandas as pd
tqdm_notebook.pandas()
df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))
df.groupby(0).progress_apply(lambda x: x**2)
Note the underscore in the import statement for _tqdm_notebook. As referenced article mentions, development is in late beta stage.
UPDATE as of 11/12/2021
I'm currently now using pandas==1.3.4 and tqdm==4.62.3, and I'm not sure which version tqdm authors implemented this change, but the above import statement is deprecated. Instead use:
from tqdm.notebook import tqdm_notebook
UPDATE as of 02/01/2022
It's now possible to simplify import statements for .py an .ipynb files alike:
from tqdm.auto import tqdm
tqdm.pandas()
That should work as expected for both types of development environments, and should work on pandas dataframes or other tqdm-worthy iterables.
UPDATE as of 05/27/2022
If you're using a jupyter notebook on SageMaker, this combo works:
from tqdm import tqdm
from tqdm.gui import tqdm as tqdm_gui
tqdm.pandas(ncols=50)
To tweak Jeff's answer (and have this as a reuseable function).
def logged_apply(g, func, *args, **kwargs):
step_percentage = 100. / len(g)
import sys
sys.stdout.write('apply progress: 0%')
sys.stdout.flush()
def logging_decorator(func):
def wrapper(*args, **kwargs):
progress = wrapper.count * step_percentage
sys.stdout.write('\033[D \033[D' * 4 + format(progress, '3.0f') + '%')
sys.stdout.flush()
wrapper.count += 1
return func(*args, **kwargs)
wrapper.count = 0
return wrapper
logged_func = logging_decorator(func)
res = g.apply(logged_func, *args, **kwargs)
sys.stdout.write('\033[D \033[D' * 4 + format(100., '3.0f') + '%' + '\n')
sys.stdout.flush()
return res
Note: the apply progress percentage updates inline. If your function stdouts then this won't work.
In [11]: g = df_users.groupby(['userID', 'requestDate'])
In [12]: f = feature_rollup
In [13]: logged_apply(g, f)
apply progress: 100%
Out[13]:
...
As usual you can add this to your groupby objects as a method:
from pandas.core.groupby import DataFrameGroupBy
DataFrameGroupBy.logged_apply = logged_apply
In [21]: g.logged_apply(f)
apply progress: 100%
Out[21]:
...
As mentioned in the comments, this isn't a feature that core pandas would be interested in implementing. But python allows you to create these for many pandas objects/methods (doing so would be quite a bit of work... although you should be able to generalise this approach).
For anyone who's looking to apply tqdm on their custom parallel pandas-apply code.
(I tried some of the libraries for parallelization over the years, but I never found a 100% parallelization solution, mainly for the apply function, and I always had to come back for my "manual" code.)
df_multi_core - this is the one you call. It accepts:
Your df object
The function name you'd like to call
The subset of columns the function can be performed upon (helps reducing time / memory)
The number of jobs to run in parallel (-1 or omit for all cores)
Any other kwargs the df's function accepts (like "axis")
_df_split - this is an internal helper function that has to be positioned globally to the running module (Pool.map is "placement dependent"), otherwise I'd locate it internally..
here's the code from my gist (I'll add more pandas function tests there):
import pandas as pd
import numpy as np
import multiprocessing
from functools import partial
def _df_split(tup_arg, **kwargs):
split_ind, df_split, df_f_name = tup_arg
return (split_ind, getattr(df_split, df_f_name)(**kwargs))
def df_multi_core(df, df_f_name, subset=None, njobs=-1, **kwargs):
if njobs == -1:
njobs = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=njobs)
try:
splits = np.array_split(df[subset], njobs)
except ValueError:
splits = np.array_split(df, njobs)
pool_data = [(split_ind, df_split, df_f_name) for split_ind, df_split in enumerate(splits)]
results = pool.map(partial(_df_split, **kwargs), pool_data)
pool.close()
pool.join()
results = sorted(results, key=lambda x:x[0])
results = pd.concat([split[1] for split in results])
return results
Bellow is a test code for a parallelized apply with tqdm "progress_apply".
from time import time
from tqdm import tqdm
tqdm.pandas()
if __name__ == '__main__':
sep = '-' * 50
# tqdm progress_apply test
def apply_f(row):
return row['c1'] + 0.1
N = 1000000
np.random.seed(0)
df = pd.DataFrame({'c1': np.arange(N), 'c2': np.arange(N)})
print('testing pandas apply on {}\n{}'.format(df.shape, sep))
t1 = time()
res = df.progress_apply(apply_f, axis=1)
t2 = time()
print('result random sample\n{}'.format(res.sample(n=3, random_state=0)))
print('time for native implementation {}\n{}'.format(round(t2 - t1, 2), sep))
t3 = time()
# res = df_multi_core(df=df, df_f_name='apply', subset=['c1'], njobs=-1, func=apply_f, axis=1)
res = df_multi_core(df=df, df_f_name='progress_apply', subset=['c1'], njobs=-1, func=apply_f, axis=1)
t4 = time()
print('result random sample\n{}'.format(res.sample(n=3, random_state=0)))
print('time for multi core implementation {}\n{}'.format(round(t4 - t3, 2), sep))
In the output you can see 1 progress bar for running without parallelization, and per-core progress bars when running with parallelization.
There is a slight hickup and sometimes the rest of the cores appear at once, but even then I think its usefull since you get the progress stats per core (it/sec and total records, for ex)
Thank you #abcdaa for this great library!
Every answer here used pandas.DataFrame.groupby. If you want a progress bar on pandas.Series.apply without a groupby, here's how you can do it inside a jupyter-notebook:
from tqdm.notebook import tqdm
tqdm.pandas()
df['<applied-col-name>'] = df['<col-name>'].progress_apply(<your-manipulation-function>)
You can easily do this with a decorator
from functools import wraps
def logging_decorator(func):
#wraps
def wrapper(*args, **kwargs):
wrapper.count += 1
print "The function I modify has been called {0} times(s).".format(
wrapper.count)
func(*args, **kwargs)
wrapper.count = 0
return wrapper
modified_function = logging_decorator(feature_rollup)
then just use the modified_function (and change when you want it to print)
I've changed Jeff's answer, to include a total, so that you can track progress and a variable to just print every X iterations (this actually improves the performance by a lot, if the "print_at" is reasonably high)
def count_wrapper(func,total, print_at):
def wrapper(*args):
wrapper.count += 1
if wrapper.count % wrapper.print_at == 0:
clear_output()
sys.stdout.write( "%d / %d"%(calc_time.count,calc_time.total) )
sys.stdout.flush()
return func(*args)
wrapper.count = 0
wrapper.total = total
wrapper.print_at = print_at
return wrapper
the clear_output() function is from
from IPython.core.display import clear_output
if not on IPython Andy Hayden's answer does that without it
For operations like merge, concat, join the progress bar can be shown by using Dask.
You can convert the Pandas DataFrames to Dask DataFrames. Then you can show Dask progress bar.
The code below shows simple example:
Create and convert Pandas DataFrames
import pandas as pd
import numpy as np
from tqdm import tqdm
import dask.dataframe as dd
n = 450000
maxa = 700
df1 = pd.DataFrame({'lkey': np.random.randint(0, maxa, n),'lvalue': np.random.randint(0,int(1e8),n)})
df2 = pd.DataFrame({'rkey': np.random.randint(0, maxa, n),'rvalue': np.random.randint(0, int(1e8),n)})
sd1 = dd.from_pandas(df1, npartitions=3)
sd2 = dd.from_pandas(df2, npartitions=3)
Merge with progress bar
from tqdm.dask import TqdmCallback
from dask.diagnostics import ProgressBar
ProgressBar().register()
with TqdmCallback(desc="compute"):
sd1.merge(sd2, left_on='lkey', right_on='rkey').compute()
Dask is faster and requires less resources than Pandas for the same operation:
Pandas 74.7 ms
Dask 20.2 ms
For more details:
Progress Bar for Merge Or Concat Operation With tqdm in Pandas
Test Notebook
Note 1: I've tested this solution: https://stackoverflow.com/a/56257514/3921758 but it doesn't work for me. Doesn't measure the merge operation.
Note 2: I've checked "open request" for tqdm for Pandas like:
https://github.com/tqdm/tqdm/issues/1144
https://github.com/noamraph/tqdm/issues/28
For concat operations:
df = pd.concat(
[
get_data(f)
for f in tqdm(files, total=len(files))
]
)
tqdm just returns an iterable.