How to write long Pandas aggregations well? - python

TL;DR
How do you write long aggregations involving many operations like groupby(), unstack() or apply() well?
Example
Say you have a DataFrame() with n_sales = 1000 ticket sales for n_events = 10 different events, like
import pandas as pd
import numpy as np
sales = pd.DataFrame({
'Event': np.random.choice(range(n_events), n_sales),
'Time': np.random.rand(n_sales)})
and you want to plot in how many events at least n = [50, 100] tickets were sold over the evening:
Then I would do
accumulation_of_sales = sales.groupby(['Time', 'Event']).size().unstack().fillna(0).cumsum()
events_with_n_sales = accumulation_of_sales.apply(lambda x: x.value_counts(), axis=1).fillna(0)
events_with_geq_n_sales = events_with_n_sales[events_with_n_sales.columns[::-1]].cumsum(axis=1)
events_with_geq_n_sales[n].plot()
which seems hard to read to me and the lines are in principle too long (see PEP). So,
how are this specific and similar operations done best?
are there some tutorials/style guides/... for beginners? Maybe not particularly for Pandas, but similar languages?

One way to write multiline pandas queries is to use :
accumulation_of_sales = sales.groupby(['Time', 'Event'])\
.size()\
.unstack()\
.fillna(0)\
.cumsum()
I sometimes prefer to wrap these in parenthesis instead.
However, if you are doing several things here often there is a simpler way. For example, whenever you see "groupby + unstack" you should think "pivot_table":
sales.pivot_table(columns='Event', index='Time', aggfunc=len, fill_value=0).cumsum()
(Which is equivalent, more efficient and more readable.)

In addition to #Andy's method. I prefer using parenthesis ( ) for writing multiline statements
accumulation_of_sales = (sales.groupby(['Time', 'Event'])
.size()
.unstack()
.fillna(0)
.cumsum())

Related

Named aggregations with pandas group by agg are super slow. Why?

I have a df with many groups.
N = int(1E6)
df = pd.DataFrame({'A':np.random.randint(300_000, size=N),
'B': np.random.rand(N)})
df.loc[::2, ['B']] = np.nan
I want to calculate the sum of each group, given that the group has at least one non-Nan value.
I encounter that the following is very slow:
df.groupby('A').agg(**{
'newname' : ('B', lambda x: x.sum(min_count=1))
})
(22 seconds)
while the following is fast:
df.groupby('A').sum(min_count=1)
(0.11 seconds).
However I would like to use named aggregations.
Am I doing something wrong in the named_aggregation approach, hereby reducing performance?
I tried functools.partial as well (instead of the lambda function), but this yields the same performance.
Once you pass in lambda, the operation is no longer vectorized across the groups even though it can be vectorized within each group. For example:
df.groupby('A').agg(**{'newname' : ('B', 'sum')}) is comparable to df.groupby('A')['B'].sum() and is largely faster than lambda x: x.sum().
That said, I read somewhere that named agg can be a bit slower than straight up applying the built-in functions. For example, this would be a bit faster than .agg:
d = df.groupby('A')
pd.DataFrame({'new_name': d['B'].sum(min_count=1),
'other_name': d['B'].size()
})
but then, you code base looks not as clean as .agg.
One reason the second solution is faster could be because internally, it's using Cython, which is Python converted to C, and is known to be much faster for algorithms.
GroupBy.sum() calls GroupBy._agg_general(), which in turn calls GroupBy._cython_agg_general()...

Using pd.DataFrame.sample on dask dataframe with groupby

I have a very large dataframe that I am resampling a large number of times, so I'd like to use dask to speed up the process. However, I'm running into challenges with the groupby apply. An example data frame would be
import numpy as np
import pandas as pd
import random
test_df = pd.DataFrame({'sample_id':np.array(['a', 'b', 'c', 'd']).repeat(100),
'param1':random.sample(range(1, 1000), 400)})
test_df.set_index('sample_id', inplace=True)
which I can normally groupby and resample using
N = 5;i=1
test = test_df\
.groupby(['sample_id'])\
.apply(pd.DataFrame.sample, n=N, replace=False)\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
Which I wrap into a method that iterates over an N gradient i times. The actual dataframe is very large with a number of columns, and before anyone suggests, this method is a little bit faster than an np.random.choice approach on the index-- it's all in the groupby. I've run the overall procedure through a multiprocessing method, but I wanted to see if I could get a bit more speed out of a dask version of the same. The problem is the documentation suggests that if you index and partition then you get complete groups per partition-- which is not proving true.
import dask.dataframe as dd
df1 = dd.from_pandas(test_df, npartitions=8)
df1=df1.persist()
df1.divisions
creates
('a', 'b', 'c', 'd', 'd')
which unsurprisingly results in a failure
N = 5;i=1
test = df1\
.groupby(['sample_id'])\
.apply(pd.DataFrame.sample, n=N, replace=False)\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
ValueError: Metadata inference failed in groupby.apply(sample).
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
ValueError("Cannot take a larger sample than population when 'replace=False'")
I have dug all around the documentation on keywords, dask dataframes & partitions, and groupby aggregations and simply am simply missing the solution if it's there in the documents. Any advice on how to create a smarter set of partitions and/or get the groupby with sample playing nice with dask would be deeply appreciated.
It's not quite clear to me what you are trying to achieve and why you need to add replace=False (which is default) but the following code work for me. I just need to add meta.
import dask.dataframe as dd
df1 = dd.from_pandas(test_df.reset_index(), npartitions=8)
N = 5
i = 1
test = df1\
.groupby(['sample_id'])\
.apply(lambda x: x.sample(n=N),
meta={"sample_id": "object",
"param1": "f8"})\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
If you then want to drop sample_id you just need to add
df = df.drop("sample_id", axis=1)

How to iterate over very big dataframes in python?

I have a code and my dataframe contains almost 800k rows and therefore it is impossible to iterate over it by using standard methods. I searched a little bit and see a method of iterrows() but i couldn't understand how to use. Basicly this is my code and can you help me how to update it for iterrows()?
**
for i in range(len(x["Value"])):
if x.loc[i ,"PP_Name"] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'] :
x.loc[i,"Santral_Type"] = "HES"
elif x.loc[i ,"PP_Name"] in ['BND','BND2','TFB','TFB3','TFB4','KNT']:
x.loc[i,"Santral_Type"] = "TERMIK"
elif x.loc[i ,"PP_Name"] in ['BRS','ÇKL','DPZ']:
x.loc[i,"Santral_Type"] = "RES"
else : x.loc[i,"Santral_Type"] = "SOLAR"
**
How to iterate over very big dataframes -- In general, you don't. You should use some sort of vectorize operation to the column as a whole. For example, your case can be map and fillna:
map_dict = {
'HES' : ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'],
'TERMIK' : ['BND','BND2','TFB','TFB3','TFB4','KNT'],
'RES' : ['BRS','ÇKL','DPZ']
}
inv_map_dict = {x:k for k,v in map_dict.items() for x in v}
df['Santral_Type'] = df['PP_Name'].map(inv_map_dict).fillna('SOLAR')
It is not advised to iterate through DataFrames for these things. Here is one possible way of doing it, applied to all rows of the DataFrame x at once:
# Default value
x["Santral_Type"] = "SOLAR"
x.loc[x.PP_Name.isin(['BRS','ÇKL','DPZ']), 'Santral_Type'] = "RES"
x.loc[x.PP_Name.isin(['BND','BND2','TFB','TFB3','TFB4','KNT']), 'Santral_Type'] = "TERMIK"
hes_list = ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
x.loc[x.PP_Name.isin(hes_list), 'Santral_Type'] = "HES"
Note that 800k can not be considered a large table when using standard pandas methods.
I would advise strongly against using iterrows and for loops when you have vectorised solutions available which take advantage of the pandas api.
this is your code adapted with numpy which should run much faster than your current method.
import numpy as np
col = 'PP_Name'
conditions = [
x[col].isin(
['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
),
x[col].isin(["BND", "BND2", "TFB", "TFB3", "TFB4", "KNT"]),
x[col].isin(["BRS", "ÇKL", "DPZ"])]
outcomes = ["HES", "TERMIK", "RES"]
x["Santral_Type"] = np.select(conditions, outcomes, default='SOLAR')
df.iterrows() according to documentation returns a tuple (index, Series).
You can use it like this:
for row in df.iterrows():
if row[1]['PP_Name'] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']:
df['Santral_Type] = "HES"
# and so on
By the way, I must say, using iterrows is going to be very slow, and looking at your sample code it's clear you can use simple pandas selection techniques to do this without explicit loops.
Better to do it as #mcsoini suggested
the simplest method could be .values, example:
def f(x0,...xn):
return('hello or some complicated operation')
df['newColumn']=[f(r[0],r[1],...,r[n]) for r in df.values]
the drawbacks of this method as far as i know is that you cannot refer to the column values by name but just by position and there is no info about the index of the df.
Advantage is faster than iterrows, itertuples and apply methods.
hope it helps

count num of occurrences by value in a pandas series [duplicate]

In the following, male_trips is a big pandas data frame and stations is a small pandas data frame. For each station id I'd like to know how many male trips took place. The following does the job, but takes a long time:
mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]
how should I go about this instead?
Update! So there were two main approaches: groupby() followed by size(), and the simpler .value_counts(). I did a quick timeit, and the groupby approach wins by quite a large margin! Here is the code:
from timeit import Timer
setup = "import pandas; male_trips=pandas.load('maletrips')"
a = "male_trips.start_station_id.value_counts()"
b = "male_trips.groupby('start_station_id').size()"
Timer(a,setup).timeit(100)
Timer(b,setup).timeit(100)
and here is the result:
In [4]: Timer(a,setup).timeit(100) # <- this is value_counts
Out[4]: 9.709594964981079
In [5]: Timer(b,setup).timeit(100) # <- this is groupby / size
Out[5]: 1.5574288368225098
Note that, at this speed, for exploring data typing value_counts is marginally quicker and less remembering!
I'd do like Vishal but instead of using sum() using size() to get a count of the number of rows allocated to each group of 'start_station_id'. So:
df = male_trips.groupby('start_station_id').size()
My answer below works in Pandas 0.7.3. Not sure about the new releases.
This is what the pandas.Series.value_counts method is for:
count_series = male_trips.start_station_id.value_counts()
It should be straight-forward to then inspect count_series based on the values in stations['id']. However, if you insist on only considering those values, you could do the following:
count_series = (
male_trips[male_trips.start_station_id.isin(stations.id.values)]
.start_station_id
.value_counts()
)
and this will only give counts for station IDs actually found in stations.id.
male_trips.count()
doesnt work?
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html
how long would this take:
df = male_trips.groupby('start_station_id').sum()
edit: after seeing in the answer above that isin and value_counts exist (and value_counts even comes with its own entry in pandas.core.algorithm and also isin isn't simply np.in1d) I updated the three methods below
male_trips.start_station_id[male_trips.start_station_id.isin(station.id)].value_counts()
You could also do an inner join on stations.id:
pd.merge(male_trips, station, left_on='start_station_id', right_on='id') followed by value_counts.
Or:
male_trips.set_index('start_station_id, inplace=True)
station.set_index('id, inplace=True)
male_trips.ix[male_trips.index.intersection(station.index)].reset_index().start_station_id.value_counts()
If you have the time I'd be interested how this performs differently with a huge DataFrame.

pandas: count things

In the following, male_trips is a big pandas data frame and stations is a small pandas data frame. For each station id I'd like to know how many male trips took place. The following does the job, but takes a long time:
mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]
how should I go about this instead?
Update! So there were two main approaches: groupby() followed by size(), and the simpler .value_counts(). I did a quick timeit, and the groupby approach wins by quite a large margin! Here is the code:
from timeit import Timer
setup = "import pandas; male_trips=pandas.load('maletrips')"
a = "male_trips.start_station_id.value_counts()"
b = "male_trips.groupby('start_station_id').size()"
Timer(a,setup).timeit(100)
Timer(b,setup).timeit(100)
and here is the result:
In [4]: Timer(a,setup).timeit(100) # <- this is value_counts
Out[4]: 9.709594964981079
In [5]: Timer(b,setup).timeit(100) # <- this is groupby / size
Out[5]: 1.5574288368225098
Note that, at this speed, for exploring data typing value_counts is marginally quicker and less remembering!
I'd do like Vishal but instead of using sum() using size() to get a count of the number of rows allocated to each group of 'start_station_id'. So:
df = male_trips.groupby('start_station_id').size()
My answer below works in Pandas 0.7.3. Not sure about the new releases.
This is what the pandas.Series.value_counts method is for:
count_series = male_trips.start_station_id.value_counts()
It should be straight-forward to then inspect count_series based on the values in stations['id']. However, if you insist on only considering those values, you could do the following:
count_series = (
male_trips[male_trips.start_station_id.isin(stations.id.values)]
.start_station_id
.value_counts()
)
and this will only give counts for station IDs actually found in stations.id.
male_trips.count()
doesnt work?
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html
how long would this take:
df = male_trips.groupby('start_station_id').sum()
edit: after seeing in the answer above that isin and value_counts exist (and value_counts even comes with its own entry in pandas.core.algorithm and also isin isn't simply np.in1d) I updated the three methods below
male_trips.start_station_id[male_trips.start_station_id.isin(station.id)].value_counts()
You could also do an inner join on stations.id:
pd.merge(male_trips, station, left_on='start_station_id', right_on='id') followed by value_counts.
Or:
male_trips.set_index('start_station_id, inplace=True)
station.set_index('id, inplace=True)
male_trips.ix[male_trips.index.intersection(station.index)].reset_index().start_station_id.value_counts()
If you have the time I'd be interested how this performs differently with a huge DataFrame.

Categories