Improve performances (vectorize?) pandas.groupby.aggregate

Improve performances (vectorize?) pandas.groupby.aggregate - python

I'm trying to improve the performances of a pandas.groupby.aggregate operation using a custom aggregating function. I noticed that - correct me if I'm wrong - pandas calls the aggregating function on each block in sequence (I suspect it to be a simple for-loop).
Since pandas is heavily based on numpy, is there a way to speed up the calculation using numpy's vectorization features?
My code
In my code I need to aggregate wind data averaging samples together. While averaging wind-speeds is trivial, averaging wind directions requires a more ad-hoc code (e.g. the average of 1deg and 359deg is 0deg, not 180deg).
What my aggregating function does is:
remove NaNs
return NaN if no other value is present
check if a special flag indicating variable wind direction is present. If it is, return the flag
average the wind directions with a vector-averaging algorithm
The function is:
def meandir(x):
'''
Parameters
----------
x : pandas.Series
pandas series to be averaged
Returns
-------
float
averaged wind direction
'''
# Removes the NaN from the recording
x = x.dropna()
# If the record is empty, return NaN
if len(x)==0:
return np.nan
# If the record contains variable samples (990) return variable (990)
elif np.any(x == 990):
return 990
# Otherwise sum the vectors and return the angle
else:
angle = np.rad2deg(
np.arctan2(
np.sum(np.sin(np.deg2rad(x))),
np.sum(np.cos(np.deg2rad(x)))
)
)
#Wrap angles from (-pi,pi) to (0,360)
return (angle + 360) % 360
you can test it with
from timeit import repeat
import pandas as pd
import numpy as np
N_samples = int(1e4)
N_nan = N_var = int(0.02 * N_samples)
# Generate random data
data = np.random.rand(N_samples,2) * [30, 360]
data[np.random.choice(N_samples, N_nan), 1] = np.nan
data[np.random.choice(N_samples, N_var), 1] = 990
# Create dataset
df = pd.DataFrame(data, columns=['WindSpeed', 'WindDir'])
df.index = pd.date_range(start='2000-01-01 00:00', periods=N_samples, freq='10min')
# Run groupby + aggregate
grouped = df.groupby(pd.Grouper(freq='H')) # Data from 14.30 to 15.29 are rounded to 15.00
aggfuns1 = {'WindSpeed': np.mean, 'WindDir':meandir}
aggfuns2 = {'WindSpeed': np.mean, 'WindDir':np.mean}
res = repeat(stmt='grouped.agg(aggfuns1)', globals=globals(), number=1, repeat=10)
print(f'With custom aggregating function {min(res)*1000:.2f} ms')
res = repeat(stmt='grouped.agg(aggfuns2)', globals=globals(), number=1, repeat=10)
print(f'Without custom aggregating function {min(res)*1000:.2f} ms')
which on my PC for N_samples=1e4 outputs:
With custom aggregating function 1500.79 ms
Without custom aggregating function 2.08 ms
with the custom aggregating function being 750 times slower
and with N_samples=1e6 outputs:
With custom aggregating function 142967.17 ms
Without custom aggregating function 21.92 ms
with the custom aggregating function being 6500 times slower!
Is there a way to speed up this line of code?

The key is to try to vectorize everything you can on the whole df, and let groupby use only builtin methods.
Here is a way to do that. The trick is to convert the angles to complex numbers, which numpy will happily sum
(and groupby too, but groupby will refuse to mean()). So, we convert the angles to complex, sum, then
convert back to angles. The same "funny mean" of angles is used as in your code and described on the Wikipedia page you reference.
About the handling of the special value (990), it can be vectorized too: comparing s.groupby(...).count() with .replace(val, nan).groupby(...).count() finds all the groups where there is at least one of those.
Anyway, here goes:
def to_complex(s):
return np.exp(np.deg2rad(s) * 1j)
def to_angle(s):
return np.angle(s, deg=True) % 360
def mask_val(s, grouper, val=990):
return s.groupby(grouper).count() != s.replace(val, np.nan).groupby(grouper).count()
def myagg(df, grouper, val=990, winddir='WindDir'):
s = df[winddir]
mask = mask_val(s, grouper, val)
gb = to_complex(s).groupby(grouper)
s = gb.sum()
cnt = gb.count()
s = to_angle(s) * (cnt / cnt) # put NaN where all NaNs
s[mask] = val
# other columns
agg = df.groupby(grouper).mean()
agg[winddir] = s
return agg
Application:
For convenience, I put your example generation into a function gen_example(N_samples).
df = gen_example(50)
myagg(df, pd.Grouper(freq='H'))
Out[ ]:
WindSpeed WindDir
2000-01-01 00:00:00 12.991717 354.120464
2000-01-01 01:00:00 15.743056 60.813629
2000-01-01 02:00:00 14.593927 245.487383
2000-01-01 03:00:00 17.836368 131.493675
2000-01-01 04:00:00 18.987296 27.150359
2000-01-01 05:00:00 16.415725 194.923399
2000-01-01 06:00:00 20.881816 990.000000
2000-01-01 07:00:00 15.033480 44.626018
2000-01-01 08:00:00 16.276834 29.252459
Speed:
df = gen_example(10_000)
%timeit myagg(df, pd.Grouper(freq='H'))
Out[ ]:
6.76 ms ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df = gen_example(1e6)
%timeit myagg(df, pd.Grouper(freq='H'))
Out[ ]:
189 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Testing:
idx = [0] * 4
grouper = pd.Grouper(level=0)
myagg(pd.DataFrame({'WindDir': [170, 170, 178, 182]}, index=idx), grouper)
WindDir
0 174.998473
myagg(pd.DataFrame({'WindDir': [330, 359, 1, 40]}, index=idx), grouper)
WindDir
0 2.251499
myagg(pd.DataFrame({'WindDir': [330, 359, 1, np.nan]}, index=idx), grouper)
WindDir
0 350.102878
myagg(pd.DataFrame({'WindDir': [np.nan, np.nan, np.nan, np.nan]}, index=idx), grouper)
WindDir
0 NaN
myagg(pd.DataFrame({'WindDir': [330, 990, 1, np.nan]}, index=idx), grouper)
WindDir
0 990.0

Related

Compute weekly averages of columns in time series data

I have time series data with an id column and some continues value column. I want to calculate the moving weekly averages of this value per each person, in a new column. Code generate a sample dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{60}T',start='2020-01-01',periods=(1)*24*14))
df['col'] = np.random.random_integers(0, 250, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{60}T',start='2020-01-01',periods=(1)*24*14))
df2['col'] = np.random.random_integers(0, 150, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3
This sample has 2 weeks of data per person, so there should be 2 average values per person. The first one will be the average of the first week, the second average will be the average of the two weeks (week1 average + week2 average)/2. Then, fill all the rows in the column with the average value of that week.
Real dataset is large, so I am looking for a solution that can be scaled. How can this be achieved?
desired outcome should look like this:
index uid col week_average
2020-01-01 00:00:00 1 104 week1_uid1_mean
2020-01-01 01:00:00 1 150 week1_uid1_mean
2020-01-01 02:00:00 1 243 week1_uid1_mean
....
2020-01-08 00:00:00 1 174 (week1_uid1_mean+week2_uid1_mean)/2
2020-01-08 01:00:00 1 24 (week1_uid1_mean+week2_uid1_mean)/2
...

First, compute the week index for each row
df3["week"] = (
df3["index"] - df3.groupby("uid")["index"].transform("min")
) // pd.Timedelta(7, unit="day")
Or if the values in the index column are identical for all persons (uid), directly
df3["week"] = (df3["index"] - df3["index"].min()) // pd.Timedelta(
7, unit="day"
)
Then, compute the week average for each distinct couple of (uid, week)
week_averages = (
df3.groupby(["uid", "week"])["col"]
.mean()
.groupby("uid")
.apply(lambda x: x.rolling(len(x), min_periods=1).mean())
)
Finally, fill each row of your dataframe with the corresponding week average
df3["week_average"] = df3.apply(
lambda x: week_averages.loc[(x["uid"], x["week"])], axis=1
)
On your data, when using %timeit, I get
39.5 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Obtaining timeframe for ID groups based on state change

First off, my apologies, I'm a complete novice when it comes to Python. I use it extremely infrequently but require it for this problem.
I have a set of data which looks like the below:
id
state
dt
101
0
2022-15
101
1
2022-22
101
0
2022-26
102
0
2022-01
102
1
2022-41
103
1
2022-03
103
0
2022-12
I need to provide an output which displays the amount of time each ID was state = "1". E.G for ID 101 - state1_start_dt = "2022_22", state1_end_dt = "2022_25".
The data is in .CSV format. I've attempted to bring this in via Pandas, utilise groupby on the df and then loop over this - however this seems extremely slow.
I've come across Finite State Machines which seem to link to my requirements, however I'm in way over my head attempting to create a Finite State Machine in Python which accepts .CSV inputs, provides output per each ID group as well as incorporates logic to account for scenarios where the last entry for an ID is state = "1" - therefore we'd assume the time frame was until the end of 2022.
If anyone can provide some sources or sample code which I can break down to get a better understanding - that would be great.
EDIT
Some examples to be clearer on what I'd like to achieve:
-For IDs that have no ending 0 in the state sequence, the state1_end_dt should be entered as '2022-52' (the final week in 2022)
-For IDs which have alternating states, we can incorporate a second, third, forth etc.. set of columns (E.G state1_start_dt_2, state1_end_dt_2). This will allow each window to be accounted for. For any entries that only have one window, these extra columns can be NULL.
-For IDs which have no "1" present in the state column, these can be skipped.
-For IDs which do not have any 0 states present, the minimum dt value should be taken as the state1_start_dt and '2022-52' can be entered for state1_end_dt

IIUC, here are some functions to perform the aggregation you are looking for.
First, we convert the strings '%Y-%W' (e.g. '2022-15') into a DateTime (the Monday of that week), e.g. '2022-04-11', as it is easier to deal with actual dates than these strings. This makes this solution generic in that it can have arbitrary dates in it, not just for a single year.
Second, we augment the df with a "sentinel": a row for each id that is on the first week of the next year (next year being max year of all dates, plus 1) with state = 0. That allows us to not worry whether a sequence ends with 0 or not.
Then, we essentially group by id and apply the following logic: keep only transitions, so, e.g., [1,1,1,0,0,1,0] becomes [1,.,.,0,.,1,0] (where '.' indicates dropped values). That gives us the spans we are looking for (after subtracting one week for the 0 states).
Edit: speedup: instead of applying the masking logic to each group, we detect transitions globally (on the sentinel-augmented df, sorted by ['id', 'dt', 'state']). Since each id sequence in the augmented df ends with the sentinel (0), we are guaranteed to catch the first 1 of the next id.
Putting it all together, including a postproc() to convert dates back into strings of year-week:
def preproc(df):
df = df.assign(dt=pd.to_datetime(df['dt'] + '-Mon', format='%Y-%W-%a'))
max_year = df['dt'].max().year
# first week next year:
tmax = pd.Timestamp(f'{max_year}-12-31') + pd.offsets.Week(1)
sentinel = pd.DataFrame(
pd.unique(df['id']),
columns=['id']).assign(state=0, dt=tmax)
df = pd.concat([df, sentinel])
df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
return df
# speed up
def proc(df):
mask = df['state'] != df['state'].shift(fill_value=0)
df = df[mask]
z = df.assign(c=df.groupby('id').cumcount()).set_index(['c', 'id'])['dt'].unstack('c')
z[z.columns[1::2]] -= pd.offsets.Week(1)
cols = [
f'{x}_{i}'
for i in range(len(z.columns) // 2)
for x in ['start', 'end']
]
return z.set_axis(cols, axis=1)
def asweeks_str(t, nat='--'):
return f'{t:%Y-%W}' if t and t == t else nat
def postproc(df):
# convert dates into strings '%Y-%W'
return df.applymap(asweeks_str)
Examples
First, let's use the example that is in the original question. Note that this doesn't exemplifies some of the corner cases we are able to handle (more on that in a minute).
df = pd.DataFrame({
'id': [101, 101, 101, 102, 102, 103, 103],
'state': [0, 1, 0, 0, 1, 1, 0],
'dt': ['2022-15', '2022-22', '2022-26', '2022-01', '2022-41', '2022-03', '2022-12'],
})
>>> postproc(proc(preproc(df)))
start_0 end_0
id
101 2022-22 2022-25
102 2022-41 2022-52
103 2022-03 2022-11
But let's generate some random data, to observe some corner cases:
def gen(n, nids=2):
wk = np.random.randint(1, 53, n*nids)
st = np.random.choice([0, 1], n*nids)
ids = np.repeat(np.arange(nids) + 101, n)
df = pd.DataFrame({
'id': ids,
'state': st,
'dt': [f'2022-{w:02d}' for w in wk],
})
df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
return df
Now:
np.random.seed(0) # reproducible example
df = gen(6, 3)
>>> df
id state dt
0 101 0 2022-01
1 101 0 2022-04
2 101 1 2022-04
3 101 1 2022-40
4 101 1 2022-45
5 101 1 2022-48
6 102 1 2022-10
7 102 1 2022-20
8 102 0 2022-22
9 102 1 2022-24
10 102 0 2022-37
11 102 1 2022-51
12 103 1 2022-02
13 103 0 2022-07
14 103 0 2022-13
15 103 1 2022-25
16 103 1 2022-25
17 103 1 2022-39
There are several interesting things here. First, 101 starts with a 0 state, whereas 102 and 103 both start with 1. Then, there are repeated ones for all ids. There are also repeated weeks: '2022-04' for 101 and '2022-25' for 103.
Nevertheless, the aggregation works just fine and produces:
>>> postproc(proc(preproc(df)))
start_0 end_0 start_1 end_1 start_2 end_2
id
101 2022-04 2022-52 -- -- -- --
102 2022-10 2022-21 2022-24 2022-36 2022-51 2022-52
103 2022-02 2022-06 2022-25 2022-52 -- --
Speed
np.random.seed(0)
n = 10
k = 100_000
df = gen(n, k)
%timeit preproc(df)
483 ms ± 4.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The processing itself takes less than 200ms for 1 million rows:
a = preproc(df)
%timeit proc(a)
185 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As for the post-processing (converting dates back to year-week strings), it is the slowest thing of all:
b = proc(a)
%timeit postproc(b)
1.63 s ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For a speed-up of that post-processing, we can rely on the fact that there are only a small number of distinct dates that are week-starts (52 per year, plus NaT for the blank cells):
def postproc2(df, nat='--'):
dct = {
t: f'{t:%Y-%W}' if t and t == t else nat
for t in df.stack().reset_index(drop=True).drop_duplicates()
}
return df.applymap(dct.get)
%timeit postproc2(b)
542 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
We could of course do something similar for preproc().

If the csv file called one_zero.csv is this
id,state,dt
100,0,2022-15
100,1,2022-22
100,0,2022-26
101,0,2022-01
101,1,2022-41
102,1,2022-03
102,0,2022-12
102,1,2022-33
(I've added one additional item to the end.)
Then this code gives you what you want.
import pandas as pd
df = pd.read_csv("one_zero.csv")
result = {}
for id_, sub_df in df.groupby('id'):
sub_df = sub_df.sort_values("dt")
intervals = []
start_dt = None
for state, dt in zip(sub_df["state"], sub_df["dt"]):
if state == 1:
start_dt = dt
if state == 0 and start_dt is not None:
week = int(dt.split("-", maxsplit=1)[1])
intervals.append((start_dt, f"2022-{week-1:02d}"))
start_dt = None
if start_dt is not None:
intervals.append((start_dt, "2022-52"))
result[id_] = intervals
At the end the result dictionary will contain this:
{
100: [('2022-22', '2022-25')],
101: [('2022-41', '2022-52')],
102: [('2022-03', '2022-11'), ('2022-33', '2022-52')]
}
With this groupby and sort_values it works even if you shuffle the lines in the csv file. I've used formatted string to fix the week number. 02d there means there, that the week will be always two digits, starting with 0 for the first 9 weeks.
I guess you need less memory if you iterate on the rows like this, but for me the zip version is more familiar.
for _, row in sub_df.iterrows():
state = row["state"]
dt = row["dt"]

Another alternative:
res = (
df.drop(columns="dt")
.assign(week=df["dt"].str.split("-").str[1].astype("int"))
.sort_values(["id", "week"])
.assign(group=lambda df:
df.groupby("id")["state"].diff().fillna(1).ne(0).cumsum()
)
.drop_duplicates(subset="group", keep="first")
.loc[lambda df: df["state"].eq(1) | df["id"].eq(df["id"].shift())]
.assign(no=lambda df: df.groupby("id")["state"].cumsum())
.pivot(index=["id", "no"], columns="state", values="week")
.rename(columns={0: "end", 1: "start"}).fillna("52").astype("int")
)[["start", "end"]]
First add new column week and sort along id and week. (The sorting might not be necessary if the data already come sorted.)
Then look id-group-wise for blocks of consecutive 0 or 1 and based on the result (stored in the new column group) drop all resp. duplicates while keeping the firsts (the others aren't relevant according to the logic you've layed out).
Afterwards also remove the 0-states at the start of an id-group.
On the result identify id-group-wise the connected start-end groups (store in new group no).
Then .pivot the thing: pull id and no in the index and state into the columns.
Afterwards fill the NaN with 52 and do some casting, renaminig, and sorting to get the result in better shape.
If you really want to move the various start-end-combinations into columns then replace below the pivot line as follows:
res = (
...
.pivot(index=["id", "no"], columns="state", values="week")
.rename(columns={0: 1, 1: 0}).fillna("52").astype("int")
.unstack().sort_index(level=1, axis=1)
)
res.columns = [f"{'start' if s == 0 else 'end'}_{n}" for s, n in res.columns]
Results with the dataframe from #Pierre's answer:
state start end
id no
101 1 4 52
102 1 10 22
2 24 37
3 51 52
103 1 2 7
2 25 52
or
start_1 end_1 start_2 end_2 start_3 end_3
id
101 4.0 52.0 NaN NaN NaN NaN
102 10.0 22.0 24.0 37.0 51.0 52.0
103 2.0 7.0 25.0 52.0 NaN NaN

Is there a way to apply a formula to only rows with matching Ids in Python? [duplicate]

This question is linked to Speedup of pandas groupby. It is about speeding up a groubby cumproduct calculation. The DataFrame is 2D and has a multi index consisting of 3 integers.
The HDF5 file for the dataframe can be found here: http://filebin.ca/2Csy0E2QuF2w/phi.h5
The actual calculation that I'm performing is similar to this:
>>> phi = pd.read_hdf('phi.h5', 'phi')
>>> %timeit phi.groupby(level='atomic_number').cumprod()
100 loops, best of 3: 5.45 ms per loop
The other speedup that might be possible is that I do this calculation about 100 times using the same index structure but with different numbers. I wonder if it can somehow cache the index.
Any help will be appreciated.

Numba appears to work pretty well here. In fact, these results seem almost too good to be true with the numba function below being about 4,000x faster than the original method and 5x faster than plain cumprod without a groupby. Hopefully these are correct, let me know if there is an error.
np.random.seed(1234)
df=pd.DataFrame({ 'x':np.repeat(range(200),4), 'y':np.random.randn(800) })
df = df.sort('x')
df['cp_groupby'] = df.groupby('x').cumprod()
from numba import jit
#jit
def group_cumprod(x,y):
z = np.ones(len(x))
for i in range(len(x)):
if x[i] == x[i-1]:
z[i] = y[i] * z[i-1]
else:
z[i] = y[i]
return z
df['cp_numba'] = group_cumprod(df.x.values,df.y.values)
df['dif'] = df.cp_groupby - df.cp_numba
Test that both ways give the same answer:
all(df.cp_groupby==df.cp_numba)
Out[1447]: True
Timings:
%timeit df.groupby('x').cumprod()
10 loops, best of 3: 102 ms per loop
%timeit df['y'].cumprod()
10000 loops, best of 3: 133 µs per loop
%timeit group_cumprod(df.x.values,df.y.values)
10000 loops, best of 3: 24.4 µs per loop

pure numpy solution, assuming the data is sorted by the index, though no handling of NaN:
res = np.empty_like(phi.values)
l = 0
r = phi.index.levels[0]
for i in r:
phi.values[l:l+i,:].cumprod(axis=0, out=res[l:l+i])
l += i
about 40 times faster on the multiindex data from the question.
Though a problem is that this does rely on how pandas stores the data in its backend array. So it may stop working when pandas changes.
>>> phi = pd.read_hdf('phi.h5', 'phi')
>>> %timeit phi.groupby(level='atomic_number').cumprod()
100 loops, best of 3: 4.33 ms per loop
>>> %timeit np_cumprod(phi)
10000 loops, best of 3: 111 µs per loop

If you want a fast but not very pretty workaround, you could do something like the following. Here's some sample data and your default approach.
df=pd.DataFrame({ 'x':np.repeat(range(200),4), 'y':np.random.randn(800) })
df = df.sort('x')
df['cp_group'] = df.groupby('x').cumprod()
And here's the workaround. It's looks rather long (it is) but each individual step is simple and fast. (The timings are at the bottom.) The key is simply to avoid using groupby at all in this case by replacing with shift and such -- but because of that you also need to make sure your data is sorted by the groupby column.
df['cp_nogroup'] = df.y.cumprod()
df['last'] = np.where( df.x == df.x.shift(-1), 0, df.y.cumprod() )
df['last'] = np.where( df['last'] == 0., np.nan, df['last'] )
df['last'] = df['last'].shift().ffill().fillna(1)
df['cp_fast'] = df['cp_nogroup'] / df['last']
df['dif'] = df.cp_group - df.cp_fast
Here's what it looks like. 'cp_group' is your default and 'cp_fast' is the above workaround. If you look at the 'dif' column you'll see that several of these are off by very small amounts. This is just a precision issue and not anything to worry about.
x y cp_group cp_nogroup last cp_fast dif
0 0 1.364826 1.364826 1.364826 1.000000 1.364826 0.000000e+00
1 0 0.410126 0.559751 0.559751 1.000000 0.559751 0.000000e+00
2 0 0.894037 0.500438 0.500438 1.000000 0.500438 0.000000e+00
3 0 0.092296 0.046189 0.046189 1.000000 0.046189 0.000000e+00
4 1 1.262172 1.262172 0.058298 0.046189 1.262172 0.000000e+00
5 1 0.832328 1.050541 0.048523 0.046189 1.050541 2.220446e-16
6 1 -0.337245 -0.354289 -0.016364 0.046189 -0.354289 -5.551115e-17
7 1 0.758163 -0.268609 -0.012407 0.046189 -0.268609 -5.551115e-17
8 2 -1.025820 -1.025820 0.012727 -0.012407 -1.025820 0.000000e+00
9 2 1.175903 -1.206265 0.014966 -0.012407 -1.206265 0.000000e+00
Timings
Default method:
In [86]: %timeit df.groupby('x').cumprod()
10 loops, best of 3: 100 ms per loop
Standard cumprod but without the groupby. This should be a good approximation of the maximum possible speed you could achieve.
In [87]: %timeit df.cumprod()
1000 loops, best of 3: 536 µs per loop
And here's the workaround:
In [88]: %%timeit
...: df['cp_nogroup'] = df.y.cumprod()
...: df['last'] = np.where( df.x == df.x.shift(-1), 0, df.y.cumprod() )
...: df['last'] = np.where( df['last'] == 0., np.nan, df['last'] )
...: df['last'] = df['last'].shift().ffill().fillna(1)
...: df['cp_fast'] = df['cp_nogroup'] / df['last']
...: df['dif'] = df.cp_group - df.cp_fast
100 loops, best of 3: 2.3 ms per loop
So the workaround is about 40x faster for this sample dataframe but the speedup will depend on the dataframe (in particular on the number of groups).

Faster `pandas.DataFrame.groupby()` when you have lots of groups

I would like to create lists from a column of a DataFrame grouping by an index. To perform this task I:
Group by an index
Aggregate all the items in each group into a list
However, once the number of groups gets large, this operation becomes very slow. Here is an illustrative example.
First, the data:
import random
import uuid
import numpy as np
import pandas as pd
np.random.seed(42)
random.seed(42)
def make_df(nr_level_one: int = 1000, max_nr_paths: int = 10):
level_one_values = np.arange(nr_level_one)
count_paths = np.random.randint(1, max_nr_paths+1, size=nr_level_one)
idx_one = np.repeat(level_one_values, count_paths)
nr_obs = np.sum(count_paths)
idx_two = np.arange(nr_obs)
idx = pd.MultiIndex.from_tuples(
zip(idx_one, idx_two), names=['one', 'two']
)
paths = [str(uuid.UUID(int=random.getrandbits(128))).replace('-', '/')
for _ in range(nr_obs)]
return pd.DataFrame(paths, index=idx, columns=['path'])
df = make_df()
df
path
one two
0 0 bdd640fb/0667/1ad1/1c80/317fa3b1799d
1 23b8c1e9/3924/56de/3eb1/3b9046685257
2 bd9c66b3/ad3c/2d6d/1a3d/1fa7bc8960a9
3 972a8469/1641/9f82/8b9d/2434e465e150
4 17fc695a/07a0/ca6e/0822/e8f36c031199
... ... ...
999 5443 fe66c4fa/35ed/ff38/9197/107c89b702ed
5444 a560c775/58cf/d966/6f11/0436b3c28ec5
5445 49e785c7/cbd8/715e/ae98/b722cf97b016
5446 f7eefd84/b31c/8349/5799/2f42351b3e63
5447 be3de265/d471/8d86/8d36/645980f6c26c
5448 rows × 1 columns
I want to aggregate all paths at level one in the index into a list. This is the function which achieves what I want:
df.groupby(level='one').agg(list)
path
one
0 [bdd640fb/0667/1ad1/1c80/317fa3b1799d, 23b8c1e...
1 [6b65a6a4/8b81/48f6/b38a/088ca65ed389, 4737819...
2 [371ecd7b/27cd/8130/4722/9389571aa876, 1a2a73e...
3 [3139d32c/93cd/59bf/5c94/1cf0dc98d2c1, a9488d9...
4 [29a3b2e9/5d65/a441/d588/42dea2bc372f, ab9099a...
... ...
995 [5dc2fd9b/f1bd/b57b/b8dd/dfc2963ba31c, aa1c5dc...
996 [1e228ade/59c6/7a52/8f80/d1ef4615575c, d60b04a...
997 [f151ff15/a46e/e99e/ae4e/d89fd659d69f, bf5628b...
998 [17a85108/43b9/3b02/7089/8400b2932bfb, 5d15c12...
999 [85e620fe/a44e/b3e1/c5ba/136f594ed61a, afe1d84...
1000 rows × 1 columns
But as you scale up the number of groups, this gets very very slow (though it does seem to scale linearly with the number of groups)
%%timeit
make_df(100000).groupby(level='one').agg(list)
15 s ± 230 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is there a faster way of achieving the same task with pandas?

Filter pandas row where 1st letter in a column is/is-not a certain value

how do I filter out a series of data (in pandas dataFrame) where I do not want the 1st letter to be 'Z', or any other character.
I have the following pandas dataFrame, df, (of which there are > 25,000 rows).
TIME_STAMP Activity Action Quantity EPIC Price Sub-activity Venue
0 2017-08-30 08:00:05.000 Allocation BUY 50 RRS 77.6 CPTY 066
1 2017-08-30 08:00:05.000 Allocation BUY 50 RRS 77.6 CPTY 066
3 2017-08-30 08:00:09.000 Allocation BUY 91 BATS 47.875 CPTY PXINLN
4 2017-08-30 08:00:10.000 Allocation BUY 43 PNN 8.07 CPTY WCAPD
5 2017-08-30 08:00:10.000 Allocation BUY 270 SGE 6.93 CPTY PROBDMAD
I am trying to remove all the rows where the 1st letter of the Venue is 'Z'.
For example, my usual filter code would be something like (filtering out all rows where the Venue = '066'
df = df[df.Venue != '066']
I can see this filter line filters out what I need by array, but I am not sure how to specify it within a filter context.
[k for k in df.Venue if 'Z' not in k]

Use str[0] for select first value or use startswith, contains with regex ^ for start of string. For invertong boolen mask is used ~:
df1 = df[df.Venue.str[0] != 'Z']
df1 = df[~df.Venue.str.startswith('Z')]
df1 = df[~df.Venue.str.contains('^Z')]
If no NaNs values faster is use list comprehension:
df1 = df[[x[0] != 'Z' for x in df.Venue]]
df1 = df[[not x.startswith('Z') for x in df.Venue]]

For the case where you do not have NaN values, you can convert the NumPy representation of a series to type '<U1' and test equality:
df1 = df[df['A'].values.astype('<U1') != 'Z']
Performance benchmarking
from string import ascii_uppercase
from random import choice
L = [''.join(choice(ascii_uppercase) for _ in range(10)) for i in range(100000)]
df = pd.DataFrame({'A': L})
%timeit df['A'].values.astype('<U1') != 'Z' # 4.05 ms per loop
%timeit [x[0] != 'Z' for x in df['A']] # 11.9 ms per loop
%timeit [not x.startswith('Z') for x in df['A']] # 23.7 ms per loop
%timeit ~df['A'].str.startswith('Z') # 53.6 ms per loop
%timeit df['A'].str[0] != 'Z' # 53.7 ms per loop
%timeit ~df['A'].str.contains('^Z') # 127 ms per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.