I have time series data with an id column and some continues value column. I want to calculate the moving weekly averages of this value per each person, in a new column. Code generate a sample dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{60}T',start='2020-01-01',periods=(1)*24*14))
df['col'] = np.random.random_integers(0, 250, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{60}T',start='2020-01-01',periods=(1)*24*14))
df2['col'] = np.random.random_integers(0, 150, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3
This sample has 2 weeks of data per person, so there should be 2 average values per person. The first one will be the average of the first week, the second average will be the average of the two weeks (week1 average + week2 average)/2. Then, fill all the rows in the column with the average value of that week.
Real dataset is large, so I am looking for a solution that can be scaled. How can this be achieved?
desired outcome should look like this:
index uid col week_average
2020-01-01 00:00:00 1 104 week1_uid1_mean
2020-01-01 01:00:00 1 150 week1_uid1_mean
2020-01-01 02:00:00 1 243 week1_uid1_mean
....
2020-01-08 00:00:00 1 174 (week1_uid1_mean+week2_uid1_mean)/2
2020-01-08 01:00:00 1 24 (week1_uid1_mean+week2_uid1_mean)/2
...
First, compute the week index for each row
df3["week"] = (
df3["index"] - df3.groupby("uid")["index"].transform("min")
) // pd.Timedelta(7, unit="day")
Or if the values in the index column are identical for all persons (uid), directly
df3["week"] = (df3["index"] - df3["index"].min()) // pd.Timedelta(
7, unit="day"
)
Then, compute the week average for each distinct couple of (uid, week)
week_averages = (
df3.groupby(["uid", "week"])["col"]
.mean()
.groupby("uid")
.apply(lambda x: x.rolling(len(x), min_periods=1).mean())
)
Finally, fill each row of your dataframe with the corresponding week average
df3["week_average"] = df3.apply(
lambda x: week_averages.loc[(x["uid"], x["week"])], axis=1
)
On your data, when using %timeit, I get
39.5 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
First off, my apologies, I'm a complete novice when it comes to Python. I use it extremely infrequently but require it for this problem.
I have a set of data which looks like the below:
id
state
dt
101
0
2022-15
101
1
2022-22
101
0
2022-26
102
0
2022-01
102
1
2022-41
103
1
2022-03
103
0
2022-12
I need to provide an output which displays the amount of time each ID was state = "1". E.G for ID 101 - state1_start_dt = "2022_22", state1_end_dt = "2022_25".
The data is in .CSV format. I've attempted to bring this in via Pandas, utilise groupby on the df and then loop over this - however this seems extremely slow.
I've come across Finite State Machines which seem to link to my requirements, however I'm in way over my head attempting to create a Finite State Machine in Python which accepts .CSV inputs, provides output per each ID group as well as incorporates logic to account for scenarios where the last entry for an ID is state = "1" - therefore we'd assume the time frame was until the end of 2022.
If anyone can provide some sources or sample code which I can break down to get a better understanding - that would be great.
EDIT
Some examples to be clearer on what I'd like to achieve:
-For IDs that have no ending 0 in the state sequence, the state1_end_dt should be entered as '2022-52' (the final week in 2022)
-For IDs which have alternating states, we can incorporate a second, third, forth etc.. set of columns (E.G state1_start_dt_2, state1_end_dt_2). This will allow each window to be accounted for. For any entries that only have one window, these extra columns can be NULL.
-For IDs which have no "1" present in the state column, these can be skipped.
-For IDs which do not have any 0 states present, the minimum dt value should be taken as the state1_start_dt and '2022-52' can be entered for state1_end_dt
IIUC, here are some functions to perform the aggregation you are looking for.
First, we convert the strings '%Y-%W' (e.g. '2022-15') into a DateTime (the Monday of that week), e.g. '2022-04-11', as it is easier to deal with actual dates than these strings. This makes this solution generic in that it can have arbitrary dates in it, not just for a single year.
Second, we augment the df with a "sentinel": a row for each id that is on the first week of the next year (next year being max year of all dates, plus 1) with state = 0. That allows us to not worry whether a sequence ends with 0 or not.
Then, we essentially group by id and apply the following logic: keep only transitions, so, e.g., [1,1,1,0,0,1,0] becomes [1,.,.,0,.,1,0] (where '.' indicates dropped values). That gives us the spans we are looking for (after subtracting one week for the 0 states).
Edit: speedup: instead of applying the masking logic to each group, we detect transitions globally (on the sentinel-augmented df, sorted by ['id', 'dt', 'state']). Since each id sequence in the augmented df ends with the sentinel (0), we are guaranteed to catch the first 1 of the next id.
Putting it all together, including a postproc() to convert dates back into strings of year-week:
def preproc(df):
df = df.assign(dt=pd.to_datetime(df['dt'] + '-Mon', format='%Y-%W-%a'))
max_year = df['dt'].max().year
# first week next year:
tmax = pd.Timestamp(f'{max_year}-12-31') + pd.offsets.Week(1)
sentinel = pd.DataFrame(
pd.unique(df['id']),
columns=['id']).assign(state=0, dt=tmax)
df = pd.concat([df, sentinel])
df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
return df
# speed up
def proc(df):
mask = df['state'] != df['state'].shift(fill_value=0)
df = df[mask]
z = df.assign(c=df.groupby('id').cumcount()).set_index(['c', 'id'])['dt'].unstack('c')
z[z.columns[1::2]] -= pd.offsets.Week(1)
cols = [
f'{x}_{i}'
for i in range(len(z.columns) // 2)
for x in ['start', 'end']
]
return z.set_axis(cols, axis=1)
def asweeks_str(t, nat='--'):
return f'{t:%Y-%W}' if t and t == t else nat
def postproc(df):
# convert dates into strings '%Y-%W'
return df.applymap(asweeks_str)
Examples
First, let's use the example that is in the original question. Note that this doesn't exemplifies some of the corner cases we are able to handle (more on that in a minute).
df = pd.DataFrame({
'id': [101, 101, 101, 102, 102, 103, 103],
'state': [0, 1, 0, 0, 1, 1, 0],
'dt': ['2022-15', '2022-22', '2022-26', '2022-01', '2022-41', '2022-03', '2022-12'],
})
>>> postproc(proc(preproc(df)))
start_0 end_0
id
101 2022-22 2022-25
102 2022-41 2022-52
103 2022-03 2022-11
But let's generate some random data, to observe some corner cases:
def gen(n, nids=2):
wk = np.random.randint(1, 53, n*nids)
st = np.random.choice([0, 1], n*nids)
ids = np.repeat(np.arange(nids) + 101, n)
df = pd.DataFrame({
'id': ids,
'state': st,
'dt': [f'2022-{w:02d}' for w in wk],
})
df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
return df
Now:
np.random.seed(0) # reproducible example
df = gen(6, 3)
>>> df
id state dt
0 101 0 2022-01
1 101 0 2022-04
2 101 1 2022-04
3 101 1 2022-40
4 101 1 2022-45
5 101 1 2022-48
6 102 1 2022-10
7 102 1 2022-20
8 102 0 2022-22
9 102 1 2022-24
10 102 0 2022-37
11 102 1 2022-51
12 103 1 2022-02
13 103 0 2022-07
14 103 0 2022-13
15 103 1 2022-25
16 103 1 2022-25
17 103 1 2022-39
There are several interesting things here. First, 101 starts with a 0 state, whereas 102 and 103 both start with 1. Then, there are repeated ones for all ids. There are also repeated weeks: '2022-04' for 101 and '2022-25' for 103.
Nevertheless, the aggregation works just fine and produces:
>>> postproc(proc(preproc(df)))
start_0 end_0 start_1 end_1 start_2 end_2
id
101 2022-04 2022-52 -- -- -- --
102 2022-10 2022-21 2022-24 2022-36 2022-51 2022-52
103 2022-02 2022-06 2022-25 2022-52 -- --
Speed
np.random.seed(0)
n = 10
k = 100_000
df = gen(n, k)
%timeit preproc(df)
483 ms ± 4.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The processing itself takes less than 200ms for 1 million rows:
a = preproc(df)
%timeit proc(a)
185 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As for the post-processing (converting dates back to year-week strings), it is the slowest thing of all:
b = proc(a)
%timeit postproc(b)
1.63 s ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For a speed-up of that post-processing, we can rely on the fact that there are only a small number of distinct dates that are week-starts (52 per year, plus NaT for the blank cells):
def postproc2(df, nat='--'):
dct = {
t: f'{t:%Y-%W}' if t and t == t else nat
for t in df.stack().reset_index(drop=True).drop_duplicates()
}
return df.applymap(dct.get)
%timeit postproc2(b)
542 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
We could of course do something similar for preproc().
If the csv file called one_zero.csv is this
id,state,dt
100,0,2022-15
100,1,2022-22
100,0,2022-26
101,0,2022-01
101,1,2022-41
102,1,2022-03
102,0,2022-12
102,1,2022-33
(I've added one additional item to the end.)
Then this code gives you what you want.
import pandas as pd
df = pd.read_csv("one_zero.csv")
result = {}
for id_, sub_df in df.groupby('id'):
sub_df = sub_df.sort_values("dt")
intervals = []
start_dt = None
for state, dt in zip(sub_df["state"], sub_df["dt"]):
if state == 1:
start_dt = dt
if state == 0 and start_dt is not None:
week = int(dt.split("-", maxsplit=1)[1])
intervals.append((start_dt, f"2022-{week-1:02d}"))
start_dt = None
if start_dt is not None:
intervals.append((start_dt, "2022-52"))
result[id_] = intervals
At the end the result dictionary will contain this:
{
100: [('2022-22', '2022-25')],
101: [('2022-41', '2022-52')],
102: [('2022-03', '2022-11'), ('2022-33', '2022-52')]
}
With this groupby and sort_values it works even if you shuffle the lines in the csv file. I've used formatted string to fix the week number. 02d there means there, that the week will be always two digits, starting with 0 for the first 9 weeks.
I guess you need less memory if you iterate on the rows like this, but for me the zip version is more familiar.
for _, row in sub_df.iterrows():
state = row["state"]
dt = row["dt"]
Another alternative:
res = (
df.drop(columns="dt")
.assign(week=df["dt"].str.split("-").str[1].astype("int"))
.sort_values(["id", "week"])
.assign(group=lambda df:
df.groupby("id")["state"].diff().fillna(1).ne(0).cumsum()
)
.drop_duplicates(subset="group", keep="first")
.loc[lambda df: df["state"].eq(1) | df["id"].eq(df["id"].shift())]
.assign(no=lambda df: df.groupby("id")["state"].cumsum())
.pivot(index=["id", "no"], columns="state", values="week")
.rename(columns={0: "end", 1: "start"}).fillna("52").astype("int")
)[["start", "end"]]
First add new column week and sort along id and week. (The sorting might not be necessary if the data already come sorted.)
Then look id-group-wise for blocks of consecutive 0 or 1 and based on the result (stored in the new column group) drop all resp. duplicates while keeping the firsts (the others aren't relevant according to the logic you've layed out).
Afterwards also remove the 0-states at the start of an id-group.
On the result identify id-group-wise the connected start-end groups (store in new group no).
Then .pivot the thing: pull id and no in the index and state into the columns.
Afterwards fill the NaN with 52 and do some casting, renaminig, and sorting to get the result in better shape.
If you really want to move the various start-end-combinations into columns then replace below the pivot line as follows:
res = (
...
.pivot(index=["id", "no"], columns="state", values="week")
.rename(columns={0: 1, 1: 0}).fillna("52").astype("int")
.unstack().sort_index(level=1, axis=1)
)
res.columns = [f"{'start' if s == 0 else 'end'}_{n}" for s, n in res.columns]
Results with the dataframe from #Pierre's answer:
state start end
id no
101 1 4 52
102 1 10 22
2 24 37
3 51 52
103 1 2 7
2 25 52
or
start_1 end_1 start_2 end_2 start_3 end_3
id
101 4.0 52.0 NaN NaN NaN NaN
102 10.0 22.0 24.0 37.0 51.0 52.0
103 2.0 7.0 25.0 52.0 NaN NaN
I'm trying to improve the performances of a pandas.groupby.aggregate operation using a custom aggregating function. I noticed that - correct me if I'm wrong - pandas calls the aggregating function on each block in sequence (I suspect it to be a simple for-loop).
Since pandas is heavily based on numpy, is there a way to speed up the calculation using numpy's vectorization features?
My code
In my code I need to aggregate wind data averaging samples together. While averaging wind-speeds is trivial, averaging wind directions requires a more ad-hoc code (e.g. the average of 1deg and 359deg is 0deg, not 180deg).
What my aggregating function does is:
remove NaNs
return NaN if no other value is present
check if a special flag indicating variable wind direction is present. If it is, return the flag
average the wind directions with a vector-averaging algorithm
The function is:
def meandir(x):
'''
Parameters
----------
x : pandas.Series
pandas series to be averaged
Returns
-------
float
averaged wind direction
'''
# Removes the NaN from the recording
x = x.dropna()
# If the record is empty, return NaN
if len(x)==0:
return np.nan
# If the record contains variable samples (990) return variable (990)
elif np.any(x == 990):
return 990
# Otherwise sum the vectors and return the angle
else:
angle = np.rad2deg(
np.arctan2(
np.sum(np.sin(np.deg2rad(x))),
np.sum(np.cos(np.deg2rad(x)))
)
)
#Wrap angles from (-pi,pi) to (0,360)
return (angle + 360) % 360
you can test it with
from timeit import repeat
import pandas as pd
import numpy as np
N_samples = int(1e4)
N_nan = N_var = int(0.02 * N_samples)
# Generate random data
data = np.random.rand(N_samples,2) * [30, 360]
data[np.random.choice(N_samples, N_nan), 1] = np.nan
data[np.random.choice(N_samples, N_var), 1] = 990
# Create dataset
df = pd.DataFrame(data, columns=['WindSpeed', 'WindDir'])
df.index = pd.date_range(start='2000-01-01 00:00', periods=N_samples, freq='10min')
# Run groupby + aggregate
grouped = df.groupby(pd.Grouper(freq='H')) # Data from 14.30 to 15.29 are rounded to 15.00
aggfuns1 = {'WindSpeed': np.mean, 'WindDir':meandir}
aggfuns2 = {'WindSpeed': np.mean, 'WindDir':np.mean}
res = repeat(stmt='grouped.agg(aggfuns1)', globals=globals(), number=1, repeat=10)
print(f'With custom aggregating function {min(res)*1000:.2f} ms')
res = repeat(stmt='grouped.agg(aggfuns2)', globals=globals(), number=1, repeat=10)
print(f'Without custom aggregating function {min(res)*1000:.2f} ms')
which on my PC for N_samples=1e4 outputs:
With custom aggregating function 1500.79 ms
Without custom aggregating function 2.08 ms
with the custom aggregating function being 750 times slower
and with N_samples=1e6 outputs:
With custom aggregating function 142967.17 ms
Without custom aggregating function 21.92 ms
with the custom aggregating function being 6500 times slower!
Is there a way to speed up this line of code?
The key is to try to vectorize everything you can on the whole df, and let groupby use only builtin methods.
Here is a way to do that. The trick is to convert the angles to complex numbers, which numpy will happily sum
(and groupby too, but groupby will refuse to mean()). So, we convert the angles to complex, sum, then
convert back to angles. The same "funny mean" of angles is used as in your code and described on the Wikipedia page you reference.
About the handling of the special value (990), it can be vectorized too: comparing s.groupby(...).count() with .replace(val, nan).groupby(...).count() finds all the groups where there is at least one of those.
Anyway, here goes:
def to_complex(s):
return np.exp(np.deg2rad(s) * 1j)
def to_angle(s):
return np.angle(s, deg=True) % 360
def mask_val(s, grouper, val=990):
return s.groupby(grouper).count() != s.replace(val, np.nan).groupby(grouper).count()
def myagg(df, grouper, val=990, winddir='WindDir'):
s = df[winddir]
mask = mask_val(s, grouper, val)
gb = to_complex(s).groupby(grouper)
s = gb.sum()
cnt = gb.count()
s = to_angle(s) * (cnt / cnt) # put NaN where all NaNs
s[mask] = val
# other columns
agg = df.groupby(grouper).mean()
agg[winddir] = s
return agg
Application:
For convenience, I put your example generation into a function gen_example(N_samples).
df = gen_example(50)
myagg(df, pd.Grouper(freq='H'))
Out[ ]:
WindSpeed WindDir
2000-01-01 00:00:00 12.991717 354.120464
2000-01-01 01:00:00 15.743056 60.813629
2000-01-01 02:00:00 14.593927 245.487383
2000-01-01 03:00:00 17.836368 131.493675
2000-01-01 04:00:00 18.987296 27.150359
2000-01-01 05:00:00 16.415725 194.923399
2000-01-01 06:00:00 20.881816 990.000000
2000-01-01 07:00:00 15.033480 44.626018
2000-01-01 08:00:00 16.276834 29.252459
Speed:
df = gen_example(10_000)
%timeit myagg(df, pd.Grouper(freq='H'))
Out[ ]:
6.76 ms ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df = gen_example(1e6)
%timeit myagg(df, pd.Grouper(freq='H'))
Out[ ]:
189 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Testing:
idx = [0] * 4
grouper = pd.Grouper(level=0)
myagg(pd.DataFrame({'WindDir': [170, 170, 178, 182]}, index=idx), grouper)
WindDir
0 174.998473
myagg(pd.DataFrame({'WindDir': [330, 359, 1, 40]}, index=idx), grouper)
WindDir
0 2.251499
myagg(pd.DataFrame({'WindDir': [330, 359, 1, np.nan]}, index=idx), grouper)
WindDir
0 350.102878
myagg(pd.DataFrame({'WindDir': [np.nan, np.nan, np.nan, np.nan]}, index=idx), grouper)
WindDir
0 NaN
myagg(pd.DataFrame({'WindDir': [330, 990, 1, np.nan]}, index=idx), grouper)
WindDir
0 990.0
I would like to create lists from a column of a DataFrame grouping by an index. To perform this task I:
Group by an index
Aggregate all the items in each group into a list
However, once the number of groups gets large, this operation becomes very slow. Here is an illustrative example.
First, the data:
import random
import uuid
import numpy as np
import pandas as pd
np.random.seed(42)
random.seed(42)
def make_df(nr_level_one: int = 1000, max_nr_paths: int = 10):
level_one_values = np.arange(nr_level_one)
count_paths = np.random.randint(1, max_nr_paths+1, size=nr_level_one)
idx_one = np.repeat(level_one_values, count_paths)
nr_obs = np.sum(count_paths)
idx_two = np.arange(nr_obs)
idx = pd.MultiIndex.from_tuples(
zip(idx_one, idx_two), names=['one', 'two']
)
paths = [str(uuid.UUID(int=random.getrandbits(128))).replace('-', '/')
for _ in range(nr_obs)]
return pd.DataFrame(paths, index=idx, columns=['path'])
df = make_df()
df
path
one two
0 0 bdd640fb/0667/1ad1/1c80/317fa3b1799d
1 23b8c1e9/3924/56de/3eb1/3b9046685257
2 bd9c66b3/ad3c/2d6d/1a3d/1fa7bc8960a9
3 972a8469/1641/9f82/8b9d/2434e465e150
4 17fc695a/07a0/ca6e/0822/e8f36c031199
... ... ...
999 5443 fe66c4fa/35ed/ff38/9197/107c89b702ed
5444 a560c775/58cf/d966/6f11/0436b3c28ec5
5445 49e785c7/cbd8/715e/ae98/b722cf97b016
5446 f7eefd84/b31c/8349/5799/2f42351b3e63
5447 be3de265/d471/8d86/8d36/645980f6c26c
5448 rows × 1 columns
I want to aggregate all paths at level one in the index into a list. This is the function which achieves what I want:
df.groupby(level='one').agg(list)
path
one
0 [bdd640fb/0667/1ad1/1c80/317fa3b1799d, 23b8c1e...
1 [6b65a6a4/8b81/48f6/b38a/088ca65ed389, 4737819...
2 [371ecd7b/27cd/8130/4722/9389571aa876, 1a2a73e...
3 [3139d32c/93cd/59bf/5c94/1cf0dc98d2c1, a9488d9...
4 [29a3b2e9/5d65/a441/d588/42dea2bc372f, ab9099a...
... ...
995 [5dc2fd9b/f1bd/b57b/b8dd/dfc2963ba31c, aa1c5dc...
996 [1e228ade/59c6/7a52/8f80/d1ef4615575c, d60b04a...
997 [f151ff15/a46e/e99e/ae4e/d89fd659d69f, bf5628b...
998 [17a85108/43b9/3b02/7089/8400b2932bfb, 5d15c12...
999 [85e620fe/a44e/b3e1/c5ba/136f594ed61a, afe1d84...
1000 rows × 1 columns
But as you scale up the number of groups, this gets very very slow (though it does seem to scale linearly with the number of groups)
%%timeit
make_df(100000).groupby(level='one').agg(list)
15 s ± 230 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is there a faster way of achieving the same task with pandas?
I'm wondering about efficient and concise ways of passing nuisance columns through to the result of a pandas.DataFrame.groupby. I often have columns which I do not want to apply the groupby operation to, but I do want the values to propagate through to the result. An example of what I am trying to do is shown below
import pandas as pd
import numpy as np
import random
import string
np.random.seed(43)
random.seed(43)
dates = pd.date_range("2015-01-01", "2017-01-02")
types = "AAABBCCCDDDDEEFFFFGG"
rtypes = list(types * len(dates))
rdates = dates.tolist() * len(types)
data = np.random.randn(len(rtypes))
info1 = [''.join(random.choice(string.ascii_uppercase) for _ in range(5))
for i in range(len(rtypes))]
info2 = [random.randint(100,1000) for i in range(len(rtypes))]
df = pd.DataFrame({"date": rdates, "category": rtypes, "vals": data,
"info1":info1, "info2": info2})
df = df.sort_values(["date", "category"]).reset_index(drop=True)
df.head()
category date info1 info2 vals
0 A 2015-01-01 BJWYE 990 0.257400
1 A 2015-01-01 ISQES 475 -0.867570
2 A 2015-01-01 KDEKE 214 1.683595
3 B 2015-01-01 TFOXR 203 0.575879
4 B 2015-01-01 HKTNF 992 -0.399677
Here I would like to group by the category and date and apply some function to vals but have the info1 and info2 columns passed through.
Possible Solutions
These are the possible solutions I have found, but both seem somewhat clunky and have quite different performance which made me wonder if there is possibly a more efficient or more concise solution. I'm applying the rank function in this example but am interested more broadly in functions that could return 1 value per group, all values per group or some values per group.
Option 1
Stash all desired pass through columns in the index
%%timeit
(df.set_index(["date", "category", "info1", "info2"])
.groupby(axis=0, level=[0, 1]).rank().reset_index())
2.64 s ± 47.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
with result
sol1 = (df.set_index(["date", "category", "info1", "info2"])
.groupby(axis=0, level=[0, 1]).rank().reset_index())
sol1.sort_values(["date", "category"]).head()
date category info1 info2 vals
0 2015-01-01 A BJWYE 990 2.0
1 2015-01-01 A ISQES 475 1.0
2 2015-01-01 A KDEKE 214 3.0
3 2015-01-01 B TFOXR 203 2.0
4 2015-01-01 B HKTNF 992 1.0
Option 2
Drop the columns and join them back later
%%timeit
pd.merge(
df.groupby(by=["date", "category"])[["vals"]].rank(),
df.drop("vals", axis=1),
how="left",
left_index=True,
right_index=True,
)
1.73 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think you're overly complicating things. You can just groupby and rank to the vals columns. This returns a pandas.Series of the same length of your original df so you can just set the column to this.
df['vals'] = df.groupby(['date', 'category']).vals.rank()
category date info1 info2 vals
0 A 2015-01-01 BJWYE 990 2.0
1 A 2015-01-01 ISQES 475 1.0
2 A 2015-01-01 KDEKE 214 3.0
3 B 2015-01-01 TFOXR 203 2.0
4 B 2015-01-01 HKTNF 992 1.0
I have a Pandas data frame with hundreds of millions of rows that looks like this:
Date Attribute A Attribute B Value
01/01/16 A 1 50
01/05/16 A 1 60
01/02/16 B 1 59
01/04/16 B 1 90
01/10/16 B 1 84
For each unique combination (call it b) of Attribute A x Attribute B, I need to fill in empty dates starting from the oldest date for that unique group b to the maximum date in the entire dataframe df. That is, so it looks like this:
Date Attribute A Attribute B Value
01/01/16 A 1 50
01/02/16 A 1 0
01/03/16 A 1 0
01/04/16 A 1 0
01/05/16 A 1 60
01/02/16 B 1 59
01/03/16 B 1 0
01/04/16 B 1 90
01/05/16 B 1 0
01/06/16 B 1 0
01/07/16 B 1 0
01/08/16 B 1 84
and then calculate the coefficient of variation (standard deviation/mean) for each unique combination's values (after inserting 0s). My code is this:
final = pd.DataFrame()
max_date = df['Date'].max()
for name, group in df.groupby(['Attribute_A','Attribute_B']):
idx = pd.date_range(group['Date'].min(),
max_date)
temp = group.set_index('Date').reindex(idx, fill_value=0)
coeff_var = temp['Value'].std()/temp['Value'].mean()
final = pd.concat([final, pd.DataFrame({'Attribute_A':[name[0]], 'Attribute_B':[name[1]],'Coeff_Var':[coeff_var]})])
This runs insanely slow, and I'm looking for a way to speed it up.
Suggestions?
This runs insanely slow, and I'm looking for a way to speed it up.
Suggestions?
I don't have a ready solution, however this is how I suggest you approach the problem:
Understand what makes this slow
Find ways to make the critical parts faster
Or, alternatively, find a new approach
Here's the analysis of your code using line profiler:
Timer unit: 1e-06 s
Total time: 0.028074 s
File: <ipython-input-54-ad49822d490b>
Function: foo at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def foo():
2 1 875 875.0 3.1 final = pd.DataFrame()
3 1 302 302.0 1.1 max_date = df['Date'].max()
4 3 3343 1114.3 11.9 for name, group in df.groupby(['Attribute_A','Attribute_B']):
5 2 836 418.0 3.0 idx = pd.date_range(group['Date'].min(),
6 2 3601 1800.5 12.8 max_date)
7
8 2 6713 3356.5 23.9 temp = group.set_index('Date').reindex(idx, fill_value=0)
9 2 1961 980.5 7.0 coeff_var = temp['Value'].std()/temp['Value'].mean()
10 2 10443 5221.5 37.2 final = pd.concat([final, pd.DataFrame({'Attribute_A':[name[0]], 'Attribute_B':[name[1]],'Coeff_Var':[coeff_var]})])
In conclusion, the .reindex and concat statements take 60% of the time.
A first approach that saves 42% of time in my measurement is to collect the data for the final data frame as a list of rows, and create the dataframe as the very last step. Like so:
newdata = []
max_date = df['Date'].max()
for name, group in df.groupby(['Attribute_A','Attribute_B']):
idx = pd.date_range(group['Date'].min(),
max_date)
temp = group.set_index('Date').reindex(idx, fill_value=0)
coeff_var = temp['Value'].std()/temp['Value'].mean()
newdata.append({'Attribute_A': name[0], 'Attribute_B': name[1],'Coeff_Var':coeff_var})
final = pd.DataFrame.from_records(newdata)
Using timeit to measure best execution times I get
your solution: 100 loops, best of 3: 11.5 ms per loop
improved concat: 100 loops, best of 3: 6.67 ms per loop
Details see this ipython notebook
Note: Your mileage may vary - I used the sample data provided in the original post. You should run the line profiler on a subset of your real data - the dominating factor in regards to time use may well be something else then.
I am not sure if my way is faster than the way that you set up, but here goes:
df = pd.DataFrame({'Date': ['1/1/2016', '1/5/2016', '1/2/2016', '1/4/2016', '1/10/2016'],
'Attribute A': ['A', 'A', 'B', 'B', 'B'],
'Attribute B': [1, 1, 1, 1, 1],
'Value': [50, 60, 59, 90, 84]})
unique_attributes = df['Attribute A'].unique()
groups = []
for i in unique_attributes:
subset = df[df['Attribute A'] ==i]
dates = subset['Date'].tolist()
Dates = pd.date_range(dates[0], dates[-1])
subset.set_index('Date', inplace=True)
subset.index = pd.DatetimeIndex(subset.index)
subset = subset.reindex(Dates)
subset['Attribute A'].fillna(method='ffill', inplace=True)
subset['Attribute B'].fillna(method='ffill', inplace=True)
subset['Value'].fillna(0, inplace=True)
groups.append(subset)
result = pd.concat(groups)