Efficient and concise GroupBy nuisance column pass through - python

I'm wondering about efficient and concise ways of passing nuisance columns through to the result of a pandas.DataFrame.groupby. I often have columns which I do not want to apply the groupby operation to, but I do want the values to propagate through to the result. An example of what I am trying to do is shown below
import pandas as pd
import numpy as np
import random
import string
np.random.seed(43)
random.seed(43)
dates = pd.date_range("2015-01-01", "2017-01-02")
types = "AAABBCCCDDDDEEFFFFGG"
rtypes = list(types * len(dates))
rdates = dates.tolist() * len(types)
data = np.random.randn(len(rtypes))
info1 = [''.join(random.choice(string.ascii_uppercase) for _ in range(5))
for i in range(len(rtypes))]
info2 = [random.randint(100,1000) for i in range(len(rtypes))]
df = pd.DataFrame({"date": rdates, "category": rtypes, "vals": data,
"info1":info1, "info2": info2})
df = df.sort_values(["date", "category"]).reset_index(drop=True)
df.head()
category date info1 info2 vals
0 A 2015-01-01 BJWYE 990 0.257400
1 A 2015-01-01 ISQES 475 -0.867570
2 A 2015-01-01 KDEKE 214 1.683595
3 B 2015-01-01 TFOXR 203 0.575879
4 B 2015-01-01 HKTNF 992 -0.399677
Here I would like to group by the category and date and apply some function to vals but have the info1 and info2 columns passed through.
Possible Solutions
These are the possible solutions I have found, but both seem somewhat clunky and have quite different performance which made me wonder if there is possibly a more efficient or more concise solution. I'm applying the rank function in this example but am interested more broadly in functions that could return 1 value per group, all values per group or some values per group.
Option 1
Stash all desired pass through columns in the index
%%timeit
(df.set_index(["date", "category", "info1", "info2"])
.groupby(axis=0, level=[0, 1]).rank().reset_index())
2.64 s ± 47.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
with result
sol1 = (df.set_index(["date", "category", "info1", "info2"])
.groupby(axis=0, level=[0, 1]).rank().reset_index())
sol1.sort_values(["date", "category"]).head()
date category info1 info2 vals
0 2015-01-01 A BJWYE 990 2.0
1 2015-01-01 A ISQES 475 1.0
2 2015-01-01 A KDEKE 214 3.0
3 2015-01-01 B TFOXR 203 2.0
4 2015-01-01 B HKTNF 992 1.0
Option 2
Drop the columns and join them back later
%%timeit
pd.merge(
df.groupby(by=["date", "category"])[["vals"]].rank(),
df.drop("vals", axis=1),
how="left",
left_index=True,
right_index=True,
)
1.73 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I think you're overly complicating things. You can just groupby and rank to the vals columns. This returns a pandas.Series of the same length of your original df so you can just set the column to this.
df['vals'] = df.groupby(['date', 'category']).vals.rank()
category date info1 info2 vals
0 A 2015-01-01 BJWYE 990 2.0
1 A 2015-01-01 ISQES 475 1.0
2 A 2015-01-01 KDEKE 214 3.0
3 B 2015-01-01 TFOXR 203 2.0
4 B 2015-01-01 HKTNF 992 1.0

Related

Compute weekly averages of columns in time series data

I have time series data with an id column and some continues value column. I want to calculate the moving weekly averages of this value per each person, in a new column. Code generate a sample dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{60}T',start='2020-01-01',periods=(1)*24*14))
df['col'] = np.random.random_integers(0, 250, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{60}T',start='2020-01-01',periods=(1)*24*14))
df2['col'] = np.random.random_integers(0, 150, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3
This sample has 2 weeks of data per person, so there should be 2 average values per person. The first one will be the average of the first week, the second average will be the average of the two weeks (week1 average + week2 average)/2. Then, fill all the rows in the column with the average value of that week.
Real dataset is large, so I am looking for a solution that can be scaled. How can this be achieved?
desired outcome should look like this:
index uid col week_average
2020-01-01 00:00:00 1 104 week1_uid1_mean
2020-01-01 01:00:00 1 150 week1_uid1_mean
2020-01-01 02:00:00 1 243 week1_uid1_mean
....
2020-01-08 00:00:00 1 174 (week1_uid1_mean+week2_uid1_mean)/2
2020-01-08 01:00:00 1 24 (week1_uid1_mean+week2_uid1_mean)/2
...
First, compute the week index for each row
df3["week"] = (
df3["index"] - df3.groupby("uid")["index"].transform("min")
) // pd.Timedelta(7, unit="day")
Or if the values in the index column are identical for all persons (uid), directly
df3["week"] = (df3["index"] - df3["index"].min()) // pd.Timedelta(
7, unit="day"
)
Then, compute the week average for each distinct couple of (uid, week)
week_averages = (
df3.groupby(["uid", "week"])["col"]
.mean()
.groupby("uid")
.apply(lambda x: x.rolling(len(x), min_periods=1).mean())
)
Finally, fill each row of your dataframe with the corresponding week average
df3["week_average"] = df3.apply(
lambda x: week_averages.loc[(x["uid"], x["week"])], axis=1
)
On your data, when using %timeit, I get
39.5 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Obtaining timeframe for ID groups based on state change

First off, my apologies, I'm a complete novice when it comes to Python. I use it extremely infrequently but require it for this problem.
I have a set of data which looks like the below:
id
state
dt
101
0
2022-15
101
1
2022-22
101
0
2022-26
102
0
2022-01
102
1
2022-41
103
1
2022-03
103
0
2022-12
I need to provide an output which displays the amount of time each ID was state = "1". E.G for ID 101 - state1_start_dt = "2022_22", state1_end_dt = "2022_25".
The data is in .CSV format. I've attempted to bring this in via Pandas, utilise groupby on the df and then loop over this - however this seems extremely slow.
I've come across Finite State Machines which seem to link to my requirements, however I'm in way over my head attempting to create a Finite State Machine in Python which accepts .CSV inputs, provides output per each ID group as well as incorporates logic to account for scenarios where the last entry for an ID is state = "1" - therefore we'd assume the time frame was until the end of 2022.
If anyone can provide some sources or sample code which I can break down to get a better understanding - that would be great.
EDIT
Some examples to be clearer on what I'd like to achieve:
-For IDs that have no ending 0 in the state sequence, the state1_end_dt should be entered as '2022-52' (the final week in 2022)
-For IDs which have alternating states, we can incorporate a second, third, forth etc.. set of columns (E.G state1_start_dt_2, state1_end_dt_2). This will allow each window to be accounted for. For any entries that only have one window, these extra columns can be NULL.
-For IDs which have no "1" present in the state column, these can be skipped.
-For IDs which do not have any 0 states present, the minimum dt value should be taken as the state1_start_dt and '2022-52' can be entered for state1_end_dt
IIUC, here are some functions to perform the aggregation you are looking for.
First, we convert the strings '%Y-%W' (e.g. '2022-15') into a DateTime (the Monday of that week), e.g. '2022-04-11', as it is easier to deal with actual dates than these strings. This makes this solution generic in that it can have arbitrary dates in it, not just for a single year.
Second, we augment the df with a "sentinel": a row for each id that is on the first week of the next year (next year being max year of all dates, plus 1) with state = 0. That allows us to not worry whether a sequence ends with 0 or not.
Then, we essentially group by id and apply the following logic: keep only transitions, so, e.g., [1,1,1,0,0,1,0] becomes [1,.,.,0,.,1,0] (where '.' indicates dropped values). That gives us the spans we are looking for (after subtracting one week for the 0 states).
Edit: speedup: instead of applying the masking logic to each group, we detect transitions globally (on the sentinel-augmented df, sorted by ['id', 'dt', 'state']). Since each id sequence in the augmented df ends with the sentinel (0), we are guaranteed to catch the first 1 of the next id.
Putting it all together, including a postproc() to convert dates back into strings of year-week:
def preproc(df):
df = df.assign(dt=pd.to_datetime(df['dt'] + '-Mon', format='%Y-%W-%a'))
max_year = df['dt'].max().year
# first week next year:
tmax = pd.Timestamp(f'{max_year}-12-31') + pd.offsets.Week(1)
sentinel = pd.DataFrame(
pd.unique(df['id']),
columns=['id']).assign(state=0, dt=tmax)
df = pd.concat([df, sentinel])
df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
return df
# speed up
def proc(df):
mask = df['state'] != df['state'].shift(fill_value=0)
df = df[mask]
z = df.assign(c=df.groupby('id').cumcount()).set_index(['c', 'id'])['dt'].unstack('c')
z[z.columns[1::2]] -= pd.offsets.Week(1)
cols = [
f'{x}_{i}'
for i in range(len(z.columns) // 2)
for x in ['start', 'end']
]
return z.set_axis(cols, axis=1)
def asweeks_str(t, nat='--'):
return f'{t:%Y-%W}' if t and t == t else nat
def postproc(df):
# convert dates into strings '%Y-%W'
return df.applymap(asweeks_str)
Examples
First, let's use the example that is in the original question. Note that this doesn't exemplifies some of the corner cases we are able to handle (more on that in a minute).
df = pd.DataFrame({
'id': [101, 101, 101, 102, 102, 103, 103],
'state': [0, 1, 0, 0, 1, 1, 0],
'dt': ['2022-15', '2022-22', '2022-26', '2022-01', '2022-41', '2022-03', '2022-12'],
})
>>> postproc(proc(preproc(df)))
start_0 end_0
id
101 2022-22 2022-25
102 2022-41 2022-52
103 2022-03 2022-11
But let's generate some random data, to observe some corner cases:
def gen(n, nids=2):
wk = np.random.randint(1, 53, n*nids)
st = np.random.choice([0, 1], n*nids)
ids = np.repeat(np.arange(nids) + 101, n)
df = pd.DataFrame({
'id': ids,
'state': st,
'dt': [f'2022-{w:02d}' for w in wk],
})
df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
return df
Now:
np.random.seed(0) # reproducible example
df = gen(6, 3)
>>> df
id state dt
0 101 0 2022-01
1 101 0 2022-04
2 101 1 2022-04
3 101 1 2022-40
4 101 1 2022-45
5 101 1 2022-48
6 102 1 2022-10
7 102 1 2022-20
8 102 0 2022-22
9 102 1 2022-24
10 102 0 2022-37
11 102 1 2022-51
12 103 1 2022-02
13 103 0 2022-07
14 103 0 2022-13
15 103 1 2022-25
16 103 1 2022-25
17 103 1 2022-39
There are several interesting things here. First, 101 starts with a 0 state, whereas 102 and 103 both start with 1. Then, there are repeated ones for all ids. There are also repeated weeks: '2022-04' for 101 and '2022-25' for 103.
Nevertheless, the aggregation works just fine and produces:
>>> postproc(proc(preproc(df)))
start_0 end_0 start_1 end_1 start_2 end_2
id
101 2022-04 2022-52 -- -- -- --
102 2022-10 2022-21 2022-24 2022-36 2022-51 2022-52
103 2022-02 2022-06 2022-25 2022-52 -- --
Speed
np.random.seed(0)
n = 10
k = 100_000
df = gen(n, k)
%timeit preproc(df)
483 ms ± 4.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The processing itself takes less than 200ms for 1 million rows:
a = preproc(df)
%timeit proc(a)
185 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As for the post-processing (converting dates back to year-week strings), it is the slowest thing of all:
b = proc(a)
%timeit postproc(b)
1.63 s ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For a speed-up of that post-processing, we can rely on the fact that there are only a small number of distinct dates that are week-starts (52 per year, plus NaT for the blank cells):
def postproc2(df, nat='--'):
dct = {
t: f'{t:%Y-%W}' if t and t == t else nat
for t in df.stack().reset_index(drop=True).drop_duplicates()
}
return df.applymap(dct.get)
%timeit postproc2(b)
542 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
We could of course do something similar for preproc().
If the csv file called one_zero.csv is this
id,state,dt
100,0,2022-15
100,1,2022-22
100,0,2022-26
101,0,2022-01
101,1,2022-41
102,1,2022-03
102,0,2022-12
102,1,2022-33
(I've added one additional item to the end.)
Then this code gives you what you want.
import pandas as pd
df = pd.read_csv("one_zero.csv")
result = {}
for id_, sub_df in df.groupby('id'):
sub_df = sub_df.sort_values("dt")
intervals = []
start_dt = None
for state, dt in zip(sub_df["state"], sub_df["dt"]):
if state == 1:
start_dt = dt
if state == 0 and start_dt is not None:
week = int(dt.split("-", maxsplit=1)[1])
intervals.append((start_dt, f"2022-{week-1:02d}"))
start_dt = None
if start_dt is not None:
intervals.append((start_dt, "2022-52"))
result[id_] = intervals
At the end the result dictionary will contain this:
{
100: [('2022-22', '2022-25')],
101: [('2022-41', '2022-52')],
102: [('2022-03', '2022-11'), ('2022-33', '2022-52')]
}
With this groupby and sort_values it works even if you shuffle the lines in the csv file. I've used formatted string to fix the week number. 02d there means there, that the week will be always two digits, starting with 0 for the first 9 weeks.
I guess you need less memory if you iterate on the rows like this, but for me the zip version is more familiar.
for _, row in sub_df.iterrows():
state = row["state"]
dt = row["dt"]
Another alternative:
res = (
df.drop(columns="dt")
.assign(week=df["dt"].str.split("-").str[1].astype("int"))
.sort_values(["id", "week"])
.assign(group=lambda df:
df.groupby("id")["state"].diff().fillna(1).ne(0).cumsum()
)
.drop_duplicates(subset="group", keep="first")
.loc[lambda df: df["state"].eq(1) | df["id"].eq(df["id"].shift())]
.assign(no=lambda df: df.groupby("id")["state"].cumsum())
.pivot(index=["id", "no"], columns="state", values="week")
.rename(columns={0: "end", 1: "start"}).fillna("52").astype("int")
)[["start", "end"]]
First add new column week and sort along id and week. (The sorting might not be necessary if the data already come sorted.)
Then look id-group-wise for blocks of consecutive 0 or 1 and based on the result (stored in the new column group) drop all resp. duplicates while keeping the firsts (the others aren't relevant according to the logic you've layed out).
Afterwards also remove the 0-states at the start of an id-group.
On the result identify id-group-wise the connected start-end groups (store in new group no).
Then .pivot the thing: pull id and no in the index and state into the columns.
Afterwards fill the NaN with 52 and do some casting, renaminig, and sorting to get the result in better shape.
If you really want to move the various start-end-combinations into columns then replace below the pivot line as follows:
res = (
...
.pivot(index=["id", "no"], columns="state", values="week")
.rename(columns={0: 1, 1: 0}).fillna("52").astype("int")
.unstack().sort_index(level=1, axis=1)
)
res.columns = [f"{'start' if s == 0 else 'end'}_{n}" for s, n in res.columns]
Results with the dataframe from #Pierre's answer:
state start end
id no
101 1 4 52
102 1 10 22
2 24 37
3 51 52
103 1 2 7
2 25 52
or
start_1 end_1 start_2 end_2 start_3 end_3
id
101 4.0 52.0 NaN NaN NaN NaN
102 10.0 22.0 24.0 37.0 51.0 52.0
103 2.0 7.0 25.0 52.0 NaN NaN

Faster `pandas.DataFrame.groupby()` when you have *lots* of groups

I would like to create lists from a column of a DataFrame grouping by an index. To perform this task I:
Group by an index
Aggregate all the items in each group into a list
However, once the number of groups gets large, this operation becomes very slow. Here is an illustrative example.
First, the data:
import random
import uuid
import numpy as np
import pandas as pd
np.random.seed(42)
random.seed(42)
def make_df(nr_level_one: int = 1000, max_nr_paths: int = 10):
level_one_values = np.arange(nr_level_one)
count_paths = np.random.randint(1, max_nr_paths+1, size=nr_level_one)
idx_one = np.repeat(level_one_values, count_paths)
nr_obs = np.sum(count_paths)
idx_two = np.arange(nr_obs)
idx = pd.MultiIndex.from_tuples(
zip(idx_one, idx_two), names=['one', 'two']
)
paths = [str(uuid.UUID(int=random.getrandbits(128))).replace('-', '/')
for _ in range(nr_obs)]
return pd.DataFrame(paths, index=idx, columns=['path'])
df = make_df()
df
path
one two
0 0 bdd640fb/0667/1ad1/1c80/317fa3b1799d
1 23b8c1e9/3924/56de/3eb1/3b9046685257
2 bd9c66b3/ad3c/2d6d/1a3d/1fa7bc8960a9
3 972a8469/1641/9f82/8b9d/2434e465e150
4 17fc695a/07a0/ca6e/0822/e8f36c031199
... ... ...
999 5443 fe66c4fa/35ed/ff38/9197/107c89b702ed
5444 a560c775/58cf/d966/6f11/0436b3c28ec5
5445 49e785c7/cbd8/715e/ae98/b722cf97b016
5446 f7eefd84/b31c/8349/5799/2f42351b3e63
5447 be3de265/d471/8d86/8d36/645980f6c26c
5448 rows × 1 columns
I want to aggregate all paths at level one in the index into a list. This is the function which achieves what I want:
df.groupby(level='one').agg(list)
path
one
0 [bdd640fb/0667/1ad1/1c80/317fa3b1799d, 23b8c1e...
1 [6b65a6a4/8b81/48f6/b38a/088ca65ed389, 4737819...
2 [371ecd7b/27cd/8130/4722/9389571aa876, 1a2a73e...
3 [3139d32c/93cd/59bf/5c94/1cf0dc98d2c1, a9488d9...
4 [29a3b2e9/5d65/a441/d588/42dea2bc372f, ab9099a...
... ...
995 [5dc2fd9b/f1bd/b57b/b8dd/dfc2963ba31c, aa1c5dc...
996 [1e228ade/59c6/7a52/8f80/d1ef4615575c, d60b04a...
997 [f151ff15/a46e/e99e/ae4e/d89fd659d69f, bf5628b...
998 [17a85108/43b9/3b02/7089/8400b2932bfb, 5d15c12...
999 [85e620fe/a44e/b3e1/c5ba/136f594ed61a, afe1d84...
1000 rows × 1 columns
But as you scale up the number of groups, this gets very very slow (though it does seem to scale linearly with the number of groups)
%%timeit
make_df(100000).groupby(level='one').agg(list)
15 s ± 230 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is there a faster way of achieving the same task with pandas?

How to fix index getting lost when ussing append within a nested for loop

I want to filter out rows from a dataframe which are below a threshold (5th percentile)in another dataframe
I have tried doing a nested for loop and appending the output but index is lost
and that runtime is really long over two minutes
I have a dataframe called fiveperc which is in the format (366,1):
tmin
1 11.32
2 11.0
3 11.41
4 11.885
5 12.155
....
366 13.08
and another dataframe called df2 in the format of (18910,1)
date tmin
1966-01-01 13.9
1966-01-02 17.1
1966-01-03 17.1
1966-01-04 16.2
.....
2018-12-31 17
Using:
anomaly = []
for yearday,perc in fiveperc.iterrows():
for date,temp in df2.iterrows():
if yearday == date.dayofyear:
anomaly.append(temp - perc)
anomaly = pd.DataFrame(anomaly)
Using the first block of code above has an output dataframe (18910,1):
index tmin
0 2.58
1 3.27
2 4.27
3 2.08
4 -3.52
....
18909 5.579
The problem here is that datetime index from df2 is lost, resulting in a different arrangement!
and that this nested for loop takes over two minutes to run.
extra code if i get the code above work:
anomaly[anomaly>0]=np.nan
anomaly[anomaly<0]= 1
anomaly.replace(0, np.nan, inplace=True)
Frequency = pd.DataFrame(final.groupby(lambda x: x.dayofyear)['anomaly'].agg(sum))
Is there a much better way to do this?
You can lookup the dayoftheyear on a column with the dt accessor:
In [11]: df
Out[11]:
date tmin
0 1966-01-01 13.9
1 1966-01-02 17.1
2 1966-01-03 17.1
3 1966-01-04 16.2
In [12]: df1
Out[12]:
tmin
1 11.320
2 11.000
3 11.410
4 11.885
5 12.155
In [13]: df1.loc[df.date.dt.dayofyear, "tmin"]
Out[13]:
1 11.320
2 11.000
3 11.410
4 11.885
Name: tmin, dtype: float64
In [14]: df["tmin"] - df1.loc[df.date.dt.dayofyear, "tmin"].values
Out[14]:
0 2.580
1 6.100
2 5.690
3 4.315
Name: tmin, dtype: float64
You can also do this with a groupby transform, but my suspicion is this will be slightly slower:
In [21]: df.groupby(df.date.dt.dayofyear)["tmin"].transform(lambda x: x - df1.loc[x.name, "tmin"])
Out[21]:
0 2.580
1 6.100
2 5.690
3 4.315
Name: tmin, dtype: float64

Fill NaN values from another DataFrame (with different shape)

I'm looking for a faster approach to improve the performance of my solution for the following problem: a certain DataFrame has two columns with a few NaN values in them. The challenge is to replace these NaNs with values from a secondary DataFrame.
Below I'll share the data and code used to implement my approach. Let me explain the scenario: merged_df is the original DataFrame with a few columns and some of them have rows with NaN values:
As you can see from the image above, columns day_of_week and holiday_flg are of particular interest. I would like to fill the NaN values of these columns by looking into a second DataFrame called date_info_df, which looks like this:
By using the values from column visit_date in merged_df it is possible to search the second DataFrame on calendar_date and find equivalent matches. This method allows to get the values for day_of_week and holiday_flg from the second DataFrame.
The end result for this exercise is a DataFrame that looks like this:
You'll notice the approach I'm using relies on apply() to execute a custom function on every row of merged_df:
For every row, search for NaN values in day_of_week and holiday_flg;
When a NaN is found on any or both of these columns, use the date available in from that row's visit_date to find an equivalent match in the second DataFrame, specifically the date_info_df['calendar_date'] column;
After a successful match, the value from date_info_df['day_of_week'] must be copied into merged_df['day_of_week'] and the value from date_info_df['holiday_flg'] must also be copied into date_info_df['holiday_flg'].
Here is a working source code:
import math
import pandas as pd
import numpy as np
from IPython.display import display
### Data for df
data = { 'air_store_id': [ 'air_a1', 'air_a2', 'air_a3', 'air_a4' ],
'area_name': [ 'Tokyo', np.nan, np.nan, np.nan ],
'genre_name': [ 'Japanese', np.nan, np.nan, np.nan ],
'hpg_store_id': [ 'hpg_h1', np.nan, np.nan, np.nan ],
'latitude': [ 1234, np.nan, np.nan, np.nan ],
'longitude': [ 5678, np.nan, np.nan, np.nan ],
'reserve_datetime': [ '2017-04-22 11:00:00', np.nan, np.nan, np.nan ],
'reserve_visitors': [ 25, 35, 45, np.nan ],
'visit_datetime': [ '2017-05-23 12:00:00', np.nan, np.nan, np.nan ],
'visit_date': [ '2017-05-23' , '2017-05-24', '2017-05-25', '2017-05-27' ],
'day_of_week': [ 'Tuesday', 'Wednesday', np.nan, np.nan ],
'holiday_flg': [ 0, np.nan, np.nan, np.nan ]
}
merged_df = pd.DataFrame(data)
display(merged_df)
### Data for date_info_df
data = { 'calendar_date': [ '2017-05-23', '2017-05-24', '2017-05-25', '2017-05-26', '2017-05-27', '2017-05-28' ],
'day_of_week': [ 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday' ],
'holiday_flg': [ 0, 0, 0, 0, 1, 1 ]
}
date_info_df = pd.DataFrame(data)
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date'])
display(date_info_df)
# Fix the NaN values in day_of_week and holiday_flg by inspecting data from another dataframe (date_info_df)
def fix_weekday_and_holiday(row):
weekday = row['day_of_week']
holiday = row['holiday_flg']
# search dataframe date_info_df for the appropriate value when weekday is NaN
if (type(weekday) == float and math.isnan(weekday)):
search_date = row['visit_date']
#print(' --> weekday search_date=', search_date, 'type=', type(search_date))
indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
idx = indexes[0]
weekday = date_info_df.at[idx,'day_of_week']
#print(' --> weekday search_date=', search_date, 'is', weekday)
row['day_of_week'] = weekday
# search dataframe date_info_df for the appropriate value when holiday is NaN
if (type(holiday) == float and math.isnan(holiday)):
search_date = row['visit_date']
#print(' --> holiday search_date=', search_date, 'type=', type(search_date))
indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
idx = indexes[0]
holiday = date_info_df.at[idx,'holiday_flg']
#print(' --> holiday search_date=', search_date, 'is', holiday)
row['holiday_flg'] = int(holiday)
return row
# send every row to fix_day_of_week
merged_df = merged_df.apply(fix_weekday_and_holiday, axis=1)
# Convert data from float to int (to remove decimal places)
merged_df['holiday_flg'] = merged_df['holiday_flg'].astype(int)
display(merged_df)
I did a few measurements so you can understand the struggle:
On a DataFrame with 6 rows, apply() takes 3.01 ms;
On a DataFrame with ~250000 rows, apply() takes 2min 51s.
On a DataFrame with ~1215000 rows, apply() takes 4min 2s.
How do I improve the performance of this task?
you can use Index to speed up the lookup, use combine_first() to fill NaN:
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
merged_df[cols] = merged_df[cols].combine_first(
date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
print(merged_df[cols])
the result:
day_of_week holiday_flg
0 Tuesday 0.0
1 Wednesday 0.0
2 Thursday 0.0
3 Saturday 1.0
This is one solution. It should be efficient as there is no explicit merge or apply.
merged_df['visit_date'] = pd.to_datetime(merged_df['visit_date'])
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date'])
s = date_info_df.set_index('calendar_date')['day_of_week']
t = date_info_df.set_index('day_of_week')['holiday_flg']
merged_df['day_of_week'] = merged_df['day_of_week'].fillna(merged_df['visit_date'].map(s))
merged_df['holiday_flg'] = merged_df['holiday_flg'].fillna(merged_df['day_of_week'].map(t))
Result
air_store_id area_name day_of_week genre_name holiday_flg hpg_store_id \
0 air_a1 Tokyo Tuesday Japanese 0.0 hpg_h1
1 air_a2 NaN Wednesday NaN 0.0 NaN
2 air_a3 NaN Thursday NaN 0.0 NaN
3 air_a4 NaN Saturday NaN 1.0 NaN
latitude longitude reserve_datetime reserve_visitors visit_date \
0 1234.0 5678.0 2017-04-22 11:00:00 25.0 2017-05-23
1 NaN NaN NaN 35.0 2017-05-24
2 NaN NaN NaN 45.0 2017-05-25
3 NaN NaN NaN NaN 2017-05-27
visit_datetime
0 2017-05-23 12:00:00
1 NaN
2 NaN
3 NaN
Explanation
s is a pd.Series mapping calendar_date to day_of_week from date_info_df.
Use pd.Series.map, which takes pd.Series as an input, to update missing values, where possible.
Edit: one can also use merge to solve the problem. 10 times faster than the old approach. (Need to make sure "visit_date" and "calendar_date" are of the same format.)
# don't need to `set_index` for date_info_df but select columns needed.
merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]],
left_on="visit_date",
right_on="calendar_date",
how="left") # outer should also work
The desired result will be at "day_of_week_y" and "holiday_flg_y" column right now. In this approach and the map approach, we don't use the old "day_of_week" and "holiday_flg" at all. We just need to map the results from data_info_df to merged_df.
merge can also do the job because data_info_df's data entries are unique. (No duplicates will be created.)
You can also try using pandas.Series.map. What it does is
Map values of Series using input correspondence (which can be a dict, Series, or function)
# set "calendar_date" as the index such that
# mapping["day_of_week"] and mapping["holiday_flg"] will be two series
# with date_info_df["calendar_date"] as their index.
mapping = date_info_df.set_index("calendar_date")
# this line is optional (depending on the layout of data.)
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
# do replacement here.
merged_df["day_of_week"] = merged_df.visit_date.map(mapping["day_of_week"])
merged_df["holiday_flg"] = merged_df.visit_date.map(mapping["holiday_flg"])
Note merged_df.visit_date originally was of string type. Thus, we use
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
to make it datetime.
Timings date_info_df dataset and merged_df provided by karlphillip.
date_info_df = pd.read_csv("full_date_info_data.csv")
merged_df = pd.read_csv("full_data.csv")
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
date_info_df.calendar_date = pd.to_datetime(date_info_df.calendar_date)
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
# merge method I proprose on the top.
%timeit merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], left_on="visit_date", right_on="calendar_date", how="left")
511 ms ± 34.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method without assigning it back
%timeit merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
772 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method with assigning it back
%timeit merged_df[cols] = merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
258 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One can see that HYRY's method runs 3 times faster if assigning the result back to the merged_df. This is why I thought HARY's method was faster than mine at first glance. I suspect that is because of the nature of combine_first. I guess that the speed of HARY's method will depend on how sparse it is in merged_df. Thus, while assigning the results back, the columns become full; therefore, while rerunning it, it is faster.
The performances of the merge and combine_first methods are nearly equivalent. Perhaps there can be circumstances that one is faster than another. It should be left to each user to do some tests on their datasets.
Another thing to note between the two methods is that the merge method assumed every date in merged_df is contained in data_info_df. If there are some dates that are contained in merged_df but not data_info_df, it should return NaN. And NaN can override some part of merged_df that originally contains values! This is when combine_first method should be preferred. See the discussion by MaxU in Pandas replace, multi column criteria

Categories