groupby in pandas with function that must keep state - python

I have the following dataframe
df = pd.DataFrame({'a': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [ 1, 2, 4, 1, 2, 3, 4]})
I want a function that would output the following dataframe definition:
df = pd.DataFrame({'a': [ 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [ 1, 2, 4, 1, 2, 3, 4],
'c': ['A_0', 'A_0', 'A_1', 'B_0', 'B_0', 'B_0', 'B_0']})
The logic is that given for each value of 'a' (each group), I create a value 'c' which could be described as a "continuous" series of 'b' values.
So far, my code is the following:
def detection(dataset):
def detect(series, avalue):
_id = 0
start = True
visits = []
prev_ = None
for h in series:
if start:
start = False
prev_ = h
else:
if h - prev_ > 1:
_id += 1
prev_ = h
visits.append(f"{avalue}_{_id}")
return visits
res = []
gb = dataset.groupby("a")
for avalue in gb.groups:
dd = gb.get_group(avalue)
dd["VISIT_ID"] = detect(dd["b"], avalue)
res.append(dd)
return pd.concat(res, axis=0)
The good is, it works perfectly !!
The bad: it is extremely slow on a large dataset (7 millions of entries, 250k of different 'a' values.
Is there something better to do?

You can find the numeric part of column c using groupby and concat values
df['c'] = df.groupby('a').b.apply(lambda x: (x.diff() > 1).cumsum())
df['c'] = df['a'] + '_' + df['c'].astype(str)
a b c
0 A 1 A_0
1 A 2 A_0
2 A 4 A_1
3 B 1 B_0
4 B 2 B_0
5 B 3 B_0
6 B 4 B_0

Related

Python - Time series split indexes automatically for several categories

I am doing a time-series split and I want to create a list of train and text indexes for each split, by category.
df = pd.DataFrame({'date': [1, 1, 2, 2, 2, 3, 3, 3, 4, 4], 'prod': ['a', 'b', 'a', 'b', 'c', 'a', 'b', 'c', 'b', 'd'], 'cat': ['c1', 'c2', 'c1', 'c2', 'c2', 'c1', 'c2', 'c2', 'c2', 'c1']})
df
date prod cat
1 a c1
1 b c2
2 a c1
2 b c2
2 c c2
3 a c1
3 b c2
3 c c2
4 b c2
4 d c1
time_split = [2, 3, 4, 5]
I was able to create a list of lists with the indexes for each train-test split, but I also want to create test indexes automatically for each category. Manually the code is the following:
train_index = []
test_index = []
cat1_index = []
cat2_index = []
for i in range(3):
idx = df.loc[df['date'] < time_split[i]].index
idx2 = df.loc[(df['date'] >= time_split[i]) & (df['date'] < time_split[i + 1])].index
idx3 = df.loc[(df['date'] >= time_split[i]) & (df['date'] < time_split[i + 1])].query('cat == "c1"').index
idx4 = df.loc[(df['date'] >= time_split[i]) & (df['date'] < time_split[i + 1])].query('cat == "c2"').index
train_index.append(idx)
test_index.append(idx2)
cat1_index.append(idx3)
cat2_index.append(idx4)
The problem is that I have several categories, and as the code is parametrized for several markets, the number of categories in each run is different.

Pandas group in series

Given
df = pd.DataFrame({'group': [1, 1, 2, 1, 1], 'value':['a','b','c','d','e']})
I need to treat a and b as one group, c as second group, d and e as third group. How to get first element from every group?
pd.DataFrame({'group': [1, 2, 1,], 'value':['a','c','d']})
Try this:
df1 = df[df['group'].ne(df['group'].shift())]
Check this answer for more details
You haven't specified if the group column tells whether the values are considered to be in the same group. So I'm assumming it has no connection, and you specify your groups in the groups list:
groups = [['a', 'b'], ['c'], ['d', 'e']]
condlist = [df['value'].isin(group) for group in groups]
choicelist = list(range(len(groups)))
group_idx = np.select(condlist, choicelist)
df.groupby(group_idx).first()
Result:
group value
0 1 a
1 2 c
2 1 d
You can create your groups and map them to a reduced output:
df = pd.DataFrame({'group': [1, 1, 2, 1, 1], 'value':['a','b','c','d','e']})
groups = [['a', 'b'], ['c'], ['d', 'e']]
mappings = {k: i for i, gr in enumerate(groups) for k in gr}
print(
df.groupby(df['value'].map(mappings)).first()
)
group value
value
0 1 a
1 2 c
2 1 d

Cumcount with reset in Python

I have a following problem. I need to compute a cumcount but I would like to reset the counter always when the series is interupted.
See example:
data = { 'col_1': ['a', 'a', 'b', 'b', 'a'], 'col_2': [3, 2, 1, 0, -3]}
df = pd.DataFrame.from_dict(data)
I tried this but it gives me a wrong output:
df["seq"] = df.groupby(["col_1"]).cumcount()
What I want is:
data = { 'col_1': ['a', 'a', 'b', 'b', 'a'], 'col_2': [3, 2, 1, 0, -3], 'seq': [0, 1, 0, 1, 0]}
How can I do it, please?
Try:
df["seq"] = df.groupby((df["col_1"] != df["col_1"].shift()).cumsum())["col_1"].cumcount()
print(df)
Output
col_1 col_2 seq
0 a 3 0
1 a 2 1
2 b 1 0
3 b 0 1
4 a -3 0
Note that as you are interested in runs (like in run-length encoding) itertools.groupby might be better suited for this task, consider following example
import pandas as pd
df = pd.DataFrame({'col1':['a','a','b','b','a']})
df['seq'] = [i for k, g in itertools.groupby(col1) for i in range(len(list(g)))]
print(df)
output
col1 seq
0 a 0
1 a 1
2 b 0
3 b 1
4 a 0

Pandas sample by filter criteria

I have a data frame like the one below
d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8], 'class': ['a', 'a', 'c', 'b']}
df = pd.DataFrame(data=d)
df
var1 var2 class
0 1 5 a
1 2 6 a
2 3 7 c
3 4 8 b
I would like to be able to change the proportion of the class column. For example I would like to down-sample at random the a class by 50% but keep the number of rows for the other classes the same. the results would be:
df
var1 var2 class
0 1 5 a
1 3 7 c
2 4 8 b
How would this be done.
I used the approach to split the DataFrame into df_selection and df_remaining first.
I then reduced df_selection by REMOVE_PERCENTAGE and merged the resulting DataFrame with df_remaining again.
import numpy as np
import pandas as pd
d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8], 'class': ['a', 'a', 'c', 'b']}
df = pd.DataFrame(data=d)
REMOVE_PERCENTAGE = 0.5 # between 0 and 1
df = df.set_index(['class'])
df_selection = df.loc['a'] \
.reset_index()
df_remaining = df.drop('a') \
.reset_index()
rows_to_remove = int(REMOVE_PERCENTAGE * len(df_selection.index))
drop_indices = np.random.choice(df_selection.index, rows_to_remove, replace=False)
df_selection_reduced = df_selection.drop(drop_indices)
df_result = pd.concat([df_selection_reduced, df_remaining]) \
.reset_index(drop=True)
print(df_result)

Replicating rows in pandas dataframe by column value and add a new column with repetition index

My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]
Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2

Categories