Cumcount with reset in Python - python

I have a following problem. I need to compute a cumcount but I would like to reset the counter always when the series is interupted.
See example:
data = { 'col_1': ['a', 'a', 'b', 'b', 'a'], 'col_2': [3, 2, 1, 0, -3]}
df = pd.DataFrame.from_dict(data)
I tried this but it gives me a wrong output:
df["seq"] = df.groupby(["col_1"]).cumcount()
What I want is:
data = { 'col_1': ['a', 'a', 'b', 'b', 'a'], 'col_2': [3, 2, 1, 0, -3], 'seq': [0, 1, 0, 1, 0]}
How can I do it, please?

Try:
df["seq"] = df.groupby((df["col_1"] != df["col_1"].shift()).cumsum())["col_1"].cumcount()
print(df)
Output
col_1 col_2 seq
0 a 3 0
1 a 2 1
2 b 1 0
3 b 0 1
4 a -3 0

Note that as you are interested in runs (like in run-length encoding) itertools.groupby might be better suited for this task, consider following example
import pandas as pd
df = pd.DataFrame({'col1':['a','a','b','b','a']})
df['seq'] = [i for k, g in itertools.groupby(col1) for i in range(len(list(g)))]
print(df)
output
col1 seq
0 a 0
1 a 1
2 b 0
3 b 1
4 a 0

Related

Select mode of each group and assign back to the original dataframe on a condition

I have a dataframe like this:
df = pd.DataFrame({
'id_vector': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E', 'E', 'E'],
'svm_prediction': [0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0]
})
And I want to create a new column in which each value represent the most frequent value in the svm_prediction column and just for the specific id_vector group. For example if for id_vector 'A' the most frequent value in svm_prediction is 1, the new column will have all values 1, same for 0, in case the frequency is equal for both values then put 2 or in each row.
The result will be something like this where column new_label is added to df.
result = pd.DataFrame({
'id_vector': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E', 'E', 'E'],
'svm_prediction': [0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0],
'new_label': [0, 0, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2]
})
I tried:
r.groupby('id_vector')['svm_prediction'].mode()
but an error is raise because groupby has no attribute mode
also:
r['average'] = r[['svm_prediction']].mode(axis=0)
gives NaN values for every row except the first one.
Since the output you want has the same length as the original dataframe, groupby.transform seems more appropriate. Then you can pass a lambda into it that uses a ternary operator to assign the mode if it's a singleton and the length otherwise.
mode_f = lambda x: m[0] if len(m:=x.mode()) == 1 else len(m)
df['new_label'] = df.groupby('id_vector')['svm_prediction'].transform(mode_f)
We can use np.where to match the case you want, and np.mean which will give 0.5 if zero's == one's and more than 0.5 if one's more and less than 0.5 for zero's more.
df['new_label'] = df.groupby('id_vector').svm_prediction.transform(np.mean)
df['new_label'] = np.where(df.new_label > 0.5 , 1, np.where(df.new_label < 0.5, 0, 2))
output
id_vector true_label svm_prediction new_label
0 A 0 0 0
1 A 0 0 0
2 A 0 1 0
3 A 0 1 0
4 A 0 0 0
5 B 0 0 2
6 B 0 1 2
7 B 0 0 2
8 B 0 1 2
9 C 0 1 1
10 C 0 0 1
11 C 0 1 1
12 C 0 1 1
13 D 1 1 0
14 D 1 0 0
15 D 1 0 0
16 D 1 0 0
17 E 1 0 2
18 E 1 1 2
19 E 1 1 2
20 E 1 0 2
well i think you are looking for this.
result_final = df.groupby('id_vector')['svm_prediction'].agg(pd.Series.mode)
print(result_final)
'''
id_vector
A 0
B [0, 1] #probably both repeated same time
C 1
D 0
E [0, 1]
Name: svm_prediction, dtype: object
'''

Can we create a matrix with neighbouring columns?

Suppose that I have the matrix:
df=
a b c d e
z 1 1 1 0 1
y 0 0 0 1 0
x 1 0 1 0 0
w 0 1 0 0 1
v 0 0 0 1 0
Now we want to find all neigbouring columns and put that in a matrix. For my example this gives the result:
result =
a b c d e
a 1 1 1 0 1
b 1 1 1 0 1
c 1 1 1 0 1
d 0 0 0 0 0
e 1 1 1 0 0
Logic is as follows: since a and b columns both have a 1 in row z there should be 1 in the cell a,b and b,a. One mutual row is sufficient.
Now I am doing it with a for-loop but it is too slow.
df = pd.DataFrame( [[1, 1, 1, 0, 1],
[0, 0, 0, 1, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 0, 1],
[0, 0, 0, 1, 0]],
index=['a', 'b', 'c', 'd', 'e' ],
columns=['a', 'b', 'c', 'd', 'e' ],
)
result = pd.DataFrame(False, index=df.columns, columns=df.columns)
for row in result.index:
my_rows = df.index[df.loc[:, row]]
my_neighbours = df.columns[df.loc[my_rows, :].any(axis=0)]
result.loc[row, my_neighbours] = True
Does anyone have an alternative?
Perhaps you want something like this:
import pandas as pd
df = pd.DataFrame( [[1, 1, 1, 0, 1],
[0, 0, 0, 1, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 0, 1],
[0, 0, 0, 1, 0]],
index=['a', 'b', 'c', 'd', 'e' ],
columns=['a', 'b', 'c', 'd', 'e' ],
)
res = pd.DataFrame( index= ['a', 'b', 'c', 'd', 'e' ])
for col in df.columns:
res[col] = df.apply( lambda row: int(row[col]==1 and row[row.name]==1), axis=1 )
But as mentioned in comments, notice that you have an error in your example output.
Also I had to guess what you want in some default cases.
res
Out[176]:
a b c d e
a 1 1 1 0 1
b 0 0 0 0 0
c 1 0 1 0 0
d 0 0 0 0 0
e 0 0 0 0 0
It is also possible, to achieve the result with inliner below:
res2 = df.apply( lambda row: pd.Series([int(row[col]==1 and row[row.name]==1) for col in row.index], index=row.index), axis=1, result_type='expand' )

Pandas sample by filter criteria

I have a data frame like the one below
d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8], 'class': ['a', 'a', 'c', 'b']}
df = pd.DataFrame(data=d)
df
var1 var2 class
0 1 5 a
1 2 6 a
2 3 7 c
3 4 8 b
I would like to be able to change the proportion of the class column. For example I would like to down-sample at random the a class by 50% but keep the number of rows for the other classes the same. the results would be:
df
var1 var2 class
0 1 5 a
1 3 7 c
2 4 8 b
How would this be done.
I used the approach to split the DataFrame into df_selection and df_remaining first.
I then reduced df_selection by REMOVE_PERCENTAGE and merged the resulting DataFrame with df_remaining again.
import numpy as np
import pandas as pd
d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8], 'class': ['a', 'a', 'c', 'b']}
df = pd.DataFrame(data=d)
REMOVE_PERCENTAGE = 0.5 # between 0 and 1
df = df.set_index(['class'])
df_selection = df.loc['a'] \
.reset_index()
df_remaining = df.drop('a') \
.reset_index()
rows_to_remove = int(REMOVE_PERCENTAGE * len(df_selection.index))
drop_indices = np.random.choice(df_selection.index, rows_to_remove, replace=False)
df_selection_reduced = df_selection.drop(drop_indices)
df_result = pd.concat([df_selection_reduced, df_remaining]) \
.reset_index(drop=True)
print(df_result)

groupby in pandas with function that must keep state

I have the following dataframe
df = pd.DataFrame({'a': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [ 1, 2, 4, 1, 2, 3, 4]})
I want a function that would output the following dataframe definition:
df = pd.DataFrame({'a': [ 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [ 1, 2, 4, 1, 2, 3, 4],
'c': ['A_0', 'A_0', 'A_1', 'B_0', 'B_0', 'B_0', 'B_0']})
The logic is that given for each value of 'a' (each group), I create a value 'c' which could be described as a "continuous" series of 'b' values.
So far, my code is the following:
def detection(dataset):
def detect(series, avalue):
_id = 0
start = True
visits = []
prev_ = None
for h in series:
if start:
start = False
prev_ = h
else:
if h - prev_ > 1:
_id += 1
prev_ = h
visits.append(f"{avalue}_{_id}")
return visits
res = []
gb = dataset.groupby("a")
for avalue in gb.groups:
dd = gb.get_group(avalue)
dd["VISIT_ID"] = detect(dd["b"], avalue)
res.append(dd)
return pd.concat(res, axis=0)
The good is, it works perfectly !!
The bad: it is extremely slow on a large dataset (7 millions of entries, 250k of different 'a' values.
Is there something better to do?
You can find the numeric part of column c using groupby and concat values
df['c'] = df.groupby('a').b.apply(lambda x: (x.diff() > 1).cumsum())
df['c'] = df['a'] + '_' + df['c'].astype(str)
a b c
0 A 1 A_0
1 A 2 A_0
2 A 4 A_1
3 B 1 B_0
4 B 2 B_0
5 B 3 B_0
6 B 4 B_0

Replicating rows in pandas dataframe by column value and add a new column with repetition index

My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]
Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2

Categories