Pandas group in series - python

Given
df = pd.DataFrame({'group': [1, 1, 2, 1, 1], 'value':['a','b','c','d','e']})
I need to treat a and b as one group, c as second group, d and e as third group. How to get first element from every group?
pd.DataFrame({'group': [1, 2, 1,], 'value':['a','c','d']})

Try this:
df1 = df[df['group'].ne(df['group'].shift())]
Check this answer for more details

You haven't specified if the group column tells whether the values are considered to be in the same group. So I'm assumming it has no connection, and you specify your groups in the groups list:
groups = [['a', 'b'], ['c'], ['d', 'e']]
condlist = [df['value'].isin(group) for group in groups]
choicelist = list(range(len(groups)))
group_idx = np.select(condlist, choicelist)
df.groupby(group_idx).first()
Result:
group value
0 1 a
1 2 c
2 1 d

You can create your groups and map them to a reduced output:
df = pd.DataFrame({'group': [1, 1, 2, 1, 1], 'value':['a','b','c','d','e']})
groups = [['a', 'b'], ['c'], ['d', 'e']]
mappings = {k: i for i, gr in enumerate(groups) for k in gr}
print(
df.groupby(df['value'].map(mappings)).first()
)
group value
value
0 1 a
1 2 c
2 1 d

Related

How to delete the consecutive duplicate elements from list also reflect that in another list in same data frame pandas?

I have a data frame like this
A
B
C
32
[1,2,2,3,4]
[a,b,c,d,e]
35
[5,5,7,7,7,8]
[q,w,e,r,t,y]
so there are the repeating elements in the B column that need to remove the second, third and so on occurrences and similarly, the elements in the C columns also get deleted with the same positions as B
the final result will be:
A
B
C
32
[1,2,3,4]
[a,b,d,e]
35
[5,7,8]
[q,e,y]
it's in a pandas data frame.
Use DataFrame.explode by both columns, then convert df.index to column df['index'] and remove duplicates per index and B column by DataFrame.drop_duplicates and last aggregate by index with A column with list:
import pandas as pd
data = {'A': [32, 35],
'B': [[1, 2, 2, 3, 4], [5, 5, 7, 7, 7, 8]],
'C': [['a', 'b', 'c', 'd', 'e'], ['q', 'w', 'e', 'r', 't', 'y']]}
df = pd.DataFrame(data)
df1 = (df.explode(['B','C'])
.reset_index()
.drop_duplicates(['index','B'])
.groupby(['index', 'A'])
.agg(list)
.reset_index(level=1)
.rename_axis(None))
print (df1)
A B C
0 32 [1, 2, 3, 4] [a, b, d, e]
1 35 [5, 7, 8] [q, e, y]
If values in A column are always unique:
df2 = (df.explode(['B','C'])
.drop_duplicates(['A','B'])
.groupby(['A'])
.agg(list)
.reset_index())
print (df2)
EDIT: Solution for oldier pandas version:
df1 = (df[['A']].join(df[['B','C']].apply(lambda x: x.explode()))
.reset_index()
.drop_duplicates(['index','B'])
.groupby(['index', 'A'])
.agg(list)
.reset_index(level=1)
.rename_axis(None))
print (df1)
Not very pandish, but trying to optimize memory used:
import numpy as np
# For each row, compute the indices of the first occurrence of each value in column B
df['_indices_to_keep'] = df.B.map(
lambda x: np.unique(x, return_index=True)[1] if x else None,
)
# Keep only values in columns B and C corresponding to the previously found positions
for c in ('B', 'C'):
df[c] = df.apply(lambda r: [r[c][i] for i in r['_indices_to_keep']] if r[c] else None, axis=1)
# Drop the temporary column with the indices
df.drop(columns=['_indices_to_keep'], inplace=True)
The result is:
A B C
0 32 [1, 2, 3, 4] [a, b, d, e]
1 35 [5, 7, 8] [q, e, y]
Short solution basing on np.unique and np.take routines:
def drop_dups(r):
vals, idx_start = np.unique(r['B'], return_index=True)
r['B'], r['C'] = vals, np.take(r['C'], idx_start)
return r
df = pd.DataFrame(data)
df[['B', 'C']] = df[['B', 'C']].apply(drop_dups, axis=1)
print(df)
The output:
A B C
0 32 [1, 2, 3, 4] [a, b, d, e]
1 35 [5, 7, 8] [q, e, y]

Cumcount with reset in Python

I have a following problem. I need to compute a cumcount but I would like to reset the counter always when the series is interupted.
See example:
data = { 'col_1': ['a', 'a', 'b', 'b', 'a'], 'col_2': [3, 2, 1, 0, -3]}
df = pd.DataFrame.from_dict(data)
I tried this but it gives me a wrong output:
df["seq"] = df.groupby(["col_1"]).cumcount()
What I want is:
data = { 'col_1': ['a', 'a', 'b', 'b', 'a'], 'col_2': [3, 2, 1, 0, -3], 'seq': [0, 1, 0, 1, 0]}
How can I do it, please?
Try:
df["seq"] = df.groupby((df["col_1"] != df["col_1"].shift()).cumsum())["col_1"].cumcount()
print(df)
Output
col_1 col_2 seq
0 a 3 0
1 a 2 1
2 b 1 0
3 b 0 1
4 a -3 0
Note that as you are interested in runs (like in run-length encoding) itertools.groupby might be better suited for this task, consider following example
import pandas as pd
df = pd.DataFrame({'col1':['a','a','b','b','a']})
df['seq'] = [i for k, g in itertools.groupby(col1) for i in range(len(list(g)))]
print(df)
output
col1 seq
0 a 0
1 a 1
2 b 0
3 b 1
4 a 0

groupby in pandas with function that must keep state

I have the following dataframe
df = pd.DataFrame({'a': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [ 1, 2, 4, 1, 2, 3, 4]})
I want a function that would output the following dataframe definition:
df = pd.DataFrame({'a': [ 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [ 1, 2, 4, 1, 2, 3, 4],
'c': ['A_0', 'A_0', 'A_1', 'B_0', 'B_0', 'B_0', 'B_0']})
The logic is that given for each value of 'a' (each group), I create a value 'c' which could be described as a "continuous" series of 'b' values.
So far, my code is the following:
def detection(dataset):
def detect(series, avalue):
_id = 0
start = True
visits = []
prev_ = None
for h in series:
if start:
start = False
prev_ = h
else:
if h - prev_ > 1:
_id += 1
prev_ = h
visits.append(f"{avalue}_{_id}")
return visits
res = []
gb = dataset.groupby("a")
for avalue in gb.groups:
dd = gb.get_group(avalue)
dd["VISIT_ID"] = detect(dd["b"], avalue)
res.append(dd)
return pd.concat(res, axis=0)
The good is, it works perfectly !!
The bad: it is extremely slow on a large dataset (7 millions of entries, 250k of different 'a' values.
Is there something better to do?
You can find the numeric part of column c using groupby and concat values
df['c'] = df.groupby('a').b.apply(lambda x: (x.diff() > 1).cumsum())
df['c'] = df['a'] + '_' + df['c'].astype(str)
a b c
0 A 1 A_0
1 A 2 A_0
2 A 4 A_1
3 B 1 B_0
4 B 2 B_0
5 B 3 B_0
6 B 4 B_0

Pandas: renaming columns that have the same name

I have a dataframe that has duplicated column names a, b and b. I would like to rename the second b into c.
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "b1": [7, 8, 9]})
df.rename(index=str, columns={'b1' : 'b'})
Trying this with no success..
df.rename(index=str, columns={2 : "c"})
try:
>>> df.columns = ['a', 'b', 'c']
>>> df
a b c
0 1 4 7
1 2 5 8
2 3 6 9
You can always just manually rename all the columns.
df.columns = ['a', 'b', 'c']
You can simply do:
df.columns = ['a','b','c']
If your columns are ordered and you want lettered columns, don't type names out manually. This is prone to error.
You can use string.ascii_lowercase, assuming you have a maximum of 26 columns:
from string import ascii_lowercase
df = pd.DataFrame(columns=['a', 'b', 'b1'])
df.columns = list(ascii_lowercase[:len(df.columns)])
print(df.columns)
Index(['a', 'b', 'c'], dtype='object')
These solutions don't take into account the problem with having many cols.
Here is a solution where, independent on the amount of columns, you can rename the columns with the same name to a unique name
df.columns = ['name'+str(col[0]) if col[1] == 'name' else col[1] for col in enumerate(df.columns)]

How to add rows for all missing values of one multi-index's level?

Suppose that I have the following dataframe df, indexed by a 3-level multi-index:
In [52]: df
Out[52]:
C
L0 L1 L2
0 w P 1
y P 2
R 3
1 x Q 4
R 5
z S 6
Code to create the DataFrame:
idx = pd.MultiIndex(levels=[[0, 1], ['w', 'x', 'y', 'z'], ['P', 'Q', 'R', 'S']],
labels=[[0, 0, 0, 1, 1, 1], [0, 2, 2, 1, 1, 3], [0, 0, 2, 1, 2, 3]],
names=['L0', 'L1', 'L2'])
df = pd.DataFrame({'C': [1, 2, 3, 4, 5, 6]}, index=idx)
The possible values for the L2 level are 'P', 'Q', 'R', and 'S', but some of these values are missing for particular combinations of values for the remaining levels. For example, the combination (L0=0, L1='w', L2='Q') is not present in df.
I would like to add enough rows to df so that, for each combination of values for the levels other than L2, there is exactly one row for each of the L2 level's possible values. For the added rows, the value of the C column should be 0.
IOW, I want to expand df so that it looks like this:
C
L0 L1 L2
0 w P 1
Q 0
R 0
S 0
y P 2
Q 0
R 3
S 0
1 x P 0
Q 4
R 5
S 0
z P 0
Q 0
R 0
S 6
REQUIREMENTS:
the operation should leave the types of the columns unchanged;
the operation should add the smallest number of rows needed to complete only the specified level (L2)
Is there a simple way to perform this expansion?
Suppose L2 initially contains all the possible values you need, you can use unstack.stack trick:
df.unstack('L2', fill_value=0).stack(level=1)

Categories