how to get a subgroup start finish indexes of dataframe

how to get a subgroup start finish indexes of dataframe - python

df=pd.DataFrame({"C1":['USA','USA','USA','USA','USA','JAPAN','JAPAN','JAPAN','USA','USA'],'C2':['A','B','A','A','A','A','A','A','B','A']})
C1 C2
0 USA A
1 USA B
2 USA A
3 USA A
4 USA A
5 JAPAN A
6 JAPAN A
7 JAPAN A
8 USA B
9 USA A
This is a watered version of my problem so to keep it simple, my objective is to iterate a sub group of the dataframe where C2 has B in it. If a B is in C2 - I look at C1 and need the entire group. So in this example, I see USA and it starts at index 0 and finish at 4. Another one is between 8 and 9.
So my desired result would be the indexes such that:
[[0,4],[8,9]]
I tried to use groupby but it wouldn't work because it groups all the USA together
my_index = list(df[df['C2']=='B'].index)
my_index
woudld give 1,8 but how to get the start/finish?

Here is one approach where you can first mask the dataframe on groups which has atleast 1 B, then grab the index and create a helper column to aggregate the first and last index values:
s = df['C1'].ne(df['C1'].shift()).cumsum()
i = df.index[s.isin(s[df['C2'].eq("B")])]
p = np.where(np.diff(i)>1)[0]+1
split_ = np.split(i,p)
out = [[i[0],i[-1]] for i in split_]
print(out)
[[0, 4], [8, 9]]

Solution
b = df['C1'].ne(df['C1'].shift()).cumsum()
m = b.isin(b[df['C2'].eq('B')])
i = m.index[m].to_series().groupby(b).agg(['first', 'last']).values.squeeze()
Explanations
shift column C1 and comapre the shifted column with the non-shifted one to create a boolean mask then take a cumulative sum on this mask to identify the blocks of rows where the value in column C1 stays the same
>>> b
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
Name: C1, dtype: int64
Create a boolean mask m to identify the blocks of rows that contain at least on B
>>> m
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 False
8 True
9 True
Name: C1, dtype: bool
Filter the index by using boolean masking with mask m, then group the filtered index by the identified blocks b and aggregate using first and last to get the indices.
>>> i
array([[0, 4],
[8, 9]])

Another approach using more_itertools.
# Keep all the indexes needed
temp = df['C1'].ne(df['C1'].shift()).cumsum()
stored_index = df.index[temp.isin(temp[df['C2'].eq("B")])]
# Group the list based on consecutive numbers
import more_itertools as mit
out = [list(i) for i in mit.consecutive_groups(stored_index)]
# Get first and last elements from the nested (grouped) lists
final = [a[:1] + a[-1:] for a in out]
>>> print(final)
[[0, 4], [8, 9]]

Another version:
x = (
df.groupby((df.C1 != df.C1.shift(1)).cumsum())["C2"]
.apply(lambda x: [x.index[0], x.index[-1]] if ("B" in x.values) else np.nan)
.dropna()
.to_list()
)
print(x)
Prints:
[[0, 4], [8, 9]]

Related

Mark highest documents with true

I have a dataframe with two columns:
name and version.
I want to add a boolean in an extra column. So if the highest version then true, otherwise false.
import pandas as pd
data = [['a', 1], ['b', 2], ['a', 2], ['a', 2], ['b', 4]]
df = pd.DataFrame(data, columns = ['name', 'version'])
df
is it best to use groupby for this?
I have tried smth. like this but I do not know how to add extra column with bolean.
df.groupby(['name']).max()

Compare maximal values per groups created by GroupBy.transform with max for generate new column/ Series filled by maxinmal values, so possible compare by original column:
df['bool_col'] = df['version'] == df.groupby('name')['version'].transform('max')
print(df)
name version bool_col
0 a 1 False
1 b 2 False
2 a 2 True
3 a 2 True
4 b 4 True
Detail:
print(df.groupby('name')['version'].transform('max'))
0 2
1 4
2 2
3 2
4 4
Name: version, dtype: int64

You can assign your column directly:
df['bool_col'] = df['version'] == max(df['version'])
Output:
name version bool_col
0 a 1 False
1 b 2 False
2 a 2 False
3 a 2 False
4 b 4 True
Is this what you were looking for?

How to Detect a Streak of Certain Values in a DataFrame?

In a Python DataFrame, I want to detect the beginning and end position of a block of False values in a row. If the block contains just one False, I would like to get that position.
Example:
df = pd.DataFrame({"a": [True, True, True,False,False,False,True,False,True],})
In[110]: df
Out[111]:
a
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 False
8 True
In this example, I would like to get the positions
`3`, `5`
and
`7`, `7`.

Use:
a = (df.a.cumsum()[~df.a]
.reset_index()
.groupby('a')['index']
.agg(['first','last'])
.values
.tolist())
print(a)
[[3, 5], [7, 7]]
Explanation:
First get cumulative sum by cumsum - get for all False unique groups:
print (df.a.cumsum())
0 1
1 2
2 3
3 3
4 3
5 3
6 4
7 4
8 5
Name: a, dtype: int32
Filter only False rows by boolean indexing with invert boolean column:
print (df.a.cumsum()[~df.a])
3 3
4 3
5 3
7 4
Name: a, dtype: int32
Create column from index by reset_index:
print (df.a.cumsum()[~df.a].reset_index())
index a
0 3 3
1 4 3
2 5 3
3 7 4
For each group aggregate by agg functions first and last:
print (df.a.cumsum()[~df.a].reset_index().groupby('a')['index'].agg(['first','last']))
first last
a
3 3 5
4 7 7
Last convert to nested list:
print (df.a.cumsum()[~df.a].reset_index().groupby('a')['index'].agg(['first','last']).values.tolist())
[[3, 5], [7, 7]]

rowbind elements of list into pandas data frame by grouping

I'm wondering what is the pythonic way of achieving the following:
Given a list of list:
l = [[1, 2],[3, 4],[5, 6],[7, 8]]
I would like to create a list of pandas data frames where the first pandas data frame is a row bind of the first two elements in l and the second a row bind of the last two elements:
>>> df1 = pd.DataFrame(np.asarray(l[:2]))
>>> df1
0 1
0 1 2
1 3 4
and
>>> df2 = pd.DataFrame(np.asarray(l[2:]))
>>> df2
0 1
0 5 6
1 7 8
In my problem I have a very long list and I know the grouping, i.e. the first k elements of the list l should be rowbinded to form the first df. How can this be achieved in a python friendly way?

You could store them in dict like
In [586]: s = pd.Series(l)
In [587]: k = 2
In [588]: df = {k:pd.DataFrame(g.values.tolist()) for k, g in s.groupby(s.index//k)}
In [589]: df[0]
Out[589]:
0 1
0 1 2
1 3 4
In [590]: df[1]
Out[590]:
0 1
0 5 6
1 7 8

How to generate the unique id from a list of ids containing duplicates

I am using pandas package to deal with my data, and I have a dataframe looks like below.
data = pd.read_csv('people.csv')
id, A, B
John, 1, 3
Mary, 2, 5
John, 4, 6
John, 3, 7
Mary, 5, 2
I'd like to produce the unique id for those duplicates but keep the same order of them.
id, A, B
John, 1, 3
Mary, 2, 5
John.1, 4, 6
John.2, 3, 7 # John shows up three times.
Mary.1, 5, 2 # Mary shows up twice.
I tried something like set_index, pd.factorize() and index_col but they do not work.

In order to obtain the indices you may use GroupBy.cumcount:
>>> idx = df.groupby('id').cumcount()
>>> idx
0 0
1 0
2 1
3 2
4 1
dtype: int64
The non-zero ones may be appended by:
>>> mask = idx != 0
>>> df.loc[mask, 'id'] += '.' + idx[mask].astype('str')
>>> df
id A B
0 John 1 3
1 Mary 2 5
2 John.1 4 6
3 John.2 3 7
4 Mary.1 5 2

merge items into a list based on index

Suppose I have a dataframe as follows:
df = pd.DataFrame(range(4), index=range(4))
df = df.append(df)
the resultant df is:
0 0
1 1
2 2
3 3
0 0
1 1
2 2
3 3
I want to combine the values of the same index into a list. The desired result is:
0 [0,0]
1 [1,1]
2 [2,2]
3 [3,3]
For a more realistic scenario, my index will be dates, and I want to aggregate multiple obs into a list based on the date. In this way, I can perform some functions on the obs for each date.

For a more realistic scenario, my index will be dates, and I want to
aggregate multiple obs into a list based on the date. In this way, I
can perform some functions on the obs for each date.
If that's your goal, then I don't think you want to actually materialize a list. What you want to do is use groupby and then act on the groups. For example:
>>> df.groupby(level=0)
<pandas.core.groupby.DataFrameGroupBy object at 0xa861f6c>
>>> df.groupby(level=0)[0]
<pandas.core.groupby.SeriesGroupBy object at 0xa86630c>
>>> df.groupby(level=0)[0].sum()
0 0
1 2
2 4
3 6
Name: 0, dtype: int64
You could extract a list too:
>>> df.groupby(level=0)[0].apply(list)
0 [0, 0]
1 [1, 1]
2 [2, 2]
3 [3, 3]
Name: 0, dtype: object
but it's usually better to act on the groups themselves. Series and DataFrames aren't really meant for storing lists of objects.

In [374]:
import pandas as pd
df = pd.DataFrame({'a':range(4)})
df = df.append(df)
df
Out[374]:
a
0 0
1 1
2 2
3 3
0 0
1 1
2 2
3 3
[8 rows x 1 columns]
In [379]:
import numpy as np
# loop over the index values and flatten them using numpy.ravel and cast to a list
for index in df.index.values:
# use loc to select the values at that index
print(index, list((np.ravel(df.loc[index].values))))
# handle condition where we have reached the max value of the index, otherwise we output the values twice
if index == max(df.index.values):
break
0 [0, 0]
1 [1, 1]
2 [2, 2]
3 [3, 3]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to get a subgroup start finish indexes of dataframe - python

Another version: x = ( df.groupby((df.C1 != df.C1.shift(1)).cumsum())["C2"] .apply(lambda x: [x.index[0], x.index[-1]] if ("B" in x.values) else np.nan) .dropna() .to_list() ) print(x) Prints: [[0, 4], [8, 9]]

Related

Mark highest documents with true

How to Detect a Streak of Certain Values in a DataFrame?

rowbind elements of list into pandas data frame by grouping

How to generate the unique id from a list of ids containing duplicates

merge items into a list based on index

Categories

Resources