Creating a derived column using pandas operations - python

I'm trying to create a column which contains a cumulative sum of the number of entries, tid, which are grouped according to unique values of (raceid, tid). The cumulative sum should increment by the number of entries in the grouping as shown in the df3 dataframe below rather than one at a time.
import pandas as pd
df1 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3]})
rid tid
0 1 1
1 1 2
2 1 2
3 2 1
4 2 1
5 2 3
6 3 1
7 3 4
8 4 5
9 5 1
10 5 1
11 5 1
12 5 3
Giving after the required operation:
df3 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3],
'groupentries': [1, 2, 2, 2, 2, 1, 1, 1, 1, 3, 3, 3, 1],
'cumulativeentries': [1, 2, 2, 3, 3, 1, 4, 1, 1, 7, 7, 7, 2]})
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2
The derived column that I'm after is the cumulativeentries column although I've only figured out how to generate the intermediate column groupentries using pandas:
df1.groupby(["rid", "tid"]).size()

Values in cumulativeentries are actually a kind of running count.
The task is to count occurrences of the current tid in "source area" of
tid column:
from the beginning of the DataFrame,
up to (including) the end of the current group.
To compute values of both required values for each group, I defined
the following function:
def fn(grp):
lastRow = grp.iloc[-1] # last row of the current group
lastId = lastRow.name # index of this row
tids = df1.truncate(after=lastId).tid
return [grp.index.size, tids[tids == lastRow.tid].size]
To get the "source area" mentioned above I used truncate function.
In my opinion it is a very intuitive solution, based on the notion of the
"source area".
The function returns a list containing both required values:
the size of the current group,
how many tids equal to the current tid are in the
truncated tid column.
To apply this function, run:
df2 = df1.groupby(['rid', 'tid']).apply(fn).apply(pd.Series)\
.rename(columns={0: 'groupentries', 1: 'cumulativeentries'})
Details:
apply(fn) generates a Series containing 2-element lists.
apply(pd.Series) converts it to a DataFrame (with default column names).
rename sets the target column names.
And the last thing to do is to join this table to df1:
df1.join(df2, on=['rid', 'tid'])

For first column use GroupBy.transform with DataFrameGroupBy.size, for second use custom function for test all values of column to last index values, compare with last values and count matched values by sum:
f = lambda x: (df1['tid'].iloc[:x.index[-1]+1] == x.iat[-1]).sum()
df1['groupentries'] = df1.groupby(["rid", "tid"])['rid'].transform('size')
df1['cumulativeentries'] = df1.groupby(["rid", "tid"])['tid'].transform(f)
print (df1)
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2

Related

dataframe new column based on groupby operations

import pandas
import numpy
df = pandas.DataFrame({'id_1' : [1,2,1,1,1,1,1,2,2,2,2],
'id_2' : [1,1,1,1,1,2,2,2,2,2,2],
'v_1' : [2,1,1,3,2,1,2,4,1,1,2],
'v_2' : [1,1,1,1,2,2,2,1,1,2,2],
'v_3' : [3,3,3,3,4,4,4,3,3,3,3]})
In [4]: df
Out[4]:
id_1 id_2 v_1 v_2 v_3
0 1 1 2 1 3
1 2 1 1 1 3
2 1 1 1 1 3
3 1 1 3 1 3
4 1 1 2 2 4
5 1 2 1 2 4
6 1 2 2 2 4
7 2 2 4 1 3
8 2 2 1 1 3
9 2 2 1 2 3
10 2 2 2 2 3
sub = df[(df['id_1'] == 1) & (df['id_2'] == 1)].copy()
sub['v_4'] = numpy.where(sub['v_1'] == sub['v_2'].shift(), 'A', \
numpy.where(sub['v_1'] == sub['v_3'].shift(), 'B', 'C'))
In [6]: sub
Out[6]:
id_1 id_2 v_1 v_2 v_3 v_4
0 1 1 2 1 3 C
2 1 1 1 1 3 A
3 1 1 3 1 3 B
4 1 1 2 2 4 C
I have a dataframe as defined above. I would like to perform some operation, basically categorize whether v_1 equals the previous v_2 or v_3 for each group of (id_1, id_2)
I have done the the operation which performs on a sub df. And I would like to have a one line code to combine the following groupby together with the operation I have on the sub df together.
gbdf = df.groupby(by=['id_1', 'id_2'])
I have tried something like
gbdf['v_4'] = numpy.where(gbdf['v_1'] == gbdf['v_2'].shift(), 'A', \
numpy.where(gbdf['v_1'] == gbdf['v_3'].shift(), 'B', 'C'))
and the error was
'DataFrameGroupBy' object does not support item assignment
I also tried
df['v_4'] = numpy.where(gbdf['v_1'] == gbdf['v_2'].shift(), 'A', \
numpy.where(gbdf['v_1'] == gbdf['v_3'].shift(), 'B', 'C'))
which I believe the result was wrong, it does not align the groupby result with the original ordering.
I am wondering whether there is an elegant way to achieve this.
This gets you a list of dataframes that match the content of the dataframe sub, but for all results of the .groupby():
import numpy
import pandas
source = pandas.DataFrame(
{'id_1': [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2],
'id_2': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'v_1': [2, 1, 1, 3, 2, 1, 2, 4, 1, 1, 2],
'v_2': [1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2],
'v_3': [3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3]})
def add_v4(df):
df['v_4'] = numpy.where(df['v_1'] == df['v_2'].shift(), 'A', numpy.where(df['v_1'] == df['v_3'].shift(), 'B', 'C'))
return df
dfs = [add_v4(pandas.DataFrame(slice)) for _, slice in source.groupby(by=['id_1', 'id_2'])]
print(dfs)
About this line:
dfs = [add_v4(pandas.DataFrame(slice)) for _, slice in source.groupby(by=['id_1', 'id_2'])]
It's a list comprehension that gets all the slices from the groupby and turns them into actual new dataframes before passing them to add_v4, which returns the modified dataframe to be added to the list.

Extract data from a dataframe

I have a list based upon which I want to retrieve data from a dataset.
Here is the list:
packed = [1, 5, 8, 2, 3, 3, 7, 3, 7, 7, 4, 6, 3]
and here is the dataset
There are two items with multiple quantity i.e. 3 and 7
I want to extract those rows which are not in packed list. In this case its two times 7(rest 3 are in the list already)
How can I do that? I tried this but this doesn't work
new_df= data[~data["Pid"].isin(packed)].reset_index(drop=True)
Use GroupBy.cumcount with helper DataFrame, merge with left join and indicator=True and last filter by boolean indexing:
packed = [1, 5, 8, 2, 3, 3, 7, 3, 7, 7, 4, 6, 3]
df1 = pd.DataFrame({'Pid':packed})
df1['g'] = df1.groupby('Pid').cumcount()
print (df1)
Pid g
0 1 0
1 5 0
2 8 0
3 2 0
4 3 0
5 3 1
6 7 0
7 3 2
8 7 1
9 7 2
10 4 0
11 6 0
12 3 3
data['g'] = data.groupby('Pid').cumcount()
new_df = data[data.merge(df1, indicator=True, how='left')['_merge'].eq('left_only')]

Pandas create a unique id for each row based on a condition

I've a dataset where one of the column is as below. I'd like to create a new column based on the below condition.
For values in column_name, if 1 is present, create a new id. If 0 is present, also create a new id. But if 1 is repeated in more than 1 continuous rows, then id should be same for all rows. The sample output result can be seen below.
column_name
1
0
0
1
1
1
1
0
0
1
column_name -- ID
1 -- 1
0 -- 2
0 -- 3
1 -- 4
1 -- 4
1 -- 4
1 -- 4
0 -- 5
0 -- 6
1 -- 7
Say your Series is
s = pd.Series([1, 0, 0, 1, 1, 1, 1, 0, 0, 1])
Then you can use:
>>> ((s != 1) | (s.shift(1) != 1)).cumsum()
0 1
1 2
2 3
3 4
4 4
5 4
6 4
7 5
8 6
9 7
dtype: int64
This checks that either the current entry is not 1, or that the previous entry is not 1, and then performs a cumulative sum on the result.
Essentially leveraging the fact that a 1 in the Series lagged by another 1 should be treated as part of the same group, while every 0 calls for an increment. One of four things will happen:
1) 0 with a preceding 0 : Increment by 1
2) 0 with a preceding 1 : Increment by 1
3) 1 with a preceding 1 : Increment by 0
4) 1 with a preceding 0: Increment by 1
(df['column_name'] + df['column_name'].shift(1)).\ ## Creates a Series with values 0, 1, or 2 (first field is NaN)
fillna(0).\ ## Fills first field with 0
isin([0,1]).\ ## True for cases 1, 2, and 4 described above, else False (case 3)
astype('int').\ ## Integerizes it
cumsum()
Output:
0 1
1 2
2 3
3 4
4 4
5 4
6 4
7 5
8 6
9 7
At this stage I would just use a regular python for loop
column_name = pd.Series([1, 0, 0, 1, 1, 1, 1, 0, 0, 1])
ID = [1]
for i in range(1, len(column_name)):
ID.append(ID[-1] + ((column_name[i] + column_name[i-1]) < 2))
print(ID)
>>> [1, 2, 3, 4, 4, 4, 4, 5, 6, 7]
And then you can assign ID as a column in your dataframe

Pandas: group columns of duplicate rows into column of lists

I have a Pandas dataframe that looks something like this:
>>> df
m event
0 3 1
1 1 1
2 1 2
3 1 2
4 2 1
5 2 0
6 3 1
7 2 2
8 3 2
9 3 1
I want to group the values of the event column into lists based on the m column so that I would get this:
>>> df
m events
0 3 [1, 1, 2, 1]
1 1 [1, 2, 2]
2 2 [1, 0, 2]
There should be one row per unique value of m with a corresponding list of all events that belongs to m.
I tried this:
>>> list(df.groupby('m').event)
[(3, m_id
0 1
6 1
8 2
9 1
Name: event, dtype: int64), (1, m_id
1 1
2 2
3 2
Name: event, dtype: int64), (2, m_id
4 1
5 0
7 2
Name: event, dtype: int64)]
It sort of does what I want in that it groups the events after m. I could massage this back into the dataframe that I wanted with some loops, but I feel that I have started on an ugly an unnecessarily complex path. And slow, if there are thousands of unique values for m.
Can I perform the conversion I wanted in an elegant manner using Pandas methods?
Bonus if the events column can contain (numpy) arrays so that I can do math directly on the events rows, like df[df.m==1].events + 100, but regular lists are also ok.
In [320]: r = df.groupby('m')['event'].apply(np.array).reset_index(name='event')
In [321]: r
Out[321]:
m event
0 1 [1, 2, 2]
1 2 [1, 0, 2]
2 3 [1, 1, 2, 1]
Bonus:
In [322]: r.loc[r.m==1, 'event'] + 1
Out[322]:
0 [2, 3, 3]
Name: event, dtype: object
You could
In [1163]: df.groupby('m')['event'].apply(list).reset_index(name='events')
Out[1163]:
m events
0 1 [1, 2, 2]
1 2 [1, 0, 2]
2 3 [1, 1, 2, 1]
If you don't want sorted m
In [1164]: df.groupby('m', sort=False).event.apply(list).reset_index(name='events')
Out[1164]:
m events
0 3 [1, 1, 2, 1]
1 1 [1, 2, 2]
2 2 [1, 0, 2]

Divide a dataframe based on the last occurrence of a condition

Let's say I have this data ordered by id:
id | count
1 1
2 2
3 0
4 4
5 3
6 2
7 0
8 10
9 1
10 2
I want to obtain always the last change that comes after the last zero of any. Based on the data above, I would want to get :
id | count
8 10
9 1
10 2
Does anyone know how to do this?
pandas
df.loc[df['count'].ne(0).iloc[::-1].cumprod().astype(bool)]
id count
7 8 10
8 9 1
9 10 2
numpy
df[(df['count'].values[::-1] != 0).cumprod()[::-1].astype(bool)]
id count
7 8 10
8 9 1
9 10 2
with other conditions
df[(df['count'].values[::-1] < 3).cumprod()[::-1].astype(bool)]
# df.loc[df['count'].lt(3).iloc[::-1].cumprod().astype(bool)]
id count
8 9 1
9 10 2
debugging
You should be able to copy and paste this and reproduce my results. If you can't then there is something else wrong. Try resetting your kernel.
import pandas as pd
df = pd.DataFrame({
'count': [1, 2, 0, 4, 3, 2, 0, 10, 1, 2],
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
df[(df['count'].values[::-1] < 3).cumprod()[::-1].astype(bool)]
Should produce
count id
8 1 9
9 2 10

Categories