Python - Pandas: select first observation per group - python

I want to adapt my former SAS code to Python using the dataframe framework.
In SAS I often use this type of code (assume the columns are sorted by group_id where group_id takes values 1 to 10 where there are multiple observations for each group_id):
data want;set have;
by group_id;
if first.group_id then c=1; else c=0;
run;
so what goes on here is that I select the first observations for each id and I create a new variable c that takes value 1 and 0 for the others. The dataset looks like this:
group_id c
1 1
1 0
1 0
2 1
2 0
2 0
3 1
3 0
3 0
How can I do this in Python using dataframe? Assume that I start with the group_id vector only.

If you're using 0.13+ you can use cumcount groupby method:
In [11]: df
Out[11]:
group_id
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
In [12]: df.groupby('group_id').cumcount() == 0
Out[12]:
0 True
1 False
2 False
3 True
4 False
5 False
6 True
7 False
8 False
dtype: bool
You can force the dtype to be int rather than bool:
In [13]: df['c'] = (df.groupby('group_id').cumcount() == 0).astype(int)

Related

how to extract 'day' of each week from dataframe according to specific condition [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')

"Is seen before" column for another column

Consider following data frame:
a
0 1
1 1
2 2
3 4
4 5
5 6
6 4
Is there a convenient way (without iterating rows) to create a column that represent "is seen before" for every value of column a.
For example desired output for the example is (0 represent not seen before, 1 represent seen before):
0
1
0
0
0
0
1
If this is possible, is there a way to enhance it with counts of previous occurrences and not just binary indicator?
Should just be .duplicated() (see documentation). Then if you want to cast it to an integer for 0's and 1's instead of False and True you can use .astype(int) on the output:
From pd.DataFrame:
df.duplicated(subset="a").astype(int)
0 0
1 1
2 0
3 0
4 0
5 0
6 1
dtype: int32
From pd.Series:
df["a"].duplicated().astype(int)
0 0
1 1
2 0
3 0
4 0
5 0
6 1
Name: a, dtype: int32
This will mark the first time a value is "seen" as False, and all subsequent values that have already been "seen" as True. Coercing it to an int datatype via astype will change False -> 0 and True -> 1
Use assign and duplicated:
df.assign(seenbefore = lambda x: x.a.duplicated().astype(int))

Get rows from Pandas DataFrame from index until condition

Let's say I have a Pandas DataFrame:
x = pd.DataFrame(data=[5,4,3,2,1,0,1,2,3,4,5],columns=['value'])
x
Out[9]:
value
0 5
1 4
2 3
3 2
4 1
5 0
6 1
7 2
8 3
9 4
10 5
Now, I want to, given an index, find rows in x until a condition is met.
For example, if index = 2:
x.loc[2]
Out[14]:
value 3
Name: 2, dtype: int64
Now I want to, from that index, find the next n rows where the value is greater than some threshold. For example, if the threshold is 0, the results should be:
x
Out[9]:
value
2 3
3 2
4 1
5 0
How can I do this?
I have tried:
x.loc[2:x['value']>0,:]
But of course this will not work because x['value']>0 returns a boolean array of:
Out[20]:
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 True
10 True
Name: value, dtype: bool
Using idxmin and slicing
x.loc[2:x['value'].gt(0).idxmin(),:]
2 3
3 2
4 1
5 0
Name: value
Edit:
For a general formula, use
index = 7
threshold = 2
x.loc[index:x.loc[index:,'value'].gt(threshold).idxmin(),:]
From your description in comments, seemed like you want to begin from index+1 and not index. So, if that is the case, just use
x.loc[index+1:x.loc[index+1:,'value'].gt(threshold).idxmin(),:]
You want to filter for index greater than your index=2, and for x['value']>=threshold, and then select the first n of these rows, which can be accomplished with .head(n).
Say:
idx = 2
threshold = 0
n = 4
x[(x.index>=idx) & (x['value']>=threshold)].head(n)
Out:
# value
# 2 3
# 3 2
# 4 1
# 5 0
Edit: changed to >=, and updated example to match OP's example.
Edit 2 due to clarification from OP: since n is unknown:
idx = 2
threshold = 0
x.loc[idx:(x['value']<=threshold).loc[x.index>=idx].idxmax()]
This is selecting from the starting idx, in this case idx=2, up to and including the first row where the condition is not met (in this case index 5).

pandas: Grouping or filtering based on values in list, instead of dataframe

I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?
Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7

Pandas conditional subset for dataframe with bool values and ints

I have a dataframe with three series. Column A contains a group_id. Column B contains True or False. Column C contains a 1-n ranking (where n is the number of rows per group_id).
I'd like to store a subset of this dataframe for each row that:
1) Column C == 1
OR
2) Column B == True
The following logic copies my old dataframe row for row into the new dataframe:
new_df = df[df.column_b | df.column_c == 1]
IIUC, starting from a sample dataframe like:
A,B,C
01,True,1
01,False,2
02,False,1
02,True,2
03,True,1
you can:
df = df[(df['C']==1) | (df['B']==True)]
which returns:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
You've couple of methods for filtering, and performance varies based on size of your data
In [722]: df[(df['C']==1) | df['B']]
Out[722]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
In [723]: df.query('C==1 or B==True')
Out[723]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
In [724]: df[df.eval('C==1 or B==True')]
Out[724]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1

Categories