How do I count the number of multicolumn (thing, cond=1) event occurrences prior to every (thing, cond=any) event?
(These could be winning games of poker by player, episodes of depression by patient, or so on.) For example, row index == 3, below, contains the pair (thing, cond) = (c,2), and shows the number of prior (c,1) occurrences, which is correctly (but manually) shown in the priors column as 0. I'm interested in producing a synthetic column with the count of prior (thing, 1) events for every (thing, event) pair in my data. My data are monotonically increasing in time. The natural index in the silly DataFrame can be taken as logical ticks, if it helps. (<Narrator>: It really doesn't.)
For convenience, below is the code for my test DataFrame and the manually generated priors column, which I cannot get pandas to usefully generate, no matter which combinations of groupby, cumsum, shift, where, & etc. I try. I have googled and wracked my brain for days. No SO answers seem to fit the bill. The key to reading the priors column is that its entries say things like, "Before this (a,1) or (a,2) event, there have been 2 (a,1) events."
[In]:
import pandas as pd
silly = pd.DataFrame({'thing': ['a','b','a','c','b','c','c','a','a','b','c','a'], "cond": [1,2,1,2,1,2,1,2,1,2,1,2]})
silly['priors'] = pd.Series([0,0,1,0,0,0,0,2,2,1,1,3])
silly
[Out]:
silly
thing cond priors
0 a 1 0
1 b 2 0
2 a 1 1
3 c 2 0
4 b 1 0
5 c 2 0
6 c 1 0
7 a 2 2
8 a 1 2
9 b 2 1
10 c 1 1
11 a 2 3
The closest I've come is:
silly
[In]:
silly['priors_inc'] = silly[['thing', 'cond']].where(silly['cond'] == 1).groupby('thing').cumsum() - 1
[Out]:
silly
thing cond priors priors_inc
0 a 1 0 0.0
1 b 2 0 NaN
2 a 1 1 1.0
3 c 2 0 NaN
4 b 1 0 0.0
5 c 2 0 NaN
6 c 1 0 0.0
7 a 2 2 NaN
8 a 1 2 2.0
9 b 2 1 NaN
10 c 1 1 1.0
11 a 2 3 NaN
Note that the values that are present in the incomplete priors column are correct, but not all of the desired data are there.
Please, if at all possible, withhold any "Pythonic" answers. While my real data are small compared to most ML problems, I want to learn pandas the right way, not the toy data way with Python loops or itertools chicanery that I've seen too much of already. Thank you in advance! (And I apologize for the wall of text!)
You need to
Cumulatively count where each "cond" is 1
Do this for each "thing"
Make sure the counts are shifted by 1.
You can do this using groupby, cumsum and shift:
(df.cond.eq(1)
.groupby(df.thing)
.apply(lambda x: x.cumsum().shift())
.fillna(0, downcast='infer'))
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 2
8 2
9 1
10 1
11 3
Name: cond, dtype: int64
Another option to avoid the apply is to chain two groupby calls—one does the shifting, the other performs the cumsum.
(df.cond.eq(1)
.groupby(df.thing)
.cumsum()
.groupby(df.thing)
.shift()
.fillna(0, downcast='infer'))
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 2
8 2
9 1
10 1
11 3
Name: cond, dtype: int64
Related
When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6
I've a question about a specific problem in pandas:
I have in a df a column with the following values:
5
4
3
2
1
0
0
0
0
1
2
3
4
5
I want to select all the rows from the first 5 to the last 0:
5
4
3
2
1
0
0
0
0
I tried with drop duplicates, but i loose the last three zeroes.
I'm thinking aboutusing a for cycle and stop when the i-th value of the column is greater than the i-1 value, but i don't know how to make such a cycle for a dataframe in pandas.
Can someone help me?
Thank you in advance, I hope I've explained the problem clearly.
You could use DataFrame.shift to compare with the next row, and keep only those that are less or equal than the previous. Here I use np.r_ to include the first value too:
import numpy as np
df[np.r_[True, df.col.le(df.col.shift()).to_numpy()[1:]]]
col
0 5
1 4
2 3
3 2
4 1
5 0
6 0
7 0
8 0
Let us try cummax:
s = df['col']
df.loc[s.eq(5).cummax() & s[::-1].eq(0).cummax()]
col
0 5
1 4
2 3
3 2
4 1
5 0
6 0
7 0
8 0
Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])
When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6
I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?
Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7