python pandas groupby then count rows satisfying condition - python

i am trying to do a groupby on the id column such that i can show the number of rows in col1 that is equal to 1.
df:
id col1 col2 col3
a 1 1 1
a 0 1 1
a 1 1 1
b 1 0 1
my code:
df.groupby(['id'])[col1].count()[1]
output i got was 2. It didnt show me the values from other ids like b.
i want:
id col1
a 2
b 1
if possible can the total rows per id also be displayed as a new column?
example:
id col1 total
a 2 3
b 1 1

Assuming you have only 1 and 0 in col1, you can use agg:
df.groupby('id', as_index=False)['col1'].agg({'col1': 'sum', 'total': 'count'})
# id total col1
#0 a 3 2
#1 b 1 1

It's because your rows which id is 'a' sums to 3. The 2 of them are identical that's why it's been grouped and considered as one then it added the unique row which contains the 0 value on its col 1. You can't group rows with different values on its rows.
Yes you can add it on your output. Just place a method how you counted all rows on your column section of your code.

If you want to generalize the solution to include values in col1 that are not zero you can do the following. This also orders the columns correctly.
df.set_index('id')['col1'].eq(1).groupby(level=0).agg([('col1', 'sum'), ('total', 'count')]).reset_index()
id col1 total
0 a 2.0 3
1 b 1.0 1
Using a tuple in the agg method where the first value is the column name and the second the aggregating function is new to me. I was just experimenting and it seemed to work. I don't remember seeing it in the documentation so use with caution.

Related

How to search and select column names based on values?

Suppose I have a pandas DataFrame like this
name col1 col2 col3
0 AAA 1 0 2
1 BBB 2 1 2
2 CCC 0 0 2
I want (a) the names of any columns that contain a value of 2 anywhere in the column (i.e., col1, col3), and (b) the names of any columns that contain only values of 2 (i.e., col3).
I understand how to use DataFrame.any() and DataFrame.all() to select rows in a DataFrame where a value appears in any or all columns, but I'm trying to find COLUMNS where a value appears in (a) any or (b) all rows.
You can do what you described with columns:
df.columns[df.eq(2).any()]
# Index(['col1', 'col3'], dtype='object')
df.columns[df.eq(2).all()]
# Index(['col3'], dtype='object')
You can loop over the columns (a):
for column in df.columns:
if (df[column]==2).any(axis=None)==True:
print(column +"contains 2")
Here you get the name of the columns containing one or more 2.
(b) :
for column in df.columns:
if (df[column]==2).all(axis=None)==True:
print(column +"contains 2")

Splitting column values into multiple rows and column in pandas

I am facing a problem with pandas. The input data is a single column :
MixedColumn
-------------
20_5, 20_5**1
20_7**9
20_4, 40_4, 15_4**2
And what I want to split and transform it into something like this :
Col1 Col2
--------------
20_5 1
20_5 1
20_7 9
20_4 2
40_4 2
15_4 2
The logic is split each row item (20_5, 20_5) based on comma (if present) and place them in next row of same column (Col1). As well as split each row item (**1) based on ** and associate them with individual values in a separate column (Col2).
Sorry if this is a noob question. Any hints will surely help me out. Thanks and wish you all a happy holiday.
First split on ** to get Col2 with Series.str.split and expand=True.
Then we use DataFrame.explode to make a new row for each element to create Col1:
note: this requires pandas >= 0.25.0
df[['Col1', 'Col2']] = df['MixedColumn'].str.split('\*\*', expand=True)
df = df.assign(Col1=df['Col1'].str.split(', ')).explode('Col1').drop(columns='MixedColumn')
Col1 Col2
0 20_5 1
0 20_5 1
1 20_7 9
2 20_4 2
2 40_4 2
2 15_4 2
Starting with
df = pd.DataFrame({"mixed_column": ["20_5, 20_5**1", "20_7**9", "20_4, 40_4, 15_4**2"]})
df_split = df.mixed_column.str.rsplit("**").apply(pd.Series)
df_split[0] = df_split.apply(lambda x: x[0].split(", "), axis=1)
df_split = df_split.explode(0)
which gives you
0 1
0 20_5 1
0 20_5 1
1 20_7 9
2 20_4 2
2 40_4 2
2 15_4 2

How to count the number of occurrences that a certain value occurs in a DataFrame according to another column?

I have a Pandas DataFrame that has two columns as such:
item1 label
0 a 0
1 a 1
2 b 0
3 c 0
4 a 1
5 a 0
6 b 0
In sum, there are a total of three kinds of items in the column item1. Namely, a, b, and c. The values that the entries of the label column are either 0 or 1.
What I want to do is receive a DataFrame where I have a count of how many entries in item1 have label value 1. Using the toy example above, the desired DataFrame would be something like:
item1 label
0 a 2
1 b 0
2 c 0
How might I achieve something like that?
I've tried using the following line of code:
df[['item1', 'label']].groupby('item1').sum()['label']
but the result is a Pandas Series and also displays some behaviors and properties that aren't desired.
IIUC, you can use pd.crosstab:
count_1=pd.crosstab(df['item1'],df['label'])[1]
print(count_1)
item1
a 2
b 0
c 0
Name: 1, dtype: int64
To get a DataFrame:
count_1=pd.crosstab(df['item1'],df['label'])[1].rename('label').reset_index()
print(count_1)
item1 label
0 a 2
1 b 0
2 c 0
The good thing about this method is that it allows you to also get the number of 0 easily, which if you use the sum you don't get
Filter columns before groupby is not necessary, but you can specify column after groupby for aggregation sum. For 2 columns DataFrames add as_index=False parameter:
df = df.groupby('item1', as_index=False)['label'].sum()
Alternative is use Series.reset_index:
df = df.groupby('item1')['label'].sum().reset_index()
print (df)
item1 label
0 a 2
1 b 0
2 c 0

Pandas - select rows with best values

I have this dataframe
col1 col2 col3
0 2 A 1
1 1 A 100
2 3 B 12
3 4 B 2
I want to select the highest col1 value from all with A, then the one from all with B, etc, i.e. this is the desired output
col1 col2 col3
0 2 A 1
3 4 B 2
I know I need some kind of groupby('col2'), but I don't know what to use after that.
is that what you want?
In [16]: df.groupby('col2').max().reset_index()
Out[16]:
col2 col1
0 A 2
1 B 4
use groupby('col2') then use idxmax to get the index of the max values within each group. Finally, use these index values to slice the original dataframe.
df.loc[df.groupby('col2').col1.idxmax()]
Notice that the index values of the original dataframe are preserved.

Order columns of DataFrame according to values

I have the following input:
col1 col2 col3
1 4 0
0 12 2
2 12 4
3 2 1
I want to sort the DataFrame according to the values in the columns, e.g. sorting it primarily for df[df==0].count() and secondarily for df.sum() would produce the output:
col2 col3 col1
4 0 1
12 2 0
12 4 2
2 1 3
pd.DataFrame.sort() takes a colum object as argument, which does not apply here, so how can I achieve this?
Firstly, I think your zero count is increasing from right to left whereas your sum is decreasing, so I think you need to clarify that. You can get the number of zero rows simply by (df == 0).sum().
To sort by a single aggregate, you can do something like:
col_order = (df == 0).sum().sort(inplace=False).index
df[col_order]
This sorts the series of aggregates by its values and the resulting index is the columns of df in the order you want.
To sort on two sets of values would be more awkward/tricky but you could do something like
aggs = pd.DataFrame({'zero_count': (df == 0).sum(), 'sum': df.sum()})
col_order = aggs.sort(['zero_count', 'sum'], inplace=False).index
df[col_order]
Note that the sort method takes an ascending parameter which takes either a Boolean or a list of Booleans of equal length to the number of columns you are sorting on, e.g.
df.sort(['a', 'b', ascending=[True, False])

Categories