I assume this is a simple task for pandas but I don't get it.
I have data liket this
Group Val
0 A 0
1 A 1
2 A <NA>
3 A 3
4 B 4
5 B <NA>
6 B 6
7 B <NA>
And I want to know the frequency of valid and invalid values in Val per group Group. This is the expected result.
A B Total
Valid 3 2 5
NA 1 2 3
Here is code to generate that sample data.
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame({
'Group': list('AAAABBBB'),
'Val': range(8)
})
# some values to NA
for idx in [2, 5, 7]:
df.iloc[idx, 1] = pd.NA
print(df)
What I tried is something with grouping
>>> df.groupby('Group').agg(lambda x: x.isna())
Val
Group
A [False, False, True, False]
B [False, True, False, True]
>>> df.groupby('Group').apply(lambda x: x.isna())
Group Val
0 False False
1 False False
2 False True
3 False False
4 False False
5 False True
6 False False
7 False True
You are close with using groupby and isna
new = df.groupby(['Group', df['Val'].isna().replace({True: 'NA', False: 'Valid'})])['Group'].count().unstack(level=0)
new['Total'] = new.sum(axis=1)
print(new)
Group A B Total
Val
NA 1 2 3
Valid 3 2 5
here is one way to do it
# cross tab to take the summarize
# convert Val to NA or Valid depending on the value
df2=(pd.crosstab(df['Val'].isna().map({True: 'NA', False: 'Valid'}),
df['Group'] )
.reset_index()
.rename_axis(columns=None))
df2['Total']=df2.sum(axis=1, numeric_only=True) # add Total column
out=df2.set_index('Val') # set index to match expected output
out
A B Total
Val
NA 1 2 3
Valid 3 2 5
if you need both row and column total, then it'll be even simpler with crosstab
df2=(pd.crosstab(df['Val'].isna().map({True: 'NA', False: 'Valid'}),
df['Group'],
margins=True, margins_name='Total')
Group A B Total
Val
NA 1 2 3
Valid 3 2 5
Total 4 4 8
Another possible solution, based on pandas.pivot_table and on the following ideas:
Add a new column, status, which contains NA or Valid if the corresponding value is or is not NaN, respectively.
Create a pivot table, using len as aggregation function.
Add the Total column, by summing by rows.
(df.assign(status=np.where(df['Val'].isna(), 'NA', 'Valid'))
.pivot_table(
index='status', columns='Group', values='Val',
aggfunc=lambda x: len(x))
.reset_index()
.rename_axis(None, axis=1)
.assign(Total = lambda x: x.sum(axis=1)))
Output:
status A B Total
0 NA 1 2 3
1 Valid 3 2 5
Related
I have a dataframe with two columns:
name and version.
I want to add a boolean in an extra column. So if the highest version then true, otherwise false.
import pandas as pd
data = [['a', 1], ['b', 2], ['a', 2], ['a', 2], ['b', 4]]
df = pd.DataFrame(data, columns = ['name', 'version'])
df
is it best to use groupby for this?
I have tried smth. like this but I do not know how to add extra column with bolean.
df.groupby(['name']).max()
Compare maximal values per groups created by GroupBy.transform with max for generate new column/ Series filled by maxinmal values, so possible compare by original column:
df['bool_col'] = df['version'] == df.groupby('name')['version'].transform('max')
print(df)
name version bool_col
0 a 1 False
1 b 2 False
2 a 2 True
3 a 2 True
4 b 4 True
Detail:
print(df.groupby('name')['version'].transform('max'))
0 2
1 4
2 2
3 2
4 4
Name: version, dtype: int64
You can assign your column directly:
df['bool_col'] = df['version'] == max(df['version'])
Output:
name version bool_col
0 a 1 False
1 b 2 False
2 a 2 False
3 a 2 False
4 b 4 True
Is this what you were looking for?
I have a dataframe
df_input = pd.DataFrame(
{
"col_cate": ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
"target_bool": [True, False, True, False, True, False, True, False]
}
)
And I want to count the number of unique categories. So I am expecting the output to be like this
col_cate, target_bool, cnt
'A' , True , 2
'A' , False , 2
'B' , True , 2
'B' , False , 2
But df_input.group_by(["col_cate", "target_bool"]).count() gives
Empty DataFrame
Columns: []
Index: [(A, False), (A, True), (B, False), (B, True)]
But adding a dummy to the df_input works, like df_input["dummy"] = 1.
How do I get the group by count table without adding a dummy?
df_input.groupby('col_cate')['target_bool'].value_counts()
col_cate target_bool
A False 2
True 2
B False 2
True 2
then you can reset_index()
Because function GroupBy.count is used for counts values with exclude missing values if exist is necessary specify column after groupby, if both columns are used in by parameter in groupby:
df = (df_input.groupby(by=["col_cate", "target_bool"])['col_cate']
.count()
.reset_index(name='cnt'))
print (df)
col_cate target_bool cnt
0 A False 2
1 A True 2
2 B False 2
3 B True 2
If want count all columns, here both is it possible (but here always same output) if specify both columns:
df1 = (df_input.groupby(["col_cate", "target_bool"])[['col_cate','target_bool']]
.count()
.add_suffix('_count')
.reset_index())
print (df1)
col_cate target_bool col_cate_count target_bool_count
0 A False 2 2
1 A True 2 2
2 B False 2 2
3 B True 2 2
Or if use GroupBy.size method it working a bit different - it count all values, not exclude missing, so no column is necessary specify:
df = df_input.groupby(["col_cate", "target_bool"]).size().reset_index(name='cnt')
print (df)
col_cate target_bool cnt
0 A False 2
1 A True 2
2 B False 2
3 B True 2
Like this also:
In [54]: df_input.groupby(df_input.columns.tolist()).size().reset_index().\
...: rename(columns={0:'cnt'})
Out[54]:
col_cate target_bool cnt
0 A False 2
1 A True 2
2 B False 2
3 B True 2
In a Python DataFrame, I want to detect the beginning and end position of a block of False values in a row. If the block contains just one False, I would like to get that position.
Example:
df = pd.DataFrame({"a": [True, True, True,False,False,False,True,False,True],})
In[110]: df
Out[111]:
a
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 False
8 True
In this example, I would like to get the positions
`3`, `5`
and
`7`, `7`.
Use:
a = (df.a.cumsum()[~df.a]
.reset_index()
.groupby('a')['index']
.agg(['first','last'])
.values
.tolist())
print(a)
[[3, 5], [7, 7]]
Explanation:
First get cumulative sum by cumsum - get for all False unique groups:
print (df.a.cumsum())
0 1
1 2
2 3
3 3
4 3
5 3
6 4
7 4
8 5
Name: a, dtype: int32
Filter only False rows by boolean indexing with invert boolean column:
print (df.a.cumsum()[~df.a])
3 3
4 3
5 3
7 4
Name: a, dtype: int32
Create column from index by reset_index:
print (df.a.cumsum()[~df.a].reset_index())
index a
0 3 3
1 4 3
2 5 3
3 7 4
For each group aggregate by agg functions first and last:
print (df.a.cumsum()[~df.a].reset_index().groupby('a')['index'].agg(['first','last']))
first last
a
3 3 5
4 7 7
Last convert to nested list:
print (df.a.cumsum()[~df.a].reset_index().groupby('a')['index'].agg(['first','last']).values.tolist())
[[3, 5], [7, 7]]
I have a data frame with categories and values. I need to find the value in each category closest to a value. I think I'm close but I can't really get the right output when applying the results of argsort to the original dataframe.
For example, if the input was defined in the code below the output should have only (a, 1, True), (b, 2, True), (c, 2, True) and all other isClosest Values should be False.
If multiple values are closest then it should be the first value listed marked.
Here is the code I have which works but I can't get it to reapply to the dataframe correctly. I would love some pointers.
df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['values'] = [1, 2, 3, 4, 5, 4, 3, 2, 1, 0]
df['isClosest'] = False
uniqueCategories = df['category'].unique()
for c in uniqueCategories:
filteredCategories = df[df['category']==c]
sortargs = (filteredCategories['value']-2.0).abs().argsort()
#how to use sortargs so that we set column in df isClosest=True if its the closest value in each category to 2.0?
You can create a column of absolute differences:
df['dif'] = (df['values'] - 2).abs()
df
Out:
category values dif
0 a 1 1
1 b 2 0
2 b 3 1
3 b 4 2
4 c 5 3
5 a 4 2
6 b 3 1
7 c 2 0
8 c 1 1
9 a 0 2
And then use groupby.transform to check whether the minimum value of each group is equal to the difference you calculated:
df['is_closest'] = df.groupby('category')['dif'].transform('min') == df['dif']
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
df.groupby('category')['dif'].idxmin() would also give you the indices of the closest values for each category. You can use that for mapping too.
For selection:
df.loc[df.groupby('category')['dif'].idxmin()]
Out:
category values dif
0 a 1 1
1 b 2 0
7 c 2 0
For assignment:
df['is_closest'] = False
df.loc[df.groupby('category')['dif'].idxmin(), 'is_closest'] = True
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
The difference between these approaches is that if you check equality against the difference, you would get True for all rows in case of ties. However, with idxmin it will return True for the first occurrence (only one for each group).
Solution with DataFrameGroupBy.idxmin - get indexes of minimal values per group and then assign boolean mask by Index.isin to column isClosest:
idx = (df['values'] - 2).abs().groupby([df['category']]).idxmin()
print (idx)
category
a 0
b 1
c 7
Name: values, dtype: int64
df['isClosest'] = df.index.isin(idx)
print (df)
category values isClosest
0 a 1 True
1 b 2 True
2 b 3 False
3 b 4 False
4 c 5 False
5 a 4 False
6 b 3 False
7 c 2 True
8 c 1 False
9 a 0 False
I want to eliminate all rows that are equal to a certain values (or in a certain range) within a dataframe with a large number of columns. For example, if I had the following dataframe:
a b
0 1 0
1 2 1
2 3 2
3 0 3
and wanted to remove all rows containing 0, I could use:
a_df[(a_df['a'] != 0) & (a_df['b'] !=0)]
but this becomes a pain when you're dealing with a large number of columns. It could be done as:
for i in a_df.columns.values:
a_df = a_df[a_df[i] != 0]
but this seems inefficient. Is there a better way to do this?
Just do it for the whole df and call dropna:
In [45]:
df[df != 0].dropna()
Out[45]:
a b
1 2 1
2 3 2
The condition df != 0 produces a boolean mask:
In [47]:
df != 0
Out[47]:
a b
0 True False
1 True True
2 True True
3 False True
When this is combined with the df it produces NaN values where the condition is not met:
In [48]:
df[df != 0]
Out[48]:
a b
0 1 NaN
1 2 1
2 3 2
3 NaN 3
Calling dropna drops any row with a NaN value
Here's a variant of EdChum's approach. You could do df != 0 and then use all to make your selector:
>>> (df != 0).all(axis=1)
0 False
1 True
2 True
3 False
dtype: bool
and then use this to select:
>>> df.loc[(df != 0).all(axis=1)]
a b
1 2 1
2 3 2
The advantage of this is that it keeps NaNs if you want, e.g.
>>> df
a b
0 1 0
1 2 NaN
2 3 2
3 0 3
>>> df.loc[(df != 0).all(axis=1)]
a b
1 2 NaN
2 3 2
>>> df[(df != 0)].dropna()
a b
2 3 2
as you've mentioned in your question you may need to drop rows that have a value in a certain range you can do this by the following
suppose the range is 0 , 10 , 20
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
mask = frame.applymap(lambda x : False if x in [0 , 10 , 20] else True )
frame[mask.all(axis = 1)]