I am learning pandas. I'm not sure when to use the .count() function and when to use .value_counts().
count() is used to count the number of non-NA/null observations across the given axis. It works with non-floating type data as well.
Now as an example create a dataframe df
df = pd.DataFrame({"A":[10, 8, 12, None, 5, 3],
"B":[-1, None, 6, 4, None, 3],
"C":["Shreyas", "Aman", "Apoorv", np.nan, "Kunal", "Ayush"]})
Find the count of non-NA value across the row axis.
df.count(axis = 0)
Output:
A 5
B 4
C 5
dtype: int64
Find the number of non-NA/null value across the column.
df.count(axis = 1)
Output:
0 3
1 2
2 3
3 1
4 2
5 3
dtype: int64
value_counts() function returns Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
So for the example shown below
s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts()
The output would be:
3.0 2
4.0 1
2.0 1
1.0 1
dtype: int64
value_counts() aggregates the data and counts each unique value. You can achieve the same by using groupby which is a more broad function to aggregate data in pandas.
count() simply returns the number of non NaN/Null values in column (series) you apply it on.
df = pd.DataFrame({'Id':['A', 'B', 'B', 'C', 'D', 'E', 'F', 'F'],
'Value':[10, 20, 15, 5, 35, 20, 10, 25]})
print(df)
Id Value
0 A 10
1 B 20
2 B 15
3 C 5
4 D 35
5 E 20
6 F 10
7 F 25
# Value counts
df['Id'].value_counts()
F 2
B 2
C 1
A 1
D 1
E 1
Name: Id, dtype: int64
# Same operation but with groupby
df.groupby('Id')['Id'].count()
Id
A 1
B 2
C 1
D 1
E 1
F 2
Name: Id, dtype: int64
# Count()
df['Id'].count()
8
Example with NaN values and count:
print(df)
Id Value
0 A 10
1 B 20
2 B 15
3 NaN 5
4 D 35
5 E 20
6 F 10
7 F 25
df['Id'].count()
7
count() returns the total number of non-null values in the series.
value_counts() returns a series of the number of times each unique non-null value appears, sorted from most to least frequent.
As usual, an example is the best way to convey this:
ser = pd.Series(list('aaaabbbccdef'))
ser
>
0 a
1 a
2 a
3 a
4 b
5 b
6 b
7 c
8 c
9 d
10 e
11 f
dtype: object
ser.count()
>
12
ser.value_counts()
>
a 4
b 3
c 2
f 1
d 1
e 1
dtype: int64
Note that a dataframe has the count() method, which returns a series of the count() (scalar) value for each column in the df. However, a dataframe has no value_counts() method.
Related
I'm trying to drop rows from a df where certain conditions are met. Using below, I'm grouping values using column C. For each unique group, I want to drop ALL rows where A is less than 1 AND B is greater than 100. This has to occur on the same row though. If I use .any() or .all(), it doesn't return what I want.
df = pd.DataFrame({
'A' : [1,0,1,0,1,0,0,1,0,1],
'B' : [101, 2, 3, 1, 5, 101, 2, 3, 4, 5],
'C' : ['d', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f',],
})
df.groupby(['C']).filter(lambda g: g['A'].lt(1) & g['B'].gt(100))
initial df:
A B C
0 1 101 d # A is not lt 1 so keep all d's
1 0 2 d
2 1 3 d
3 0 1 d
4 1 5 e
5 0 101 e # A is lt 1 and B is gt 100 so drop all e's
6 0 2 e
7 1 3 f
8 0 4 f
9 1 5 f
intended out:
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f
For better performnce get all C values match condition and then filter original column C by Series.isin in boolean indexing with inverted mask:
df1 = df[~df['C'].isin(df.loc[df['A'].lt(1) & df['B'].gt(100), 'C'])]
Another idea is use GroupBy.transform with GroupBy.any for test if match at least one value:
df1 = df[~(df['A'].lt(1) & df['B'].gt(100)).groupby(df['C']).transform('any')]
Your solution is possible with any and not for scalars, if large DataFrame it should be slow:
df1 = df.groupby(['C']).filter(lambda g:not ( g['A'].lt(1) & g['B'].gt(100)).any())
df1 = df.groupby(['C']).filter(lambda g: (g['A'].ge(1) | g['B'].le(100)).all())
print (df1)
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f
Using pandas, I want to filter out all groups that contain only zero values
So in pseudo-code something like this
df.groupby('my_group')['values'].filter(all(iszero))
Example input dataframe could be something like this
df = pd.DataFrame({'my_group': ['A', 'B', 'C', 'D']*3, 'values': [0 if (x % 4 == 0 or x == 11) else random.random() for x in range(12)]})
my_group values
0 A 0.000000
1 B 0.286104
2 C 0.359804
3 D 0.596152
4 A 0.000000
5 B 0.560742
6 C 0.534575
7 D 0.251302
8 A 0.000000
9 B 0.445010
10 C 0.750434
11 D 0.000000
Here, group A contains all zero values, so it should be filtered out. Group D also has a zero value in row 11, but in other rows it has non-zero values, so it shouldn't be filtered out
Here are possible solution from the best to worse performance:
#filtere groups by != 0 and then filter again original column by mask
df1 = df[df['my_group'].isin(df.loc[df['values'].ne(0), 'my_group'])]
#create mask by groupy.transform
df1 = df[df['values'].ne(0).groupby(df['my_group']).transform('any')]
#filtered by lambda function (if large data it is slow)
df1 = df.groupby('my_group').filter(lambda x: x['values'].ne(0).any())
print (df1)
my_group values
1 B 0.286104
2 C 0.359804
3 D 0.596152
5 B 0.560742
6 C 0.534575
7 D 0.251302
9 B 0.445010
10 C 0.750434
11 D 0.000000
IIUC use a condition to keep the rows. For this if any value in the group is not equal (ne) to zero, then keep the group:
df2 = df.groupby('my_group').filter(lambda g: g['values'].ne(0).any())
output:
my_group values
1 B 0.286104
2 C 0.359804
3 D 0.596152
5 B 0.560742
6 C 0.534575
7 D 0.251302
9 B 0.445010
10 C 0.750434
11 D 0.000000
Or to get only the indices:
idx = df.groupby('my_group')['values'].filter(lambda s: s.ne(0).any()).index
output: Int64Index([1, 2, 3, 5, 6, 7, 9, 10, 11], dtype='int64')
You can use:
>>> df[df.groupby('my_group')['values'].transform('any')]
my_group values
1 B 0.507089
2 C 0.846842
3 D 0.953003
5 B 0.085316
6 C 0.482732
7 D 0.764508
9 B 0.879005
10 C 0.717571
11 D 0.000000
I have a dataset given below:
a,b,c
1,1,1
1,1,1
1,1,2
2,1,2
2,1,1
2,2,1
I created crosstab with pandas:
cross_tab = pd.crosstab(index=a, columns=[b, c], rownames=['a'], colnames=['b', 'c'])
my crosstab is given as an output:
b 1 2
c 1 2 1
a
1 2 1 0
2 1 1 1
I want to iterate over this crosstab for given each a,b and c values. How can I get values such as cross_tab[a=1][b=1, c=1]? Thank you.
You can use slicers:
a,b,c = 1,1,1
idx = pd.IndexSlice
print (cross_tab.loc[a, idx[b,c]])
2
You can also reshape df by DataFrame.unstack, reorder_levels and then use loc:
a = cross_tab.unstack().reorder_levels(('a','b','c'))
print (a)
a b c
1 1 1 2
2 1 1 1
1 1 2 1
2 1 2 1
1 2 1 0
2 2 1 1
dtype: int64
print (a.loc[1,1,1])
2
You are looking for df2.xxx.get_level_values:
In [777]: cross_tab.loc[cross_tab.index.get_level_values('a') == 1,\
(cross_tab.columns.get_level_values('b') == 1)\
& (cross_tab.columns.get_level_values('c') == 1)]
Out[777]:
b 1
c 1
a
1 2
Another way to consider, albeit at loss of a little bit of readability, might be to simply use the .loc to navigate the hierarchical index generated by pandas.crosstab. Following example illustrates it:
import pandas as pd
import numpy as np
np.random.seed(1234)
df = pd.DataFrame(
{
"a": np.random.choice([1, 2], 5, replace=True),
"b": np.random.choice([11, 12, 13], 5, replace=True),
"c": np.random.choice([21, 22, 23], 5, replace=True),
}
)
df
Output
a b c
0 2 11 23
1 2 11 23
2 1 12 23
3 2 12 21
4 1 12 21
crosstab output is:
cross_tab = pd.crosstab(
index=df.a, columns=[df.b, df.c], rownames=["a"], colnames=["b", "c"]
)
cross_tab
b 11 12
c 23 21 23
a
1 0 1 1
2 2 1 0
Now let's say you want to access value when a==2, b==11 and c==23, then simply do
cross_tab.loc[2].loc[11].loc[23]
2
Why does this work? .loc allows one to select by index labels. In the dataframe output by crosstab, our erstwhile column values now become index labels. Thus, with every .loc selection we do, it gives the slice of the dataframe corresponding to that index label. Let's navigate cross_tab.loc[2].loc[11].loc[23] step by step:
cross_tab.loc[2]
yields:
b c
11 23 2
12 21 1
23 0
Name: 2, dtype: int64
Next one:
cross_tab.loc[2].loc[11]
Yields:
c
23 2
Name: 2, dtype: int64
And finally we have
cross_tab.loc[2].loc[11].loc[23]
which yields:
2
Why do I say that this reduces the readability a bit? Because to understand this selection you have to be aware of how the crosstab was created, i.e. rows are a and columns were in the order [b, c]. You have to know that to be able to interpret what cross_tab.loc[2].loc[11].loc[23] would do. But I have found that often to be a good tradeoff.
I try to get a new series from a DataFrame. This series should contain the column names of the DataFrame's values that are above some value for each row of the DataFrame. But beginning from the left of the DataFrame, like this:
df = pd.DataFrame(np.random.randint(0,10,size=(5, 6)), columns=list('ABCDEF'))
>>> df
A B C D E F
0 2 4 6 8 8 4
1 2 0 9 7 7 1
2 1 7 7 7 3 0
3 5 4 4 0 1 7
4 9 6 1 5 1 5
min = 3
Expected Output:
0 B
1 C
2 B
3 A
4 A
dtype: object
Here the output's row 0 is "B" because in the DataFrame row index 0 column "B" is the most left column that has a value that is equal or bigger than min = 3.
I know that I an use df.idxmin(axis = 1) to get the column names of the minimum for each row but I have now clue at all how to tackle this more complex problem.
Thanks for help or hints!
UPDATE - index of the first element in each row, satisfying condition:
more elegant and more efficient version from #DSM:
In [156]: (df>=3).idxmax(1)
Out[156]:
0 B
1 C
2 B
3 A
4 A
dtype: object
my version:
In [149]: df[df>=3].apply(lambda x: x.first_valid_index(), axis=1)
Out[149]:
0 B
1 C
2 B
3 A
4 A
dtype: object
Old answer - index of the minimum element for each row:
In [27]: df[df>=3].idxmin(1)
Out[27]:
0 E
1 A
2 C
3 C
4 F
dtype: object
How do I add a order number column to an existing DataFrame?
This is my DataFrame:
import pandas as pd
import math
frame = pd.DataFrame([[1, 4, 2], [8, 9, 2], [10, 2, 1]], columns=['a', 'b', 'c'])
def add_stats(row):
row['sum'] = sum([row['a'], row['b'], row['c']])
row['sum_sq'] = sum(math.pow(v, 2) for v in [row['a'], row['b'], row['c']])
row['max'] = max(row['a'], row['b'], row['c'])
return row
frame = frame.apply(add_stats, axis=1)
print(frame.head())
The resulting data is:
a b c sum sum_sq max
0 1 4 2 7 21 4
1 8 9 2 19 149 9
2 10 2 1 13 105 10
First, I would like to add 3 extra columns with order numbers, sorting on sum, sum_sq and max, respectively. Next, these 3 columns should be combined into one column - the mean of the order numbers - but I do know how to do that part (with apply and axis=1).
I think you're looking for rank where you mention sorting. Given your example, add:
frame['sum_order'] = frame['sum'].rank()
frame['sum_sq_order'] = frame['sum_sq'].rank()
frame['max_order'] = frame['max'].rank()
frame['mean_order'] = frame[['sum_order', 'sum_sq_order', 'max_order']].mean(axis=1)
To get:
a b c sum sum_sq max sum_order sum_sq_order max_order mean_order
0 1 4 2 7 21 4 1 1 1 1.000000
1 8 9 2 19 149 9 3 3 2 2.666667
2 10 2 1 13 105 10 2 2 3 2.333333
The rank method has some options as well, to specify the behavior in case of identical or NA-values for example.