pandas count over multiple columns - python

I have a dataframe looking like this
Measure1 Measure2 Measure3 ...
0 1 3
1 3 2
3 0
I'd like to count the occurrences of the values over the columns to produce:
Measure Count Percentage
0 2 0.25
1 2 0.25
2 1 0.125
3 3 0.373
With
outcome_measure_count = cdss_data.groupby(key_columns=['Measure1'],operations={'count': agg.COUNT()}).sort('count', ascending=True)
I only get the first column (actually using graphlab package, but I'd prefer pandas)
Could someone help me?

You can generate the counts by flattening the df using ravel and value_counts, from this you can construct the final df:
In [230]:
import io
import pandas as pd
​
t="""Measure1 Measure2 Measure3
0 1 3
1 3 2
3 0 0"""
​
df = pd.read_csv(io.StringIO(t), sep='\s+')
df
Out[230]:
Measure1 Measure2 Measure3
0 0 1 3
1 1 3 2
2 3 0 0
In [240]:
count = pd.Series(df.squeeze().values.ravel()).value_counts()
pd.DataFrame({'Measure': count.index, 'Count':count.values, 'Percentage':(count/count.sum()).values})
Out[240]:
Count Measure Percentage
0 3 3 0.333333
1 3 0 0.333333
2 2 1 0.222222
3 1 2 0.111111
I inserted a 0 just to make the df shape correct but you should get the point

In [68]: df=DataFrame({'m1':[0,1,3], 'm2':[1,3,0], 'm3':[3,2, np.nan]})
In [69]: df
Out[69]:
m1 m2 m3
0 0 1 3.0
1 1 3 2.0
2 3 0 NaN
In [70]: df=df.apply(Series.value_counts).sum(1).to_frame(name='Count')
In [71]: df
Out[71]:
Count
0.0 2.0
1.0 2.0
2.0 1.0
3.0 3.0
In [72]: df.index.name='Measure'
In [73]: df
Out[73]:
Count
Measure
0.0 2.0
1.0 2.0
2.0 1.0
3.0 3.0
In [74]: df['Percentage']=df.Count.div(df.Count.sum())
In [75]: df
Out[75]:
Count Percentage
Measure
0.0 2.0 0.250
1.0 2.0 0.250
2.0 1.0 0.125
3.0 3.0 0.375

Related

Filling missing data in pandas dataframe on the basis of a value in another column [duplicate]

I have a dataframe having 4 columns(A,B,C,D). D has some NaN entries. I want to fill the NaN values by the average value of D having same value of A,B,C.
For example,if the value of A,B,C,D are x,y,z and Nan respectively,then I want the NaN value to be replaced by the average of D for the rows where the value of A,B,C are x,y,z respectively.
df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean')) would be faster than apply
In [2400]: df
Out[2400]:
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
In [2401]: df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
Out[2401]:
0 1.0
1 2.0
2 3.0
3 5.0
Name: D, dtype: float64
In [2402]: df['D'] = df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
In [2403]: df
Out[2403]:
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Details
In [2396]: df.shape
Out[2396]: (10000, 4)
In [2398]: %timeit df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
100 loops, best of 3: 3.44 ms per loop
In [2397]: %timeit df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
100 loops, best of 3: 5.34 ms per loop
I think you need:
df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
Sample:
df = pd.DataFrame({'A':[1,1,1,3],
'B':[1,1,1,3],
'C':[1,1,1,3],
'D':[1,np.nan,3,5]})
print (df)
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
print (df)
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Link to duplicate of this question for further information:
Pandas Dataframe: Replacing NaN with row average
Another suggested way of doing it mentioned in the link is using a simple fillna on the transpose:
df.T.fillna(df.mean(axis=1)).T

Forward fill column with an index-based limit

I want to forward fill a column and I want to specify a limit, but I want the limit to be based on the index---not a simple number of rows like limit allows.
For example, say I have the dataframe given by:
df = pd.DataFrame({
'data': [0.0, 1.0, np.nan, 3.0, np.nan, 5.0, np.nan, np.nan, np.nan, np.nan],
'group': [0, 0, 0, 1, 1, 0, 0, 0, 1, 1]
})
which looks like
In [27]: df
Out[27]:
data group
0 0.0 0
1 1.0 0
2 NaN 0
3 3.0 1
4 NaN 1
5 5.0 0
6 NaN 0
7 NaN 0
8 NaN 1
9 NaN 1
If I group by the group column and forward fill in that group with limit=2, then my resulting data frame will be
In [35]: df.groupby('group').ffill(limit=2)
Out[35]:
group data
0 0 0.0
1 0 1.0
2 0 1.0
3 1 3.0
4 1 3.0
5 0 5.0
6 0 5.0
7 0 5.0
8 1 3.0
9 1 NaN
What I actually want to do here however is only forward fill onto rows whose indexes are within say 2 from the first index of each group, as opposed to the next 2 rows of each group. For example, if we just look at the groups on the dataframe:
In [36]: for i, group in df.groupby('group'):
...: print(group)
...:
data group
0 0.0 0
1 1.0 0
2 NaN 0
5 5.0 0
6 NaN 0
7 NaN 0
data group
3 3.0 1
4 NaN 1
8 NaN 1
9 NaN 1
I would want the second group here to only be forward filled to index 4---not 8 and 9. The first group's NaN values are all within 2 indexes from the last non-NaN values, so they would be filled completely. The resulting dataframe would look like:
group data
0 0 0.0
1 0 1.0
2 0 1.0
3 1 3.0
4 1 3.0
5 0 5.0
6 0 5.0
7 0 5.0
8 1 NaN
9 1 NaN
FWIW in my actual use case, my index is a DateTimeIndex (and it is sorted).
I currently have a solution which sort of works, requiring looping through the dataframe filtered on the group indexes, creating a time range for every single event with a non-NaN value based on the index, and then combining those. But this is far too slow to be practical.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'data': [0.0, 1.0, 1, 3.0, np.nan, 22, np.nan, 5, np.nan, np.nan],
'group': [0, 0, 1, 0, 1, 0, 1, 0, 1, 1]})
df = df.reset_index()
df['stop_index'] = df['index'] + 2
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()
print(df)
# index data group stop_index mask
# 0 0 0.0 0 2.0 True
# 1 1 1.0 0 3.0 True
# 2 2 1.0 1 4.0 True
# 3 3 3.0 0 5.0 True
# 4 4 1.0 1 4.0 True
# 5 5 22.0 0 7.0 True
# 6 6 NaN 1 4.0 False
# 7 7 5.0 0 9.0 True
# 8 8 NaN 1 4.0 False
# 9 9 NaN 1 4.0 False
# clean up df
df = df[['data', 'group']]
print(df)
yields
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 1.0 1
5 22.0 0
6 NaN 1
7 5.0 0
8 NaN 1
9 NaN 1
This copies the index into a column, then
makes a second stop_index column which is the index augmented by the size of
the (time) window.
df = df.reset_index()
df['stop_index'] = df['index'] + 2
Then it makes null rows in stop_index to match null rows in data:
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
Then it forward-fills stop_index on a per-group basis:
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
Now (at last) we can define the desired mask -- the places where we actually want to forward-fill data:
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()
IIUC
l=[]
for i, group in df.groupby('group'):
idx=group.index
l.append(group.reindex(df.index).ffill(limit=2).loc[idx])
pd.concat(l).sort_index()
data group
0 0.0 0.0
1 1.0 0.0
2 1.0 0.0
3 3.0 1.0
4 3.0 1.0
5 5.0 0.0
6 5.0 0.0
7 5.0 0.0
8 NaN 1.0
9 NaN 1.0
Testing data
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 NaN 1
5 22 0
6 NaN 1
7 5.0 0
8 NaN 1
9 NaN 1
My method for testing data
data group
0 0.0 0.0
1 1.0 0.0
2 1.0 1.0
3 3.0 0.0
4 1.0 1.0
5 22.0 0.0
6 NaN 1.0# here not change , since the previous two do not have valid value for group 1
7 5.0 0.0
8 NaN 1.0
9 NaN 1.0
Out put with unutbu
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 1.0 1
5 22.0 0
6 1.0 1# miss match in here
7 5.0 0
8 NaN 1
9 NaN 1

Pandas agg fuction with operations on multiple columns

I am interested if we can use pandas.core.groupby.DataFrameGroupBy.agg function to make arithmetic operations on multiple columns columns. For example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(15).reshape(5, 3))
df['C'] = [0, 0, 2, 2, 5]
print(df.groupby('C').mean()[0] - df.groupby('C').mean()[1])
print(df.groupby('C').agg({0: 'mean', 1: 'sum', 2: 'nunique', 'C': 'mean0-mean1'}))
Is it somehow possible that we receive result like in this example: the difference between means of column 0 and column 1 grouped by column 'C'?
df
0 1 2 C
0 0 1 2 0
1 3 4 5 0
2 6 7 8 2
3 9 10 11 2
4 12 13 14 5
Groupped difference
C
0 -1.0
2 -1.0
5 -1.0
dtype: float64
I am not interested with solutions that does not use agg method. I am curious only if agg method can take multiple columns as argument and then do some operations on them to return one columns after job is done.
IIUC:
In [12]: df.groupby('C').mean().diff(axis=1)
Out[12]:
0 1 2
C
0 NaN 1.0 1.0
2 NaN 1.0 1.0
5 NaN 1.0 1.0
or
In [13]: df.groupby('C').mean().diff(-1, axis=1)
Out[13]:
0 1 2
C
0 -1.0 -1.0 NaN
2 -1.0 -1.0 NaN
5 -1.0 -1.0 NaN

Python: Pandas use slice with .describe() versions greater than 0.20

Using this because its convenient.
http://nbviewer.jupyter.org/gist/aflaxman/436cde71f85b93638959
df = pd.DataFrame({'A': [0,0,0,0,1,1],
'B': [1,2,3,4,5,6],
'C': [8,9,10,11,12,13]})
This use to work!
Now:
>>> pandas.__version__
u'0.20.3'
df.groupby('A').describe().unstack()\
.loc[:,(slice(None),['count','mean']),]
Gives:
TypeError: '['count', 'mean']' is an invalid key
For columns remove unstack, because in version 0.20.0 was changed groupby describe formatting:
df = df.groupby('A').describe().loc[:,(slice(None),['count','mean'])]
print (df)
B C
count mean count mean
A
0 4.0 2.5 4.0 9.5
1 2.0 5.5 2.0 12.5
MultiIndex is in index, so first : is removed, because select all index values.
Also there was added slice(None), because MultiIndex has 3 levels:
df = df.groupby('A').describe().unstack()\
.loc[(slice(None),['count','mean'],slice(None))]
print (df)
A
B count 0 4.0
1 2.0
mean 0 2.5
1 5.5
C count 0 4.0
1 2.0
mean 0 9.5
1 12.5
dtype: float64
Alternative solutions:
idx = pd.IndexSlice
df = df.groupby('A').describe().unstack()\
.loc[idx[:,['count','mean'],:]]
print (df)
A
B count 0 4.0
1 2.0
mean 0 2.5
1 5.5
C count 0 4.0
1 2.0
mean 0 9.5
1 12.5
dtype: float64
df = df.groupby('A').describe().unstack()\
.loc(axis=0)[:,['count','mean'],:]
print (df)
A
B count 0 4.0
1 2.0
mean 0 2.5
1 5.5
C count 0 4.0
1 2.0
mean 0 9.5
1 12.5
dtype: float64
More information in pandas documentation - using slicers .

Implement a counter which resets in python panda data frame

Hi I would like to implement a counter which counts the number of successive zero observations in a dataframe (across multiple columns). But I would like to reset it if a non-zero observation is found. I have used a for loop but it is incredibly slow, I am sure there must be far more efficient ways. This is my code:
Here is a snapshot of df
df.head()
ACL ACT ADH ADR AFE AFH AFT
2013-02-05 NaN NaN NaN NaN NaN NaN NaN
2013-02-12 -0.136861 -0.020406 0.046150 0.000000 -0.005321 NaN 0.058195
2013-02-19 -0.006632 0.041665 0.007365 0.012738 0.040930 NaN -0.037818
2013-02-26 -0.023848 -0.023999 -0.030677 -0.003144 0.050604 NaN -0.047604
2013-03-05 0.009771 -0.024589 -0.021073 -0.039432 0.047315 NaN 0.068727
I first initialise an empty data frame which has the same properties of df (dataframe) above
df1=pd.DataFrame( index= df, columns=df)
df1=df1.fillna(0)
Then I create my function which iterates over the rows, but this only deals with one column at a time
def zero_obs(x=df,y=df1):
for i in range(len(x)):
if x[i] == 0:
y[i] = y[i-1] + 1
else:
y[i] = 0
return y
for col in df.columns:
df1[col] = zero_obs(x=df[col],y=df1[col])
Really appreciate any help!!
The output i expect is as follows:
df1.tail()
BRN AXL TTO AGL ACL
2017-01-03 3 125 0 0 0
2017-01-10 0 126 0 0 0
2017-01-17 1 127 0 0 0
2017-01-24 0 128 0 0 0
2017-01-31 0 129 1 0 0
setup
Consider the dataframe df
df = pd.DataFrame(
np.zeros((10, 2), dtype=int),
columns=list('AB')
)
df.loc[[0, 4, 8], 'A'] = 1
df.loc[6, 'B'] = 1
print(df)
A B
0 1 0
1 0 0
2 0 0
3 0 0
4 1 0
5 0 0
6 0 1
7 0 0
8 1 0
9 0 0
Option 1
pandas apply
def zero_obs(x):
"""`x` is assumed to be a `pd.Series`"""
csum = x.eq(0).cumsum()
cpos = csum.where(x.ne(0)).ffill().fillna(0)
return csum.sub(cpos)
print(df.apply(zero_obs))
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0
Option 2
don't use apply
This function works just as well on df
zero_obs(df)
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0

Categories