Python: Pandas use slice with .describe() versions greater than 0.20 - python

Using this because its convenient.
http://nbviewer.jupyter.org/gist/aflaxman/436cde71f85b93638959
df = pd.DataFrame({'A': [0,0,0,0,1,1],
'B': [1,2,3,4,5,6],
'C': [8,9,10,11,12,13]})
This use to work!
Now:
>>> pandas.__version__
u'0.20.3'
df.groupby('A').describe().unstack()\
.loc[:,(slice(None),['count','mean']),]
Gives:
TypeError: '['count', 'mean']' is an invalid key

For columns remove unstack, because in version 0.20.0 was changed groupby describe formatting:
df = df.groupby('A').describe().loc[:,(slice(None),['count','mean'])]
print (df)
B C
count mean count mean
A
0 4.0 2.5 4.0 9.5
1 2.0 5.5 2.0 12.5
MultiIndex is in index, so first : is removed, because select all index values.
Also there was added slice(None), because MultiIndex has 3 levels:
df = df.groupby('A').describe().unstack()\
.loc[(slice(None),['count','mean'],slice(None))]
print (df)
A
B count 0 4.0
1 2.0
mean 0 2.5
1 5.5
C count 0 4.0
1 2.0
mean 0 9.5
1 12.5
dtype: float64
Alternative solutions:
idx = pd.IndexSlice
df = df.groupby('A').describe().unstack()\
.loc[idx[:,['count','mean'],:]]
print (df)
A
B count 0 4.0
1 2.0
mean 0 2.5
1 5.5
C count 0 4.0
1 2.0
mean 0 9.5
1 12.5
dtype: float64
df = df.groupby('A').describe().unstack()\
.loc(axis=0)[:,['count','mean'],:]
print (df)
A
B count 0 4.0
1 2.0
mean 0 2.5
1 5.5
C count 0 4.0
1 2.0
mean 0 9.5
1 12.5
dtype: float64
More information in pandas documentation - using slicers .

Related

pandas - ranking with tolerance?

Is there a way to rank values in a dataframe but considering a tolerance?
Say I have the following values
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
and if I ran rank:
ex.rank(method='average')
0 2.0
1 3.0
2 1.0
3 6.0
4 5.0
5 4.0
dtype: float64
But what I'd like as a result would be (with a tolereance of 0.01):
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
Any way to define this tolerance?
Thanks
This function may works:
def rank_with_tolerance(sr, tolerance=0.01+1e-10, method='average'):
vals = pd.Series(sr.unique()).sort_values()
vals.index = vals
vals = vals.mask(vals - vals.shift(1) <= tolerance, vals.shift(1))
return sr.map(vals).fillna(sr).rank(method=method)
It works for your given input:
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
And with more complex sets it seems to work too:
ex = pd.Series([16.52,19.95,19.96, 19.95, 19.97, 19.97, 19.98])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 1.0
1 3.0
2 3.0
3 3.0
4 5.5
5 5.5
6 7.0
dtype: float64
You could do some sort of min-max scaling i.e.
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
# You scale the values to be between 0 and 1
ex_scaled = (ex - min(ex)) / (max(ex) - min(ex))
# You put them on a scale from 1 to the length of your series
result = ex_scaled * len(ex) + 1
# result
0 1.335347
1 4.444109
2 1.000000
3 7.000000
4 4.969789
5 4.453172
That way you are still ranking, but values closer to each other have ranks close to each other
You can sort the values, merge the close ones and rank on that:
s = ex.drop_duplicates().sort_values()
mapper = (s.groupby(s.diff().abs().gt(0.011).cumsum(), sort=False)
.transform('mean')
.reindex_like(ex)
)
out = mapper.rank(method='average')
N.B. I used 0.011 as threshold as floating point arithmetics does not always enable enough precision to detect a value clode to the threshold
output:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
intermediate mapper:
0 16.520
1 19.955
2 16.150
3 22.770
4 20.530
5 19.955
dtype: float64

Python: how to apply a function fo same ids in a pandas dataframe without a loop?

I have two dataframes with same column id and for each id I need to apply the following function
def findConstant(df1,df2):
c = df1.iloc[[0], df1.eq(df1.iloc[0]).all().to_numpy()].squeeze()
return pd.concat([df1, df2]).assign(**c).reset_index(drop=True)
what I am doing the is the following:
df3 = pd.DataFrame()
for idx in df1['id']:
tmp1 = df1[df1['id']==idx]
tmp2 = df2[df2['id']==idx]
tmp3 = findConstant(tmp1,tmp2)
df3 = pd.concat([df3,tmp3], ignore_index(drop=True))
I would like to know how to avoid a loop like that
Use:
print (df1)
A B C id val
0 ar 2 8 1 3.2
1 ar 3 7 1 5.6
3 ar1 0 3 2 7.8
4 ar1 4 3 2 9.2
5 ar1 5 3 2 3.4
print (df2)
id val
0 1 3.3
1 2 6.4
#get number of unique values and first values to df3
df3 = df1.groupby('id').agg(['nunique','first'])
#filter if same values by comapre by 1
m = df3.xs('nunique', axis=1, level=1).eq(1)
#get correct values to df with replace not matched by original df2
df = df3.xs('first', axis=1, level=1).where(m).combine_first(df2.set_index('id'))
print (df)
A B C val
id
1 ar NaN NaN 3.3
2 ar1 NaN 3.0 6.4
#join together
df = pd.concat([df1, df.reset_index()], ignore_index=True)
print (df)
A B C id val
0 ar 2.0 8.0 1 3.2
1 ar 3.0 7.0 1 5.6
2 ar1 0.0 3.0 2 7.8
3 ar1 4.0 3.0 2 9.2
4 ar1 5.0 3.0 2 3.4
5 ar NaN NaN 1 3.3
6 ar1 NaN 3.0 2 6.4

Python, element wise sorting of a DataFrame

I am trying to sort each row of a DataFrame element wise.
Input:
A B C
0 10 5 6
1 3 6 5
2 1 2 3
Output:
A B C
0 10 6 5
1 6 5 3
2 3 2 1
It feels this should be easy but I've been failing for while... Very much a beginner in Python.
Use np.sort with swap ordering by indexing:
df1 = pd.DataFrame(np.sort(df.to_numpy(), axis=1)[:, ::-1],
index=df.index,
columns=df.columns)
print (df1)
A B C
0 10 6 5
1 6 5 3
2 3 2 1
Pandas solution, slowier, is apply sorting for each row separately, convert to array and then to Series:
f = lambda x: pd.Series(x.sort_values(ascending=False).to_numpy(), index=df.columns)
df1 = df.apply(f, axis=1)
print (df1)
A B C
0 10 6 5
1 6 5 3
2 3 2 1
If possible missing values for me working:
print (df)
A B C
0 10.0 6.0 5.0
1 5.0 3.0 NaN
2 2.0 1.0 NaN
df1 = pd.DataFrame(np.sort(df.to_numpy(), axis=1)[:, ::-1],
index=df.index,
columns=df.columns)
print (df1)
A B C
0 10.0 6.0 5.0
1 NaN 5.0 3.0
2 NaN 2.0 1.0
In pandas is possible use na_position parameter for specify order of them:
f = lambda x: pd.Series(x.sort_values(ascending=False, na_position='first').to_numpy(),
index=df.columns)
df1 = df.apply(f, axis=1)
print (df1)
A B C
0 10.0 6.0 5.0
1 NaN 5.0 3.0
2 NaN 2.0 1.0
f = lambda x: pd.Series(x.sort_values(ascending=False, na_position='last').to_numpy(),
index=df.columns)
df1 = df.apply(f, axis=1)
print (df1)
A B C
0 10.0 6.0 5.0
1 5.0 3.0 NaN
2 2.0 1.0 NaN

Filling missing data in pandas dataframe on the basis of a value in another column [duplicate]

I have a dataframe having 4 columns(A,B,C,D). D has some NaN entries. I want to fill the NaN values by the average value of D having same value of A,B,C.
For example,if the value of A,B,C,D are x,y,z and Nan respectively,then I want the NaN value to be replaced by the average of D for the rows where the value of A,B,C are x,y,z respectively.
df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean')) would be faster than apply
In [2400]: df
Out[2400]:
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
In [2401]: df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
Out[2401]:
0 1.0
1 2.0
2 3.0
3 5.0
Name: D, dtype: float64
In [2402]: df['D'] = df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
In [2403]: df
Out[2403]:
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Details
In [2396]: df.shape
Out[2396]: (10000, 4)
In [2398]: %timeit df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
100 loops, best of 3: 3.44 ms per loop
In [2397]: %timeit df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
100 loops, best of 3: 5.34 ms per loop
I think you need:
df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
Sample:
df = pd.DataFrame({'A':[1,1,1,3],
'B':[1,1,1,3],
'C':[1,1,1,3],
'D':[1,np.nan,3,5]})
print (df)
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
print (df)
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Link to duplicate of this question for further information:
Pandas Dataframe: Replacing NaN with row average
Another suggested way of doing it mentioned in the link is using a simple fillna on the transpose:
df.T.fillna(df.mean(axis=1)).T

pandas count over multiple columns

I have a dataframe looking like this
Measure1 Measure2 Measure3 ...
0 1 3
1 3 2
3 0
I'd like to count the occurrences of the values over the columns to produce:
Measure Count Percentage
0 2 0.25
1 2 0.25
2 1 0.125
3 3 0.373
With
outcome_measure_count = cdss_data.groupby(key_columns=['Measure1'],operations={'count': agg.COUNT()}).sort('count', ascending=True)
I only get the first column (actually using graphlab package, but I'd prefer pandas)
Could someone help me?
You can generate the counts by flattening the df using ravel and value_counts, from this you can construct the final df:
In [230]:
import io
import pandas as pd
​
t="""Measure1 Measure2 Measure3
0 1 3
1 3 2
3 0 0"""
​
df = pd.read_csv(io.StringIO(t), sep='\s+')
df
Out[230]:
Measure1 Measure2 Measure3
0 0 1 3
1 1 3 2
2 3 0 0
In [240]:
count = pd.Series(df.squeeze().values.ravel()).value_counts()
pd.DataFrame({'Measure': count.index, 'Count':count.values, 'Percentage':(count/count.sum()).values})
Out[240]:
Count Measure Percentage
0 3 3 0.333333
1 3 0 0.333333
2 2 1 0.222222
3 1 2 0.111111
I inserted a 0 just to make the df shape correct but you should get the point
In [68]: df=DataFrame({'m1':[0,1,3], 'm2':[1,3,0], 'm3':[3,2, np.nan]})
In [69]: df
Out[69]:
m1 m2 m3
0 0 1 3.0
1 1 3 2.0
2 3 0 NaN
In [70]: df=df.apply(Series.value_counts).sum(1).to_frame(name='Count')
In [71]: df
Out[71]:
Count
0.0 2.0
1.0 2.0
2.0 1.0
3.0 3.0
In [72]: df.index.name='Measure'
In [73]: df
Out[73]:
Count
Measure
0.0 2.0
1.0 2.0
2.0 1.0
3.0 3.0
In [74]: df['Percentage']=df.Count.div(df.Count.sum())
In [75]: df
Out[75]:
Count Percentage
Measure
0.0 2.0 0.250
1.0 2.0 0.250
2.0 1.0 0.125
3.0 3.0 0.375

Categories