Having a dataframe in python:
CASE TYPE
1 A
1 A
1 A
2 A
2 B
3 B
3 B
3 B
how can I create a result dataframe which would yield all cases and either an "A" if the case had only "A's" assigned, "B" if it was only "B's" or "MIXED" if the case had both A and B?
Result would be then:
Case Type
1 A
2 MIXED
3 B
Here is an option, where we firstly collect the TYPE as list by group of CASE and then check the length of unique TYPE, if it is larger than 1, return MIXED otherwise the TYPE by itself:
import pandas as pd
import numpy as np
groups = df.groupby('CASE').agg(lambda g: [g.TYPE.unique()]).
apply(lambda row: np.where(len(row.TYPE) > 1, 'MIXED', row.TYPE[0]), axis = 1)
groups
# CASE
# 1 A
# 2 MIXED
# 3 B
# dtype: object
df['NTYPES'] = df.groupby('CASE').transform(lambda x: x.nunique())
df.loc[df.NTYPES > 1, 'TYPE'] = 'MIXED'
df.groupby('TYPE', as_index=False).first().drop('NTYPES', 1)
TYPE CASE
0 A 1
1 B 3
2 MIXED 2
Here is a (admittedly over-engineered) solution that avoids looping over groups and DataFrame.apply (these are slow, so avoiding them may become important if your dataset gets sufficiently large).
import pandas as pd
df = pd.DataFrame({'CASE': [1]*3 + [2]*2 + [3]*3,
'TYPE': ['A']*4 + ['B']*4})
We group by CASE and compute the relative frequencies of TYPE being A or B:
grouped = df.groupby('CASE')
vc = (grouped['TYPE'].value_counts(normalize=True)
.unstack(level=0)
.fillna(0))
Here's what vc looks like
CASE 1 2 3
TYPE
A 1.0 0.5 0.0
B 0.0 0.5 0.0
Notice that all the information is contained in the first row. Cutting said row into bins with pd.cut gives the desired result:
tolerance = 1e-10
bins = [-tolerance, tolerance, 1-tolerance, 1+tolerance]
types = pd.cut(vc.loc['A'], bins=bins, labels=['B', 'MIXED', 'A'])
We get:
CASE
1 A
2 MIXED
3 B
Name: A, dtype: category
Categories (3, object): [B < MIXED < A]
For good measure, we can rename the types series:
types.name = 'TYPE'
here is one bit ugly, but not that slow solution:
In [154]: df
Out[154]:
CASE TYPE
0 1 A
1 1 A
2 1 A
3 2 A
4 2 B
5 3 B
6 3 B
7 3 B
8 4 C
9 4 C
10 4 B
In [155]: %paste
(df.groupby('CASE')['TYPE']
.apply(lambda x: x.head(1) if x.nunique() == 1 else pd.Series(['MIX']))
.reset_index()
.drop('level_1', 1)
)
## -- End pasted text --
Out[155]:
CASE TYPE
0 1 A
1 2 MIX
2 3 B
3 4 MIX
Timing: against 800K rows DF:
In [191]: df = pd.concat([df] * 10**5, ignore_index=True)
In [192]: df.shape
Out[192]: (800000, 3)
In [193]: %timeit Psidom(df)
1 loop, best of 3: 235 ms per loop
In [194]: %timeit capitalistpug(df)
1 loop, best of 3: 419 ms per loop
In [195]: %timeit Alberto_Garcia_Raboso(df)
10 loops, best of 3: 112 ms per loop
In [196]: %timeit MaxU(df)
10 loops, best of 3: 80.4 ms per loop
Related
In pandas, I regularly use the following to filter a dataframe by number of occurrences
df = df.groupby('A').filter(lambda x: len(x) >= THRESHOLD)
Assume df has another column 'B' and I want to filter the dataframe this time by the count of unique values on that column, I would expect something like
df = df.groupby('A').filter(lambda x: len(np.unique(x['B'])) >= THRESHOLD2)
But that doesn't seem to work, what would be the right approach?
It should working nice with nunique:
df = pd.DataFrame({'B':list('abccee'),
'E':[5,3,6,9,2,4],
'A':list('aabbcc')})
print (df)
A B E
0 a a 5
1 a b 3
2 b c 6
3 b c 9
4 c e 2
5 c e 4
THRESHOLD2 = 2
df1 = df.groupby('A').filter(lambda x: x['B'].nunique() >= THRESHOLD2)
print (df1)
A B E
0 a a 5
1 a b 3
But if need faster solution use transform and filter by boolean indexing:
df2 = df[df.groupby('A')['B'].transform('nunique') >= THRESHOLD2]
print (df2)
A B E
0 a a 5
1 a b 3
Timings:
np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'B': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489)),
'A':np.random.randint(10000,size=N)})
df = df.sort_values(['A','B']).reset_index(drop=True)
print (df)
THRESHOLD2 = 3
In [403]: %timeit df.groupby('A').filter(lambda x: x['B'].nunique() >= THRESHOLD2)
1 loop, best of 3: 3.05 s per loop
In [404]: %timeit df[df.groupby('A')['B'].transform('nunique')>= THRESHOLD2]
1 loop, best of 3: 558 ms per loop
Caveat
The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.
import numpy as np
import pandas as pd
ind = [0, 1, 2]
cols = ['A','B','C']
df = pd.DataFrame(np.arange(9).reshape((3,3)),columns=cols)
Say you have a pandas dataframe df looking like:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
If you want to capture a single element from each column in cols at a specific index ind the output should look like a series:
A 0
B 4
C 8
What I've tried so far was:
df.loc[ind,cols]
which gives the undesired output:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
Any suggestions?
context:
The next step would be mapping the output of an df.idxmax() call of one dataframe onto another dataframe with the same column names and indexes, but I can likely figure that out if I know how to do the above mentioned transformation .
you can use DataFrame.lookup():
In [6]: pd.Series(df.lookup(df.index, df.columns), index=df.columns)
Out[6]:
A 0
B 4
C 8
dtype: int32
or:
In [14]: pd.Series(df.lookup(ind, cols), index=df.columns)
Out[14]:
A 0
B 4
C 8
dtype: int32
Explanation:
In [12]: df.lookup(df.index, df.columns)
Out[12]: array([0, 4, 8])
Here's a vectorized one with NumPy's advanced-indexing to select one element per column, given the row indices ind per col -
pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
Sample run -
In [107]: ind = [0, 2, 1] # different one than sample for variety
...: cols = ['A','B','C']
...: df = pd.DataFrame(np.arange(9).reshape((3,3)),columns=cols)
...:
In [109]: df
Out[109]:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
In [110]: pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
Out[110]:
A 0
B 7
C 5
dtype: int64
Runtime test
Let's compare the propose one against the pandas built-in vectorized lookup method proposed in #MaxU's solution and since we are seeing how good the vectorized ones are, let's have greater number of cols -
In [111]: ncols = 10000
...: df = pd.DataFrame(np.random.randint(0,9,(100,ncols)))
...: ind = np.random.randint(0,100,(ncols)).tolist()
...:
# #MaxU's solution
In [112]: %timeit pd.Series(df.lookup(ind, df.columns), index=df.columns)
1000 loops, best of 3: 718 µs per loop
# Proposed in this post
In [113]: %timeit pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
1000 loops, best of 3: 410 µs per loop
In [114]: ncols = 100000
...: df = pd.DataFrame(np.random.randint(0,9,(100,ncols)))
...: ind = np.random.randint(0,100,(ncols)).tolist()
...:
# #MaxU's solution
In [115]: %timeit pd.Series(df.lookup(ind, df.columns), index=df.columns)
100 loops, best of 3: 8.83 ms per loop
# Proposed in this post
In [116]: %timeit pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
100 loops, best of 3: 5.76 ms per loop
There is another way using mutiIndex, if you like using .loc
df1=df.reset_index().melt('index').set_index(['index','variable'])
df1.loc[list(zip(df.index,df.columns))]
Out[118]:
value
index variable
0 A 0
1 B 4
2 C 8
There should be a more direct way but this is what I could think of,
val = [df.iloc[i,i] for i in df.index]
pd.Series(val, index = df.columns)
A 0
B 4
C 8
dtype: int64
You could zip the column and index values you would like to retrieve the values for and then create a series from that:
pd.Series([df.loc[id_, col] for id_, col in zip(ind, cols)], df.columns)
A 0
B 4
C 8
Or if you always just need the diagonal value:
pd.Series(np.diag(df), df.columns)
Will be much faster
When I am using Pandas, I have a problem. My task is like this:
df=pd.DataFrame([(1,2,3,4,5,6),(1,2,3,4,5,6),(1,2,3,4,5,6)],columns=['a','b','c','d','e','f'])
Out:
a b c d e f
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
what I want to do is the output dataframe looks like this:
Out:
s1 s2 s3
0 3 7 11
1 3 7 11
2 3 7 11
That is to say, sum the column (a,b),(c,d),(e,f) separately and rename the result columns names as (s1,s2,s3). Could anyone help solve this problem in Pandas? Thank you so much.
1) Perform groupby w.r.t columns by supplying axis=1. Per #Boud's comment, you exactly get what you want with a minor tweak in the grouping array:
df.groupby((np.arange(len(df.columns)) // 2) + 1, axis=1).sum().add_prefix('s')
Grouping gets performed according to this condition:
np.arange(len(df.columns)) // 2
# array([0, 0, 1, 1, 2, 2], dtype=int32)
2) Use np.add.reduceat which is a faster alternative:
df = pd.DataFrame(np.add.reduceat(df.values, np.arange(len(df.columns))[::2], axis=1))
df.columns = df.columns + 1
df.add_prefix('s')
Timing Constraints:
For a DF of 1 million rows spanned over 20 columns:
from string import ascii_lowercase
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 10, (10**6,20)), columns=list(ascii_lowercase[:20]))
df.shape
(1000000, 20)
def with_groupby(df):
return df.groupby((np.arange(len(df.columns)) // 2) + 1, axis=1).sum().add_prefix('s')
def with_reduceat(df):
df = pd.DataFrame(np.add.reduceat(df.values, np.arange(len(df.columns))[::2], axis=1))
df.columns = df.columns + 1
return df.add_prefix('s')
# test whether they give the same o/p
with_groupby(df).equals(with_groupby(df))
True
%timeit with_groupby(df.copy())
1 loop, best of 3: 1.11 s per loop
%timeit with_reduceat(df.copy()) # <--- (>3X faster)
1 loop, best of 3: 345 ms per loop
Is there an efficient way to delete columns that have at least 20% missing values?
Suppose my dataframe is like:
A B C D
0 sg hh 1 7
1 gf 9
2 hh 10
3 dd 8
4 6
5 y 8`
After removing the columns, the dataframe becomes like this:
A D
0 sg 7
1 gf 9
2 hh 10
3 dd 8
4 6
5 y 8`
You can use boolean indexing on the columns where the count of notnull values is larger then 80%:
df.loc[:, pd.notnull(df).sum()>len(df)*.8]
This is useful for many cases, e.g., dropping the columns where the number of values larger than 1 would be:
df.loc[:, (df > 1).sum() > len(df) *. 8]
Alternatively, for the .dropna() case, you can also specify the thresh keyword of .dropna() as illustrated by #EdChum:
df.dropna(thresh=0.8*len(df), axis=1)
The latter will be slightly faster:
df = pd.DataFrame(np.random.random((100, 5)), columns=list('ABCDE'))
for col in df:
df.loc[np.random.choice(list(range(100)), np.random.randint(10, 30)), col] = np.nan
%timeit df.loc[:, pd.notnull(df).sum()>len(df)*.8]
1000 loops, best of 3: 716 µs per loop
%timeit df.dropna(thresh=0.8*len(df), axis=1)
1000 loops, best of 3: 537 µs per loop
You can call dropna and pass a thresh value to drop the columns that don't meet your threshold criteria:
In [10]:
frac = len(df) * 0.8
df.dropna(thresh=frac, axis=1)
Out[10]:
A D
0 sg 7
1 gf 9
2 hh 10
3 dd 8
4 NaN 6
5 y 8
I work with large datasets, making pandas group and groupby functions take a long time/use too much memory. I have heard some people say groupby can be slow, but am having trouble finding a better solution.
If my dataframe has 2 columns similar to:
df = pd.DataFrame({'a':[1,2,2,4], 'b':[1,1,1,1]})
a b
1 1
2 1
2 1
4 1
I wish to return a list of values that match to a value in another column:
a b list_of_b
1 1 [1]
2 1 [1,1]
2 1 [1,1]
4 1 [1]
I currently use:
df_group = df.groupby('a')
df['list_of_b'] = df.apply(lambda row: df_group.get_group(row['a'])['b'].tolist(), axis=1)
The code above works for small stuff, but not on large dataframes ( df > 1,000,000 rows) Does anyone have a faster way to do this?
Shortest solution I can think of:
df = pd.DataFrame({'a':[1,2,2,4], 'b':[1,1,1,1]})
df.join(pd.Series(df.groupby(by='a').apply(lambda x: list(x.b)), name="list_of_b"), on='a')
a b list_of_b
0 1 1 [1]
1 2 1 [1, 1]
2 2 1 [1, 1]
3 4 1 [1]
On a 4K row df I get the following:
In [29]:
df_group = df.groupby('a')
%timeit df.apply(lambda row: df_group.get_group(row['a'])['b'].tolist(), axis=1)
%timeit df['a'].map(df.groupby('a')['b'].apply(list))
1 loops, best of 3: 4.37 s per loop
100 loops, best of 3: 4.21 ms per loop
Just doing the grouping and then joining back to the original dataframe seems to be quite a bit faster:
def make_lists(df):
g = df.groupby('a')
def list_of_b(x):
return x.b.tolist()
return df.set_index('a').join(
pd.DataFrame(g.apply(list_of_b),
columns=['list_of_b']),
rsuffix='_').reset_index()
This gives me 192ms per loop with 1M rows generated like this:
df1 = pd.DataFrame({'a':[1,2,2,4], 'b':[1,1,1,1]})
low = 1
high = 10
size = 1000000
df2 = pd.DataFrame({'a':np.random.randint(low,high,size),
'b':np.random.randint(low,high,size)})
make_lists(df1)
Out[155]:
a b list_of_b
0 1 1 [1]
1 2 1 [1, 1]
2 2 1 [1, 1]
3 4 1 [1]
In [156]:
%%timeit
make_lists(df2)
10 loops, best of 3: 192 ms per loop