Let's say we have the following pandas dataframe:
df = pd.DataFrame({'a': {0: 3.0, 1: 2.0, 2: None}, 'b': {0: 10.0, 1: None, 2: 8.0}, 'c': {0: 4.0, 1: 2.0, 2: 6.0}})
a b c
0 3.0 10.0 4.0
1 2.0 NaN 2.0
2 NaN 8.0 6.0
I need to get a dataframe with, for each row, the column names of all non-NaN values.
I know I can do the following, which produces the expected outupt:
df2 = df.apply(lambda x: pd.Series(x.dropna().index), axis=1)
0 1 2
0 a b c
1 a c NaN
2 b c NaN
Unfortunately, this is quite slow with large datasets. Is there a faster way?
Getting the row indices of non-Null values of each column could work too, as I would just need to transpose the input dataframe. Thanks.
Use numpy:
m = df.notna()
a = m.mul(df.columns).where(m).to_numpy()
out = pd.DataFrame(a[np.arange(len(a))[:,None], np.argsort(~m, axis=1)],
index=df.index)
Output:
0 1 2
0 a b c
1 a c NaN
2 b c NaN
timings
On 30k rows x 3 columns:
# numpy approach
6.82 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# pandas apply
7.32 s ± 553 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related
I have a data frame like this.
mydf = pd.DataFrame({'a':[1,1,3,3],'b':[np.nan,2,3,6],'c':[1,3,3,9]})
a b c
0 1 NaN 1
1 1 2.0 3
2 3 3.0 3
3 3 6.0 9
I would like to have a resulting dataframe like this.
myResults = pd.concat([mydf.groupby('a').apply(lambda x: (x.b/x.c).max()), mydf.groupby('a').apply(lambda x: (x.c/x.b).max())], axis =1)
myResults.columns = ['b_c','c_b']
b_c c_b
a
1 0.666667 1.5
3 1.000000 1.5
Basically i would like to have max and min of ratio of column b and column c for each group (grouped by column a)
If it possible to achieve this by agg?
I tried mydf.groupby('a').agg([lambda x: (x.b/x.c).max(), lambda x: (x.c/x.b).max()]). It will not work, and seems column name b and c will not be recognized.
Is there a better way to achieve this (prefer in one line) through agg or other function? In summary, I would like to apply customized function to grouped DataFrame, and the customized function needs to read multiple columns (may more than b and c columns mentioned above) from original DataFrame.
One way of doing it
def func(x):
C= (x['b']/x['c']).max()
D= (x['c']/x['b']).max()
return pd.Series([C, D], index=['b_c','c_b'])
mydf.groupby('a').apply(func).reset_index()
Output
a b_c c_b
0 1 0.666667 1.5
1 3 1.000000 1.5
Prepend new temporary columns to the dataframe via assign, then do your groupby and max functions. This method should provide significant performance benefits.
>>> (mydf
.assign(b_c=df['b'].div(df['c']), c_b=df['c'].div(df['b']))
.groupby('a')[['b_c', 'c_b']]
.max()
)
b_c c_b
a
1 0.666667 1.5
3 1.000000 1.5
Timings
# Sample data.
n = 1000 # Sample data number of rows = 4 * n.
data = {
'a': list(range(n)) * 4,
'b': [np.nan, 2, 3, 6] * n,
'c': [1, 3, 3, 9] * n
}
df = pd.DataFrame(data)
# Solution 1.
%timeit df.assign(b_c=df['b'].div(df['c']), c_b=df['c'].div(df['b'])).groupby('a')[['b_c', 'c_b']].max()
# 3.96 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Solution 2.
def func(x):
C= (x['b']/x['c']).max()
D= (x['c']/x['b']).max()
return pd.Series([C, D], index=['b_c','c_b'])
%timeit df.groupby('a').apply(func)
# 1.09 s ± 56.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Both solutions give the same result.
I am trying to convert values within the current dataframe as the "Index" and the dataframe's Index as the "Labels". For Example:
Value1 Value2
0 0 1
1 2 4
2 NaN 3
This would result in
Labels
0 0
1 0
2 1
3 2
4 1
Currently I managed to do this using a loop to check and apply the necessary labels/values but with millions of labels to mark this process becomes extremely time consuming. Is there a way to do this in a smarter and quicker way? Thanks in advance.
Use stack with DataFrame constructor:
s = df.stack()
df = pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
print (df)
Labels
0 0
1 0
2 1
3 2
4 1
Detail:
print (df.stack())
0 Value1 0.0
Value2 1.0
1 Value1 2.0
Value2 4.0
2 Value2 3.0
dtype: float64
Came up with a really good one (thanks to the collective effort of the pandas community). This one should be fast.
It uses the power a flexibility of repeat and ravel to flatten your data.
s = pd.Series(df.index.repeat(2), index=df.values.ravel())
s[s.index.notnull()].sort_index()
0.0 0
1.0 0
2.0 1
3.0 2
4.0 1
dtype: int64
A subsequent conversion results in an integer index:
df.index = df.index.astype(int)
A similar (slightly faster depending on your data) solution which also results in an integer index is performing the filtering before converting to Series -
v = df.index.repeat(df.shape[1])
i = df.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
s
0 0
1 0
2 1
3 2
4 1
dtype: int64
Performance
df2 = pd.concat([df] * 10000, ignore_index=True)
# jezrael's solution
%%timeit
s = df2.stack()
pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
4.57 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(df2.index.repeat(2), index=df2.values.ravel())
s[s.index.notnull()].sort_index()
3.12 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df2.index.repeat(df.shape[1])
i = df2.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
3.1 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I want to add two dataframes which I can achieve by add function.
Now I want to divide each value of resultant dataframe based on whether respective value was present in initial dataframes(df1,df2,df3). For eg.
df1 = pd.DataFrame([[1,2],[3,4]], index =['A','B'], columns = ['C','D'])
df2 = pd.DataFrame([[11,12], [13,14]], index = ['A','B'], columns = ['D','E'])
df3 = df1.add(df2, fill_value=0)
This would result in a df like
C D E
A 1.0 13 12.0
B 3.0 17 14.0
I require a df like:
C D E
A 1.0 6.5 12.0
B 3.0 8.5 14.0
because D column is found in both dataframes, I divide those values by 2.
Can anyone please provide a generic solution, assuming I need to add more than 2 dataframes (so the division factor also changes) and have more than 100 columns in each dataframe.
We can concatenate all DFs horizontally in one step:
In [13]: df = pd.concat([df1,df2], axis=1).fillna(0)
this yields:
In [15]: df
Out[15]:
C D D E
A 1 2 11 12
B 3 4 13 14
now we can group by columns, calculating average (mean):
In [14]: df.groupby(df.columns, axis=1).mean()
Out[14]:
C D E
A 1.0 6.5 12.0
B 3.0 8.5 14.0
or we can do it in one step (thanks #jezrael):
In [60]: pd.concat([df1,df2], axis=1).fillna(0).groupby(level=0, axis=1).mean()
Out[60]:
C D E
A 1.0 6.5 12.0
B 3.0 8.5 14.0
Timing:
In [38]: df1 = pd.concat([df1] * 10**5, ignore_index=True)
In [39]: df2 = pd.concat([df2] * 10**5, ignore_index=True)
In [40]: %%timeit
...: df = pd.concat([df1,df2], axis=1).fillna(0)
...: df.groupby(df.columns, axis=1).mean()
...:
63.4 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [41]: %%timeit
...: s = pd.Series(np.concatenate([df1.columns, df2.columns])).value_counts()
...: df1.add(df2, fill_value=0).div(s)
...:
28.7 ms ± 712 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [42]: %%timeit
...: pd.concat([df1,df2]).mean(level = 0)
...:
65.5 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [43]: df1.shape
Out[43]: (200000, 2)
In [44]: df2.shape
Out[44]: (200000, 2)
Current winner: #jezrael (28.7 ms ± 712 µs) - congratulations!
It looks like you are trying to compute a mean. Don't do too many operations with the dataframe methods and individual columns if you can help it, as it's slow.
df = pd.concat([df1,df2]) # concatenate all your dataframes together
df.mean(level = 0)
The second line computes the mean along the vertical axis (axis = 0 by default), and level = 0 tells pandas to get the mean of each unique index.
Faster solution is divide by size of columns:
s = pd.Series(np.concatenate([df1.columns, df2.columns])).value_counts()
print (s)
C 1
D 2
E 1
dtype: int64
df3 = df1.add(df2, fill_value=0).div(s)
print (df3)
C D E
A 1.0 6.5 12.0
B 3.0 8.5 14.0
Timings (with 100 columns like OP mentioned):
np.random.seed(123)
N = 100000
df1 = pd.DataFrame(np.random.randint(10, size=(N, 100)))
df1.columns = 'col' + df1.columns.astype(str)
df2 = df1.mul(10)
#MaxU solution
In [127]: %timeit (pd.concat([df1,df2], axis=1).fillna(0).groupby(level=0, axis=1).mean())
1 loop, best of 3: 952 ms per loop
#Ken Wei solution
In [128]: %timeit (pd.concat([df1,df2]).mean(level = 0))
1 loop, best of 3: 895 ms per loop
#jez solution
In [129]: %timeit (df1.add(df2, fill_value=0).div(pd.Series(np.concatenate([df1.columns, df2.columns])).value_counts()))
10 loops, best of 3: 161 ms per loop
More general solution:
If have list of DataFrames, is possible chaning like:
df = df1.add(df2, fill_value=0).add(df3, fill_value=0)
but better is use reduce:
from functools import reduce
dfs = [df1,df2, df3]
s = pd.Series(np.concatenate([x.columns for x in dfs])).value_counts()
df5 = reduce(lambda x, y: x.add(y, fill_value=0), dfs).div(s)
I face some problem here, in my python package I have install numpy, but I still have this error:
'DataFrame' object has no attribute 'sort'
Anyone can give me some idea..
This is my code :
final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1 # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)
sort() was deprecated for DataFrames in favor of either:
sort_values() to sort by column(s)
sort_index() to sort by the index
sort() was deprecated (but still available) in Pandas with release 0.17 (2015-10-09) with the introduction of sort_values() and sort_index(). It was removed from Pandas with release 0.20 (2017-05-05).
Pandas Sorting 101
sort has been replaced in v0.20 by DataFrame.sort_values and DataFrame.sort_index. Aside from this, we also have argsort.
Here are some common use cases in sorting, and how to solve them using the sorting functions in the current API. First, the setup.
# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})
df
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
Sort by Single Column
For example, to sort df by column "A", use sort_values with a single column name:
df.sort_values(by='A')
A B
0 a 7
3 a 5
4 b 2
1 c 9
2 c 3
If you need a fresh RangeIndex, use DataFrame.reset_index.
Sort by Multiple Columns
For example, to sort by both col "A" and "B" in df, you can pass a list to sort_values:
df.sort_values(by=['A', 'B'])
A B
3 a 5
0 a 7
4 b 2
2 c 3
1 c 9
Sort By DataFrame Index
df2 = df.sample(frac=1)
df2
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2
You can do this using sort_index:
df2.sort_index()
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
df.equals(df2)
# False
df.equals(df2.sort_index())
# True
Here are some comparable methods with their performance:
%timeit df2.sort_index()
%timeit df2.iloc[df2.index.argsort()]
%timeit df2.reindex(np.sort(df2.index))
605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sort by List of Indices
For example,
idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])
This "sorting" problem is actually a simple indexing problem. Just passing integer labels to iloc will do.
df.iloc[idx]
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2
I have two columns with strings. I would like to combine them and ignore nan values. Such that:
ColA, Colb, ColA+ColB
str str strstr
str nan str
nan str str
I tried df['ColA+ColB'] = df['ColA'] + df['ColB'] but that creates a nan value if either column is nan. I've also thought about using concat.
I suppose I could just go with that, and then use some df.ColA+ColB[df[ColA] = nan] = df[ColA] but that seems like quite the workaround.
Call fillna and pass an empty str as the fill value and then sum with param axis=1:
In [3]:
df = pd.DataFrame({'a':['asd',np.NaN,'asdsa'], 'b':['asdas','asdas',np.NaN]})
df
Out[3]:
a b
0 asd asdas
1 NaN asdas
2 asdsa NaN
In [7]:
df['a+b'] = df.fillna('').sum(axis=1)
df
Out[7]:
a b a+b
0 asd asdas asdasdas
1 NaN asdas asdas
2 asdsa NaN asdsa
You could fill the NaN with an empty string:
df['ColA+ColB'] = df['ColA'].fillna('') + df['ColB'].fillna('')
Using apply and str.cat you can
In [723]: df
Out[723]:
a b
0 asd asdas
1 NaN asdas
2 asdsa NaN
In [724]: df['a+b'] = df.apply(lambda x: x.str.cat(sep=''), axis=1)
In [725]: df
Out[725]:
a b a+b
0 asd asdas asdasdas
1 NaN asdas asdas
2 asdsa NaN asdsa
In my case, I wanted to join more than 2 columns together with a separator (a+b+c)
In [3]:
df = pd.DataFrame({'a':['asd',np.NaN,'asdsa'], 'b':['asdas','asdas',np.NaN], 'c':['as',np.NaN ,'ds']})
In [4]: df
Out[4]:
a b c
0 asd asdas as
1 NaN asdas NaN
2 asdsa NaN ds
The following syntax worked for me:
In [5]: df['d'] = df[['a', 'b', 'c']].fillna('').agg('|'.join, axis=1)
In [6]: df
Out[6]:
a b c d
0 asd asdas as asd|asdas|as
1 NaN asdas NaN |asdas|
2 asdsa NaN ds asdsa||ds
Prefer adding the columns than use apply method. cuz it's faster than apply.
Just add the two columns (if you know they are strings)
%timeit df.bio + df.procedure_codes
21.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Use apply
%timeit df[eventcol].apply(lambda x: ''.join(x), axis=1)
13.6 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use Pandas string methods and cat:
%timeit df[eventcol[0]].str.cat(cols, sep=',')
264 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using sum (which concatenate strings)
%timeit df[eventcol].sum(axis=1)
509 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
see here for more tests