I have the DataFrame
df = pd.DataFrame({
'colA':['?',2,3,4,'?'],
'colB':[1,2,'?',3,4],
'colC':['?',2,3,4,5]
})
I would like to get the count the the number of '?' in each column and return the following output -
colA - 2
colB - 1
colC - 1
Is there a way to return this output at once. Right now the only way I know how to do it is write a for loop for each column.
looks like the simple way is
df[df == '?'].count()
the result is
colA 2
colB 1
colC 1
dtype: int64
where df[df == '?'] give us DataFrame with ? and Nan
colA colB colC
0 ? NaN ?
1 NaN NaN NaN
2 NaN ? NaN
3 NaN NaN NaN
4 ? NaN NaN
and the count non-NA cells for each column.
Please, look on the other solutions: good readable and the most faster
You can use numpy.count_nonzero here.
pd.Series(np.count_nonzero(df.to_numpy()=='?', axis=0), index=df.columns)
# pd.Series((df.values == '?').sum(0), index=df.columns)
colA 2
colB 1
colC 1
dtype: int64
Timeit results:
Benchmarking with df of shape (1_000_000, 3)
big_df = pd.DataFrame(df.to_numpy().repeat(200_000,axis=0))
big_df.shape
(1000000, 3)
In [186]: %timeit pd.Series(np.count_nonzero(big_df.to_numpy()=='?', axis=0), index=big_df.columns)
53.1 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [187]: %timeit big_df.eq('?').sum()
171 ms ± 7.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [188]: %timeit big_df[big_df == '?'].count()
314 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [189]: %timeit pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, big_df.values), index=big_df.columns)
174 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can do sum
df.eq('?').sum()
Out[182]:
colA 2
colB 1
colC 1
dtype: int64
#Bear Brown Answer is probably the most elegant, a faster option is to use numpy:
from collections import Counter
%%timeit
df[df == '?'].count()
5.2 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, df.values), index=df.columns)
218 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Variation to BENY's answer:
(df=='?').sum()
Out[182]:
colA 2
colB 1
colC 1
dtype: int64
Related
I have a Dataframe of stock prices...
I wish to have a boolean column that indicates if the price had reached a certain threshold in the previous rows or not.
My output should be something like this (let's say my threshold is 100):
index
price
bool
0
98
False
1
99
False
2
100.5
True
3
101
True
4
99
True
5
98
True
I've managed to do this with the following code but it's not efficient and takes a lot of time:
(df.loc[:, 'price'] > threshold).cumsum().fillna(0).gt(0)
Please, any suggestions?
Use a comparison and cummax:
threshold = 100
df['bool'] = df['price'].ge(threshold).cummax()
Note that it would work the other way around (although maybe less efficiently*):
threshold = 100
df['bool'] = df['price'].cummax().ge(threshold)
Output:
index price bool
0 0 98.0 False
1 1 99.0 False
2 2 100.5 True
3 3 101.0 True
4 4 99.0 True
5 5 98.0 True
* indeed on a large array:
%%timeit
df['price'].ge(threshold).cummax()
# 193 µs ± 4.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df['price'].cummax().ge(threshold)
# 309 µs ± 4.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
timing
# setting up a dummy example with 10M rows
np.random.seed(0)
df = pd.DataFrame({'price': np.random.choice([0,1], p=[0.999,0.001], size=10_000_000)})
threshold = 0.5
## comparison
%%timeit
df['bool'] = (df.loc[:, 'price'] > threshold).cumsum().fillna(0).gt(0)
# 271 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['bool'] = df['price'].ge(threshold).cummax()
# 109 ms ± 5.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
%%timeit
df['bool'] = np.maximum.accumulate(df['price'].to_numpy()>threshold)
# 75.8 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A similar dataframe can be created:
import pandas as pd
df = pd.DataFrame()
df["nodes"] = list(range(1, 11))
df["x"] = [1,4,9,12,27,87,99,121,156,234]
df["y"] = [3,5,6,1,8,9,2,1,0,-1]
df["z"] = [2,3,4,2,1,5,9,99,78,1]
df.set_index("nodes", inplace=True)
So the dataframe looks like this:
x y z
nodes
1 1 3 2
2 4 5 3
3 9 6 4
4 12 1 2
5 27 8 1
6 87 9 5
7 99 2 9
8 121 1 99
9 156 0 78
10 234 -1 1
My first try for searching e.g. all nodes containing number 1 is:
>>> df[(df == 1).any(axis=1)].index.values
[1 4 5 8 10]
As i have to do this for many numbers and my real dataframe is much bigger than this one, i'm searching for a very fast way to do this.
Just tried something that may be enlightening
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
df.set_index("A", inplace=True)
df_no_index = df.reset_index()
So set up a dataframe with ints right the way through. This is not the same as yours but it will suffice.
Then I ran four tests
%timeit df[(df == 1).any(axis=1)].index.values
%timeit df[(df['B'] == 1) | (df['C']==1)| (df['D']==1)].index.values
%timeit df_no_index[(df_no_index == 1).any(axis=1)].A.values
%timeit df_no_index[(df_no_index['B'] == 1) | (df_no_index['C']==1)| (df_no_index['D']==1)].A.values
The results I got were,
940 µs ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.47 ms ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.08 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.55 ms ± 51.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Which showed that the initial method that you took, with index seems to be the fastest of these approaches. Removing the index does not improve the speed with a moderately sized dataframe
I'm trying to determine and flag duplicate 'Sample' values in a dataframe using groupby with lambda:
rdtRows["DuplicateSample"] = False
rdtRowsSampleGrouped = rdtRows.groupby( ['Sample']).filter(lambda x: len(x) > 1)
rdtRowsSampleGrouped["DuplicateSample"] = True
# How to get flag changes made on rdtRowsSampleGrouped to apply to rdtRows??
How do I make changes / apply the "DuplicateSample" to the source rdtRows data? I'm stumped
:(
Use GroupBy.transform with GroupBy.size:
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
Or use Series.duplicated with keep=False if need faster solution:
df['DuplicateSample'] = df['Sample'].duplicated(keep=False)
Performance in some sample data (in real should be different, depends of number of rows, number of duplicated values):
np.random.seed(2020)
N = 100000
df = pd.DataFrame({'Sample': np.random.randint(100000, size=N)})
In [51]: %timeit df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
17 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['DuplicateSample1'] = df['Sample'].duplicated(keep=False)
3.73 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Stef solution is unfortunately 2734times slowier like duplicated solution
In [53]: %timeit df['DuplicateSample2'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
10.2 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use transform:
import pandas as pd
df = pd.DataFrame({'Sample': [1,2,2,3,4,4]})
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
Result:
Sample DuplicateSample
0 1 False
1 2 True
2 2 True
3 3 False
4 4 True
5 4 True
How would I add multiple dataframes together. I have like 10 dataframes to sum up. I tried to use
df_add = df1.add(df2, df3 fill_value=0)
And it doesn't work.
This is the code to create the dfs
df1 = pd.DataFrame([(1,2),(3,4),(5,6)], columns=['a','b'])
a b
0 1 2
1 3 4
2 5 6
df2 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=
['a','b'])
df3 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=
['a','b'])
Now how would I be able to add these so that
a b
0 201 402
1 603 804
2 1005 1206
Looks like the simple
sum((df1,df2,df3))
works and gives:
a b
0 201 402
1 603 804
2 1005 1206
Just as an addendum,
df1 + df2 + df3
works just fine. If performance is important and index matching is not important, then consider vectorized np.sum and avoid pandas overhead
%timeit np.sum([df1.values,df2.values,df3.values],axis=0)
27.2 µs ± 3.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df1+df2+df3
1.21 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sum((df1,df2,df3))
2.04 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have the following DataFrame with named columns and index:
'a' 'a*' 'b' 'b*'
1 5 NaN 9 NaN
2 NaN 3 3 NaN
3 4 NaN 1 NaN
4 NaN 9 NaN 7
The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.
I want to copy any values (which are not null) from a* and b* columns to a and b, respectively.
Is there an efficient way to do such an operation?
Use np.where
df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])
Output:
a a* b b*
0 5.0 NaN 9.0 NaN
1 3.0 3.0 3.0 NaN
2 4.0 NaN 1.0 NaN
3 9.0 9.0 7.0 7.0
Using fillna() is a lot slower than np.where but has the advantage of being pandas only. If you want a faster method and keep it pandas pure, you can use combine_first() which according to the documentation is used to:
Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
Translation: this is a method designed to do exactly what is asked in the question.
How do I use it?
df['a'].combine_first(df['a*'])
Performance:
df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
return df['A'].fillna(df['A*'])
def using_combine_first(df):
return df['A'].combine_first(df['A*'])
def using_np_where(df):
return np.where(df['A'].isnull(), df['A*'], df['A'])
def using_np_where_numpy(df):
return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_np_where_numpy(df)
1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:
df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)
Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:
First create MultiIndex by split with append *val:
df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)
And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:
df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
print (df)
a b
1 5.0 9.0
2 3.0 3.0
3 4.0 1.0
4 9.0 7.0
Performance: (depends of number of missing values and length of DataFrame)
df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000,
'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
df['A'] = df['A'].fillna(df['A*'])
return df
def using_np_where(df):
df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
return df
def using_np_where_numpy(df):
df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
return df
def using_combine_first(df):
df['D'] = df['A'].combine_first(df['A*'])
return df
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where_numpy(df)
1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)