I have the following DataFrame with named columns and index:
'a' 'a*' 'b' 'b*'
1 5 NaN 9 NaN
2 NaN 3 3 NaN
3 4 NaN 1 NaN
4 NaN 9 NaN 7
The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.
I want to copy any values (which are not null) from a* and b* columns to a and b, respectively.
Is there an efficient way to do such an operation?
Use np.where
df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])
Output:
a a* b b*
0 5.0 NaN 9.0 NaN
1 3.0 3.0 3.0 NaN
2 4.0 NaN 1.0 NaN
3 9.0 9.0 7.0 7.0
Using fillna() is a lot slower than np.where but has the advantage of being pandas only. If you want a faster method and keep it pandas pure, you can use combine_first() which according to the documentation is used to:
Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
Translation: this is a method designed to do exactly what is asked in the question.
How do I use it?
df['a'].combine_first(df['a*'])
Performance:
df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
return df['A'].fillna(df['A*'])
def using_combine_first(df):
return df['A'].combine_first(df['A*'])
def using_np_where(df):
return np.where(df['A'].isnull(), df['A*'], df['A'])
def using_np_where_numpy(df):
return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_np_where_numpy(df)
1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:
df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)
Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:
First create MultiIndex by split with append *val:
df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)
And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:
df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
print (df)
a b
1 5.0 9.0
2 3.0 3.0
3 4.0 1.0
4 9.0 7.0
Performance: (depends of number of missing values and length of DataFrame)
df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000,
'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
df['A'] = df['A'].fillna(df['A*'])
return df
def using_np_where(df):
df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
return df
def using_np_where_numpy(df):
df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
return df
def using_combine_first(df):
df['D'] = df['A'].combine_first(df['A*'])
return df
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where_numpy(df)
1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Related
I have a Dataframe of stock prices...
I wish to have a boolean column that indicates if the price had reached a certain threshold in the previous rows or not.
My output should be something like this (let's say my threshold is 100):
index
price
bool
0
98
False
1
99
False
2
100.5
True
3
101
True
4
99
True
5
98
True
I've managed to do this with the following code but it's not efficient and takes a lot of time:
(df.loc[:, 'price'] > threshold).cumsum().fillna(0).gt(0)
Please, any suggestions?
Use a comparison and cummax:
threshold = 100
df['bool'] = df['price'].ge(threshold).cummax()
Note that it would work the other way around (although maybe less efficiently*):
threshold = 100
df['bool'] = df['price'].cummax().ge(threshold)
Output:
index price bool
0 0 98.0 False
1 1 99.0 False
2 2 100.5 True
3 3 101.0 True
4 4 99.0 True
5 5 98.0 True
* indeed on a large array:
%%timeit
df['price'].ge(threshold).cummax()
# 193 µs ± 4.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df['price'].cummax().ge(threshold)
# 309 µs ± 4.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
timing
# setting up a dummy example with 10M rows
np.random.seed(0)
df = pd.DataFrame({'price': np.random.choice([0,1], p=[0.999,0.001], size=10_000_000)})
threshold = 0.5
## comparison
%%timeit
df['bool'] = (df.loc[:, 'price'] > threshold).cumsum().fillna(0).gt(0)
# 271 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['bool'] = df['price'].ge(threshold).cummax()
# 109 ms ± 5.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
%%timeit
df['bool'] = np.maximum.accumulate(df['price'].to_numpy()>threshold)
# 75.8 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A similar dataframe can be created:
import pandas as pd
df = pd.DataFrame()
df["nodes"] = list(range(1, 11))
df["x"] = [1,4,9,12,27,87,99,121,156,234]
df["y"] = [3,5,6,1,8,9,2,1,0,-1]
df["z"] = [2,3,4,2,1,5,9,99,78,1]
df.set_index("nodes", inplace=True)
So the dataframe looks like this:
x y z
nodes
1 1 3 2
2 4 5 3
3 9 6 4
4 12 1 2
5 27 8 1
6 87 9 5
7 99 2 9
8 121 1 99
9 156 0 78
10 234 -1 1
My first try for searching e.g. all nodes containing number 1 is:
>>> df[(df == 1).any(axis=1)].index.values
[1 4 5 8 10]
As i have to do this for many numbers and my real dataframe is much bigger than this one, i'm searching for a very fast way to do this.
Just tried something that may be enlightening
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
df.set_index("A", inplace=True)
df_no_index = df.reset_index()
So set up a dataframe with ints right the way through. This is not the same as yours but it will suffice.
Then I ran four tests
%timeit df[(df == 1).any(axis=1)].index.values
%timeit df[(df['B'] == 1) | (df['C']==1)| (df['D']==1)].index.values
%timeit df_no_index[(df_no_index == 1).any(axis=1)].A.values
%timeit df_no_index[(df_no_index['B'] == 1) | (df_no_index['C']==1)| (df_no_index['D']==1)].A.values
The results I got were,
940 µs ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.47 ms ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.08 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.55 ms ± 51.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Which showed that the initial method that you took, with index seems to be the fastest of these approaches. Removing the index does not improve the speed with a moderately sized dataframe
I have the DataFrame
df = pd.DataFrame({
'colA':['?',2,3,4,'?'],
'colB':[1,2,'?',3,4],
'colC':['?',2,3,4,5]
})
I would like to get the count the the number of '?' in each column and return the following output -
colA - 2
colB - 1
colC - 1
Is there a way to return this output at once. Right now the only way I know how to do it is write a for loop for each column.
looks like the simple way is
df[df == '?'].count()
the result is
colA 2
colB 1
colC 1
dtype: int64
where df[df == '?'] give us DataFrame with ? and Nan
colA colB colC
0 ? NaN ?
1 NaN NaN NaN
2 NaN ? NaN
3 NaN NaN NaN
4 ? NaN NaN
and the count non-NA cells for each column.
Please, look on the other solutions: good readable and the most faster
You can use numpy.count_nonzero here.
pd.Series(np.count_nonzero(df.to_numpy()=='?', axis=0), index=df.columns)
# pd.Series((df.values == '?').sum(0), index=df.columns)
colA 2
colB 1
colC 1
dtype: int64
Timeit results:
Benchmarking with df of shape (1_000_000, 3)
big_df = pd.DataFrame(df.to_numpy().repeat(200_000,axis=0))
big_df.shape
(1000000, 3)
In [186]: %timeit pd.Series(np.count_nonzero(big_df.to_numpy()=='?', axis=0), index=big_df.columns)
53.1 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [187]: %timeit big_df.eq('?').sum()
171 ms ± 7.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [188]: %timeit big_df[big_df == '?'].count()
314 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [189]: %timeit pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, big_df.values), index=big_df.columns)
174 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can do sum
df.eq('?').sum()
Out[182]:
colA 2
colB 1
colC 1
dtype: int64
#Bear Brown Answer is probably the most elegant, a faster option is to use numpy:
from collections import Counter
%%timeit
df[df == '?'].count()
5.2 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, df.values), index=df.columns)
218 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Variation to BENY's answer:
(df=='?').sum()
Out[182]:
colA 2
colB 1
colC 1
dtype: int64
Very much a beginner question, sorry: is there a way to avoid repeating the dataframe name when operating on pandas columns?
In R, data.table allows to operate on a column without repeating the dataframe name like this
very_long_dt_name = data.table::data.table(col1=c(1,2,3),col2=c(3,3,1))
# operate on the columns without repeating the dt name:
very_long_dt_name[,ratio:=round(col1/col2,2)]
I couldn't figure out how to do it with pandas in Python so I keep repeating the df name:
data = {'col1': [1,2,3], 'col2': [3, 3, 1]}
very_long_df_name = pd.DataFrame(data)
# operate on the columns requires repeating the df name
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
I'm sure there's a way to avoid it but I can't find anything on Google. Any hint please? Thank you.
Try assign:
very_long_df_name.assign(ratio=lambda x: np.round(x.col1/x.col2,2))
Output:
col1 col2 ratio
0 1 3 0.33
1 2 3 0.67
2 3 1 3.00
Edit: to reflect comments, tests on 1 million rows:
%%timeit
very_long_df_name.assign(ratio = lambda x:x.col1/x.col2)
# 18.6 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and
%%timeit
very_long_df_name['ratio'] = very_long_df_name['col1']/very_long_df_name['col2']
# 13.3 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And with np.round, assign
%%timeit
very_long_df_name.assign(ratio = lambda x: np.round(x.col1/x.col2,2))
# 64.8 ms ± 958 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
and not-assign:
%%timeit
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
# 55.8 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
SO it appears that assign is vectorized, just not as well tuned.
I have a dataframe like this
data
0 1.5
1 1.3
2 1.3
3 1.8
4 1.3
5 1.8
6 1.5
And I have a list of lists like this:
indices = [[0, 3, 4], [0, 3], [2, 6, 4], [1, 3, 4, 5]]
I want to produce sums of each of the groups in my dataframe using the list of lists, so
group1 = df[0] + df[1] + df[2]
group2 = df[1] + df[2] + df[3]
group3 = df[2] + df[3] + df[4]
group4 = df[3] + df[4] + df[5]
so I am looking for something like df.groupby(indices).sum
I know this can be done iteratively using a for loop and applying the sum to each of the df.iloc[sublist], but I am looking for a faster way.
Use list comprehension:
a = [df.loc[x, 'data'].sum() for x in indices]
print (a)
[4.6, 3.3, 4.1, 6.2]
arr = df['data'].values
a = [arr[x].sum() for x in indices]
print (a)
[4.6, 3.3, 4.1, 6.2]
Solution with groupby + sum is possible, but not sure if better performance:
df1 = pd.DataFrame({
'd' : df['data'].values[np.concatenate(indices)],
'g' : np.arange(len(indices)).repeat([len(x) for x in indices])
})
print (df1)
d g
0 1.5 0
1 1.8 0
2 1.3 0
3 1.5 1
4 1.8 1
5 1.3 2
6 1.5 2
7 1.3 2
8 1.3 3
9 1.8 3
10 1.3 3
11 1.8 3
print(df1.groupby('g')['d'].sum())
g
0 4.6
1 3.3
2 4.1
3 6.2
Name: d, dtype: float64
Performance tested in small sample data - in real data should be different:
In [150]: %timeit [df.loc[x, 'data'].sum() for x in indices]
4.84 ms ± 80.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [151]: %%timeit
...: df['data'].values
...: [arr[x].sum() for x in indices]
...:
...:
20.9 µs ± 99.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [152]: %timeit pd.DataFrame({'d' : df['data'].values[np.concatenate(indices)],'g' : np.arange(len(indices)).repeat([len(x) for x in indices])}).groupby('g')['d'].sum()
1.46 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
On real data
In [37]: %timeit [df.iloc[x, 0].sum() for x in indices]
158 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [38]: arr = df['data'].values
...: %timeit \
...: [arr[x].sum() for x in indices]
5.99 ms ± 18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In[49]: %timeit pd.DataFrame({'d' : df['last'].values[np.concatenate(sample_indices['train'])],'g' : np.arange(len(sample_indices['train'])).repeat([len(x) for x in sample_indices['train']])}).groupby('g')['d'].sum()
...:
5.97 ms ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
interesting.. both of the bottom answers are fast.