How to count non NaN values accross columns in pandas dataframe? - python

My data looks like this:
Close a b c d e Time
2015-12-03 2051.25 5 4 3 1 1 05:00:00
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00
I need to count 'horizontally' the values in the columns ['a'] to ['e'] that are not NaN. So the outcome would be this:
df['Count'] = .....
df
Close a b c d e Time Count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
Thanks

You can subselect from your df and call count passing axis=1:
In [24]:
df['count'] = df[list('abcde')].count(axis=1)
df
Out[24]:
Close a b c d e Time count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
TIMINGS
In [25]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
100 loops, best of 3: 3.28 ms per loop
100 loops, best of 3: 2.76 ms per loop
100 loops, best of 3: 2.98 ms per loop
apply is the slowest which is not a surprise, the drop version is marginally faster but semantically I prefer just passing the list of cols of interest and calling count for readability
Hmm I keep getting varying timings now:
In [27]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
100 loops, best of 3: 3.33 ms per loop
100 loops, best of 3: 2.7 ms per loop
100 loops, best of 3: 2.7 ms per loop
100 loops, best of 3: 2.57 ms per loop
MORE TIMINGS
In [160]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
%timeit df[list('abcde')].notnull().sum(axis=1)
1000 loops, best of 3: 1.4 ms per loop
1000 loops, best of 3: 1.14 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.05 ms per loop
It seems that testing for notnull and summing (as notnull will produce a boolean mask) is quicker on this dataset
On a 50k row df the last method is slightly quicker:
In [172]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
%timeit df[list('abcde')].notnull().sum(axis=1)
1 loops, best of 3: 5.83 s per loop
100 loops, best of 3: 6.15 ms per loop
100 loops, best of 3: 6.49 ms per loop
100 loops, best of 3: 6.04 ms per loop

df['Count'] = df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
In [1254]: df
Out[1254]:
Close a b c d e Time Count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1

Include the list of desired columns, or just drop the two columns you do not want to exclude from the count - along axis=1 (see docs):
df['Count'] = df.drop(['Close', 'Time'], axis=1).count(axis=1)
Close a b c d e Time Count
0 2051.25 5 4 3 1 1 05:00:00 5
1 2088.25 5 4 3 1 NaN 06:00:00 4
2 2081.50 5 4 3 NaN NaN 07:00:00 3
3 2058.25 5 4 3 NaN NaN 08:00:00 3
4 2042.25 5 4 NaN NaN NaN 09:00:00 2

Related

Get max value in previous rows for matching rows [duplicate]

This question already has answers here:
pandas rolling max with groupby
(2 answers)
Closed 4 months ago.
Say I have a dataframe that records temperature measurements for various sensors:
import pandas as pd
df = pd.DataFrame({'sensor': ['A', 'C', 'A', 'C', 'B', 'B', 'C', 'A', 'A', 'A'],
'temperature': [4.8, 12.5, 25.1, 16.9, 20.4, 15.7, 7.7, 5.5, 27.4, 17.7]})
I would like to add a column max_prev_temp that will show the previous maximum temperature for the corresponding sensor. So this works:
df["max_prev_temp"] = df.apply(
lambda row: df[df["sensor"] == row["sensor"]].loc[: row.name, "temperature"].max(),
axis=1,
)
It returns:
sensor temperature max_prev_temp
0 A 4.8 4.8
1 C 12.5 12.5
2 A 25.1 25.1
3 C 16.9 16.9
4 B 20.4 20.4
5 B 15.7 20.4
6 C 7.7 16.9
7 A 5.5 25.1
8 A 27.4 27.4
9 A 17.7 27.4
Problem is: my actual data set contains over 2 million rows, so this is excruciatingly slow (it probably will take about 2 hours). I understand that rolling is a better method, but I don't see to use it for this specific case.
Any hint would be appreciated.
Use Series.expanding per groups with remove first level by Series.droplevel:
df["max_prev_temp"] = df.groupby('sensor')["temperature"].expanding().max().droplevel(0)
print (df)
sensor temperature max_prev_temp
0 A 4.8 4.8
1 C 12.5 12.5
2 A 25.1 25.1
3 C 16.9 16.9
4 B 20.4 20.4
5 B 15.7 20.4
6 C 7.7 16.9
7 A 5.5 25.1
8 A 27.4 27.4
9 A 17.7 27.4
Use groupby.cummax:
df['max_prev_temp'] = df.groupby('sensor')['temperature'].cummax()
output:
sensor temperature max_prev_temp
0 A 4.8 4.8
1 C 12.5 12.5
2 A 25.1 25.1
3 C 16.9 16.9
4 B 20.4 20.4
5 B 15.7 20.4
6 C 7.7 16.9
7 A 5.5 25.1
8 A 27.4 27.4
9 A 17.7 27.4

Fetching Standard Meteorological Week from pandas dataframe date column

I have a pandas dataframe which is having long term data,
point_id issue_date latitude longitude rainfall
0 1.0 2020-01-01 6.5 66.50 NaN
1 2.0 2020-01-02 6.5 66.75 NaN
... ... ... ... ... ... ... ...
6373888 17414.0 2020-12-30 38.5 99.75 NaN
6373889 17415.0 2020-12-31 38.5 100.00 NaN
6373890 rows × 5 columns
I want to extract the Standard Meteorological Week from its issue_date column, as
given in this figure.
I have tried in 2 ways.
1st
lulc_gdf['smw'] = lulc_gdf['issue_date'].astype('datetime64[ns]').dt.strftime('%V')
2nd
lulc_gdf['iso'] = lulc_gdf['issue_date'].astype('datetime64[ns]').dt.isocalendar().week
The output in both cases is same
point_id issue_date latitude longitude rainfall smw iso
0 1.0 2020-01-01 6.5 66.50 NaN 01 1
1 2.0 2020-01-02 6.5 66.75 NaN 01 1
... ... ... ... ... ... ... ...
6373888 17414.0 2020-12-30 38.5 99.75 NaN 53 53
6373889 17415.0 2020-12-31 38.5 100.00 NaN 53 53
6373890 rows × 7 columns
The issue is that the week starts here by taking reference of Sunday or Monday as the starting day of week, irrespective of year.
Like here in case of year 2020 the day on 1st January is Wednesday (not Monday),
so the 1st week is of 5 days only i.e (Wed, Thu, Fri, Sat & Sunday).
year week day issue_date
0 2020 1 3 2020-01-01
1 2020 1 4 2020-01-02
2 2020 1 5 2020-01-03
3 2020 1 6 2020-01-04
... ... ... ...
6373889 2020 53 4 2020-12-31
But in the case of Standard Meteorological Weeks,
I want output as:
for every year
1st week should always be from - 1st January to 07th January
2nd week from - 8th January to 14th January
3rd week from - 15th January to 21st January
------------------------------- and so on
irrespective of the starting day of year (Sunday, monday etc).
How to do so?
Use:
df = pd.DataFrame({'issue_date': pd.date_range('2000-01-01','2000-12-31')})
#inspire https://stackoverflow.com/a/61592907/2901002
normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
days = df['issue_date'].dt.dayofyear
df['smw'] = np.where(df['issue_date'].dt.is_leap_year,
leap_year[days - 1],
normal_year[days - 1])
print (df[df['smw'] == 9])
issue_date smw
56 2000-02-26 9
57 2000-02-27 9
58 2000-02-28 9
59 2000-02-29 9
60 2000-03-01 9
61 2000-03-02 9
62 2000-03-03 9
63 2000-03-04 9
Performance:
#11323 rows
df = pd.DataFrame({'issue_date': pd.date_range('2000-01-01','2030-12-31')})
In [6]: %%timeit
...: normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
...: leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
...: days = df['issue_date'].dt.dayofyear
...:
...: df['smw'] = np.where(df['issue_date'].dt.is_leap_year, leap_year[days - 1], normal_year[days - 1])
...:
3.51 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %%timeit
...: df['smw1'] = get_smw(df['issue_date'])
...:
17.2 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#51500 rows
df = pd.DataFrame({'issue_date': pd.date_range('1900-01-01','2040-12-31')})
In [9]: %%timeit
...: normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
...: leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
...: days = df['issue_date'].dt.dayofyear
...:
...: df['smw'] = np.where(df['issue_date'].dt.is_leap_year, leap_year[days - 1], normal_year[days - 1])
...:
...:
11.9 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %%timeit
...: df['smw1'] = get_smw(df['issue_date'])
...:
...:
64.3 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can write a custom function to calculate the Standard Meteorological Weeks.
Normal calculation is by taking the difference in number of days from 1st January of the same year, then divide by 7 and add 1.
Special adjustment for leap year on Week No. 9 to have 8 days and also special adjustment for the last week of the year to have 8 days:
import numpy as np
# convert to datetime format if not already in datetime
df['issue_date'] = pd.to_datetime(df['issue_date'])
def get_smw(date_s):
# get day-of-the-year minus 1 in range [0..364/365] for division by 7
days_diff = date_s.dt.dayofyear - 1
# adjust for leap year on Week No. 9 to have 8 days: (minus one day for 29 Feb onwards in the same year)
leap_adj = date_s.dt.is_leap_year & (date_s > pd.to_datetime(date_s.dt.year.astype(str) + '-02-28'))
days_diff = np.where(leap_adj, days_diff - 1, days_diff)
# adjust for the last week of the year to have 8 days:
# Make the value for 31 Dec to 363 instead of 364 to keep it in the same week of 24 Dec)
days_diff = np.clip(days_diff, 0, 363)
smw = days_diff // 7 + 1
return smw
df['smw'] = get_smw(df['issue_date'])
Result:
print(df)
point_id issue_date latitude longitude rainfall smw
0 1.0 2020-01-01 6.5 66.50 NaN 1
1 2.0 2020-01-02 6.5 66.75 NaN 1
2 3.0 2020-01-03 6.5 66.75 NaN 1
3 4.0 2020-01-04 6.5 66.75 NaN 1
4 5.0 2020-01-05 6.5 66.75 NaN 1
5 6.0 2020-01-06 6.5 66.75 NaN 1
6 7.0 2020-01-07 6.5 66.75 NaN 1
7 8.0 2020-01-08 6.5 66.75 NaN 2
8 9.0 2020-01-09 6.5 66.75 NaN 2
40 40.0 2020-02-26 6.5 66.75 NaN 9
41 41.0 2020-03-04 6.5 66.75 NaN 9
42 42.0 2020-03-05 6.5 66.75 NaN 10
43 43.0 2020-03-12 6.5 66.75 NaN 11
6373880 17414.0 2020-12-23 38.5 99.75 NaN 51
6373881 17414.0 2020-12-24 38.5 99.75 NaN 52
6373888 17414.0 2020-12-30 38.5 99.75 NaN 52
6373889 17415.0 2020-12-31 38.5 100.00 NaN 52
7000040 40.0 2021-02-26 6.5 66.75 NaN 9
7000041 41.0 2021-03-04 6.5 66.75 NaN 9
7000042 42.0 2021-03-05 6.5 66.75 NaN 10
7000042 43.0 2021-03-12 6.5 66.75 NaN 11
7373880 17414.0 2021-12-23 38.5 99.75 NaN 51
7373881 17414.0 2021-12-24 38.5 99.75 NaN 52
7373888 17414.0 2021-12-30 38.5 99.75 NaN 52
7373889 17415.0 2021-12-31 38.5 100.00 NaN 52

Fill NaN values from previous column with data

I have a dataframe in pandas, and I am trying to take data from the same row and different columns and fill NaN values in my data. How would I do this in pandas?
For example,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
83 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0 NaN 28.0 18.0
The goal is for the data to look like this:
1 2 3 4 5 6 7 ... 10 11 12 13 14 15 16
83 NaN NaN NaN 27.0 29.0 29.0 30.0 ... 15.0 16.0 17.0 28.0 30.0 28.0 18.0
The goal is to be able to take the mean of the last five columns that have data. If there are not >= 5 data-filled cells, then take the average of however many cells there are.
Use function justify for improve performance with filter all columns without first by DataFrame.iloc:
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0
14 15 16
80 NaN 28.0 18.0
df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob NaN NaN NaN NaN NaN 27.0 29.0 29.0 30.0 15.0 16.0 17.0 28.0
14 15 16
80 30.0 28.0 18.0
Function:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
Performance:
#100 rows
df = pd.concat([df] * 100, ignore_index=True)
#41 times slowier
In [39]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
145 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [41]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
3.54 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#1000 rows
df = pd.concat([df] * 1000, ignore_index=True)
#198 times slowier
In [43]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
1.13 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
5.7 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Assuming you need to move all NaN to the first columns I would define a function that takes all NaN and places them first and leave the rest as it is:
def fun(row):
index_order = row.index[row.isnull()].append(row.index[~row.isnull()])
row.iloc[:] = row[index_order].values
return row
df_fix = df.loc[:,df.columns[1:]].apply(fun, axis=1)
If you need to overwrite the results in the same dataframe then:
df.loc[:,df.columns[1:]] = df_fix.copy()

Split a pandas dataframe into multiple dataframes if all rows are nan

I have the following dataframe.
a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999
5 NaN NaN NaN NaN
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001
11 NaN NaN NaN NaN
12 4.43 29.724285 126.620003 24.764999
13 NaN NaN NaN NaN
14 4.29 29.010000 120.309998 24.730000
15 4.11 29.420000 119.480003 25.035000
I want to split this df into multiple dfs when there is row with all NaN.
I explored the following links but could not figure out how to apply it to my problem.
Split pandas dataframe in two if it has more than 10 rows
Splitting dataframe into multiple dataframes
In my example, I would have 4 dataframes with 5,5,1 and 2 rows as the output.
Please suggest the way forward.
Using isna, all, cumsum and groupby.
First we check if all the values in a row are NaN, then use cumsum to create a group indicator and finally we save these dataframes in a list with groupby:
grps = df.isna().all(axis=1).cumsum()
dfs = [df.dropna() for _, df in df.groupby(grps)]
for df in dfs:
print(df)
a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999
a b c d
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001
a b c d
12 4.43 29.724285 126.620003 24.764999
a b c d
14 4.29 29.01 120.309998 24.730
15 4.11 29.42 119.480003 25.035
Something like this should do the trick:
import pandas as pd
import numpy as np
data_frame = pd.DataFrame({"a":[1,np.nan,3,np.nan,4,np.nan,5],
"b":[1,np.nan,3,np.nan,4,np.nan,5],
"c":[1,np.nan,3,np.nan,4,np.nan,5],
"d":[1,np.nan,3,np.nan,4,np.nan,5],
"e":[1,np.nan,3,np.nan,4,np.nan,5],
"f":[1,np.nan,3,np.nan,4,np.nan,5]})
all_nan = data_frame.index[data_frame.isnull().all(1)]
df_list = []
prev = 0
for i in all_nan:
df_list.append(data_frame[prev:i])
prev = i+1
for i in df_list:
print(i)
Just another flavor of doing the same thing:
nan_indices = df.index[df.isna().all(axis=1)]
df_list = [df.dropna() for df in np.split(df, nan_indices)]
df_list
[ a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999,
a b c d
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001,
a b c d
12 4.43 29.724285 126.620003 24.764999,
a b c d
14 4.29 29.01 120.309998 24.730
15 4.11 29.42 119.480003 25.035]

Pandas DataFrame count number of not None elements in two columns [duplicate]

My data looks like this:
Close a b c d e Time
2015-12-03 2051.25 5 4 3 1 1 05:00:00
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00
I need to count 'horizontally' the values in the columns ['a'] to ['e'] that are not NaN. So the outcome would be this:
df['Count'] = .....
df
Close a b c d e Time Count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
Thanks
You can subselect from your df and call count passing axis=1:
In [24]:
df['count'] = df[list('abcde')].count(axis=1)
df
Out[24]:
Close a b c d e Time count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
TIMINGS
In [25]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
100 loops, best of 3: 3.28 ms per loop
100 loops, best of 3: 2.76 ms per loop
100 loops, best of 3: 2.98 ms per loop
apply is the slowest which is not a surprise, the drop version is marginally faster but semantically I prefer just passing the list of cols of interest and calling count for readability
Hmm I keep getting varying timings now:
In [27]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
100 loops, best of 3: 3.33 ms per loop
100 loops, best of 3: 2.7 ms per loop
100 loops, best of 3: 2.7 ms per loop
100 loops, best of 3: 2.57 ms per loop
MORE TIMINGS
In [160]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
%timeit df[list('abcde')].notnull().sum(axis=1)
1000 loops, best of 3: 1.4 ms per loop
1000 loops, best of 3: 1.14 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.05 ms per loop
It seems that testing for notnull and summing (as notnull will produce a boolean mask) is quicker on this dataset
On a 50k row df the last method is slightly quicker:
In [172]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
%timeit df[list('abcde')].notnull().sum(axis=1)
1 loops, best of 3: 5.83 s per loop
100 loops, best of 3: 6.15 ms per loop
100 loops, best of 3: 6.49 ms per loop
100 loops, best of 3: 6.04 ms per loop
df['Count'] = df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
In [1254]: df
Out[1254]:
Close a b c d e Time Count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
Include the list of desired columns, or just drop the two columns you do not want to exclude from the count - along axis=1 (see docs):
df['Count'] = df.drop(['Close', 'Time'], axis=1).count(axis=1)
Close a b c d e Time Count
0 2051.25 5 4 3 1 1 05:00:00 5
1 2088.25 5 4 3 1 NaN 06:00:00 4
2 2081.50 5 4 3 NaN NaN 07:00:00 3
3 2058.25 5 4 3 NaN NaN 08:00:00 3
4 2042.25 5 4 NaN NaN NaN 09:00:00 2

Categories