Related
If I've got a DataFrame in pandas which looks something like:
A B C
0 1 NaN 2
1 NaN 3 NaN
2 NaN 4 5
3 NaN NaN NaN
How can I get the first non-null value from each row? E.g. for the above, I'd like to get: [1, 3, 4, None] (or equivalent Series).
Fill the nans from the left with fillna, then get the leftmost column:
df.fillna(method='bfill', axis=1).iloc[:, 0]
This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:
In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64
EDIT
A slightly cleaner way:
In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)
Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64
I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:
def get_first_non_null(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return [a[row, col] for row, col in enumerate(col_index)]
EDIT:
Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.
def get_first_non_null_vec(df):
a = df.values
n_rows, n_cols = a.shape
col_index = np.isnan(a).argmin(axis=1)
flat_index = n_cols * np.arange(n_rows) + col_index
return a.ravel()[flat_index]
If a row is completely null then the corresponding value will be null also.
Here's some benchmarking against unutbu's solution:
df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:
df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop
df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop
Here is another way to do it:
In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]:
0 1
1 3
2 4
3 NaN
dtype: float64
The idea here is to use stack to move the columns into a row index level:
In [184]: df.stack()
Out[184]:
0 A 1
C 2
1 B 3
2 B 4
C 5
dtype: float64
Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:
In [185]: df.stack().groupby(level=0).first()
Out[185]:
0 1
1 3
2 4
dtype: float64
All we need to do is reindex the result (using the original index) so as to
include rows that are completely NaN:
df.stack().groupby(level=0).first().reindex(df.index)
This is nothing new, but it's a combination of the best bits of #yangie's approach with a list comprehension, and #EdChum's df.apply approach that I think is easiest to understand.
First, which columns to we want to pick our values from?
In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)
In [96]: pick_cols
Out[96]:
0 A
1 B
2 B
3 None
dtype: object
Now how do we pick the values?
In [100]: [df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]
This is ok, but we really want the index to match that of the original DataFrame:
In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()})
Out[98]:
0 1
1 3
2 4
3 NaN
dtype: float64
groupby in axis=1
If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.agg which gives us the first method that makes this easy
df.groupby(lambda x: 'Z', 1).first()
Z
0 1.0
1 3.0
2 4.0
3 NaN
This returns a dataframe with the column name of the thing I was returning in my callable
lookup, notna, and idxmax
df.lookup(df.index, df.notna().idxmax(1))
array([ 1., 3., 4., nan])
argmin and slicing
v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]
array([ 1., 3., 4., nan])
Here is a one line solution:
[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]
Edit:
This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.
If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.
I packed everything into a list comprehension for brevity.
JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:
def get_first_non_null_vect(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return a[np.arange(a.shape[0]), col_index]
The improvement is small if the DataFrame is relatively flat:
In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop
In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop
... but can be relevant on slim DataFrames:
In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))
In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop
In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop
Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).
df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})
df
A B C
0 1.0 NaN 2.0
1 NaN 3.0 NaN
2 NaN 4.0 5.0
3 NaN NaN NaN
df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]
This question is same to this posted earlier. I want to concatenate three columns instead of concatenating two columns:
Here is the combining two columns:
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined']=df.apply(lambda x:'%s_%s' % (x['foo'],x['bar']),axis=1)
df
bar foo new combined
0 1 a apple a_1
1 2 b banana b_2
2 3 c pear c_3
I want to combine three columns with this command but it is not working, any idea?
df['combined']=df.apply(lambda x:'%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
Another solution using DataFrame.apply(), with slightly less typing and more scalable when you want to join more columns:
cols = ['foo', 'bar', 'new']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
You can use string concatenation to combine columns, with or without delimiters. You do have to convert the type on non-string columns.
In[17]: df['combined'] = df['bar'].astype(str) + '_' + df['foo'] + '_' + df['new']
In[17]:df
Out[18]:
bar foo new combined
0 1 a apple 1_a_apple
1 2 b banana 2_b_banana
2 3 c pear 3_c_pear
If you have even more columns you want to combine, using the Series method str.cat might be handy:
df["combined"] = df["foo"].str.cat(df[["bar", "new"]].astype(str), sep="_")
Basically, you select the first column (if it is not already of type str, you need to append .astype(str)), to which you append the other columns (separated by an optional separator character).
Just wanted to make a time comparison for both solutions (for 30K rows DF):
In [1]: df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
In [2]: big = pd.concat([df] * 10**4, ignore_index=True)
In [3]: big.shape
Out[3]: (30000, 3)
In [4]: %timeit big.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
1 loop, best of 3: 881 ms per loop
In [5]: %timeit big['bar'].astype(str)+'_'+big['foo']+'_'+big['new']
10 loops, best of 3: 44.2 ms per loop
a few more options:
In [6]: %timeit big.ix[:, :-1].astype(str).add('_').sum(axis=1).str.cat(big.new)
10 loops, best of 3: 72.2 ms per loop
In [11]: %timeit big.astype(str).add('_').sum(axis=1).str[:-1]
10 loops, best of 3: 82.3 ms per loop
Possibly the fastest solution is to operate in plain Python:
Series(
map(
'_'.join,
df.values.tolist()
# when non-string columns are present:
# df.values.astype(str).tolist()
),
index=df.index
)
Comparison against #MaxU answer (using the big data frame which has both numeric and string columns):
%timeit big['bar'].astype(str) + '_' + big['foo'] + '_' + big['new']
# 29.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit Series(map('_'.join, big.values.astype(str).tolist()), index=big.index)
# 27.4 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Comparison against #derchambers answer (using their df data frame where all columns are strings):
from functools import reduce
def reduce_join(df, columns):
slist = [df[x] for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def list_map(df, columns):
return Series(
map(
'_'.join,
df[columns].values.tolist()
),
index=df.index
)
%timeit df1 = reduce_join(df, list('1234'))
# 602 ms ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = list_map(df, list('1234'))
# 351 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The answer given by #allen is reasonably generic but can lack in performance for larger dataframes:
Reduce does a lot better:
from functools import reduce
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def reduce_join(df, columns):
assert len(columns) > 1
slist = [df[x].astype(str) for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def apply_join(df, columns):
assert len(columns) > 1
return df[columns].apply(lambda row:'_'.join(row.values.astype(str)), axis=1)
# ensure outputs are equal
df1 = reduce_join(df, list('1234'))
df2 = apply_join(df, list('1234'))
assert df1.equals(df2)
# profile
%timeit df1 = reduce_join(df, list('1234')) # 733 ms
%timeit df2 = apply_join(df, list('1234')) # 8.84 s
I think you are missing one %s
df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
First convert the columns to str. Then use the .T.agg('_'.join) function to concatenate them. More info can be gotten here
# Initialize columns
cols_concat = ['first_name', 'second_name']
# Convert them to type str
df[cols_concat] = df[cols_concat].astype('str')
# Then concatenate them as follows
df['new_col'] = df[cols_concat].T.agg('_'.join)
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined'] = df['foo'].astype(str)+'_'+df['bar'].astype(str)
If you concatenate with string('_') please you convert the column to string which you want and after you can concatenate the dataframe.
df['New_column_name'] = df['Column1'].map(str) + 'X' + df['Steps']
X= x is any delimiter (eg: space) by which you want to separate two merged column.
If you have a list of columns you want to concatenate and maybe you'd like to use some separator, here's what you can do
def concat_columns(df, cols_to_concat, new_col_name, sep=" "):
df[new_col_name] = df[cols_to_concat[0]]
for col in cols_to_concat[1:]:
df[new_col_name] = df[new_col_name].astype(str) + sep + df[col].astype(str)
This should be faster than apply and takes an arbitrary number of columns to concatenate.
#derchambers I found one more solution:
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def eval_join(df, columns):
sum_elements = [f"df['{col}']" for col in columns]
to_eval = "+ '_' + ".join(sum_elements)
return eval(to_eval)
#profile
%timeit df3 = eval_join(df, list('1234')) # 504 ms
You could create a function which would make the implementation neater (esp. if you're using this functionality multiple times throughout an implementation):
def concat_cols(df, cols_to_concat, new_col_name, separator):
df[new_col_name] = ''
for i, col in enumerate(cols_to_concat):
df[new_col_name] += ('' if i == 0 else separator) + df[col].astype(str)
return df
Sample usage:
test = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
test = concat_cols(test, ['a', 'b', 'c'], 'concat_col', '_')
following to #Allen response
If you need to chain such operation with other dataframe transformation, use assign:
df.assign(
combined = lambda x: x[cols].apply(
lambda row: "_".join(row.values.astype(str)), axis=1
)
)
Considering that one is combining three columns, one would need three format specifiers, '%s_%s_%s', not just two '%s_%s'. The following will do the work
df['combined'] = df.apply(lambda x: '%s_%s_%s' % (x['foo'], x['bar'], x['new']), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
Alternatively, if one wants to create a separate list to store the columns that one wants to combine, the following will do the work.
columns = ['foo', 'bar', 'new']
df['combined'] = df.apply(lambda x: '_'.join([str(x[i]) for i in columns]), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
This last one is more convenient, as one can simply change or add the column names in the list - it will require less changes.
This question is same to this posted earlier. I want to concatenate three columns instead of concatenating two columns:
Here is the combining two columns:
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined']=df.apply(lambda x:'%s_%s' % (x['foo'],x['bar']),axis=1)
df
bar foo new combined
0 1 a apple a_1
1 2 b banana b_2
2 3 c pear c_3
I want to combine three columns with this command but it is not working, any idea?
df['combined']=df.apply(lambda x:'%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
Another solution using DataFrame.apply(), with slightly less typing and more scalable when you want to join more columns:
cols = ['foo', 'bar', 'new']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
You can use string concatenation to combine columns, with or without delimiters. You do have to convert the type on non-string columns.
In[17]: df['combined'] = df['bar'].astype(str) + '_' + df['foo'] + '_' + df['new']
In[17]:df
Out[18]:
bar foo new combined
0 1 a apple 1_a_apple
1 2 b banana 2_b_banana
2 3 c pear 3_c_pear
If you have even more columns you want to combine, using the Series method str.cat might be handy:
df["combined"] = df["foo"].str.cat(df[["bar", "new"]].astype(str), sep="_")
Basically, you select the first column (if it is not already of type str, you need to append .astype(str)), to which you append the other columns (separated by an optional separator character).
Just wanted to make a time comparison for both solutions (for 30K rows DF):
In [1]: df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
In [2]: big = pd.concat([df] * 10**4, ignore_index=True)
In [3]: big.shape
Out[3]: (30000, 3)
In [4]: %timeit big.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
1 loop, best of 3: 881 ms per loop
In [5]: %timeit big['bar'].astype(str)+'_'+big['foo']+'_'+big['new']
10 loops, best of 3: 44.2 ms per loop
a few more options:
In [6]: %timeit big.ix[:, :-1].astype(str).add('_').sum(axis=1).str.cat(big.new)
10 loops, best of 3: 72.2 ms per loop
In [11]: %timeit big.astype(str).add('_').sum(axis=1).str[:-1]
10 loops, best of 3: 82.3 ms per loop
Possibly the fastest solution is to operate in plain Python:
Series(
map(
'_'.join,
df.values.tolist()
# when non-string columns are present:
# df.values.astype(str).tolist()
),
index=df.index
)
Comparison against #MaxU answer (using the big data frame which has both numeric and string columns):
%timeit big['bar'].astype(str) + '_' + big['foo'] + '_' + big['new']
# 29.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit Series(map('_'.join, big.values.astype(str).tolist()), index=big.index)
# 27.4 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Comparison against #derchambers answer (using their df data frame where all columns are strings):
from functools import reduce
def reduce_join(df, columns):
slist = [df[x] for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def list_map(df, columns):
return Series(
map(
'_'.join,
df[columns].values.tolist()
),
index=df.index
)
%timeit df1 = reduce_join(df, list('1234'))
# 602 ms ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = list_map(df, list('1234'))
# 351 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The answer given by #allen is reasonably generic but can lack in performance for larger dataframes:
Reduce does a lot better:
from functools import reduce
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def reduce_join(df, columns):
assert len(columns) > 1
slist = [df[x].astype(str) for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def apply_join(df, columns):
assert len(columns) > 1
return df[columns].apply(lambda row:'_'.join(row.values.astype(str)), axis=1)
# ensure outputs are equal
df1 = reduce_join(df, list('1234'))
df2 = apply_join(df, list('1234'))
assert df1.equals(df2)
# profile
%timeit df1 = reduce_join(df, list('1234')) # 733 ms
%timeit df2 = apply_join(df, list('1234')) # 8.84 s
I think you are missing one %s
df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
First convert the columns to str. Then use the .T.agg('_'.join) function to concatenate them. More info can be gotten here
# Initialize columns
cols_concat = ['first_name', 'second_name']
# Convert them to type str
df[cols_concat] = df[cols_concat].astype('str')
# Then concatenate them as follows
df['new_col'] = df[cols_concat].T.agg('_'.join)
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined'] = df['foo'].astype(str)+'_'+df['bar'].astype(str)
If you concatenate with string('_') please you convert the column to string which you want and after you can concatenate the dataframe.
df['New_column_name'] = df['Column1'].map(str) + 'X' + df['Steps']
X= x is any delimiter (eg: space) by which you want to separate two merged column.
If you have a list of columns you want to concatenate and maybe you'd like to use some separator, here's what you can do
def concat_columns(df, cols_to_concat, new_col_name, sep=" "):
df[new_col_name] = df[cols_to_concat[0]]
for col in cols_to_concat[1:]:
df[new_col_name] = df[new_col_name].astype(str) + sep + df[col].astype(str)
This should be faster than apply and takes an arbitrary number of columns to concatenate.
#derchambers I found one more solution:
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def eval_join(df, columns):
sum_elements = [f"df['{col}']" for col in columns]
to_eval = "+ '_' + ".join(sum_elements)
return eval(to_eval)
#profile
%timeit df3 = eval_join(df, list('1234')) # 504 ms
You could create a function which would make the implementation neater (esp. if you're using this functionality multiple times throughout an implementation):
def concat_cols(df, cols_to_concat, new_col_name, separator):
df[new_col_name] = ''
for i, col in enumerate(cols_to_concat):
df[new_col_name] += ('' if i == 0 else separator) + df[col].astype(str)
return df
Sample usage:
test = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
test = concat_cols(test, ['a', 'b', 'c'], 'concat_col', '_')
following to #Allen response
If you need to chain such operation with other dataframe transformation, use assign:
df.assign(
combined = lambda x: x[cols].apply(
lambda row: "_".join(row.values.astype(str)), axis=1
)
)
Considering that one is combining three columns, one would need three format specifiers, '%s_%s_%s', not just two '%s_%s'. The following will do the work
df['combined'] = df.apply(lambda x: '%s_%s_%s' % (x['foo'], x['bar'], x['new']), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
Alternatively, if one wants to create a separate list to store the columns that one wants to combine, the following will do the work.
columns = ['foo', 'bar', 'new']
df['combined'] = df.apply(lambda x: '_'.join([str(x[i]) for i in columns]), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
This last one is more convenient, as one can simply change or add the column names in the list - it will require less changes.
I currently have a dataframe consisting of columns with 1's and 0's as values, I would like to iterate through the columns and delete the ones that are made up of only 0's. Here's what I have tried so far:
ones = []
zeros = []
for year in years:
for i in range(0,599):
if year[str(i)].values.any() == 1:
ones.append(i)
if year[str(i)].values.all() == 0:
zeros.append(i)
for j in ones:
if j in zeros:
zeros.remove(j)
for q in zeros:
del year[str(q)]
In which years is a list of dataframes for the various years I am analyzing, ones consists of columns with a one in them and zeros is a list of columns containing all zeros. Is there a better way to delete a column based on a condition? For some reason I have to check whether the ones columns are in the zeros list as well and remove them from the zeros list to obtain a list of all the zero columns.
df.loc[:, (df != 0).any(axis=0)]
Here is a break-down of how it works:
In [74]: import pandas as pd
In [75]: df = pd.DataFrame([[1,0,0,0], [0,0,1,0]])
In [76]: df
Out[76]:
0 1 2 3
0 1 0 0 0
1 0 0 1 0
[2 rows x 4 columns]
df != 0 creates a boolean DataFrame which is True where df is nonzero:
In [77]: df != 0
Out[77]:
0 1 2 3
0 True False False False
1 False False True False
[2 rows x 4 columns]
(df != 0).any(axis=0) returns a boolean Series indicating which columns have nonzero entries. (The any operation aggregates values along the 0-axis -- i.e. along the rows -- into a single boolean value. Hence the result is one boolean value for each column.)
In [78]: (df != 0).any(axis=0)
Out[78]:
0 True
1 False
2 True
3 False
dtype: bool
And df.loc can be used to select those columns:
In [79]: df.loc[:, (df != 0).any(axis=0)]
Out[79]:
0 2
0 1 0
1 0 1
[2 rows x 2 columns]
To "delete" the zero-columns, reassign df:
df = df.loc[:, (df != 0).any(axis=0)]
Here is an alternative way to use is
df.replace(0,np.nan).dropna(axis=1,how="all")
Compared with the solution of unutbu, this way is obviously slower:
%timeit df.loc[:, (df != 0).any(axis=0)]
652 µs ± 5.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.replace(0,np.nan).dropna(axis=1,how="all")
1.75 ms ± 9.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In case you'd like a more expressive way of getting the zero-column names so you can print / log them, and drop them, in-place, by their names:
zero_cols = [ col for col, is_zero in ((df == 0).sum() == df.shape[0]).items() if is_zero ]
df.drop(zero_cols, axis=1, inplace=True)
Some break down:
# a pandas Series with {col: is_zero} items
# is_zero is True when the number of zero items in that column == num_all_rows
(df == 0).sum() == df.shape[0])
# a list comprehension of zero_col_names is built from the_series
[ col for col, is_zero in the_series.items() if is_zero ]
In case there are some NaN values in your columns, you may want to use this approach if you want to remove columns that have both 0 and NaN :
df.loc[:, (df**2).sum() != 0]
Is there any way to specify the index that I want for a new row, when appending the row to a dataframe?
The original documentation provides the following example:
In [1301]: df = DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [1302]: df
Out[1302]:
A B C D
0 -1.137707 -0.891060 -0.693921 1.613616
1 0.464000 0.227371 -0.496922 0.306389
2 -2.290613 -1.134623 -1.561819 -0.260838
3 0.281957 1.523962 -0.902937 0.068159
4 -0.057873 -0.368204 -1.144073 0.861209
5 0.800193 0.782098 -1.069094 -1.099248
6 0.255269 0.009750 0.661084 0.379319
7 -0.008434 1.952541 -1.056652 0.533946
In [1303]: s = df.xs(3)
In [1304]: df.append(s, ignore_index=True)
Out[1304]:
A B C D
0 -1.137707 -0.891060 -0.693921 1.613616
1 0.464000 0.227371 -0.496922 0.306389
2 -2.290613 -1.134623 -1.561819 -0.260838
3 0.281957 1.523962 -0.902937 0.068159
4 -0.057873 -0.368204 -1.144073 0.861209
5 0.800193 0.782098 -1.069094 -1.099248
6 0.255269 0.009750 0.661084 0.379319
7 -0.008434 1.952541 -1.056652 0.533946
8 0.281957 1.523962 -0.902937 0.068159
where the new row gets the index label automatically. Is there any way to control the new label?
The name of the Series becomes the index of the row in the DataFrame:
In [99]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [100]: s = df.xs(3)
In [101]: s.name = 10
In [102]: df.append(s)
Out[102]:
A B C D
0 -2.083321 -0.153749 0.174436 1.081056
1 -1.026692 1.495850 -0.025245 -0.171046
2 0.072272 1.218376 1.433281 0.747815
3 -0.940552 0.853073 -0.134842 -0.277135
4 0.478302 -0.599752 -0.080577 0.468618
5 2.609004 -1.679299 -1.593016 1.172298
6 -0.201605 0.406925 1.983177 0.012030
7 1.158530 -2.240124 0.851323 -0.240378
10 -0.940552 0.853073 -0.134842 -0.277135
df.loc will do the job :
>>> df = pd.DataFrame(np.random.randn(3, 2), columns=['A','B'])
>>> df
A B
0 -0.269036 0.534991
1 0.069915 -1.173594
2 -1.177792 0.018381
>>> df.loc[13] = df.loc[1]
>>> df
A B
0 -0.269036 0.534991
1 0.069915 -1.173594
2 -1.177792 0.018381
13 0.069915 -1.173594
I shall refer to the same sample of data as posted in the question:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
print('The original data frame is: \n{}'.format(df))
Running this code will give you
The original data frame is:
A B C D
0 0.494824 -0.328480 0.818117 0.100290
1 0.239037 0.954912 -0.186825 -0.651935
2 -1.818285 -0.158856 0.359811 -0.345560
3 -0.070814 -0.394711 0.081697 -1.178845
4 -1.638063 1.498027 -0.609325 0.882594
5 -0.510217 0.500475 1.039466 0.187076
6 1.116529 0.912380 0.869323 0.119459
7 -1.046507 0.507299 -0.373432 -1.024795
Now you wish to append a new row to this data frame, which doesn't need to be copy of any other row in the data frame. #Alon suggested an interesting approach to use df.loc to append a new row with different index. The issue, however, with this approach is if there is already a row present at that index, it will be overwritten by new values. This is typically the case for datasets when row index is not unique, like store ID in transaction datasets. So a more general solution to your question is to create the row, transform the new row data into a pandas series, name it to the index you want to have and then append it to the data frame. Don't forget to overwrite the original data frame with the one with appended row. The reason is df.append returns a view of the dataframe and does not modify its contents. Following is the code:
row = pd.Series({'A':10,'B':20,'C':30,'D':40},name=3)
df = df.append(row)
print('The new data frame is: \n{}'.format(df))
Following would be the new output:
The new data frame is:
A B C D
0 0.494824 -0.328480 0.818117 0.100290
1 0.239037 0.954912 -0.186825 -0.651935
2 -1.818285 -0.158856 0.359811 -0.345560
3 -0.070814 -0.394711 0.081697 -1.178845
4 -1.638063 1.498027 -0.609325 0.882594
5 -0.510217 0.500475 1.039466 0.187076
6 1.116529 0.912380 0.869323 0.119459
7 -1.046507 0.507299 -0.373432 -1.024795
3 10.000000 20.000000 30.000000 40.000000
There is another solution. The next code is bad (although I think pandas needs this feature):
import pandas as pd
# empty dataframe
a = pd.DataFrame()
a.loc[0] = {'first': 111, 'second': 222}
But the next code runs fine:
import pandas as pd
# empty dataframe
a = pd.DataFrame()
a = a.append(pd.Series({'first': 111, 'second': 222}, name=0))
Maybe my case is a different scenario but looks similar. I would define my own question as: 'How to insert a row with new index at some (given) position?'
Let's create test dataframe:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'], index=['x', 'y'])
Result:
A B
x 1 2
y 3 4
Then, let's say, we want to place a new row with index z at position 1 (second row).
pos = 1
index_name = 'z'
# create new indexes where index is at the specified position
new_indexes = df.index.insert(pos, index_name)
# create new dataframe with new row
# specify new index in name argument
new_line = pd.Series({'A': 5, 'B': 6}, name=index_name)
df_new_row = pd.DataFrame([new_line], columns=df.columns)
# append new line to dataframe
df = pd.concat([df, df_new_row])
Now it is in the end:
A B
x 1 2
y 3 4
z 5 6
Now let's sort it specifying new index' position:
df = df.reindex(new_indexes)
Result:
A B
x 1 2
z 5 6
y 3 4
You should consider using df.loc[row_name] = row_value.
df.append(pd.Series({row_name: row_value}, name=column will lead to
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df.loc[row_name] = row_value is faster than pd.concat
Here is an example:
p = pd.DataFrame(data=np.random.rand(100), columns=['price'], index=np.arange(100))
def func1(p):
for i in range(100):
p.loc[i] = 0
def func2(p):
for i in range(100):
p.append(pd.Series({'BTC': 0}, name=i))
def func3(p):
for i in range(100):
p = pd.concat([p, pd.Series({i: 0}, name='price')])
%timeit func1(p)
1.87 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit func2(p)
1.56 s ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func3(p)
24.8 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)