Related
I would like to work with a pandas data frame to get a strange yet desired output dataframe. For each row, I'd like any values of 0.0 to be replaced with an empty string (''), and all values of 1.0 to be replaced with the value of the index. Any given value on a row can only be 1.0 or 0.0.
Here's some example data:
# starting df
df = pd.DataFrame.from_dict({'A':[1.0,0.0,0.0],'B':[1.0,1.0,0.0],'C':[0.0,1.0,1.0]})
df.index=['x','y','z']
print(df)
What the input df looks like:
A B C
x 1.0 1.0 0.0
y 0.0 1.0 1.0
z 0.0 0.0 1.0
What I would like the output df to look like:
A B C
x x x
y y y
z z
So far I've got this pretty inefficient but seemingly working code:
for idx in df.index:
df.loc[idx] = df.loc[idx].map(str).replace('1.0',str(idx))
df.loc[idx] = df.loc[idx].map(str).replace('0.0','')
Could anyone please suggest an efficient way to do this?
The real data frame I'll be working with has a shape of (4548, 2044) and the values will always be floats (1.0 or 0.0), like in the example. I'm manipulating the usher_barcodes.csv data from "raw.githubusercontent.com/andersen-lab/Freyja/main/freyja/data/…" into a format required by another pipeline, where the column headers are lineage names and the values are mutations (taken from the index). The column headers and index values will likely be different each time I need to run this code because the lineage assignments are constantly changing.
Thanks!
Use numpy.where with broadcasting index convert to numpy array:
df = pd.DataFrame(np.where(df.eq(1),
df.index.to_numpy()[:, None],
''),
index = df.index,
columns = df.columns)
print(df)
A B C
x x x
y y y
z z
Performance with data by size (4548,2044):
np.random.seed(2023)
df = pd.DataFrame(np.random.choice([0.0,1.0], size=(4548,2044))).add_prefix('c')
df.index = df.index.astype(str) + 'r'
# print (df)
In [87]: %timeit df.eq(1).mul(df.index, axis=0)
684 ms ± 36.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [90]: %timeit pd.DataFrame(np.where(df.eq(1),df.index.to_numpy()[:, None],''),index = df.index, columns = df.columns)
449 ms ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can simply do:
for idx, row in df.iterrows():
df.loc[idx] = ['' if val == 0 else idx for val in row]
which gives:
A B C
x x x
y y y
z z
Take advantage of the fact that 1*'x' -> 'x' and 0*'x' -> '':
out = df.eq(1).mul(df.index, axis=0)
NB. the eq(1) converts the float to boolean as True is equivalent to 1. You could also use astype(int) if you only have 0./1..
Output:
A B C
x x x
y y y
z z
If I've got a DataFrame in pandas which looks something like:
A B C
0 1 NaN 2
1 NaN 3 NaN
2 NaN 4 5
3 NaN NaN NaN
How can I get the first non-null value from each row? E.g. for the above, I'd like to get: [1, 3, 4, None] (or equivalent Series).
Fill the nans from the left with fillna, then get the leftmost column:
df.fillna(method='bfill', axis=1).iloc[:, 0]
This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:
In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64
EDIT
A slightly cleaner way:
In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)
Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64
I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:
def get_first_non_null(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return [a[row, col] for row, col in enumerate(col_index)]
EDIT:
Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.
def get_first_non_null_vec(df):
a = df.values
n_rows, n_cols = a.shape
col_index = np.isnan(a).argmin(axis=1)
flat_index = n_cols * np.arange(n_rows) + col_index
return a.ravel()[flat_index]
If a row is completely null then the corresponding value will be null also.
Here's some benchmarking against unutbu's solution:
df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:
df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop
df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop
Here is another way to do it:
In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]:
0 1
1 3
2 4
3 NaN
dtype: float64
The idea here is to use stack to move the columns into a row index level:
In [184]: df.stack()
Out[184]:
0 A 1
C 2
1 B 3
2 B 4
C 5
dtype: float64
Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:
In [185]: df.stack().groupby(level=0).first()
Out[185]:
0 1
1 3
2 4
dtype: float64
All we need to do is reindex the result (using the original index) so as to
include rows that are completely NaN:
df.stack().groupby(level=0).first().reindex(df.index)
This is nothing new, but it's a combination of the best bits of #yangie's approach with a list comprehension, and #EdChum's df.apply approach that I think is easiest to understand.
First, which columns to we want to pick our values from?
In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)
In [96]: pick_cols
Out[96]:
0 A
1 B
2 B
3 None
dtype: object
Now how do we pick the values?
In [100]: [df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]
This is ok, but we really want the index to match that of the original DataFrame:
In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()})
Out[98]:
0 1
1 3
2 4
3 NaN
dtype: float64
groupby in axis=1
If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.agg which gives us the first method that makes this easy
df.groupby(lambda x: 'Z', 1).first()
Z
0 1.0
1 3.0
2 4.0
3 NaN
This returns a dataframe with the column name of the thing I was returning in my callable
lookup, notna, and idxmax
df.lookup(df.index, df.notna().idxmax(1))
array([ 1., 3., 4., nan])
argmin and slicing
v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]
array([ 1., 3., 4., nan])
Here is a one line solution:
[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]
Edit:
This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.
If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.
I packed everything into a list comprehension for brevity.
JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:
def get_first_non_null_vect(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return a[np.arange(a.shape[0]), col_index]
The improvement is small if the DataFrame is relatively flat:
In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop
In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop
... but can be relevant on slim DataFrames:
In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))
In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop
In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop
Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).
df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})
df
A B C
0 1.0 NaN 2.0
1 NaN 3.0 NaN
2 NaN 4.0 5.0
3 NaN NaN NaN
df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]
This question is same to this posted earlier. I want to concatenate three columns instead of concatenating two columns:
Here is the combining two columns:
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined']=df.apply(lambda x:'%s_%s' % (x['foo'],x['bar']),axis=1)
df
bar foo new combined
0 1 a apple a_1
1 2 b banana b_2
2 3 c pear c_3
I want to combine three columns with this command but it is not working, any idea?
df['combined']=df.apply(lambda x:'%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
Another solution using DataFrame.apply(), with slightly less typing and more scalable when you want to join more columns:
cols = ['foo', 'bar', 'new']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
You can use string concatenation to combine columns, with or without delimiters. You do have to convert the type on non-string columns.
In[17]: df['combined'] = df['bar'].astype(str) + '_' + df['foo'] + '_' + df['new']
In[17]:df
Out[18]:
bar foo new combined
0 1 a apple 1_a_apple
1 2 b banana 2_b_banana
2 3 c pear 3_c_pear
If you have even more columns you want to combine, using the Series method str.cat might be handy:
df["combined"] = df["foo"].str.cat(df[["bar", "new"]].astype(str), sep="_")
Basically, you select the first column (if it is not already of type str, you need to append .astype(str)), to which you append the other columns (separated by an optional separator character).
Just wanted to make a time comparison for both solutions (for 30K rows DF):
In [1]: df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
In [2]: big = pd.concat([df] * 10**4, ignore_index=True)
In [3]: big.shape
Out[3]: (30000, 3)
In [4]: %timeit big.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
1 loop, best of 3: 881 ms per loop
In [5]: %timeit big['bar'].astype(str)+'_'+big['foo']+'_'+big['new']
10 loops, best of 3: 44.2 ms per loop
a few more options:
In [6]: %timeit big.ix[:, :-1].astype(str).add('_').sum(axis=1).str.cat(big.new)
10 loops, best of 3: 72.2 ms per loop
In [11]: %timeit big.astype(str).add('_').sum(axis=1).str[:-1]
10 loops, best of 3: 82.3 ms per loop
Possibly the fastest solution is to operate in plain Python:
Series(
map(
'_'.join,
df.values.tolist()
# when non-string columns are present:
# df.values.astype(str).tolist()
),
index=df.index
)
Comparison against #MaxU answer (using the big data frame which has both numeric and string columns):
%timeit big['bar'].astype(str) + '_' + big['foo'] + '_' + big['new']
# 29.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit Series(map('_'.join, big.values.astype(str).tolist()), index=big.index)
# 27.4 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Comparison against #derchambers answer (using their df data frame where all columns are strings):
from functools import reduce
def reduce_join(df, columns):
slist = [df[x] for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def list_map(df, columns):
return Series(
map(
'_'.join,
df[columns].values.tolist()
),
index=df.index
)
%timeit df1 = reduce_join(df, list('1234'))
# 602 ms ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = list_map(df, list('1234'))
# 351 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The answer given by #allen is reasonably generic but can lack in performance for larger dataframes:
Reduce does a lot better:
from functools import reduce
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def reduce_join(df, columns):
assert len(columns) > 1
slist = [df[x].astype(str) for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def apply_join(df, columns):
assert len(columns) > 1
return df[columns].apply(lambda row:'_'.join(row.values.astype(str)), axis=1)
# ensure outputs are equal
df1 = reduce_join(df, list('1234'))
df2 = apply_join(df, list('1234'))
assert df1.equals(df2)
# profile
%timeit df1 = reduce_join(df, list('1234')) # 733 ms
%timeit df2 = apply_join(df, list('1234')) # 8.84 s
I think you are missing one %s
df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
First convert the columns to str. Then use the .T.agg('_'.join) function to concatenate them. More info can be gotten here
# Initialize columns
cols_concat = ['first_name', 'second_name']
# Convert them to type str
df[cols_concat] = df[cols_concat].astype('str')
# Then concatenate them as follows
df['new_col'] = df[cols_concat].T.agg('_'.join)
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined'] = df['foo'].astype(str)+'_'+df['bar'].astype(str)
If you concatenate with string('_') please you convert the column to string which you want and after you can concatenate the dataframe.
df['New_column_name'] = df['Column1'].map(str) + 'X' + df['Steps']
X= x is any delimiter (eg: space) by which you want to separate two merged column.
If you have a list of columns you want to concatenate and maybe you'd like to use some separator, here's what you can do
def concat_columns(df, cols_to_concat, new_col_name, sep=" "):
df[new_col_name] = df[cols_to_concat[0]]
for col in cols_to_concat[1:]:
df[new_col_name] = df[new_col_name].astype(str) + sep + df[col].astype(str)
This should be faster than apply and takes an arbitrary number of columns to concatenate.
#derchambers I found one more solution:
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def eval_join(df, columns):
sum_elements = [f"df['{col}']" for col in columns]
to_eval = "+ '_' + ".join(sum_elements)
return eval(to_eval)
#profile
%timeit df3 = eval_join(df, list('1234')) # 504 ms
You could create a function which would make the implementation neater (esp. if you're using this functionality multiple times throughout an implementation):
def concat_cols(df, cols_to_concat, new_col_name, separator):
df[new_col_name] = ''
for i, col in enumerate(cols_to_concat):
df[new_col_name] += ('' if i == 0 else separator) + df[col].astype(str)
return df
Sample usage:
test = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
test = concat_cols(test, ['a', 'b', 'c'], 'concat_col', '_')
following to #Allen response
If you need to chain such operation with other dataframe transformation, use assign:
df.assign(
combined = lambda x: x[cols].apply(
lambda row: "_".join(row.values.astype(str)), axis=1
)
)
Considering that one is combining three columns, one would need three format specifiers, '%s_%s_%s', not just two '%s_%s'. The following will do the work
df['combined'] = df.apply(lambda x: '%s_%s_%s' % (x['foo'], x['bar'], x['new']), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
Alternatively, if one wants to create a separate list to store the columns that one wants to combine, the following will do the work.
columns = ['foo', 'bar', 'new']
df['combined'] = df.apply(lambda x: '_'.join([str(x[i]) for i in columns]), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
This last one is more convenient, as one can simply change or add the column names in the list - it will require less changes.
Is there a faster way to drop columns that only contain one distinct value than the code below?
cols=df.columns.tolist()
for col in cols:
if len(set(df[col].tolist()))<2:
df=df.drop(col, axis=1)
This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.
You can use Series.unique() method to find out all the unique elements in a column, and for columns whose .unique() returns only 1 element, you can drop that. Example -
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)
A method that does not do inplace dropping -
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
Demo -
In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])
In [155]: for col in df.columns:
.....: if len(df[col].unique()) == 1:
.....: df.drop(col,inplace=True,axis=1)
.....:
In [156]: df
Out[156]:
1
0 2
1 3
2 2
Timing results -
In [166]: %paste
def func1(df):
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
return res
## -- End pasted text --
In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
In [178]: %timeit func1(df)
1000 loops, best of 3: 1.05 ms per loop
In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
100 loops, best of 3: 8.81 ms per loop
In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
100 loops, best of 3: 5.81 ms per loop
The fastest method still seems to be the method using unique and looping through the columns.
One step:
df = df[[c for c
in list(df)
if len(df[c].unique()) > 1]]
Two steps:
Create a list of column names that have more than 1 distinct value.
keep = [c for c
in list(df)
if len(df[c].unique()) > 1]
Drop the columns that are not in 'keep'
df = df[keep]
Note: this step can also be done using a list of columns to drop:
drop_cols = [c for c
in list(df)
if df[c].nunique() <= 1]
df = df.drop(columns=drop_cols)
df.loc[:,df.apply(pd.Series.nunique) != 1]
For example
In:
df = pd.DataFrame({'A': [10, 20, np.nan, 30], 'B': [10, np.nan, 10, 10]})
df.loc[:,df.apply(pd.Series.nunique) != 1]
Out:
A
0 10
1 20
2 NaN
3 30
Two simple one-liners for either returning a view (shorter version of jz0410's answer)
df.loc[:,df.nunique()!=1]
or dropping inplace (via drop())
df.drop(columns=df.columns[df.nunique()==1], inplace=True)
You can create a mask of your df by calling apply and call value_counts, this will produce NaN for all rows except one, you can then call dropna column-wise and pass param thresh=2 so that there must be 2 or more non-NaN values:
In [329]:
df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
df
Out[329]:
a b c
0 1 0 0
1 1 1 0
2 1 2 2
3 1 3 2
4 1 4 2
In [342]:
df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
Out[342]:
b c
0 0 0
1 1 0
2 2 2
3 3 2
4 4 2
Output from the boolean conditions:
In [344]:
df.apply(pd.Series.value_counts)
Out[344]:
a b c
0 NaN 1 2
1 5 1 NaN
2 NaN 1 3
3 NaN 1 NaN
4 NaN 1 NaN
In [345]:
df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
Out[345]:
b c
0 1 2
1 1 NaN
2 1 3
3 1 NaN
4 1 NaN
Many examples in thread and this thread does not worked for my df. Those worked:
# from: https://stackoverflow.com/questions/33144813/quickly-drop-dataframe-columns-with-only-one-distinct-value
# from: https://stackoverflow.com/questions/20209600/pandas-dataframe-remove-constant-column
import pandas as pd
import numpy as np
data = {'var1': [1,2,3,4,5,np.nan,7,8,9],
'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'],
'var3':[101,101,101,102,102,102,103,103,np.nan],
'var4':[np.nan,1,1,1,1,1,1,1,1],
'var5':[1,1,1,1,1,1,1,1,1],
'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'var7':["a","a","a","a","a","a","a","a","a"],
'var8': [1,2,3,4,5,6,7,8,9]}
df = pd.DataFrame(data)
df_original = df.copy()
#-------------------------------------------------------------------------------------------------
df2 = df[[c for c
in list(df)
if len(df[c].unique()) > 1]]
#-------------------------------------------------------------------------------------------------
keep = [c for c
in list(df)
if len(df[c].unique()) > 1]
df3 = df[keep]
#-------------------------------------------------------------------------------------------------
keep_columns = [col for col in df.columns if len(df[col].unique()) > 1]
df5 = df[keep_columns].copy()
#-------------------------------------------------------------------------------------------------
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)
I would like to throw in:
pandas 1.0.3
ids = df.nunique().values>1
df.loc[:,ids]
not that slow:
2.81 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df=df.loc[:,df.nunique()!=Numberofvalues]
None of the solutions worked in my use-case because I got this error: (my dataframe contains list item).
TypeError: unhashable type: 'list'
The solution that worked for me is this:
ndf = df.describe(include="all").T
new_cols = set(df.columns) - set(ndf[ndf.unique == 1].index)
df = df[list(new_cols)]
One line
df=df[[i for i in df if len(set(df[i]))>1]]
One of the solutions with pipe (convenient if used often):
def drop_unique_value_col(df):
return df.loc[:,df.apply(pd.Series.nunique) != 1]
df.pipe(drop_unique_value_col)
This will drop all the columns with only one distinct value.
for col in Dataframe.columns:
if len(Dataframe[col].value_counts()) == 1:
Dataframe.drop([col], axis=1, inplace=True)
Most 'pythonic' way of doing it I could find:
df = df.loc[:, (df != df.iloc[0]).any()]
Is there any way to specify the index that I want for a new row, when appending the row to a dataframe?
The original documentation provides the following example:
In [1301]: df = DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [1302]: df
Out[1302]:
A B C D
0 -1.137707 -0.891060 -0.693921 1.613616
1 0.464000 0.227371 -0.496922 0.306389
2 -2.290613 -1.134623 -1.561819 -0.260838
3 0.281957 1.523962 -0.902937 0.068159
4 -0.057873 -0.368204 -1.144073 0.861209
5 0.800193 0.782098 -1.069094 -1.099248
6 0.255269 0.009750 0.661084 0.379319
7 -0.008434 1.952541 -1.056652 0.533946
In [1303]: s = df.xs(3)
In [1304]: df.append(s, ignore_index=True)
Out[1304]:
A B C D
0 -1.137707 -0.891060 -0.693921 1.613616
1 0.464000 0.227371 -0.496922 0.306389
2 -2.290613 -1.134623 -1.561819 -0.260838
3 0.281957 1.523962 -0.902937 0.068159
4 -0.057873 -0.368204 -1.144073 0.861209
5 0.800193 0.782098 -1.069094 -1.099248
6 0.255269 0.009750 0.661084 0.379319
7 -0.008434 1.952541 -1.056652 0.533946
8 0.281957 1.523962 -0.902937 0.068159
where the new row gets the index label automatically. Is there any way to control the new label?
The name of the Series becomes the index of the row in the DataFrame:
In [99]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [100]: s = df.xs(3)
In [101]: s.name = 10
In [102]: df.append(s)
Out[102]:
A B C D
0 -2.083321 -0.153749 0.174436 1.081056
1 -1.026692 1.495850 -0.025245 -0.171046
2 0.072272 1.218376 1.433281 0.747815
3 -0.940552 0.853073 -0.134842 -0.277135
4 0.478302 -0.599752 -0.080577 0.468618
5 2.609004 -1.679299 -1.593016 1.172298
6 -0.201605 0.406925 1.983177 0.012030
7 1.158530 -2.240124 0.851323 -0.240378
10 -0.940552 0.853073 -0.134842 -0.277135
df.loc will do the job :
>>> df = pd.DataFrame(np.random.randn(3, 2), columns=['A','B'])
>>> df
A B
0 -0.269036 0.534991
1 0.069915 -1.173594
2 -1.177792 0.018381
>>> df.loc[13] = df.loc[1]
>>> df
A B
0 -0.269036 0.534991
1 0.069915 -1.173594
2 -1.177792 0.018381
13 0.069915 -1.173594
I shall refer to the same sample of data as posted in the question:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
print('The original data frame is: \n{}'.format(df))
Running this code will give you
The original data frame is:
A B C D
0 0.494824 -0.328480 0.818117 0.100290
1 0.239037 0.954912 -0.186825 -0.651935
2 -1.818285 -0.158856 0.359811 -0.345560
3 -0.070814 -0.394711 0.081697 -1.178845
4 -1.638063 1.498027 -0.609325 0.882594
5 -0.510217 0.500475 1.039466 0.187076
6 1.116529 0.912380 0.869323 0.119459
7 -1.046507 0.507299 -0.373432 -1.024795
Now you wish to append a new row to this data frame, which doesn't need to be copy of any other row in the data frame. #Alon suggested an interesting approach to use df.loc to append a new row with different index. The issue, however, with this approach is if there is already a row present at that index, it will be overwritten by new values. This is typically the case for datasets when row index is not unique, like store ID in transaction datasets. So a more general solution to your question is to create the row, transform the new row data into a pandas series, name it to the index you want to have and then append it to the data frame. Don't forget to overwrite the original data frame with the one with appended row. The reason is df.append returns a view of the dataframe and does not modify its contents. Following is the code:
row = pd.Series({'A':10,'B':20,'C':30,'D':40},name=3)
df = df.append(row)
print('The new data frame is: \n{}'.format(df))
Following would be the new output:
The new data frame is:
A B C D
0 0.494824 -0.328480 0.818117 0.100290
1 0.239037 0.954912 -0.186825 -0.651935
2 -1.818285 -0.158856 0.359811 -0.345560
3 -0.070814 -0.394711 0.081697 -1.178845
4 -1.638063 1.498027 -0.609325 0.882594
5 -0.510217 0.500475 1.039466 0.187076
6 1.116529 0.912380 0.869323 0.119459
7 -1.046507 0.507299 -0.373432 -1.024795
3 10.000000 20.000000 30.000000 40.000000
There is another solution. The next code is bad (although I think pandas needs this feature):
import pandas as pd
# empty dataframe
a = pd.DataFrame()
a.loc[0] = {'first': 111, 'second': 222}
But the next code runs fine:
import pandas as pd
# empty dataframe
a = pd.DataFrame()
a = a.append(pd.Series({'first': 111, 'second': 222}, name=0))
Maybe my case is a different scenario but looks similar. I would define my own question as: 'How to insert a row with new index at some (given) position?'
Let's create test dataframe:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'], index=['x', 'y'])
Result:
A B
x 1 2
y 3 4
Then, let's say, we want to place a new row with index z at position 1 (second row).
pos = 1
index_name = 'z'
# create new indexes where index is at the specified position
new_indexes = df.index.insert(pos, index_name)
# create new dataframe with new row
# specify new index in name argument
new_line = pd.Series({'A': 5, 'B': 6}, name=index_name)
df_new_row = pd.DataFrame([new_line], columns=df.columns)
# append new line to dataframe
df = pd.concat([df, df_new_row])
Now it is in the end:
A B
x 1 2
y 3 4
z 5 6
Now let's sort it specifying new index' position:
df = df.reindex(new_indexes)
Result:
A B
x 1 2
z 5 6
y 3 4
You should consider using df.loc[row_name] = row_value.
df.append(pd.Series({row_name: row_value}, name=column will lead to
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df.loc[row_name] = row_value is faster than pd.concat
Here is an example:
p = pd.DataFrame(data=np.random.rand(100), columns=['price'], index=np.arange(100))
def func1(p):
for i in range(100):
p.loc[i] = 0
def func2(p):
for i in range(100):
p.append(pd.Series({'BTC': 0}, name=i))
def func3(p):
for i in range(100):
p = pd.concat([p, pd.Series({i: 0}, name='price')])
%timeit func1(p)
1.87 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit func2(p)
1.56 s ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func3(p)
24.8 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)