Pandas: Appending a row to a dataframe and specify its index label

Pandas: Appending a row to a dataframe and specify its index label - python

Is there any way to specify the index that I want for a new row, when appending the row to a dataframe?
The original documentation provides the following example:
In [1301]: df = DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [1302]: df
Out[1302]:
A B C D
0 -1.137707 -0.891060 -0.693921 1.613616
1 0.464000 0.227371 -0.496922 0.306389
2 -2.290613 -1.134623 -1.561819 -0.260838
3 0.281957 1.523962 -0.902937 0.068159
4 -0.057873 -0.368204 -1.144073 0.861209
5 0.800193 0.782098 -1.069094 -1.099248
6 0.255269 0.009750 0.661084 0.379319
7 -0.008434 1.952541 -1.056652 0.533946
In [1303]: s = df.xs(3)
In [1304]: df.append(s, ignore_index=True)
Out[1304]:
A B C D
0 -1.137707 -0.891060 -0.693921 1.613616
1 0.464000 0.227371 -0.496922 0.306389
2 -2.290613 -1.134623 -1.561819 -0.260838
3 0.281957 1.523962 -0.902937 0.068159
4 -0.057873 -0.368204 -1.144073 0.861209
5 0.800193 0.782098 -1.069094 -1.099248
6 0.255269 0.009750 0.661084 0.379319
7 -0.008434 1.952541 -1.056652 0.533946
8 0.281957 1.523962 -0.902937 0.068159
where the new row gets the index label automatically. Is there any way to control the new label?

The name of the Series becomes the index of the row in the DataFrame:
In [99]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [100]: s = df.xs(3)
In [101]: s.name = 10
In [102]: df.append(s)
Out[102]:
A B C D
0 -2.083321 -0.153749 0.174436 1.081056
1 -1.026692 1.495850 -0.025245 -0.171046
2 0.072272 1.218376 1.433281 0.747815
3 -0.940552 0.853073 -0.134842 -0.277135
4 0.478302 -0.599752 -0.080577 0.468618
5 2.609004 -1.679299 -1.593016 1.172298
6 -0.201605 0.406925 1.983177 0.012030
7 1.158530 -2.240124 0.851323 -0.240378
10 -0.940552 0.853073 -0.134842 -0.277135

df.loc will do the job :
>>> df = pd.DataFrame(np.random.randn(3, 2), columns=['A','B'])
>>> df
A B
0 -0.269036 0.534991
1 0.069915 -1.173594
2 -1.177792 0.018381
>>> df.loc[13] = df.loc[1]
>>> df
A B
0 -0.269036 0.534991
1 0.069915 -1.173594
2 -1.177792 0.018381
13 0.069915 -1.173594

I shall refer to the same sample of data as posted in the question:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
print('The original data frame is: \n{}'.format(df))
Running this code will give you
The original data frame is:
A B C D
0 0.494824 -0.328480 0.818117 0.100290
1 0.239037 0.954912 -0.186825 -0.651935
2 -1.818285 -0.158856 0.359811 -0.345560
3 -0.070814 -0.394711 0.081697 -1.178845
4 -1.638063 1.498027 -0.609325 0.882594
5 -0.510217 0.500475 1.039466 0.187076
6 1.116529 0.912380 0.869323 0.119459
7 -1.046507 0.507299 -0.373432 -1.024795
Now you wish to append a new row to this data frame, which doesn't need to be copy of any other row in the data frame. #Alon suggested an interesting approach to use df.loc to append a new row with different index. The issue, however, with this approach is if there is already a row present at that index, it will be overwritten by new values. This is typically the case for datasets when row index is not unique, like store ID in transaction datasets. So a more general solution to your question is to create the row, transform the new row data into a pandas series, name it to the index you want to have and then append it to the data frame. Don't forget to overwrite the original data frame with the one with appended row. The reason is df.append returns a view of the dataframe and does not modify its contents. Following is the code:
row = pd.Series({'A':10,'B':20,'C':30,'D':40},name=3)
df = df.append(row)
print('The new data frame is: \n{}'.format(df))
Following would be the new output:
The new data frame is:
A B C D
0 0.494824 -0.328480 0.818117 0.100290
1 0.239037 0.954912 -0.186825 -0.651935
2 -1.818285 -0.158856 0.359811 -0.345560
3 -0.070814 -0.394711 0.081697 -1.178845
4 -1.638063 1.498027 -0.609325 0.882594
5 -0.510217 0.500475 1.039466 0.187076
6 1.116529 0.912380 0.869323 0.119459
7 -1.046507 0.507299 -0.373432 -1.024795
3 10.000000 20.000000 30.000000 40.000000

There is another solution. The next code is bad (although I think pandas needs this feature):
import pandas as pd
# empty dataframe
a = pd.DataFrame()
a.loc[0] = {'first': 111, 'second': 222}
But the next code runs fine:
import pandas as pd
# empty dataframe
a = pd.DataFrame()
a = a.append(pd.Series({'first': 111, 'second': 222}, name=0))

Maybe my case is a different scenario but looks similar. I would define my own question as: 'How to insert a row with new index at some (given) position?'
Let's create test dataframe:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'], index=['x', 'y'])
Result:
A B
x 1 2
y 3 4
Then, let's say, we want to place a new row with index z at position 1 (second row).
pos = 1
index_name = 'z'
# create new indexes where index is at the specified position
new_indexes = df.index.insert(pos, index_name)
# create new dataframe with new row
# specify new index in name argument
new_line = pd.Series({'A': 5, 'B': 6}, name=index_name)
df_new_row = pd.DataFrame([new_line], columns=df.columns)
# append new line to dataframe
df = pd.concat([df, df_new_row])
Now it is in the end:
A B
x 1 2
y 3 4
z 5 6
Now let's sort it specifying new index' position:
df = df.reindex(new_indexes)
Result:
A B
x 1 2
z 5 6
y 3 4

You should consider using df.loc[row_name] = row_value.
df.append(pd.Series({row_name: row_value}, name=column will lead to
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df.loc[row_name] = row_value is faster than pd.concat
Here is an example:
p = pd.DataFrame(data=np.random.rand(100), columns=['price'], index=np.arange(100))
def func1(p):
for i in range(100):
p.loc[i] = 0
def func2(p):
for i in range(100):
p.append(pd.Series({'BTC': 0}, name=i))
def func3(p):
for i in range(100):
p = pd.concat([p, pd.Series({i: 0}, name='price')])
%timeit func1(p)
1.87 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit func2(p)
1.56 s ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func3(p)
24.8 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Iteratively combine text in first column with existing text in other columns

I am in the process of creating a python script that extracts data from a poorly designed output file (which I can't change) from a piece of equipment within our research lab. I would like to include a way to iteratively combine the text in the first column of a dataframe (example below) with each other column in the dataframe.
A simple example of the dataframe:
Filename
1
2
3
4
5
a
Sheet(1)
Sheet(2)
Sheet(3)
Sheet(4)
....
b
Sheet(1)
Sheet(2)
--------
--------
....
c
Sheet(1)
Sheet(2)
Sheet(3)
Sheet(4)
....
d
Sheet(1)
Sheet(2)
Sheet(3)
--------
....
e
Sheet(1)
Sheet(2)
Sheet(3)
Sheet(4)
....
f
Sheet(1)
--------
--------
--------
....
What I am looking to produce:
Filename
1
2
3
4
5
a
a_Sheet(1)
a_Sheet(2)
a_Sheet(3)
a_Sheet(4)
....
b
b_Sheet(1)
b_Sheet(2)
--------
--------
....
c
c_Sheet(1)
c_Sheet(2)
c_Sheet(3)
c_Sheet(4)
....
d
d_Sheet(1)
d_Sheet(2)
d_Sheet(3)
--------
....
e
e_Sheet(1)
e_Sheet(2)
e_Sheet(3)
e_Sheet(4)
....
f
f_Sheet(1)
--------
--------
--------
....

Use .apply to prepend the 'Filename' string to the other columns.
Of the current answers, the solution from Mykola Zotko is the fastest solution, tested against a 3 column dataframe with 100k rows.
If your dataframe has, undesired strings (e.g. '--------'), then use something like df.replace('--------', pd.NA, inplace=True), before combining the column strings.
If the final result must have '--------', then use df.fillna('--------', inplace=True) at the end. This will be better than trying to iteratively deal with them.
import pandas as pd
import numpy as np
# test dataframe
df = pd.DataFrame({'Filename': ['a', 'b', 'c'], 'c1': ['s1'] * 3, 'c2': ['s2', np.nan, 's2']})
# display(df)
Filename c1 c2
0 a s1 s2
1 b s1 NaN
2 c s1 s2
# prepend the filename strings to the other columns
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: df.Filename + '_' + x)
# display(df)
Filename c1 c2
0 a a_s1 a_s2
1 b b_s1 NaN
2 c c_s1 c_s2
%%timeit test against other answers
# test data with 100k rows
df = pd.concat([pd.DataFrame({'Filename': ['a', 'b', 'c'], 'c1': ['s1'] * 3, 'c2': ['s2'] * 3})] * 33333).reset_index(drop=True)
# Solution from Trenton
%%timeit
df.iloc[:, 1:].apply(lambda x: df.Filename + '_' + x)
[out]:
33.6 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Solution from Mykola
%%timeit
df['Filename'].to_numpy().reshape(-1, 1) + '_' + df.loc[:, 'c1':]
[out]:
29.6 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Solution from Alex
%%timeit
df.loc[:, cols].apply(lambda s: df["Filename"].str.cat(s, sep="_"))
[out]:
45.3 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# iterating the columns in a for-loop
def test(d):
for cols in d.columns[1:]:
d[cols]=d['Filename'] + '_' + d[cols]
return d
%%timeit
test(df)
[out]:
53.8 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

For example, if you have the following data frame:
col1 col2 col3 col4
0 a x y z
1 b x y z
2 c x y NaN
You can use broadcasting:
df.loc[:, 'col2':] = df['col1'].to_numpy().reshape(-1, 1) + '_' + df.loc[:, 'col2':]
Result:
col1 col2 col3 col4
0 a a_x a_y a_z
1 b b_x b_y b_z
2 c c_x c_y NaN

Try:
for cols in df.loc[:,'1':]:
df[cols]=df['Filename']+'_'+df[cols]

I've represented the -------- as np.NaN. You should be able to label these as NaN when you load the file, see nan_values.
This is the dict for the DataFrame:
d = {
1: [nan, "Sheet(1)", nan],
2: [nan, "Sheet(2)", nan],
3: ["Sheet(3)", nan, "Sheet(3)"],
4: ["Sheet(4)", nan, nan],
"Filename": ["a", "b", "c"],
}
df = pd.DatFrame(d)
Then we can:
Make a mask of the columns we want to change, everything but Filename
cols = df.columns != "Filename"
# array([ True, True, True, True, False])
Apply a function, which uses Series.str.cat:
df.loc[:, cols] = df.loc[:, cols].apply(lambda s: df["Filename"].str.cat(s, sep="_"))
this function takes each column specified in cols and concatenates it with the Filename column.
Which produces:
1 2 3 4 Filename
0 NaN NaN a_Sheet(3) a_Sheet(4) a
1 b_Sheet(1) b_Sheet(2) NaN NaN b
2 NaN NaN c_Sheet(3) NaN c

Different groupers for each column with pandas GroupBy

How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?
My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum.
Consider for example:
df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})
print(df1)
a b
0 1 5
1 2 6
2 3 7
3 4 8
df2 = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})
print(df2)
a b
0 A A
1 B A
2 A B
3 B B
Then, the expected output would be:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.

You will have to group each column individually since each column uses a different grouping scheme.
If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
a b
0 4 11
1 6 11
2 4 15
3 6 15
Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)
Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.
%%timeit
(df1.stack()
.groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
.transform('sum').unstack())
%%timeit
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
%%timeit
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Not to say apply is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.

Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
Output:
a b
0 4 11
1 6 11
2 4 15
3 6 15

Using stack and unstack
df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
Out[291]:
a b
0 4 11
1 6 11
2 4 15
3 6 15

I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.
The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
name on the end of each value (since columns should be unique), and then factorize the result, like so [*]
>>> df2 + df2.columns
a b
0 Aa Ab
1 Ba Ab
2 Aa Bb
3 Ba Bb
>>> pd.factorize((df2 + df2.columns).values.ravel())
(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))
Once we have a unique grouping, we can utilize our scipy.sparse matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.
from scipy import sparse
a = df1.values.ravel()
b, _ = pd.factorize((df2 + df2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
res = o[b].reshape(df1.shape)
array([[ 4, 11],
[ 6, 11],
[ 4, 15],
[ 6, 15]], dtype=int64)
Performance
Functions
def gp_chris(f1, f2):
a = f1.values.ravel()
b, _ = pd.factorize((f2 + f2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)
def gp_cs(f1, f2):
return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)
def gp_scott(f1, f2):
return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))
def gp_wen(f1, f2):
return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()
Setup
import numpy as np
from scipy import sparse
import pandas as pd
import string
from timeit import timeit
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
columns=[10, 50, 100, 200, 400],
dtype=float
)
for f in res.index:
for c in res.columns:
df1 = pd.DataFrame(np.random.rand(c, c))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
stmt = '{}(df1, df2)'.format(f)
setp = 'from __main__ import df1, df2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
plt.show()
Results
Validation
df1 = pd.DataFrame(np.random.rand(10, 10))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
print(np.all(v[:-1] == v[1:]))
True
Either we're all wrong or we're all correct :)
[*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.

You could do something like the following:
res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))\
.assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))
Results:
a b
0 4 11
1 6 11
2 4 15
3 6 15

Pandas: expanding DataFrame by number of observations in column

Stata has the function expand which adds rows to a database corresponding to values in a particular column. For example:
I have:
df = pd.DataFrame({"A":[1, 2, 3],
"B":[3,4,5]})
A B
0 1 3
1 2 4
2 3 5
What I need:
df2 = pd.DataFrame({"A":[1, 2, 3, 2, 3, 3],
"B":[3,4,5, 4, 5, 5]})
A B
0 1 3
1 2 4
2 3 5
3 2 4
4 3 5
6 3 5
The value in df.loc[0,'A'] is 1, so no additional row is added to the end of the DataFrame, since B=3 is only supposed to occur once.
The value in df.loc[1,'A'] is 2, so one observation was added to the end of the DataFrame, bringing the total occurrences of B=4 to 2.
The value in df.loc[2,'A'] is 3, so two observations were added to the end of the DataFrame, bringing the total occurrences of B=5 to 3.
I've scoured prior questions for something to get me started, but no luck. Any help is appreciated.

There are a number of possibilities, all built around np.repeat:
def using_reindex(df):
return df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True)
def using_dictcomp(df):
return pd.DataFrame({col:np.repeat(df[col].values, df['A'], axis=0)
for col in df})
def using_df_values(df):
return pd.DataFrame(np.repeat(df.values, df['A'], axis=0), columns=df.columns)
def using_loc(df):
return df.loc[np.repeat(df.index.values, df['A'])].reset_index(drop=True)
For example,
In [219]: df = pd.DataFrame({"A":[1, 2, 3], "B":[3,4,5]})
In [220]: df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True)
Out[220]:
A B
0 1 3
1 2 4
2 2 4
3 3 5
4 3 5
5 3 5
Here is a benchmark on a 1000-row
DataFrame; the result being a roughly 500K-row DataFrame:
In [208]: df = make_dataframe(1000)
In [210]: %timeit using_dictcomp(df)
10 loops, best of 3: 23.6 ms per loop
In [218]: %timeit using_reindex(df)
10 loops, best of 3: 35.8 ms per loop
In [211]: %timeit using_df_values(df)
10 loops, best of 3: 31.3 ms per loop
In [212]: %timeit using_loc(df)
1 loop, best of 3: 275 ms per loop
This is the code I used to generate df:
import numpy as np
import pandas as pd
def make_dataframe(nrows=100):
df = pd.DataFrame(
{'A': np.arange(nrows),
'float': np.random.randn(nrows),
'str': np.random.choice('Lorem ipsum dolor sit'.split(), size=nrows),
'datetime64': pd.date_range('20000101', periods=nrows)},
index=pd.date_range('20000101', periods=nrows))
return df
df = make_dataframe(1000)
If there are only a few columns, using_dictcomp is the fastest. But note that using_dictcomp assumes df has unique column names. The dict comprehension in using_dictcomp won't repeat duplicated column names. The other alternatives will work with repeated column names, however.
Both using_reindex and using_loc assume df has a unique index.
using_reindex came from cᴏʟᴅsᴘᴇᴇᴅ's using_loc, in an (unfortunately) now
deleted post. cᴏʟᴅsᴘᴇᴇᴅ showed it wasn't necessary to manually repeat all the values -- you only need to repeat the index and then let df.loc (or df.reindex) repeat all the rows for you. It also avoids accessing df.values which could generate an intermediate NumPy array of object dtype if df contains columns of multiple dtypes.

Pandas: for each row in a DataFrame, count the number of rows matching a condition

I have a DataFrame for which I want to calculate, for each row, how many other rows match a given condition (e.g. number of rows that have value in column C less than the value for this row). Iterating through each row is too slow (I have ~1B rows), especially when the columns dtype is a datetime, but this is the way it could be run on a DataFrame df with a column labeled C:
df['newcol'] = 0
for row in df.itertuples():
df.loc[row.Index, 'newcol'] = len(df[df.C < row.C])
Is there a way to vectorize this?
Thanks!

Preparation:
import numpy as np
import pandas as pd
count = 5000
np.random.seed(100)
data = np.random.randint(100, size=count)
df = pd.DataFrame({'Col': list('ABCDE') * (count/5),
'Val': data})
Suggestion:
u, c = np.unique(data, return_counts=True)
values = np.cumsum(c)
dictionary = dict(zip(u[1:], values[:-1]))
dictionary[u[0]] = 0
df['newcol'] = [dictionary[x] for x in data]
It does exactly the same as your example.
If it does not help. Write more detailed question.
Recommendations:
Pandas vectorization and jit-compiling are available with numba at page .
If you work with 1d arrays - use numpy. In many situations it works faster. Just compare that:
Pandas
%timeit df['newcol2'] = df.apply(lambda x: sum(df['Val'] < x.Val), axis=1)
1 loop, best of 3: 51.1 s per loop
204.34800005
Numpy
%timeit df['newcol3'] = [np.sum(data<x) for x in data]
10 loops, best of 3: 61.3 ms per loop
2.5490000248
Use numpy.sum instead of sum!

Consider pandas.DataFrame.apply with a lambda expression to count the rows to your condition. Admittedly, apply is a loop and to run across ~1 billion rows may take time to process.
import numpy as np
import pandas as pd
np.random.seed(161)
df = pd.DataFrame({'Col': list('ABCDE') * 3,
'Val': np.random.randint(100, size=15)})
df['newcol'] = df.apply(lambda x: sum(df['Val'] < x.Val), axis=1)
# Col Val Count
# 0 A 78 13
# 1 B 11 2
# 2 C 51 8
# 3 D 31 5
# 4 E 29 4
# 5 A 99 14
# 6 B 65 10
# 7 C 16 3
# 8 D 43 7
# 9 E 10 1
# 10 A 67 11
# 11 B 36 6
# 12 C 1 0
# 13 D 73 12
# 14 E 64 9

quickly drop dataframe columns with only one distinct value

Is there a faster way to drop columns that only contain one distinct value than the code below?
cols=df.columns.tolist()
for col in cols:
if len(set(df[col].tolist()))<2:
df=df.drop(col, axis=1)
This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.

You can use Series.unique() method to find out all the unique elements in a column, and for columns whose .unique() returns only 1 element, you can drop that. Example -
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)
A method that does not do inplace dropping -
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
Demo -
In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])
In [155]: for col in df.columns:
.....: if len(df[col].unique()) == 1:
.....: df.drop(col,inplace=True,axis=1)
.....:
In [156]: df
Out[156]:
1
0 2
1 3
2 2
Timing results -
In [166]: %paste
def func1(df):
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
return res
## -- End pasted text --
In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
In [178]: %timeit func1(df)
1000 loops, best of 3: 1.05 ms per loop
In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
100 loops, best of 3: 8.81 ms per loop
In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
100 loops, best of 3: 5.81 ms per loop
The fastest method still seems to be the method using unique and looping through the columns.

One step:
df = df[[c for c
in list(df)
if len(df[c].unique()) > 1]]
Two steps:
Create a list of column names that have more than 1 distinct value.
keep = [c for c
in list(df)
if len(df[c].unique()) > 1]
Drop the columns that are not in 'keep'
df = df[keep]
Note: this step can also be done using a list of columns to drop:
drop_cols = [c for c
in list(df)
if df[c].nunique() <= 1]
df = df.drop(columns=drop_cols)

df.loc[:,df.apply(pd.Series.nunique) != 1]
For example
In:
df = pd.DataFrame({'A': [10, 20, np.nan, 30], 'B': [10, np.nan, 10, 10]})
df.loc[:,df.apply(pd.Series.nunique) != 1]
Out:
A
0 10
1 20
2 NaN
3 30

Two simple one-liners for either returning a view (shorter version of jz0410's answer)
df.loc[:,df.nunique()!=1]
or dropping inplace (via drop())
df.drop(columns=df.columns[df.nunique()==1], inplace=True)

You can create a mask of your df by calling apply and call value_counts, this will produce NaN for all rows except one, you can then call dropna column-wise and pass param thresh=2 so that there must be 2 or more non-NaN values:
In [329]:
df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
df
Out[329]:
a b c
0 1 0 0
1 1 1 0
2 1 2 2
3 1 3 2
4 1 4 2
In [342]:
df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
Out[342]:
b c
0 0 0
1 1 0
2 2 2
3 3 2
4 4 2
Output from the boolean conditions:
In [344]:
df.apply(pd.Series.value_counts)
Out[344]:
a b c
0 NaN 1 2
1 5 1 NaN
2 NaN 1 3
3 NaN 1 NaN
4 NaN 1 NaN
In [345]:
df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
Out[345]:
b c
0 1 2
1 1 NaN
2 1 3
3 1 NaN
4 1 NaN

Many examples in thread and this thread does not worked for my df. Those worked:
# from: https://stackoverflow.com/questions/33144813/quickly-drop-dataframe-columns-with-only-one-distinct-value
# from: https://stackoverflow.com/questions/20209600/pandas-dataframe-remove-constant-column
import pandas as pd
import numpy as np
data = {'var1': [1,2,3,4,5,np.nan,7,8,9],
'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'],
'var3':[101,101,101,102,102,102,103,103,np.nan],
'var4':[np.nan,1,1,1,1,1,1,1,1],
'var5':[1,1,1,1,1,1,1,1,1],
'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'var7':["a","a","a","a","a","a","a","a","a"],
'var8': [1,2,3,4,5,6,7,8,9]}
df = pd.DataFrame(data)
df_original = df.copy()
#-------------------------------------------------------------------------------------------------
df2 = df[[c for c
in list(df)
if len(df[c].unique()) > 1]]
#-------------------------------------------------------------------------------------------------
keep = [c for c
in list(df)
if len(df[c].unique()) > 1]
df3 = df[keep]
#-------------------------------------------------------------------------------------------------
keep_columns = [col for col in df.columns if len(df[col].unique()) > 1]
df5 = df[keep_columns].copy()
#-------------------------------------------------------------------------------------------------
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)

I would like to throw in:
pandas 1.0.3
ids = df.nunique().values>1
df.loc[:,ids]
not that slow:
2.81 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

df=df.loc[:,df.nunique()!=Numberofvalues]

None of the solutions worked in my use-case because I got this error: (my dataframe contains list item).
TypeError: unhashable type: 'list'
The solution that worked for me is this:
ndf = df.describe(include="all").T
new_cols = set(df.columns) - set(ndf[ndf.unique == 1].index)
df = df[list(new_cols)]

One line
df=df[[i for i in df if len(set(df[i]))>1]]

One of the solutions with pipe (convenient if used often):
def drop_unique_value_col(df):
return df.loc[:,df.apply(pd.Series.nunique) != 1]
df.pipe(drop_unique_value_col)

This will drop all the columns with only one distinct value.
for col in Dataframe.columns:
if len(Dataframe[col].value_counts()) == 1:
Dataframe.drop([col], axis=1, inplace=True)

Most 'pythonic' way of doing it I could find:
df = df.loc[:, (df != df.iloc[0]).any()]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Appending a row to a dataframe and specify its index label - python

df.loc will do the job : >>> df = pd.DataFrame(np.random.randn(3, 2), columns=['A','B']) >>> df A B 0 -0.269036 0.534991 1 0.069915 -1.173594 2 -1.177792 0.018381 >>> df.loc[13] = df.loc[1] >>> df A B 0 -0.269036 0.534991 1 0.069915 -1.173594 2 -1.177792 0.018381 13 0.069915 -1.173594

Related

Iteratively combine text in first column with existing text in other columns

Different groupers for each column with pandas GroupBy

Pandas: expanding DataFrame by number of observations in column

Pandas: for each row in a DataFrame, count the number of rows matching a condition

quickly drop dataframe columns with only one distinct value

Categories

Resources