process columns in pandas dataframe - python

I have a dataframe df:
Col1 Col2 Col3
0 a1 NaN NaN
1 a2 b1 NaN
2 a3 b3 c1
3 a4 NaN c2
I have tried :
new_df = '[' + df + ']'
new_df['Col4']=new_df[new_df.columns[0:]].apply(lambda x:','.join(x.dropna().astype(str)),axis =1)
df_final = pd.concat([df, new_df['col4']], axis =1)
I am getting at this :
I was looking for a robust solution to get to something which must look like this:
I know there is no direct way to do this, the data frame eventually is going to be at least 20k rows and so the question to fellow stack-people.
Thanks.
let me know if you have any more questions and I can edit the question to add points.

I'm not sure what your usecase is, but here you go
df['Col4'] = df.apply(lambda row:", ".join([(val if val[0]=='a' else "["+val+"]") for val in row if not pd.isna(val)]), axis=1)
It joins the rows together, by concatenating their values with ", ".join, but only if they are not pd.isna. It further puts everything in brackets that does not begin with a.
Whatever you want to do with it, there probably is a better solution though

You can add [] for all columns without first not missing value tested with helper i from enumerate:
def f(x):
gen = (y for y in x if pd.notna(y))
return ','.join(y if i == 0 else '['+y+']' for i, y in enumerate(gen))
#f = lambda x: ','.join(y if i == 0 else '['+y+']' for i, y in enumerate(x.dropna()))
df['col4'] = df.apply(f, axis=1)
print (df)
Col1 Col2 Col3 Col4 col4
0 a1 NaN d8 NaN a1,[d8]
1 a2 b1 d3 NaN a2,[b1],[d3]
2 NaN b3 c1 NaN b3,[c1]
3 a4 NaN c2 NaN a4,[c2]
4 NaN NaN c6 d5 c6,[d5]
Performance test:
#test for 25k rows
df = pd.concat([df] * 5000, ignore_index=True)
f1 = lambda x: ','.join(y if i == 0 else '['+y+']' for i, y in enumerate(x.dropna()))
%timeit df.apply(f1, axis=1)
3.62 s ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.apply(f, axis =1)
475 ms ± 3.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

new_col = []
for idx, row in df.iterrows():
val1 = row["Col1"]
val2 = row["Col2"]
val3 = row["Col3"]
new_val2 = f",[{val2}]" if pd.notna(val2) else ""
new_val3 = f",[{val3}]" if pd.notna(val3) else ""
val4 = f"{val1}{new_val2}{new_val3}"
new_col.append(val4)
df["Col4"] = new_col
Maybe my answer is not the most "computationally efficient", but if your dataset is 20k rows, it will be fast enough!
I think my answer is very easy to read, and it is also easy to adapt it to different scenarios!

Related

Efficiently replacing values in each row of pandas dataframe based on condition

I would like to work with a pandas data frame to get a strange yet desired output dataframe. For each row, I'd like any values of 0.0 to be replaced with an empty string (''), and all values of 1.0 to be replaced with the value of the index. Any given value on a row can only be 1.0 or 0.0.
Here's some example data:
# starting df
df = pd.DataFrame.from_dict({'A':[1.0,0.0,0.0],'B':[1.0,1.0,0.0],'C':[0.0,1.0,1.0]})
df.index=['x','y','z']
print(df)
What the input df looks like:
A B C
x 1.0 1.0 0.0
y 0.0 1.0 1.0
z 0.0 0.0 1.0
What I would like the output df to look like:
A B C
x x x
y y y
z z
So far I've got this pretty inefficient but seemingly working code:
for idx in df.index:
df.loc[idx] = df.loc[idx].map(str).replace('1.0',str(idx))
df.loc[idx] = df.loc[idx].map(str).replace('0.0','')
Could anyone please suggest an efficient way to do this?
The real data frame I'll be working with has a shape of (4548, 2044) and the values will always be floats (1.0 or 0.0), like in the example. I'm manipulating the usher_barcodes.csv data from "raw.githubusercontent.com/andersen-lab/Freyja/main/freyja/data/…" into a format required by another pipeline, where the column headers are lineage names and the values are mutations (taken from the index). The column headers and index values will likely be different each time I need to run this code because the lineage assignments are constantly changing.
Thanks!
Use numpy.where with broadcasting index convert to numpy array:
df = pd.DataFrame(np.where(df.eq(1),
df.index.to_numpy()[:, None],
''),
index = df.index,
columns = df.columns)
print(df)
A B C
x x x
y y y
z z
Performance with data by size (4548,2044):
np.random.seed(2023)
df = pd.DataFrame(np.random.choice([0.0,1.0], size=(4548,2044))).add_prefix('c')
df.index = df.index.astype(str) + 'r'
# print (df)
In [87]: %timeit df.eq(1).mul(df.index, axis=0)
684 ms ± 36.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [90]: %timeit pd.DataFrame(np.where(df.eq(1),df.index.to_numpy()[:, None],''),index = df.index, columns = df.columns)
449 ms ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can simply do:
for idx, row in df.iterrows():
df.loc[idx] = ['' if val == 0 else idx for val in row]
which gives:
A B C
x x x
y y y
z z
Take advantage of the fact that 1*'x' -> 'x' and 0*'x' -> '':
out = df.eq(1).mul(df.index, axis=0)
NB. the eq(1) converts the float to boolean as True is equivalent to 1. You could also use astype(int) if you only have 0./1..
Output:
A B C
x x x
y y y
z z

Iteratively combine text in first column with existing text in other columns

I am in the process of creating a python script that extracts data from a poorly designed output file (which I can't change) from a piece of equipment within our research lab. I would like to include a way to iteratively combine the text in the first column of a dataframe (example below) with each other column in the dataframe.
A simple example of the dataframe:
Filename
1
2
3
4
5
a
Sheet(1)
Sheet(2)
Sheet(3)
Sheet(4)
....
b
Sheet(1)
Sheet(2)
--------
--------
....
c
Sheet(1)
Sheet(2)
Sheet(3)
Sheet(4)
....
d
Sheet(1)
Sheet(2)
Sheet(3)
--------
....
e
Sheet(1)
Sheet(2)
Sheet(3)
Sheet(4)
....
f
Sheet(1)
--------
--------
--------
....
What I am looking to produce:
Filename
1
2
3
4
5
a
a_Sheet(1)
a_Sheet(2)
a_Sheet(3)
a_Sheet(4)
....
b
b_Sheet(1)
b_Sheet(2)
--------
--------
....
c
c_Sheet(1)
c_Sheet(2)
c_Sheet(3)
c_Sheet(4)
....
d
d_Sheet(1)
d_Sheet(2)
d_Sheet(3)
--------
....
e
e_Sheet(1)
e_Sheet(2)
e_Sheet(3)
e_Sheet(4)
....
f
f_Sheet(1)
--------
--------
--------
....
Use .apply to prepend the 'Filename' string to the other columns.
Of the current answers, the solution from Mykola Zotko is the fastest solution, tested against a 3 column dataframe with 100k rows.
If your dataframe has, undesired strings (e.g. '--------'), then use something like df.replace('--------', pd.NA, inplace=True), before combining the column strings.
If the final result must have '--------', then use df.fillna('--------', inplace=True) at the end. This will be better than trying to iteratively deal with them.
import pandas as pd
import numpy as np
# test dataframe
df = pd.DataFrame({'Filename': ['a', 'b', 'c'], 'c1': ['s1'] * 3, 'c2': ['s2', np.nan, 's2']})
# display(df)
Filename c1 c2
0 a s1 s2
1 b s1 NaN
2 c s1 s2
# prepend the filename strings to the other columns
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: df.Filename + '_' + x)
# display(df)
Filename c1 c2
0 a a_s1 a_s2
1 b b_s1 NaN
2 c c_s1 c_s2
%%timeit test against other answers
# test data with 100k rows
df = pd.concat([pd.DataFrame({'Filename': ['a', 'b', 'c'], 'c1': ['s1'] * 3, 'c2': ['s2'] * 3})] * 33333).reset_index(drop=True)
# Solution from Trenton
%%timeit
df.iloc[:, 1:].apply(lambda x: df.Filename + '_' + x)
[out]:
33.6 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Solution from Mykola
%%timeit
df['Filename'].to_numpy().reshape(-1, 1) + '_' + df.loc[:, 'c1':]
[out]:
29.6 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Solution from Alex
%%timeit
df.loc[:, cols].apply(lambda s: df["Filename"].str.cat(s, sep="_"))
[out]:
45.3 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# iterating the columns in a for-loop
def test(d):
for cols in d.columns[1:]:
d[cols]=d['Filename'] + '_' + d[cols]
return d
%%timeit
test(df)
[out]:
53.8 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For example, if you have the following data frame:
col1 col2 col3 col4
0 a x y z
1 b x y z
2 c x y NaN
You can use broadcasting:
df.loc[:, 'col2':] = df['col1'].to_numpy().reshape(-1, 1) + '_' + df.loc[:, 'col2':]
Result:
col1 col2 col3 col4
0 a a_x a_y a_z
1 b b_x b_y b_z
2 c c_x c_y NaN
Try:
for cols in df.loc[:,'1':]:
df[cols]=df['Filename']+'_'+df[cols]
I've represented the -------- as np.NaN. You should be able to label these as NaN when you load the file, see nan_values.
This is the dict for the DataFrame:
d = {
1: [nan, "Sheet(1)", nan],
2: [nan, "Sheet(2)", nan],
3: ["Sheet(3)", nan, "Sheet(3)"],
4: ["Sheet(4)", nan, nan],
"Filename": ["a", "b", "c"],
}
df = pd.DatFrame(d)
Then we can:
Make a mask of the columns we want to change, everything but Filename
cols = df.columns != "Filename"
# array([ True, True, True, True, False])
Apply a function, which uses Series.str.cat:
df.loc[:, cols] = df.loc[:, cols].apply(lambda s: df["Filename"].str.cat(s, sep="_"))
this function takes each column specified in cols and concatenates it with the Filename column.
Which produces:
1 2 3 4 Filename
0 NaN NaN a_Sheet(3) a_Sheet(4) a
1 b_Sheet(1) b_Sheet(2) NaN NaN b
2 NaN NaN c_Sheet(3) NaN c

Different groupers for each column with pandas GroupBy

How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?
My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum.
Consider for example:
df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})
print(df1)
a b
0 1 5
1 2 6
2 3 7
3 4 8
df2 = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})
print(df2)
a b
0 A A
1 B A
2 A B
3 B B
Then, the expected output would be:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.
You will have to group each column individually since each column uses a different grouping scheme.
If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
a b
0 4 11
1 6 11
2 4 15
3 6 15
Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)
Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.
%%timeit
(df1.stack()
.groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
.transform('sum').unstack())
%%timeit
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
%%timeit
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)
8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Not to say apply is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.
Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
Output:
a b
0 4 11
1 6 11
2 4 15
3 6 15
Using stack and unstack
df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
Out[291]:
a b
0 4 11
1 6 11
2 4 15
3 6 15
I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.
The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column
name on the end of each value (since columns should be unique), and then factorize the result, like so [*]
>>> df2 + df2.columns
a b
0 Aa Ab
1 Ba Ab
2 Aa Bb
3 Ba Bb
>>> pd.factorize((df2 + df2.columns).values.ravel())
(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))
Once we have a unique grouping, we can utilize our scipy.sparse matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.
from scipy import sparse
a = df1.values.ravel()
b, _ = pd.factorize((df2 + df2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
res = o[b].reshape(df1.shape)
array([[ 4, 11],
[ 6, 11],
[ 4, 15],
[ 6, 15]], dtype=int64)
Performance
Functions
def gp_chris(f1, f2):
a = f1.values.ravel()
b, _ = pd.factorize((f2 + f2.columns).values.ravel())
o = sparse.csr_matrix(
(a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1
return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)
def gp_cs(f1, f2):
return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)
def gp_scott(f1, f2):
return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))
def gp_wen(f1, f2):
return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()
Setup
import numpy as np
from scipy import sparse
import pandas as pd
import string
from timeit import timeit
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
columns=[10, 50, 100, 200, 400],
dtype=float
)
for f in res.index:
for c in res.columns:
df1 = pd.DataFrame(np.random.rand(c, c))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
stmt = '{}(df1, df2)'.format(f)
setp = 'from __main__ import df1, df2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
plt.show()
Results
Validation
df1 = pd.DataFrame(np.random.rand(10, 10))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)
v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
print(np.all(v[:-1] == v[1:]))
True
Either we're all wrong or we're all correct :)
[*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.
You could do something like the following:
res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))\
.assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))
Results:
a b
0 4 11
1 6 11
2 4 15
3 6 15

How to identify text related to a particular dynmamic value in Pandas/Python

I have the following 2 columns in my dataframe:
COL1 COL2
12 :402:agshhhjd:45:hghghgruru:12:fghg,hgh:22:hhhh
57 :42:ags,hhhjd:57:hghg,hgruru:120:fghgh,gh:12:hhhhhh
I need to create another column COL3 which sould be like below:
COL1 COL2 COL3
12 :402:agshhhjd:45:hghghgruru,:12:fghg,hgh:22:hhhh fghg,hg
57 :42:agshhhjd:57:hghg,hgruru:120:fghghgh:12:hhhhhh hghg,hg
The new column COL 3 needs to be created in such a way that it searches the value of COL1 in COL2 for the same row and then prints the 7 characters apart from the ":". I tried doing is using slice, but its not working. can someone kindly help.
You can just use the attribute replace, but first you have to change the datatype of column 1. we need to replace everything that is in COL2 save the wordings after the number in COL1 ie:
.*12:(\w{7}).* So we just capture the seven letters and call them by back reference ie value = \1. Also we do the same for the second row. This can be done easily done since replace is vectorized. Although this will be slow
df['COL3'] = df.COL2.replace(regex=r'.*'+ df.COL1.astype('str') +':(\\w{7}).*',value="\\1")
df
COL1 COL2 COL3
0 12 :402:agshhhjd,:45:hghghgruru,:12:fghghgh,:22:hhhh fghghgh
1 57 :42:agshhhjd,:57:hghghgruru,:120:fghghgh,:12:h... hghghgr
You can also do:
import re
[re.sub(".*"+str(i)+":(\\w{7}).*","\\1",j) for i,j in zip(df.COL1,df.COL2)]
EDIT
with your update, you could do:
df.assign(COL3 = df.COL2.replace(regex=r'.*'+ df.COL1.astype('str')+':(.{7}).*',value="\\1"))
Out[102]:
COL1 COL2 COL3
0 12 :402:agshhhjd,:45:hghghgruru,:12:fghg,hgh,:22:... fghg,hg
1 57 :42:ags,hhhjd,:57:hghg,hgruru,:120:fghgh,gh,:1... hghg,hg
Using a list comprehension and re.findall:
import re
df['COL3'] = [
re.findall('{}\:([a-z]{{7}})'.format(i), j) for i, j in zip(df.COL1, df.COL2)
]
COL1 COL2 COL3
0 12 :402:agshhhjd,:45:hghghgruru,:12:fghghgh,:22:hhhh [fghghgh]
1 57 :42:agshhhjd,:57:hghghgruru,:120:fghghgh,:12:h... [hghghgr]
You could also use a list comprehension and split, although this will throw an error if the first value isn't found in COL2:
[j.split('{}:'.format(i))[1][:7] for i, j in zip(df.COL1, df.COL2)]
# ['fghghgh', 'hghghgr']
If you can guarantee that the value will be found in COL2, then using split is faster:
df = pd.concat([df]*10000)
%timeit [re.findall('{}\:([a-z]{{7}})'.format(i), j) for i, j in zip(df.COL1, df.COL2)]
28.3 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [j.split('{}:'.format(i))[1][:7] for i, j in zip(df.COL1, df.COL2)]
12 ms ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Ty this:
test = pd.DataFrame({'Col1': [12, 57], 'Col2': [':402:agshhhjd,:45:hghghgruru,:12:fghghgh,:22:hhhh', ':42:agshhhjd,:57:hghghgruru,:120:fghghgh,:12:hhhhhh']})
test
Col1 Col2
0 12 :402:agshhhjd,:45:hghghgruru,:12:fghghgh,:22:hhhh
1 57 :42:agshhhjd,:57:hghghgruru,:120:fghghgh,:12:h...
def my_val(col1num, col2text):
# Split columns by ':'
col2_ls = col2text.split(':')[1:]
# Create an empty dict to store key-value pairs
my_dict = {}
# Create your key-value pairs and update dict
for i, j in zip(range(0, len(col2_ls), 2), range(1, len(col2_ls)+1, 2)):
my_dict[col2_ls[i]] = col2_ls[j]
# If the key exists return the value
if str(col1num) in my_dict.keys():
val = my_dict[str(col1num)]
return val
else:
return 'Unavailable'
test['Col3'] = test.apply(lambda x: my_val(col1num=x['Col1'], col2text=x['Col2']), axis=1)
test
Col1 Col2 Col3
0 12 :402:agshhhjd,:45:hghghgruru,:12:fghghgh,:22:hhhh fghghgh,
1 57 :42:agshhhjd,:57:hghghgruru,:120:fghghgh,:12:h... hghghgruru,
Hope this helps

quickly drop dataframe columns with only one distinct value

Is there a faster way to drop columns that only contain one distinct value than the code below?
cols=df.columns.tolist()
for col in cols:
if len(set(df[col].tolist()))<2:
df=df.drop(col, axis=1)
This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.
You can use Series.unique() method to find out all the unique elements in a column, and for columns whose .unique() returns only 1 element, you can drop that. Example -
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)
A method that does not do inplace dropping -
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
Demo -
In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])
In [155]: for col in df.columns:
.....: if len(df[col].unique()) == 1:
.....: df.drop(col,inplace=True,axis=1)
.....:
In [156]: df
Out[156]:
1
0 2
1 3
2 2
Timing results -
In [166]: %paste
def func1(df):
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
return res
## -- End pasted text --
In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
In [178]: %timeit func1(df)
1000 loops, best of 3: 1.05 ms per loop
In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
100 loops, best of 3: 8.81 ms per loop
In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
100 loops, best of 3: 5.81 ms per loop
The fastest method still seems to be the method using unique and looping through the columns.
One step:
df = df[[c for c
in list(df)
if len(df[c].unique()) > 1]]
Two steps:
Create a list of column names that have more than 1 distinct value.
keep = [c for c
in list(df)
if len(df[c].unique()) > 1]
Drop the columns that are not in 'keep'
df = df[keep]
Note: this step can also be done using a list of columns to drop:
drop_cols = [c for c
in list(df)
if df[c].nunique() <= 1]
df = df.drop(columns=drop_cols)
df.loc[:,df.apply(pd.Series.nunique) != 1]
For example
In:
df = pd.DataFrame({'A': [10, 20, np.nan, 30], 'B': [10, np.nan, 10, 10]})
df.loc[:,df.apply(pd.Series.nunique) != 1]
Out:
A
0 10
1 20
2 NaN
3 30
Two simple one-liners for either returning a view (shorter version of jz0410's answer)
df.loc[:,df.nunique()!=1]
or dropping inplace (via drop())
df.drop(columns=df.columns[df.nunique()==1], inplace=True)
You can create a mask of your df by calling apply and call value_counts, this will produce NaN for all rows except one, you can then call dropna column-wise and pass param thresh=2 so that there must be 2 or more non-NaN values:
In [329]:
df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
df
Out[329]:
a b c
0 1 0 0
1 1 1 0
2 1 2 2
3 1 3 2
4 1 4 2
In [342]:
df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
Out[342]:
b c
0 0 0
1 1 0
2 2 2
3 3 2
4 4 2
Output from the boolean conditions:
In [344]:
df.apply(pd.Series.value_counts)
Out[344]:
a b c
0 NaN 1 2
1 5 1 NaN
2 NaN 1 3
3 NaN 1 NaN
4 NaN 1 NaN
In [345]:
df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
Out[345]:
b c
0 1 2
1 1 NaN
2 1 3
3 1 NaN
4 1 NaN
Many examples in thread and this thread does not worked for my df. Those worked:
# from: https://stackoverflow.com/questions/33144813/quickly-drop-dataframe-columns-with-only-one-distinct-value
# from: https://stackoverflow.com/questions/20209600/pandas-dataframe-remove-constant-column
import pandas as pd
import numpy as np
data = {'var1': [1,2,3,4,5,np.nan,7,8,9],
'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'],
'var3':[101,101,101,102,102,102,103,103,np.nan],
'var4':[np.nan,1,1,1,1,1,1,1,1],
'var5':[1,1,1,1,1,1,1,1,1],
'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'var7':["a","a","a","a","a","a","a","a","a"],
'var8': [1,2,3,4,5,6,7,8,9]}
df = pd.DataFrame(data)
df_original = df.copy()
#-------------------------------------------------------------------------------------------------
df2 = df[[c for c
in list(df)
if len(df[c].unique()) > 1]]
#-------------------------------------------------------------------------------------------------
keep = [c for c
in list(df)
if len(df[c].unique()) > 1]
df3 = df[keep]
#-------------------------------------------------------------------------------------------------
keep_columns = [col for col in df.columns if len(df[col].unique()) > 1]
df5 = df[keep_columns].copy()
#-------------------------------------------------------------------------------------------------
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)
I would like to throw in:
pandas 1.0.3
ids = df.nunique().values>1
df.loc[:,ids]
not that slow:
2.81 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df=df.loc[:,df.nunique()!=Numberofvalues]
None of the solutions worked in my use-case because I got this error: (my dataframe contains list item).
TypeError: unhashable type: 'list'
The solution that worked for me is this:
ndf = df.describe(include="all").T
new_cols = set(df.columns) - set(ndf[ndf.unique == 1].index)
df = df[list(new_cols)]
One line
df=df[[i for i in df if len(set(df[i]))>1]]
One of the solutions with pipe (convenient if used often):
def drop_unique_value_col(df):
return df.loc[:,df.apply(pd.Series.nunique) != 1]
df.pipe(drop_unique_value_col)
This will drop all the columns with only one distinct value.
for col in Dataframe.columns:
if len(Dataframe[col].value_counts()) == 1:
Dataframe.drop([col], axis=1, inplace=True)
Most 'pythonic' way of doing it I could find:
df = df.loc[:, (df != df.iloc[0]).any()]

Categories