for loop using iterrows in pandas

for loop using iterrows in pandas - python

I have 2 dataframes as follows:
data1 looks like this:
id address
1 11123451
2 78947591
data2 looks like the following:
lowerbound_address upperbound_address place
78392888 89000000 X
10000000 20000000 Y
I want to create another column in data1 called "place" which contains the place the id is from.
For example, in the above case,
for id 1, I want the place column to contain Y and for id 2, I want the place column to contain X.
There will be many ids coming from the same place. And some ids don't have a match.
I am trying to do it using the following piece of code.
places = []
for index, row in data1.iterrows():
for idx, r in data2.iterrows():
if r['lowerbound_address'] <= row['address'] <= r['upperbound_address']:
places.append(r['place'])
The addresses here are float values.
It's taking forever to run this piece of code. It makes me wonder if my code is correct or if there's a faster way of executing the same.
Any help will be much appreciated.
Thank you!

You can use first cross join with merge and then filter values by boolean indexing. Last remove unecessary columns by drop:
data1['tmp'] = 1
data2['tmp'] = 1
df = pd.merge(data1, data2, on='tmp', how='outer')
df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
print (df)
id address place
1 1 11123451 Y
2 2 78947591 X
Another solution with itertuples, last create DataFrame.from_records:
places = []
for row1 in data1.itertuples():
for row2 in data2.itertuples():
#print (row1.address)
if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
places.append((row1.id, row1.address, row2.place))
print (places)
[(1, 11123451, 'Y'), (2, 78947591, 'X')]
df = pd.DataFrame.from_records(places)
df.columns=['id','address','place']
print (df)
id address place
0 1 11123451 Y
1 2 78947591 X
Another solution with apply:
def f(x):
for row2 in data2.itertuples():
if (row2.lowerbound_address <= x <= row2.upperbound_address):
return pd.Series([x, row2.place], index=['address','place'])
df = data1.set_index('id')['address'].apply(f).reset_index()
print (df)
id address place
0 1 11123451 Y
1 2 78947591 X
EDIT:
Timings:
N = 1000:
If saome values are not in range, in solution b and c are omited. Check last row of df1.
In [73]: %timeit (data1.set_index('id')['address'].apply(f).reset_index())
1 loop, best of 3: 2.06 s per loop
In [74]: %timeit (a(df1a, df2a))
1 loop, best of 3: 82.2 ms per loop
In [75]: %timeit (b(df1b, df2b))
1 loop, best of 3: 3.17 s per loop
In [76]: %timeit (c(df1c, df2c))
100 loops, best of 3: 2.71 ms per loop
Code for timings:
np.random.seed(123)
N = 1000
data1 = pd.DataFrame({'id':np.arange(1,N+1),
'address': np.random.randint(N*10, size=N)}, columns=['id','address'])
#add last row with value out of range
data1.loc[data1.index[-1]+1, ['id','address']] = [data1.index[-1]+1, -1]
data1 = data1.astype(int)
print (data1.tail())
data2 = pd.DataFrame({'lowerbound_address':np.arange(1, N*10,10),
'upperbound_address':np.arange(10,N*10+10, 10),
'place': np.random.randint(40, size=N)})
print (data2.tail())
df1a, df1b, df1c = data1.copy(),data1.copy(),data1.copy()
df2a, df2b ,df2c = data2.copy(),data2.copy(),data2.copy()
def a(data1, data2):
data1['tmp'] = 1
data2['tmp'] = 1
df = pd.merge(data1, data2, on='tmp', how='outer')
df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
return (df)
def b(data1, data2):
places = []
for row1 in data1.itertuples():
for row2 in data2.itertuples():
#print (row1.address)
if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
places.append((row1.id, row1.address, row2.place))
df = pd.DataFrame.from_records(places)
df.columns=['id','address','place']
return (df)
def f(x):
#use for ... else for add NaN to values out of range
#http://stackoverflow.com/q/9979970/2901002
for row2 in data2.itertuples():
if (row2.lowerbound_address <= x <= row2.upperbound_address):
return pd.Series([x, row2.place], index=['address','place'])
else:
return pd.Series([x, np.nan], index=['address','place'])
def c(data1,data2):
data1 = data1.sort_values('address')
data2 = data2.sort_values('lowerbound_address')
df = pd.merge_asof(data1, data2, left_on='address', right_on='lowerbound_address')
df = df.drop(['lowerbound_address','upperbound_address'], axis=1)
return df.sort_values('id')
print (data1.set_index('id')['address'].apply(f).reset_index())
print (a(df1a, df2a))
print (b(df1b, df2b))
print (c(df1c, df2c))
Only solution c with merge_asof works very nice with large DataFrame:
N=1M:
In [84]: %timeit (c(df1c, df2c))
1 loop, best of 3: 525 ms per loop
More about merge asof in docs.

Related

Python pandas how to scan string contains by row?

How do you scan if a pandas dataframe row contains a certain substring?
for example i have a dataframe with 11 columns
all the columns contains names
ID name1 name2 name3 ... name10
-------------------------------------------------------
AA AA_balls AA_cakee1 AA_lavender ... AA_purple
AD AD_cakee AD_cats AD_webss ... AD_ballss
CS CS_cakee CS_cats CS_webss ... CS_purble
.
.
.
I would like to get rows which contains, say "ball" in the dataframe and get the ID
so the result would be ID 'AA' and ID 'AD' since AA_balls and AD_ballss are in the rows.
I have searched on google but seems there is no specific result for these.
people usually ask questions about searching substring in a specific columns but not all columns (a single row)
df[df["col_name"].str.contains("ball")]
The Methods I have thought of are as follows, you can skip this if you have little time:
(1) loop through the columns
for col_name in col_names:
df.append(df[df[col_name].str.contains('ball')])
and then drop duplicates rows which have same ID values
but this method would be very slow
(2) Make data frame to a 2 column dataframe by appending name2- name10 columns into one column and use df[df["concat_col"].str.contains("ball")]["ID] to get the IDs and drop duplicate
ID concat_col
AA AA_balls
AA AA_cakeee
AA AA_lavender
AA AA_purple
.
.
.
CS CS_purble
(3) Use the dataframe like (2) to make a dictionay
where
dict[df["concat_col"].value] = df["ID"]
then get the
[value for key, value in programs.items() if 'ball' in key()]
but in this method i need to loop through dictionary and become slow
if there is a method that i can apply faster without these processes,
i would prefer doing so.
If anyone knows about this,
would appreciate a lot if you kindly let me know:)
Thanks!

One idea is use melt:
df = df.melt('ID')
a = df.loc[df['value'].str.contains('ball'), 'ID']
print (a)
0 AA
10 AD
Name: ID, dtype: object
Another:
df = df.set_index('ID')
a = df.index[df.applymap(lambda x: 'ball' in x).any(axis=1)]
Or:
mask = np.logical_or.reduce([df[x].str.contains('ball', regex=False) for x in df.columns])
a = df.loc[, 'ID']
Timings:
np.random.seed(145)
L = list('abcdefgh')
df = pd.DataFrame(np.random.choice(L, size=(4000, 10)))
df.insert(0, 'ID', np.arange(4000).astype(str))
a = np.random.randint(4000, size=15)
b = np.random.randint(1, 10, size=15)
for i, j in zip(a,b):
df.iloc[i, j] = 'AB_ball_DE'
#print (df)
In [85]: %%timeit
...: df1 = df.melt('ID')
...: a = df1.loc[df1['value'].str.contains('ball'), 'ID']
...:
10 loops, best of 3: 24.3 ms per loop
In [86]: %%timeit
...: df.loc[np.logical_or.reduce([df[x].str.contains('ball', regex=False) for x in df.columns]), 'ID']
...:
100 loops, best of 3: 12.8 ms per loop
In [87]: %%timeit
...: df1 = df.set_index('ID')
...: df1.index[df1.applymap(lambda x: 'ball' in x).any(axis=1)]
...:
100 loops, best of 3: 11.1 ms per loop

Maybe this might work?
mask = df.apply(lambda row: row.map(str).str.contains('word').any(), axis=1)
df.loc[mask]
Disclaimer: I haven't tested this. Perhaps the .map(str) isn't necessary.

Python Data Frame: Create New Column Based on Values in a String Column and a Float Column

I have the following Python data frame below. The "Flag" field is my desired column I want to create with code.
I want to do the following:
If "Allocation Type" is Predicted AND "Activities_Counter" is greater than 10, I want to create a new column called "Flag" and label the row with 'Flag'
Otherwise, leave the Flag row blank.
I use the following code to identify / flag where "Activities_Counter" is greater than 10... BUT I don't know how to incorporate the "Allocation Type" criteria into my code.
Flag = []
for row in df_HA_noHA_act['Activities_Counter']:
if row >= 10:
Flag.append('Flag')
else:
Flag.append('')
df_HA_noHA_act['Flag'] = Flag
Any help is greatly appreciated!

You need add new condition with &. Also faster is use numpy.where:
mask = (df_HA_noHA_act["Allocation Type"] == 'Predicted') &
(df_HA_noHA_act['Activities_Counter'] >= 10)
df_HA_noHA_act['Flag'] = np.where(mask, 'Flag', '')
df_HA_noHA_act = pd.DataFrame({'Activities_Counter':[10,2,6,15,11,18],
'Allocation Type':['Historical','Historical','Predicted',
'Predicted','Predicted','Historical']})
print (df_HA_noHA_act)
Activities_Counter Allocation Type
0 10 Historical
1 2 Historical
2 6 Predicted
3 15 Predicted
4 11 Predicted
5 18 Historical
mask = (df_HA_noHA_act["Allocation Type"] == 'Predicted') &
(df_HA_noHA_act['Activities_Counter'] >= 10)
df_HA_noHA_act['Flag'] = np.where(mask, 'Flag', '')
print (df_HA_noHA_act)
Activities_Counter Allocation Type Flag
0 10 Historical
1 2 Historical
2 6 Predicted
3 15 Predicted Flag
4 11 Predicted Flag
5 18 Historical
Loop slow solution:
Flag = []
for i, row in df_HA_noHA_act.iterrows():
if (row['Activities_Counter'] >= 10) and (row["Allocation Type"] == 'Predicted'):
Flag.append('Flag')
else:
Flag.append('')
df_HA_noHA_act['Flag'] = Flag
print (df_HA_noHA_act)
Activities_Counter Allocation Type Flag
0 10 Historical
1 2 Historical
2 6 Predicted
3 15 Predicted Flag
4 11 Predicted Flag
5 18 Historical
Timings:
df_HA_noHA_act = pd.DataFrame({'Activities_Counter':[10,2,6,15,11,18],
'Allocation Type':['Historical','Historical','Predicted',
'Predicted','Predicted','Historical']})
print (df_HA_noHA_act)
#[6000 rows x 2 columns]
df_HA_noHA_act = pd.concat([df_HA_noHA_act]*1000).reset_index(drop=True)
In [187]: %%timeit
...: df_HA_noHA_act['Flag1'] = np.where((df_HA_noHA_act["Allocation Type"] == 'Predicted') & (df_HA_noHA_act['Activities_Counter'] >= 10), 'Flag', '')
...:
100 loops, best of 3: 1.89 ms per loop
In [188]: %%timeit
...: Flag = []
...: for i, row in df_HA_noHA_act.iterrows():
...: if (row['Activities_Counter'] >= 10) and (row["Allocation Type"] == 'Predicted'):
...: Flag.append('Flag')
...: else:
...: Flag.append('')
...: df_HA_noHA_act['Flag'] = Flag
...:
...:
1 loop, best of 3: 381 ms per loop

Create df or other array that counts entries from another df meeting specific criteria

I have a current df containing entries like this:
date tags ease
0 'date1' 'tag1' 1
1 'date1' 'tag1' 2
2 'date1' 'tag1' 1
3 'date1' 'tag2' 2
4 'date1' 'tag2' 2
5 'date2' 'tag1' 3
6 'date2' 'tag1' 1
7 'date2' 'tag2' 1
8 'date2' 'tag3' 1
I'd like to create a df (or some other type array if there is a better way to go about this-I'm green to Python and welcome suggestions) that counts the number of time a specific tag has a specific ease for each date in the df. For example, if I wanted to count the number of times each tag has an ease of 1 it would look something like this:
date1 date2
tag1 2 1
tag2 1 2
tag3 0 1
I can think of ways to do this using a loop, but my the final outputs are going to be about 700 x 800 and I need to make one for each "ease." I feel like there must be an efficient way to do this using indexing, hence why I looked first to pandas. As I said mentioned, I'm very new to Python if there are alternate approaches or packages I should consider using, I'm open for it.

I think you need boolean indexing with crosstab:
df1 = df[df['ease'] == 1]
df = pd.crosstab(df1['tags'], df1['date'])
print (df)
date 'date1' 'date2'
tags
'tag1' 2 1
'tag2' 0 1
'tag3' 0 1
Another solution where instead crosstab use groupby with size and for reshape unstack:
df = df[df['ease'] == 1].groupby(["date", "tags"]).size().unstack(level=0, fill_value=0)
print (df)
date 'date1' 'date2'
tags
'tag1' 2 1
'tag2' 0 1
'tag3' 0 1
EDIT:
After testing solution I released is necessery add function reindex and sort_index, becasue if filter non 1 out values, it remove rows in final DataFrame.
print (df[df['ease'] == 1].groupby(["date", "tags"])
.size()
.unstack(level=0, fill_value=0)
.reindex(index=df.tags.unique(), columns=df.date.unique(), fill_value=0)
.sort_index()
.sort_index(axis=1))
And also second solution:
df1 = df[df['ease'] == 1]
df2 = pd.crosstab(df1['tags'], df1['date'])
.reindex(index=df.tags.unique(), columns=df.date.unique(), fill_value=0)
.sort_index()
.sort_index(axis=1)
Timings:
(second solution of Psidom is wrong in general df, so I omit it from timings)
np.random.seed(123)
N = 10000
dates = pd.date_range('2017-01-01', periods=100)
tags = ['tag' + str(i) for i in range(100)]
ease = range(10)
df = pd.DataFrame({'date':np.random.choice(dates, N),
'tags': np.random.choice(tags, N),
'ease': np.random.choice(ease, N)})
df = df.reindex_axis(['date','tags','ease'], axis=1)
#[10000 rows x 3 columns]
#print (df)
print (df.groupby(["date", "tags"]).agg({"ease": lambda x: (x == 1).sum()}).ease.unstack(level=0).fillna(0))
print (df[df['ease'] == 1].groupby(["date", "tags"]).size().unstack(level=0, fill_value=0).reindex(index=df.tags.unique(), columns=df.date.unique(), fill_value=0).sort_index().sort_index(axis=1))
def jez(df):
df1 = df[df['ease'] == 1]
return pd.crosstab(df1['tags'], df1['date']).reindex(index=df.tags.unique(), columns=df.date.unique(), fill_value=0).sort_index().sort_index(axis=1)
print (jez(df))
#Psidom solution
In [56]: %timeit (df.groupby(["date", "tags"]).agg({"ease": lambda x: (x == 1).sum()}).ease.unstack(level=0).fillna(0))
1 loop, best of 3: 1.94 s per loop
In [57]: %timeit (df[df['ease'] == 1].groupby(["date", "tags"]).size().unstack(level=0, fill_value=0).reindex(index=df.tags.unique(), columns=df.date.unique(), fill_value=0).sort_index().sort_index(axis=1))
100 loops, best of 3: 5.74 ms per loop
In [58]: %timeit (jez(df))
10 loops, best of 3: 54.5 ms per loop

Here is one option; Use groupby.agg to calculate the count, and then unstack the result to wide format:
(df.groupby(["date", "tags"])
.agg({"ease": lambda x: (x == 1).sum()})
.ease.unstack(level=0).fillna(0))
Or if you like to use crosstab:
pd.crosstab(df.tags, df.date, df.ease == 1, aggfunc="sum").fillna(0)
# date 'date1' 'date2'
# tags
#'tag1' 2.0 1.0
#'tag2' 0.0 1.0
#'tag3' 0.0 1.0

You could look at using the pivot_table method on the DataFrame with a function of your own to do something that only counts if the condition you want is true. This should then also populate tags and dates where there is no data with a 0. Something like:
def calc(column):
total = 0
for e in column:
if e == 1:
total += 1
return total
check_res = df.pivot_table(index='tags',columns='date', values='ease', aggfunc=calc, fill_value=0)

How can I create many columns after a list in pandas?

I have a dataframe, I want to create a lot of new columns after a list and filled with 0, how can I do it?
For example:
df = pd.DataFrame({"a":["computer", "printer"]})
print(df)
>>> a
>>>0 computer
>>>1 printer
I have a list
myList=["b","c","d"]
I want my new dataframe looks like:
>>> a b c d
>>>0 computer 0 0 0
>>>1 printer 0 0 0
How can I do it?

Use fastest solution:
for col in myList:
df[col] = 0
print(df)
a b c d
0 computer 0 0 0
1 printer 0 0 0
Another solution is use concat with DataFrame constructor:
pd.concat([df3,pd.DataFrame(columns=myList, index=df.index, data=0)], axis=1)
Timings:
[20000 rows x 300 columns]:
In [286]: %timeit pd.concat([df,pd.DataFrame(columns=myList)], axis=1).fillna(0)
1 loop, best of 3: 1.17 s per loop
In [287]: %timeit pd.concat([df3,pd.DataFrame(columns=myList, index=df.index,data=0)],axis=1)
10 loops, best of 3: 81.7 ms per loop
In [288]: %timeit (orig(df4))
10 loops, best of 3: 59.2 ms per loop
Code for timings:
myList=["b","c","d"] * 100
df = pd.DataFrame({"a":["computer", "printer"]})
print(df)
df = pd.concat([df]*10000).reset_index(drop=True)
df3 = df.copy()
df4 = df.copy()
df1= pd.concat([df,pd.DataFrame(columns=myList)], axis=1).fillna(0)
df2 = pd.concat([df3,pd.DataFrame(columns=myList, index=df.index, data=0)], axis=1)
print(df1)
print(df2)
def orig(df):
for col in range(300):
df[col] = 0
return df
print (orig(df4))

It'll be more performant to concat an empty df for large dfs rather than incrementally adding new columns as this will grow the df incrementally rather than just make a single allocation of the final df dimensions:
In [116]:
myList=["b","c","d"]
df = pd.concat([df,pd.DataFrame(columns=myList)], axis=1).fillna(0)
df
Out[116]:
a b c d
0 computer 0 0 0
1 printer 0 0 0

quickly drop dataframe columns with only one distinct value

Is there a faster way to drop columns that only contain one distinct value than the code below?
cols=df.columns.tolist()
for col in cols:
if len(set(df[col].tolist()))<2:
df=df.drop(col, axis=1)
This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.

You can use Series.unique() method to find out all the unique elements in a column, and for columns whose .unique() returns only 1 element, you can drop that. Example -
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)
A method that does not do inplace dropping -
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
Demo -
In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])
In [155]: for col in df.columns:
.....: if len(df[col].unique()) == 1:
.....: df.drop(col,inplace=True,axis=1)
.....:
In [156]: df
Out[156]:
1
0 2
1 3
2 2
Timing results -
In [166]: %paste
def func1(df):
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
return res
## -- End pasted text --
In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
In [178]: %timeit func1(df)
1000 loops, best of 3: 1.05 ms per loop
In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
100 loops, best of 3: 8.81 ms per loop
In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
100 loops, best of 3: 5.81 ms per loop
The fastest method still seems to be the method using unique and looping through the columns.

One step:
df = df[[c for c
in list(df)
if len(df[c].unique()) > 1]]
Two steps:
Create a list of column names that have more than 1 distinct value.
keep = [c for c
in list(df)
if len(df[c].unique()) > 1]
Drop the columns that are not in 'keep'
df = df[keep]
Note: this step can also be done using a list of columns to drop:
drop_cols = [c for c
in list(df)
if df[c].nunique() <= 1]
df = df.drop(columns=drop_cols)

df.loc[:,df.apply(pd.Series.nunique) != 1]
For example
In:
df = pd.DataFrame({'A': [10, 20, np.nan, 30], 'B': [10, np.nan, 10, 10]})
df.loc[:,df.apply(pd.Series.nunique) != 1]
Out:
A
0 10
1 20
2 NaN
3 30

Two simple one-liners for either returning a view (shorter version of jz0410's answer)
df.loc[:,df.nunique()!=1]
or dropping inplace (via drop())
df.drop(columns=df.columns[df.nunique()==1], inplace=True)

You can create a mask of your df by calling apply and call value_counts, this will produce NaN for all rows except one, you can then call dropna column-wise and pass param thresh=2 so that there must be 2 or more non-NaN values:
In [329]:
df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
df
Out[329]:
a b c
0 1 0 0
1 1 1 0
2 1 2 2
3 1 3 2
4 1 4 2
In [342]:
df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
Out[342]:
b c
0 0 0
1 1 0
2 2 2
3 3 2
4 4 2
Output from the boolean conditions:
In [344]:
df.apply(pd.Series.value_counts)
Out[344]:
a b c
0 NaN 1 2
1 5 1 NaN
2 NaN 1 3
3 NaN 1 NaN
4 NaN 1 NaN
In [345]:
df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
Out[345]:
b c
0 1 2
1 1 NaN
2 1 3
3 1 NaN
4 1 NaN

Many examples in thread and this thread does not worked for my df. Those worked:
# from: https://stackoverflow.com/questions/33144813/quickly-drop-dataframe-columns-with-only-one-distinct-value
# from: https://stackoverflow.com/questions/20209600/pandas-dataframe-remove-constant-column
import pandas as pd
import numpy as np
data = {'var1': [1,2,3,4,5,np.nan,7,8,9],
'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'],
'var3':[101,101,101,102,102,102,103,103,np.nan],
'var4':[np.nan,1,1,1,1,1,1,1,1],
'var5':[1,1,1,1,1,1,1,1,1],
'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'var7':["a","a","a","a","a","a","a","a","a"],
'var8': [1,2,3,4,5,6,7,8,9]}
df = pd.DataFrame(data)
df_original = df.copy()
#-------------------------------------------------------------------------------------------------
df2 = df[[c for c
in list(df)
if len(df[c].unique()) > 1]]
#-------------------------------------------------------------------------------------------------
keep = [c for c
in list(df)
if len(df[c].unique()) > 1]
df3 = df[keep]
#-------------------------------------------------------------------------------------------------
keep_columns = [col for col in df.columns if len(df[col].unique()) > 1]
df5 = df[keep_columns].copy()
#-------------------------------------------------------------------------------------------------
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)

I would like to throw in:
pandas 1.0.3
ids = df.nunique().values>1
df.loc[:,ids]
not that slow:
2.81 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

df=df.loc[:,df.nunique()!=Numberofvalues]

None of the solutions worked in my use-case because I got this error: (my dataframe contains list item).
TypeError: unhashable type: 'list'
The solution that worked for me is this:
ndf = df.describe(include="all").T
new_cols = set(df.columns) - set(ndf[ndf.unique == 1].index)
df = df[list(new_cols)]

One line
df=df[[i for i in df if len(set(df[i]))>1]]

One of the solutions with pipe (convenient if used often):
def drop_unique_value_col(df):
return df.loc[:,df.apply(pd.Series.nunique) != 1]
df.pipe(drop_unique_value_col)

This will drop all the columns with only one distinct value.
for col in Dataframe.columns:
if len(Dataframe[col].value_counts()) == 1:
Dataframe.drop([col], axis=1, inplace=True)

Most 'pythonic' way of doing it I could find:
df = df.loc[:, (df != df.iloc[0]).any()]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

for loop using iterrows in pandas - python

Related

Python pandas how to scan string contains by row?

Python Data Frame: Create New Column Based on Values in a String Column and a Float Column

Create df or other array that counts entries from another df meeting specific criteria

How can I create many columns after a list in pandas?

quickly drop dataframe columns with only one distinct value

Categories

Resources