Removing columns selectively from multilevel index dataframe - python

Say we have a dataframe like this and want to remove columns when certain conditions met.
df = pd.DataFrame(
np.arange(2, 14).reshape(-1, 4),
index=list('ABC'),
columns=pd.MultiIndex.from_arrays([
['data1', 'data2','data1','data2'],
['F', 'K','R','X'],
['C', 'D','E','E']
], names=['meter', 'Sleeper','sweeper'])
)
df
then lets say we want to remove cols only when meter == data1 and sweeper == E
so I tried
df = df.drop(('data1','E'),axis = 1)
KeyError: 'E'
second try
df.drop(('data1','E'), axis = 1, level = 2)
KeyError: "labels [('data1', 'E')] not found in level"
Pandas: drop a level from a multi-level column index?

Seems drop doesn't support selection over split levels ([0,2] here). We can create a mask with the conditions instead using get_level_values:
# keep where not ((level0 is 'data1') and (level2 is 'E'))
col_mask = ~((df.columns.get_level_values(0) == 'data1')
& (df.columns.get_level_values(2) == 'E'))
df = df.loc[:, col_mask]
We can also do this by integer location by excluding the locs that are in a particular index slice, however, this is overall less clear and less flexible:
idx = pd.IndexSlice['data1', :, 'E']
cols = [i for i in range(len(df.columns))
if i not in df.columns.get_locs(idx)]
df = df.iloc[:, cols]
Either approach produces df:
meter data1 data2
Sleeper F K X
sweeper C D E
A 2 3 5
B 6 7 9
C 10 11 13

You have to do them individually, since they are on different levels:
df.drop('data1', axis=1, level='meter').drop('E', axis = 1, level='sweeper')
Out[833]:
meter data2
Sleeper K
sweeper D
A 3
B 7
C 11

Related

python how to drop columns from dataframe [duplicate]

To delete a column in a DataFrame, I can successfully use:
del df['column_name']
But why can't I use the following?
del df.column_name
Since it is possible to access the Series via df.column_name, I expected this to work.
The best way to do this in Pandas is to use drop:
df = df.drop('column_name', axis=1)
where 1 is the axis number (0 for rows and 1 for columns.)
Or, the drop() method accepts index/columns keywords as an alternative to specifying the axis. So we can now just do:
df = df.drop(columns=['column_nameA', 'column_nameB'])
This was introduced in v0.21.0 (October 27, 2017)
To delete the column without having to reassign df you can do:
df.drop('column_name', axis=1, inplace=True)
Finally, to drop by column number instead of by column label, try this to delete, e.g. the 1st, 2nd and 4th columns:
df = df.drop(df.columns[[0, 1, 3]], axis=1) # df.columns is zero-based pd.Index
Also working with "text" syntax for the columns:
df.drop(['column_nameA', 'column_nameB'], axis=1, inplace=True)
As you've guessed, the right syntax is
del df['column_name']
It's difficult to make del df.column_name work simply as the result of syntactic limitations in Python. del df[name] gets translated to df.__delitem__(name) under the covers by Python.
Use:
columns = ['Col1', 'Col2', ...]
df.drop(columns, inplace=True, axis=1)
This will delete one or more columns in-place. Note that inplace=True was added in pandas v0.13 and won't work on older versions. You'd have to assign the result back in that case:
df = df.drop(columns, axis=1)
Drop by index
Delete first, second and fourth columns:
df.drop(df.columns[[0,1,3]], axis=1, inplace=True)
Delete first column:
df.drop(df.columns[[0]], axis=1, inplace=True)
There is an optional parameter inplace so that the original
data can be modified without creating a copy.
Popped
Column selection, addition, deletion
Delete column column-name:
df.pop('column-name')
Examples:
df = DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6]), ('C', [7,8, 9])], orient='index', columns=['one', 'two', 'three'])
print df:
one two three
A 1 2 3
B 4 5 6
C 7 8 9
df.drop(df.columns[[0]], axis=1, inplace=True)
print df:
two three
A 2 3
B 5 6
C 8 9
three = df.pop('three')
print df:
two
A 2
B 5
C 8
The actual question posed, missed by most answers here is:
Why can't I use del df.column_name?
At first we need to understand the problem, which requires us to dive into Python magic methods.
As Wes points out in his answer, del df['column'] maps to the Python magic method df.__delitem__('column') which is implemented in Pandas to drop the column.
However, as pointed out in the link above about Python magic methods:
In fact, __del__ should almost never be used because of the precarious circumstances under which it is called; use it with caution!
You could argue that del df['column_name'] should not be used or encouraged, and thereby del df.column_name should not even be considered.
However, in theory, del df.column_name could be implemented to work in Pandas using the magic method __delattr__. This does however introduce certain problems, problems which the del df['column_name'] implementation already has, but to a lesser degree.
Example Problem
What if I define a column in a dataframe called "dtypes" or "columns"?
Then assume I want to delete these columns.
del df.dtypes would make the __delattr__ method confused as if it should delete the "dtypes" attribute or the "dtypes" column.
Architectural questions behind this problem
Is a dataframe a collection of columns?
Is a dataframe a collection of rows?
Is a column an attribute of a dataframe?
Pandas answers:
Yes, in all ways
No, but if you want it to be, you can use the .ix, .loc or .iloc methods.
Maybe, do you want to read data? Then yes, unless the name of the attribute is already taken by another attribute belonging to the dataframe. Do you want to modify data? Then no.
TLDR;
You cannot do del df.column_name, because Pandas has a quite wildly grown architecture that needs to be reconsidered in order for this kind of cognitive dissonance not to occur to its users.
Pro tip:
Don't use df.column_name. It may be pretty, but it causes cognitive dissonance.
Zen of Python quotes that fits in here:
There are multiple ways of deleting a column.
There should be one-- and preferably only one --obvious way to do it.
Columns are sometimes attributes but sometimes not.
Special cases aren't special enough to break the rules.
Does del df.dtypes delete the dtypes attribute or the dtypes column?
In the face of ambiguity, refuse the temptation to guess.
A nice addition is the ability to drop columns only if they exist. This way you can cover more use cases, and it will only drop the existing columns from the labels passed to it:
Simply add errors='ignore', for example.:
df.drop(['col_name_1', 'col_name_2', ..., 'col_name_N'], inplace=True, axis=1, errors='ignore')
This is new from pandas 0.16.1 onward. Documentation is here.
From version 0.16.1, you can do
df.drop(['column_name'], axis = 1, inplace = True, errors = 'ignore')
It's good practice to always use the [] notation. One reason is that attribute notation (df.column_name) does not work for numbered indices:
In [1]: df = DataFrame([[1, 2, 3], [4, 5, 6]])
In [2]: df[1]
Out[2]:
0 2
1 5
Name: 1
In [3]: df.1
File "<ipython-input-3-e4803c0d1066>", line 1
df.1
^
SyntaxError: invalid syntax
Pandas 0.21+ answer
Pandas version 0.21 has changed the drop method slightly to include both the index and columns parameters to match the signature of the rename and reindex methods.
df.drop(columns=['column_a', 'column_c'])
Personally, I prefer using the axis parameter to denote columns or index because it is the predominant keyword parameter used in nearly all pandas methods. But, now you have some added choices in version 0.21.
In Pandas 0.16.1+, you can drop columns only if they exist per the solution posted by eiTan LaVi. Prior to that version, you can achieve the same result via a conditional list comprehension:
df.drop([col for col in ['col_name_1','col_name_2',...,'col_name_N'] if col in df],
axis=1, inplace=True)
Use:
df.drop('columnname', axis =1, inplace = True)
Or else you can go with
del df['colname']
To delete multiple columns based on column numbers
df.drop(df.iloc[:,1:3], axis = 1, inplace = True)
To delete multiple columns based on columns names
df.drop(['col1','col2',..'coln'], axis = 1, inplace = True)
TL;DR
A lot of effort to find a marginally more efficient solution. Difficult to justify the added complexity while sacrificing the simplicity of df.drop(dlst, 1, errors='ignore')
df.reindex_axis(np.setdiff1d(df.columns.values, dlst), 1)
Preamble
Deleting a column is semantically the same as selecting the other columns. I'll show a few additional methods to consider.
I'll also focus on the general solution of deleting multiple columns at once and allowing for the attempt to delete columns not present.
Using these solutions are general and will work for the simple case as well.
Setup
Consider the pd.DataFrame df and list to delete dlst
df = pd.DataFrame(dict(zip('ABCDEFGHIJ', range(1, 11))), range(3))
dlst = list('HIJKLM')
df
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 10
1 1 2 3 4 5 6 7 8 9 10
2 1 2 3 4 5 6 7 8 9 10
dlst
['H', 'I', 'J', 'K', 'L', 'M']
The result should look like:
df.drop(dlst, 1, errors='ignore')
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Since I'm equating deleting a column to selecting the other columns, I'll break it into two types:
Label selection
Boolean selection
Label Selection
We start by manufacturing the list/array of labels that represent the columns we want to keep and without the columns we want to delete.
df.columns.difference(dlst)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
np.setdiff1d(df.columns.values, dlst)
array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)
df.columns.drop(dlst, errors='ignore')
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
list(set(df.columns.values.tolist()).difference(dlst))
# does not preserve order
['E', 'D', 'B', 'F', 'G', 'A', 'C']
[x for x in df.columns.values.tolist() if x not in dlst]
['A', 'B', 'C', 'D', 'E', 'F', 'G']
Columns from Labels
For the sake of comparing the selection process, assume:
cols = [x for x in df.columns.values.tolist() if x not in dlst]
Then we can evaluate
df.loc[:, cols]
df[cols]
df.reindex(columns=cols)
df.reindex_axis(cols, 1)
Which all evaluate to:
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Boolean Slice
We can construct an array/list of booleans for slicing
~df.columns.isin(dlst)
~np.in1d(df.columns.values, dlst)
[x not in dlst for x in df.columns.values.tolist()]
(df.columns.values[:, None] != dlst).all(1)
Columns from Boolean
For the sake of comparison
bools = [x not in dlst for x in df.columns.values.tolist()]
df.loc[: bools]
Which all evaluate to:
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Robust Timing
Functions
setdiff1d = lambda df, dlst: np.setdiff1d(df.columns.values, dlst)
difference = lambda df, dlst: df.columns.difference(dlst)
columndrop = lambda df, dlst: df.columns.drop(dlst, errors='ignore')
setdifflst = lambda df, dlst: list(set(df.columns.values.tolist()).difference(dlst))
comprehension = lambda df, dlst: [x for x in df.columns.values.tolist() if x not in dlst]
loc = lambda df, cols: df.loc[:, cols]
slc = lambda df, cols: df[cols]
ridx = lambda df, cols: df.reindex(columns=cols)
ridxa = lambda df, cols: df.reindex_axis(cols, 1)
isin = lambda df, dlst: ~df.columns.isin(dlst)
in1d = lambda df, dlst: ~np.in1d(df.columns.values, dlst)
comp = lambda df, dlst: [x not in dlst for x in df.columns.values.tolist()]
brod = lambda df, dlst: (df.columns.values[:, None] != dlst).all(1)
Testing
res1 = pd.DataFrame(
index=pd.MultiIndex.from_product([
'loc slc ridx ridxa'.split(),
'setdiff1d difference columndrop setdifflst comprehension'.split(),
], names=['Select', 'Label']),
columns=[10, 30, 100, 300, 1000],
dtype=float
)
res2 = pd.DataFrame(
index=pd.MultiIndex.from_product([
'loc'.split(),
'isin in1d comp brod'.split(),
], names=['Select', 'Label']),
columns=[10, 30, 100, 300, 1000],
dtype=float
)
res = res1.append(res2).sort_index()
dres = pd.Series(index=res.columns, name='drop')
for j in res.columns:
dlst = list(range(j))
cols = list(range(j // 2, j + j // 2))
d = pd.DataFrame(1, range(10), cols)
dres.at[j] = timeit('d.drop(dlst, 1, errors="ignore")', 'from __main__ import d, dlst', number=100)
for s, l in res.index:
stmt = '{}(d, {}(d, dlst))'.format(s, l)
setp = 'from __main__ import d, dlst, {}, {}'.format(s, l)
res.at[(s, l), j] = timeit(stmt, setp, number=100)
rs = res / dres
rs
10 30 100 300 1000
Select Label
loc brod 0.747373 0.861979 0.891144 1.284235 3.872157
columndrop 1.193983 1.292843 1.396841 1.484429 1.335733
comp 0.802036 0.732326 1.149397 3.473283 25.565922
comprehension 1.463503 1.568395 1.866441 4.421639 26.552276
difference 1.413010 1.460863 1.587594 1.568571 1.569735
in1d 0.818502 0.844374 0.994093 1.042360 1.076255
isin 1.008874 0.879706 1.021712 1.001119 0.964327
setdiff1d 1.352828 1.274061 1.483380 1.459986 1.466575
setdifflst 1.233332 1.444521 1.714199 1.797241 1.876425
ridx columndrop 0.903013 0.832814 0.949234 0.976366 0.982888
comprehension 0.777445 0.827151 1.108028 3.473164 25.528879
difference 1.086859 1.081396 1.293132 1.173044 1.237613
setdiff1d 0.946009 0.873169 0.900185 0.908194 1.036124
setdifflst 0.732964 0.823218 0.819748 0.990315 1.050910
ridxa columndrop 0.835254 0.774701 0.907105 0.908006 0.932754
comprehension 0.697749 0.762556 1.215225 3.510226 25.041832
difference 1.055099 1.010208 1.122005 1.119575 1.383065
setdiff1d 0.760716 0.725386 0.849949 0.879425 0.946460
setdifflst 0.710008 0.668108 0.778060 0.871766 0.939537
slc columndrop 1.268191 1.521264 2.646687 1.919423 1.981091
comprehension 0.856893 0.870365 1.290730 3.564219 26.208937
difference 1.470095 1.747211 2.886581 2.254690 2.050536
setdiff1d 1.098427 1.133476 1.466029 2.045965 3.123452
setdifflst 0.833700 0.846652 1.013061 1.110352 1.287831
fig, axes = plt.subplots(2, 2, figsize=(8, 6), sharey=True)
for i, (n, g) in enumerate([(n, g.xs(n)) for n, g in rs.groupby('Select')]):
ax = axes[i // 2, i % 2]
g.plot.bar(ax=ax, title=n)
ax.legend_.remove()
fig.tight_layout()
This is relative to the time it takes to run df.drop(dlst, 1, errors='ignore'). It seems like after all that effort, we only improve performance modestly.
If fact the best solutions use reindex or reindex_axis on the hack list(set(df.columns.values.tolist()).difference(dlst)). A close second and still very marginally better than drop is np.setdiff1d.
rs.idxmin().pipe(
lambda x: pd.DataFrame(
dict(idx=x.values, val=rs.lookup(x.values, x.index)),
x.index
)
)
idx val
10 (ridx, setdifflst) 0.653431
30 (ridxa, setdifflst) 0.746143
100 (ridxa, setdifflst) 0.816207
300 (ridx, setdifflst) 0.780157
1000 (ridxa, setdifflst) 0.861622
We can remove or delete a specified column or specified columns by the drop() method.
Suppose df is a dataframe.
Column to be removed = column0
Code:
df = df.drop(column0, axis=1)
To remove multiple columns col1, col2, . . . , coln, we have to insert all the columns that needed to be removed in a list. Then remove them by the drop() method.
Code:
df = df.drop([col1, col2, . . . , coln], axis=1)
If your original dataframe df is not too big, you have no memory constraints, and you only need to keep a few columns, or, if you don't know beforehand the names of all the extra columns that you do not need, then you might as well create a new dataframe with only the columns you need:
new_df = df[['spam', 'sausage']]
Deleting a column using the iloc function of dataframe and slicing, when we have a typical column name with unwanted values:
df = df.iloc[:,1:] # Removing an unnamed index column
Here 0 is the default row and 1 is the first column, hence :,1: is our parameter for deleting the first column.
The dot syntax works in JavaScript, but not in Python.
Python: del df['column_name']
JavaScript: del df['column_name'] or del df.column_name
Another way of deleting a column in a Pandas DataFrame
If you're not looking for in-place deletion then you can create a new DataFrame by specifying the columns using DataFrame(...) function as:
my_dict = { 'name' : ['a','b','c','d'], 'age' : [10,20,25,22], 'designation' : ['CEO', 'VP', 'MD', 'CEO']}
df = pd.DataFrame(my_dict)
Create a new DataFrame as
newdf = pd.DataFrame(df, columns=['name', 'age'])
You get a result as good as what you get with del / drop.
Taking advantage by using Autocomplete or "IntelliSense" over string literals:
del df[df.column1.name]
# or
df.drop(df.column1.name, axis=1, inplace=True)
It works fine with current Pandas versions.
To remove columns before and after specific columns you can use the method truncate. For example:
A B C D E
0 1 10 100 1000 10000
1 2 20 200 2000 20000
df.truncate(before='B', after='D', axis=1)
Output:
B C D
0 10 100 1000
1 20 200 2000
Viewed from a general Python standpoint, del obj.column_name makes sense if the attribute column_name can be deleted. It needs to be a regular attribute - or a property with a defined deleter.
The reasons why this doesn't translate to Pandas, and does not make sense for Pandas Dataframes are:
Consider df.column_name to be a “virtual attribute”, it is not a thing in its own right, it is not the “seat” of that column, it's just a way to access the column. Much like a property with no deleter.

Pandas - sort_values by combining multiple columns [duplicate]

To delete a column in a DataFrame, I can successfully use:
del df['column_name']
But why can't I use the following?
del df.column_name
Since it is possible to access the Series via df.column_name, I expected this to work.
The best way to do this in Pandas is to use drop:
df = df.drop('column_name', axis=1)
where 1 is the axis number (0 for rows and 1 for columns.)
Or, the drop() method accepts index/columns keywords as an alternative to specifying the axis. So we can now just do:
df = df.drop(columns=['column_nameA', 'column_nameB'])
This was introduced in v0.21.0 (October 27, 2017)
To delete the column without having to reassign df you can do:
df.drop('column_name', axis=1, inplace=True)
Finally, to drop by column number instead of by column label, try this to delete, e.g. the 1st, 2nd and 4th columns:
df = df.drop(df.columns[[0, 1, 3]], axis=1) # df.columns is zero-based pd.Index
Also working with "text" syntax for the columns:
df.drop(['column_nameA', 'column_nameB'], axis=1, inplace=True)
As you've guessed, the right syntax is
del df['column_name']
It's difficult to make del df.column_name work simply as the result of syntactic limitations in Python. del df[name] gets translated to df.__delitem__(name) under the covers by Python.
Use:
columns = ['Col1', 'Col2', ...]
df.drop(columns, inplace=True, axis=1)
This will delete one or more columns in-place. Note that inplace=True was added in pandas v0.13 and won't work on older versions. You'd have to assign the result back in that case:
df = df.drop(columns, axis=1)
Drop by index
Delete first, second and fourth columns:
df.drop(df.columns[[0,1,3]], axis=1, inplace=True)
Delete first column:
df.drop(df.columns[[0]], axis=1, inplace=True)
There is an optional parameter inplace so that the original
data can be modified without creating a copy.
Popped
Column selection, addition, deletion
Delete column column-name:
df.pop('column-name')
Examples:
df = DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6]), ('C', [7,8, 9])], orient='index', columns=['one', 'two', 'three'])
print df:
one two three
A 1 2 3
B 4 5 6
C 7 8 9
df.drop(df.columns[[0]], axis=1, inplace=True)
print df:
two three
A 2 3
B 5 6
C 8 9
three = df.pop('three')
print df:
two
A 2
B 5
C 8
The actual question posed, missed by most answers here is:
Why can't I use del df.column_name?
At first we need to understand the problem, which requires us to dive into Python magic methods.
As Wes points out in his answer, del df['column'] maps to the Python magic method df.__delitem__('column') which is implemented in Pandas to drop the column.
However, as pointed out in the link above about Python magic methods:
In fact, __del__ should almost never be used because of the precarious circumstances under which it is called; use it with caution!
You could argue that del df['column_name'] should not be used or encouraged, and thereby del df.column_name should not even be considered.
However, in theory, del df.column_name could be implemented to work in Pandas using the magic method __delattr__. This does however introduce certain problems, problems which the del df['column_name'] implementation already has, but to a lesser degree.
Example Problem
What if I define a column in a dataframe called "dtypes" or "columns"?
Then assume I want to delete these columns.
del df.dtypes would make the __delattr__ method confused as if it should delete the "dtypes" attribute or the "dtypes" column.
Architectural questions behind this problem
Is a dataframe a collection of columns?
Is a dataframe a collection of rows?
Is a column an attribute of a dataframe?
Pandas answers:
Yes, in all ways
No, but if you want it to be, you can use the .ix, .loc or .iloc methods.
Maybe, do you want to read data? Then yes, unless the name of the attribute is already taken by another attribute belonging to the dataframe. Do you want to modify data? Then no.
TLDR;
You cannot do del df.column_name, because Pandas has a quite wildly grown architecture that needs to be reconsidered in order for this kind of cognitive dissonance not to occur to its users.
Pro tip:
Don't use df.column_name. It may be pretty, but it causes cognitive dissonance.
Zen of Python quotes that fits in here:
There are multiple ways of deleting a column.
There should be one-- and preferably only one --obvious way to do it.
Columns are sometimes attributes but sometimes not.
Special cases aren't special enough to break the rules.
Does del df.dtypes delete the dtypes attribute or the dtypes column?
In the face of ambiguity, refuse the temptation to guess.
A nice addition is the ability to drop columns only if they exist. This way you can cover more use cases, and it will only drop the existing columns from the labels passed to it:
Simply add errors='ignore', for example.:
df.drop(['col_name_1', 'col_name_2', ..., 'col_name_N'], inplace=True, axis=1, errors='ignore')
This is new from pandas 0.16.1 onward. Documentation is here.
From version 0.16.1, you can do
df.drop(['column_name'], axis = 1, inplace = True, errors = 'ignore')
It's good practice to always use the [] notation. One reason is that attribute notation (df.column_name) does not work for numbered indices:
In [1]: df = DataFrame([[1, 2, 3], [4, 5, 6]])
In [2]: df[1]
Out[2]:
0 2
1 5
Name: 1
In [3]: df.1
File "<ipython-input-3-e4803c0d1066>", line 1
df.1
^
SyntaxError: invalid syntax
Pandas 0.21+ answer
Pandas version 0.21 has changed the drop method slightly to include both the index and columns parameters to match the signature of the rename and reindex methods.
df.drop(columns=['column_a', 'column_c'])
Personally, I prefer using the axis parameter to denote columns or index because it is the predominant keyword parameter used in nearly all pandas methods. But, now you have some added choices in version 0.21.
In Pandas 0.16.1+, you can drop columns only if they exist per the solution posted by eiTan LaVi. Prior to that version, you can achieve the same result via a conditional list comprehension:
df.drop([col for col in ['col_name_1','col_name_2',...,'col_name_N'] if col in df],
axis=1, inplace=True)
Use:
df.drop('columnname', axis =1, inplace = True)
Or else you can go with
del df['colname']
To delete multiple columns based on column numbers
df.drop(df.iloc[:,1:3], axis = 1, inplace = True)
To delete multiple columns based on columns names
df.drop(['col1','col2',..'coln'], axis = 1, inplace = True)
TL;DR
A lot of effort to find a marginally more efficient solution. Difficult to justify the added complexity while sacrificing the simplicity of df.drop(dlst, 1, errors='ignore')
df.reindex_axis(np.setdiff1d(df.columns.values, dlst), 1)
Preamble
Deleting a column is semantically the same as selecting the other columns. I'll show a few additional methods to consider.
I'll also focus on the general solution of deleting multiple columns at once and allowing for the attempt to delete columns not present.
Using these solutions are general and will work for the simple case as well.
Setup
Consider the pd.DataFrame df and list to delete dlst
df = pd.DataFrame(dict(zip('ABCDEFGHIJ', range(1, 11))), range(3))
dlst = list('HIJKLM')
df
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 10
1 1 2 3 4 5 6 7 8 9 10
2 1 2 3 4 5 6 7 8 9 10
dlst
['H', 'I', 'J', 'K', 'L', 'M']
The result should look like:
df.drop(dlst, 1, errors='ignore')
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Since I'm equating deleting a column to selecting the other columns, I'll break it into two types:
Label selection
Boolean selection
Label Selection
We start by manufacturing the list/array of labels that represent the columns we want to keep and without the columns we want to delete.
df.columns.difference(dlst)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
np.setdiff1d(df.columns.values, dlst)
array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)
df.columns.drop(dlst, errors='ignore')
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
list(set(df.columns.values.tolist()).difference(dlst))
# does not preserve order
['E', 'D', 'B', 'F', 'G', 'A', 'C']
[x for x in df.columns.values.tolist() if x not in dlst]
['A', 'B', 'C', 'D', 'E', 'F', 'G']
Columns from Labels
For the sake of comparing the selection process, assume:
cols = [x for x in df.columns.values.tolist() if x not in dlst]
Then we can evaluate
df.loc[:, cols]
df[cols]
df.reindex(columns=cols)
df.reindex_axis(cols, 1)
Which all evaluate to:
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Boolean Slice
We can construct an array/list of booleans for slicing
~df.columns.isin(dlst)
~np.in1d(df.columns.values, dlst)
[x not in dlst for x in df.columns.values.tolist()]
(df.columns.values[:, None] != dlst).all(1)
Columns from Boolean
For the sake of comparison
bools = [x not in dlst for x in df.columns.values.tolist()]
df.loc[: bools]
Which all evaluate to:
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Robust Timing
Functions
setdiff1d = lambda df, dlst: np.setdiff1d(df.columns.values, dlst)
difference = lambda df, dlst: df.columns.difference(dlst)
columndrop = lambda df, dlst: df.columns.drop(dlst, errors='ignore')
setdifflst = lambda df, dlst: list(set(df.columns.values.tolist()).difference(dlst))
comprehension = lambda df, dlst: [x for x in df.columns.values.tolist() if x not in dlst]
loc = lambda df, cols: df.loc[:, cols]
slc = lambda df, cols: df[cols]
ridx = lambda df, cols: df.reindex(columns=cols)
ridxa = lambda df, cols: df.reindex_axis(cols, 1)
isin = lambda df, dlst: ~df.columns.isin(dlst)
in1d = lambda df, dlst: ~np.in1d(df.columns.values, dlst)
comp = lambda df, dlst: [x not in dlst for x in df.columns.values.tolist()]
brod = lambda df, dlst: (df.columns.values[:, None] != dlst).all(1)
Testing
res1 = pd.DataFrame(
index=pd.MultiIndex.from_product([
'loc slc ridx ridxa'.split(),
'setdiff1d difference columndrop setdifflst comprehension'.split(),
], names=['Select', 'Label']),
columns=[10, 30, 100, 300, 1000],
dtype=float
)
res2 = pd.DataFrame(
index=pd.MultiIndex.from_product([
'loc'.split(),
'isin in1d comp brod'.split(),
], names=['Select', 'Label']),
columns=[10, 30, 100, 300, 1000],
dtype=float
)
res = res1.append(res2).sort_index()
dres = pd.Series(index=res.columns, name='drop')
for j in res.columns:
dlst = list(range(j))
cols = list(range(j // 2, j + j // 2))
d = pd.DataFrame(1, range(10), cols)
dres.at[j] = timeit('d.drop(dlst, 1, errors="ignore")', 'from __main__ import d, dlst', number=100)
for s, l in res.index:
stmt = '{}(d, {}(d, dlst))'.format(s, l)
setp = 'from __main__ import d, dlst, {}, {}'.format(s, l)
res.at[(s, l), j] = timeit(stmt, setp, number=100)
rs = res / dres
rs
10 30 100 300 1000
Select Label
loc brod 0.747373 0.861979 0.891144 1.284235 3.872157
columndrop 1.193983 1.292843 1.396841 1.484429 1.335733
comp 0.802036 0.732326 1.149397 3.473283 25.565922
comprehension 1.463503 1.568395 1.866441 4.421639 26.552276
difference 1.413010 1.460863 1.587594 1.568571 1.569735
in1d 0.818502 0.844374 0.994093 1.042360 1.076255
isin 1.008874 0.879706 1.021712 1.001119 0.964327
setdiff1d 1.352828 1.274061 1.483380 1.459986 1.466575
setdifflst 1.233332 1.444521 1.714199 1.797241 1.876425
ridx columndrop 0.903013 0.832814 0.949234 0.976366 0.982888
comprehension 0.777445 0.827151 1.108028 3.473164 25.528879
difference 1.086859 1.081396 1.293132 1.173044 1.237613
setdiff1d 0.946009 0.873169 0.900185 0.908194 1.036124
setdifflst 0.732964 0.823218 0.819748 0.990315 1.050910
ridxa columndrop 0.835254 0.774701 0.907105 0.908006 0.932754
comprehension 0.697749 0.762556 1.215225 3.510226 25.041832
difference 1.055099 1.010208 1.122005 1.119575 1.383065
setdiff1d 0.760716 0.725386 0.849949 0.879425 0.946460
setdifflst 0.710008 0.668108 0.778060 0.871766 0.939537
slc columndrop 1.268191 1.521264 2.646687 1.919423 1.981091
comprehension 0.856893 0.870365 1.290730 3.564219 26.208937
difference 1.470095 1.747211 2.886581 2.254690 2.050536
setdiff1d 1.098427 1.133476 1.466029 2.045965 3.123452
setdifflst 0.833700 0.846652 1.013061 1.110352 1.287831
fig, axes = plt.subplots(2, 2, figsize=(8, 6), sharey=True)
for i, (n, g) in enumerate([(n, g.xs(n)) for n, g in rs.groupby('Select')]):
ax = axes[i // 2, i % 2]
g.plot.bar(ax=ax, title=n)
ax.legend_.remove()
fig.tight_layout()
This is relative to the time it takes to run df.drop(dlst, 1, errors='ignore'). It seems like after all that effort, we only improve performance modestly.
If fact the best solutions use reindex or reindex_axis on the hack list(set(df.columns.values.tolist()).difference(dlst)). A close second and still very marginally better than drop is np.setdiff1d.
rs.idxmin().pipe(
lambda x: pd.DataFrame(
dict(idx=x.values, val=rs.lookup(x.values, x.index)),
x.index
)
)
idx val
10 (ridx, setdifflst) 0.653431
30 (ridxa, setdifflst) 0.746143
100 (ridxa, setdifflst) 0.816207
300 (ridx, setdifflst) 0.780157
1000 (ridxa, setdifflst) 0.861622
We can remove or delete a specified column or specified columns by the drop() method.
Suppose df is a dataframe.
Column to be removed = column0
Code:
df = df.drop(column0, axis=1)
To remove multiple columns col1, col2, . . . , coln, we have to insert all the columns that needed to be removed in a list. Then remove them by the drop() method.
Code:
df = df.drop([col1, col2, . . . , coln], axis=1)
If your original dataframe df is not too big, you have no memory constraints, and you only need to keep a few columns, or, if you don't know beforehand the names of all the extra columns that you do not need, then you might as well create a new dataframe with only the columns you need:
new_df = df[['spam', 'sausage']]
Deleting a column using the iloc function of dataframe and slicing, when we have a typical column name with unwanted values:
df = df.iloc[:,1:] # Removing an unnamed index column
Here 0 is the default row and 1 is the first column, hence :,1: is our parameter for deleting the first column.
The dot syntax works in JavaScript, but not in Python.
Python: del df['column_name']
JavaScript: del df['column_name'] or del df.column_name
Another way of deleting a column in a Pandas DataFrame
If you're not looking for in-place deletion then you can create a new DataFrame by specifying the columns using DataFrame(...) function as:
my_dict = { 'name' : ['a','b','c','d'], 'age' : [10,20,25,22], 'designation' : ['CEO', 'VP', 'MD', 'CEO']}
df = pd.DataFrame(my_dict)
Create a new DataFrame as
newdf = pd.DataFrame(df, columns=['name', 'age'])
You get a result as good as what you get with del / drop.
Taking advantage by using Autocomplete or "IntelliSense" over string literals:
del df[df.column1.name]
# or
df.drop(df.column1.name, axis=1, inplace=True)
It works fine with current Pandas versions.
To remove columns before and after specific columns you can use the method truncate. For example:
A B C D E
0 1 10 100 1000 10000
1 2 20 200 2000 20000
df.truncate(before='B', after='D', axis=1)
Output:
B C D
0 10 100 1000
1 20 200 2000
Viewed from a general Python standpoint, del obj.column_name makes sense if the attribute column_name can be deleted. It needs to be a regular attribute - or a property with a defined deleter.
The reasons why this doesn't translate to Pandas, and does not make sense for Pandas Dataframes are:
Consider df.column_name to be a “virtual attribute”, it is not a thing in its own right, it is not the “seat” of that column, it's just a way to access the column. Much like a property with no deleter.

Create a dataframe from a dataframe by not extracting certain columns [duplicate]

To delete a column in a DataFrame, I can successfully use:
del df['column_name']
But why can't I use the following?
del df.column_name
Since it is possible to access the Series via df.column_name, I expected this to work.
The best way to do this in Pandas is to use drop:
df = df.drop('column_name', axis=1)
where 1 is the axis number (0 for rows and 1 for columns.)
Or, the drop() method accepts index/columns keywords as an alternative to specifying the axis. So we can now just do:
df = df.drop(columns=['column_nameA', 'column_nameB'])
This was introduced in v0.21.0 (October 27, 2017)
To delete the column without having to reassign df you can do:
df.drop('column_name', axis=1, inplace=True)
Finally, to drop by column number instead of by column label, try this to delete, e.g. the 1st, 2nd and 4th columns:
df = df.drop(df.columns[[0, 1, 3]], axis=1) # df.columns is zero-based pd.Index
Also working with "text" syntax for the columns:
df.drop(['column_nameA', 'column_nameB'], axis=1, inplace=True)
As you've guessed, the right syntax is
del df['column_name']
It's difficult to make del df.column_name work simply as the result of syntactic limitations in Python. del df[name] gets translated to df.__delitem__(name) under the covers by Python.
Use:
columns = ['Col1', 'Col2', ...]
df.drop(columns, inplace=True, axis=1)
This will delete one or more columns in-place. Note that inplace=True was added in pandas v0.13 and won't work on older versions. You'd have to assign the result back in that case:
df = df.drop(columns, axis=1)
Drop by index
Delete first, second and fourth columns:
df.drop(df.columns[[0,1,3]], axis=1, inplace=True)
Delete first column:
df.drop(df.columns[[0]], axis=1, inplace=True)
There is an optional parameter inplace so that the original
data can be modified without creating a copy.
Popped
Column selection, addition, deletion
Delete column column-name:
df.pop('column-name')
Examples:
df = DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6]), ('C', [7,8, 9])], orient='index', columns=['one', 'two', 'three'])
print df:
one two three
A 1 2 3
B 4 5 6
C 7 8 9
df.drop(df.columns[[0]], axis=1, inplace=True)
print df:
two three
A 2 3
B 5 6
C 8 9
three = df.pop('three')
print df:
two
A 2
B 5
C 8
The actual question posed, missed by most answers here is:
Why can't I use del df.column_name?
At first we need to understand the problem, which requires us to dive into Python magic methods.
As Wes points out in his answer, del df['column'] maps to the Python magic method df.__delitem__('column') which is implemented in Pandas to drop the column.
However, as pointed out in the link above about Python magic methods:
In fact, __del__ should almost never be used because of the precarious circumstances under which it is called; use it with caution!
You could argue that del df['column_name'] should not be used or encouraged, and thereby del df.column_name should not even be considered.
However, in theory, del df.column_name could be implemented to work in Pandas using the magic method __delattr__. This does however introduce certain problems, problems which the del df['column_name'] implementation already has, but to a lesser degree.
Example Problem
What if I define a column in a dataframe called "dtypes" or "columns"?
Then assume I want to delete these columns.
del df.dtypes would make the __delattr__ method confused as if it should delete the "dtypes" attribute or the "dtypes" column.
Architectural questions behind this problem
Is a dataframe a collection of columns?
Is a dataframe a collection of rows?
Is a column an attribute of a dataframe?
Pandas answers:
Yes, in all ways
No, but if you want it to be, you can use the .ix, .loc or .iloc methods.
Maybe, do you want to read data? Then yes, unless the name of the attribute is already taken by another attribute belonging to the dataframe. Do you want to modify data? Then no.
TLDR;
You cannot do del df.column_name, because Pandas has a quite wildly grown architecture that needs to be reconsidered in order for this kind of cognitive dissonance not to occur to its users.
Pro tip:
Don't use df.column_name. It may be pretty, but it causes cognitive dissonance.
Zen of Python quotes that fits in here:
There are multiple ways of deleting a column.
There should be one-- and preferably only one --obvious way to do it.
Columns are sometimes attributes but sometimes not.
Special cases aren't special enough to break the rules.
Does del df.dtypes delete the dtypes attribute or the dtypes column?
In the face of ambiguity, refuse the temptation to guess.
A nice addition is the ability to drop columns only if they exist. This way you can cover more use cases, and it will only drop the existing columns from the labels passed to it:
Simply add errors='ignore', for example.:
df.drop(['col_name_1', 'col_name_2', ..., 'col_name_N'], inplace=True, axis=1, errors='ignore')
This is new from pandas 0.16.1 onward. Documentation is here.
From version 0.16.1, you can do
df.drop(['column_name'], axis = 1, inplace = True, errors = 'ignore')
It's good practice to always use the [] notation. One reason is that attribute notation (df.column_name) does not work for numbered indices:
In [1]: df = DataFrame([[1, 2, 3], [4, 5, 6]])
In [2]: df[1]
Out[2]:
0 2
1 5
Name: 1
In [3]: df.1
File "<ipython-input-3-e4803c0d1066>", line 1
df.1
^
SyntaxError: invalid syntax
Pandas 0.21+ answer
Pandas version 0.21 has changed the drop method slightly to include both the index and columns parameters to match the signature of the rename and reindex methods.
df.drop(columns=['column_a', 'column_c'])
Personally, I prefer using the axis parameter to denote columns or index because it is the predominant keyword parameter used in nearly all pandas methods. But, now you have some added choices in version 0.21.
In Pandas 0.16.1+, you can drop columns only if they exist per the solution posted by eiTan LaVi. Prior to that version, you can achieve the same result via a conditional list comprehension:
df.drop([col for col in ['col_name_1','col_name_2',...,'col_name_N'] if col in df],
axis=1, inplace=True)
Use:
df.drop('columnname', axis =1, inplace = True)
Or else you can go with
del df['colname']
To delete multiple columns based on column numbers
df.drop(df.iloc[:,1:3], axis = 1, inplace = True)
To delete multiple columns based on columns names
df.drop(['col1','col2',..'coln'], axis = 1, inplace = True)
TL;DR
A lot of effort to find a marginally more efficient solution. Difficult to justify the added complexity while sacrificing the simplicity of df.drop(dlst, 1, errors='ignore')
df.reindex_axis(np.setdiff1d(df.columns.values, dlst), 1)
Preamble
Deleting a column is semantically the same as selecting the other columns. I'll show a few additional methods to consider.
I'll also focus on the general solution of deleting multiple columns at once and allowing for the attempt to delete columns not present.
Using these solutions are general and will work for the simple case as well.
Setup
Consider the pd.DataFrame df and list to delete dlst
df = pd.DataFrame(dict(zip('ABCDEFGHIJ', range(1, 11))), range(3))
dlst = list('HIJKLM')
df
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 10
1 1 2 3 4 5 6 7 8 9 10
2 1 2 3 4 5 6 7 8 9 10
dlst
['H', 'I', 'J', 'K', 'L', 'M']
The result should look like:
df.drop(dlst, 1, errors='ignore')
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Since I'm equating deleting a column to selecting the other columns, I'll break it into two types:
Label selection
Boolean selection
Label Selection
We start by manufacturing the list/array of labels that represent the columns we want to keep and without the columns we want to delete.
df.columns.difference(dlst)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
np.setdiff1d(df.columns.values, dlst)
array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)
df.columns.drop(dlst, errors='ignore')
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
list(set(df.columns.values.tolist()).difference(dlst))
# does not preserve order
['E', 'D', 'B', 'F', 'G', 'A', 'C']
[x for x in df.columns.values.tolist() if x not in dlst]
['A', 'B', 'C', 'D', 'E', 'F', 'G']
Columns from Labels
For the sake of comparing the selection process, assume:
cols = [x for x in df.columns.values.tolist() if x not in dlst]
Then we can evaluate
df.loc[:, cols]
df[cols]
df.reindex(columns=cols)
df.reindex_axis(cols, 1)
Which all evaluate to:
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Boolean Slice
We can construct an array/list of booleans for slicing
~df.columns.isin(dlst)
~np.in1d(df.columns.values, dlst)
[x not in dlst for x in df.columns.values.tolist()]
(df.columns.values[:, None] != dlst).all(1)
Columns from Boolean
For the sake of comparison
bools = [x not in dlst for x in df.columns.values.tolist()]
df.loc[: bools]
Which all evaluate to:
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Robust Timing
Functions
setdiff1d = lambda df, dlst: np.setdiff1d(df.columns.values, dlst)
difference = lambda df, dlst: df.columns.difference(dlst)
columndrop = lambda df, dlst: df.columns.drop(dlst, errors='ignore')
setdifflst = lambda df, dlst: list(set(df.columns.values.tolist()).difference(dlst))
comprehension = lambda df, dlst: [x for x in df.columns.values.tolist() if x not in dlst]
loc = lambda df, cols: df.loc[:, cols]
slc = lambda df, cols: df[cols]
ridx = lambda df, cols: df.reindex(columns=cols)
ridxa = lambda df, cols: df.reindex_axis(cols, 1)
isin = lambda df, dlst: ~df.columns.isin(dlst)
in1d = lambda df, dlst: ~np.in1d(df.columns.values, dlst)
comp = lambda df, dlst: [x not in dlst for x in df.columns.values.tolist()]
brod = lambda df, dlst: (df.columns.values[:, None] != dlst).all(1)
Testing
res1 = pd.DataFrame(
index=pd.MultiIndex.from_product([
'loc slc ridx ridxa'.split(),
'setdiff1d difference columndrop setdifflst comprehension'.split(),
], names=['Select', 'Label']),
columns=[10, 30, 100, 300, 1000],
dtype=float
)
res2 = pd.DataFrame(
index=pd.MultiIndex.from_product([
'loc'.split(),
'isin in1d comp brod'.split(),
], names=['Select', 'Label']),
columns=[10, 30, 100, 300, 1000],
dtype=float
)
res = res1.append(res2).sort_index()
dres = pd.Series(index=res.columns, name='drop')
for j in res.columns:
dlst = list(range(j))
cols = list(range(j // 2, j + j // 2))
d = pd.DataFrame(1, range(10), cols)
dres.at[j] = timeit('d.drop(dlst, 1, errors="ignore")', 'from __main__ import d, dlst', number=100)
for s, l in res.index:
stmt = '{}(d, {}(d, dlst))'.format(s, l)
setp = 'from __main__ import d, dlst, {}, {}'.format(s, l)
res.at[(s, l), j] = timeit(stmt, setp, number=100)
rs = res / dres
rs
10 30 100 300 1000
Select Label
loc brod 0.747373 0.861979 0.891144 1.284235 3.872157
columndrop 1.193983 1.292843 1.396841 1.484429 1.335733
comp 0.802036 0.732326 1.149397 3.473283 25.565922
comprehension 1.463503 1.568395 1.866441 4.421639 26.552276
difference 1.413010 1.460863 1.587594 1.568571 1.569735
in1d 0.818502 0.844374 0.994093 1.042360 1.076255
isin 1.008874 0.879706 1.021712 1.001119 0.964327
setdiff1d 1.352828 1.274061 1.483380 1.459986 1.466575
setdifflst 1.233332 1.444521 1.714199 1.797241 1.876425
ridx columndrop 0.903013 0.832814 0.949234 0.976366 0.982888
comprehension 0.777445 0.827151 1.108028 3.473164 25.528879
difference 1.086859 1.081396 1.293132 1.173044 1.237613
setdiff1d 0.946009 0.873169 0.900185 0.908194 1.036124
setdifflst 0.732964 0.823218 0.819748 0.990315 1.050910
ridxa columndrop 0.835254 0.774701 0.907105 0.908006 0.932754
comprehension 0.697749 0.762556 1.215225 3.510226 25.041832
difference 1.055099 1.010208 1.122005 1.119575 1.383065
setdiff1d 0.760716 0.725386 0.849949 0.879425 0.946460
setdifflst 0.710008 0.668108 0.778060 0.871766 0.939537
slc columndrop 1.268191 1.521264 2.646687 1.919423 1.981091
comprehension 0.856893 0.870365 1.290730 3.564219 26.208937
difference 1.470095 1.747211 2.886581 2.254690 2.050536
setdiff1d 1.098427 1.133476 1.466029 2.045965 3.123452
setdifflst 0.833700 0.846652 1.013061 1.110352 1.287831
fig, axes = plt.subplots(2, 2, figsize=(8, 6), sharey=True)
for i, (n, g) in enumerate([(n, g.xs(n)) for n, g in rs.groupby('Select')]):
ax = axes[i // 2, i % 2]
g.plot.bar(ax=ax, title=n)
ax.legend_.remove()
fig.tight_layout()
This is relative to the time it takes to run df.drop(dlst, 1, errors='ignore'). It seems like after all that effort, we only improve performance modestly.
If fact the best solutions use reindex or reindex_axis on the hack list(set(df.columns.values.tolist()).difference(dlst)). A close second and still very marginally better than drop is np.setdiff1d.
rs.idxmin().pipe(
lambda x: pd.DataFrame(
dict(idx=x.values, val=rs.lookup(x.values, x.index)),
x.index
)
)
idx val
10 (ridx, setdifflst) 0.653431
30 (ridxa, setdifflst) 0.746143
100 (ridxa, setdifflst) 0.816207
300 (ridx, setdifflst) 0.780157
1000 (ridxa, setdifflst) 0.861622
We can remove or delete a specified column or specified columns by the drop() method.
Suppose df is a dataframe.
Column to be removed = column0
Code:
df = df.drop(column0, axis=1)
To remove multiple columns col1, col2, . . . , coln, we have to insert all the columns that needed to be removed in a list. Then remove them by the drop() method.
Code:
df = df.drop([col1, col2, . . . , coln], axis=1)
If your original dataframe df is not too big, you have no memory constraints, and you only need to keep a few columns, or, if you don't know beforehand the names of all the extra columns that you do not need, then you might as well create a new dataframe with only the columns you need:
new_df = df[['spam', 'sausage']]
Deleting a column using the iloc function of dataframe and slicing, when we have a typical column name with unwanted values:
df = df.iloc[:,1:] # Removing an unnamed index column
Here 0 is the default row and 1 is the first column, hence :,1: is our parameter for deleting the first column.
The dot syntax works in JavaScript, but not in Python.
Python: del df['column_name']
JavaScript: del df['column_name'] or del df.column_name
Another way of deleting a column in a Pandas DataFrame
If you're not looking for in-place deletion then you can create a new DataFrame by specifying the columns using DataFrame(...) function as:
my_dict = { 'name' : ['a','b','c','d'], 'age' : [10,20,25,22], 'designation' : ['CEO', 'VP', 'MD', 'CEO']}
df = pd.DataFrame(my_dict)
Create a new DataFrame as
newdf = pd.DataFrame(df, columns=['name', 'age'])
You get a result as good as what you get with del / drop.
Taking advantage by using Autocomplete or "IntelliSense" over string literals:
del df[df.column1.name]
# or
df.drop(df.column1.name, axis=1, inplace=True)
It works fine with current Pandas versions.
To remove columns before and after specific columns you can use the method truncate. For example:
A B C D E
0 1 10 100 1000 10000
1 2 20 200 2000 20000
df.truncate(before='B', after='D', axis=1)
Output:
B C D
0 10 100 1000
1 20 200 2000
Viewed from a general Python standpoint, del obj.column_name makes sense if the attribute column_name can be deleted. It needs to be a regular attribute - or a property with a defined deleter.
The reasons why this doesn't translate to Pandas, and does not make sense for Pandas Dataframes are:
Consider df.column_name to be a “virtual attribute”, it is not a thing in its own right, it is not the “seat” of that column, it's just a way to access the column. Much like a property with no deleter.

move column in pandas dataframe

I have the following dataframe:
a b x y
0 1 2 3 -1
1 2 4 6 -2
2 3 6 9 -3
3 4 8 12 -4
How can I move columns b and x such that they are the last 2 columns in the dataframe? I would like to specify b and x by name, but not the other columns.
You can rearrange columns directly by specifying their order:
df = df[['a', 'y', 'b', 'x']]
In the case of larger dataframes where the column titles are dynamic, you can use a list comprehension to select every column not in your target set and then append the target set to the end.
>>> df[[c for c in df if c not in ['b', 'x']]
+ ['b', 'x']]
a y b x
0 1 -1 2 3
1 2 -2 4 6
2 3 -3 6 9
3 4 -4 8 12
To make it more bullet proof, you can ensure that your target columns are indeed in the dataframe:
cols_at_end = ['b', 'x']
df = df[[c for c in df if c not in cols_at_end]
+ [c for c in cols_at_end if c in df]]
cols = list(df.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('b')) #Remove b from list
cols.pop(cols.index('x')) #Remove x from list
df = df[cols+['b','x']] #Create new dataframe with columns in the order you want
For example, to move column "name" to be the first column in df you can use insert:
column_to_move = df.pop("name")
# insert column with insert(location, column_name, column_value)
df.insert(0, "name", column_to_move)
similarly, if you want this column to be e.g. third column from the beginning:
df.insert(2, "name", column_to_move )
You can use to way below. It's very simple, but similar to the good answer given by Charlie Haley.
df1 = df.pop('b') # remove column b and store it in df1
df2 = df.pop('x') # remove column x and store it in df2
df['b']=df1 # add b series as a 'new' column.
df['x']=df2 # add b series as a 'new' column.
Now you have your dataframe with the columns 'b' and 'x' in the end. You can see this video from OSPY : https://youtu.be/RlbO27N3Xg4
similar to ROBBAT1's answer above, but hopefully a bit more robust:
df.insert(len(df.columns)-1, 'b', df.pop('b'))
df.insert(len(df.columns)-1, 'x', df.pop('x'))
This function will reorder your columns without losing data. Any omitted columns remain in the center of the data set:
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
columns = list(set(columns) - set(first_cols))
columns = list(set(columns) - set(drop_cols))
columns = list(set(columns) - set(last_cols))
new_order = first_cols + columns + last_cols
return new_order
Example usage:
my_list = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
# Output:
['fourth', 'third', 'first', 'sixth', 'second']
To assign to your dataframe, use:
my_list = df.columns.tolist()
reordered_cols = reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
df = df[reordered_cols]
Simple solution:
old_cols = df.columns.values
new_cols= ['a', 'y', 'b', 'x']
df = df.reindex(columns=new_cols)
An alternative, more generic method;
from pandas import DataFrame
def move_columns(df: DataFrame, cols_to_move: list, new_index: int) -> DataFrame:
"""
This method re-arranges the columns in a dataframe to place the desired columns at the desired index.
ex Usage: df = move_columns(df, ['Rev'], 2)
:param df:
:param cols_to_move: The names of the columns to move. They must be a list
:param new_index: The 0-based location to place the columns.
:return: Return a dataframe with the columns re-arranged
"""
other = [c for c in df if c not in cols_to_move]
start = other[0:new_index]
end = other[new_index:]
return df[start + cols_to_move + end]
You can use pd.Index.difference with np.hstack, then reindex or use label-based indexing. In general, it's a good idea to avoid list comprehensions or other explicit loops with NumPy / Pandas objects.
cols_to_move = ['b', 'x']
new_cols = np.hstack((df.columns.difference(cols_to_move), cols_to_move))
# OPTION 1: reindex
df = df.reindex(columns=new_cols)
# OPTION 2: direct label-based indexing
df = df[new_cols]
# OPTION 3: loc label-based indexing
df = df.loc[:, new_cols]
print(df)
# a y b x
# 0 1 -1 2 3
# 1 2 -2 4 6
# 2 3 -3 6 9
# 3 4 -4 8 12
You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'b')
mc.MoveToLast(df,'x')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/
You can also do this as a one-liner:
df.drop(columns=['b', 'x']).assign(b=df['b'], x=df['x'])
This will move any column to the last column :
Move any column to the last column of dataframe :
df= df[ [ col for col in df.columns if col != 'col_name_to_moved' ] + ['col_name_to_moved']]
Move any column to the first column of dataframe:
df= df[ ['col_name_to_moved'] + [ col for col in df.columns if col != 'col_name_to_moved' ]]
where col_name_to_moved is the column that you want to move.
I use Pokémon database as an example, the columns for my data base are
['Name', '#', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']
Here is the code:
import pandas as pd
df = pd.read_html('https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6')[0]
cols = df.columns.to_list()
cos_end= ["Name", "Total", "HP", "Defense"]
for i, j in enumerate(cos_end, start=(len(cols)-len(cos_end))):
cols.insert(i, cols.pop(cols.index(j)))
print(cols)
df = df.reindex(columns=cols)
print(df)

Pandas read multiindexed csv with blanks

I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
What I would like to get is:
When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:
Is there a way to get the desired result?
Note: alternatively, I would accept this as a result:
Versions used:
Python: 2.7.8
Pandas 0.16.0
Here is an automated way to fix the column index. First,
pull the column level values into a DataFrame:
columns = pd.DataFrame(df.columns.tolist())
then rename the Unnamed: columns to NaN:
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
and then forward-fill the NaNs:
columns[0] = columns[0].fillna(method='ffill')
so that columns now looks like
In [314]: columns
Out[314]:
0 1
0 NaN A
1 NaN B
2 C X
3 C Y
4 C Z
5 D X
6 D Y
7 D Z
Now we can find the remaining NaNs and fill them with empty strings:
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
Now you can build a new MultiIndex and assign it to df.columns:
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
Putting it all together, if data is
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0
then
import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
print(df)
yields
A B C D
X Y Z X Y Z
0 1 2 3 4 5 6 7 8
1 3 4 5 6 7 8 9 0
There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:
names = ['A', 'B',
('C','X'), ('C', 'Y'), ('C', 'Z'),
('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
header=1, names=names, index_col=[0, 1])
Gives:
C D
X Y Z X Y Z
A B
1 2 3 4 5 6 7 8
To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.
pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])
Then access the columns and loop to create your header.
Again, not a very clean solution, but should work.
you can read using :
df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True)
and then
df.columns = pd.MultiIndex.from_tuples(df.columns)
Load the dataframe, with multiindex:
df = pd.read_csv(filelist,header=[0,1], sep=',')
Write a function to replace the index:
def replace_index(df):
arr = df.columns.values
l = [list(x) for x in arr]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
if l[i-1][0][:7] != 'Unnamed':
l[i][0] = l[i-1][0]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
l[i][0] = l[i][1]
l[i][1] = ''
index = pd.MultiIndex.from_tuples(l)
df.columns = index
return df
Return the new dataframe properly indexed:
replace_index(df)
I used a technique to flatten from the multi-index columns and make one column. It works well for me.
your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]
Import your csv file providing the header row indexes:
df = pd.read_csv('file.csv', header=[0, 1, 2])
Then, you can iterate over each column header, clean it up, assign it to a tuple, the re-assign the dataframe columns using pd.MultiIndex.from_tuples(list_of_tuples)
df.columns = pd.MultiIndex.from_tuples(
[tuple(['' if y.find('Unnamed')==0 else y for y in x]) for x in df.columns]
)
this is the quick one liner I was looking for when trying to figure this out.

Categories