Drop all columns before a particular column (by name) in pandas? - python

I found a bunch of answers about how to drop columns using index numbers.
I am stumped about how to drop all columns after a particular column by name.
df = pd.DataFrame(columns=['A','B','C','D'])
df.drop(['B':],inplace=True)
I expect the new df to have only A B columns.

Dropping all columns after is the same as keeping all columns up to and including. So:
In [321]: df = pd.DataFrame(columns=['A','B','C','D'])
In [322]: df = df.loc[:, :'B']
In [323]: df
Out[323]:
Empty DataFrame
Columns: [A, B]
Index: []
(Using inplace is typically not worth it.)

get_loc and iloc
Dropping some is the same as selecting the others.
df.iloc[:, :df.columns.get_loc('B') + 1]
Empty DataFrame
Columns: [A, B]
Index: []

df.drop(df.columns[list(df.columns).index("B")+1:],inplace=True)

df = df[df.columns[:list(df.columns).index('B')+1]]
should work.

Related

How to add columns to an empty pandas dataframe?

I have an empty dataframe.
df=pd.DataFrame(columns=['a'])
for some reason I want to generate df2, another empty dataframe, with two columns 'a' and 'b'.
If I do
df.columns=df.columns+'b'
it does not work (I get the columns renamed to 'ab')
and neither does the following
df.columns=df.columns.tolist()+['b']
How to add a separate column 'b' to df, and df.emtpy keep on being True?
Using .loc is also not possible
df.loc[:,'b']=None
as it returns
Cannot set dataframe with no defined index and a scalar
Here are few ways to add an empty column to an empty dataframe:
df=pd.DataFrame(columns=['a'])
df['b'] = None
df = df.assign(c=None)
df = df.assign(d=df['a'])
df['e'] = pd.Series(index=df.index)
df = pd.concat([df,pd.DataFrame(columns=list('f'))])
print(df)
Output:
Empty DataFrame
Columns: [a, b, c, d, e, f]
Index: []
I hope it helps.
If you just do df['b'] = None then df.empty is still True and df is:
Empty DataFrame
Columns: [a, b]
Index: []
EDIT:
To create an empty df2 from the columns of df and adding new columns, you can do:
df2 = pd.DataFrame(columns = df.columns.tolist() + ['b', 'c', 'd'])
If you want to add multiple columns at the same time you can also reindex.
new_cols = ['c', 'd', 'e', 'f', 'g']
df2 = df.reindex(df.columns.union(new_cols), axis=1)
#Empty DataFrame
#Columns: [a, c, d, e, f, g]
#Index: []
This is one way:
df2 = df.join(pd.DataFrame(columns=['b']))
The advantage of this method is you can add an arbitrary number of columns without explicit loops.
In addition, this satisfies your requirement of df.empty evaluating to True if no data exists.
You can use concat:
df=pd.DataFrame(columns=['a'])
df
Out[568]:
Empty DataFrame
Columns: [a]
Index: []
df2=pd.DataFrame(columns=['b', 'c', 'd'])
pd.concat([df,df2])
Out[571]:
Empty DataFrame
Columns: [a, b, c, d]
Index: []

How do I filter an empty DataFrame and still keep the columns of that DataFrame?

Here is an example of why pandas is a terribly designed hacked together library:
import pandas as pd
df = pd.DataFrame()
df['A'] = [1,2,3]
df['B'] = [4,5,6]
print(df)
df1 = df[df.A.apply(lambda x:x == 4)]
df2 = df1[df1.B.apply(lambda x:x == 1)]
print(df2)
This will print
df
A B
0 1 4
1 2 5
2 3 6
df2
Empty DataFrame
Columns: []
Index: []
Note how Columns: [] , which means any further/selecting on df2 will fail. This is a huge issue, because it means I now have to always check if any table is empty before attempting to select from it, which is garbage behaviour.
For clarity, the sensible, thoughtful, reasonable, not totally broken behaviour would be to preserve the columns.
Anyone care to offer some hack I can apply on top of the collection of hacks which is the dataframe API?
Pandas almost consider all situations we need, especially for those simple cases
PS: Nothing wrong with pandas
df1 = df.loc[df.A.apply(lambda x:x == 4)]
df2 = df1.loc[df1.B.apply(lambda x:x == 1)]
df1
Out[53]:
Empty DataFrame
Columns: [A, B]
Index: []
df2
Out[54]:
Empty DataFrame
Columns: [A, B]
Index: []
df2 = df1[df1.B.apply(lambda x:x == 1).astype(bool)]
All other answers are missing the point (except for Wen's, which is an ok alternative)

How do you select the indexes of a pandas dataframe as a series?

I have tried resetting the index and then selecting that column then setting the index again like so:
df.reset_index(inplace=True,drop=False)
country_names = df['Country'] #the Series I want to select
df.set_index('Country',drop=True,inplace=True)
But it seems like there should be a better way to do this.
To get the index of a dataframe as a pd.series object you can use the to_series method, for example:
df = pd.DataFrame([1, 3], index=['a', 'b'])
df.index.to_series()
a a
b b
dtype: object

Pandas selecting discontinuous columns from a dataframe

I am using the following to select specific columns from the dataframe comb, which I would like to bring into a new dataframe. The individual selects work fine EG: comb.ix[:,0:1], but when I attempt to combine them using the + I get a bad result the 1st selection ([:,0:1]) getting stuck on the end of the dataframe and the values contained in original col 1 are wiped out while appearing at the end of the row. What is the right way to get just the columns I want? (I'd include sample data but as you may see, too many columns...which is why I'm trying to do it this way)
comb.ix[:,0:1]+comb.ix[:,17:342]
If you want to concatenate a sub selection of your df columns then use pd.concat:
pd.concat([comb.ix[:,0:1],comb.ix[:,17:342]], axis=1)
So long as the indices match then this will align correctly.
Thanks to #iHightower that you can also sub-select by passing the labels:
pd.concat([df.ix[:,'Col1':'Col5'],df.ix[:,'Col9':'Col15']],a‌​xis=1)
Note that .ix will be deprecated in a future version the following should work:
In [115]:
df = pd.DataFrame(columns=['col' + str(x) for x in range(10)])
df
Out[115]:
Empty DataFrame
Columns: [col0, col1, col2, col3, col4, col5, col6, col7, col8, col9]
Index: []
In [118]:
pd.concat([df.loc[:, 'col2':'col4'], df.loc[:, 'col7':'col8']], axis=1)
​
Out[118]:
Empty DataFrame
Columns: [col2, col3, col4, col7, col8]
Index: []
Or using iloc:
In [127]:
pd.concat([df.iloc[:, df.columns.get_loc('col2'):df.columns.get_loc('col4')], df.iloc[:, df.columns.get_loc('col7'):df.columns.get_loc('col8')]], axis=1)
Out[127]:
Empty DataFrame
Columns: [col2, col3, col7]
Index: []
Note that iloc slicing is open/closed so the end range is not included so you'd have to find the column after the column of interest if you want to include it:
In [128]:
pd.concat([df.iloc[:, df.columns.get_loc('col2'):df.columns.get_loc('col4')+1], df.iloc[:, df.columns.get_loc('col7'):df.columns.get_loc('col8')+1]], axis=1)
Out[128]:
Empty DataFrame
Columns: [col2, col3, col4, col7, col8]
Index: []
NumPy has a nice module named r_, allowing you to solve it with the modern DataFrame selection interface, iloc:
df.iloc[:, np.r_[0:1, 17:342]]
I believe this is a more elegant solution.
It even support more complex selections:
df.iloc[:, np.r_[0:1, 5, 16, 17:342:2, -5:]]
I recently solved it by just appending ranges
r1 = pd.Series(range(5))
r2 = pd.Series([10,15,20])
final_range = r1.append(r2)
df.iloc[:,final_range]
Then you will get columns from 0:5 and 10, 15, 20.

pandas join DataFrame force suffix?

How can I force a suffix on a merge or join. I understand it's possible to provide one if there is a collision but in my case I'm merging df1 with df2 which doesn't cause any collision but then merging again on df2 which uses the suffixes but I would prefer for each merge to have a suffix because it gets confusing if I do different combinations as you could imagine.
You could force a suffix on the actual DataFrame:
In [11]: df_a = pd.DataFrame([[1], [2]], columns=['A'])
In [12]: df_b = pd.DataFrame([[3], [4]], columns=['B'])
In [13]: df_a.join(df_b)
Out[13]:
A B
0 1 3
1 2 4
By appending to its column's names:
In [14]: df_a.columns = df_a.columns.map(lambda x: str(x) + '_a')
In [15]: df_a
Out[15]:
A_a
0 1
1 2
Now joins won't need the suffix correction, whether they collide or not:
In [16]: df_b.columns = df_b.columns.map(lambda x: str(x) + '_b')
In [17]: df_a.join(df_b)
Out[17]:
A_a B_b
0 1 3
1 2 4
As of pandas version 0.24.2 you can add a suffix to column names on a DataFrame using the add_suffix method.
This makes a one-liner merge command with force-suffix more bearable, for example:
df_merged = df1.merge(df2.add_suffix('_2'))
Pandas merge will give the new columns a suffix when there is already a column with the same name, When i need to force the new columns with a suffix, i create an empty column with the name of the column that i want to join.
df["colName"] = "" #create empty column
df.merge(right = "df1", suffixes = ("_a","_b"))
You can later drop the empty column.
You could do the same for more than one columns, or for every column in df.columns.values
This is what I've been using to pandas.merge two DataFrames and force suffixing:
def merge_force_suffix(left, right, **kwargs):
on_col = kwargs['on']
suffix_tupple = kwargs['suffixes']
def suffix_col(col, suffix):
if col != on_col:
return str(col) + suffix
else:
return col
left_suffixed = left.rename(columns=lambda x: suffix_col(x, suffix_tupple[0]))
right_suffixed = right.rename(columns=lambda x: suffix_col(x, suffix_tupple[1]))
del kwargs['suffixes']
return pd.merge(left_suffixed, right_suffixed, **kwargs)

Categories