How to store formulas, instead of values, in pandas DataFrame - python

Is it possible to work with pandas DataFrame as with an Excel spreadsheet: say, by entering a formula in a column so that when variables in other columns change, the values in this column change automatically? Something like:
a b c
2 3 =a+b
And so when I update 2 or 3, the column c also updates automatically.
PS: It's clearly possible to write a function to return a+b, but is there any built-in functionality in pandas or in other Python libraries to work with matrices this way?

This will work in 0.13 (still in development)
In [19]: df = DataFrame(randn(10,2),columns=list('ab'))
In [20]: df
Out[20]:
a b
0 0.958465 0.679193
1 -0.769077 0.497436
2 0.598059 0.457555
3 0.290926 -1.617927
4 -0.248910 -0.947835
5 -1.352096 -0.568631
6 0.009125 0.711511
7 -0.993082 -1.440405
8 -0.593704 0.352468
9 0.523332 -1.544849
This will be possible as 'a + b' (soon)
In [21]: formulas = { 'c' : 'df.a + df.b' }
In [22]: def update(df,formulas):
for k, v in formulas.items():
df[k] = pd.eval(v)
In [23]: update(df,formulas)
In [24]: df
Out[24]:
a b c
0 0.958465 0.679193 1.637658
1 -0.769077 0.497436 -0.271642
2 0.598059 0.457555 1.055614
3 0.290926 -1.617927 -1.327001
4 -0.248910 -0.947835 -1.196745
5 -1.352096 -0.568631 -1.920726
6 0.009125 0.711511 0.720636
7 -0.993082 -1.440405 -2.433487
8 -0.593704 0.352468 -0.241236
9 0.523332 -1.544849 -1.021517
You could implement a hook into setitem on the data frame to have this type of function called automatically. But pretty tricky. You didn't specify how the frame is updated in the first place. Would probably be easiest to simply call the update function after you change the values

I don't know it it is what you want, but I accidentally discovered that you can store xlwt.Formula objects in the DataFrame cells, and then, using DataFrame.to_excel method, export the DataFrame to excel and have your formulas in it:
import pandas
import xlwt
formulae=[]
formulae.append(xlwt.Formula('SUM(F1:F5)'))
formulae.append(xlwt.Formula('SUM(G1:G5)'))
formulae.append(xlwt.Formula('SUM(H1:I5)'))
formulae.append(xlwt.Formula('SUM(I1:I5)'))
df=pandas.DataFrame(formula)
df.to_excel('FormulaTest.xls')
Try it...

There's currently no way to do this exactly in the way that you describe.
In pandas 0.13 there will be a new DataFrame.eval method that will allow you to evaluate an expression in the "context" of a DataFrame. For example, you'll be able to df['c'] = df.eval('a + b').

Related

Pandas to modify values in csv file based on function

I have a CSV file that looks like below, this is same like my last question but this is by using Pandas.
Group Sam Dan Bori Son John Mave
A 0.00258844 0.983322 1.61479 1.2785 1.96963 10.6945
B 0.0026034 0.983305 1.61198 1.26239 1.9742 10.6838
C 0.0026174 0.983294 1.60913 1.24543 1.97877 10.6729
D 0.00263062 0.983289 1.60624 1.22758 1.98334 10.6618
E 0.00264304 0.98329 1.60332 1.20885 1.98791 10.6505
I have a function like below
def getnewno(value):
value = value + 30
if value > 40 :
value = value - 20
else:
value = value
return value
I want to send all these values to the getnewno function and get a newvalue and update the CSV file. How can this be accomplished in Pandas.
Expected output:
Group Sam Dan Bori Son John Mave
A 30.00258844 30.983322 31.61479 31.2785 31.96963 20.6945
B 30.0026034 30.983305 31.61198 31.26239 31.9742 20.6838
C 30.0026174 30.983294 31.60913 31.24543 31.97877 20.6729
D 30.00263062 30.983289 31.60624 31.22758 31.98334 20.6618
E 30.00264304 30.98329 31.60332 31.20885 31.98791 20.6505
The following should give you what you desire.
Applying a function
Your function can be simplified and here expressed as a lambda function.
It's then a matter of applying your function to all of the columns. There are a number of ways to do so. The first idea that comes to mind is to loop over df.columns. However, we can do better than this by using the applymap or transform methods:
import pandas as pd
# Read in the data from file
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
# Simplified function with which to transform data
getnewno = lambda value: value + 10 if value > 10 else value + 30
# Looping over columns
#for col in df.columns:
# df[col] = df[col].apply(getnewno)
# Apply to all columns without loop
df = df.applymap(getnewno)
# Write out updated data
df.to_csv('data_updated.csv')
Using broadcasting
You can achieve your result using broadcasting and a little boolean logic. This avoids looping over any columns, and should ultimately prove faster and less memory intensive (although if your dataset is small any speed-up would be negligible):
import pandas as pd
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
df += 30
make_smaller = df > 40
df[make_smaller] -= 20
First of all, your getnewno function looks too complicated... it can be simplified to e.g.:
def getnewno(value):
if value + 30 > 40:
return value - 20
else:
return value
you can even change value + 30 > 40 to value > 10.
Or even a oneliner if you want:
getnewno = lambda value: value-20 if value > 10 else value
Having the function you can apply it to specific values/columns. For example, if want you to create a column Mark_updated basing on Mark column, it should look like this (I assume your pandas DataFrame is called df):
df['Mark_updated'] = df['Mark'].apply(getnewno)
Use the mask function to do an if-else solution, before writing the data to csv
res = (df
.select_dtypes('number')
.add(30)
#the if-else comes in here
#if any entry in the dataframe is greater than 40, subtract 20 from it
#else leave as is
.mask(lambda x: x>40, lambda x: x.sub(20))
)
#insert the group column back
res.insert(0,'Group',df.Group.array)
write to csv
res.to_csv(filename)
Group Sam Dan Bori Son John Mave
0 A 30.002588 30.983322 31.61479 31.27850 31.96963 20.6945
1 B 30.002603 30.983305 31.61198 31.26239 31.97420 20.6838
2 C 30.002617 30.983294 31.60913 31.24543 31.97877 20.6729
3 D 30.002631 30.983289 31.60624 31.22758 31.98334 20.6618
4 E 30.002643 30.983290 31.60332 31.20885 31.98791 20.6505

Solution for SpecificationError: nested renamer is not supported while agg() along with groupby()

def stack_plot(data, xtick, col2='project_is_approved', col3='total'):
ind = np.arange(data.shape[0])
plt.figure(figsize=(20,5))
p1 = plt.bar(ind, data[col3].values)
p2 = plt.bar(ind, data[col2].values)
plt.ylabel('Projects')
plt.title('Number of projects aproved vs rejected')
plt.xticks(ind, list(data[xtick].values))
plt.legend((p1[0], p2[0]), ('total', 'accepted'))
plt.show()
def univariate_barplots(data, col1, col2='project_is_approved', top=False):
# Count number of zeros in dataframe python: https://stackoverflow.com/a/51540521/4084039
temp = pd.DataFrame(project_data.groupby(col1)[col2].agg(lambda x: x.eq(1).sum())).reset_index()
# Pandas dataframe grouby count: https://stackoverflow.com/a/19385591/4084039
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
temp.sort_values(by=['total'],inplace=True, ascending=False)
if top:
temp = temp[0:top]
stack_plot(temp, xtick=col1, col2=col2, col3='total')
print(temp.head(5))
print("="*50)
print(temp.tail(5))
univariate_barplots(project_data, 'school_state', 'project_is_approved', False)
Error:
SpecificationError Traceback (most recent call last)
<ipython-input-21-2cace8f16608> in <module>()
----> 1 univariate_barplots(project_data, 'school_state', 'project_is_approved', False)
<ipython-input-20-856fcc83737b> in univariate_barplots(data, col1, col2, top)
4
5 # Pandas dataframe grouby count: https://stackoverflow.com/a/19385591/4084039
----> 6 temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
7 print (temp['total'].head(2))
8 temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
~\AppData\Roaming\Python\Python36\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
251 # but not the class list / tuple itself.
252 func = _maybe_mangle_lambdas(func)
--> 253 ret = self._aggregate_multiple_funcs(func)
254 if relabeling:
255 ret.columns = columns
~\AppData\Roaming\Python\Python36\site-packages\pandas\core\groupby\generic.py in _aggregate_multiple_funcs(self, arg)
292 # GH 15931
293 if isinstance(self._selected_obj, Series):
--> 294 raise SpecificationError("nested renamer is not supported")
295
296 columns = list(arg.keys())
SpecificationError: **nested renamer is not supported**
change
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
to
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg(total='count')).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg(Avg='mean')).reset_index()['Avg']
reason: in new pandas version named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).
source: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html
This error also happens if a column specified in the aggregation function dict does not exist in the dataframe:
In [190]: group = pd.DataFrame([[1, 2]], columns=['A', 'B']).groupby('A')
In [195]: group.agg({'B': 'mean'})
Out[195]:
B
A
1 2
In [196]: group.agg({'B': 'mean', 'non-existing-column': 'mean'})
...
SpecificationError: nested renamer is not supported
I found the way: Instead of going like
g2 = df.groupby(["Description","CustomerID"],as_index=False).agg({'Quantity':{"maxQ":np.max,"minQ":np.min,"meanQ":np.mean}})
g2.columns = ["Description","CustomerID","maxQ","minQ",'meanQ']
Do as follows:
g2 = df.groupby(["Description","CustomerID"],as_index=False).agg({'Quantity':{np.max,np.min,np.mean}})
g2.columns = ["Description","CustomerID","maxQ","minQ",'meanQ']
I had the same error and this is how I resolved it!
Do you get the same error if you change
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
to
temp['total'] = project_data.groupby(col1)[col2].agg(total=('total','count')).reset_index()['total']
Instead of using .agg({'total':'count'})), you can pass name with the function as a list of tuple like .agg([('total', 'count')])and use the same for Avg also. Hope it would work.
I have got the similar issue as #akshay jindal, but I check the documentation as suggested by #artikay Khanna, the problem solved, some functions has been adjusted, the old is deprecated. Here is the code warning provided per last time execute.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version. Use named aggregation instead.
>>> grouper.agg(name_1=func_1, name_2=func_2)
"""Entry point for launching an IPython kernel.
Therefore, I will suggest try
grouper.agg(name_1=func_1, name_2=func_2)
Hope this will help
Not a very elegant solution but this one works. As renaming the column is deprecated with the way you are doing. But there is work around. Create a temporary variable 'approved' , store the col2 in it. Because when you apply agg function , the original column values will change with column name. You can preserve the column name but then values in those column will change. So in order to preserve the original dataframe and to have two new columns with desired names, you can use the following code.
approved = temp[col2]
temp = pd.DataFrame(project_data.groupby(col1)[col2].agg([('Avg','mean'),('total','count')]).reset_index())
temp[col2] = approved
P.S : Seems like an assignment of AAIC, I am working on same :)
Sometimes it's convenient to keep an aggdict of how each column should be transformed under aggregation that will work with different column sets and different group by columns. You can do this with the new syntax fairly easily by unpacking the dict with **. Here's a minimal working example for simple data.
dfx=pd.DataFrame(columns=["A","B","C"],data=np.random.randint(0,5,size=(10,3)))
#dfx
#
# A B C
#0 4 4 1
#1 2 4 4
#2 1 3 3
#3 2 4 3
#4 1 2 1
#5 0 4 2
#6 2 3 4
#7 1 0 2
#8 2 1 4
#9 3 0 3
Maybe when you agg you want the first "A", the last "B", the mean "C" and sometimes your pipeline has a "D" (but not this time) that you also want the mean of.
aggdict = {"A":lambda x: x.iloc[0], "B": lambda x: x.iloc[-1], "C" : "mean" , "D":lambda x: "mean"}
You can build a simple dict like the old days and then unpack it with ** filtering on the relevant keys:
gb_col="C"
gbc = dfx.groupby(gb_col).agg(**{k:(k,v) for k,v in aggdict.items() if k in dfx.columns and k != gb_col})
# A B
#C
#1 4 2
#2 0 0
#3 1 4
#4 2 3
And then you can slice and dice how you want with the same syntax:
mygb = lambda gb_col: dfx.groupby(gb_col).agg(**{k:(k,v) for k,v in aggdict.items() if k in dfx.columns and k != gb_col})
allgb = [mygb(c) for c in dfx.columns]
I have tried alll the solutions and turned out to be the error with the name. If your column name has some inbuilt keywords such as "in", "is",etc., It is throwing error. In my case, My column name is "Points in Polygon" and I have resolved the issue by renaming the column to "Points"
#Rishi's solution worked for me. The original name of the column in my dataframe was net_value_budgeted_rate, which was essentially dollar value of the sale. I changed it to dollars and it worked.
Info = pd.DataFrame(df.groupby("school_state").agg(Approved=("project_is_approved",lambda x: x.eq(1).sum()),Total=("project_is_approved","count"),Avg=("project_is_approved","mean"))).reset_index().sort_values(by=["Total"],ascending=False).head()
You can break this into individual commands for better readability.

Storing as Pandas DataFrames and Updating as Pytables

Can you store data as pandas HDFStore and open them / perform i/o using pytables? The reason this question comes up is because I am currently storing data as
pd.HDFStore('Filename',mode='a')
store.append(data)
However, as i understand pandas doesn't support updating records so much. I have a usecase where I have to update 5% of the data daily. Would pd.io.pytables work? if so I found no documentation on this? Pytables has a lot of documentation but i am not sure if i can open the file / update without opening using pytables when i didnt use pytables to save the file initially?
Here is a demonstration for the #flyingmeatball's answer:
Let's generate a test DF:
In [56]: df = pd.DataFrame(np.random.rand(15, 3), columns=list('abc'))
In [57]: df
Out[57]:
a b c
0 0.022079 0.901965 0.282529
1 0.596452 0.096204 0.197186
2 0.034127 0.992500 0.523114
3 0.659184 0.447355 0.246932
4 0.441517 0.853434 0.119602
5 0.779707 0.429574 0.744452
6 0.105255 0.934440 0.545421
7 0.216278 0.217386 0.282171
8 0.690729 0.052097 0.146705
9 0.828667 0.439608 0.091007
10 0.988435 0.326589 0.536904
11 0.687250 0.661912 0.318209
12 0.829129 0.758737 0.519068
13 0.500462 0.723528 0.026962
14 0.464162 0.364536 0.843899
and save it to HDFStore (NOTE: don't forget to use data_columns=True (or data_columns=[list_of_columns_to_index]) in order to index all columns, that we want to use in the where clause):
In [58]: store = pd.HDFStore(r'd:/temp/test_removal.h5')
In [59]: store.append('test', df, format='t', data_columns=True)
In [60]: store.close()
Solution:
In [61]: store = pd.HDFStore(r'd:/temp/test_removal.h5')
The .remove() method should return # of removed rows:
In [62]: store.remove('test', where="a > 0.5")
Out[62]: 9
Let's append changed (multiplied by 100) rows :
In [63]: store.append('test', df.loc[df.a > 0.5] * 100, format='t', data_columns=True)
Test:
In [64]: store.select('test')
Out[64]:
a b c
0 0.022079 0.901965 0.282529
2 0.034127 0.992500 0.523114
4 0.441517 0.853434 0.119602
6 0.105255 0.934440 0.545421
7 0.216278 0.217386 0.282171
14 0.464162 0.364536 0.843899
1 59.645151 9.620415 19.718557
3 65.918421 44.735482 24.693160
5 77.970749 42.957446 74.445185
8 69.072948 5.209725 14.670545
9 82.866731 43.960848 9.100682
10 98.843540 32.658931 53.690360
11 68.725002 66.191215 31.820942
12 82.912937 75.873689 51.906795
13 50.046189 72.352794 2.696243
finalize:
In [65]: store.close()
Here are the docs I think you're after:
http://pandas.pydata.org/pandas-docs/version/0.19.0/api.html?highlight=pytables
See this thread as well:
Update pandas DataFrame in stored in a Pytable with another pandas DataFrame
Looks like you can load the 5% records into memory, remove them from the store then append the updated ones back
to replace the whole table
store.remove(key, where = ...)
store.append(.....)
You can also do outside of Pandas - see tutorial here on removal
http://www.pytables.org/usersguide/tutorials.html

Operating on pandas dataframes that may or may not be multiIndex

I have a few functions that make new columns in a pandas dataframe, as a function of existing columns in the dataframe. I have two different scenarios that occur here: (1) the dataframe is NOT multiIndex and has a set of columns, say [a,b] and (2) the dataframe is multiIndex and now has the same set of columns headers repeated N times, say [(a,1),(b,1),(a,2),(b,2)....(a,N),(n,N)].
I've been making the aforementioned functions in the style shown below:
def f(df):
if multiindex(df):
for s df[a].columns:
df[c,s] = someFunction(df[a,s], df[b,s])
else:
df[c] = someFunction(df[a], df[b])
Is there another way to do this, without having these if-multi-index/else statement everywhere and duplicating the someFunction code? I'd prefer NOT to split the multi indexed frame into N smaller dataframes (I often need to filter data or do things and keep the rows consistent across all the 1,2,...N frames, and keeping them together in one frame seems the to be the best way to do that).
you may still have to test if columns is a MultiIndex but this should be cleaner and more efficient. Caveat, this will not work if your function utilizes summary statistics on the column. For example, if someFunction divides by the the average of column 'a'.
Solution
def someFunction(a, b):
return a + b
def f(df):
df = df.copy()
ismi = isinstance(df.columns, pd.MultiIndex)
if ismi:
df = df.stack()
df['c'] = someFunction(df['a'], df['a'])
if ismi:
df = df.unstack()
return df
Setup
import pandas as pd
import numpy as np
setup_tuples = []
for c in ['a', 'b']:
for i in ['one', 'two', 'three']:
setup_tuples.append((c, i))
columns = pd.MultiIndex.from_tuples(setup_tuples)
rand_array = np.random.rand(10, len(setup_tuples))
df = pd.DataFrame(rand_array, columns=columns)
df looks like this
a b
one two three one two three
0 0.282834 0.490313 0.201300 0.140157 0.467710 0.352555
1 0.838527 0.707131 0.763369 0.265170 0.452397 0.968125
2 0.822786 0.785226 0.434637 0.146397 0.056220 0.003197
3 0.314795 0.414096 0.230474 0.595133 0.060608 0.900934
4 0.334733 0.118689 0.054299 0.237786 0.658538 0.057256
5 0.993753 0.552942 0.665615 0.336948 0.788817 0.320329
6 0.310809 0.199921 0.158675 0.059406 0.801491 0.134779
7 0.971043 0.183953 0.723950 0.909778 0.103679 0.695661
8 0.755384 0.728327 0.029720 0.408389 0.808295 0.677195
9 0.276158 0.978232 0.623972 0.897015 0.253178 0.093772
I constructed df to have MultiIndex columns. What I'd do is use the .stack() method to push the second level of the column index to be the second level of the row index.
df.stack() looks like this
a b
0 one 0.282834 0.140157
three 0.201300 0.352555
two 0.490313 0.467710
1 one 0.838527 0.265170
three 0.763369 0.968125
two 0.707131 0.452397
2 one 0.822786 0.146397
three 0.434637 0.003197
two 0.785226 0.056220
3 one 0.314795 0.595133
three 0.230474 0.900934
two 0.414096 0.060608
4 one 0.334733 0.237786
three 0.054299 0.057256
two 0.118689 0.658538
5 one 0.993753 0.336948
three 0.665615 0.320329
two 0.552942 0.788817
6 one 0.310809 0.059406
three 0.158675 0.134779
two 0.199921 0.801491
7 one 0.971043 0.909778
three 0.723950 0.695661
two 0.183953 0.103679
8 one 0.755384 0.408389
three 0.029720 0.677195
two 0.728327 0.808295
9 one 0.276158 0.897015
three 0.623972 0.093772
two 0.978232 0.253178
Now you can operate on df.stack() as if the columns were not a MultiIndex
Demonstration
print f(df)
will give you what you want
a b c \
one three two one three two one
0 0.282834 0.201300 0.490313 0.140157 0.352555 0.467710 0.565667
1 0.838527 0.763369 0.707131 0.265170 0.968125 0.452397 1.677055
2 0.822786 0.434637 0.785226 0.146397 0.003197 0.056220 1.645572
3 0.314795 0.230474 0.414096 0.595133 0.900934 0.060608 0.629591
4 0.334733 0.054299 0.118689 0.237786 0.057256 0.658538 0.669465
5 0.993753 0.665615 0.552942 0.336948 0.320329 0.788817 1.987507
6 0.310809 0.158675 0.199921 0.059406 0.134779 0.801491 0.621618
7 0.971043 0.723950 0.183953 0.909778 0.695661 0.103679 1.942086
8 0.755384 0.029720 0.728327 0.408389 0.677195 0.808295 1.510767
9 0.276158 0.623972 0.978232 0.897015 0.093772 0.253178 0.552317
three two
0 0.402600 0.980626
1 1.526739 1.414262
2 0.869273 1.570453
3 0.460948 0.828193
4 0.108599 0.237377
5 1.331230 1.105884
6 0.317349 0.399843
7 1.447900 0.367907
8 0.059439 1.456654
9 1.247944 1.956464

Select multiple columns by labels in pandas

I've been looking around for ways to select columns through the python documentation and the forums but every example on indexing columns are too simplistic.
Suppose I have a 10 x 10 dataframe
df = DataFrame(randn(10, 10), index=range(0,10), columns=['A', 'B', 'C', 'D','E','F','G','H','I','J'])
So far, all the documentations gives is just a simple example of indexing like
subset = df.loc[:,'A':'C']
or
subset = df.loc[:,'C':]
But I get an error when I try index multiple, non-sequential columns, like this
subset = df.loc[:,('A':'C', 'E')]
How would I index in Pandas if I wanted to select column A to C, E, and G to I? It appears that this logic will not work
subset = df.loc[:,('A':'C', 'E', 'G':'I')]
I feel that the solution is pretty simple, but I can't get around this error. Thanks!
Name- or Label-Based (using regular expression syntax)
df.filter(regex='[A-CEG-I]') # does NOT depend on the column order
Note that any regular expression is allowed here, so this approach can be very general. E.g. if you wanted all columns starting with a capital or lowercase "A" you could use: df.filter(regex='^[Aa]')
Location-Based (depends on column order)
df[ list(df.loc[:,'A':'C']) + ['E'] + list(df.loc[:,'G':'I']) ]
Note that unlike the label-based method, this only works if your columns are alphabetically sorted. This is not necessarily a problem, however. For example, if your columns go ['A','C','B'], then you could replace 'A':'C' above with 'A':'B'.
The Long Way
And for completeness, you always have the option shown by #Magdalena of simply listing each column individually, although it could be much more verbose as the number of columns increases:
df[['A','B','C','E','G','H','I']] # does NOT depend on the column order
Results for any of the above methods
A B C E G H I
0 -0.814688 -1.060864 -0.008088 2.697203 -0.763874 1.793213 -0.019520
1 0.549824 0.269340 0.405570 -0.406695 -0.536304 -1.231051 0.058018
2 0.879230 -0.666814 1.305835 0.167621 -1.100355 0.391133 0.317467
Just pick the columns you want directly....
df[['A','E','I','C']]
How do I select multiple columns by labels in pandas?
Multiple label-based range slicing is not easily supported with pandas, but position-based slicing is, so let's try that instead:
loc = df.columns.get_loc
df.iloc[:, np.r_[loc('A'):loc('C')+1, loc('E'), loc('G'):loc('I')+1]]
A B C E G H I
0 -1.666330 0.321260 -1.768185 -0.034774 0.023294 0.533451 -0.241990
1 0.911498 3.408758 0.419618 -0.462590 0.739092 1.103940 0.116119
2 1.243001 -0.867370 1.058194 0.314196 0.887469 0.471137 -1.361059
3 -0.525165 0.676371 0.325831 -1.152202 0.606079 1.002880 2.032663
4 0.706609 -0.424726 0.308808 1.994626 0.626522 -0.033057 1.725315
5 0.879802 -1.961398 0.131694 -0.931951 -0.242822 -1.056038 0.550346
6 0.199072 0.969283 0.347008 -2.611489 0.282920 -0.334618 0.243583
7 1.234059 1.000687 0.863572 0.412544 0.569687 -0.684413 -0.357968
8 -0.299185 0.566009 -0.859453 -0.564557 -0.562524 0.233489 -0.039145
9 0.937637 -2.171174 -1.940916 -1.553634 0.619965 -0.664284 -0.151388
Note that the +1 is added because when using iloc the rightmost index is exclusive.
Comments on Other Solutions
filter is a nice and simple method for OP's headers, but this might not generalise well to arbitrary column names.
The "location-based" solution with loc is a little closer to the ideal, but you cannot avoid creating intermediate DataFrames (that are eventually thrown out and garbage collected) to compute the final result range -- something that we would ideally like to avoid.
Lastly, "pick your columns directly" is good advice as long as you have a manageably small number of columns to pick. It will, however not be applicable in some cases where ranges span dozens (or possibly hundreds) of columns.
One option for selecting multiple slices is with select_columns from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
from numpy import random
random.seed(3)
df = pd.DataFrame(
random.randn(10, 10),
index=range(0,10),
columns=['A', 'B', 'C', 'D','E','F','G','H','I','J']
)
df.select_columns(slice('A', 'C'), 'E', slice('G', 'I'))
A B C E G H I
0 1.788628 0.436510 0.096497 -0.277388 -0.082741 -0.627001 -0.043818
1 -1.313865 0.884622 0.881318 0.050034 -0.545360 -1.546477 0.982367
2 -1.185047 -0.205650 1.486148 -1.023785 0.625245 -0.160513 -0.768836
3 0.745056 1.976111 -1.244123 -0.803766 -0.923792 -1.023876 1.123978
4 -1.623285 0.646675 -0.356271 -0.596650 -0.873882 0.029714 -2.248258
5 1.013183 0.852798 1.108187 1.487543 0.845833 -1.860890 -0.602885
6 1.048148 1.333738 -0.197415 -0.674728 0.152946 -1.064195 0.437947
7 -1.024931 0.899338 -0.154507 0.483788 0.643163 0.249087 -1.395764
8 -1.370669 0.238563 0.614077 0.145063 -0.024104 -0.888657 -2.915738
9 -0.591079 -0.516417 -0.959996 -0.574708 0.679072 -0.855437 -0.300206
The caveat here is that you have to explicitly use python's builtin slice.
Just like the excellent chosen answer, you can use regular expressions, again, it is explicit use (python's re):
import re
df.select_columns(re.compile('[A-CEG-I]'))
A B C E G H I
0 1.788628 0.436510 0.096497 -0.277388 -0.082741 -0.627001 -0.043818
1 -1.313865 0.884622 0.881318 0.050034 -0.545360 -1.546477 0.982367
2 -1.185047 -0.205650 1.486148 -1.023785 0.625245 -0.160513 -0.768836
3 0.745056 1.976111 -1.244123 -0.803766 -0.923792 -1.023876 1.123978
4 -1.623285 0.646675 -0.356271 -0.596650 -0.873882 0.029714 -2.248258
5 1.013183 0.852798 1.108187 1.487543 0.845833 -1.860890 -0.602885
6 1.048148 1.333738 -0.197415 -0.674728 0.152946 -1.064195 0.437947
7 -1.024931 0.899338 -0.154507 0.483788 0.643163 0.249087 -1.395764
8 -1.370669 0.238563 0.614077 0.145063 -0.024104 -0.888657 -2.915738
9 -0.591079 -0.516417 -0.959996 -0.574708 0.679072 -0.855437 -0.300206
You can go crazy and combine different selection options within the select_columns method.

Categories