Select multiple columns by labels in pandas

Select multiple columns by labels in pandas - python

I've been looking around for ways to select columns through the python documentation and the forums but every example on indexing columns are too simplistic.
Suppose I have a 10 x 10 dataframe
df = DataFrame(randn(10, 10), index=range(0,10), columns=['A', 'B', 'C', 'D','E','F','G','H','I','J'])
So far, all the documentations gives is just a simple example of indexing like
subset = df.loc[:,'A':'C']
or
subset = df.loc[:,'C':]
But I get an error when I try index multiple, non-sequential columns, like this
subset = df.loc[:,('A':'C', 'E')]
How would I index in Pandas if I wanted to select column A to C, E, and G to I? It appears that this logic will not work
subset = df.loc[:,('A':'C', 'E', 'G':'I')]
I feel that the solution is pretty simple, but I can't get around this error. Thanks!

Name- or Label-Based (using regular expression syntax)
df.filter(regex='[A-CEG-I]') # does NOT depend on the column order
Note that any regular expression is allowed here, so this approach can be very general. E.g. if you wanted all columns starting with a capital or lowercase "A" you could use: df.filter(regex='^[Aa]')
Location-Based (depends on column order)
df[ list(df.loc[:,'A':'C']) + ['E'] + list(df.loc[:,'G':'I']) ]
Note that unlike the label-based method, this only works if your columns are alphabetically sorted. This is not necessarily a problem, however. For example, if your columns go ['A','C','B'], then you could replace 'A':'C' above with 'A':'B'.
The Long Way
And for completeness, you always have the option shown by #Magdalena of simply listing each column individually, although it could be much more verbose as the number of columns increases:
df[['A','B','C','E','G','H','I']] # does NOT depend on the column order
Results for any of the above methods
A B C E G H I
0 -0.814688 -1.060864 -0.008088 2.697203 -0.763874 1.793213 -0.019520
1 0.549824 0.269340 0.405570 -0.406695 -0.536304 -1.231051 0.058018
2 0.879230 -0.666814 1.305835 0.167621 -1.100355 0.391133 0.317467

Just pick the columns you want directly....
df[['A','E','I','C']]

How do I select multiple columns by labels in pandas?
Multiple label-based range slicing is not easily supported with pandas, but position-based slicing is, so let's try that instead:
loc = df.columns.get_loc
df.iloc[:, np.r_[loc('A'):loc('C')+1, loc('E'), loc('G'):loc('I')+1]]
A B C E G H I
0 -1.666330 0.321260 -1.768185 -0.034774 0.023294 0.533451 -0.241990
1 0.911498 3.408758 0.419618 -0.462590 0.739092 1.103940 0.116119
2 1.243001 -0.867370 1.058194 0.314196 0.887469 0.471137 -1.361059
3 -0.525165 0.676371 0.325831 -1.152202 0.606079 1.002880 2.032663
4 0.706609 -0.424726 0.308808 1.994626 0.626522 -0.033057 1.725315
5 0.879802 -1.961398 0.131694 -0.931951 -0.242822 -1.056038 0.550346
6 0.199072 0.969283 0.347008 -2.611489 0.282920 -0.334618 0.243583
7 1.234059 1.000687 0.863572 0.412544 0.569687 -0.684413 -0.357968
8 -0.299185 0.566009 -0.859453 -0.564557 -0.562524 0.233489 -0.039145
9 0.937637 -2.171174 -1.940916 -1.553634 0.619965 -0.664284 -0.151388
Note that the +1 is added because when using iloc the rightmost index is exclusive.
Comments on Other Solutions
filter is a nice and simple method for OP's headers, but this might not generalise well to arbitrary column names.
The "location-based" solution with loc is a little closer to the ideal, but you cannot avoid creating intermediate DataFrames (that are eventually thrown out and garbage collected) to compute the final result range -- something that we would ideally like to avoid.
Lastly, "pick your columns directly" is good advice as long as you have a manageably small number of columns to pick. It will, however not be applicable in some cases where ranges span dozens (or possibly hundreds) of columns.

One option for selecting multiple slices is with select_columns from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
from numpy import random
random.seed(3)
df = pd.DataFrame(
random.randn(10, 10),
index=range(0,10),
columns=['A', 'B', 'C', 'D','E','F','G','H','I','J']
)
df.select_columns(slice('A', 'C'), 'E', slice('G', 'I'))
A B C E G H I
0 1.788628 0.436510 0.096497 -0.277388 -0.082741 -0.627001 -0.043818
1 -1.313865 0.884622 0.881318 0.050034 -0.545360 -1.546477 0.982367
2 -1.185047 -0.205650 1.486148 -1.023785 0.625245 -0.160513 -0.768836
3 0.745056 1.976111 -1.244123 -0.803766 -0.923792 -1.023876 1.123978
4 -1.623285 0.646675 -0.356271 -0.596650 -0.873882 0.029714 -2.248258
5 1.013183 0.852798 1.108187 1.487543 0.845833 -1.860890 -0.602885
6 1.048148 1.333738 -0.197415 -0.674728 0.152946 -1.064195 0.437947
7 -1.024931 0.899338 -0.154507 0.483788 0.643163 0.249087 -1.395764
8 -1.370669 0.238563 0.614077 0.145063 -0.024104 -0.888657 -2.915738
9 -0.591079 -0.516417 -0.959996 -0.574708 0.679072 -0.855437 -0.300206
The caveat here is that you have to explicitly use python's builtin slice.
Just like the excellent chosen answer, you can use regular expressions, again, it is explicit use (python's re):
import re
df.select_columns(re.compile('[A-CEG-I]'))
A B C E G H I
0 1.788628 0.436510 0.096497 -0.277388 -0.082741 -0.627001 -0.043818
1 -1.313865 0.884622 0.881318 0.050034 -0.545360 -1.546477 0.982367
2 -1.185047 -0.205650 1.486148 -1.023785 0.625245 -0.160513 -0.768836
3 0.745056 1.976111 -1.244123 -0.803766 -0.923792 -1.023876 1.123978
4 -1.623285 0.646675 -0.356271 -0.596650 -0.873882 0.029714 -2.248258
5 1.013183 0.852798 1.108187 1.487543 0.845833 -1.860890 -0.602885
6 1.048148 1.333738 -0.197415 -0.674728 0.152946 -1.064195 0.437947
7 -1.024931 0.899338 -0.154507 0.483788 0.643163 0.249087 -1.395764
8 -1.370669 0.238563 0.614077 0.145063 -0.024104 -0.888657 -2.915738
9 -0.591079 -0.516417 -0.959996 -0.574708 0.679072 -0.855437 -0.300206
You can go crazy and combine different selection options within the select_columns method.

Related

How to use one dataframe's index to reindex another one in pandas

I am so sorry that I truly don't know what title I should use. But here is my question
Stocks_Open
d-1 d-2 d-3 d-4
000001.HR 1817.670960 1808.937405 1796.928768 1804.570628
000002.ZH 4867.910878 4652.713598 4652.713598 4634.904168
000004.HD 92.046474 92.209029 89.526880 96.435445
000005.SS 28.822245 28.636893 28.358865 28.729569
000006.SH 192.362963 189.174626 185.986290 187.403328
000007.SH 79.190528 80.515892 81.509916 78.693516
Stocks_Volume
d-1 d-2 d-3 d-4
000001.HR 324234 345345 657546 234234
000002.ZH 4867343 465234 4652598 4634168
000004.HD 9246474 929029 826880 965445
000005.SS 2822245 2836893 2858865 2829569
000006.SH 19262963 1897466 1886290 183328
000007.SH 7190528 803892 809916 7693516
Above are the data I parsed from a database, what I exactly want to do is to obtain the correlation of open price and volume in 4 days for each stock (The first column consists of codes of different stocks). In other words, I am trying to calculate the correlation of corresponding rows of each DataFrame. (This is only simplified example, the real data should be extended to more than 1000 different stocks.)
My attempt is to create a dataframe and to run a loop, assigning the results to that dataframe. But here is a problem, which is, the index pf the created dataframe is not exactly what I want. When I tried to append the correlation column, the bug occurred. (Please ignore the value of correlation, which is I concocted here, just to give an example)
r = pd.DataFrame(index = range(6),columns = ['c']
for i in range(6):
r.iloc[i-1,:] = Stocks_Open.iloc[i-1].corr(Stocks_Volume.iloc[i-1])
Correlation_in_4days = pd.concat([Stocks_Open,Stocks_Volume], axis = 1)
Correlation_in_4days['corr'] = r['c']
for i in range(6):
Correlation_in_4days.iloc[i-1,8] = r.iloc[i-1,:]
r c
1 0.654
2 -0.454
3 0.3321
4 0.2166
5 -0.8772
6 0.3256
The bug occurred.
"ValueError: Incompatible indexer with Series"
I realized that my correlation dataframe's index is integer but not the stock code, but I don't know how to fix it, is there any help?
My ideal result is:
corr
000001.HR 0.654
000002.ZH -0.454
000004.HD 0.3321
000005.SS 0.2166
000006.SH -0.8772
000007.SH 0.3256

Try assign the index back
r.index = Stocks_Open.index

How do I create a labeling column for strings based on another DataFrame?

I have the following data frames
import pandas as pd
df_occurencies = pd.DataFrame({'day':[1,2,3,4,5],
'occ':[['frog','wasp','bee'],
['frog','whale','barley','orchid'],
['orchid','barley','frog'],
['orchid','whale','frog'],
['orchid','barley','tulip']]})
df_kingdoms = pd.DataFrame({'item':['frog','wasp','bee',
'whale','barley','orchid',
'tulip'],
'kingdom':['animalia','animalia','animalia',
'animalia','plantae','plantae',
'plantae']})
I need to set up another column, classifying the observations in column occ based on the values of the df_kingdoms values.
The values are all heterogeneous, so the desired outcome would be like this:
day occ desired_result
0 1 [frog, wasp, bee] "animals"
1 2 [frog, whale, barley, orchid] "animals and plants"
2 3 [orchid, barley, frog] "mostly plants"
3 4 [orchid, whale, frog] "mostly animals"
4 5 [orchid, barley, tulip] "plants"
I know that there are many ways to solve this, I've tried unsuccessfully a defined function with lots of .locs that I think is not even worth posting. And I need to perform this on large datasets, so faster is better.

This should do:
dic_kd={i:j for i,j in zip(df_kingdoms.item,df_kingdoms.kingdom)}
desired_output=[]
for I in df_occurencies.occ:
list_aux=[dic_kd[i] for i in I]
if (list_aux.count('animalia')!=0) and (list_aux.count('plantae')==0) :
desired_output.append('animals')
elif (list_aux.count('animalia')==0) and (list_aux.count('plantae')!=0) :
desired_output.append('plants')
elif list_aux.count('animalia')>list_aux.count('plantae'):
desired_output.append('mostly animals')
elif list_aux.count('animalia')<list_aux.count('plantae'):
desired_output.append('mostly plants')
else:
desired_output.append('animals and plants')
df_occurencies['desired output']=desired_output
Tell me if you don't understand anything and I'll help you

Python: numpy, pandas, and performing operations on the previous array value (smoothed averages): any way to not use FOR loop? EWMA?

Tbh, I'm not really sure how to ask this question. I've got an array of values, and I'm looking to take the smoothed average of these values moving forward. In Excel, the calculation process is:
average_val_1 = mean average of values through window_size
average_val_2 = (value at location window_size+1 * window_size-1 + average_val_1) / window_size
average_val_3 = (value at location window_size+2 * window_size-1 + average_val_2) / window_size
etc., etc.
In pandas and numpy, my code for this is the following
df = pd.DataFrame({'av':np.nan, 'values':np.random.rand(10)})
df = df[['values','av']]
window = 5
df['av'].iloc[5] = np.mean(df['values'][:5])
for i in range(window+1,len(df.index)):
df['av'].iloc[i] = (df['values'].iloc[i] * (window-1) + df['av'].iloc[i-1])/window
Which returns:
values av
0 0.418498 NaN
1 0.570326 NaN
2 0.296878 NaN
3 0.308445 NaN
4 0.127376 NaN
5 0.381160 0.344305
6 0.239725 0.260641
7 0.928491 0.794921
8 0.711632 0.728290
9 0.319791 0.401491
These are the values I am looking for, but there has to be a better way than using for loops. I think the answer has something to do with using exponentially weighted moving averages, but I'll be damned if I can figure out the syntax to make any sense of that.
Any suggestions?

you can use ewm such as:
window = 5
df['av'] = np.nan
df['av'].iloc[window] = np.mean(df['values'][:window])
df.loc[window:,'av'] = (df.loc[window:,'av'].fillna(df['values'])
.ewm(adjust=False, alpha=(window-1.)/window).mean())
and you get the same result than with your loop for. To be sure it works, column 'av' must be nan otherwise the fillna with column 'values' will not happen and the value calculted in 'av' will be wrong. The parameter alpha in ewm is what helps you to weigth the row you are calculating.
Note: while this code does as yours, I would recommend to have a look at this line in your code:
df['av'].iloc[5] = np.mean(df['values'][:5])
Because of the exclusion of the uppper bound when doing slicing [:5], df['values'][:5] is:
0 0.418498
1 0.570326
2 0.296878
3 0.308445
4 0.127376
Name: values, dtype: float64
so I think that what you should do is df['av'].iloc[4] = np.mean(df['values'][:5]). If you agree, then my above must be slightly changed
df['av'].iloc[window-1] = np.mean(df['values'][:window])
df.loc[window-1:,'av'] = (df.loc[window-1:,'av'].fillna(df['values'])
.ewm(adjust=False, alpha=(window-1.)/window).mean())

Operating on pandas dataframes that may or may not be multiIndex

I have a few functions that make new columns in a pandas dataframe, as a function of existing columns in the dataframe. I have two different scenarios that occur here: (1) the dataframe is NOT multiIndex and has a set of columns, say [a,b] and (2) the dataframe is multiIndex and now has the same set of columns headers repeated N times, say [(a,1),(b,1),(a,2),(b,2)....(a,N),(n,N)].
I've been making the aforementioned functions in the style shown below:
def f(df):
if multiindex(df):
for s df[a].columns:
df[c,s] = someFunction(df[a,s], df[b,s])
else:
df[c] = someFunction(df[a], df[b])
Is there another way to do this, without having these if-multi-index/else statement everywhere and duplicating the someFunction code? I'd prefer NOT to split the multi indexed frame into N smaller dataframes (I often need to filter data or do things and keep the rows consistent across all the 1,2,...N frames, and keeping them together in one frame seems the to be the best way to do that).

you may still have to test if columns is a MultiIndex but this should be cleaner and more efficient. Caveat, this will not work if your function utilizes summary statistics on the column. For example, if someFunction divides by the the average of column 'a'.
Solution
def someFunction(a, b):
return a + b
def f(df):
df = df.copy()
ismi = isinstance(df.columns, pd.MultiIndex)
if ismi:
df = df.stack()
df['c'] = someFunction(df['a'], df['a'])
if ismi:
df = df.unstack()
return df
Setup
import pandas as pd
import numpy as np
setup_tuples = []
for c in ['a', 'b']:
for i in ['one', 'two', 'three']:
setup_tuples.append((c, i))
columns = pd.MultiIndex.from_tuples(setup_tuples)
rand_array = np.random.rand(10, len(setup_tuples))
df = pd.DataFrame(rand_array, columns=columns)
df looks like this
a b
one two three one two three
0 0.282834 0.490313 0.201300 0.140157 0.467710 0.352555
1 0.838527 0.707131 0.763369 0.265170 0.452397 0.968125
2 0.822786 0.785226 0.434637 0.146397 0.056220 0.003197
3 0.314795 0.414096 0.230474 0.595133 0.060608 0.900934
4 0.334733 0.118689 0.054299 0.237786 0.658538 0.057256
5 0.993753 0.552942 0.665615 0.336948 0.788817 0.320329
6 0.310809 0.199921 0.158675 0.059406 0.801491 0.134779
7 0.971043 0.183953 0.723950 0.909778 0.103679 0.695661
8 0.755384 0.728327 0.029720 0.408389 0.808295 0.677195
9 0.276158 0.978232 0.623972 0.897015 0.253178 0.093772
I constructed df to have MultiIndex columns. What I'd do is use the .stack() method to push the second level of the column index to be the second level of the row index.
df.stack() looks like this
a b
0 one 0.282834 0.140157
three 0.201300 0.352555
two 0.490313 0.467710
1 one 0.838527 0.265170
three 0.763369 0.968125
two 0.707131 0.452397
2 one 0.822786 0.146397
three 0.434637 0.003197
two 0.785226 0.056220
3 one 0.314795 0.595133
three 0.230474 0.900934
two 0.414096 0.060608
4 one 0.334733 0.237786
three 0.054299 0.057256
two 0.118689 0.658538
5 one 0.993753 0.336948
three 0.665615 0.320329
two 0.552942 0.788817
6 one 0.310809 0.059406
three 0.158675 0.134779
two 0.199921 0.801491
7 one 0.971043 0.909778
three 0.723950 0.695661
two 0.183953 0.103679
8 one 0.755384 0.408389
three 0.029720 0.677195
two 0.728327 0.808295
9 one 0.276158 0.897015
three 0.623972 0.093772
two 0.978232 0.253178
Now you can operate on df.stack() as if the columns were not a MultiIndex
Demonstration
print f(df)
will give you what you want
a b c \
one three two one three two one
0 0.282834 0.201300 0.490313 0.140157 0.352555 0.467710 0.565667
1 0.838527 0.763369 0.707131 0.265170 0.968125 0.452397 1.677055
2 0.822786 0.434637 0.785226 0.146397 0.003197 0.056220 1.645572
3 0.314795 0.230474 0.414096 0.595133 0.900934 0.060608 0.629591
4 0.334733 0.054299 0.118689 0.237786 0.057256 0.658538 0.669465
5 0.993753 0.665615 0.552942 0.336948 0.320329 0.788817 1.987507
6 0.310809 0.158675 0.199921 0.059406 0.134779 0.801491 0.621618
7 0.971043 0.723950 0.183953 0.909778 0.695661 0.103679 1.942086
8 0.755384 0.029720 0.728327 0.408389 0.677195 0.808295 1.510767
9 0.276158 0.623972 0.978232 0.897015 0.093772 0.253178 0.552317
three two
0 0.402600 0.980626
1 1.526739 1.414262
2 0.869273 1.570453
3 0.460948 0.828193
4 0.108599 0.237377
5 1.331230 1.105884
6 0.317349 0.399843
7 1.447900 0.367907
8 0.059439 1.456654
9 1.247944 1.956464

How to store formulas, instead of values, in pandas DataFrame

Is it possible to work with pandas DataFrame as with an Excel spreadsheet: say, by entering a formula in a column so that when variables in other columns change, the values in this column change automatically? Something like:
a b c
2 3 =a+b
And so when I update 2 or 3, the column c also updates automatically.
PS: It's clearly possible to write a function to return a+b, but is there any built-in functionality in pandas or in other Python libraries to work with matrices this way?

This will work in 0.13 (still in development)
In [19]: df = DataFrame(randn(10,2),columns=list('ab'))
In [20]: df
Out[20]:
a b
0 0.958465 0.679193
1 -0.769077 0.497436
2 0.598059 0.457555
3 0.290926 -1.617927
4 -0.248910 -0.947835
5 -1.352096 -0.568631
6 0.009125 0.711511
7 -0.993082 -1.440405
8 -0.593704 0.352468
9 0.523332 -1.544849
This will be possible as 'a + b' (soon)
In [21]: formulas = { 'c' : 'df.a + df.b' }
In [22]: def update(df,formulas):
for k, v in formulas.items():
df[k] = pd.eval(v)
In [23]: update(df,formulas)
In [24]: df
Out[24]:
a b c
0 0.958465 0.679193 1.637658
1 -0.769077 0.497436 -0.271642
2 0.598059 0.457555 1.055614
3 0.290926 -1.617927 -1.327001
4 -0.248910 -0.947835 -1.196745
5 -1.352096 -0.568631 -1.920726
6 0.009125 0.711511 0.720636
7 -0.993082 -1.440405 -2.433487
8 -0.593704 0.352468 -0.241236
9 0.523332 -1.544849 -1.021517
You could implement a hook into setitem on the data frame to have this type of function called automatically. But pretty tricky. You didn't specify how the frame is updated in the first place. Would probably be easiest to simply call the update function after you change the values

I don't know it it is what you want, but I accidentally discovered that you can store xlwt.Formula objects in the DataFrame cells, and then, using DataFrame.to_excel method, export the DataFrame to excel and have your formulas in it:
import pandas
import xlwt
formulae=[]
formulae.append(xlwt.Formula('SUM(F1:F5)'))
formulae.append(xlwt.Formula('SUM(G1:G5)'))
formulae.append(xlwt.Formula('SUM(H1:I5)'))
formulae.append(xlwt.Formula('SUM(I1:I5)'))
df=pandas.DataFrame(formula)
df.to_excel('FormulaTest.xls')
Try it...

There's currently no way to do this exactly in the way that you describe.
In pandas 0.13 there will be a new DataFrame.eval method that will allow you to evaluate an expression in the "context" of a DataFrame. For example, you'll be able to df['c'] = df.eval('a + b').

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Select multiple columns by labels in pandas - python

Just pick the columns you want directly.... df[['A','E','I','C']]

Related

How to use one dataframe's index to reindex another one in pandas

How do I create a labeling column for strings based on another DataFrame?

Python: numpy, pandas, and performing operations on the previous array value (smoothed averages): any way to not use FOR loop? EWMA?

Operating on pandas dataframes that may or may not be multiIndex

How to store formulas, instead of values, in pandas DataFrame

Categories

Resources