Say I group a Pandas dataframe around some column
df.groupby(cols)
Are groups sorted according to any criteria?
One way to retrieve a group is:
ix = 0
grouped.get_group(grouped.groups.keys()[ix])
but it is a bit verbose, and it's not clear that keys() above will give the groups in order.
Another way:
df = df.set_index(col)
df.loc[idx[df.index.levels[0][0],:],:]
but again, that's really verbose.
Is there another way to get a group by its integer index?
groupby has a sort parameter which is True by default, thus the groups are sorted. As for getting the nth group, it looks like you'd have to define a function, and use an internal API:
In [123]: df = DataFrame({'a': [1,1,1,1,2,2,3,3,3,3], 'b': randn(10)})
In [124]: df
Out[124]:
a b
0 1 1.5665
1 1 -0.2558
2 1 0.0756
3 1 -0.2821
4 2 0.8670
5 2 -2.0043
6 3 -1.3393
7 3 0.3898
8 3 -0.3392
9 3 1.2198
[10 rows x 2 columns]
In [125]: gb = df.groupby('a')
In [126]: def nth_group(gb, n):
.....: keys = gb.grouper._get_group_keys()
.....: return gb.get_group(keys[n])
.....:
In [127]: nth_group(gb, 0)
Out[127]:
a b
0 1 1.5665
1 1 -0.2558
2 1 0.0756
3 1 -0.2821
[4 rows x 2 columns]
How about:
key, df2 = iter(grouped).next()
Related
how to find the most frequent value of each row of a dataframe?
For example:
In [14]: df
Out[14]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
return:
[3,1,7]
try .mode() method:
In [88]: df
Out[88]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
In [89]: df.mode(axis=1)
Out[89]:
0
0 3
1 1
2 7
From docs:
Gets the mode(s) of each element along the axis selected. Adds a row
for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected
axis (when more than one item share the maximum frequency), which is
the reason why a dataframe is returned. If you want to impute missing
values with the mode in a dataframe df, you can just do this:
df.fillna(df.mode().iloc[0])
Assume the following dataframe
>>> import pandas as pd
>>> L = [(1,'A',9,9), (1,'C',8,8), (1,'D',4,5),(2,'H',7,7),(2,'L',5,5)]
>>> df = pd.DataFrame.from_records(L).set_index([0,1])
>>> df
2 3
0 1
1 A 9 9
C 8 8
D 4 5
2 H 7 7
L 5 5
I want to filter the rows in the nth position of level 1 of the multiindex, i.e. filtering the first
2 3
0 1
1 A 9 9
2 H 7 7
or filtering the third
2 3
0 1
1 D 4 5
How can I achieve this ?
You can filter rows with the help of GroupBy.nth after performing grouping on the first level of the multi-index DF. Since n follows the 0-based indexing approach, you need to provide the values appropriately to it as shown:
1) To select the first row grouped per level=0:
df.groupby(level=0, as_index=False).nth(0)
2) To select the third row grouped per level=0:
df.groupby(level=0, as_index=False).nth(2)
I have multiple DataFrames that I want to do the same thing to.
First I create a list of the DataFrames. All of them have the same column called 'result'.
df_list = [df1,df2,df3]
I want to keep only the rows in all the DataFrames with value 'passed' so I use a for loop on my list:
for df in df_list:
df =df[df['result'] == 'passed']
...this does not work, the values are not filtered out of each DataFrame.
If I filter each one separately then it does work.
df1 =df1[df1['result'] == 'passed']
df2 =df2[df2['result'] == 'passed']
df3 =df3[df3['result'] == 'passed']
This is because every time you do a subset like this df[<whatever>] you are returning a new dataframe, and assigning it to the df looping variable, which gets obliterated each time you go to the next iteration (although you do keep the last one). This similar to slicing lists:
>>> list1 = [1,2,3,4]
>>> list2 = [11,12,13,14]
>>> for lyst in list1,list2:
... lyst = lyst[1:-1]
...
>>> list1, list2
([1, 2, 3, 4], [11, 12, 13, 14])
>>> lyst
[12, 13]
Usually, you need to use a mutator method if you want to actually modify the lists in-place. Equivalently, with a dataframe, you could use assignment on an indexer, e.g. .loc/.ix/.iloc/ etc in combination with the .dropna method, being careful to pass the inplace=True argument. Suppose I have three dataframes and I want to only keep the rows where my second column is positive:
Warning: This way is not ideal, look at edit for better way
In [11]: df1
Out[11]:
0 1 2 3
0 0.957288 -0.170286 0.406841 -3.058443
1 1.762343 -1.837631 -0.867520 1.666193
2 0.618665 0.660312 -1.319740 -0.024854
3 -2.008017 -0.445997 -0.028739 -0.227665
4 0.638419 -0.271300 -0.918894 1.524009
5 0.957006 1.181246 0.513298 0.370174
6 0.613378 -0.852546 -1.778761 -1.386848
7 -1.891993 -0.304533 -1.427700 0.099904
In [12]: df2
Out[12]:
0 1 2 3
0 -0.521018 0.407258 -1.167445 -0.363503
1 -0.879489 0.008560 0.224466 -0.165863
2 0.550845 -0.102224 -0.575909 -0.404770
3 -1.171828 -0.912451 -1.197273 0.719489
4 -0.887862 1.073306 0.351835 0.313953
5 -0.517824 -0.096929 -0.300282 0.716020
6 -1.121527 0.183219 0.938509 0.842882
7 0.003498 -2.241854 -1.146984 -0.751192
In [13]: df3
Out[13]:
0 1 2 3
0 0.240411 0.795132 -0.305770 -0.332253
1 -1.162097 0.055346 0.094363 -1.254859
2 -0.493466 -0.717872 1.090417 -0.591872
3 1.021246 -0.060453 -0.013952 0.304933
4 -0.859882 -0.947950 0.562609 1.313632
5 0.917199 1.186865 0.354839 -1.771787
6 -0.694799 -0.695505 -1.077890 -0.880563
7 1.088068 -0.893466 -0.188419 -0.451623
In [14]: for df in df1, df2, df3:
....: df.loc[:,:] = df.loc[df[1] > 0,:]
....: df.dropna(inplace = True,axis =0)
....:
In [15]: df1
dfOut[15]:
0 1 2 3
2 0.618665 0.660312 -1.319740 -0.024854
5 0.957006 1.181246 0.513298 0.370174
In [16]: df2
Out[16]:
0 1 2 3
0 -0.521018 0.407258 -1.167445 -0.363503
1 -0.879489 0.008560 0.224466 -0.165863
4 -0.887862 1.073306 0.351835 0.313953
6 -1.121527 0.183219 0.938509 0.842882
In [17]: df3
Out[17]:
0 1 2 3
0 0.240411 0.795132 -0.305770 -0.332253
1 -1.162097 0.055346 0.094363 -1.254859
5 0.917199 1.186865 0.354839 -1.771787
Edited to Add:
I think I found a better way just using the .drop method.
In [21]: df1
Out[21]:
0 1 2 3
0 -0.804913 -0.481498 0.076843 1.136567
1 -0.457197 -0.903681 -0.474828 1.289443
2 -0.820710 1.610072 0.175455 0.712052
3 0.715610 -0.178728 -0.664992 1.261465
4 -0.297114 -0.591935 0.487698 0.760450
5 1.035231 -0.108825 -1.058996 0.056320
6 1.579931 0.958331 -0.653261 -0.171245
7 0.685427 1.447411 0.001002 0.241999
In [22]: df2
Out[22]:
0 1 2 3
0 1.660864 0.110002 0.366881 1.765541
1 -0.627716 1.341457 -0.552313 0.578854
2 0.277738 0.128419 -0.279720 -1.197483
3 -1.294724 1.396698 0.108767 1.353454
4 -0.379995 0.215192 1.446584 0.530020
5 0.557042 0.339192 -0.105808 -0.693267
6 1.293941 0.203973 -3.051011 1.638143
7 -0.909982 1.998656 -0.057350 2.279443
In [23]: df3
Out[23]:
0 1 2 3
0 -0.002327 -2.054557 -1.752107 -0.911178
1 -0.998328 -1.119856 1.468124 -0.961131
2 -0.048568 0.373192 -0.666330 0.867719
3 0.533597 -1.222963 0.119789 -0.037949
4 1.203075 -0.773511 0.475809 1.352943
5 -0.984069 -0.352267 -0.313516 0.138259
6 0.114596 0.354404 2.119963 -0.452462
7 -1.033029 -0.787237 0.479321 -0.818260
In [25]: for df in df1,df2,df3:
....: df.drop(df.index[df[1] < 0],axis=0,inplace=True)
....:
In [26]: df1
Out[26]:
0 1 2 3
2 -0.820710 1.610072 0.175455 0.712052
6 1.579931 0.958331 -0.653261 -0.171245
7 0.685427 1.447411 0.001002 0.241999
In [27]: df2
Out[27]:
0 1 2 3
0 1.660864 0.110002 0.366881 1.765541
1 -0.627716 1.341457 -0.552313 0.578854
2 0.277738 0.128419 -0.279720 -1.197483
3 -1.294724 1.396698 0.108767 1.353454
4 -0.379995 0.215192 1.446584 0.530020
5 0.557042 0.339192 -0.105808 -0.693267
6 1.293941 0.203973 -3.051011 1.638143
7 -0.909982 1.998656 -0.057350 2.279443
In [28]: df3
Out[28]:
0 1 2 3
2 -0.048568 0.373192 -0.666330 0.867719
6 0.114596 0.354404 2.119963 -0.452462
Certainly faster:
In [8]: timeit.Timer(stmt="df.loc[:,:] = df.loc[df[1] > 0, :];df.dropna(inplace = True,axis =0)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[8]: 23.69621358400036
In [9]: timeit.Timer(stmt="df.drop(df.index[df[1] < 0],axis=0,inplace=True)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[9]: 11.476448250003159
If you multiple columns contains passed condition like more than 40 are passed else failed
col_list_contains_passed= ['result1','result2']
df_list = [df1,df2,df3]
for d in df_list:
for c in col_list:
d[c]=np.where(d[c]>=40,'Passed','failed')#you can put your condition
Then you can filter individual dataframe by condition
df1=df1[df1['col_name']=='xyz]
I will update the answer if for loop works. To be continued...
how to find the most frequent value of each row of a dataframe?
For example:
In [14]: df
Out[14]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
return:
[3,1,7]
try .mode() method:
In [88]: df
Out[88]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
In [89]: df.mode(axis=1)
Out[89]:
0
0 3
1 1
2 7
From docs:
Gets the mode(s) of each element along the axis selected. Adds a row
for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected
axis (when more than one item share the maximum frequency), which is
the reason why a dataframe is returned. If you want to impute missing
values with the mode in a dataframe df, you can just do this:
df.fillna(df.mode().iloc[0])
It's a generic question about filtering a pandas dataframe using a list. The problem is the following:
I have a pandas dataframe df with a column field
I have a list of banned fields, for example ban_field=['field1','field2','field3']
All elements of ban_field appear in df.field
For the moment, to retrieve the dataframe without the banned field, I proceed as follows:
for f in ban_field:
df = df[df.field!=f]
Is there a more pythonic way to proceed (in one line?)?
Method #1: use isin and a boolean array selector:
In [47]: df = pd.DataFrame({"a": [2]*10, "field": range(10)})
In [48]: ban_field = [3,4,6,7,8]
In [49]: df[~df.field.isin(ban_field)]
Out[49]:
a field
0 2 0
1 2 1
2 2 2
5 2 5
9 2 9
[5 rows x 2 columns]
Method #2: use query:
In [51]: df.query("field not in #ban_field")
Out[51]:
a field
0 2 0
1 2 1
2 2 2
5 2 5
9 2 9
[5 rows x 2 columns]
You can remove it by using the isin function and the negation (~) operator.
df[~df.field.isin(ban_field)]