I have multiple DataFrames that I want to do the same thing to.
First I create a list of the DataFrames. All of them have the same column called 'result'.
df_list = [df1,df2,df3]
I want to keep only the rows in all the DataFrames with value 'passed' so I use a for loop on my list:
for df in df_list:
df =df[df['result'] == 'passed']
...this does not work, the values are not filtered out of each DataFrame.
If I filter each one separately then it does work.
df1 =df1[df1['result'] == 'passed']
df2 =df2[df2['result'] == 'passed']
df3 =df3[df3['result'] == 'passed']
This is because every time you do a subset like this df[<whatever>] you are returning a new dataframe, and assigning it to the df looping variable, which gets obliterated each time you go to the next iteration (although you do keep the last one). This similar to slicing lists:
>>> list1 = [1,2,3,4]
>>> list2 = [11,12,13,14]
>>> for lyst in list1,list2:
... lyst = lyst[1:-1]
...
>>> list1, list2
([1, 2, 3, 4], [11, 12, 13, 14])
>>> lyst
[12, 13]
Usually, you need to use a mutator method if you want to actually modify the lists in-place. Equivalently, with a dataframe, you could use assignment on an indexer, e.g. .loc/.ix/.iloc/ etc in combination with the .dropna method, being careful to pass the inplace=True argument. Suppose I have three dataframes and I want to only keep the rows where my second column is positive:
Warning: This way is not ideal, look at edit for better way
In [11]: df1
Out[11]:
0 1 2 3
0 0.957288 -0.170286 0.406841 -3.058443
1 1.762343 -1.837631 -0.867520 1.666193
2 0.618665 0.660312 -1.319740 -0.024854
3 -2.008017 -0.445997 -0.028739 -0.227665
4 0.638419 -0.271300 -0.918894 1.524009
5 0.957006 1.181246 0.513298 0.370174
6 0.613378 -0.852546 -1.778761 -1.386848
7 -1.891993 -0.304533 -1.427700 0.099904
In [12]: df2
Out[12]:
0 1 2 3
0 -0.521018 0.407258 -1.167445 -0.363503
1 -0.879489 0.008560 0.224466 -0.165863
2 0.550845 -0.102224 -0.575909 -0.404770
3 -1.171828 -0.912451 -1.197273 0.719489
4 -0.887862 1.073306 0.351835 0.313953
5 -0.517824 -0.096929 -0.300282 0.716020
6 -1.121527 0.183219 0.938509 0.842882
7 0.003498 -2.241854 -1.146984 -0.751192
In [13]: df3
Out[13]:
0 1 2 3
0 0.240411 0.795132 -0.305770 -0.332253
1 -1.162097 0.055346 0.094363 -1.254859
2 -0.493466 -0.717872 1.090417 -0.591872
3 1.021246 -0.060453 -0.013952 0.304933
4 -0.859882 -0.947950 0.562609 1.313632
5 0.917199 1.186865 0.354839 -1.771787
6 -0.694799 -0.695505 -1.077890 -0.880563
7 1.088068 -0.893466 -0.188419 -0.451623
In [14]: for df in df1, df2, df3:
....: df.loc[:,:] = df.loc[df[1] > 0,:]
....: df.dropna(inplace = True,axis =0)
....:
In [15]: df1
dfOut[15]:
0 1 2 3
2 0.618665 0.660312 -1.319740 -0.024854
5 0.957006 1.181246 0.513298 0.370174
In [16]: df2
Out[16]:
0 1 2 3
0 -0.521018 0.407258 -1.167445 -0.363503
1 -0.879489 0.008560 0.224466 -0.165863
4 -0.887862 1.073306 0.351835 0.313953
6 -1.121527 0.183219 0.938509 0.842882
In [17]: df3
Out[17]:
0 1 2 3
0 0.240411 0.795132 -0.305770 -0.332253
1 -1.162097 0.055346 0.094363 -1.254859
5 0.917199 1.186865 0.354839 -1.771787
Edited to Add:
I think I found a better way just using the .drop method.
In [21]: df1
Out[21]:
0 1 2 3
0 -0.804913 -0.481498 0.076843 1.136567
1 -0.457197 -0.903681 -0.474828 1.289443
2 -0.820710 1.610072 0.175455 0.712052
3 0.715610 -0.178728 -0.664992 1.261465
4 -0.297114 -0.591935 0.487698 0.760450
5 1.035231 -0.108825 -1.058996 0.056320
6 1.579931 0.958331 -0.653261 -0.171245
7 0.685427 1.447411 0.001002 0.241999
In [22]: df2
Out[22]:
0 1 2 3
0 1.660864 0.110002 0.366881 1.765541
1 -0.627716 1.341457 -0.552313 0.578854
2 0.277738 0.128419 -0.279720 -1.197483
3 -1.294724 1.396698 0.108767 1.353454
4 -0.379995 0.215192 1.446584 0.530020
5 0.557042 0.339192 -0.105808 -0.693267
6 1.293941 0.203973 -3.051011 1.638143
7 -0.909982 1.998656 -0.057350 2.279443
In [23]: df3
Out[23]:
0 1 2 3
0 -0.002327 -2.054557 -1.752107 -0.911178
1 -0.998328 -1.119856 1.468124 -0.961131
2 -0.048568 0.373192 -0.666330 0.867719
3 0.533597 -1.222963 0.119789 -0.037949
4 1.203075 -0.773511 0.475809 1.352943
5 -0.984069 -0.352267 -0.313516 0.138259
6 0.114596 0.354404 2.119963 -0.452462
7 -1.033029 -0.787237 0.479321 -0.818260
In [25]: for df in df1,df2,df3:
....: df.drop(df.index[df[1] < 0],axis=0,inplace=True)
....:
In [26]: df1
Out[26]:
0 1 2 3
2 -0.820710 1.610072 0.175455 0.712052
6 1.579931 0.958331 -0.653261 -0.171245
7 0.685427 1.447411 0.001002 0.241999
In [27]: df2
Out[27]:
0 1 2 3
0 1.660864 0.110002 0.366881 1.765541
1 -0.627716 1.341457 -0.552313 0.578854
2 0.277738 0.128419 -0.279720 -1.197483
3 -1.294724 1.396698 0.108767 1.353454
4 -0.379995 0.215192 1.446584 0.530020
5 0.557042 0.339192 -0.105808 -0.693267
6 1.293941 0.203973 -3.051011 1.638143
7 -0.909982 1.998656 -0.057350 2.279443
In [28]: df3
Out[28]:
0 1 2 3
2 -0.048568 0.373192 -0.666330 0.867719
6 0.114596 0.354404 2.119963 -0.452462
Certainly faster:
In [8]: timeit.Timer(stmt="df.loc[:,:] = df.loc[df[1] > 0, :];df.dropna(inplace = True,axis =0)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[8]: 23.69621358400036
In [9]: timeit.Timer(stmt="df.drop(df.index[df[1] < 0],axis=0,inplace=True)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[9]: 11.476448250003159
If you multiple columns contains passed condition like more than 40 are passed else failed
col_list_contains_passed= ['result1','result2']
df_list = [df1,df2,df3]
for d in df_list:
for c in col_list:
d[c]=np.where(d[c]>=40,'Passed','failed')#you can put your condition
Then you can filter individual dataframe by condition
df1=df1[df1['col_name']=='xyz]
I will update the answer if for loop works. To be continued...
Related
This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 2 years ago.
I have a pandas.DataFrame as follows:
df1 =
a b
0 1 2
1 3 4
I'd like to make this three times to become:
df2 =
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
df2 is made from a loop, but it is not efficient.
How can I get df2 from df1 using a matrix way which is faster?
Build a one dimensional indexer to slice both the the values array and index. You must take care of the index as well to get your desired results.
use np.repeat on an np.arange to get the indexer
construct a new dataframe using this indexer on both values and the index
r = np.arange(len(df)).repeat(3)
pd.DataFrame(df.values[r], df.index[r], df.columns)
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
You can use np.repeat
df = pd.DataFrame(np.repeat(df.values,[3,3], axis = 0), columns = df.columns)
You get
a b
0 1 2
1 1 2
2 1 2
3 3 4
4 3 4
5 3 4
Time testing:
%timeit pd.DataFrame(np.repeat(df.values,[3,3], axis = 0))
1000 loops, best of 3: 235 µs per loop
%timeit pd.concat([df] * 3).sort_index()
best of 3: 1.26 ms per loop
Numpy is definitely faster in most cases so no surprises there
EDIT: I am not sure if you would be looking for repeating indices but incase you do,
pd.DataFrame(np.repeat(df.values,3, axis = 0), index = np.repeat(df.index, 3), columns = df.columns)
I do not know if it is more efficient than your loop, but it easy enough to construct as:
Code:
pd.concat([df] * 3).sort_index()
Test Code:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('ab'))
print(pd.concat([df] * 3).sort_index())
Results:
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
You can use numpy.repeat with parameter scalar 3 and then add columns parameter to DataFrame constructor:
df = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
print (df)
a b
0 1 2
1 1 2
2 1 2
3 3 4
4 3 4
5 3 4
If really want duplicated index what can complicated some pandas functions like reindex which failed:
r = np.repeat(np.arange(len(df.index)), 3)
df = pd.DataFrame(df.values[r], df.index[r], df.columns)
print (df)
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
Not the fastest (not the slowest either) but the shortest solution so far.
#Build a index array and extract the rows to build the desired new df. This handles index and data all at once.
df.iloc[np.repeat(df.index,3)]
Out[270]: In [271]:
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
Is it possible to put percentile cuts on all columns of a dataframe with using a loop? This is how I am doing it now:
df = pd.DataFrame(np.random.randn(10,5))
df_q = pd.DataFrame()
for i in list(range(len(df.columns))):
df_q[i] = pd.qcut(df[i], 5, labels=list(range(5)))
I am hoping there is a slick pandas solution for this to avoid the use of a loop.
Thanks!
pd.qcut accepts an 1D array or Series as its argument. To apply pd.qcut to every column requires multiple calls to pd.qcut. So no matter how you dress it up, there will be a loop -- either explicit or implicit.
You could for example, use apply to call pd.qcut for each column:
In [46]: df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
Out[46]:
0 1 2 3 4
0 4 0 3 0 3
1 0 0 2 3 0
2 3 4 1 2 3
3 4 1 1 1 4
4 3 2 2 4 1
5 2 4 3 0 1
6 2 3 0 4 4
7 1 3 4 2 2
8 0 1 4 3 0
9 1 2 0 1 2
but under the hood, df.apply is using a for-loop, so it really isn't very different than your for-loop:
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
In [47]: %timeit df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
100 loops, best of 3: 2.9 ms per loop
In [48]: %%timeit
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
100 loops, best of 3: 2.95 ms per loop
Note that
for i in list(range(len(df.columns))):
will only work if the columns of df happen to be sequential integers starting at 0.
It is more robust to use
for col in df:
to iterate over the columns of the DataFrame.
from one of my scripts I end up with a big dataframe in Pandas
and one of it's columns look like this
13 [1705916]
14 [116242799]
15 [17865718]
...
9551 [74736013]
9553 []
9620 [92090990]
9666 [113455]
9667 [327478610]
9733 [52782791]
9838 []
9951 [229462842]
9952 []
10070 []
When I do type(df.column_of_interest)
I get back <class 'pandas.core.series.Series'>.
So my question is, is it possible to extract data from list in a dataframe while keeping rows with empty values in the list?
expected output:
13 1705916
14 116242799
15 17865718
...
9551 74736013
9553
9620 92090990
9666 113455
9667 327478610
9733 52782791
9838
9951 229462842
9952
10070
If the column is currently a list of integers, then you can use series.apply along with str.join() to get what you want. Example -
In [42]: df = pd.DataFrame([[1,[2]],[2,[3]],[3,[]],[4,[5,6]]], columns=['A','B'])
In [43]: df
Out[43]:
A B
0 1 [2]
1 2 [3]
2 3 []
3 4 [5, 6]
In [44]: df['B'] = df['B'].apply(lambda x:','.join([str(i) for i in x]))
In [45]: df
Out[45]:
A B
0 1 2
1 2 3
2 3
3 4 5,6
I'd like to have my dataframe start with index 1 instead of 0. But somehow I am not getting it:
In[1]: df = pd.DataFrame([[4,7],[10,11],[7,2]],columns=['one', 'two'])
In[2]: df
Out[2]:
one two
0 4 7
1 10 11
2 7 2
In[3]: df.reindex(range(1,len(df)+1))
Out[3]:
one two
1 10 11
2 7 2
3 NaN NaN
Where did my first row go? What am I getting wrong about reindex()?
To increase your index with 1 you can simply modify the index like this, df.index += 1.
Full example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[4,7],[10,11],[7,2]],columns=['one', 'two'])
In [3]: df
Out[3]:
one two
0 4 7
1 10 11
2 7 2
In [4]: df.index += 1
In [5]: df
Out[5]:
one two
1 4 7
2 10 11
3 7 2
The reindexing does not reassign the index values and preserve the order for that you can assign directly:
In [25]:
df.index = range(1,len(df)+1)
df
Out[25]:
one two
1 4 7
2 10 11
3 7 2
The docs show that you are conforming your data to the new index which will introduce NaN values where none existed hence why you lost a row, this is why there is a fillna param for reindex.
Say I group a Pandas dataframe around some column
df.groupby(cols)
Are groups sorted according to any criteria?
One way to retrieve a group is:
ix = 0
grouped.get_group(grouped.groups.keys()[ix])
but it is a bit verbose, and it's not clear that keys() above will give the groups in order.
Another way:
df = df.set_index(col)
df.loc[idx[df.index.levels[0][0],:],:]
but again, that's really verbose.
Is there another way to get a group by its integer index?
groupby has a sort parameter which is True by default, thus the groups are sorted. As for getting the nth group, it looks like you'd have to define a function, and use an internal API:
In [123]: df = DataFrame({'a': [1,1,1,1,2,2,3,3,3,3], 'b': randn(10)})
In [124]: df
Out[124]:
a b
0 1 1.5665
1 1 -0.2558
2 1 0.0756
3 1 -0.2821
4 2 0.8670
5 2 -2.0043
6 3 -1.3393
7 3 0.3898
8 3 -0.3392
9 3 1.2198
[10 rows x 2 columns]
In [125]: gb = df.groupby('a')
In [126]: def nth_group(gb, n):
.....: keys = gb.grouper._get_group_keys()
.....: return gb.get_group(keys[n])
.....:
In [127]: nth_group(gb, 0)
Out[127]:
a b
0 1 1.5665
1 1 -0.2558
2 1 0.0756
3 1 -0.2821
[4 rows x 2 columns]
How about:
key, df2 = iter(grouped).next()