Get corresponding index of median - python

I have a pandas dataframe with one column and I would like to know the index of the median. That is, I determine the median this way:
df.median()
This gives me the median value, but I would like to know the index of that row. Is it possible to determine this? For a list with uneven length I could search for the index with that value but for even list lengths this is not going to work. Can someone help?
This question was asked in another post, where the answer was basically to search for rows which have the same value as the median. But like I said, that will not work for a list of even length.
Below is a Min Example (I have included the suggestion by Wen below):
df = pd.DataFrame(np.random.randn(6, 1), columns=list('A'))
df.median()
df.loc[df[0]==df[0].median()]
Out[120]:
Empty DataFrame
Columns: [0]
Index: []

You can use Wen's answer for dataframes of odd length.
For dataframes of even length, the question does not really make sense. As you have pointed out the median does not exist in the dataframe. However, you can sort the dataframe by your column of interest and then find the indices for the two "median" values.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(6, 1), columns=list('A'))
df.median()
df.loc[df['A']==df['A'].median()]
df.sort_values(by='A', inplace=True)
df[df['A'] > df['A'].median()].iloc[0]
df[df['A'] < df['A'].median()].iloc[-1]

Another way is to use the quantile function (which convieniently defaults to 0.5, i.e. the median) and set the interpolation argument so that it doesn't try to split the midpoints on a DataFrame of even length.
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(6,1), columns=['A'])
# row nearest to midpoint
df[df['A']==df['A'].quantile(interpolation='nearest')]
# just below the midpoint
df[df['A']==df['A'].quantile(interpolation='lower')]
# just above the midpoint
df[df['A']==df['A'].quantile(interpolation='higher')]

Related

Applying a function that inverts column values using pandas

I'm hoping to get someone's advice on a problem I'm running into trying to apply a function over columns in a dataframe I have that inverses the values in the columns.
For example, if the observation is 0 and the max of the column is 7, I subtract the absolute value of the max from the observation: abs(0 - 7) = 7, so the smallest value becomes the largest.
All of the columns essentially have a similar range to the above example. The shape of the sliced df is 16984,512
The code I have written creates a bunch of empty columns, that are then replaced with the max values of those columns. The new shape becomes 16984, 1029 including the 5 columns that I sliced off before. Then I use lambda to apply the function over the columns in question:
#create max cols
col = df.iloc[:, 5:]
col_names = col.columns
maximum = '_max'
for col in df[col_names]:
max_value = df[col].max()
df[col+maximum] = np.zeros((16984,))
df[col+maximum].replace(to_replace = 0, value = max_value)
#for each row and column inverse value of row
def invert_col(x, col):
"""Invert values of a column"""
return abs(x[col] - x[col+"_max"])
for col in col_names:
new_df = df.apply(lambda x: invert_col(x, col), axis = 1)
I've tried this where I includes axis = 1 and when I remove it and the behaviour is quite different. I am fairly new to Python so I'm finding it difficult to troubleshoot why this is happening.
When I remove axis = 1, the error I get is a key error: KeyError: 'TV_TIME_LIVE'
TV_TIME_LIVE is the first column in col_names, so it's as if it's not finding it.
When I include axis = 1, I don't get an error, but all the columns in the df get flattened into a Series, with length equal to the original df.
What I'm expecting is a new_df with the same shape (16984,1029) where the values of the 5th to the 517th column have the inverse function applied to them.
I would really appreciate any guidance as to what's going on here and how we might get to the desired output.
Many thanks
apply is slow. It is better to use vectorized approaches as below.
axis=1 means that your function will work column wise, if you do not specify it will work row wise. When you get key error it means pandas is searching for a column name and it cannot find it. If you really must use apply try searching for a few examples how exactly it works.
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randint(0,7,size=(100, 4)), columns=list('ABCD'))
col_list=df.columns.copy()
for col in col_list:
df[col+"inversed"]=abs(df[col]-df[col].max())

Creating a new dataframe by filtering matches from columns of two existing dataframes with error tolerance

I am pretty new to python and pandas, and I want to sort through the existing two dataframes by certain columns, and create a third dataframe that contains only the value matches within a tolerance. In other words, I have df1 and df2, and I want df3 to contain the rows and columns of df2 that are within the tolerance of values in df1:
Two dataframes:
df1=pd.DataFrame([[0.221,2.233,7.84554,10.222],[0.222,2.000,7.8666,10.000],
[0.220,2.230,7.8500,10.005]],columns=('rt','mz','mz2','abundance'))
[Dataframe 1]
df2=pd.DataFrame([[0.219,2.233,7.84500,10.221],[0.220,7.8669,10.003],[0.229,2.238,7.8508,10.009]],columns=('rt','mz','mz2','abundance'))
[Dataframe 2]
Expected Output:
df3=pd.DataFrame([[0.219,2.233,7.84500,10.221],[0.220,2.002,7.8669,10.003]],columns=('Rt','mz','mz2','abundance'))
[Dataframe 3]
I have tried forloops and filters, but as I am a newby nothing is really working for me. But here us what I'm trying now:
import pandas as pd
import numpy as np
p=[]
d=np.array(p)
#print(d.dtype)
def count(df2, l, r):
l=[(df1['Rt']-0.001)]
r=[(df1['Rt']+0.001)]
for x in df2['Rt']:
# condition check
if x>= l and x<= r:
print(x)
d.append(x)
where p and d are the corresponding dataframe and the array (if necessary to make array?) that will be populated. I bet the problem lies somewhere in the fact that that the function shouldn't contain the forloop.
Ideally, this could work to sort like ~13,000 rows of a dataframe using the 180 column values of another dataframe.
Thank you in advance!
Is this what you're looking for?:
min = df1.rt.min()-0.001
max = df1.rt.max()+0.001
df3 = df2[(df2.rt >= min) & (df2.rt <= max)]
>>> df3

Python Pandas - filter pandas dataframe to get rows with minimum values in one column for each unique value in another column

Here is a dummy example of the DF I'm working with ('ETC' represents several columns):
df = pd.DataFrame(data={'PlotCode':['A','A','A','A','B','B','B','C','C'],
'INVYR':[2000,2000,2000,2005,1990,2000,1990,2005,2001],
'ETC':['a','b','c','d','e','f','g','h','i']})
picture of df (sorry not enough reputation yet)
And here is what I want to end up with:
df1 = pd.DataFrame(data={'PlotCode':['A','A','A','B','B','C'],
'INVYR':[2000,2000,2000,1990,1990,2001],
'ETC':['a','b','c','e','g','i']})
picture of df1
NOTE: I want ALL rows with minimum 'INVYR' values for each 'PlotCode', not just one or else I'm assuming I could do something easier with drop_duplicates and sort.
So far, following the answer here Appending pandas dataframes generated in a for loop I've tried this with the following code:
df1 = []
for i in df['PlotCode'].unique():
j = df[df['PlotCode']==i]
k = j[j['INVYR']==j['INVYR'].min()]
df1.append(k)
df1 = pd.concat(df1)
This code works but is very slow, my actual data contains some 40,000 different PlotCodes so this isn't a feasible solution. Does anyone know some smooth filtering way of doing this? I feel like I'm missing something very simple.
Thank you in advance!
Try not to use for loops when using pandas, they are extremely slow in comparison to the vectorized operations that pandas has.
Solution 1:
Determine the minimum INVYR for every plotcode, using .groupby():
min_invyr_per_plotcode = df.groupby('PlotCode', as_index=False)['INVYR'].min()
And use pd.merge() to do an inner join between your orignal df with this minimum you just found:
result_df = pd.merge(
df,
min_invyr_per_plotcode,
how='inner',
on=['PlotCode', 'INVYR'],
)
Solution 2:
Again, determine the minimum per group, but now add it as a column to your dataframe. This minimum per group gets added to every row by using .groupby().transform()
df['min_per_group'] = (df
.groupby('PlotCode')['INVYR']
.transform('min')
)
Now filter your dataframe where INVYR in a row is equal to the minimum of that group:
df[df['INVYR'] == df['min_per_group']]

Python Pandas: create rank columns, move orginal column max rank

I need to be able to
1. calculate ranks for each column in all rows,
2. the find the max column label of each row,
3. and then in each row move the max ranked column of the original df.
It is trivial to do when working only with the data in the original df. But if different ranking calls are needed, it seems difficult to accomplish.
Below is my Python Pandas code to accomplish this. But it does not work. It does not seem to interpret my statement, df1['maxV'] = df1[df1['maxR']] as I expect. Suggestions to achieve will be appreciated.
import pandas as pd
import numpy ass np
df1 = pd.DataFrame(np.random.randn(10,3),columns=list('ABC')
rankV = df1.pct_change(3) # calculate ranking values
df1['maxR'] = rankV.idxmax(axis=1) # add max ranked column label of rankv
df1['maxV'] = df1[df1['maxR']] # move max ranked column value to maxV
Iterate the rows and accumulate the values in an array:
maxVals = [np.nan]*3
for index, row in df1[pd.notna(df1['maxR'])].iterrows():
maxVals.append(df1.loc[index, row['maxR']])
df1['maxV'] = maxVals
Alternative: A less intuitive way might be to index df1 using the index and values, which will return a wider Dataframe(# columns equal to # of rows) which has the maxes on the diagonal:
maxVals = [np.nan]*3
newDf = df1.loc[df1['maxR'][3:].index, df1['maxR'][3:].values]
maxVals.extend(np.diag(newDf))
df1['maxV'] = maxVals

set median of a column to zero pandas Dataframe

I have a a Dataframe with two columns and I want to set each of the columns median value to zero. How can i do this without changing the standard deviation ? Or better is this the right way to do that ?
suppose I have:
df = pd.DataFrame(np.random.randn(100, 2))
#first column
df0=df[0]
#set median to zero
test=abs(df0-df.median())
Since I again looked for
test.median()
it is printing me not zero but a different value as above. Do I have a mistake in thought?
IIUC, you want
test= df0 - df[0].median()
>>> test.median()
0.0
If you just get the absolute values of the series, you'll change the median value because of course, it depends on the ordering of elements.
There are mainly 2 things you need to do here:
Iterating over the columns
For each column, you want to calculate its median and substract it from all values (related to that column)
And don't use absolute as it'll ruin the median = 0 you want.
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 2))
for col in df.columns:
df[col] = df[col] - np.median(df[col])
Testing:
for col in df.columns:
print(np.median(df[col]))
0.0
0.0

Categories