I need to be able to
1. calculate ranks for each column in all rows,
2. the find the max column label of each row,
3. and then in each row move the max ranked column of the original df.
It is trivial to do when working only with the data in the original df. But if different ranking calls are needed, it seems difficult to accomplish.
Below is my Python Pandas code to accomplish this. But it does not work. It does not seem to interpret my statement, df1['maxV'] = df1[df1['maxR']] as I expect. Suggestions to achieve will be appreciated.
import pandas as pd
import numpy ass np
df1 = pd.DataFrame(np.random.randn(10,3),columns=list('ABC')
rankV = df1.pct_change(3) # calculate ranking values
df1['maxR'] = rankV.idxmax(axis=1) # add max ranked column label of rankv
df1['maxV'] = df1[df1['maxR']] # move max ranked column value to maxV
Iterate the rows and accumulate the values in an array:
maxVals = [np.nan]*3
for index, row in df1[pd.notna(df1['maxR'])].iterrows():
maxVals.append(df1.loc[index, row['maxR']])
df1['maxV'] = maxVals
Alternative: A less intuitive way might be to index df1 using the index and values, which will return a wider Dataframe(# columns equal to # of rows) which has the maxes on the diagonal:
maxVals = [np.nan]*3
newDf = df1.loc[df1['maxR'][3:].index, df1['maxR'][3:].values]
maxVals.extend(np.diag(newDf))
df1['maxV'] = maxVals
Related
I'm hoping to get someone's advice on a problem I'm running into trying to apply a function over columns in a dataframe I have that inverses the values in the columns.
For example, if the observation is 0 and the max of the column is 7, I subtract the absolute value of the max from the observation: abs(0 - 7) = 7, so the smallest value becomes the largest.
All of the columns essentially have a similar range to the above example. The shape of the sliced df is 16984,512
The code I have written creates a bunch of empty columns, that are then replaced with the max values of those columns. The new shape becomes 16984, 1029 including the 5 columns that I sliced off before. Then I use lambda to apply the function over the columns in question:
#create max cols
col = df.iloc[:, 5:]
col_names = col.columns
maximum = '_max'
for col in df[col_names]:
max_value = df[col].max()
df[col+maximum] = np.zeros((16984,))
df[col+maximum].replace(to_replace = 0, value = max_value)
#for each row and column inverse value of row
def invert_col(x, col):
"""Invert values of a column"""
return abs(x[col] - x[col+"_max"])
for col in col_names:
new_df = df.apply(lambda x: invert_col(x, col), axis = 1)
I've tried this where I includes axis = 1 and when I remove it and the behaviour is quite different. I am fairly new to Python so I'm finding it difficult to troubleshoot why this is happening.
When I remove axis = 1, the error I get is a key error: KeyError: 'TV_TIME_LIVE'
TV_TIME_LIVE is the first column in col_names, so it's as if it's not finding it.
When I include axis = 1, I don't get an error, but all the columns in the df get flattened into a Series, with length equal to the original df.
What I'm expecting is a new_df with the same shape (16984,1029) where the values of the 5th to the 517th column have the inverse function applied to them.
I would really appreciate any guidance as to what's going on here and how we might get to the desired output.
Many thanks
apply is slow. It is better to use vectorized approaches as below.
axis=1 means that your function will work column wise, if you do not specify it will work row wise. When you get key error it means pandas is searching for a column name and it cannot find it. If you really must use apply try searching for a few examples how exactly it works.
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randint(0,7,size=(100, 4)), columns=list('ABCD'))
col_list=df.columns.copy()
for col in col_list:
df[col+"inversed"]=abs(df[col]-df[col].max())
Here is a dummy example of the DF I'm working with ('ETC' represents several columns):
df = pd.DataFrame(data={'PlotCode':['A','A','A','A','B','B','B','C','C'],
'INVYR':[2000,2000,2000,2005,1990,2000,1990,2005,2001],
'ETC':['a','b','c','d','e','f','g','h','i']})
picture of df (sorry not enough reputation yet)
And here is what I want to end up with:
df1 = pd.DataFrame(data={'PlotCode':['A','A','A','B','B','C'],
'INVYR':[2000,2000,2000,1990,1990,2001],
'ETC':['a','b','c','e','g','i']})
picture of df1
NOTE: I want ALL rows with minimum 'INVYR' values for each 'PlotCode', not just one or else I'm assuming I could do something easier with drop_duplicates and sort.
So far, following the answer here Appending pandas dataframes generated in a for loop I've tried this with the following code:
df1 = []
for i in df['PlotCode'].unique():
j = df[df['PlotCode']==i]
k = j[j['INVYR']==j['INVYR'].min()]
df1.append(k)
df1 = pd.concat(df1)
This code works but is very slow, my actual data contains some 40,000 different PlotCodes so this isn't a feasible solution. Does anyone know some smooth filtering way of doing this? I feel like I'm missing something very simple.
Thank you in advance!
Try not to use for loops when using pandas, they are extremely slow in comparison to the vectorized operations that pandas has.
Solution 1:
Determine the minimum INVYR for every plotcode, using .groupby():
min_invyr_per_plotcode = df.groupby('PlotCode', as_index=False)['INVYR'].min()
And use pd.merge() to do an inner join between your orignal df with this minimum you just found:
result_df = pd.merge(
df,
min_invyr_per_plotcode,
how='inner',
on=['PlotCode', 'INVYR'],
)
Solution 2:
Again, determine the minimum per group, but now add it as a column to your dataframe. This minimum per group gets added to every row by using .groupby().transform()
df['min_per_group'] = (df
.groupby('PlotCode')['INVYR']
.transform('min')
)
Now filter your dataframe where INVYR in a row is equal to the minimum of that group:
df[df['INVYR'] == df['min_per_group']]
I have a multiindex DataFrame and I'm trying to select data in it base on certain criteria, so far so good. The problem is that once I have selected my data using .loc and pd.IndexSlice, the resulting DataFrame which should logically have less rows and less element in the first level of the multiindex keeps exactly the same multiIndex but with some keys in it refering to empty dataframe.
I've tried creating a completely new DataFrame with a new index, but the structure of my data set is complicating and there is not always the same number of elements in a given level, so it is not easy to created a dataFrame with the right shape in which I can put the data.
import numpy as np
import pandas as pd
np.random.seed(3) #so my exemple is reproductible
idx = pd.IndexSlice
iterables = [['A','B','C'],[0,1,2],['some','rdm','data']]
my_index = pd.MultiIndex.from_product(iterables,names =
['first','second','third'])
my_columns = ['col1','col2','col3']
df1 = pd.DataFrame(data = np.random.randint(10,size =
(len(my_index),len(my_columns))),
index = my_index,
columns = my_columns
)
#Ok, so let's say I want to keep only the elements in the first level of my index (["A","B","C"]) for
#which the total sum in column 3 is less than 35 for some reasons
boolean_mask = (df1.groupby(level = "first").col3.sum() < 35).tolist()
first_level_to_keep = df1.index.levels[0][boolean_mask].tolist()
#lets select the wanted data and put it in df2
df2 = df1.loc[idx[first_level_to_keep,:,:],:]
So far, everything is as expected
The problem is when I want to access the df2 index. I expected the following:
df2.index.levels[0].tolist() == ['B','C']
to be true. But this is what gives a True statement:
df2.index.levels[0].tolist() == ['A','B','C']
So my question is the following: is there a way to select data and to have in retrun a dataFrame with a multiindex reflecting what is in it. Because I find weird to be able to select non existing data in my df2:
I tried to put some images of the dataframes in question but I couldn't because I dont't have enough «reputation»... sorry about that.
Thank you for your time!
Even if you delete the rows corresponding to a particular value in an index level, that value still exists. You can reset the index and then set those columns back as an index in order to generate a MultiIndex with new level values.
df2 = df2.reset_index().set_index(['first','second','third'])
print(df2.index.levels[0].tolist() == ['B','C'])
True
I have a a Dataframe with two columns and I want to set each of the columns median value to zero. How can i do this without changing the standard deviation ? Or better is this the right way to do that ?
suppose I have:
df = pd.DataFrame(np.random.randn(100, 2))
#first column
df0=df[0]
#set median to zero
test=abs(df0-df.median())
Since I again looked for
test.median()
it is printing me not zero but a different value as above. Do I have a mistake in thought?
IIUC, you want
test= df0 - df[0].median()
>>> test.median()
0.0
If you just get the absolute values of the series, you'll change the median value because of course, it depends on the ordering of elements.
There are mainly 2 things you need to do here:
Iterating over the columns
For each column, you want to calculate its median and substract it from all values (related to that column)
And don't use absolute as it'll ruin the median = 0 you want.
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 2))
for col in df.columns:
df[col] = df[col] - np.median(df[col])
Testing:
for col in df.columns:
print(np.median(df[col]))
0.0
0.0
I have a pandas dataframe with one column and I would like to know the index of the median. That is, I determine the median this way:
df.median()
This gives me the median value, but I would like to know the index of that row. Is it possible to determine this? For a list with uneven length I could search for the index with that value but for even list lengths this is not going to work. Can someone help?
This question was asked in another post, where the answer was basically to search for rows which have the same value as the median. But like I said, that will not work for a list of even length.
Below is a Min Example (I have included the suggestion by Wen below):
df = pd.DataFrame(np.random.randn(6, 1), columns=list('A'))
df.median()
df.loc[df[0]==df[0].median()]
Out[120]:
Empty DataFrame
Columns: [0]
Index: []
You can use Wen's answer for dataframes of odd length.
For dataframes of even length, the question does not really make sense. As you have pointed out the median does not exist in the dataframe. However, you can sort the dataframe by your column of interest and then find the indices for the two "median" values.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(6, 1), columns=list('A'))
df.median()
df.loc[df['A']==df['A'].median()]
df.sort_values(by='A', inplace=True)
df[df['A'] > df['A'].median()].iloc[0]
df[df['A'] < df['A'].median()].iloc[-1]
Another way is to use the quantile function (which convieniently defaults to 0.5, i.e. the median) and set the interpolation argument so that it doesn't try to split the midpoints on a DataFrame of even length.
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(6,1), columns=['A'])
# row nearest to midpoint
df[df['A']==df['A'].quantile(interpolation='nearest')]
# just below the midpoint
df[df['A']==df['A'].quantile(interpolation='lower')]
# just above the midpoint
df[df['A']==df['A'].quantile(interpolation='higher')]