set median of a column to zero pandas Dataframe - python

I have a a Dataframe with two columns and I want to set each of the columns median value to zero. How can i do this without changing the standard deviation ? Or better is this the right way to do that ?
suppose I have:
df = pd.DataFrame(np.random.randn(100, 2))
#first column
df0=df[0]
#set median to zero
test=abs(df0-df.median())
Since I again looked for
test.median()
it is printing me not zero but a different value as above. Do I have a mistake in thought?

IIUC, you want
test= df0 - df[0].median()
>>> test.median()
0.0
If you just get the absolute values of the series, you'll change the median value because of course, it depends on the ordering of elements.

There are mainly 2 things you need to do here:
Iterating over the columns
For each column, you want to calculate its median and substract it from all values (related to that column)
And don't use absolute as it'll ruin the median = 0 you want.
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 2))
for col in df.columns:
df[col] = df[col] - np.median(df[col])
Testing:
for col in df.columns:
print(np.median(df[col]))
0.0
0.0

Related

Applying a function that inverts column values using pandas

I'm hoping to get someone's advice on a problem I'm running into trying to apply a function over columns in a dataframe I have that inverses the values in the columns.
For example, if the observation is 0 and the max of the column is 7, I subtract the absolute value of the max from the observation: abs(0 - 7) = 7, so the smallest value becomes the largest.
All of the columns essentially have a similar range to the above example. The shape of the sliced df is 16984,512
The code I have written creates a bunch of empty columns, that are then replaced with the max values of those columns. The new shape becomes 16984, 1029 including the 5 columns that I sliced off before. Then I use lambda to apply the function over the columns in question:
#create max cols
col = df.iloc[:, 5:]
col_names = col.columns
maximum = '_max'
for col in df[col_names]:
max_value = df[col].max()
df[col+maximum] = np.zeros((16984,))
df[col+maximum].replace(to_replace = 0, value = max_value)
#for each row and column inverse value of row
def invert_col(x, col):
"""Invert values of a column"""
return abs(x[col] - x[col+"_max"])
for col in col_names:
new_df = df.apply(lambda x: invert_col(x, col), axis = 1)
I've tried this where I includes axis = 1 and when I remove it and the behaviour is quite different. I am fairly new to Python so I'm finding it difficult to troubleshoot why this is happening.
When I remove axis = 1, the error I get is a key error: KeyError: 'TV_TIME_LIVE'
TV_TIME_LIVE is the first column in col_names, so it's as if it's not finding it.
When I include axis = 1, I don't get an error, but all the columns in the df get flattened into a Series, with length equal to the original df.
What I'm expecting is a new_df with the same shape (16984,1029) where the values of the 5th to the 517th column have the inverse function applied to them.
I would really appreciate any guidance as to what's going on here and how we might get to the desired output.
Many thanks
apply is slow. It is better to use vectorized approaches as below.
axis=1 means that your function will work column wise, if you do not specify it will work row wise. When you get key error it means pandas is searching for a column name and it cannot find it. If you really must use apply try searching for a few examples how exactly it works.
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randint(0,7,size=(100, 4)), columns=list('ABCD'))
col_list=df.columns.copy()
for col in col_list:
df[col+"inversed"]=abs(df[col]-df[col].max())

How to perform functions on unique values in dataframe columns in python

I have a data of about 5 million records that like in the image below
I need to get the max and average value for each id in a new data frame,so that each ID will have just one value
I am pretty new to python and programming and this group has been helpful but i don't seem to find related answer to this particular question. Thanks
This should do it:
import numpy as np
import pandas as pd
# create dummy data
ids = [1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4]
values = [13,21,34,22,34,2,3,34,12,45,45,23,67,76,32,23,80]
df = pd.DataFrame({'ID': ids, 'Values': values})
df = df.groupby('ID').agg({'Values': [min, max, np.mean]}) # group by on ID and calculate new columns min, max, mean for the values columns
df.columns = df.columns.droplevel(0) # get rid of the multilevel columns due to the grouping
df.reset_index()
EDIT: with thanks to ALollz for pointing out the following shortcut (avoiding the multilevel index):
df = df.groupby('ID')['Values'].agg([min, max, np.mean]) # group by on ID and calculate new columns min, max, mean for the values columns
df.reset_index()
Let me know if any of the steps requires elaboration.

Python Pandas: create rank columns, move orginal column max rank

I need to be able to
1. calculate ranks for each column in all rows,
2. the find the max column label of each row,
3. and then in each row move the max ranked column of the original df.
It is trivial to do when working only with the data in the original df. But if different ranking calls are needed, it seems difficult to accomplish.
Below is my Python Pandas code to accomplish this. But it does not work. It does not seem to interpret my statement, df1['maxV'] = df1[df1['maxR']] as I expect. Suggestions to achieve will be appreciated.
import pandas as pd
import numpy ass np
df1 = pd.DataFrame(np.random.randn(10,3),columns=list('ABC')
rankV = df1.pct_change(3) # calculate ranking values
df1['maxR'] = rankV.idxmax(axis=1) # add max ranked column label of rankv
df1['maxV'] = df1[df1['maxR']] # move max ranked column value to maxV
Iterate the rows and accumulate the values in an array:
maxVals = [np.nan]*3
for index, row in df1[pd.notna(df1['maxR'])].iterrows():
maxVals.append(df1.loc[index, row['maxR']])
df1['maxV'] = maxVals
Alternative: A less intuitive way might be to index df1 using the index and values, which will return a wider Dataframe(# columns equal to # of rows) which has the maxes on the diagonal:
maxVals = [np.nan]*3
newDf = df1.loc[df1['maxR'][3:].index, df1['maxR'][3:].values]
maxVals.extend(np.diag(newDf))
df1['maxV'] = maxVals

Possible optimization of going through pandas dataframe

I'm trying to find a way to optimize looping through pandas dataframe. The dataset contains ~450k rows with ~20 columns. The dataframe contains 3 locational variables as multiindex and I want to drop the rows where NaN columns exist within the group, otherwise fill NaN with mean of the group.
LOC = ['market_id', 'midmarket_id', 'submarket_id']
# Assign -1000 to multiindex nan values
df = df.fillna({c:-1000 for c in LOC})
df = df.set_index(LOC).sort_index(level=[i for i in range(len(LOC))])
# Looping through subset with same (market, midmarket, submarket)
for k, v in df.copy().groupby(level=[i for i in range(len(LOC))]):
# If there is any column with all NaN value, drop it from df
if v.isnull().all().any():
df.drop(v.index.values)
# If there is at least one non-NaN value, fillna with mean
else:
df.loc[v.index.values] = df.loc[v.index.values].fillna(v.mean())
So if there is dataframe like this
before
and it should be converted like this, removing the rows with all NaN columns
after.
I apologize if this is redundant or not accordance with stack overflow question guideline. But if anyone has better solution for this, I would greatly appreciate it.
Thanks in advance.
There's no need to copy your entire dataframe. Nor is there a need to iterate GroupBy elements manually. Here's an alternative solution:
LOC = ['market_id', 'midmarket_id', 'submarket_id']
# Assign -1000 to NaN values
df = df.fillna(-1000)
# Include only columns containing non-nulls
non_nulls = np.where(df.notnull().any())[0]
df = df.iloc[:, non_nulls]
# Fill columns with respective groupwise means
g = df.groupby(LOC)
for col in df.columns.difference(LOC):
df[col] = df[col].fillna(g[col].transform('mean'))

Get corresponding index of median

I have a pandas dataframe with one column and I would like to know the index of the median. That is, I determine the median this way:
df.median()
This gives me the median value, but I would like to know the index of that row. Is it possible to determine this? For a list with uneven length I could search for the index with that value but for even list lengths this is not going to work. Can someone help?
This question was asked in another post, where the answer was basically to search for rows which have the same value as the median. But like I said, that will not work for a list of even length.
Below is a Min Example (I have included the suggestion by Wen below):
df = pd.DataFrame(np.random.randn(6, 1), columns=list('A'))
df.median()
df.loc[df[0]==df[0].median()]
Out[120]:
Empty DataFrame
Columns: [0]
Index: []
You can use Wen's answer for dataframes of odd length.
For dataframes of even length, the question does not really make sense. As you have pointed out the median does not exist in the dataframe. However, you can sort the dataframe by your column of interest and then find the indices for the two "median" values.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(6, 1), columns=list('A'))
df.median()
df.loc[df['A']==df['A'].median()]
df.sort_values(by='A', inplace=True)
df[df['A'] > df['A'].median()].iloc[0]
df[df['A'] < df['A'].median()].iloc[-1]
Another way is to use the quantile function (which convieniently defaults to 0.5, i.e. the median) and set the interpolation argument so that it doesn't try to split the midpoints on a DataFrame of even length.
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(6,1), columns=['A'])
# row nearest to midpoint
df[df['A']==df['A'].quantile(interpolation='nearest')]
# just below the midpoint
df[df['A']==df['A'].quantile(interpolation='lower')]
# just above the midpoint
df[df['A']==df['A'].quantile(interpolation='higher')]

Categories