How to find non-zero median/mean of multiple columns in pandas? - python

I have a long list of columns for which I want to calculate non-zero median,mean & std in a one go. I cannot just delete rows with 0 based on 1 column because the value for another column in same column may not be 0.
Below is the code I currently have which calculates median,mean etc. including zero.
agg_list_oper={'ABC1':[max,np.std,np.mean,np.median],
'ABC2':[max,np.std,np.mean,np.median],
'ABC3':[max,np.std,np.mean,np.median],
'ABC4':[max,np.std,np.mean,np.median],
.....
.....
.....
}
df=df_tmp.groupby(['id']).agg(agg_list_oper).reset_index()
I know I can write long code with loops to process one column at a time.
Is there a way to do this in pandas groupby.agg() or some other functions elegantly?

You can temporarily replace 0's with NaNs. Then, pandas will ignore the NaNs while calculating medians.
df_tmp.replace(0, np.nan).groupby(['id']).agg(agg_list_oper).reset_index()

Related

Pandas conditional formula based on comparison of two cells

When calculating a new column called "duration_minutes", some of the results are negative because the values were put in the original columns backwards.
time.started_at=pd.to_datetime(time.started_at)
time.ended_at=pd.to_datetime(time.ended_at)
time["duration_minutes"]=(time.ended_at-time.started_at).dt.total_seconds()/60
time.head()
A quick check for negatives time[time.duration_minutes<0] in the "duration_minutes" column shows many rows with negative values because the start and stop times are in the wrong columns.
Is there a way to create and calculate the "duration_minutes" column to deal with this situation?

averaging columns on a very big matrix of data (approximately 150,000*80) in a normal runtime

I am having problems with analyzing very big matrices.
I have 4 big matrices of numbers (between 0 to 1), each one of the matrices contains over 100,000 rows, with 70-80 columns.
I need to average each column and add them to lists according to the column title.
I tried to use the built in mean() method of pandas (the input is pandas dataframe), but the runtime is crazy (it takes numerous hours to calculate the mean of a single column) and I cannot afford that runtime.
Any suggestions on how can I do it with normal runtime?
I am adding my code here-
def filtering(data):
healthy=[]
nmibc_hg=[]
nmibc_lg=[]
mibc_hg=[]
for column in data:
if 'HEALTHY' in column:
healthy.append(data[column].mean())
elif 'NMIBC_HG' in column:
nmibc_hg.append(data[column].mean())
elif 'NMIBC_LG' in column:
nmibc_lg.append(data[column].mean())
elif 'MIBC_HG' in column and not 'NMIBC_HG' in column:
mibc_hg.append(data[column].mean())
Use df.mean() and it will generate a mean for every numeric column in the dataframe df. The column name and summary results go into a pandas series. I tried it on a dataframe with 190,000 rows and 12 numeric columns and it only took 4 seconds. It should only take a few seconds on your dataset too.

pandas DataFrame: Calculate Sum based on boolean values in another column

I am fairly new to Python and I trying to simulate the following logic with in pandas
I am currently looping throw the rows and want to sum the values in the AMOUNT column in the prior rows but only till the last seen 'TRUE' value. It seems inefficient with the actual data (I have a dataframe of about 5 million rows)? Was wondering what the efficient way of handling such a logic in Python would entail?
Logic:
The logic is that if FLAG is TRUE I want to sum the values in the AMOUNT column in the prior rows but only till the last seen 'TRUE' value. Basically sum the values in 'AMOUNT' between the rows where FLAG is TRUE
Check with cumsum and transform sum
df['SUM']=df.groupby(df['FLAG'].cumsum()).Amount.transform('sum').where(df.FLAG)
maybe try something around the following:
import pandas
df = pd.read_csv('name of file.csv')
df['AMOUNT'].sum()

Pandas: Calculating column-wise mean yields nulls

I have a pandas DataFrame, df, and I'd like to get the mean for columns 180 through the end (not including the last column), only using the first 100K rows.
If I use the whole DataFrame:
df.mean().isnull().any()
I get False
If I use only the first 100K rows:
train_means = df.iloc[:100000, 180:-1].mean()
train_means.isnull().any()
I get: True
I'm not sure how this is possible, since the second approach is only getting the column means for a subset of the full DataFrame. So if no column in the full DataFrame has a mean of NaN, I don't see how a column in a subset of the full DataFrame can.
For what it's worth, I ran:
df.columns[df.isna().all()].tolist()
and I get: []. So I don't think I have any columns where every entry is NaN (which would cause a NaN in my train_means calculation).
Any idea what I'm doing incorrectly?
Thanks!
Try look at
(df.iloc[:100000, 180:-1].isnull().sum()==100000).any()
If this return True , which mean you have a columns' value is all NaN in the first 100000 rows
And Now let us explain why you get all notnull when do the mean to the whole dataframe , since mean have skipna default as True so it will drop NaN before mean

Remove duplicate columns only by their values

I just got an assignment which i got a lot of features (as columns) and records (as rows) in a csv file.
Cleaning the data using Python (including pandas):
A,B,C
1,1,1
0,0,0
1,0,1
I would like to delete all the duplicate columns with the same values and to remain only one of them. A and B will be the only column one to remain.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
thanks.
I would like to delete all the duplicate columns with the same values and to remain only one of them. A will be the only column one to remain.
You mean that's the only one among the A and C that's kept, right? (B doesn't duplicate anything.)
You can use DataFrame.drop_duplicates
df = df.T.drop_duplicates().T
It works on rows, not columns, so I transpose before/after calling it.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
You can do a loop matching all columns up and computing their correlation with DataFrame.corr or with numpy.corrcoef.

Categories