How to sum rows containing specific targets in pandas? - python

Here is my code. I would like to sum the FPKM rows containing all specific target and print all the corresponding targets and sum values in a new pd.
# coding=utf-8
import pandas as pd
import numpy as np
classes = [('Carbon;Pyruvate;vitamins', 16.7, 1),
('Pyruvate;Carbohydrate;Pentose and glucuronate', 30, 7),
('Lipid;Carbon;Galactose', 40.5, 9),
('Galactose;Pyruvate;Fatty acid', 57, 10),
('Fatty acid;Lipid', 22, 4)]
labels = ['Ko_class','FPKM', 'count']
alls = pd.DataFrame.from_records(classes, columns=labels)
target = [['Carbon'],['Pyruvate'],['Galactose']]
targetsum = pd.DataFrame.from_records(target,columns=['target'])
#######
targets = '|'.join(sum(target, []))
targetsum['total_FPKM']=(alls['FPKM']
.groupby(alls['Ko_class']
.str.contains(targets))
.sum())
targetsum['count']=(alls['count']
.groupby(alls['Ko_class']
.str.contains(targets))
.sum())
targetsum
Its results:
target total_FPKM count
0 Carbon NaN NaN
1 Pyruvate NaN NaN
2 Galactose NaN NaN
What I want is :
target total_FPKM count
0 Carbon 57.2 10
1 Pyruvate 103.7 18
2 Galactose 97.5 19
Hope I have described my question clearly:(

try this :
def aggregation(dataframe,target):
targetsum = pd.DataFrame(columns=['target','sum','count'])
for val in target:
df_tempo=dataframe.loc[dataframe['Ko_class'].str.contains(val),:].copy()
new_row = {'target':val, 'sum':df_tempo['FPKM'].sum(), 'count':df_tempo['count'].sum()}
targetsum = targetsum.append(new_row, ignore_index=True)
return targetsum
df_result=aggregation(alls,['Carbon','Pyruvate','Galactose'])
Result :
target sum count
0 Carbon 57.2 10
1 Pyruvate 103.7 18
2 Galactose 97.5 19

You can use str.findall to find the substances that appear in your 'Ko_class' column, and assign that back to a new column. Exploding this new list-valued column into a separate rows using explode will allow you to groupby on them and perform your aggregation:
target_list = ['Carbon','Pyruvate','Galactose']
target_substances = '|'.join(target_list)
alls.assign(
Ko_class_contains_target = alls['Ko_class'].str.findall(target_substances)
).explode('Ko_class_contains_target').groupby('Ko_class_contains_target').agg('sum')
prints back:
FPKM count
Ko_class_contains_target
Carbon 57.2 10
Galactose 97.5 19
Pyruvate 103.7 18

Related

How to add interpolated values in multiple rows of a pandas dataframe?

I have a csv file that looks like as shown in the picture. There are multiple rows like this whose values are zero in between. So in this row, i want an interpolated value of the upper and lower row. I used df.interpolate(method ='linear', limit_direction ='forward') to interpolate. However, the zero values are not treated as NaN values so it didnt work for me.
First replace all the zeros with np.nan and then the interpolate will work correctly:
import pandas as pd
import numpy as np
data = [
[7260,-4.458639405975710,-4.,7.E-08,0.1393070275997700,0.,-0.11144176562682400],
[8030,-4.452569075111660,-4.,4.E-08,0.1347428577024860,-0.1001462206643270,-0.04915374942019220],
[498,-4.450785570790800,-4.437233532812810,1.E-07,0.1577349354100960,-0.1628636478696300,-0.05505793797144350],
[1500,-4.450303023388150,-4.429207978066990,1.E-07,0.1219543073754720,-0.1886731968341070,-0.14408112469719300],
[6600,-4.462030024237730,-4.4286701710604900,4.E-08,0.100803412848051,-0.1840333872203410,-0.18430271378600200],
[8860,0.0,0.0,0.0,0.0,0.0,0.0],
[530,-4.453994378096950,-4.0037494206318200,-9.E-08,0.0594973737919224,1.0356594366090900,-0.03173366589936420],
[6904,-4.449221525263950,-3.1840342819501800,-2.E-07,0.0918042463623589,1.5125956674286500,-0.01150704151230230],
[7700,-4.454965896625150,-3.041102261967650,-1.E-07,0.1211292098853800,1.837772463779190,0.0680406376006960],
[6463,-4.4524324374160600,-3.1096025723730000,-4.E-08,0.1920291560629040,2.062490856824510,0.10665282217392200],
]
df = pd.DataFrame(data, columns=range(98, 105)) \
.replace(0, np.nan) \
.interpolate(method ='linear', limit_direction ='forward')
print(df)
Giving:
98 99 100 101 102 103 104
0 7260 -4.458639 -4.000000 7.000000e-08 0.139307 NaN -0.111442
1 8030 -4.452569 -4.000000 4.000000e-08 0.134743 -0.100146 -0.049154
2 498 -4.450786 -4.437234 1.000000e-07 0.157735 -0.162864 -0.055058
3 1500 -4.450303 -4.429208 1.000000e-07 0.121954 -0.188673 -0.144081
4 6600 -4.462030 -4.428670 4.000000e-08 0.100803 -0.184033 -0.184303
5 8860 -4.458012 -4.216210 -2.500000e-08 0.080150 0.425813 -0.108018
6 530 -4.453994 -4.003749 -9.000000e-08 0.059497 1.035659 -0.031734
7 6904 -4.449222 -3.184034 -2.000000e-07 0.091804 1.512596 -0.011507
8 7700 -4.454966 -3.041102 -1.000000e-07 0.121129 1.837772 0.068041
9 6463 -4.452432 -3.109603 -4.000000e-08 0.192029 2.062491 0.106653

How can we select columns from a pandas dataframe based on a certain condition?

I have a pandas dataframe and i want to create a list of columns for one particular variable if P_BUYER column has one entry greater than 97 and others less . For example, below, a list should be created containing TENRACT and ADV_INC. If P_BUYER has a value greater than or equal to 97 then the value which is in parallel to T for that particular block should be saved in a list (e.g. we have following values in parallel to T in below example : (TENRCT,ADVNTG_MARITAL,NEWLSGOLFIN,ADV_INC)
Input :
T TENRCT P_NONBUY(%) P_BUYER(%) INDEX PBIN NEWBIN
N (1,2,3) = Renter N (1,2,3) = Renter 35.88 0.1 33 8 2
Q <0> = Unknown Q <0> = Unknown 3.26 0.1 36 8 2
Q1 <4> = Owner Q <4> = Owner 60.86 99.8 143 5 1
E2
T ADVNTG_MARITAL P_NONBUY(%) P_BUYER(%) INDEX PBIN NEWBIN
Q2<1> = 1+Marrd Q<1> = 1+Marrd 52.91 78.98 149 5 2
Q<2> = 1+Sngl Q<2> = 1+Sngl 45.23 17.6 39 8 3
Q1<3> = Mrrd_Sngl Q<3> = Mrrd_Sngl 1.87 3.42 183 4 1
E3
T ADV_INC P_NONBUY(%) P_BUYER(%) INDEX PBIN NEWBIN
N1('1','Y') = Yes N('1','Y') = Yes 3.26 1.2 182 4 1
N('0','-1')= No N('0','-1')= No 96.74 98.8 97 7 2
E2
output:
Finallist=['TENRACT','ADV_INC']
You can do it like this:
# In your code, you have 3 dataframes E1,E2,E3, iterate over them
output = []
for df in [E1,E2,E3]:
# Filter you dataframe
df = df[df['P_BUYER(%)'] >= 97 ]
if not df.empty:
cols = df.columns.values.tolist()
# Find index of 'T' column
t_index = cols.index('T')
# You desired parallel column will be at t_index+1
output.append(cols[t_index+1])
print(output)

How to find customized average which is based on weightage including handling of nan value in pandas?

I have a data frame df_ss_g as
ent_id,WA,WB,WC,WD
123,0.045251836,0.614582906,0.225930615,0.559766482
124,0.722324239,0.057781167,,0.123603561
125,,0.361074325,0.768542766,0.080434134
126,0.085781742,0.698045853,0.763116684,0.029084545
127,0.909758657,,0.760993759,0.998406211
128,,0.32961283,,0.90038336
129,0.714585519,,0.671905291,
130,0.151888772,0.279261613,0.641133263,0.188231227
now I have to compute the average(AVG_WEIGHTAGE) which is based on a weightage i.e. =(WA*0.5+WB*1+WC*0.5+WD*1)/(0.5+1+0.5+1)
but while I am computing it using below method i.e.
df_ss_g['AVG_WEIGHTAGE']= df_ss_g.apply(lambda x:((x['WA']*0.5)+(x['WB']*1)+(x['WC']*0.5)+(x['WD']*1))/(0.5+1+0.5+1) , axis=1)
IT output as i.e. for NaN value it is giving NaN as AVG_WEIGHTAGE as null which is wrong.
all I wanted is that null should not be considered in denominator and numerator
e.g.
ent_id,WA,WB,WC,WD,AVG_WEIGHTAGE
128,,0.32961283,,0.90038336,0.614998095 i.e. (WB*1+WD*1)/1+1
129,0.714585519,,0.671905291,,0.693245405 i.e. (WA*0.5+WC*0.5)/0.5+0.5
IIUC:
import numpy as np
weights = np.array([0.5, 1, 0.5, 1]))
values = df.drop('ent_id', axis=1)
df['AVG_WEIGHTAGE'] = np.dot(values.fillna(0).to_numpy(), weights)/np.dot(values.notna().to_numpy(), weights)
df['AVG_WEIGHTAGE']
0 0.436647
1 0.217019
2 0.330312
3 0.383860
4 0.916891
5 0.614998
6 0.693245
7 0.288001
Try this method using dot products -
def av(t):
#Define weights
wt = [0.5, 1, 0.5, 1]
#Create a vector with 0 for null and 1 for non null
nulls = [int(i) for i in ~t.isna()]
#Take elementwise products of the nulls vector with both weights and t.fillna(0)
wt_new = np.dot(nulls, wt)
t_new = np.dot(nulls, t.fillna(0))
#return division
return np.divide(t_new,wt_new)
df['WEIGHTED AVG'] = df.apply(av, axis=1)
df = df.reset_index()
print(df)
ent_id WA WB WC WD WEIGHTED AVG
0 123 0.045252 0.614583 0.225931 0.559766 0.481844
1 124 0.722324 0.057781 NaN 0.123604 0.361484
2 125 NaN 0.361074 0.768543 0.080434 0.484020
3 126 0.085782 0.698046 0.763117 0.029085 0.525343
4 127 0.909759 NaN 0.760994 0.998406 1.334579
5 128 NaN 0.329613 NaN 0.900383 0.614998
6 129 0.714586 NaN 0.671905 NaN 1.386491
7 130 0.151889 0.279262 0.641133 0.188231 0.420172
It boils down to masking the nan values with 0 so they don't contribute to either weights or sum:
# this is the weights
weights = np.array([0.5,1,0.5,1])
# the columns of interest
s = df.iloc[:,1:]
# where the valid values are
mask = s.notnull()
# use `fillna` and then `#` for matrix multiplication
df['AVG_WEIGHTAGE'] = (s.fillna(0) # weights) / (mask#weights)

Sum df columns for weighted average

Backstory: I have a pandas dataframe scaledData that is just a standard df of information as follows:
COL NAME0 COL NAME1 ... COL NAME3 COL NAME4
0 Alabama 4.099099 ... 2.042345 1.392755
1 Alaska 1.396396 ... 1.000000 1.000000
2 Arizona 4.189189 ... 2.003257 1.537777
3 Arkansas 2.927928 ... 2.208723 1.007370
4 California 3.378378 ... 1.754930 2.012395
5 Colorado 3.378378 ... 3.282196 2.843435
6 Connecticut 5.000000 ... 1.452587 4.277286
7 Delaware 4.409692 ... 2.134501 1.970434
8 District of Columbia 5.000000 ... 1.000000 1.000000
9 Florida 4.628118 ... 1.806412 2.213038
10 Georgia 4.628118 ... 1.513896 2.748559
11 Hawaii 3.902494 ... 2.891694 3.872309
12 Idaho 1.090703 ... 2.978469 4.127419
13 Illinois 4.537415 ... 1.242970 1.888353
14 Indiana 4.537415 ... 2.368881 2.307914
15 Iowa 2.088435 ... 3.298368 3.421122
16 Kansas 2.723356 ... 2.791375 2.160330
17 Kentucky 3.902494 ... 1.692890 4.133744
18 Louisiana 2.451247 ... 1.000000 1.000000
19 Maine 3.448980 ... 2.535328 5.000000
20 Maryland 5.000000 ... 1.632194 1.046567
I want to create another column Total in this df that is a result of adding all of the column values per each state (COL NAME0) divided by the sum of a dictionary weights. Additionally, column E to perform the same total but only for columns with those specific tags. The weights dictionary's key is the column names of the df and the values are a tuple containing the weight values for the columns (used previously but irrelevant to this problem) and the category the column belongs to. Here is my current implementation:
weights = {'COL NAME1': (2.14, 'E'), 'COL NAME2': (5.14, 'E'), 'COL NAME3': (10, 'G'), 'COL NAME4' : (5, 'E')}
eWeights = { key: value for key, value in weights.items() if value[1] == 'E'}
gWeights = { key: value for key, value in weights.items() if value[1] == 'G'}
#Total should be the result of adding each of the columns per COL NAME0 row
#and dividing by the sum of the weight values.
scaledData['Total'] = scaledData.sum(axis = 1, skipna = True)/ sum(list(weights.values())[0])
#Same calculation on only columns marked 'E'
for key in eWeights:
scaledData['E'] = scaledData['E'] + scaledData[key]
scaledData['E'] = scaledData['E'] / sum(list(eWeights.values())[0])
Unfortunately, the above code results in the following error (caused by the line creating the Total column in scaledData) :
TypeError: unsupported operand type(s) for +: 'float' and 'str'
I've simplified the scaledData and weights but any solution or suggestions will help me with my actual df with many more rows and columns. Appreciate the help and let me know if more information is needed.
Your df seems to be stored as float. Try:
for key in eWeights:
scaledData['E'] = scaledData['E'].astype(float) + scaledData[key].astype(float)
scaledData['E'] / sum(list(eWeights.values())[0])
# should this be a print? Are you trying to set any values?

Grouping column values together

I have a dataframe like so:
Class price demand
1 22 8
1 60 7
3 32 14
2 72 9
4 45 20
5 42 25
What I'd like to do is group classes 1-3 in one category and classes 4-5 in one category. Then I'd like to get the sum of price for each category and the sum of demand for each category. I'd like to also get the mean. The result should look something like this:
Class TotalPrice TotalDemand AveragePrice AverageDemand
P 186 38 46.5 9.5
E 87 45 43.5 22.5
Where P is classes 1-3 and E is classes 4-5. How can I group by categories in pandas? Is there a way to do this?
In [8]: df.groupby(np.where(df['Class'].isin([1, 2, 3]), 'P', 'E'))[['price', 'demand']].agg(['sum', 'mean'])
Out[8]:
price demand
sum mean sum mean
E 87 43.5 45 22.5
P 186 46.5 38 9.5
You can create a dictionary that defines your groups.
mapping = {**dict.fromkeys([1, 2, 3], 'P'), **dict.fromkeys([4, 5], 'E')}
Then if you pass a dictionary or callable to a groupby it automatically gets mapped onto the index. So, let's set the index to Class
d = df.set_index('Class').groupby(mapping).agg(['sum', 'mean']).sort_index(1, 1)
Finally, we do some tweaking to get column names the way you specified.
rename_dict = {'sum': 'Total', 'mean': 'Average'}
d.columns = d.columns.map(lambda c: f"{rename_dict[c[1]]}{c[0].title()}")
d.rename_axis('Class').reset_index()
Class TotalPrice TotalDemand AveragePrice AverageDemand
0 E 87 45 43.5 22.5
1 P 186 38 46.5 9.5
In general, you can form arbitrary bins to group your data using pd.cut, specifying the right bin edges:
import pandas as pd
pd.cut(df.Class, bins=[0, 3, 5], labels=['P', 'E'])
#0 P
#1 P
#2 P
#3 P
#4 E
#5 E
df2 = (df.groupby(pd.cut(df.Class, bins=[0,3,5], labels=['P', 'E']))[['demand', 'price']]
.agg({'sum', 'mean'}).reset_index())
# Get rid of the multi-level columns
df2.columns = [f'{i}_{j}' if j != '' else f'{i}' for i,j in df2.columns]
Output:
Class demand_sum demand_mean price_sum price_mean
0 P 38 9.5 186 46.5
1 E 45 22.5 87 43.5

Categories