I have a dataframe that looks like this:
ENSG
3dir_S2_S23_L004_R1_001
7dir_S2_S25_L004_R1_001
i3dir_S2_S29_L004_R1_001
i7dir_S2_S31_L004_R1_001
ENSG00000000003.15
349.0
183.0
199.0
165.0
ENSG00000000419.13
133.0
82.0
190.0
168.0
ENSG00000000457.14
62.0
56.0
95.0
111.0
ENSG00000000460.17
191.0
122.0
300.0
285.0
ENSG00000001036.14
507.0
286.0
326.0
317.0
ENSG00000001084.13
205.0
192.0
310.0
320.0
ENSG00000001167.14
406.0
324.0
379.0
309.0
ENSG00000001460.18
93.0
78.0
146.0
120.0
I'm attempting to perform a calculation on each row of each column, excluding the column ENSG.
Something like this, where I divide each row value by the sum of the entire column:
df = df.transform(lambda x: x / x.sum())
How can I exclude the column ENSG from this calculation? Could I use iloc?
Use set_index to exclude ENSG from columns then transform and reset_index after:
out = df.set_index('ENSG').transform(lambda x: x / x.sum()).reset_index()
print(out)
# Output:
ENSG 3dir_S2_S23_L004_R1_001 7dir_S2_S25_L004_R1_001 i3dir_S2_S29_L004_R1_001 i7dir_S2_S31_L004_R1_001
0 ENSG00000000003.15 0.179342 0.138322 0.102314 0.091922
1 ENSG00000000419.13 0.068345 0.061980 0.097686 0.093593
2 ENSG00000000457.14 0.031860 0.042328 0.048843 0.061838
3 ENSG00000000460.17 0.098150 0.092215 0.154242 0.158774
4 ENSG00000001036.14 0.260534 0.216175 0.167609 0.176602
5 ENSG00000001084.13 0.105344 0.145125 0.159383 0.178273
6 ENSG00000001167.14 0.208633 0.244898 0.194859 0.172145
7 ENSG00000001460.18 0.047790 0.058957 0.075064 0.066852
Assuming ENSG is the first column, yes, you can use iloc:
df.iloc[:, 1:] = df.iloc[:, 1:] / np.sum(df.iloc[:, 1:], axis=0)
Related
I have a dataframe like as shown below
df = pd.DataFrame(
{'stud_name' : ['ABC', 'ABC','ABC','DEF',
'DEF','DEF'],
'qty' : [123,31,490,518,70,900],
'trans_date' : ['13/11/2020','10/1/2018','11/11/2017','27/03/2016','13/05/2010','14/07/2008']})
I would like to do the below
a) for each stud_name, look at their past data (full past data) and compute the min, max and mean of qty column
Please note that the 1st record/row for every unique stud_name will be NA because there is no past data (history) to look at and compute the aggregate statistics
I tried something like below but the output is incorrect
df['trans_date'] = pd.to_datetime(df['trans_date'])
df.sort_values(by=['stud_name','trans_date'],inplace=True)
df['past_transactions'] = df.groupby('stud_name').cumcount()
df['past_max_qty'] = df.groupby('stud_name')['qty'].expanding().max().values
df['past_min_qty'] = df.groupby('stud_name')['qty'].expanding().min().values
df['past_avg_qty'] = df.groupby('stud_name')['qty'].expanding().mean().values
I expect my output to be like as shown below
We can use custom function to calculate the past statistics per student
def past_stats(q):
return (
q.expanding()
.agg(['max', 'min', 'mean'])
.shift().add_prefix('past_')
)
df.join(df.groupby('stud_name')['qty'].apply(past_stats))
stud_name qty trans_date past_max past_min past_mean
2 ABC 490 2017-11-11 NaN NaN NaN
1 ABC 31 2018-10-01 490.0 490.0 490.0
0 ABC 123 2020-11-13 490.0 31.0 260.5
5 DEF 900 2008-07-14 NaN NaN NaN
4 DEF 70 2010-05-13 900.0 900.0 900.0
3 DEF 518 2016-03-27 900.0 70.0 485.0
I have a DataFrame with data in string. These data need to be evaluated and change to numeric.
Let my df be:
var_pct var_num
-76*2 14*1000000
-76*2 12*1000000
111*2 29*1000000
47*2 33*1000000
nan 60*1000000
for column in df:
df[column] =df.eval(df[column],inplace=True)
I faced problem for column with 'nan' where the result of eval has length less than the original. How do I make sure the 'nan' will be a '' after eval?
You should avoid eval. In this case, I recommend you split into numeric series first:
df = df.fillna('NaN*NaN')
for col in df.columns:
df = df.join(df.pop(col).str.split('*', expand=True)\
.apply(pd.to_numeric, errors='coerce')\
.add_prefix(f'{col}_'))
print(df)
var_pct_0 var_pct_1 var_num_0 var_num_1
0 -76.0 2.0 14 1000000
1 -76.0 2.0 12 1000000
2 111.0 2.0 29 1000000
3 47.0 2.0 33 1000000
4 NaN NaN 60 1000000
Then perform your calculations using vectorised operations:
for col in ['var_pct', 'var_num']:
df[col] = df[f'{col}_0'] * df[f'{col}_1']
For academic purposes, the approach you are attempting is possible via the top level function pd.eval together with applymap. But beware, this is just an inefficient Python-level loop.
nan = np.nan
df = df.fillna('nan*nan')
df = df.applymap(lambda x: pd.eval(x))
print(df)
var_pct var_num
0 -152.0 14000000
1 -152.0 12000000
2 222.0 29000000
3 94.0 33000000
4 NaN 60000000
Assuming that you can live with a copied dataframe:
def ff(val):
if 'nan' not in val:
return eval(val)
df4 = df3.applymap(ff)
print(df4)
var_pct var_num
0 -152.0 14000000
1 -152.0 12000000
2 222.0 29000000
3 94.0 33000000
4 NaN 60000000
Of course ff can be expressed as lambda too:
lambda val: eval(val) if 'nan' not in val else val
I have a dataframe with two numeric columns. I want to add a third column to calculate the difference. But the condition is if the values in the first column are blank or Nan, the difference should be the value in the second column...
Can anyone help me with this problem?
Any suggestions and clues will be appreciated!
Thank you.
You should use vectorised operations where possible. Here you can use numpy.where:
df['Difference'] = np.where(df['July Sales'].isnull(), df['August Sales'],
df['August Sales'] - df['July Sales'])
However, consider this is precisely the same as considering NaN values in df['July Sales'] to be equal to zero. So you can use pd.Series.fillna:
df['Difference'] = df['August Sales'] - df['July Sales'].fillna(0)
This isn't really a situation with conditions, it is just a math operation.. Suppose you have the df:
consider your df using the .sub() method:
df['Diff'] = df['August Sales'].sub(df['July Sales'], fill_value=0)
returns output:
July Sales August Sales Diff
0 459.0 477 18.0
1 422.0 125 -297.0
2 348.0 483 135.0
3 397.0 271 -126.0
4 NaN 563 563.0
5 191.0 325 134.0
6 435.0 463 28.0
7 NaN 479 479.0
8 475.0 473 -2.0
9 284.0 496 212.0
Used a sample dataframe, but it shouldn't be hard to comprehend:
df = pd.DataFrame({'A': [1, 2, np.nan, 3], 'B': [10, 20, 30, 40]})
def diff(row):
return row['B'] if (pd.isnull(row['A'])) else (row['B'] - row['A'])
df['C'] = df.apply(diff, axis=1)
ORIGINAL DATAFRAME:
A B
0 1.0 10
1 2.0 20
2 NaN 30
3 3.0 40
AFTER apply:
A B C
0 1.0 10 9.0
1 2.0 20 18.0
2 NaN 30 30.0
3 3.0 40 37.0
try this:
def diff(row):
if not row['col1']:
return row['col2']
else:
return row['col1'] - row['col2']
df['col3']= df.apply(diff, axis=1)
I am trying to create a new column 'New' that:
1) if 'y' values is different from zero, give me 'y'
2) if 'y' is equal zero, give me 'yhat' value
ds y ds yhat
0 1999-03-05 45.0 1999-03-05 37.168417
1 1999-03-06 45.0 1999-03-06 37.109215
2 1999-03-07 45.0 1999-03-07 37.049726
3 1999-03-08 45.0 1999-03-08 36.987036
4 1999-03-09 45.0 1999-03-09 36.926852
5 1999-03-10 45.0 1999-03-10 36.864715
6 1999-03-11 45.0 1999-03-11 36.771622
7 1999-03-12 45.0 1999-03-12 36.712117
8 1999-03-13 45.0 1999-03-13 36.646144
9 1999-03-14 45.0 1999-03-14 36.578244
... ... ... ... ...
7356 NaT 0 2019-04-25 8.321119
7357 NaT 0 2019-04-26 8.315796
In order to do that, I am using this function:
df['New'] = np.where(df['y']!=0, df['y'], df['yhat'])
But I get an error saying:
KeyError: 'y'
Solved
Using this:
df.reset_index()
I want to merge the values of two different columns of pandas dataframe into one column of new dataframe.
pandas df1 =
hapX
pos 0.0
1 721 0.2
2 735 0.5
3 739 1.0
pandas df2 =
hapY
pos 0.1
1 721 0.0
2 735 0.6
3 739 1.5
I want to generate a new dataframe like:
df_joined['hapX|Y'] = df1.astype(str).add('|').add(df2.astype(str))
with expected output:
hapX|Y
pos 0.0|0.1
1 721 0.2|0.0
2 735 0.5|0.6
3 739 1.0|1.5
But, this is outputting bunch of NaN
hapX hapY
pos NaN NaN
1 721 NaN NaN
2 735 NaN NaN
3 739 NaN NaN
Is the problem with value being float (i don't think so). What is the problem with my approach?
Also, is there a way to automate the process if columns values are like hapX1 hapX1 hapX3 in one dataframe with hapY1 hapY2 hapY3 in another dataframe?
Thanks,
You can merge the two dataframes and then concat the hapX and hapY.
Say your first column name is no.
df_joined = df1.merge(df2, on = 'no')
df_joined['hapX|Y'] = (df_joined['hapX'].astype(str))+'|'+(df_joined['hapY'].astype(str))
df_joined.drop(['hapX', 'hapY'], axis = 1)
This gives you
no hapX|Y
0 pos 0.0|0.1
1 721 0.2|0.0
2 735 0.5|0.6
3 739 1.0|1.5
Just to add onto the previous answer, for the general case of N DataFrames,
Suppose you have a number of DataFrames as follows:
dfs = [pd.DataFrame({'hapY'+str(j): [random.random() for i in range(10)]}) for j in range(5)]
such that
>>> dfs[0]
hapY0
0 0.175683
1 0.353729
2 0.949848
3 0.346088
4 0.435292
5 0.837879
6 0.277274
7 0.623121
8 0.325119
9 0.709252
Then,
>>> map( lambda m: '|'.join(m) , zip(*[ dfs[j]['hapY'+str(j)].astype(str) for j in range(5)]))
['0.0845464936138|0.193336164837|0.551717121013|0.113566029656|0.479590342798',
'0.275851474238|0.694161791339|0.151607726092|0.615367668451|0.498997567849',
'0.116891472119|0.258406028668|0.315137581816|0.819992354178|0.864412473301',
'0.729581942312|0.614902776003|0.443986436146|0.227782256619|0.0149481683863',
'0.745583477173|0.441456815889|0.428691631831|0.307480112319|0.136790112739',
'0.981337451224|0.0117895017035|0.415140979617|0.650957722911|0.968082350568',
'0.725618728314|0.0546057041356|0.715910454674|0.0828229441557|0.220878025678',
'0.704047455894|0.303403129266|0.0499082759635|0.49727194707|0.251623048104',
'0.453595354131|0.146042134766|0.346665276655|0.911092176243|0.291405609407',
'0.140523603089|0.117930249858|0.902071673051|0.0804933425857|0.876006332635']
which you can later put into a DataFrame.
I think the simpliest is rename columns by dict which can be created by dict comprehension, last add_suffix:
print (df1)
hapX1 hapX2 hapX3 hapX4
pos
23 1.0 0.0 1.0 1.0
24 1.0 1.0 1.5 1.0
28 1.0 0.0 0.5 0.0
print (df2)
hapY1 hapY2 hapY3 hapY4
pos
23 0.0 1.0 0.5 0.0
24 1.0 1.0 1.5 1.0
28 0.0 1.0 1.0 1.0
d = {'hapY' + str(x):'hapX' + str(x) for x in range(1,5)}
print (d)
{'hapY1': 'hapX1', 'hapY3': 'hapX3', 'hapY2': 'hapX2', 'hapY4': 'hapX4'}
df_joined = df1.astype(str).add('|').add(df2.rename(columns=d).astype(str)).add_suffix('|Y')
print (df_joined)
hapX1|Y hapX2|Y hapX3|Y hapX4|Y
pos
23 1.0|0.0 0.0|1.0 1.0|0.5 1.0|0.0
24 1.0|1.0 1.0|1.0 1.5|1.5 1.0|1.0
28 1.0|0.0 0.0|1.0 0.5|1.0 0.0|1.0