Creating a new column with ifelse condition (python) - python

I am trying to create a new column 'New' that:
1) if 'y' values is different from zero, give me 'y'
2) if 'y' is equal zero, give me 'yhat' value
ds y ds yhat
0 1999-03-05 45.0 1999-03-05 37.168417
1 1999-03-06 45.0 1999-03-06 37.109215
2 1999-03-07 45.0 1999-03-07 37.049726
3 1999-03-08 45.0 1999-03-08 36.987036
4 1999-03-09 45.0 1999-03-09 36.926852
5 1999-03-10 45.0 1999-03-10 36.864715
6 1999-03-11 45.0 1999-03-11 36.771622
7 1999-03-12 45.0 1999-03-12 36.712117
8 1999-03-13 45.0 1999-03-13 36.646144
9 1999-03-14 45.0 1999-03-14 36.578244
... ... ... ... ...
7356 NaT 0 2019-04-25 8.321119
7357 NaT 0 2019-04-26 8.315796
In order to do that, I am using this function:
df['New'] = np.where(df['y']!=0, df['y'], df['yhat'])
But I get an error saying:
KeyError: 'y'

Solved
Using this:
df.reset_index()

Related

Python Dataframe perfom algebric operation between different columns

I have a data frame of three columns. I want to check them if they follow a logical sequence.
code:
df = pd.DataFrame({'low':[10,15,np.nan]','medium:[12,18,29],'high':[16,19,np.nan]})
df =
low medium high
0 10.0 12 16.0
1 15.0 18 19.0
2 NaN 29 NaN
# check if low<medium<high
df['check'] = (df['low']<df['medium'])&(df['medium']<df['high'])
print("Condition failed: %s"%(df['check'].all()))
Present output:
df['check']=
True #correct
True # correct
False # wrong output here, it should not consider this
Basically I want to avoid comparison with the NaN values and producing false output. I want to avoid them. How can I do it?
You can mask it. Also, instead of chained condition, you can use between:
df['check'] = df['medium'].between(df['low'], df['high'], inclusive='neither').mask(df[['low','high']].isna().any(axis=1))
Output:
low medium high check
0 10.0 12 16.0 True
1 15.0 18 19.0 True
2 NaN 29 NaN NaN

Find cumcount and agg func based on past records of each group

I have a dataframe like as shown below
df = pd.DataFrame(
{'stud_name' : ['ABC', 'ABC','ABC','DEF',
'DEF','DEF'],
'qty' : [123,31,490,518,70,900],
'trans_date' : ['13/11/2020','10/1/2018','11/11/2017','27/03/2016','13/05/2010','14/07/2008']})
I would like to do the below
a) for each stud_name, look at their past data (full past data) and compute the min, max and mean of qty column
Please note that the 1st record/row for every unique stud_name will be NA because there is no past data (history) to look at and compute the aggregate statistics
I tried something like below but the output is incorrect
df['trans_date'] = pd.to_datetime(df['trans_date'])
df.sort_values(by=['stud_name','trans_date'],inplace=True)
df['past_transactions'] = df.groupby('stud_name').cumcount()
df['past_max_qty'] = df.groupby('stud_name')['qty'].expanding().max().values
df['past_min_qty'] = df.groupby('stud_name')['qty'].expanding().min().values
df['past_avg_qty'] = df.groupby('stud_name')['qty'].expanding().mean().values
I expect my output to be like as shown below
We can use custom function to calculate the past statistics per student
def past_stats(q):
return (
q.expanding()
.agg(['max', 'min', 'mean'])
.shift().add_prefix('past_')
)
df.join(df.groupby('stud_name')['qty'].apply(past_stats))
stud_name qty trans_date past_max past_min past_mean
2 ABC 490 2017-11-11 NaN NaN NaN
1 ABC 31 2018-10-01 490.0 490.0 490.0
0 ABC 123 2020-11-13 490.0 31.0 260.5
5 DEF 900 2008-07-14 NaN NaN NaN
4 DEF 70 2010-05-13 900.0 900.0 900.0
3 DEF 518 2016-03-27 900.0 70.0 485.0

Exclude column in pandas

I have a dataframe that looks like this:
ENSG
3dir_S2_S23_L004_R1_001
7dir_S2_S25_L004_R1_001
i3dir_S2_S29_L004_R1_001
i7dir_S2_S31_L004_R1_001
ENSG00000000003.15
349.0
183.0
199.0
165.0
ENSG00000000419.13
133.0
82.0
190.0
168.0
ENSG00000000457.14
62.0
56.0
95.0
111.0
ENSG00000000460.17
191.0
122.0
300.0
285.0
ENSG00000001036.14
507.0
286.0
326.0
317.0
ENSG00000001084.13
205.0
192.0
310.0
320.0
ENSG00000001167.14
406.0
324.0
379.0
309.0
ENSG00000001460.18
93.0
78.0
146.0
120.0
I'm attempting to perform a calculation on each row of each column, excluding the column ENSG.
Something like this, where I divide each row value by the sum of the entire column:
df = df.transform(lambda x: x / x.sum())
How can I exclude the column ENSG from this calculation? Could I use iloc?
Use set_index to exclude ENSG from columns then transform and reset_index after:
out = df.set_index('ENSG').transform(lambda x: x / x.sum()).reset_index()
print(out)
# Output:
ENSG 3dir_S2_S23_L004_R1_001 7dir_S2_S25_L004_R1_001 i3dir_S2_S29_L004_R1_001 i7dir_S2_S31_L004_R1_001
0 ENSG00000000003.15 0.179342 0.138322 0.102314 0.091922
1 ENSG00000000419.13 0.068345 0.061980 0.097686 0.093593
2 ENSG00000000457.14 0.031860 0.042328 0.048843 0.061838
3 ENSG00000000460.17 0.098150 0.092215 0.154242 0.158774
4 ENSG00000001036.14 0.260534 0.216175 0.167609 0.176602
5 ENSG00000001084.13 0.105344 0.145125 0.159383 0.178273
6 ENSG00000001167.14 0.208633 0.244898 0.194859 0.172145
7 ENSG00000001460.18 0.047790 0.058957 0.075064 0.066852
Assuming ENSG is the first column, yes, you can use iloc:
df.iloc[:, 1:] = df.iloc[:, 1:] / np.sum(df.iloc[:, 1:], axis=0)

pd.wide_to_long gives empty data frame when trying to convert long data frame

I am the following dataframe called new:
policy_rate2008-01-01 policy_rate2008-02-21 ... ID x
ID ...
1 10.0 11.0 ... 48150000996 2.0
2 10.0 11.0 ... 60001024367 5.0
3 10.0 11.0 ... 58001020206 1.0
4 10.0 11.0 ... 57001015191 13.0
5 10.0 11.0 ... 51001004844 15.0
I want to convert it in wide_to_long format by the following comand:
pd.wide_to_long(new,['policy_rate','REER','inflation'],i="ID", j="year")
But output of this comand is empt dataframe with column names
the Full list of columns in dataframe new is :
Index(['policy_rate2008-01-01', 'policy_rate2008-02-21',
'policy_rate2008-03-20', 'policy_rate2008-04-17',
'policy_rate2008-05-15', 'policy_rate2008-06-26',
'policy_rate2008-07-24', 'policy_rate2008-08-21',
'policy_rate2008-09-11', 'policy_rate2008-10-16',
...
'inflation2020-05-29', 'inflation2020-06-24', 'inflation2020-07-29',
'inflation2020-08-05', 'inflation2020-09-16', 'inflation2020-10-28',
'inflation2020-11-29', 'inflation2020-12-09', 'ID', 'x'],
dtype='object', length=470)

How to add conditions when calculating using Python?

I have a dataframe with two numeric columns. I want to add a third column to calculate the difference. But the condition is if the values in the first column are blank or Nan, the difference should be the value in the second column...
Can anyone help me with this problem?
Any suggestions and clues will be appreciated!
Thank you.
You should use vectorised operations where possible. Here you can use numpy.where:
df['Difference'] = np.where(df['July Sales'].isnull(), df['August Sales'],
df['August Sales'] - df['July Sales'])
However, consider this is precisely the same as considering NaN values in df['July Sales'] to be equal to zero. So you can use pd.Series.fillna:
df['Difference'] = df['August Sales'] - df['July Sales'].fillna(0)
This isn't really a situation with conditions, it is just a math operation.. Suppose you have the df:
consider your df using the .sub() method:
df['Diff'] = df['August Sales'].sub(df['July Sales'], fill_value=0)
returns output:
July Sales August Sales Diff
0 459.0 477 18.0
1 422.0 125 -297.0
2 348.0 483 135.0
3 397.0 271 -126.0
4 NaN 563 563.0
5 191.0 325 134.0
6 435.0 463 28.0
7 NaN 479 479.0
8 475.0 473 -2.0
9 284.0 496 212.0
Used a sample dataframe, but it shouldn't be hard to comprehend:
df = pd.DataFrame({'A': [1, 2, np.nan, 3], 'B': [10, 20, 30, 40]})
def diff(row):
return row['B'] if (pd.isnull(row['A'])) else (row['B'] - row['A'])
df['C'] = df.apply(diff, axis=1)
ORIGINAL DATAFRAME:
A B
0 1.0 10
1 2.0 20
2 NaN 30
3 3.0 40
AFTER apply:
A B C
0 1.0 10 9.0
1 2.0 20 18.0
2 NaN 30 30.0
3 3.0 40 37.0
try this:
def diff(row):
if not row['col1']:
return row['col2']
else:
return row['col1'] - row['col2']
df['col3']= df.apply(diff, axis=1)

Categories