How to add conditions when calculating using Python? - python

I have a dataframe with two numeric columns. I want to add a third column to calculate the difference. But the condition is if the values in the first column are blank or Nan, the difference should be the value in the second column...
Can anyone help me with this problem?
Any suggestions and clues will be appreciated!
Thank you.

You should use vectorised operations where possible. Here you can use numpy.where:
df['Difference'] = np.where(df['July Sales'].isnull(), df['August Sales'],
df['August Sales'] - df['July Sales'])
However, consider this is precisely the same as considering NaN values in df['July Sales'] to be equal to zero. So you can use pd.Series.fillna:
df['Difference'] = df['August Sales'] - df['July Sales'].fillna(0)

This isn't really a situation with conditions, it is just a math operation.. Suppose you have the df:
consider your df using the .sub() method:
df['Diff'] = df['August Sales'].sub(df['July Sales'], fill_value=0)
returns output:
July Sales August Sales Diff
0 459.0 477 18.0
1 422.0 125 -297.0
2 348.0 483 135.0
3 397.0 271 -126.0
4 NaN 563 563.0
5 191.0 325 134.0
6 435.0 463 28.0
7 NaN 479 479.0
8 475.0 473 -2.0
9 284.0 496 212.0

Used a sample dataframe, but it shouldn't be hard to comprehend:
df = pd.DataFrame({'A': [1, 2, np.nan, 3], 'B': [10, 20, 30, 40]})
def diff(row):
return row['B'] if (pd.isnull(row['A'])) else (row['B'] - row['A'])
df['C'] = df.apply(diff, axis=1)
ORIGINAL DATAFRAME:
A B
0 1.0 10
1 2.0 20
2 NaN 30
3 3.0 40
AFTER apply:
A B C
0 1.0 10 9.0
1 2.0 20 18.0
2 NaN 30 30.0
3 3.0 40 37.0

try this:
def diff(row):
if not row['col1']:
return row['col2']
else:
return row['col1'] - row['col2']
df['col3']= df.apply(diff, axis=1)

Related

Find cumcount and agg func based on past records of each group

I have a dataframe like as shown below
df = pd.DataFrame(
{'stud_name' : ['ABC', 'ABC','ABC','DEF',
'DEF','DEF'],
'qty' : [123,31,490,518,70,900],
'trans_date' : ['13/11/2020','10/1/2018','11/11/2017','27/03/2016','13/05/2010','14/07/2008']})
I would like to do the below
a) for each stud_name, look at their past data (full past data) and compute the min, max and mean of qty column
Please note that the 1st record/row for every unique stud_name will be NA because there is no past data (history) to look at and compute the aggregate statistics
I tried something like below but the output is incorrect
df['trans_date'] = pd.to_datetime(df['trans_date'])
df.sort_values(by=['stud_name','trans_date'],inplace=True)
df['past_transactions'] = df.groupby('stud_name').cumcount()
df['past_max_qty'] = df.groupby('stud_name')['qty'].expanding().max().values
df['past_min_qty'] = df.groupby('stud_name')['qty'].expanding().min().values
df['past_avg_qty'] = df.groupby('stud_name')['qty'].expanding().mean().values
I expect my output to be like as shown below
We can use custom function to calculate the past statistics per student
def past_stats(q):
return (
q.expanding()
.agg(['max', 'min', 'mean'])
.shift().add_prefix('past_')
)
df.join(df.groupby('stud_name')['qty'].apply(past_stats))
stud_name qty trans_date past_max past_min past_mean
2 ABC 490 2017-11-11 NaN NaN NaN
1 ABC 31 2018-10-01 490.0 490.0 490.0
0 ABC 123 2020-11-13 490.0 31.0 260.5
5 DEF 900 2008-07-14 NaN NaN NaN
4 DEF 70 2010-05-13 900.0 900.0 900.0
3 DEF 518 2016-03-27 900.0 70.0 485.0

Exclude column in pandas

I have a dataframe that looks like this:
ENSG
3dir_S2_S23_L004_R1_001
7dir_S2_S25_L004_R1_001
i3dir_S2_S29_L004_R1_001
i7dir_S2_S31_L004_R1_001
ENSG00000000003.15
349.0
183.0
199.0
165.0
ENSG00000000419.13
133.0
82.0
190.0
168.0
ENSG00000000457.14
62.0
56.0
95.0
111.0
ENSG00000000460.17
191.0
122.0
300.0
285.0
ENSG00000001036.14
507.0
286.0
326.0
317.0
ENSG00000001084.13
205.0
192.0
310.0
320.0
ENSG00000001167.14
406.0
324.0
379.0
309.0
ENSG00000001460.18
93.0
78.0
146.0
120.0
I'm attempting to perform a calculation on each row of each column, excluding the column ENSG.
Something like this, where I divide each row value by the sum of the entire column:
df = df.transform(lambda x: x / x.sum())
How can I exclude the column ENSG from this calculation? Could I use iloc?
Use set_index to exclude ENSG from columns then transform and reset_index after:
out = df.set_index('ENSG').transform(lambda x: x / x.sum()).reset_index()
print(out)
# Output:
ENSG 3dir_S2_S23_L004_R1_001 7dir_S2_S25_L004_R1_001 i3dir_S2_S29_L004_R1_001 i7dir_S2_S31_L004_R1_001
0 ENSG00000000003.15 0.179342 0.138322 0.102314 0.091922
1 ENSG00000000419.13 0.068345 0.061980 0.097686 0.093593
2 ENSG00000000457.14 0.031860 0.042328 0.048843 0.061838
3 ENSG00000000460.17 0.098150 0.092215 0.154242 0.158774
4 ENSG00000001036.14 0.260534 0.216175 0.167609 0.176602
5 ENSG00000001084.13 0.105344 0.145125 0.159383 0.178273
6 ENSG00000001167.14 0.208633 0.244898 0.194859 0.172145
7 ENSG00000001460.18 0.047790 0.058957 0.075064 0.066852
Assuming ENSG is the first column, yes, you can use iloc:
df.iloc[:, 1:] = df.iloc[:, 1:] / np.sum(df.iloc[:, 1:], axis=0)

Python groupby nested dictionary is ambiguous in aggregation

I am currently working on my thesis and facing some problems in a groupby function I want to do. I am trying to find out someone's total purchase amount, average purchase amount, purchase count, how many products bought in total and the average value per product.
The data looks like thise:
id purchase_amount price_products #_products
0 123 30 20.00 2
2 123 NaN 10.00 NaN
3 124 50.00 25.00 3
4 124 NaN 15.00 NaN
5 124 NaN 10.00 NaN
My code looks like this:
df.groupby('id')[['purchase_amount','price_products','#_products']].agg(total_purchase_amount=('purchase_amount','sum'),average_purchase_amount=('purchase_amount','mean'),times_purchased=('#_products','count'),total_amount_products_purchased=('price_products','count'),average_value_products=('price_products','mean'))
But I get the following error:
SpecificationError: nested dictionary is ambiguous in aggregation
I cannot seem to find what I am doing wrong, hopefully someone can help me!
Do like this for all calculations
df.groupby('id')['purchase_amount'].agg({'total_purchase_amount':'sum'})
Since you have several variables to aggregate, I would suggest using the following form of aggregation:
df.groupby('id')[<variables-list>].agg([<statistics-list>])
For example:
df_agg = df.groupby('id')[['purchase_amount','price_products','#_products']].agg(["count", "mean", "sum"])
This will create a column-wise multi-level output data frame df_agg that looks like:
purchase_amount price_products #_products
count mean sum count mean sum count mean sum
id
123 1 30.0 30.0 2 15 30 1 2.0 2.0
124 1 50.0 50.0 3 17 51 1 3.0 3.0
You can then refer to a particular entry in the output data frame using the multi-index as follows:
df_agg['purchase_amount']['mean']
id
123 30.0
124 50.0
Name: mean, dtype: float64
or if you want e.g. all the means, use the cross-sectional method xs():
df_agg.xs('mean', axis=1, level=1)
purchase_amount price_products #_products
id
123 30.0 15 2.0
124 50.0 17 3.0
Note: presumably, the above piece of code will make Python compute more statistics than needed, as is the case in your example. But this may not be an issue in certain contexts, and it has the advantage that the code is shorter and generalizable to any set and number of (numeric and float) variables to aggregate.
You can do this in an organized way using a dictionary for your aggregation.
df = pd.DataFrame([[123, 30, 20, 2],
[123, np.nan, 10, np.nan],
[124, 50, 25, 3],
[124, np.nan, 15, np.nan],
[124, np.nan, 10, np.nan]],
columns=['id', 'purchase_amount', 'price_products', 'num_products']
)
agg_dict = {
'purchase_amount': [np.sum, np.mean],
'num_products': [np.count_nonzero],
'price_products': [np.count_nonzero, np.mean],
}
print(df.groupby('id').agg(agg_dict))
output:
purchase_amount num_products price_products
sum mean count_nonzero count_nonzero mean
id
123 30.0 30.0 2.0 2 15.000000
124 50.0 50.0 3.0 3 16.666667

Pandas 'eval' with NaN

I have a DataFrame with data in string. These data need to be evaluated and change to numeric.
Let my df be:
var_pct var_num
-76*2 14*1000000
-76*2 12*1000000
111*2 29*1000000
47*2 33*1000000
nan 60*1000000
for column in df:
df[column] =df.eval(df[column],inplace=True)
I faced problem for column with 'nan' where the result of eval has length less than the original. How do I make sure the 'nan' will be a '' after eval?
You should avoid eval. In this case, I recommend you split into numeric series first:
df = df.fillna('NaN*NaN')
for col in df.columns:
df = df.join(df.pop(col).str.split('*', expand=True)\
.apply(pd.to_numeric, errors='coerce')\
.add_prefix(f'{col}_'))
print(df)
var_pct_0 var_pct_1 var_num_0 var_num_1
0 -76.0 2.0 14 1000000
1 -76.0 2.0 12 1000000
2 111.0 2.0 29 1000000
3 47.0 2.0 33 1000000
4 NaN NaN 60 1000000
Then perform your calculations using vectorised operations:
for col in ['var_pct', 'var_num']:
df[col] = df[f'{col}_0'] * df[f'{col}_1']
For academic purposes, the approach you are attempting is possible via the top level function pd.eval together with applymap. But beware, this is just an inefficient Python-level loop.
nan = np.nan
df = df.fillna('nan*nan')
df = df.applymap(lambda x: pd.eval(x))
print(df)
var_pct var_num
0 -152.0 14000000
1 -152.0 12000000
2 222.0 29000000
3 94.0 33000000
4 NaN 60000000
Assuming that you can live with a copied dataframe:
def ff(val):
if 'nan' not in val:
return eval(val)
df4 = df3.applymap(ff)
print(df4)
var_pct var_num
0 -152.0 14000000
1 -152.0 12000000
2 222.0 29000000
3 94.0 33000000
4 NaN 60000000
Of course ff can be expressed as lambda too:
lambda val: eval(val) if 'nan' not in val else val

How to merge the two columns from two dataframe into one column of a new dataframe (pandas)?

I want to merge the values of two different columns of pandas dataframe into one column of new dataframe.
pandas df1 =
hapX
pos 0.0
1 721 0.2
2 735 0.5
3 739 1.0
pandas df2 =
hapY
pos 0.1
1 721 0.0
2 735 0.6
3 739 1.5
I want to generate a new dataframe like:
df_joined['hapX|Y'] = df1.astype(str).add('|').add(df2.astype(str))
with expected output:
hapX|Y
pos 0.0|0.1
1 721 0.2|0.0
2 735 0.5|0.6
3 739 1.0|1.5
But, this is outputting bunch of NaN
hapX hapY
pos NaN NaN
1 721 NaN NaN
2 735 NaN NaN
3 739 NaN NaN
Is the problem with value being float (i don't think so). What is the problem with my approach?
Also, is there a way to automate the process if columns values are like hapX1 hapX1 hapX3 in one dataframe with hapY1 hapY2 hapY3 in another dataframe?
Thanks,
You can merge the two dataframes and then concat the hapX and hapY.
Say your first column name is no.
df_joined = df1.merge(df2, on = 'no')
df_joined['hapX|Y'] = (df_joined['hapX'].astype(str))+'|'+(df_joined['hapY'].astype(str))
df_joined.drop(['hapX', 'hapY'], axis = 1)
This gives you
no hapX|Y
0 pos 0.0|0.1
1 721 0.2|0.0
2 735 0.5|0.6
3 739 1.0|1.5
Just to add onto the previous answer, for the general case of N DataFrames,
Suppose you have a number of DataFrames as follows:
dfs = [pd.DataFrame({'hapY'+str(j): [random.random() for i in range(10)]}) for j in range(5)]
such that
>>> dfs[0]
hapY0
0 0.175683
1 0.353729
2 0.949848
3 0.346088
4 0.435292
5 0.837879
6 0.277274
7 0.623121
8 0.325119
9 0.709252
Then,
>>> map( lambda m: '|'.join(m) , zip(*[ dfs[j]['hapY'+str(j)].astype(str) for j in range(5)]))
['0.0845464936138|0.193336164837|0.551717121013|0.113566029656|0.479590342798',
'0.275851474238|0.694161791339|0.151607726092|0.615367668451|0.498997567849',
'0.116891472119|0.258406028668|0.315137581816|0.819992354178|0.864412473301',
'0.729581942312|0.614902776003|0.443986436146|0.227782256619|0.0149481683863',
'0.745583477173|0.441456815889|0.428691631831|0.307480112319|0.136790112739',
'0.981337451224|0.0117895017035|0.415140979617|0.650957722911|0.968082350568',
'0.725618728314|0.0546057041356|0.715910454674|0.0828229441557|0.220878025678',
'0.704047455894|0.303403129266|0.0499082759635|0.49727194707|0.251623048104',
'0.453595354131|0.146042134766|0.346665276655|0.911092176243|0.291405609407',
'0.140523603089|0.117930249858|0.902071673051|0.0804933425857|0.876006332635']
which you can later put into a DataFrame.
I think the simpliest is rename columns by dict which can be created by dict comprehension, last add_suffix:
print (df1)
hapX1 hapX2 hapX3 hapX4
pos
23 1.0 0.0 1.0 1.0
24 1.0 1.0 1.5 1.0
28 1.0 0.0 0.5 0.0
print (df2)
hapY1 hapY2 hapY3 hapY4
pos
23 0.0 1.0 0.5 0.0
24 1.0 1.0 1.5 1.0
28 0.0 1.0 1.0 1.0
d = {'hapY' + str(x):'hapX' + str(x) for x in range(1,5)}
print (d)
{'hapY1': 'hapX1', 'hapY3': 'hapX3', 'hapY2': 'hapX2', 'hapY4': 'hapX4'}
df_joined = df1.astype(str).add('|').add(df2.rename(columns=d).astype(str)).add_suffix('|Y')
print (df_joined)
hapX1|Y hapX2|Y hapX3|Y hapX4|Y
pos
23 1.0|0.0 0.0|1.0 1.0|0.5 1.0|0.0
24 1.0|1.0 1.0|1.0 1.5|1.5 1.0|1.0
28 1.0|0.0 0.0|1.0 0.5|1.0 0.0|1.0

Categories