Pandas dataframe column subtraction, handling NaN - python

I have a data frame for example
df = pd.DataFrame([(np.nan, .32), (.01, np.nan), (np.nan, np.nan), (.21, .18)],
columns=['A', 'B'])
A B
0 NaN 0.32
1 0.01 NaN
2 NaN NaN
3 0.21 0.18
And I want to subtract column B from A
df['diff'] = df['A'] - df['B']
A B diff
0 NaN 0.32 NaN
1 0.01 NaN NaN
2 NaN NaN NaN
3 0.21 0.18 0.03
Difference returns NaN if one of the columns is NaN. To overcome this I use fillna
df['diff'] = df['A'].fillna(0) - df['B'].fillna(0)
A B diff
0 NaN 0.32 -0.32
1 0.01 NaN 0.01
2 NaN NaN 0.00
3 0.21 0.18 0.03
This solves NaN coming in the diff column, but for index 2 the result is coming to 0, while I want the difference as NaN since columns A and B are NaN.
Is there a way to explicitly tell pandas to output NaN if both columns are NaN?

Use Series.sub with fill_value=0 parameter:
df['diff'] = df['A'].sub(df['B'], fill_value=0)
print (df)
A B diff
0 NaN 0.32 -0.32
1 0.01 NaN 0.01
2 NaN NaN NaN
3 0.21 0.18 0.03
If need replace NaNs to 0 add Series.fillna:
df['diff'] = df['A'].sub(df['B'], fill_value=0).fillna(0)
print (df)
A B diff
0 NaN 0.32 -0.32
1 0.01 NaN 0.01
2 NaN NaN 0.00
3 0.21 0.18 0.03

Related

count values of each month, fill NaN if under certain limit

I am working with a dataframe, where every column represents a company. The index is a datetime index with daily frequency. My problem is the following: For each company, I would like to fill a month with NaN if there are less than 20 values in that month. In the example below, this would mean that Company_1's entry 0.91 on 2012-08-31 would be changed to NaN, while company_2 and 3 would be unchanged.
Company_1 Company_2 Company_3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
... ... ... ...
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 0.91 0.51 -0.33
Total Values: 1 22 21
I am struggling to find an efficient way to count the number of values for each month of each stock. I could theoretically write a function which creates a new dataframe, which reports the number of values for each month (and for each stock), to then use that dataframe for the original company information, but I am sure that there has to be an easier way. Any help is highly appreciated. Thanks in advance.
groupby the dataframe on monthly freq and transform using count then using Series.lt create a boolean mask and use this mask to fill NaN values in dataframe:
df1 = df.mask(df.groupby(pd.Grouper(freq='M')).transform('count').lt(20))
print(df1)
Company_1 Company_2 Company_3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
....
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 NaN 0.51 -0.33
IIUC:
df.loc[:, df.apply(lambda d: d.notnull().sum()<20)] = np.NaN
print (df)
Company 1 Company 2 Company 3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 NaN 0.51 -0.33

Pandas combine two columns into one and exclude NaN values

I have a 5k x 2 column dataframe called "both".
I want to create a new 5k x 1 DataFrame or column (doesn't matter) by replacing any NaN value in one column with the value of the adjacent column.
ex:
Gains Loss
0 NaN NaN
1 NaN -0.17
2 NaN -0.13
3 NaN -0.75
4 NaN -0.17
5 NaN -0.99
6 1.06 NaN
7 NaN -1.29
8 NaN -0.42
9 0.14 NaN
so for example, I need to swap the NaNs in the first column in rows 1 through 5 with the values in the same rows, in second column to get a new df of the following form:
Change
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
how do I tell python to do this??
You may fill the NaN values with zeroes and then simply add your columns:
both["Change"] = both["Gains"].fillna(0) + both["Loss"].fillna(0)
Then — if you need it — you may return the resulting zeroes back to NaNs:
both["Change"].replace(0, np.nan, inplace=True)
The result:
Gains Loss Change
0 NaN NaN NaN
1 NaN -0.17 -0.17
2 NaN -0.13 -0.13
3 NaN -0.75 -0.75
4 NaN -0.17 -0.17
5 NaN -0.99 -0.99
6 1.06 NaN 1.06
7 NaN -1.29 -1.29
8 NaN -0.42 -0.42
9 0.14 NaN 0.14
Finally, if you want to get rid of your original columns, you may drop them:
both.drop(columns=["Gains", "Loss"], inplace=True)
There are many ways to achieve this. One is using the loc property:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Price1': [np.nan,np.nan,np.nan,np.nan,
np.nan,np.nan,1.06,np.nan,np.nan],
'Price2': [np.nan,-0.17,-0.13,-0.75,-0.17,
-0.99,np.nan,-1.29,-0.42]})
df.loc[df['Price1'].isnull(), 'Price1'] = df['Price2']
df = df.loc[:6,'Price1']
print(df)
Output:
Price1
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
You can see more complex recipes in the Cookbook
IIUC, we can filter for null values and just sum the columns to make your new dataframe.
cols = ['Gains','Loss']
s = df.isnull().cumsum(axis=1).eq(len(df.columns)).any(axis=1)
# add df[cols].isnull() if you only want to measure the price columns for nulls.
df['prices'] = df[cols].loc[~s].sum(axis=1)
df = df.drop(cols,axis=1)
print(df)
prices
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
7 -1.29
8 -0.42

calculate mean only when the number of values in each rows is higher then certain number in python pandas

I have a daily time series dataframe with nine columns. Each columns represent the measurement from different methods. I want to calculate daily mean only when there are more than two measurements otherwise want to assign as NaN. How to do that with pandas dataframe?
suppose my df looks like:
0 1 2 3 4 5 6 7 8
2000-02-25 NaN 0.22 0.54 NaN NaN NaN NaN NaN NaN
2000-02-26 0.57 NaN 0.91 0.21 NaN 0.22 NaN 0.51 NaN
2000-02-27 0.10 0.14 0.09 NaN 0.17 NaN 0.05 NaN NaN
2000-02-28 NaN NaN NaN NaN NaN NaN NaN NaN 0.14
2000-02-29 0.82 NaN 0.75 NaN NaN NaN 0.14 NaN NaN
and I'm expecting mean values like:
0
2000-02-25 NaN
2000-02-26 0.48
2000-02-27 0.11
2000-02-28 NaN
2000-02-29 0.57
Use where for NaNs values by condition created by DataFrame.count for count with exclude NaNs and comparing by Series.gt (>):
s = df.where(df.count(axis=1).gt(2)).mean(axis=1)
#alternative soluton with changed order
#s = df.mean(axis=1).where(df.count(axis=1).gt(2))
print (s)
2000-02-25 NaN
2000-02-26 0.484
2000-02-27 0.110
2000-02-28 NaN
2000-02-29 0.570
dtype: float64

How to merge pandas DataFrame with same column name?

The index is a timestamp and column name, and also the ability to replace NaN to value. It does not seem to be working.
sample:
import pandas as pd
times = pd.to_datetime(pd.Series(['2014-07-4',
'2014-07-15','2014-08-24','2014-08-25','2014-09-10','2014-09-17']))
valuea = [0.01, 0.02, -0.03, 0.4 ,0.5,np.NaN]
times2 = pd.to_datetime(pd.Series(['2014-07-6',
'2014-07-16','2014-08-27','2014-09-5','2014-09-11','2014-09-17']))
valuea2 = [1, 2, 3, 4,5,-6]
df1 = pd.DataFrame({'value A': valuea}, index=times)
df2 = pd.DataFrame({'value A': valuea2}, index=times2)
df3=pd.merge(df1,df2, left_index=True, right_index=True)
df3.head()
Assuming you need outer join
pd.concat([df1,df2],axis=1)
Out[321]:
value A value A
2014-07-04 0.01 NaN
2014-07-06 NaN 1.0
2014-07-15 0.02 NaN
2014-07-16 NaN 2.0
2014-08-24 -0.03 NaN
2014-08-25 0.40 NaN
2014-08-27 NaN 3.0
2014-09-05 NaN 4.0
2014-09-10 0.50 NaN
2014-09-11 NaN 5.0
2014-09-17 NaN -6.0
Update
df1.combine_first(df2)
Out[324]:
value A
2014-07-04 0.01
2014-07-06 1.00
2014-07-15 0.02
2014-07-16 2.00
2014-08-24 -0.03
2014-08-25 0.40
2014-08-27 3.00
2014-09-05 4.00
2014-09-10 0.50
2014-09-11 5.00
2014-09-17 -6.00

How to insert value in column if condition is true using Pandas (Python)

I have the following dataset and I am trying to create a condition, where if the value in the Percentage cell is positive, I want the match cell to show the subsequent Percentage value eg (i+1). However, I wanted to ask how would I be able to perform this operation without using a loop. For example, in row 0 and Match, it would display the value -0.34.
User Percent Match
0 A 0.87 NaN
1 A -0.34 NaN
2 A 0.71 NaN
3 A -0.58 NaN
4 B -1.67 NaN
5 B -0.44 NaN
6 B -0.72 NaN
7 C 0.19 NaN
8 C 0.39 NaN
9 C -0.28 NaN
10 C 0.53 NaN
Additionally, how would I be able to have a summation of the subsequent two value proceeding a positive number in the Percent cell. I have the following code, but I am making an error in indexing the row location.
df1.ix[df1.Percent >=0, ['Match']] = df1.iloc[:1]['Match']; df1
For the first part you can use loc with a boolean condition and shift:
In [5]:
df.loc[df['Percent']>0,'Match'] = df['Percent'].shift(-1)
df
Out[5]:
User Percent Match
0 A 0.87 -0.34
1 A -0.34 NaN
2 A 0.71 -0.58
3 A -0.58 NaN
4 B -1.67 NaN
5 B -0.44 NaN
6 B -0.72 NaN
7 C 0.19 0.39
8 C 0.39 -0.28
9 C -0.28 NaN
10 C 0.53 NaN
For the summation you can do the following:
In [15]:
def func(x):
return df['Percent'].iloc[x.name-2:x.name].sum()
df['sum'] = df[df['Percent']>0][['Percent']].apply(lambda x: func(x), axis=1)
df
Out[15]:
User Percent Match sum
0 A 0.87 -0.34 0.00
1 A -0.34 NaN NaN
2 A 0.71 -0.58 0.53
3 A -0.58 NaN NaN
4 B -1.67 NaN NaN
5 B -0.44 NaN NaN
6 B -0.72 NaN NaN
7 C 0.19 0.39 -1.16
8 C 0.39 -0.28 -0.53
9 C -0.28 NaN NaN
10 C 0.53 NaN 0.11
This uses a slight trick to mask the df and return the col of interest but force to a df (by using double square brackets [[]]) so we can call apply and use axis=1 to iterate row-wise. This allows us to access the row index via the .name attribute. We can then use this to slice the df and return the sum.

Categories