Distance matrix for rows in pandas dataframe - python

I have a pandas dataframe that looks as follows:
In [23]: dataframe.head()
Out[23]:
column_id 1 10 11 12 13 14 15 16 17 18 ... 46 47 48 49 5 50 \
row_id ...
1 NaN NaN 1 1 1 1 1 1 1 1 ... 1 1 NaN 1 NaN NaN
10 1 1 1 1 1 1 1 1 1 NaN ... 1 1 1 NaN 1 NaN
100 1 1 NaN 1 1 1 1 1 NaN 1 ... NaN NaN 1 1 1 NaN
11 NaN 1 1 1 1 1 1 1 1 NaN ... NaN 1 1 1 1 1
12 1 1 1 NaN 1 1 1 1 NaN 1 ... 1 NaN 1 1 NaN 1
The thing is I'm currently using the Pearson correlation to calculate similarity between rows, and given the nature of the data, sometimes std deviation is zero (all values are 1 or NaN), so the pearson correlation returns this:
In [24]: dataframe.transpose().corr().head()
Out[24]:
row_id 1 10 100 11 12 13 14 15 16 17 ... 90 91 92 93 94 95 \
row_id ...
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
Is there any other way of computing correlations that avoids this? Maybe an easy way to calculate the euclidean distance between rows with just one method, just as Pearson correlation has?
Thanks!
A.

The key question here is what distance metric to use.
Let's say this is your data.
>>> import pandas as pd
>>> data = pd.DataFrame(pd.np.random.rand(100, 50))
>>> data[data > 0.2] = 1
>>> data[data <= 0.2] = pd.np.nan
>>> data.head()
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 \
0 1 1 1 NaN 1 NaN NaN 1 1 1 ... 1 1 NaN 1 NaN 1 1 1
1 1 1 1 NaN 1 1 1 1 1 1 ... NaN 1 1 NaN NaN 1 1 1
2 1 1 1 1 1 1 1 1 1 1 ... 1 NaN 1 1 1 1 1 NaN
3 1 NaN 1 NaN 1 NaN 1 NaN 1 1 ... 1 1 1 1 NaN 1 1 1
4 1 1 1 1 1 1 1 1 NaN 1 ... NaN 1 1 1 1 1 1 1
What is the % difference?
You can compute a distance metric as percentage of values that are different between each column. The result shows the % difference between any 2 columns.
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: (column1 - column2).abs().sum() / len(column1)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 7 8 9 ... 40 \
0 0.00 0.36 0.33 0.37 0.32 0.41 0.35 0.33 0.39 0.33 ... 0.37
1 0.36 0.00 0.37 0.29 0.30 0.37 0.33 0.37 0.33 0.31 ... 0.35
2 0.33 0.37 0.00 0.36 0.29 0.38 0.40 0.34 0.30 0.28 ... 0.28
3 0.37 0.29 0.36 0.00 0.29 0.30 0.34 0.26 0.32 0.36 ... 0.36
4 0.32 0.30 0.29 0.29 0.00 0.31 0.35 0.29 0.29 0.25 ... 0.27
What is the correlation coefficient?
Here, we use the Pearson correlation coefficient. This is a perfectly valid metric. Specifically, it translates to the phi coefficient in case of binary data.
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: scipy.stats.pearsonr(column1, column2)[0]
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 \
0 1.000000 0.013158 0.026262 -0.059786 -0.024293 -0.078056 0.054074
1 0.013158 1.000000 -0.093109 0.170159 0.043187 0.027425 0.108148
2 0.026262 -0.093109 1.000000 -0.124540 -0.048485 -0.064881 -0.161887
3 -0.059786 0.170159 -0.124540 1.000000 0.004245 0.184153 0.042524
4 -0.024293 0.043187 -0.048485 0.004245 1.000000 0.079196 -0.099834
Incidentally, this is the same result that you would get with the Spearman R coefficient as well.
What is the Euclidean distance?
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 \
0 0.000000 6.000000 5.744563 6.082763 5.656854 6.403124 5.916080
1 6.000000 0.000000 6.082763 5.385165 5.477226 6.082763 5.744563
2 5.744563 6.082763 0.000000 6.000000 5.385165 6.164414 6.324555
3 6.082763 5.385165 6.000000 0.000000 5.385165 5.477226 5.830952
4 5.656854 5.477226 5.385165 5.385165 0.000000 5.567764 5.916080
By now, you'd have a sense of the pattern. Create a distance method. Then apply it pairwise to every column using
data.apply(lambda col1: data.apply(lambda col2: method(col1, col2)))
If your distance method relies on the presence of zeroes instead of nans, convert to zeroes using .fillna(0).

A proposal to improve the excellent answer from #s-anand for Euclidian distance:
instead of
zero_data = data.fillna(0)
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
we can apply the fillna the fill only the missing data, thus:
distance = lambda column1, column2: pd.np.linalg.norm((column1 - column2).fillna(0))
This way, the distance on missing dimensions will not be counted.

This is my numpy-only version of #S Anand's fantastic answer, which I put together in order to help myself understand his explanation better.
Happy to share it with a short, reproducible example:
# Preliminaries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Get iris dataset into a DataFrame
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
Let's try scipy.stats.pearsonr first.
Executing:
distance = lambda column1, column2: pearsonr(column1, column2)[0]
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: distance(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt
returns:
and:
rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: pearsonr(col1, col2)[0],
axis = 0, arr=iris_df),
axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np
returns:
array([[1.00, -0.12, 0.87, 0.82, 0.78],
[-0.12, 1.00, -0.43, -0.37, -0.43],
[0.87, -0.43, 1.00, 0.96, 0.95],
[0.82, -0.37, 0.96, 1.00, 0.96],
[0.78, -0.43, 0.95, 0.96, 1.00]])
As a second example let's try the distance correlation from the dcor library.
Executing:
import dcor
dist_corr = lambda column1, column2: dcor.distance_correlation(column1, column2)
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: dist_corr(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt
returns:
while:
rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, col2),
axis = 0, arr=iris_df),
axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np
returns:
array([[1.00, 0.31, 0.86, 0.83, 0.78],
[0.31, 1.00, 0.54, 0.51, 0.51],
[0.86, 0.54, 1.00, 0.97, 0.95],
[0.83, 0.51, 0.97, 1.00, 0.95],
[0.78, 0.51, 0.95, 0.95, 1.00]])

I compared 3 variants from the other answers here for their speed. I had a trial 1000x25 matrix (leading to resulting 1000x1000 matrix)
dcor library
Time: 0.03s
https://dcor.readthedocs.io/en/latest/functions/dcor.distances.pairwise_distances.html
import dcor
result = dcor.distances.pairwise_distances(data)
scipy.distance
Time: 0.05s
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html
from scipy.spatial import distance_matrix
result = distance_matrix(data, data)
using lambda function and numpy or pandas
Time: 180s / 90s
import numpy as np # variant A (180s)
import pandas as pd # variant B (90s)
distance = lambda x, y: np.sqrt(np.sum((x - y) ** 2)) # variant A
distance = lambda x, y: pd.np.linalg.norm(x - y) # variant B
result = data.apply(lambda x: data.apply(lambda y: distance(x, y), axis=1), axis=1)

Related

How to create column in dataframe with list of headings affected by condition, apply a cap and then exclude not respecting condition headings

I'm struggling to solve this issue. Help would be very much appreciated.
Note: bold in the text refers to the columns i need to create.
I have a data set in which I count the values of the row that are different than nan, and it's represented in column [count]. In column [incl_count] i would like to have lists which identify the headings of the columns contributing to the count.
Next, I would like to have a limitation [lim] column in which I cannot have more than 3 counts. There is a cap of maximum 3. This means that the last columns to arrive to the counting cannot be considering and therefore excluded, being the exclusion saved in column [excl]
[index] [A] [B] [C] [D] [E] [F] [count] [incl_count] [lim] [excl]
...
...
...
2020-01-01 nan nan nan nan nan nan 0 [] 0 []
2020-01-02 -0.01 nan nan nan nan nan 1 [A] 1 []
2020-01-03 0.02 nan nan nan nan nan 1 [A] 1 []
2020-01-04 -0.01 0.01 nan nan nan nan 2 [A,B] 2 []
2020-01-05 -0.02 -0.04 0.02 nan nan nan 3 [A,B,C] 3 []
2020-01-06 nan 0.02 0.03 0.02 0.01 nan 4 [B,C,D,E] 3 [E]
2020-01-07 nan -0.02 0.01 -0.01 0.03 0.01 5 [B,C,D,E,F] 3 [E,F]
2020-01-08 nan nan -0.02 0.05 -0.05 0.02 4 [C,D,E,F] 2 [E,F]
2020-01-09 nan nan nan 0.02 0.02 0.05 3 [D,E,F] 1 [E,F]
2020-01-10 nan nan nan nan nan 0.01 1 [F] 0 [F]
...
...
...
This should work:
import pandas as pd
import numpy as np
non_value_columns = ["index", "incl_count", "excl", "lim", "count"]
max_lim = 3
entries = []
df = pd.read_excel('your.xlsx')
for entry in df:
if entry not in non_value_columns:
print(entry)
entries.append(entry)
indexes = df['index'].tolist()
i = 0
cur_excludes = []
for index in indexes:
c = 0
incl = []
excl = []
for entry in entries:
if not np.isnan(df[entry].tolist()[i]):
incl.append(entry)
c += 1
if max_lim < c or entry in cur_excludes:
c -= 1
excl.append(entry)
cur_excludes.append(entry)
df.loc[i, 'lim'] = str(c)
df.loc[i, 'incl_count'] = str(incl)
df.loc[i, 'excl'] = str(excl)
i += 1
df.to_excel('output.xlsx')
Edit: Changed code so it would loop through all the different columns. Made an array where you can state the columns that are nonvalue columns, make sure you extend it if you add columns that you do not want to check it is name-based so just add the name of the column. Also made a variable where you can state your limit. Hope this works tell me if anything goes wrong!

Filling missing data using a custom condition in a Pandas time series dataframe

Below is a portion of mydataframe which has many missing values.
A B
S a b c d e a b c d e
date
2020-10-15 1.0 2.0 NaN NaN NaN 10.0 11.0 NaN NaN NaN
2020-10-16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-17 NaN NaN NaN 4.0 NaN NaN NaN NaN 13.0 NaN
2020-10-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-20 4.0 6.0 4.0 1.0 9.0 10.0 2.0 13.0 4.0 13.0
I would like to replace the NANs in each column using a specific backward fill condition .
For example, in column (A,a) missing values appear for dates 16th, 17th, 18th and 19th. The next value is '4' against 20th. I want this value (the next non missing value in the column) to be distributed among all these dates including 20th at a progressively increasing value of 10%. That is column (A,a) gets values of .655, .720,.793,.872 & .96 approximately for the dates 16th, 17th, 18th, 19th & 20th. This shall be the approach for all columns for all missing values across rows.
I tried using bfill() function but unable to fathom how to incorporate the required formula as an option.
I have checked the link Pandas: filling missing values in time series forward using a formula and a few other links on stackoverflow. This is somewhat similar, but in my case the the number of NANs in a given column are variable in nature and span multiple rows. Compare columns (A,a) with column (A,d) or column (B,d). Given this, I am finding it difficult to adopt the solution to my problem.
Appreciate any inputs.
Here is a completely vectorized way to do this. It is very efficient and fast: 130 ms on a 1000 x 1000 matrix. This is a good opportunity to expose some interesting techniques using numpy.
First, let's dig a bit into the requirements, specifically what exactly the value for each cell needs to be.
The example given is [nan, nan, nan, nan, 4.0] --> [.66, .72, .79, .87, .96], which is explained to be a "progressively increasing value of 10%" (in such a way that the total is the "value to spread": 4.0).
This is a geometric series with rate r = 1 + 0.1: [r^1, r^2, r^3, ...] and then normalized to sum to 1. For example:
r = 1.1
a = 4.0
n = 5
q = np.cumprod(np.repeat(r, n))
a * q / q.sum()
# array([0.65518992, 0.72070892, 0.79277981, 0.87205779, 0.95926357])
We'd like to do a direct calculation (to avoid calling Python functions and explicit loops, which would be much slower), so we need to express that normalizing factor q.sum() in closed form. It is a well-established quantity and is:
To generalize, we need 3 quantities to calculate the value of each cell:
a: value to distribute
i: index of run (0 .. n-1)
n: run length
then, the value is v = a * r**i * (r - 1) / (r**n - 1).
To illustrate with the first column in the OP's example, where the input is: [1, nan, nan, nan, nan, 4], we would like:
a = [1, 4, 4, 4, 4, 4]
i = [0, 0, 1, 2, 3, 4]
n = [1, 5, 5, 5, 5, 5]
then, the value v would be (rounded at 2 decimals): [1. , 0.66, 0.72, 0.79, 0.87, 0.96].
Now comes the part where we go about getting these three quantities as numpy arrays.
a is the easiest and is simply df.bfill().values. But for i and n, we do have to do a little bit of work, starting by assigning the values to a numpy array:
z = df.values
nrows, ncols = z.shape
For i, we start with the cumulative count of NaNs, with reset when values are not NaN. This is strongly inspired by this SO answer for "Cumulative counts in NumPy without iteration". But we do it for a 2D array, and we also want to add a first row of 0, and discard the last row to satisfy exactly our needs:
def rcount(z):
na = np.isnan(z)
without_reset = na.cumsum(axis=0)
reset_at = ~na
overcount = np.maximum.accumulate(without_reset * reset_at)
result = without_reset - overcount
return result
i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]
For n, we need to do some dancing on our own, using first principles of numpy (I'll break down the steps if I have time):
runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))
n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]
So, putting it all together:
def spread_bfill(df, r=1.1):
z = df.values
nrows, ncols = z.shape
a = df.bfill().values
i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]
runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))
n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]
v = a * r**i * (r - 1) / (r**n - 1)
return pd.DataFrame(v, columns=df.columns, index=df.index)
On your example data, we get:
>>> spread_bfill(df).round(2) # round(2) for printing purposes
A B
a b c d e a b c d e
S
2020-10-15 1.00 2.00 0.52 1.21 1.17 10.00 11.00 1.68 3.93 1.68
2020-10-16 0.66 0.98 0.57 1.33 1.28 1.64 0.33 1.85 4.32 1.85
2020-10-17 0.72 1.08 0.63 1.46 1.41 1.80 0.36 2.04 4.75 2.04
2020-10-18 0.79 1.19 0.69 0.30 1.55 1.98 0.40 2.24 1.21 2.24
2020-10-19 0.87 1.31 0.76 0.33 1.71 2.18 0.44 2.47 1.33 2.47
2020-10-20 0.96 1.44 0.83 0.37 1.88 2.40 0.48 2.71 1.46 2.71
For inspection, let's look at each of the 3 quantities in that example:
>>> a
[[ 1 2 4 4 9 10 11 13 13 13]
[ 4 6 4 4 9 10 2 13 13 13]
[ 4 6 4 4 9 10 2 13 13 13]
[ 4 6 4 1 9 10 2 13 4 13]
[ 4 6 4 1 9 10 2 13 4 13]
[ 4 6 4 1 9 10 2 13 4 13]]
>>> i
[[0 0 0 0 0 0 0 0 0 0]
[0 0 1 1 1 0 0 1 1 1]
[1 1 2 2 2 1 1 2 2 2]
[2 2 3 0 3 2 2 3 0 3]
[3 3 4 1 4 3 3 4 1 4]
[4 4 5 2 5 4 4 5 2 5]]
>>> n
[[1 1 6 3 6 1 1 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]]
And here is a final example, to illustrate what happens if a column ends with 1 or several NaNs (they remain NaN):
np.random.seed(10)
a = np.random.randint(0, 10, (6, 6)).astype(float)
a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])
df = pd.DataFrame(a)
>>> df
0 1 2 3 4 5
0 NaN NaN NaN NaN NaN 0.0
1 NaN NaN 9.0 NaN 8.0 NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN 8.0 4.0 NaN NaN NaN
4 NaN NaN NaN 6.0 9.0 NaN
5 NaN NaN 2.0 NaN 7.0 8.0
Then:
>>> spread_bfill(df).round(2) # round(2) for printing
0 1 2 3 4 5
0 NaN 1.72 4.29 0.98 3.81 0.00
1 NaN 1.90 4.71 1.08 4.19 1.31
2 NaN 2.09 1.90 1.19 2.72 1.44
3 NaN 2.29 2.10 1.31 2.99 1.59
4 NaN NaN 0.95 1.44 3.29 1.74
5 NaN NaN 1.05 NaN 7.00 1.92
Speed
a = np.random.randint(0, 10, (1000, 1000)).astype(float)
a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])
df = pd.DataFrame(a)
%timeit spread_bfill(df)
# 130 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Initial data:
>>> df
A B
a b c d e a b c d e
date
2020-10-15 1.0 2.0 NaN NaN NaN 10.0 11.0 NaN NaN NaN
2020-10-16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-17 NaN NaN NaN 4.0 NaN NaN NaN NaN 13.0 NaN
2020-10-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-20 4.0 6.0 4.0 1.0 9.0 10.0 2.0 13.0 4.0 13.0
Define your geometric sequence:
def geomseq(seq):
q = 1.1
n = len(seq)
S = seq.max()
Uo = S * (1-q) / (1-q**n)
Un = [Uo * q**i for i in range(0, n)]
return Un
TL;DR
>>> df.unstack().groupby(df.unstack().sort_index(ascending=False).notna().cumsum().sort_index()).transform(geomseq).unstack(level=[0, 1])
A B
a b c d e a b c d e
date
2020-10-15 1.000000 2.000000 0.518430 1.208459 1.166466 10.000000 11.000000 1.684896 3.927492 1.684896
2020-10-16 0.655190 0.982785 0.570272 1.329305 1.283113 1.637975 0.327595 1.853386 4.320242 1.853386
2020-10-17 0.720709 1.081063 0.627300 1.462236 1.411424 1.801772 0.360354 2.038724 4.752266 2.038724
2020-10-18 0.792780 1.189170 0.690030 0.302115 1.552567 1.981950 0.396390 2.242597 1.208459 2.242597
2020-10-19 0.872058 1.308087 0.759033 0.332326 1.707823 2.180144 0.436029 2.466856 1.329305 2.466856
2020-10-20 0.959264 1.438895 0.834936 0.365559 1.878606 2.398159 0.479632 2.713542 1.462236 2.713542
Details
Convert your dataframe to series:
>>> sr = df.unstack()
>>> sr.head(10)
date
A a 2020-10-15 1.0
2020-10-16 NaN # <= group X (final value: .655)
2020-10-17 NaN # <= group X (final value: .720)
2020-10-18 NaN # <= group X (final value: .793)
2020-10-19 NaN # <= group X (final value: .872)
2020-10-20 4.0 # <= group X (final value: .960)
b 2020-10-15 2.0
2020-10-16 NaN
2020-10-17 NaN
2020-10-18 NaN
dtype: float64
Now you can build groups:
>>> groups = sr.sort_index(ascending=False).notna().cumsum().sort_index()
>>> groups.head(10)
date
A a 2020-10-15 16
2020-10-16 15 # <= group X15
2020-10-17 15 # <= group X15
2020-10-18 15 # <= group X15
2020-10-19 15 # <= group X15
2020-10-20 15 # <= group X15
b 2020-10-15 14
2020-10-16 13
2020-10-17 13
2020-10-18 13
dtype: int64
Apply your geometric progression:
>>> sr = sr.groupby(groups).transform(geomseq)
>>> sr.head(10)
date
A a 2020-10-15 1.000000
2020-10-16 0.655190 # <= group X15
2020-10-17 0.720709 # <= group X15
2020-10-18 0.792780 # <= group X15
2020-10-19 0.872058 # <= group X15
2020-10-20 0.959264 # <= group X15
b 2020-10-15 2.000000
2020-10-16 0.982785
2020-10-17 1.081063
2020-10-18 1.189170
dtype: float64
And finally, reshape series according to your initial dataframe:
>>> df = sr.unstack(level=[0, 1])
>>> df
A B
a b c d e a b c d e
date
2020-10-15 1.000000 2.000000 0.518430 1.208459 1.166466 10.000000 11.000000 1.684896 3.927492 1.684896
2020-10-16 0.655190 0.982785 0.570272 1.329305 1.283113 1.637975 0.327595 1.853386 4.320242 1.853386
2020-10-17 0.720709 1.081063 0.627300 1.462236 1.411424 1.801772 0.360354 2.038724 4.752266 2.038724
2020-10-18 0.792780 1.189170 0.690030 0.302115 1.552567 1.981950 0.396390 2.242597 1.208459 2.242597
2020-10-19 0.872058 1.308087 0.759033 0.332326 1.707823 2.180144 0.436029 2.466856 1.329305 2.466856
2020-10-20 0.959264 1.438895 0.834936 0.365559 1.878606 2.398159 0.479632 2.713542 1.462236 2.713542

Pandas dataframe column subtraction, handling NaN

I have a data frame for example
df = pd.DataFrame([(np.nan, .32), (.01, np.nan), (np.nan, np.nan), (.21, .18)],
columns=['A', 'B'])
A B
0 NaN 0.32
1 0.01 NaN
2 NaN NaN
3 0.21 0.18
And I want to subtract column B from A
df['diff'] = df['A'] - df['B']
A B diff
0 NaN 0.32 NaN
1 0.01 NaN NaN
2 NaN NaN NaN
3 0.21 0.18 0.03
Difference returns NaN if one of the columns is NaN. To overcome this I use fillna
df['diff'] = df['A'].fillna(0) - df['B'].fillna(0)
A B diff
0 NaN 0.32 -0.32
1 0.01 NaN 0.01
2 NaN NaN 0.00
3 0.21 0.18 0.03
This solves NaN coming in the diff column, but for index 2 the result is coming to 0, while I want the difference as NaN since columns A and B are NaN.
Is there a way to explicitly tell pandas to output NaN if both columns are NaN?
Use Series.sub with fill_value=0 parameter:
df['diff'] = df['A'].sub(df['B'], fill_value=0)
print (df)
A B diff
0 NaN 0.32 -0.32
1 0.01 NaN 0.01
2 NaN NaN NaN
3 0.21 0.18 0.03
If need replace NaNs to 0 add Series.fillna:
df['diff'] = df['A'].sub(df['B'], fill_value=0).fillna(0)
print (df)
A B diff
0 NaN 0.32 -0.32
1 0.01 NaN 0.01
2 NaN NaN 0.00
3 0.21 0.18 0.03

pandas.DataFrame.append is adding new columns to dataframe in python

First, I created a csv file from data0 as shown:
data0 = ["car", 0.82, 0.0026, 0.914, 0.59]
test_df = pd.DataFrame([data0])
test_df.to_csv("testfile1.csv")
Output of that "testfile1.csv" appears like this:
0 1 2 3 4
0 car 0.82 0.0026 0.914 0.59
I want to append new data (data1 = ["bus", 0.9, 0.123, 12.907, 42], data2 = ["van", 0.23, 0.41, .031, 0.894, 6.16, 4.104]) to old csv file so that it appears in new row but exactly under previous rows as shown below:
0 1 2 3 4 5 6
0 car 0.82 0.0026 0.914 0.590 NaN NaN
1 bus 0.90 0.1230 12.907 42.000 NaN NaN
2 van 0.23 0.4100 0.031 0.894 6.16 4.104
I tried program and other similar methods using .append() or .to_csv() with mode="a":
test_df = pd.read_csv("testfile1.csv", index_col=0)
test_df = test_df.append([data1], ignore_index=True)
test_df = test_df.append([data2], ignore_index=True)
test_df.to_csv("testfile1.csv")
However, every time new data is appended in new columns and not under previous columns:
0 1 2 3 4 ... 0 1 2 3 4
0 NaN NaN NaN NaN NaN ... car 0.82 0.0026 0.914 0.59
1 bus 0.90 0.123 12.907 42.000 ... NaN NaN NaN NaN NaN
2 van 0.23 0.410 0.031 0.894 ... NaN NaN NaN NaN NaN
My project is to read existing CSV file, appending and saving back to same file. I even tried type casting to pandas.Series
You probably want to turn the list into pandas series first:
data1 = ['bus', 0.90, 0.1230, 12.907, 42.000]
row1 = pd.Series(data1)
data2 = ["van", 0.23, 0.41, .031, 0.894, 6.16, 4.104]
row2 = pd.Series(data2)
Then append it:
test_df = test_df.append(row1, ignore_index=True)
test_df = test_df.append(row2, ignore_index=True)
Output:
0 1 2 3 4 5 6
0 car 0.82 0.0026 0.914 0.590 NaN NaN
1 bus 0.90 0.1230 12.907 42.000 NaN NaN
2 van 0.23 0.4100 0.031 0.894 6.16 4.104

Pandas combine two columns into one and exclude NaN values

I have a 5k x 2 column dataframe called "both".
I want to create a new 5k x 1 DataFrame or column (doesn't matter) by replacing any NaN value in one column with the value of the adjacent column.
ex:
Gains Loss
0 NaN NaN
1 NaN -0.17
2 NaN -0.13
3 NaN -0.75
4 NaN -0.17
5 NaN -0.99
6 1.06 NaN
7 NaN -1.29
8 NaN -0.42
9 0.14 NaN
so for example, I need to swap the NaNs in the first column in rows 1 through 5 with the values in the same rows, in second column to get a new df of the following form:
Change
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
how do I tell python to do this??
You may fill the NaN values with zeroes and then simply add your columns:
both["Change"] = both["Gains"].fillna(0) + both["Loss"].fillna(0)
Then — if you need it — you may return the resulting zeroes back to NaNs:
both["Change"].replace(0, np.nan, inplace=True)
The result:
Gains Loss Change
0 NaN NaN NaN
1 NaN -0.17 -0.17
2 NaN -0.13 -0.13
3 NaN -0.75 -0.75
4 NaN -0.17 -0.17
5 NaN -0.99 -0.99
6 1.06 NaN 1.06
7 NaN -1.29 -1.29
8 NaN -0.42 -0.42
9 0.14 NaN 0.14
Finally, if you want to get rid of your original columns, you may drop them:
both.drop(columns=["Gains", "Loss"], inplace=True)
There are many ways to achieve this. One is using the loc property:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Price1': [np.nan,np.nan,np.nan,np.nan,
np.nan,np.nan,1.06,np.nan,np.nan],
'Price2': [np.nan,-0.17,-0.13,-0.75,-0.17,
-0.99,np.nan,-1.29,-0.42]})
df.loc[df['Price1'].isnull(), 'Price1'] = df['Price2']
df = df.loc[:6,'Price1']
print(df)
Output:
Price1
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
You can see more complex recipes in the Cookbook
IIUC, we can filter for null values and just sum the columns to make your new dataframe.
cols = ['Gains','Loss']
s = df.isnull().cumsum(axis=1).eq(len(df.columns)).any(axis=1)
# add df[cols].isnull() if you only want to measure the price columns for nulls.
df['prices'] = df[cols].loc[~s].sum(axis=1)
df = df.drop(cols,axis=1)
print(df)
prices
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
7 -1.29
8 -0.42

Categories