Subtract two consecutive rows and save in first row using pandas - python

I have a pandas dataframe of format with 226 columns :
**W X Y Z.....**
a b c d.....
e f g h.....
i want to subtract columns Y and Z in the following way:
**W X Y Z.....**
a (b-c) (c-d) (d-nextvalue).....
e (f-g) (g-h) (h-nextvalue).....
how do i go about doing this? I am a rookie in python, thanks in advance

Use DataFrame.diff and if necessary convert first column to index by DataFrame.set_index:
df = pd.DataFrame({
'W':list('abc'),
'X':[10,5,4],
'Y':[7,8,9],
'Z':[1,1,0],
'E':[5,3,6],
})
df = df.set_index('W').diff(-1, axis=1)
print (df)
X Y Z E
W
a 3.0 6.0 -4.0 NaN
b -3.0 7.0 -2.0 NaN
c -5.0 9.0 -6.0 NaN

To create 'W' as the index you can do,
df.set_index('W', inplace=True)
Further, you may try the following:
for i in range(len(df.columns) - 1):
df.iloc[:, i] = df.iloc[:, i] - df.iloc[:, i+1]

Related

Pandas - aggregate multiple columns with pivot_table

I have a dataframe like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ind0": list("QQQWWWW"), "ind1": list("RRRRSSS"), "vals": range(7), "cols": list("XXYXXYY")})
print(df)
Output:
ind0 ind1 vals cols
0 Q R 0 X
1 Q R 1 X
2 Q R 2 Y
3 W R 3 X
4 W S 4 X
5 W S 5 Y
6 W S 6 Y
I want to aggregate the values while creating columns from col, so I thought of using pivot_table:
df_res = df.pivot_table(index=["ind0", "ind1"], columns="cols", values="vals", aggfunc=np.sum).fillna(0)
print(df_res)
This gives me:
cols X Y
ind0 ind1
Q R 1.0 2.0
W R 3.0 0.0
S 4.0 11.0
However, I would rather get the sum independent of ind1 categories while keeping the information in this column. So, the desired output would be:
cols X Y
ind0 ind1
Q R 1.0 2.0
W R,S 7.0 11.0
Is there a way to use pivot_table or pivot to this end or do I have to aggregate for ind1 in a second step? If the latter, how?
You could reset_index of df_res and groupby "ind0" and using agg, use different functions on columns: joining unique values of "ind1" and summing "X" and "Y".
out = df_res.reset_index().groupby('ind0').agg({'ind1': lambda x: ', '.join(x.unique()), 'X':'sum', 'Y':'sum'})
Or if you have multiple columns that you need to do the same function on, you could also use a dict comprehension:
funcs = {'ind1': lambda x: ', '.join(x.unique()), **{k:'sum' for k in ('X','Y')}}
out = df_res.reset_index().groupby('ind0').agg(funcs)
Output:
cols ind1 X Y
ind0
Q R 1.0 2.0
W R, S 7.0 11.0

Pairwise Euclidean distance with pandas ignoring NaNs

I start with a dictionary, which is the way my data was already formatted:
import pandas as pd
dict2 = {'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},
'C':{'b':1.0,'c':2.0, 'd':4.0}}
I then convert it to a pandas dataframe:
df = pd.DataFrame(dict2)
print(df)
A B C
a 1.0 2.0 NaN
b 2.0 NaN 1.0
c NaN 2.0 2.0
d 4.0 5.0 4.0
Of course, I can get the difference one at a time by doing this:
df['A'] - df['B']
Out[643]:
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
I figured out how to loop through and calculate A-A, A-B, A-C:
for column in df:
print(df['A'] - df[column])
a 0.0
b 0.0
c NaN
d 0.0
Name: A, dtype: float64
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
a NaN
b 1.0
c NaN
d 0.0
dtype: float64
What I would like to do is iterate through the columns so as to calculate |A-B|, |A-C|, and |B-C| and store the results in another dictionary.
I want to do this so as to calculate the Euclidean distance between all combinations of columns later on. If there is an easier way to do this I would like to see it as well. Thank you.
You can use numpy broadcasting to compute vectorised Euclidean distance (L2-norm), ignoring NaNs using np.nansum.
i = df.values.T
j = np.nansum((i - i[:, None]) ** 2, axis=2) ** .5
If you want a DataFrame representing a distance matrix, here's what that would look like:
df = (lambda v, c: pd.DataFrame(v, c, c))(j, df.columns)
df
A B C
A 0.000000 1.414214 1.0
B 1.414214 0.000000 1.0
C 1.000000 1.000000 0.0
df[i, j] represents the distance between the ith and jth column in the original DataFrame.
The code below iterates through columns to calculate the difference.
# Import libraries
import pandas as pd
import numpy as np
# Create dataframe
df = pd.DataFrame({'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},'C':{'b':1.0,'c':2.0, 'd':4.0}})
df2 = pd.DataFrame()
# Calculate difference
clist = df.columns
for i in range (0,len(clist)-1):
for j in range (1,len(clist)):
if (clist[i] != clist[j]):
var = clist[i] + '-' + clist[j]
df[var] = abs(df[clist[i]] - df[clist[j]]) # optional
df2[var] = abs(df[clist[i]] - df[clist[j]]) # optional
Output in same dataframe
df.head()
Output in a new dataframe
df2.head()

How to create a matrix that is the sum of multiple matrices using pandas dataframe?

I have multiple data frames that I saved in a concatenated list like below. Each df represents a matrix.
my_df = pd.concat([df1, df2, df3, .....])
How do I sum all these dfs (matrices) into one df (matrix)?
I found a discussion here, but it only answers how to add two data frames, by using code like below.
df_x.add(df_y, fill_value=0)
Should I use the code above in a loop, or is there a more concise way?
I tried to do print(my_df.sum()) but got a very confusing result (it's suddenly turned into a one row instead of two-dimensional matrix).
Thank you.
I believe need functools.reduce if each DataFrame in list have same index and columns values:
np.random.seed(2018)
df1 = pd.DataFrame(np.random.choice([1,np.nan,2], size=(3,3)), columns=list('abc'))
df2 = pd.DataFrame(np.random.choice([1,np.nan,3], size=(3,3)), columns=list('abc'))
df3 = pd.DataFrame(np.random.choice([1,np.nan,4], size=(3,3)), columns=list('abc'))
print (df1)
a b c
0 2.0 2.0 2.0
1 NaN NaN 1.0
2 1.0 2.0 NaN
print (df2)
a b c
0 NaN NaN 1.0
1 3.0 3.0 3.0
2 NaN 1.0 3.0
print (df3)
a b c
0 4.0 NaN NaN
1 4.0 1.0 1.0
2 4.0 NaN 1.0
from functools import reduce
my_df = [df1,df2, df3]
df = reduce(lambda x, y: x.add(y, fill_value=0), my_df)
print (df)
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
I believe the idiomatic solution to this is to preserve the information about different DataFrames with the help of the keys parameter and then use sum on the innermost level:
dfs = [df1, df2, df3]
my_df = pd.concat(dfs, keys=['df{}'.format(i+1) for i in range(len(dfs))])
my_df.sum(level=1)
which yields
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
with jezrael's sample DataFrames.
One method is to use sum with a list of arrays. The output here will be an array rather than a dataframe.
This assumes you need to replace np.nan with 0:
res = sum([x.fillna(0).values for x in [df1, df2, df3]])
Alternatively, you can use numpy directly in a couple of different ways:
res_np1 = np.add.reduce([x.fillna(0).values for x in [df1, df2, df3]])
res_np2 = np.nansum([x.values for x in [df1, df2, df3]], axis=0)
numpy.nansum assumes np.nan equals zero for summing purposes.

Calculate difference between first valid value and last valid value in DataFrame per row?

I'm trying to find the difference between the first valid value and the last valid value in a DataFrame per row.
I have a working code with a for loop and looking for something faster.
Here's an example of what I'm doing currently:
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.arange(16).astype(np.float).reshape(4, 4),
columns=['a', 'b', 'c', 'd'])
# Fill some NaN
df.loc[0, ['a', 'd']] = np.nan
df.loc[1, ['c', 'd']] = np.nan
df.loc[2, 'b'] = np.nan
df.loc[3, :] = np.nan
print(df)
# a b c d
# 0 NaN 1.0 2.0 NaN
# 1 4.0 5.0 NaN NaN
# 2 8.0 NaN 10.0 11.0
# 3 NaN NaN NaN NaN
diffs = pd.Series(index=df.index)
for i in df.index:
row = df.loc[i]
min_i = row.first_valid_index()
max_i = row.last_valid_index()
if min_i is None or min_i == max_i: # 0 or 1 valid values
continue
diffs[i] = df.loc[i, max_i] - df.loc[i, min_i]
df['diff'] = diffs
print(df)
# a b c d diff
# 0 NaN 1.0 2.0 NaN 1.0
# 1 4.0 5.0 NaN NaN 1.0
# 2 8.0 NaN 10.0 11.0 3.0
# 3 NaN NaN NaN NaN NaN
One way would be to back and forward fill the missing values, and then just compare the first and last rows.
df2 = df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1)
df['diff'] = df2.ix[:, -1] - df2.ix[:, 0]
If you want to do it in one line, without creating a new dataframe:
df['diff'] = df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1).apply(lambda r: r.d - r.a, axis=1)
Pandas making your life easy, one method (first_valid_values()) at a time. Note that you'll have to delete any rows that have all NaN values (no point in having these anyways):
For first valid values:
a= [df.ix[x,i] for x,i in enumerate(df.apply(lambda row: row.first_valid_index(), axis=1))]
For last valid values:
b = [df.ix[x,i] for x,i in enumerate(df.apply(lambda row: row[::-1].first_valid_index(), axis=1))]
Subtract to get final result:
a-b

Return list of indices/index where a min/max value occurs in a pandas dataframe

I'd like to search a pandas DataFrame for minimum values. I need the min in the entire dataframe (across all values) analogous to df.min().min(). However I also need the know the index of the location(s) where this value occurs.
I've tried a number of different approaches:
df.where(df == (df.min().min())),
df.where(df == df.min().min()).notnull()(source) and
val_mask = df == df.min().min(); df[val_mask] (source).
These return a dataframe of NaNs on non-min/boolean values but I can't figure out a way to get the (row, col) of these locations.
Is there a more elegant way of searching a dataframe for a min/max and returning a list containing all of the locations of the occurrence(s)?
import pandas as pd
keys = ['x', 'y', 'z']
vals = [[1,2,-1], [3,5,1], [4,2,3]]
data = dict(zip(keys,vals))
df = pd.DataFrame(data)
list_of_lowest = []
for column_name, column in df.iteritems():
if len(df[column == df.min().min()]) > 0:
print(column_name, column.where(column ==df.min().min()).dropna())
list_of_lowest.append([column_name, column.where(column ==df.min().min()).dropna()])
list_of_lowest
output: [['x', 2 -1.0
Name: x, dtype: float64]]
Based on your revised update:
In [209]:
keys = ['x', 'y', 'z']
vals = [[1,2,-1], [3,5,-1], [4,2,3]]
data = dict(zip(keys,vals))
df = pd.DataFrame(data)
df
Out[209]:
x y z
0 1 3 4
1 2 5 2
2 -1 -1 3
Then the following would work:
In [211]:
df[df==df.min().min()].dropna(axis=1, thresh=1).dropna()
Out[211]:
x y
2 -1.0 -1.0
So this uses the boolean mask on the df:
In [212]:
df[df==df.min().min()]
Out[212]:
x y z
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.0 -1.0 NaN
and we call dropna with param thresh=1 this drops columns that don't have at least 1 non-NaN value:
In [213]:
df[df==df.min().min()].dropna(axis=1, thresh=1)
Out[213]:
x y
0 NaN NaN
1 NaN NaN
2 -1.0 -1.0
Probably safer to call again with thresh=1:
In [214]:
df[df==df.min().min()].dropna(axis=1, thresh=1).dropna(thresh=1)
Out[214]:
x y
2 -1.0 -1.0

Categories