Subtracting multiple columns and appending results in pandas DataFrame - python

I have a table of sensor data, for which some columns are measurements and some columns are sensor bias. For example, something like this:
df=pd.DataFrame({'x':[1.0,2.0,3.0],'y':[4.0,5.0,6.0],
'dx':[0.25,0.25,0.25],'dy':[0.5,0.5,0.5]})
dx dy x y
0 0.25 0.5 1.0 4.0
1 0.25 0.5 2.0 5.0
2 0.25 0.5 3.0 6.0
I can add a column to the table by subtracting the bias from the measurement like this:
df['newX'] = df['x'] - df['dx']
dx dy x y newX
0 0.25 0.5 1.0 4.0 0.75
1 0.25 0.5 2.0 5.0 1.75
2 0.25 0.5 3.0 6.0 2.75
But I'd like to do that for many columns at once. This doesn't work:
df[['newX','newY']] = df[['x','y']] - df[['dx','dy']]
for two reasons, it seems.
When subtracting DataFrames the column labels are used to align the subtraction, so I wind up with a 4 column result ['x', 'y', 'dx', 'dy'].
It seems I can insert a single column into the DataFrame using indexing, but not more than one.
Obviously I can iterate over the columns and do each one individually, but is there a more compact way to accomplish what I'm trying to do that is more analogous to the one column solution?

DataFrames generally align operations such as arithmetic on column and row indices. Since df[['x','y']] and df[['dx','dy']] have different column names, the dx column is not subtracted from the x column, and similiarly for the y columns.
In contrast, if you subtract a NumPy array from a DataFrame, the operation is done elementwise since the NumPy array has no Panda-style indices to align upon.
Hence, if you use df[['dx','dy']].values to extract a NumPy array consisting of the values in df[['dx','dy']], then your assignment can be done as desired:
import pandas as pd
df = pd.DataFrame({'x':[1.0,2.0,3.0],'y':[4.0,5.0,6.0],
'dx':[0.25,0.25,0.25],'dy':[0.5,0.5,0.5]})
df[['newx','newy']] = df[['x','y']] - df[['dx','dy']].values
print(df)
yields
dx dy x y newx newy
0 0.25 0.5 1.0 4.0 0.75 3.5
1 0.25 0.5 2.0 5.0 1.75 4.5
2 0.25 0.5 3.0 6.0 2.75 5.5
Be ware that if you were to try assigning a NumPy array (on the right-hand side)
to a DataFrame (on the left-hand side), the column names specified on the left must already exist.
In contrast, when assigning a DataFrame on the right-hand side to a DataFrame on the left, new columns can be used since in this case Pandas zips the keys (new column names) on the left with the columns on the right and assigns values in column-order instead of by aligning columns:
for k1, k2 in zip(key, value.columns):
self[k1] = value[k2]
Thus, using a DataFrame on the right
df[['newx','newy']] = df[['x','y']] - df[['dx','dy']].values
works, but using a NumPy array on the right
df[['newx','newy']] = df[['x','y']].values - df[['dx','dy']].values
does not.

Related

Doubts about pandas axis working my code may be off

My issue is the following, I'm creating a pandas data frame from a dictionary that ends up looking like [70k, 300]. I'm trying to normalise each cell be it either by columns and after rows, and other way around rows then columns.
I ha asked a similar question before but this was with a [70k, 70k] data frame so square and it worked just by doing this
dfNegInfoClearRev = (df - df.mean(axis=1)) / df.std(axis=1).replace(0, 1)
dfNegInfoClearRev = (dfNegInfoClearRev - dfNegInfoClearRev.mean(axis=0)) / dfNegInfoClearRev.std(axis=0).replace(0, 1)
print(dfNegInfoClearRev)
This did what I needed for the case of a [70k, 70k]. A problem came up when I tried the same principle with a [70k, 300] if I do this:
dfRINegInfo = (dfRI - dfRI.mean(axis=0)) / dfRI.std(axis=0).replace(0, 1)
dfRINegInfoRows = (dfRINegInfo - dfRINegInfo.mean(axis=1)) / dfRINegInfo.std(axis=1).replace(0, 1)
I somehow end up with a [70k, 70k+300] full of NaNs with the same names.
I ended up doing this:
dfRIInter = dfRINegInfo.sub(dfRINegInfo.mean(axis=1), axis=0)
dfRINegInfoRows = dfRIInter.div(dfRIInter.std(axis=1), axis=0).fillna(1).replace(0, 1)
print(dfRINegInfoRows)
But I'm not sure if this is what I was trying to do and don't really understand why after the column normalisation which it does work [70k, 300] the row normalisation gives me a [70k, 70k+300], and I'm not sure if the way is working is what I'm trying to do. Any help?
I think your new code is doing what you want.
If we look at a 3x3 toy example:
df = pd.DataFrame([
[1, 2, 3],
[2, 4, 6],
[3, 6, 9],
])
The axis=1 mean is:
df.mean(axis=1)
# 0 2.0
# 1 4.0
# 2 6.0
# dtype: float64
And the subtraction applies to each row (i.e., [1,2,3] - [2,4,6] element-wise, [2-4-6] - [2,4,6], and [3,6,9] - [2,4,6]):
df - df.mean(axis=1)
# 0 1 2
# 0 -1.0 -2.0 -3.0
# 1 0.0 0.0 0.0
# 2 1.0 2.0 3.0
So if we have df2 shaped 3x2:
df2 = pd.DataFrame([
[1,2],
[3,6],
[5,10],
])
The axis=1 mean is still length 3:
df2.mean(axis=1)
# 0 1.5
# 1 4.5
# 2 7.5
# dtype: float64
And subtraction will result in the 3rd column being nan (i.e., [1,2,nan] - [1.5,4.5,7.5] element-wise, [3,6,nan] - [1.5,4.5,7.5], and [5,10,nan] - [1.5,4.5,7.5]):
df2 - df2.mean(axis=1)
# 0 1 2
# 0 -0.5 -2.5 NaN
# 1 1.5 1.5 NaN
# 2 3.5 5.5 NaN
If you make the subtraction itself along axis=0 then it works as expected:
df2.sub(df2.mean(axis=1), axis=0)
# 0 1
# 0 -0.5 0.5
# 1 -1.5 1.5
# 2 -2.5 2.5
So when you use a default subtraction between (70000, 300) and (70000,1), there will be 69700 columns of nan.

How to group near-duplicate values in a pandas dataframe?

If there are duplicate values in a DataFrame pandas already provides functions to replace or drop duplicates. In many experimental datasets on the other hand one might have 'near' duplicates.
How can one replace these near duplicate values with, e.g. their mean?
The example data looks as follows:
df = pd.DataFrame({'x': [1, 2,2.01, 3, 4,4.1,3.95, 5,],
'y': [1, 2,2.2, 3, 4.1,4.4,4.01, 5.5]})
I tried to hack together something to bin together near duplicates but this is using for loops and seems like a hack against pandas:
def cluster_near_values(df, colname_to_cluster, bin_size=0.1):
used_x = [] # list of values already grouped
group_index = 0
for search_value in df[colname_to_cluster]:
if search_value in used_x:
# value is already in a group, skip to next
continue
g_ix = df[abs(df[colname_to_cluster]-search_value) < bin_size].index
used_x.extend(df.loc[g_ix, colname_to_cluster])
df.loc[g_ix, 'cluster_group'] = group_index
group_index += 1
return df.groupby('cluster_group').mean()
Which does the grouping and averaging:
print(cluster_near_values(df, 'x', 0.1))
x y
cluster_group
0.0 1.000000 1.00
1.0 2.005000 2.10
2.0 3.000000 3.00
3.0 4.016667 4.17
4.0 5.000000 5.50
Is there a better way to achieve this?
Here's an example, where you want to group items to one digit of precision. You can modify this as needed. You can also modify this for binning values with threshold over 1.
df.groupby(np.ceil(df['x'] * 10) // 10).mean()
x y
x
1.0 1.000000 1.00
2.0 2.005000 2.10
3.0 3.000000 3.00
4.0 4.016667 4.17
5.0 5.000000 5.50

How can I impute the NA in a dataframe with values randomly selected from a specified normal distribution

How can I impute the NA in a dataframe with values randomly selected from a specified normal distribution.
The dataframe df is defined as follows:
A B C D
1 3 NA 4 NA
2 3.4 2.3 4.1 NA
3 2.3 0.1 0.2 6.3
4 3.1 4.5 2.1 0.2
5 4.1 2.5 NA 2.4
I want to fill the NA with the values randomly select from a generated normal distribution and the values are different.
The mean the normal distribution is the 1% quantile of the values of the given dataframe. The standard deviation is the median SD of the rows in dataframe.
My code is as follows:
import pandas as pd
import numpy as np
df = pd.read_csv('try.txt',sep="\t")
df.index = df['type']
del df['type']
sigma = median(df.std(axis=1))
mu = df.quantile(0.01)
# mean and standard deviation
df = df.fillna(np.random.normal(mu, sigma, 1))
The mean is incorrect and the df can not fill with the simulated array.
How can I complete the work. Thank you.
There are a few problems with your code
df.index = df['type']
del df['type']
can better be expressed as df.set_index('type')
median(df.std(axis=1)) should be df.std(axis=1).median()
df.quantile() returns a series. If you want the quantile of all the values, you should do df.stack().quantile(0.01)
sigma = df.std(axis=1).median()
mu = df.stack().quantile(0.01)
print((sigma, mu))
(0.9539392014169454, 0.115)
First you have to find the empty fields. Easiest is with .stack and pd.isnull
df2 = df.stack(dropna=False)
s = df2[pd.isnull(df2)]
Now you can impute the random values in 2 ways
ran = np.random.normal(mu, sigma, len(s))
df3 = df.stack(dropna=False)
df3.loc[s.index] = ran
df3.unstack()
A B C D
1 3.0 0.38531116198179066 4.0 0.7070154252582993
2 3.4 2.3 4.1 -0.8651789931843614
3 2.3 0.1 0.2 6.3
4 3.1 4.5 2.1 0.2
5 4.1 2.5 -1.3176599584973157 2.4
Or via a loop, overwriting the empty fields in the original DataFrame
for (row, column), value in zip(s.index.tolist(), np.random.normal(mu, sigma, len(s))):
df.loc[row, column] = value

X Y Z array data to heatmap

I couldn't quite find a consensus answer for this question or one that fits my needs -- I have data in three columns of a text file: X, Y, and Z. Columns are tab-deliminated. I would like to make a heatmap representation of these data with Python where X and Y positions are shaded by the value in Z, which ranges from 0 to 1 (a discrete probability of X and Y). I was trying seaborn's heatmap package and matplotlib's pcolormesh, but unfortunately these need 2D data arrays.
My data runs through X from 1 to 37 for constant y then iterates by 0.1 in the y. y max fluctuates based on the data set, but ymin is always 0.
[X Y Z] row1[1...37 0.0000 Zvalue], row2[1...37 0.1000 Zvalue] etc.
import numpy as np
from numpy import *
import pandas as pd
import seaborn as sns
sns.set()
df = np.loadtxt(open("file.txt", "rb"), delimiter="\t").astype("float")
Any tips for next steps?
If I understand you correctly you have three columns with X and Y denoting the position of a value Z.
Consider the following example. There are three columns: X and Y contain positional information (categories in this case) and Z contains the values for shading the heatmap.
x = np.array(['a','b','c','a','b','c','a','b','c'])
y = np.array(['a','a','a','b','b','b','c','c','c'])
z = np.array([0.3,-0.3,1,0.5,-0.25,-1,0.25,-0.23,0.25])
Then we create a dataframe from these columns and transpose them (so x,y and z actually become columns). Give column names and make sure Z_value is a number.
df = pd.DataFrame.from_dict(np.array([x,y,z]).T)
df.columns = ['X_value','Y_value','Z_value']
df['Z_value'] = pd.to_numeric(df['Z_value'])
resulting in this dataframe.
X_value Y_value Z_value
0 a a 0.30
1 b a -0.30
2 c a 1.00
3 a b 0.50
4 b b -0.25
5 c b -1.00
6 a c 0.25
7 b c -0.23
8 c c 0.25
From this you cannot create a heatmap, however by calling df.pivot('Y_value','X_value','Z_value') you pivot the dataframe to a form that can be used for a heatmap.
pivotted= df.pivot('Y_value','X_value','Z_value')
The resulting dataframe looks like this.
X_value a b c
Y_value
a 0.30 -0.30 1.00
b 0.50 -0.25 -1.00
c 0.25 -0.23 0.25
You can then feed pivotted to the sns.heatmap to create your heatmap.
sns.heatmap(pivotted,cmap='RdBu')
Resulting in this heatmap.
You may need to make some adjustments to the code for your precise needs. But since I had no example data to go from I needed to make my own example.

Getting mean, max, min from pandas dataframe

I have the following dataframe which is the result of performing a standard pandas correlation:
df.corr()
abc xyz jkl
abc 1 0.2 -0.01
xyz -0.34 1 0.23
jkl 0.5 0.4 1
I have a few things that need to be done with these correlations, however these calculations need to exclude all the cells where the value is 1. The 1 values are the cells where the item has a perfect correlation with itself, therefore I am not interested in it.:
Determine the maximum correlation pair. The result is 'jkl' and 'abc' which has a correlation of 0.5
Determine the minimum correlation pair. The result is 'abc' and 'xyz' which has a correlation of -0.34
Determine the average/mean for the whole dataframe (again this needs to exclude all the values which are 1). The result would be (0.2 + -0.01 + -0.34 + 0.23 + 0.5 + 0.4) / 6 = 0,163333333
Check this:
from numpy import unravel_index,fill_diagonal,nanargmax,nanargmin
from bottleneck import nanmean
a = df(columns=['abc','xyz', 'jkl'])
a.loc['abc'] = [1, 0.2 , -0.01]
a.loc['xyz'] = [-0.34, 1, 0.23]
a.loc['jkl'] = [0.5, 0.4, 1]
b = a.values.copy()
fill_diagonal(b, None)
imax = unravel_index(nanargmax(b), b.shape)
imin = unravel_index(nanargmin(b), b.shape)
print(a.index[imax[0]],a.columns[imax[1]])
print(a.index[imin[0]],a.columns[imin[1]])
print(nanmean(b))
Please don't forget to copy your data, otherwise np.fill_diagonal will erase its diagonal values.

Categories