Largest elementwise difference between all rows in dataframe - python

Given is the following dataframe:
c1 c2 c3 c4
code
x 1 2 1 1
y 3 2 2 1
z 2 0 4 1
For any row in this dataframe I want to calculate the largest elementwise absolute difference between this row and all other rows of this dataframe and put it into a new dataframe:
x y z
code
x 0 2 3
y 2 0 2
z 3 2 0
(the result is, of course, a triangular matrix with the main diagonal = 0 so it would be sufficient to get just either the upper or lower triangular half).
So for instance the maximum elementwise difference between rows x and y is 2 (from column c1: abs(3 - 1) = 2).
What I got so far:
df = pd.DataFrame(data={'code': ['x','y','z'], 'c1': [1, 3, 2], 'c2': [2, 2, 0], 'c3': [1,2,4], 'c4': [1,1,1]})
df.set_index('code', inplace = True)
df1 = pd.DataFrame()
for row in df.iterrows():
df1.append((df-row[1]).abs().max(1), ignore_index = True)
When run interactively, this already looks close to what I need, but the new df1 is still empty afterwards:
>>> for row in df.iterrows(): df1.append((df-row[1]).abs().max(1),ignore_index=True)
...
x y z
0 0.0 2.0 3.0
x y z
0 2.0 0.0 2.0
x y z
0 3.0 2.0 0.0
>>> df1
Empty DataFrame
Columns: []
Index: []
Questions:
How to get the results into the new dataframe df1 (with correct index x, y, ...)?
This is only a mcve. In reality, df has about 700 rows. Not sure if iterrows is so good then. I have a feeling that the apply method would come in handy here but I couldn't figure it out. So is there any more idiomatic / pandas-like way to do it without explicitely iterating over the rows?

You can use NumPy and feed an array to the pd.DataFrame constructor. For a small number of rows, as in your data, this should be efficient.
A = df.values
res = pd.DataFrame(np.abs(A - A[:, None]).max(2),
index=df.index, columns=df.index.values)
print(res)
x y z
code
x 0 2 3
y 2 0 2
z 3 2 0

If you want your code to produce correct output then you can assign the value computed to df1 again.
for row in df.iterrows():
df1 = df1.append((df-row[1]).abs().max(1), ignore_index = True)
df1.index = df.index
print (df1)
x y z
X 0.0 2.0 3.0
y 2.0 0.0 2.0
z 3.0 2.0 0.0

Related

python - pandas groupby to flat DataFrame

I would like to convert groupby result to a flat DataFrame.
import pandas
df1 = pandas.DataFrame( {
"x" : ["A", "B", "C", "A", "B" ,"B"] ,
"y" : [ 1, 2, 3, 4, 5, 6]} )
g1 = df1.groupby(["x"]).max().reset_index()
print(g1)
The expected output DataFrame like below:
x y1 y2 y3
0 A 1 4 0
1 B 2 5 6
2 C 3 0 0
If value not exist, use 0 by default.
Try with groupby.agg with add_prefix and fillna with reset_index.
Like the following:
g1 = df1.groupby('x')['y'].agg(list).agg(pd.Series).add_prefix('y').fillna(0).reset_index()
print(g1)
Or if you care about column names, try using rename with a slick way with 1 .__add__:
g1 = df1.groupby('x')['y'].agg(list).agg(pd.Series).rename(1 .__add__, axis=1).add_prefix('y').fillna(0).reset_index()
Output:
x y1 y2 y3
0 A 1.0 4.0 0.0
1 B 2.0 5.0 6.0
2 C 3.0 0.0 0.0
We can use pivot_table index is the 'x' column, and we can use groupby cumcount on x to enumerate rows to get positional y values as new columns [1,2,3] etc and fill_value of 0 to set the default for missing (benefit of fill_value over fillna is that NaN are not introduced so dtype does not change to float).
Lastly, add_prefix to columns and reset_index to match desired output:
out = (
df1.pivot_table(index='x',
columns=df1.groupby('x').cumcount() + 1,
values='y',
fill_value=0)
.add_prefix('y')
.reset_index()
)
out:
x y1 y2 y3
0 A 1 4 0
1 B 2 5 6
2 C 3 0 0

Restore index and append zeros after subtracting dataframe values

I am calculating the difference of a dataframe values at different lags.
Following dataframe is my input
df = pd.DataFrame([[1, 2], [3, 4],[5,6],[7,8]], columns=list('AB'))
To compute the difference between last three rows and first three rows, I am doing the following.
df2=df.iloc[1:,:]
df3=df.iloc[:-1,:]
df_out=pd.DataFrame(df2.values-df3.values,index=df2.index)
The calculation is as expected but I want to retain the index 0 with zeros in that row.
df_expected_out=pd.DataFrame([[0,0], [2,2],[2,2],[2,2]], columns=list('AB'))
Please suggest the way forward.Thanks for your time.
I think you need reindex by original index:
df_out=pd.DataFrame(df2.values-df3.values,index=df2.index).reindex(df.index, fill_value=0)
print (df_out)
0 1
0 0 0
1 2 2
2 2 2
3 2 2
Another solution:
df_out= df.diff().fillna(0).astype(int)
Or append first zero row to arrays:
a1 = np.zeros((1, len(df.columns)), dtype=int)
arr = np.append(a1, df2.values, axis=0) - np.append(a1, df3.values, axis=0)
df_out = pd.DataFrame(arr, index=df.index)
print (df_out)
0 1
0 0 0
1 2 2
2 2 2
3 2 2
You can use the shift function
(df - df.shift()).fillna(0)
Out[9]:
A B
0 0.0 0.0
1 2.0 2.0
2 2.0 2.0
3 2.0 2.0

Pandas add rows according for each unique element of a column

I've got a dataframe, like so:
ID A
0 z
2 z
2 y
5 x
To which I want to add rows for each unique value of an ID column:
ID A
0 z
2 z
2 y
5 x
0 b
2 b
5 b
I'm currently doing so in a very naïve way, which is quite inefficient/slow:
IDs = df["ID"].unique()
for ID in IDs:
df = df.append(pd.DataFrame([[ID, "b"]], columns=df.columns), ignore_index=True)
How would I go to accomplish the same without the explicit foreach, only pandas function calls?
Use drop_duplicates, rewrite column by assign and append or concat to original DataFrame:
df = df.append(df.drop_duplicates("ID").assign(A='B'), ignore_index=True)
#alternative
#df = pd.concat([df, df.drop_duplicates("ID").assign(A='B')], ignore_index=True)
print (df)
ID A
0 0 z
1 2 z
2 2 y
3 5 x
4 0 B
5 2 B
6 5 B

Pandas - combine two columns

I have 2 columns, which we'll call x and y. I want to create a new column called xy:
x y xy
1 1
2 2
4 4
8 8
There shouldn't be any conflicting values, but if there are, y takes precedence. If it makes the solution easier, you can assume that x will always be NaN where y has a value.
it could be quite simple if your example is accurate
df.fillna(0) #if the blanks are nan will need this line first
df['xy']=df['x']+df['y']
Notice your column type right now is string not numeric anymore
df = df.apply(lambda x : pd.to_numeric(x, errors='coerce'))
df['xy'] = df.sum(1)
More
df['xy'] =df[['x','y']].astype(str).apply(''.join,1)
#df[['x','y']].astype(str).apply(''.join,1)
Out[655]:
0 1.0
1 2.0
2
3 4.0
4 8.0
dtype: object
You can also use NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame({'x': [1, 2, np.nan, np.nan],
'y': [np.nan, np.nan, 4, 8]})
arr = df.values
df['xy'] = arr[~np.isnan(arr)].astype(int)
print(df)
x y xy
0 1.0 NaN 1
1 2.0 NaN 2
2 NaN 4.0 4
3 NaN 8.0 8

Stop Pandas from rotating results from groupby-apply when there is one group

I have some code that first selects data based on a certain criteria then it does a groupby-apply on a Pandas dataframe. Occasionally, the data only has 1 group that matches the criteria. In this case, Pandas will return a row vector rather than a column vector. Example below:
In [50]: x = pd.DataFrame([(round(i/2, 0), i, i) for i in range(0, 10)], column
...: s=['a', 'b', 'c'])
In [51]: x
Out[51]:
a b c
0 0.0 0 0
1 0.0 1 1
2 1.0 2 2
3 2.0 3 3
4 2.0 4 4
5 2.0 5 5
6 3.0 6 6
7 4.0 7 7
8 4.0 8 8
9 4.0 9 9
In [52]: y = x.loc[x.a == 0.0].groupby('a').apply(lambda x: x.b / x.c)
In [53]: y
Out[53]:
0 1
a
0.0 NaN 1.0
y in the above example is a row vector with datatype pandas.DataFrame. If the .loc selection has two or more classes, it will produce a column vector.
In [54]: y = x.loc[x.a <= 1.0].groupby('a').apply(lambda x: x.b / x.c)
In [55]: y
Out[55]:
a
0.0 0 NaN
1 1.0
1.0 2 1.0
dtype: float64
Any idea how I can make the two behaviour consistent? Ultimately, the column vector is what I want.
Thanks
There's no way to do this in one step, unfortunately. You can, however, do this in two steps, by querying ngroups and reshaping your result accordingly.
g = x.loc[...].groupby('a')
y = g.apply(lambda x: x.b / x.c)
if g.ngroups == 1:
y = y.T

Categories