Baffled by dataframe groupby.diff() - python

I have just read this question:
In a Pandas dataframe, how can I extract the difference between the values on separate rows within the same column, conditional on a second column?
and I am completely baffled by the answer. How does this work???
I mean, when I groupby('user') shouldn't the result be, well, grouped by user?
Whatever the function I use (mean, sum etc) I would expect a result like this:
aa=pd.DataFrame([{'user':'F','time':0},
{'user':'T','time':0},
{'user':'T','time':0},
{'user':'T','time':1},
{'user':'B','time':1},
{'user':'K','time':2},
{'user':'J','time':2},
{'user':'T','time':3},
{'user':'J','time':4},
{'user':'B','time':4}])
aa2=aa.groupby('user')['time'].sum()
print(aa2)
user
B 5
F 0
J 6
K 2
T 4
Name: time, dtype: int64
How does diff() instead return a diff of each row with the previous, within each group?
aa['diff']=aa.groupby('user')['time'].diff()
print(aa)
time user diff
0 0 F NaN
1 0 T NaN
2 0 T 0.0
3 1 T 1.0
4 1 B NaN
5 2 K NaN
6 2 J NaN
7 3 T 2.0
8 4 J 2.0
9 4 B 3.0
And more important, how is the result not a unique list of 'user' values?
I found many answers that use groupby.diff() but none of them explain it in detail. It would be extremely useful to me, and hopefully to others, to understand the mechanics behind it. Thanks.

Related

Pandas: Use selected amount of previous rows in apply function

lets say I have dataframe below:
index value
1 1
2 2
3 3
4 4
I want to apply a function to each row using previous two rows using "apply" statement. Lets say for example I want to multiple current row and previous 2 rows if it exists. (This could be any funtion)
Result:
index value result
1 1 nan
2 2 nan
3 3 6
4 4 24
Thank you.
You can try rolling with prod:
df['result'] = df['value'].rolling(3).apply(lambda x: x.prod())
Output:
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0
Use assign function:
df = df.assign(result = lambda x: x['value'].cumprod().tail(len(df)-2))
I presume you have more than four rows. If so, please try groupby every four rows, cumproduct, choose the last 2 and join to the original datframe.
df['value']=df.index.map(df.assign(result=df['value'].cumprod(0)).groupby(df.index//4).result.tail(2).to_dict())
If just four rows then this should you;
Lets try combine .cumprod() and .tail()
df['result']=df['value'].cumprod(0).tail(2)
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0

Df.mean returns imaginary numbers

I have a dataframe with around 50 columns and around 3000 rows. Most cells are empty but not all of them. I am trying to add a new row at the end of the dataframe, with the mean value of each column and I need it to ignore the empty cells.
I am using df.mean(axis=0), which somehows turns all values of the dataframe into imaginary numbers. All values stay the same but a +0j is added. I have no Idea why.
Turbine.loc['Mean_Values'] = Turbine.mean(axis=0)
I couldnt find a solution for this, is it because of the empty cells?
Base on this, df.mean() will automatically skip the NaN/Null value with parameter value of skipna=True. Example:
df=pd.DataFrame({'value':[1,2,3,np.nan,5,6,np.nan]})
df=df.append({'value':df.mean(numeric_only=True).value}, ignore_index=True,)
print(df)
Output:
value
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 NaN
7 3.4
But if there is a complex number in a cell, the result of df.mean() will be cast to complex number. Example:
df=pd.DataFrame({'value':[1,2,3,np.nan,5,6,np.nan, complex(1,0)]})
print(df)
print('\n')
df=df.append({'value':df.mean(numeric_only=True).value}, ignore_index=True,)
print(df)
Output with a complex value in a cell:
value
0 (1+0j)
1 (2+0j)
2 (3+0j)
3 NaN
4 (5+0j)
5 (6+0j)
6 NaN
7 (1+0j)
value
0 (1+0j)
1 (2+0j)
2 (3+0j)
3 NaN
4 (5+0j)
5 (6+0j)
6 NaN
7 (1+0j)
8 (3+0j)
Hope this can help you :)
some cells had information about directions (north, west...) in them, which were interpreted as imaginary numbers.

Compute difference between rows prior to and following to the specific row_pandas

I want to find the difference between rows prior to and following to the specific row. Specifically, I have the following dataset:
Number of rows A
1 4
2 2
3 2
4 3
5 2
I should get the following data:
Number of rows A B
1 4 NaN (since there is not row before this row)
2 2 2 (4-2)
3 2 -1 (2-3)
4 3 0 (2-2)
5 2 NaN (since there is not row after this row)
As you can see, each row in column B, equal the difference between previous and following rows in column A. For example, second row in column B, equal the difference between value in the first row in column A and value in the third row in column A. IMPORTANT POINT: I do not need only previous and following. I should find the difference between previous 2 and the following 2 rows. I meant the value in row Number 23 in column B will be equal the difference between the value in row Number 21 in column A and the value in row Number 25 in column A. I use the previous and the following rows for simplicity.
I hope I could explain it.
Seems like you need a centered rolling window. You can specify that with the arg center=True
>>> df.A.rolling(3, center=True).apply(lambda s: s[0]-s[-1])
0 NaN
1 2.0
2 -1.0
3 0.0
4 NaN
Name: A, dtype: float64
This approach works for any window. Notice that this is a centered window, so the size of the window has to be N+N+1 (where N is the number of lookback and lookforward rows, and you add 1 to account for the value in the middle). Thus, the general formula is
window = 2*N + 1
If you need 2 rows before and 2 after, then N = 2. if you need 5 and 5, N=5 (and window = 11) etc. The apply lambda stays the same.
Let the series (i.e. DataFrame column) be s.
You want:
s.shift(1) - s.shift(-1)
You need to use .shift on the column (series) where you want to run your calculation.
With shift(1) you get the previous row, with shift(-1) you get the next row.
from there you need to calculate previous - next
>>> s = pd.Series([4,2,2,3,2])
>>> s
0 4
1 2
2 2
3 3
4 2
dtype: int64
# previous
>>> s.shift(1)
0 NaN
1 4.0
2 2.0
3 2.0
4 3.0
dtype: float64
# next
>>> s.shift(-1)
0 2.0
1 2.0
2 3.0
3 2.0
4 NaN
dtype: float64
# previous - next
>>> s.shift(1)-s.shift(-1)
0 NaN
1 2.0
2 -1.0
3 0.0
4 NaN
dtype: float64

In a Pandas dataframe, how can I extract the difference between the values on separate rows within the same column, conditional on a second column? [duplicate]

This question already has answers here:
Pandas groupby multiple fields then diff
(2 answers)
Closed 4 years ago.
This is part of a larger project, but I've broken my problem down into steps, so here's the first step. Take a Pandas dataframe, like this:
index | user time
---------------------
0 F 0
1 T 0
2 T 0
3 T 1
4 B 1
5 K 2
6 J 2
7 T 3
8 J 4
9 B 4
For each unique user, can I extract the difference between the values in column "time," but with some conditions?
So, for example, there are two instances of user J, and the "time" difference between these two instances is 2. Can I extract the difference, 2, between these two rows? Then if that user appears again, extract the difference between that row and the previous appearance of that user in the dataframe?
I believe need DataFrameGroupBy.diff:
df['new'] = df.groupby('user')['time'].diff()
print (df)
user time new
0 F 0 NaN
1 T 0 NaN
2 T 0 0.0
3 T 1 1.0
4 B 1 NaN
5 K 2 NaN
6 J 2 NaN
7 T 3 2.0
8 J 4 2.0
9 B 4 3.0
I think np.where and pandas shifts does this
This subtract between two consecutive Time, only if the users are same
df1 = np.where (df['users'] == df['users'].shifts(-1), df['time'] - df['time'].shifts(-1), 'NaN')

Total Number of Missing Attributes

Using python and panda: For a given data set how does one find the total number of missing attributes? I have found the number for each column, but I need to sum the columns using python to find the total. Below is the code I have currently used.
def num_missing(x):
return sum(x.isnull())
print("Missing Values per Column:")
print(data_file1.apply(num_missing))
Consider df -
df
A B C
0 1.0 4 NaN
1 2.0 5 1.0
2 NaN 6 6.0
3 NaN 7 3.0
Column-wise NaN count -
df.isnull().sum(0)
A 2
B 0
C 1
dtype: int64
Row-wise NaN count -
df.isnull().sum(1)
0 1
1 0
2 1
3 1
dtype: int64
df-wide NaN count -
df.isnull().values.sum()
3
Option 1: call .sum() twice, where the second call finds the sum of the intermediate Series.
df = pd.DataFrame(np.ones((5,5)))
df.iloc[2:4, 1:3] = np.nan
df.isnull().sum().sum()
# 4
Option 2: use underlying NumPy array.
np.isnan(df.values).sum()
# 4
Option 2 should be significantly faster (8.5 us vs. 249 us on this sample data).
As noted by #root and here, np.isnan() works only on numeric data, not object dtypes. pandas.DataFrame.isnull() doesn't have this problem.

Categories