.agg Sum Converting NaN to 0 - python

I am trying to bin a Pandas DataFrame into three day windows. I have two columns, A and B, which I want to sum in each window. This code which I wrote for the task
df = df.groupby(df.index // 3).agg({'A': 'sum', 'B':'sum'})
Converts NaN values to zero when doing this sum, but I would like them to remain NaN as my data has actual non-NaN zero values.
For example if I had this df:
df = pd.DataFrame([
[np.nan, np.nan],
[np.nan, 0],
[np.nan, np.nan],
[2, 0],
[4 , 0],
[0 , 0]
], columns=['A','B'])
Index A B
0 NaN Nan
1 NaN 3
2 NaN Nan
3 2 0
4 4 0
5 0 0
I would like the new df to be:
Index A B
0 NaN 3
1 6 0
But my current code outputs:
Index A B
0 0 3
1 6 0

df.groupby(df.index // 3)['A', 'B'].mean()
The above snippet provides the mentioned sample output.
If you want to go for the sum, look at df.groupby(df.index // 3)['A', 'B'].sum(min_count = 1)
Another option:
df.groupby(df.index // 3).agg({'A': lambda x: x.sum(skipna=False),
'B':lambda x: x.sum(skipna=True)})

Try with this code:
df.groupby(df.index // 3).agg({'A': lambda x: x.sum(skipna=False),
'B':lambda x: x.sum(skipna=False)})
Out[282]:
A B
0 NaN NaN
1 6.0 0.0

Related

Subtract with value in previous row to create a new column by subject

Using python and this data set https://raw.githubusercontent.com/yadatree/AL/main/AK4.csv I would like to create a new column for each subject, that starts with 0 (in the first row) and then subtracts the SCALE value from row 2 from row 1, then row 3 from row 2, row 4 from row 3, etc.
However, if this produces a negative value, then to give the output of 0.
Edit: Thank you for the response. That worked perfectly. The only remaining issue is that I'd like to start again with each subject (SUBJECT column). The number of values for each subject is not fixed thus something that checks the SUBJECT column and then starts again from 0 would be ideal.
screenshot
You can use .shift(1) create new column with values moved from previous rows - and then you will have both values in the same row and you can substract columns.
And later you can selecte all negative results and assign zero
import pandas as pd
data = {
'A': [1, 3, 2, 5, 1],
}
df = pd.DataFrame(data)
df['previous'] = df['A'].shift(1)
df['result'] = df['A'] - df['previous']
print(df)
#df['result'] = df['A'] - df['A'].shift(1)
#print(df)
df.loc[ df['result'] < 0 , 'result'] = 0
print(df)
Result:
A previous result
0 1 NaN NaN
1 3 1.0 2.0
2 2 3.0 -1.0
3 5 2.0 3.0
4 1 5.0 -4.0
A previous result
0 1 NaN NaN
1 3 1.0 2.0
2 2 3.0 0.0
3 5 2.0 3.0
4 1 5.0 0.0
EDIT:
If you use df['result'] = df['A'] - df['A'].shift(1) then you get column result without creating column previous.
And if you use .shift(1, fill_value=0) then it will put 0 instead of NaN in first row.
EDIT:
You can use groupy("SUBJECT") to group by subject and later in every group you can put 0 in first row.
import pandas as pd
data = {
'S': ['A', 'A', 'A', 'B', 'B', 'B'],
'A': [1, 3, 2, 1, 5, 1],
}
df = pd.DataFrame(data)
df['result'] = df['A'] - df['A'].shift(1, fill_value=0)
print(df)
df.loc[ df['result'] < 0 , 'result'] = 0
print(df)
all_groups = df.groupby('S')
first_index = all_groups.apply(lambda grp: grp.index[0])
df.loc[first_index, 'result'] = 0
print(df)
Results:
S A result
0 A 1 1
1 A 3 2
2 A 2 -1
3 B 1 -1
4 B 5 4
5 B 1 -4
S A result
0 A 1 1
1 A 3 2
2 A 2 0
3 B 1 0
4 B 5 4
5 B 1 0
S A result
0 A 1 0
1 A 3 2
2 A 2 0
3 B 1 0
4 B 5 4
5 B 1 0

python - pandas groupby to flat DataFrame

I would like to convert groupby result to a flat DataFrame.
import pandas
df1 = pandas.DataFrame( {
"x" : ["A", "B", "C", "A", "B" ,"B"] ,
"y" : [ 1, 2, 3, 4, 5, 6]} )
g1 = df1.groupby(["x"]).max().reset_index()
print(g1)
The expected output DataFrame like below:
x y1 y2 y3
0 A 1 4 0
1 B 2 5 6
2 C 3 0 0
If value not exist, use 0 by default.
Try with groupby.agg with add_prefix and fillna with reset_index.
Like the following:
g1 = df1.groupby('x')['y'].agg(list).agg(pd.Series).add_prefix('y').fillna(0).reset_index()
print(g1)
Or if you care about column names, try using rename with a slick way with 1 .__add__:
g1 = df1.groupby('x')['y'].agg(list).agg(pd.Series).rename(1 .__add__, axis=1).add_prefix('y').fillna(0).reset_index()
Output:
x y1 y2 y3
0 A 1.0 4.0 0.0
1 B 2.0 5.0 6.0
2 C 3.0 0.0 0.0
We can use pivot_table index is the 'x' column, and we can use groupby cumcount on x to enumerate rows to get positional y values as new columns [1,2,3] etc and fill_value of 0 to set the default for missing (benefit of fill_value over fillna is that NaN are not introduced so dtype does not change to float).
Lastly, add_prefix to columns and reset_index to match desired output:
out = (
df1.pivot_table(index='x',
columns=df1.groupby('x').cumcount() + 1,
values='y',
fill_value=0)
.add_prefix('y')
.reset_index()
)
out:
x y1 y2 y3
0 A 1 4 0
1 B 2 5 6
2 C 3 0 0

Apply function rowwise to pandas dataframe while referencing a column

I have a pandas dataframe like this:
df = pd.DataFrame({'A': [2, 3], 'B': [1, 2], 'C': [0, 1], 'D': [1, 0], 'total': [4, 6]})
A B C D total
0 2 1 0 1 4
1 3 2 1 0 6
I'm trying to perform a rowwise calculation and create a new column with the result. The calculation is to divide each column ABCD by the total, square it, and sum it up rowwise. This should be the result (0 if total is 0):
A B C D total result
0 2 1 0 1 4 0.375
1 3 2 1 0 6 0.389
This is what I've tried so far, but it always returns 0:
df['result'] = df[['A', 'B', 'C', 'D']].apply(lambda x: ((x/df['total'])**2).sum(), axis=1)
I guess the problem is df['total'] in the lambda function, because if I replace this by a number it works fine. I don't know how to work around this though. Appreciate any suggestions.
A combination of div, pow and sum can solve this :
df["result"] = df.filter(regex="[^total]").div(df.total, axis=0).pow(2).sum(1)
df
A B C D total result
0 2 1 0 1 4 0.375000
1 3 2 1 0 6 0.388889
you could do
df['result'] = (df.loc[:, "A": 'D'].divide(df.total, axis=0) ** 2).sum(axis=1)

Pandas- merging two dataframe by sum the values of columns and index

I want to merge two datasets by indexes and columns.
I want to merge entire dataset
df1 = pd.DataFrame([[1, 0, 0], [0, 2, 0], [0, 0, 3]],columns=[1, 2, 3])
df1
1 2 3
0 1 0 0
1 0 2 0
2 0 0 3
df2 = pd.DataFrame([[0, 0, 1], [0, 2, 0], [3, 0, 0]],columns=[1, 2, 3])
df2
1 2 3
0 0 0 1
1 0 2 0
2 3 0 0
I have tried this code but I got this error. I can't get why it shows the size of axis as an error.
df_sum = pd.concat([df1, df2])\
.groupby(df2.index)[df2.columns]\
.sum().reset_index()
ValueError: Grouper and axis must be same length
This was what I expected the output of df_sum
df_sum
1 2 3
0 1 0 1
1 0 4 0
2 3 0 3
You can use :df1.add(df2, fill_value=0). It will add df2 into df1 also it will replace NAN value with 0.
>>> import numpy as np
>>> import pandas as pd
>>> df2 = pd.DataFrame([(10,9),(8,4),(7,np.nan)], columns=['a','b'])
>>> df1 = pd.DataFrame([(1,2),(3,4),(5,6)], columns=['a','b'])
>>> df1.add(df2, fill_value=0)
a b
0 11 11.0
1 11 8.0
2 12 6.0

KeyError when Assigning value using For Loop by Pandas

I have a long list of data, that meaningful data being sandwiched between 0 values, here is how it looks like
0
0
1
0
0
2
3
1
0
0
0
0
1
0
The length of 0 and meaningful value sequence is variable. I want to extract the meaningful sequence, each of them into a row in a dataframe. For example, the above data can be extracted to this:
1
2 3 1
1
I used this code to 'slice' the meaningful data:
import pandas as pd
import numpy as np
raw = pd.read_csv('data.csv')
df = pd.DataFrame(index=np.arange(0, 10000),columns = ['DT01', 'DT02', 'DT03', 'DT04', 'DT05', 'DT06', 'DT07', 'DT08', 'DT02', 'DT09', 'DT10', 'DT11', 'DT12', 'DT13', 'DT14', 'DT15', 'DT16', 'DT17', 'DT18', 'DT19', 'DT20',])
a = 0
b = 0
n=0
for n in range(0,999999):
if raw.iloc[n].values > 0:
df.iloc[a,b] = raw.iloc[n].values
a=a+1
if raw [n+1] == 0:
b=b+1
a=0
but I keep getting KeyError: n, while n is the row after the first row has a value different than 0.
Where is the problem with me code? And is there any way to improve it, in term of speed and memory cost?
Thank you very much
You can use:
df['Group'] = df['col'].eq(0).cumsum()
df = df.loc[ df['col'] != 0]
df = df.groupby('Group')['col'].apply(list)
print (df)
Group
2 [1]
4 [2, 3, 1]
8 [1]
Name: col, dtype: object
df = pd.DataFrame(df.groupby('Group')['col'].apply(list).values.tolist())
print (df)
0 1 2
0 1 NaN NaN
1 2 3.0 1.0
2 1 NaN NaN
Let's try this outputs a dataframe:
df.groupby(df[0].eq(0).cumsum().mask(df[0].eq(0)),as_index=False)[0]\
.apply(lambda x: x.reset_index(drop=True)).unstack(1)
Output:
0 1 2
0 1.0 NaN NaN
1 2.0 3.0 1.0
2 1.0 NaN NaN
Or a string:
df.groupby(df[0].eq(0).cumsum().mask(df[0].eq(0)),as_index=False)[0]\
.apply(lambda x: ' '.join(x.astype(str)))
Output:
0 1
1 2 3 1
2 1
dtype: object
Or as a list:
df.groupby(df[0].eq(0).cumsum().mask(df[0].eq(0)),as_index=False)[0]\
.apply(list)
Output:
0 [1]
1 [2, 3, 1]
2 [1]
dtype: object
Try this , I break down the steps
df.LIST=df.LIST.replace({0:np.nan})
df['Group']=df.LIST.isnull().cumsum()
df=df.dropna()
df.groupby('Group').LIST.apply(list)
Out[384]:
Group
2 [1]
4 [2, 3, 1]
8 [1]
Name: LIST, dtype: object
Data Input
df = pd.DataFrame({'LIST' : [0,0,1,0,0,2,3,1,0,0,0,0,1,0]})
Let's start with packing your original data into a pandas dataframe (in real life, you will probably use pd.read_csv() to generate this dataframe):
raw = pd.DataFrame({'0' : [0,0,1,0,0,2,3,1,0,0,0,0,1,0]})
The default index will help you locate zero spans:
s1 = raw.reset_index()
s1['index'] = np.where(s1['0'] != 0, np.nan, s1['index'])
s1['index'] = s1['index'].fillna(method='ffill').fillna(0).astype(int)
s1[s1['0'] != 0].groupby('index')['0'].apply(list).tolist()
#[[1], [2, 3, 1], [1]]

Categories