Iterating over rows subtracting by 1 in dataframe pandas - python

I have a pandas dataframe that I would like to iterate from the last non Null value and then subtract 1 from that value for all following rows.
z = pd.DataFrame({'l':range(10),'r':[4,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]\
,'gh':[np.nan,np.nan,np.nan,np.nan,15,np.nan,np.nan,np.nan,np.nan,np.nan],\
'gfh':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2]})
df = z.transpose().copy()
df.reset_index(inplace=True)
df.drop(['index'],axis=1, inplace=True)
df.columns = ['a','b','c','d','e','f','g','h','i','j']
In [8]: df
Out[8]:
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 NaN NaN NaN NaN NaN
2 0 1 2 3 4 5 6 7 8 9
3 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
I have the above dataframe that I would like to reduce by 1 for everyrow till the last column. For example row 2 the value is 15, so I want 14, 13,12,11,10 to follow. Nothing will follow the 2 in the first row since there are no columns left. Also, the 4 in the last row would be 3,2,1,0,0,0,0 etc.
I reached my desired output by doing the following.
for index, row in df.iterrows():
df.iloc[index,df.columns.get_loc(df.iloc[index].last_valid_index())+1:] =\
[(df.iloc[index,m.columns.get_loc(df.iloc[index].last_valid_index()):][0]-(x+1)).astype(int) \
for x in range((df.shape[1]-1)-df.columns.get_loc(df.iloc[index].last_valid_index()))]
df[df < 0] = 0
This gives me the desired output
In [13]: df
Out[13]:
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 14 13 12 11 10
2 0 1 2 3 4 5 6 7 8 9
3 4 3 2 1 0 0 0 0 0 0
BUT. In my realworld data I have 50K plus columns and the above code takes WAAAY too long.
Can anyone please suggest how I can make this run faster?
I believe the solution would be to somehow tell the code that once the subtaction equals zero move on to the next row. but Idk how to do that since even if I use max(0,subtraction formula) the code still waste time subtracting.
Thank you.

I don't know how fast it will be, but you could experiment with ffill, fillna, and cumsum. For example:
>>> df
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 NaN NaN NaN NaN NaN
2 0 1 2 3 4 5 6 7 8 9
3 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
>>> mask = df.ffill(axis=1).notnull() & df.isnull()
>>> df.where(~mask, df.fillna(-1).cumsum(axis=1).clip_lower(0))
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 10 9 8 7 6
2 0 1 2 3 4 5 6 7 8 9
3 4 3 2 1 0 0 0 0 0 0
This is a little tricky. First we figure out which cells we need to fill, by forward-filling the rightmost element and seeing whether it's null (there might be a faster way to use last_valid_index tests, but this is the first thing that occurred to me)
>>> mask = df.ffill(axis=1).notnull() & df.isnull()
>>> mask
a b c d e f g h i j
0 False False False False False False False False False False
1 False False False False False True True True True True
2 False False False False False False False False False False
3 False True True True True True True True True True
If we fill the empty spots with -1, we can get the values we want by cumulative summing to the right:
>>> (df.fillna(-1).cumsum(axis=1))
a b c d e f g h i j
0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -7
1 -1 -2 -3 -4 11 10 9 8 7 6
2 0 1 3 6 10 15 21 28 36 45
3 4 3 2 1 0 -1 -2 -3 -4 -5
Many of those values we don't want, but that's okay, because we're only going to insert the ones we need. We should clip to 0, though:
>>> df.fillna(-1).cumsum(axis=1).clip_lower(0)
a b c d e f g h i j
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 11 10 9 8 7 6
2 0 1 3 6 10 15 21 28 36 45
3 4 3 2 1 0 0 0 0 0 0
and finally we can use the original ones where mask is False, and the new values where mask is True:
>>> df.where(~mask, df.fillna(-1).cumsum(axis=1).clip_lower(0))
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 10 9 8 7 6
2 0 1 2 3 4 5 6 7 8 9
3 4 3 2 1 0 0 0 0 0 0
(Note: this assumes the rows we need to fill look like the ones in your example. If they're messier we'd have to do a little more work, but the same techniques will apply.)

Related

count sets of consecutive true values in a column

Let's say that I have a dataframe as follow:
df = pd.DataFrame({'A':[1,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,0,1,1]})
Then, I convert it into a boolean form:
df.eq(1)
Out[213]:
A
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 True
8 True
9 False
10 True
11 True
12 True
13 True
14 True
15 False
16 False
17 False
18 False
19 False
20 True
21 True
What I want is to count consecutive sets of True values in the column. In this example, the output would be:
df
Out[215]:
A count
0 1 5.0
1 1 2.0
2 1 5.0
3 1 2.0
4 1 NaN
5 0 NaN
6 0 NaN
7 1 NaN
8 1 NaN
9 0 NaN
10 1 NaN
11 1 NaN
12 1 NaN
13 1 NaN
14 1 NaN
15 0 NaN
16 0 NaN
17 0 NaN
18 0 NaN
19 0 NaN
20 1 NaN
21 1 NaN
My progress has been by using tools as 'groupby' and 'cumsum' but honestly, I can not figure out how to solve it. Thanks in advance
You can use df['A'].diff().ne(0).cumsum() to generate a grouper that will group each consecutive group of zeros/ones:
# A side-by-side comparison:
>>> pd.concat([df['A'], df['A'].diff().ne(0).cumsum()], axis=1)
A A
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 0 2
6 0 2
7 1 3
8 1 3
9 0 4
10 1 5
11 1 5
12 1 5
13 1 5
14 1 5
15 0 6
16 0 6
17 0 6
18 0 6
19 0 6
20 1 7
21 1 7
Thus, group by that grouper, calculate sums, replace zero with NaN + dropna, and reset the index:
df['count'] = df.groupby(df['A'].diff().ne(0).cumsum()).sum().replace(0, np.nan).dropna().reset_index(drop=True)
Output:
>>> df
A B
0 1 5.0
1 1 2.0
2 1 5.0
3 1 2.0
4 1 NaN
5 0 NaN
6 0 NaN
7 1 NaN
8 1 NaN
9 0 NaN
10 1 NaN
11 1 NaN
12 1 NaN
13 1 NaN
14 1 NaN
15 0 NaN
16 0 NaN
17 0 NaN
18 0 NaN
19 0 NaN
20 1 NaN
21 1 NaN
I propose an alternative way that makes use of the split string function.
Let's transform the Series df.A into a string and then split it where the zeros are.
df = pd.DataFrame({'A':[1,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,0,1,1]})
ll = ''.join(df.A.astype('str').tolist()).split('0')
The list ll looks like
print(ll)
['11111', '', '11', '11111', '', '', '', '', '11']
now we count the lengths of every string and put it into a list
[len(item) for item in ll if len(item)>0]
This is doable if the Series is not too long.

Turn columns' values to headers of columns with values 1 and 0 ( accordingly) [python]

I got a column of the form :
0 q4
1 4
2 3
3 1
4 2
5 1
6 5
7 1
8 3
The column represents the answers of users to a question of 5 choices (1-5).
I want to turn this into a matrix of 5 columns where the indexes are the 5 possible answers and the values are 1 or 0 according to the user's given answer.
Visualy i want a matrix of the form:
0 q4_1 q4_2 q4_3 q4_4 q4_5
1 Nan Nan Nan 1 Nan
2 Nan Nan 1 Nan Nan
3 1 Nan Nan Nan Nan
4 Nan 1 Nan Nan Nan
5 1 Nan Nan Nan Nan
for i in range(1,6):
df['q4_'+str(i)]=np.where(df.q4==i, 1, 0)
def df['q4']
Output:
>>> print(df)
q4_1 q4_2 q4_3 q4_4 q4_5
0 0 0 0 1 0
1 0 0 1 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 0 0 1
6 1 0 0 0 0
7 0 0 1 0 0
I think pivot is the way to go. You'd have to prepopulate the df with the info you want in the new table.
Also, I don't understand why you want only 5 rows but I added it as well in iloc. If you remove it, you will have this data for your entire index (up to 8).
import pandas as pd
df = pd.DataFrame({'q4': [4, 3, 1, 2, 1, 5, 1, 3]})
df.index += 1
df['values'] = 1
df = df.reset_index().pivot(index='q4', columns='index', values='values').T.iloc[:5]
prints
q4 1 2 3 4 5
index
1 NaN NaN NaN 1.0 NaN
2 NaN NaN 1.0 NaN NaN
3 1.0 NaN NaN NaN NaN
4 NaN 1.0 NaN NaN NaN
5 1.0 NaN NaN NaN NaN

Subtract same column multi index for each level

This question seem very basic but I couldn't find any answer.
I have a multi index dataframe which look like this
A B
a b c a b c
x y z x y z x y z x y z x y z x y z
1 : : :
2 : :
3 :
4
5
6
7
And all I would like to is to create another dataframe which shows x-z and y-z.
I tried to subtract the slicers but it gives me NaN (despite being the same dimensions)
test.loc[:,idx[:,:,'x']].sub(test.loc[:,idx[:,:,'z']])
Do you know a trick to perform this task?
Pandas operations (such as subtraction) always aligns NDFrames based on the row and column indexes. Since df.loc[:,idx[:,:,'x']] and df.loc[:,idx[:,:,'z']] have different column indexes, subtraction yields NaNs:
x = df.loc[:,idx[:,:,'x']]
z = df.loc[:,idx[:,:,'z']]
x.sub(z)
# A B
# a b c a b c
# x z x z x z x z x z x z
# 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
To make Pandas perform an operation element-wise (ignoring the index), remove the index by making z a NumPy array:
x = df.loc[:,idx[:,:,'x']]
z = df.loc[:,idx[:,:,'z']].values
x.sub(z)
A B
a b c a b c
x x x x x x
0 1 -1 -1 -1 -5 -1
1 0 1 0 4 -2 1
2 -5 -9 4 7 -5 -2
3 -5 -2 -6 -6 0 0
4 -3 -4 4 -5 -6 8
5 1 4 -7 7 -4 8
6 4 0 -2 1 3 -6
For example,
import pandas as pd
import numpy as np
np.random.seed(2016)
columns = pd.MultiIndex.from_product([['A', 'B'], ['a', 'b', 'c'], ['x', 'y', 'z']])
df = pd.DataFrame(np.random.randint(10, size=(7, 18)), columns=columns)
idx = pd.IndexSlice
x = df.loc[:,idx[:,:,'x']]
y = df.loc[:,idx[:,:,'y']]
z = df.loc[:,idx[:,:,'z']].values
result = pd.concat([x-z, y-z], axis=1)
result = result.rename(columns={'x':'x-z', 'y':'y-z'})
yields
A B A B
a b c a b c a b c a b c
x-z x-z x-z x-z x-z x-z y-z y-z y-z y-z y-z y-z
0 1 -1 -1 -1 -5 -1 5 4 -2 3 -8 0
1 0 1 0 4 -2 1 1 5 1 4 6 0
2 -5 -9 4 7 -5 -2 -9 -5 4 8 4 4
3 -5 -2 -6 -6 0 0 -5 -7 0 -4 3 -2
4 -3 -4 4 -5 -6 8 -2 -1 2 -8 1 1
5 1 4 -7 7 -4 8 0 2 -8 3 2 5
6 4 0 -2 1 3 -6 2 5 5 6 -2 -2

Pandas: Find empty/missing values and add them to DataFrame

I have dataframe where column 1 should have all the values from 1 to 169. If a value doesnt exists, I'd like to add a new row to my dataframe which contains the said value (and some zeros).
I can't get the following code to work, even tho there are no errors:
for i in range(1,170):
if i in df.col1 is False:
df.loc[len(df)+1] = [i,0,0]
else:
continue
Any advices?
It would be better to do something like:
In [37]:
# create our test df, we have vales 1 to 9 in steps of 2
df = pd.DataFrame({'a':np.arange(1,10,2)})
df['b'] = np.NaN
df['c'] = np.NaN
df
Out[37]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
In [38]:
# now set the index to a, this allows us to reindex the values with optional fill value, then reset the index
df = df.set_index('a').reindex(index = np.arange(1,10), fill_value=0).reset_index()
df
Out[38]:
a b c
0 1 NaN NaN
1 2 0 0
2 3 NaN NaN
3 4 0 0
4 5 NaN NaN
5 6 0 0
6 7 NaN NaN
7 8 0 0
8 9 NaN NaN
So just to explain the above:
In [40]:
# set the index to 'a', this allows us to reindex and fill missing values
df = df.set_index('a')
df
Out[40]:
b c
a
1 NaN NaN
3 NaN NaN
5 NaN NaN
7 NaN NaN
9 NaN NaN
In [41]:
# now reindex and pass fill_value for the extra rows we want
df = df.reindex(index = np.arange(1,10), fill_value=0)
df
Out[41]:
b c
a
1 NaN NaN
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN
6 0 0
7 NaN NaN
8 0 0
9 NaN NaN
In [42]:
# now reset the index
df = df.reset_index()
df
Out[42]:
a b c
0 1 NaN NaN
1 2 0 0
2 3 NaN NaN
3 4 0 0
4 5 NaN NaN
5 6 0 0
6 7 NaN NaN
7 8 0 0
8 9 NaN NaN
If you modified your loop to the following then it would work:
In [63]:
for i in range(1,10):
if any(df.a.isin([i])) == False:
df.loc[len(df)+1] = [i,0,0]
else:
continue
df
Out[63]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
6 2 0 0
7 4 0 0
8 6 0 0
9 8 0 0
EDIT
If you wanted the missing rows to appear at the end of the df then you could just create a temporary df with the full range of values and other columns set to zero and then filter this df based on the values that are missing in the other df and concatenate them:
In [70]:
df_missing = pd.DataFrame({'a':np.arange(10),'b':0,'c':0})
df_missing
Out[70]:
a b c
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
6 6 0 0
7 7 0 0
8 8 0 0
9 9 0 0
In [73]:
df = pd.concat([df,df_missing[~df_missing.a.isin(df.a)]], ignore_index=True)
df
Out[73]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
5 0 0 0
6 2 0 0
7 4 0 0
8 6 0 0
9 8 0 0
The expression if i in df.col1 is False always evaluates to false. I think it is looking in the index. Also I think you need to use pandas.concat in modern versions of pandas instead of assigning to df.loc[].
I would recommend gathering all missing values in a list then concatenating them to the dataframe at the end. For instance
>>> df = pd.DataFrame({'col1': range(5) + [i + 6 for i in range(5)], 'col2': range(10)})
>>> print df
col1 col2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 6 5
6 7 6
7 8 7
8 9 8
9 10 9
>>> to_add = []
>>> for i in range(11):
... if i not in df.col1.values:
... to_add.append([i, 0])
... else:
... continue
...
>>> pd.concat([df, pd.DataFrame(to_add, columns=['col1', 'col2'])])
col1 col2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 6 5
6 7 6
7 8 7
8 9 8
9 10 9
0 5 0
I assume you don't care about the index values of the rows you add.

Select rows around a value in Pandas

I have a DataFrame with a continuous measure, marked by occasional events:
TimeIndex Event Value
0 NaN 4.099969
1 NaN 3.833528
2 NaN -1.335025
3 A 4.420085
4 NaN 4.508899
5 NaN 4.557383
6 B -3.377152
7 NaN 4.508899
8 NaN -1.919803
9 A 2.18520
10 NaN 3.821221
11 C 0.922389
12 NaN 2.165784
I want the average for each event, but also the average two time points before and two time points after the event occurs. Something like this might work:
TimeIndex Event Value Around_A Around_B Around_C
0 NaN 4.099969 NaN NaN NaN
1 NaN 3.833528 -2 NaN NaN
2 NaN -1.335025 -1 NaN NaN
3 A 4.420085 0 NaN NaN
4 NaN 4.508899 1 -2 NaN
5 NaN 4.557383 2 -1 NaN
6 B -3.377152 NaN 0 NaN
7 NaN 4.508899 -2 1 NaN
8 NaN -1.919803 -1 2 NaN
9 A 2.18520 0 NaN 2
10 NaN 3.821221 1 NaN -1
11 C 0.922389 2 NaN 0
12 NaN 2.165784 NaN NaN 1
However: 1) I'm unsure how to get the new column values without looping and 2) appending a new column gets intractable for many different events (which I have)
Is there an easier way to select timepoints/rows around a value in pandas, and then average by time point/row?
My desired output is the average Value for Event x AroundTime (dummy means shown here)
Event AroundTime Value.mean
A -2 3.35
A -1 0.19
A 0 2.33
A 1 -1.01
A 2 3.78
B -2 4.53
B -1 4.22
B 0 5.14
B 1 1.88
B 2 0.70
C -2 -1.01
C -1 -2.33
C 0 1.69
C 1 1.19
C 2 2.21
I will suggest:
In [26]:
print df
TimeIndex Event Value
0 0 NaN 4.099969
1 1 NaN 3.833528
2 2 NaN -1.335025
3 3 A 4.420085
4 4 NaN 4.508899
5 5 NaN 4.557383
6 6 B -3.377152
7 7 NaN 4.508899
8 8 NaN -1.919803
9 9 A 2.185200
10 10 NaN 3.821221
11 11 C 0.922389
12 12 NaN 2.165784
[13 rows x 3 columns]
In [27]:
df['Around_A']=np.nan
In [28]:
for i in range(-2,3):
df['Around_A'][(df.Event=='A').shift(i).fillna(False)]=i
#or df.ix[(df.Event=='A').shift(i).fillna(False), 'Around_A']=i
In [29]:
print df
TimeIndex Event Value Around_A
0 0 NaN 4.099969 NaN
1 1 NaN 3.833528 -2
2 2 NaN -1.335025 -1
3 3 A 4.420085 0
4 4 NaN 4.508899 1
5 5 NaN 4.557383 2
6 6 B -3.377152 NaN
7 7 NaN 4.508899 -2
8 8 NaN -1.919803 -1
9 9 A 2.185200 0
10 10 NaN 3.821221 1
11 11 C 0.922389 2
12 12 NaN 2.165784 NaN
[13 rows x 4 columns]
Don't quite get your last question, mind provide an intended result?
Edit
now it is clear, my approach:
In [22]:
df=pd.read_clipboard()
df['Around_A']=np.nan
df['Around_B']=np.nan
df['Around_C']=np.nan
for i in range(-2,3):
df.ix[(df.Event=='A').shift(i).fillna(False), 'Around_A']=i
df.ix[(df.Event=='B').shift(i).fillna(False), 'Around_B']=i
df.ix[(df.Event=='C').shift(i).fillna(False), 'Around_C']=i
Data=[]
for s in ['A', 'B', 'C']:
_df=pd.DataFrame(df.groupby('Around_%s'%s).Value.mean())
_df['Event']=s
_df.index.name='AroundTime'
Data.append(_df.reset_index())
print pd.concat(Data)[['Event', 'AroundTime', 'Value']]
Event AroundTime Value
0 A -2 4.171213
1 A -1 -1.627414
2 A 0 3.302643
3 A 1 4.165060
4 A 2 2.739886
0 B -2 4.508899
1 B -1 4.557383
2 B 0 -3.377152
3 B 1 4.508899
4 B 2 -1.919803
0 C -2 2.185200
1 C -1 3.821221
2 C 0 0.922389
3 C 1 2.165780
[14 rows x 3 columns]

Categories