This question seem very basic but I couldn't find any answer.
I have a multi index dataframe which look like this
A B
a b c a b c
x y z x y z x y z x y z x y z x y z
1 : : :
2 : :
3 :
4
5
6
7
And all I would like to is to create another dataframe which shows x-z and y-z.
I tried to subtract the slicers but it gives me NaN (despite being the same dimensions)
test.loc[:,idx[:,:,'x']].sub(test.loc[:,idx[:,:,'z']])
Do you know a trick to perform this task?
Pandas operations (such as subtraction) always aligns NDFrames based on the row and column indexes. Since df.loc[:,idx[:,:,'x']] and df.loc[:,idx[:,:,'z']] have different column indexes, subtraction yields NaNs:
x = df.loc[:,idx[:,:,'x']]
z = df.loc[:,idx[:,:,'z']]
x.sub(z)
# A B
# a b c a b c
# x z x z x z x z x z x z
# 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
To make Pandas perform an operation element-wise (ignoring the index), remove the index by making z a NumPy array:
x = df.loc[:,idx[:,:,'x']]
z = df.loc[:,idx[:,:,'z']].values
x.sub(z)
A B
a b c a b c
x x x x x x
0 1 -1 -1 -1 -5 -1
1 0 1 0 4 -2 1
2 -5 -9 4 7 -5 -2
3 -5 -2 -6 -6 0 0
4 -3 -4 4 -5 -6 8
5 1 4 -7 7 -4 8
6 4 0 -2 1 3 -6
For example,
import pandas as pd
import numpy as np
np.random.seed(2016)
columns = pd.MultiIndex.from_product([['A', 'B'], ['a', 'b', 'c'], ['x', 'y', 'z']])
df = pd.DataFrame(np.random.randint(10, size=(7, 18)), columns=columns)
idx = pd.IndexSlice
x = df.loc[:,idx[:,:,'x']]
y = df.loc[:,idx[:,:,'y']]
z = df.loc[:,idx[:,:,'z']].values
result = pd.concat([x-z, y-z], axis=1)
result = result.rename(columns={'x':'x-z', 'y':'y-z'})
yields
A B A B
a b c a b c a b c a b c
x-z x-z x-z x-z x-z x-z y-z y-z y-z y-z y-z y-z
0 1 -1 -1 -1 -5 -1 5 4 -2 3 -8 0
1 0 1 0 4 -2 1 1 5 1 4 6 0
2 -5 -9 4 7 -5 -2 -9 -5 4 8 4 4
3 -5 -2 -6 -6 0 0 -5 -7 0 -4 3 -2
4 -3 -4 4 -5 -6 8 -2 -1 2 -8 1 1
5 1 4 -7 7 -4 8 0 2 -8 3 2 5
6 4 0 -2 1 3 -6 2 5 5 6 -2 -2
Related
I have a dataframe with a lot of NaNs.
y columns mean the count of events, val means values of each event in that yeat, and total means a multiplication of both columns.
Many columns have zeros and many have NaNs because values are not available (up to 80% of data is missing) is 4 columns.
y17 y18 y19 y20 val17 va18 val19 val20 total17 total18 total19 total20
1 2 1 2 2 2 2 2 1 4 2 4
2 2 2 2 2 2 2 2 4 4 4 4
3 3 3 3 NaN NaN NaN NaN NaN NaN NaN NaN
0 0 0 0 1 2 3 4 0 0 0 0
0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN
I want to keep rows with all values with zeros and numbers AND I want to keep rows where first four columns (multiple condition) have zeros.
Expected output
y17 y18 y19 y20 val17 va18 val19 val20 total17 total18 total19 total20
1 2 1 2 2 2 2 2 1 4 2 4
2 2 2 2 2 2 2 2 4 4 4 4
0 0 0 0 1 2 3 4 0 0 0 0
0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN
Thanks!
Just pass the condition with all
out = df[df.iloc[:,:4].eq(0).all(1) | df.notna().all(1)]
Out[386]:
y17 y18 y19 y20 val17 ... val20 total17 total18 total19 total20
0 1 2 1 2 2.0 ... 2.0 1.0 4.0 2.0 4.0
1 2 2 2 2 2.0 ... 2.0 4.0 4.0 4.0 4.0
3 0 0 0 0 1.0 ... 4.0 0.0 0.0 0.0 0.0
4 0 0 0 0 NaN ... NaN NaN NaN NaN NaN
[4 rows x 12 columns]
I got a column of the form :
0 q4
1 4
2 3
3 1
4 2
5 1
6 5
7 1
8 3
The column represents the answers of users to a question of 5 choices (1-5).
I want to turn this into a matrix of 5 columns where the indexes are the 5 possible answers and the values are 1 or 0 according to the user's given answer.
Visualy i want a matrix of the form:
0 q4_1 q4_2 q4_3 q4_4 q4_5
1 Nan Nan Nan 1 Nan
2 Nan Nan 1 Nan Nan
3 1 Nan Nan Nan Nan
4 Nan 1 Nan Nan Nan
5 1 Nan Nan Nan Nan
for i in range(1,6):
df['q4_'+str(i)]=np.where(df.q4==i, 1, 0)
def df['q4']
Output:
>>> print(df)
q4_1 q4_2 q4_3 q4_4 q4_5
0 0 0 0 1 0
1 0 0 1 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 0 0 1
6 1 0 0 0 0
7 0 0 1 0 0
I think pivot is the way to go. You'd have to prepopulate the df with the info you want in the new table.
Also, I don't understand why you want only 5 rows but I added it as well in iloc. If you remove it, you will have this data for your entire index (up to 8).
import pandas as pd
df = pd.DataFrame({'q4': [4, 3, 1, 2, 1, 5, 1, 3]})
df.index += 1
df['values'] = 1
df = df.reset_index().pivot(index='q4', columns='index', values='values').T.iloc[:5]
prints
q4 1 2 3 4 5
index
1 NaN NaN NaN 1.0 NaN
2 NaN NaN 1.0 NaN NaN
3 1.0 NaN NaN NaN NaN
4 NaN 1.0 NaN NaN NaN
5 1.0 NaN NaN NaN NaN
I have a pandas dataframe that I would like to iterate from the last non Null value and then subtract 1 from that value for all following rows.
z = pd.DataFrame({'l':range(10),'r':[4,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]\
,'gh':[np.nan,np.nan,np.nan,np.nan,15,np.nan,np.nan,np.nan,np.nan,np.nan],\
'gfh':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2]})
df = z.transpose().copy()
df.reset_index(inplace=True)
df.drop(['index'],axis=1, inplace=True)
df.columns = ['a','b','c','d','e','f','g','h','i','j']
In [8]: df
Out[8]:
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 NaN NaN NaN NaN NaN
2 0 1 2 3 4 5 6 7 8 9
3 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
I have the above dataframe that I would like to reduce by 1 for everyrow till the last column. For example row 2 the value is 15, so I want 14, 13,12,11,10 to follow. Nothing will follow the 2 in the first row since there are no columns left. Also, the 4 in the last row would be 3,2,1,0,0,0,0 etc.
I reached my desired output by doing the following.
for index, row in df.iterrows():
df.iloc[index,df.columns.get_loc(df.iloc[index].last_valid_index())+1:] =\
[(df.iloc[index,m.columns.get_loc(df.iloc[index].last_valid_index()):][0]-(x+1)).astype(int) \
for x in range((df.shape[1]-1)-df.columns.get_loc(df.iloc[index].last_valid_index()))]
df[df < 0] = 0
This gives me the desired output
In [13]: df
Out[13]:
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 14 13 12 11 10
2 0 1 2 3 4 5 6 7 8 9
3 4 3 2 1 0 0 0 0 0 0
BUT. In my realworld data I have 50K plus columns and the above code takes WAAAY too long.
Can anyone please suggest how I can make this run faster?
I believe the solution would be to somehow tell the code that once the subtaction equals zero move on to the next row. but Idk how to do that since even if I use max(0,subtraction formula) the code still waste time subtracting.
Thank you.
I don't know how fast it will be, but you could experiment with ffill, fillna, and cumsum. For example:
>>> df
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 NaN NaN NaN NaN NaN
2 0 1 2 3 4 5 6 7 8 9
3 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
>>> mask = df.ffill(axis=1).notnull() & df.isnull()
>>> df.where(~mask, df.fillna(-1).cumsum(axis=1).clip_lower(0))
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 10 9 8 7 6
2 0 1 2 3 4 5 6 7 8 9
3 4 3 2 1 0 0 0 0 0 0
This is a little tricky. First we figure out which cells we need to fill, by forward-filling the rightmost element and seeing whether it's null (there might be a faster way to use last_valid_index tests, but this is the first thing that occurred to me)
>>> mask = df.ffill(axis=1).notnull() & df.isnull()
>>> mask
a b c d e f g h i j
0 False False False False False False False False False False
1 False False False False False True True True True True
2 False False False False False False False False False False
3 False True True True True True True True True True
If we fill the empty spots with -1, we can get the values we want by cumulative summing to the right:
>>> (df.fillna(-1).cumsum(axis=1))
a b c d e f g h i j
0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -7
1 -1 -2 -3 -4 11 10 9 8 7 6
2 0 1 3 6 10 15 21 28 36 45
3 4 3 2 1 0 -1 -2 -3 -4 -5
Many of those values we don't want, but that's okay, because we're only going to insert the ones we need. We should clip to 0, though:
>>> df.fillna(-1).cumsum(axis=1).clip_lower(0)
a b c d e f g h i j
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 11 10 9 8 7 6
2 0 1 3 6 10 15 21 28 36 45
3 4 3 2 1 0 0 0 0 0 0
and finally we can use the original ones where mask is False, and the new values where mask is True:
>>> df.where(~mask, df.fillna(-1).cumsum(axis=1).clip_lower(0))
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 10 9 8 7 6
2 0 1 2 3 4 5 6 7 8 9
3 4 3 2 1 0 0 0 0 0 0
(Note: this assumes the rows we need to fill look like the ones in your example. If they're messier we'd have to do a little more work, but the same techniques will apply.)
I have dataframe where column 1 should have all the values from 1 to 169. If a value doesnt exists, I'd like to add a new row to my dataframe which contains the said value (and some zeros).
I can't get the following code to work, even tho there are no errors:
for i in range(1,170):
if i in df.col1 is False:
df.loc[len(df)+1] = [i,0,0]
else:
continue
Any advices?
It would be better to do something like:
In [37]:
# create our test df, we have vales 1 to 9 in steps of 2
df = pd.DataFrame({'a':np.arange(1,10,2)})
df['b'] = np.NaN
df['c'] = np.NaN
df
Out[37]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
In [38]:
# now set the index to a, this allows us to reindex the values with optional fill value, then reset the index
df = df.set_index('a').reindex(index = np.arange(1,10), fill_value=0).reset_index()
df
Out[38]:
a b c
0 1 NaN NaN
1 2 0 0
2 3 NaN NaN
3 4 0 0
4 5 NaN NaN
5 6 0 0
6 7 NaN NaN
7 8 0 0
8 9 NaN NaN
So just to explain the above:
In [40]:
# set the index to 'a', this allows us to reindex and fill missing values
df = df.set_index('a')
df
Out[40]:
b c
a
1 NaN NaN
3 NaN NaN
5 NaN NaN
7 NaN NaN
9 NaN NaN
In [41]:
# now reindex and pass fill_value for the extra rows we want
df = df.reindex(index = np.arange(1,10), fill_value=0)
df
Out[41]:
b c
a
1 NaN NaN
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN
6 0 0
7 NaN NaN
8 0 0
9 NaN NaN
In [42]:
# now reset the index
df = df.reset_index()
df
Out[42]:
a b c
0 1 NaN NaN
1 2 0 0
2 3 NaN NaN
3 4 0 0
4 5 NaN NaN
5 6 0 0
6 7 NaN NaN
7 8 0 0
8 9 NaN NaN
If you modified your loop to the following then it would work:
In [63]:
for i in range(1,10):
if any(df.a.isin([i])) == False:
df.loc[len(df)+1] = [i,0,0]
else:
continue
df
Out[63]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
6 2 0 0
7 4 0 0
8 6 0 0
9 8 0 0
EDIT
If you wanted the missing rows to appear at the end of the df then you could just create a temporary df with the full range of values and other columns set to zero and then filter this df based on the values that are missing in the other df and concatenate them:
In [70]:
df_missing = pd.DataFrame({'a':np.arange(10),'b':0,'c':0})
df_missing
Out[70]:
a b c
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
6 6 0 0
7 7 0 0
8 8 0 0
9 9 0 0
In [73]:
df = pd.concat([df,df_missing[~df_missing.a.isin(df.a)]], ignore_index=True)
df
Out[73]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
5 0 0 0
6 2 0 0
7 4 0 0
8 6 0 0
9 8 0 0
The expression if i in df.col1 is False always evaluates to false. I think it is looking in the index. Also I think you need to use pandas.concat in modern versions of pandas instead of assigning to df.loc[].
I would recommend gathering all missing values in a list then concatenating them to the dataframe at the end. For instance
>>> df = pd.DataFrame({'col1': range(5) + [i + 6 for i in range(5)], 'col2': range(10)})
>>> print df
col1 col2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 6 5
6 7 6
7 8 7
8 9 8
9 10 9
>>> to_add = []
>>> for i in range(11):
... if i not in df.col1.values:
... to_add.append([i, 0])
... else:
... continue
...
>>> pd.concat([df, pd.DataFrame(to_add, columns=['col1', 'col2'])])
col1 col2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 6 5
6 7 6
7 8 7
8 9 8
9 10 9
0 5 0
I assume you don't care about the index values of the rows you add.
I have a DataFrame with a continuous measure, marked by occasional events:
TimeIndex Event Value
0 NaN 4.099969
1 NaN 3.833528
2 NaN -1.335025
3 A 4.420085
4 NaN 4.508899
5 NaN 4.557383
6 B -3.377152
7 NaN 4.508899
8 NaN -1.919803
9 A 2.18520
10 NaN 3.821221
11 C 0.922389
12 NaN 2.165784
I want the average for each event, but also the average two time points before and two time points after the event occurs. Something like this might work:
TimeIndex Event Value Around_A Around_B Around_C
0 NaN 4.099969 NaN NaN NaN
1 NaN 3.833528 -2 NaN NaN
2 NaN -1.335025 -1 NaN NaN
3 A 4.420085 0 NaN NaN
4 NaN 4.508899 1 -2 NaN
5 NaN 4.557383 2 -1 NaN
6 B -3.377152 NaN 0 NaN
7 NaN 4.508899 -2 1 NaN
8 NaN -1.919803 -1 2 NaN
9 A 2.18520 0 NaN 2
10 NaN 3.821221 1 NaN -1
11 C 0.922389 2 NaN 0
12 NaN 2.165784 NaN NaN 1
However: 1) I'm unsure how to get the new column values without looping and 2) appending a new column gets intractable for many different events (which I have)
Is there an easier way to select timepoints/rows around a value in pandas, and then average by time point/row?
My desired output is the average Value for Event x AroundTime (dummy means shown here)
Event AroundTime Value.mean
A -2 3.35
A -1 0.19
A 0 2.33
A 1 -1.01
A 2 3.78
B -2 4.53
B -1 4.22
B 0 5.14
B 1 1.88
B 2 0.70
C -2 -1.01
C -1 -2.33
C 0 1.69
C 1 1.19
C 2 2.21
I will suggest:
In [26]:
print df
TimeIndex Event Value
0 0 NaN 4.099969
1 1 NaN 3.833528
2 2 NaN -1.335025
3 3 A 4.420085
4 4 NaN 4.508899
5 5 NaN 4.557383
6 6 B -3.377152
7 7 NaN 4.508899
8 8 NaN -1.919803
9 9 A 2.185200
10 10 NaN 3.821221
11 11 C 0.922389
12 12 NaN 2.165784
[13 rows x 3 columns]
In [27]:
df['Around_A']=np.nan
In [28]:
for i in range(-2,3):
df['Around_A'][(df.Event=='A').shift(i).fillna(False)]=i
#or df.ix[(df.Event=='A').shift(i).fillna(False), 'Around_A']=i
In [29]:
print df
TimeIndex Event Value Around_A
0 0 NaN 4.099969 NaN
1 1 NaN 3.833528 -2
2 2 NaN -1.335025 -1
3 3 A 4.420085 0
4 4 NaN 4.508899 1
5 5 NaN 4.557383 2
6 6 B -3.377152 NaN
7 7 NaN 4.508899 -2
8 8 NaN -1.919803 -1
9 9 A 2.185200 0
10 10 NaN 3.821221 1
11 11 C 0.922389 2
12 12 NaN 2.165784 NaN
[13 rows x 4 columns]
Don't quite get your last question, mind provide an intended result?
Edit
now it is clear, my approach:
In [22]:
df=pd.read_clipboard()
df['Around_A']=np.nan
df['Around_B']=np.nan
df['Around_C']=np.nan
for i in range(-2,3):
df.ix[(df.Event=='A').shift(i).fillna(False), 'Around_A']=i
df.ix[(df.Event=='B').shift(i).fillna(False), 'Around_B']=i
df.ix[(df.Event=='C').shift(i).fillna(False), 'Around_C']=i
Data=[]
for s in ['A', 'B', 'C']:
_df=pd.DataFrame(df.groupby('Around_%s'%s).Value.mean())
_df['Event']=s
_df.index.name='AroundTime'
Data.append(_df.reset_index())
print pd.concat(Data)[['Event', 'AroundTime', 'Value']]
Event AroundTime Value
0 A -2 4.171213
1 A -1 -1.627414
2 A 0 3.302643
3 A 1 4.165060
4 A 2 2.739886
0 B -2 4.508899
1 B -1 4.557383
2 B 0 -3.377152
3 B 1 4.508899
4 B 2 -1.919803
0 C -2 2.185200
1 C -1 3.821221
2 C 0 0.922389
3 C 1 2.165780
[14 rows x 3 columns]