I have a dataframe with subjects in two different conditions and many value columns.
d = {
"subject": [1, 1, 2, 2],
"condition": ["on", "off", "on", "off"],
"value": [1, 2, 3, 5]
}
df = pd.DataFrame(data=d)
df
subject
condition
value
0
1
on
1
1
1
off
2
2
2
on
3
3
2
off
5
I would like to get new columns which indicate the difference off-on between both conditions. In this case I would like to get:
subject
condition
value
off-on
0
1
on
1
1
1
1
off
2
1
2
2
on
3
2
3
2
off
5
2
How would I best do that?
I could achieve the result using this code:
onoff = (df[df.condition == "off"].value.reset_index() - df[df.condition == "on"].value.reset_index()).value
for idx, sub in enumerate(df.subject.unique()):
df.loc[df.subject == sub, "off-on"] = onoff.iloc[idx]
But it seems quite tedious and slow. I was hoping for a solution without loop. I have many rows and very many value columns. Is there a better way?
Use a pivot combined with map:
df['off-on'] = df['subject'].map(
df.pivot(index='subject', columns='condition', values='value')
.eval('off-on')
)
Or with a MultiIndex (more efficient than a pivot):
s = df.set_index(['condition', 'subject'])['value']
df['off-on'] = df['subject'].map(s['off']-s['on'])
Output:
subject condition value off-on
0 1 on 1 1
1 1 off 2 1
2 2 on 3 2
3 2 off 5 2
timings
On 100k subjects
# MultiIndexing
43.2 ms ± 2.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# pivot
77 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Use DataFrame.pivot for possible easy mapping subtracted column off and on by Series.map:
df1 = df.pivot(index='subject', columns='condition', values='value')
df['off-on'] = df['subject'].map(df1['off'].sub(df1['on']))
print (df)
subject condition value off-on
0 1 on 1 1
1 1 off 2 1
2 2 on 3 2
3 2 off 5 2
Details:
print (df.pivot(index='subject', columns='condition', values='value'))
condition off on
subject
1 2 1
2 5 3
print (df1['off'].sub(df1['on']))
subject
1 1
2 2
dtype: int64
Related
I'm selecting several columns of a dataframe, by a list of the column names. This works fine if all elements of the list are in the dataframe.
But if some elements of the list are not in the DataFrame, then it will generate the error "not in index".
Is there a way to select all columns which included in that list, even if not all elements of the list are included in the dataframe? Here is some sample data which generates the above error:
df = pd.DataFrame( [[0,1,2]], columns=list('ABC') )
lst = list('ARB')
data = df[lst] # error: not in index
I think you need Index.intersection:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
lst = ['A','R','B']
print (df.columns.intersection(lst))
Index(['A', 'B'], dtype='object')
data = df[df.columns.intersection(lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Another solution with numpy.intersect1d:
data = df[np.intersect1d(df.columns, lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Few other ways, and list comprehension is much faster
In [1357]: df[df.columns & lst]
Out[1357]:
A B
0 1 4
1 2 5
2 3 6
In [1358]: df[[c for c in df.columns if c in lst]]
Out[1358]:
A B
0 1 4
1 2 5
2 3 6
Timings
In [1360]: %timeit [c for c in df.columns if c in lst]
100000 loops, best of 3: 2.54 µs per loop
In [1359]: %timeit df.columns & lst
1000 loops, best of 3: 231 µs per loop
In [1362]: %timeit df.columns.intersection(lst)
1000 loops, best of 3: 236 µs per loop
In [1363]: %timeit np.intersect1d(df.columns, lst)
10000 loops, best of 3: 26.6 µs per loop
Details
In [1365]: df
Out[1365]:
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
In [1366]: lst
Out[1366]: ['A', 'R', 'B']
A really simple solution here is to use filter(). In your example, just type:
df.filter(lst)
and it will automatically ignore any missing columns. For more, see the documentation for filter.
As a general note, filter is a very flexible and powerful way to select specific columns. In particular, you can use regular expressions. Borrowing the sample data from #jezrael, you could type either of the following.
df.filter(regex='A|R|B')
df.filter(regex='[ARB]')
Those are trivial examples, but suppose you wanted only columns starting with those letters, then you could type:
df.filter(regex='^[ARB]')
FWIW, in some quick timings I find this to be faster than the list comprehension method, but I don't think speed is really much of a concern here -- even the slowest way should be fast enough, as the speed does not depend on the size of the dataframe, only on the number of columns.
Honestly, all of these ways are fine and you can go with whatever is most readable to you. I prefer filter because it is simple while also giving you more options for selecting columns than a simple intersection.
Use * with list
data = df[[*lst]]
It will give the desired result.
please try this:
syntax : Dataframe[[List of Columns]]
for example : df[['a','b']]
a
Out[5]:
a b c
0 1 2 3
1 12 3 44
X is the list of req columns to slice
x = ['a','b']
this would give you the req slice:
a[x]
Out[7]:
a b
0 1 2
1 12 3
Performance:
%timeit a[x]
333 µs ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Hi I am working to find out repetitive position of the following data frame:
data = pd.DataFrame()
data ['league'] =['A','A','A','A','A','A','B','B','B']
data ['Team'] = ['X','X','X','Y','Y','Y','Z','Z','Z']
data ['week'] =[1,2,3,1,2,3,1,2,3]
data ['position']= [1,1,2,2,2,1,2,3,4]
I will compare the data for position from previous row, it is it the same, I will assign one. If it is different previous row, I will assign as 1
My expected outcome will be as follow:
It means I will group by (League, Team and week) and work out the frequency.
Can anyone advise how to do that in Pandas
Thanks,
Zep
Use diff, and compare against 0:
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)
print(df)
league Team week position frequency
0 A X 1 1 0
1 A X 2 1 0
2 A X 3 2 1
3 A Y 1 2 0
4 A Y 2 2 0
5 A Y 3 1 1
6 B Z 1 2 1
7 B Z 2 3 1
8 B Z 3 4 1
For performance reasons, you should try to avoid a fillna call.
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['frequency'] = df['position'].diff().abs().fillna(0,downcast='infer')
%%timeit
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)
83.7 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
10.9 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To extend this answer to work in a groupby, use
v = df.groupby(['league', 'Team', 'week']).position.diff()
v[np.isnan(v)] = 0
df['frequency'] = v.ne(0).astype(int)
Use diff and abs with fillna:
data['frequency'] = data['position'].diff().abs().fillna(0,downcast='infer')
print(data)
league Team week position frequency
0 A X 1 1 0
1 A X 2 1 0
2 A X 3 2 1
3 A Y 1 2 0
4 A Y 2 2 0
5 A Y 3 1 1
6 B Z 1 2 1
7 B Z 2 3 1
8 B Z 3 4 1
Using groupby gives all zeros, since you are comparing within groups not on whole dataframe.
data.groupby(['league', 'Team', 'week'])['position'].diff().fillna(0,downcast='infer')
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
Name: position, dtype: int64
I'm selecting several columns of a dataframe, by a list of the column names. This works fine if all elements of the list are in the dataframe.
But if some elements of the list are not in the DataFrame, then it will generate the error "not in index".
Is there a way to select all columns which included in that list, even if not all elements of the list are included in the dataframe? Here is some sample data which generates the above error:
df = pd.DataFrame( [[0,1,2]], columns=list('ABC') )
lst = list('ARB')
data = df[lst] # error: not in index
I think you need Index.intersection:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
lst = ['A','R','B']
print (df.columns.intersection(lst))
Index(['A', 'B'], dtype='object')
data = df[df.columns.intersection(lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Another solution with numpy.intersect1d:
data = df[np.intersect1d(df.columns, lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Few other ways, and list comprehension is much faster
In [1357]: df[df.columns & lst]
Out[1357]:
A B
0 1 4
1 2 5
2 3 6
In [1358]: df[[c for c in df.columns if c in lst]]
Out[1358]:
A B
0 1 4
1 2 5
2 3 6
Timings
In [1360]: %timeit [c for c in df.columns if c in lst]
100000 loops, best of 3: 2.54 µs per loop
In [1359]: %timeit df.columns & lst
1000 loops, best of 3: 231 µs per loop
In [1362]: %timeit df.columns.intersection(lst)
1000 loops, best of 3: 236 µs per loop
In [1363]: %timeit np.intersect1d(df.columns, lst)
10000 loops, best of 3: 26.6 µs per loop
Details
In [1365]: df
Out[1365]:
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
In [1366]: lst
Out[1366]: ['A', 'R', 'B']
A really simple solution here is to use filter(). In your example, just type:
df.filter(lst)
and it will automatically ignore any missing columns. For more, see the documentation for filter.
As a general note, filter is a very flexible and powerful way to select specific columns. In particular, you can use regular expressions. Borrowing the sample data from #jezrael, you could type either of the following.
df.filter(regex='A|R|B')
df.filter(regex='[ARB]')
Those are trivial examples, but suppose you wanted only columns starting with those letters, then you could type:
df.filter(regex='^[ARB]')
FWIW, in some quick timings I find this to be faster than the list comprehension method, but I don't think speed is really much of a concern here -- even the slowest way should be fast enough, as the speed does not depend on the size of the dataframe, only on the number of columns.
Honestly, all of these ways are fine and you can go with whatever is most readable to you. I prefer filter because it is simple while also giving you more options for selecting columns than a simple intersection.
Use * with list
data = df[[*lst]]
It will give the desired result.
please try this:
syntax : Dataframe[[List of Columns]]
for example : df[['a','b']]
a
Out[5]:
a b c
0 1 2 3
1 12 3 44
X is the list of req columns to slice
x = ['a','b']
this would give you the req slice:
a[x]
Out[7]:
a b
0 1 2
1 12 3
Performance:
%timeit a[x]
333 µs ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have two data frame lets say:
dataframe A with column 'name'
name
0 4
1 2
2 1
3 3
Another dataframe B with two columns i.e. name and value
name value
0 3 5
1 2 6
2 4 7
3 1 8
I want to rearrange the value in dataframe B according to the name column in dataframe A
I am expecting final dataframe similar to this:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Here are two options:
dfB.set_index('name').loc[dfA.name].reset_index()
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Or,
dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
dfA
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Timings:
import numpy as np
import pandas as pd
prng = np.random.RandomState(0)
names = np.arange(10**7)
prng.shuffle(names)
dfA = pd.DataFrame({'name': names})
prng.shuffle(names)
dfB = pd.DataFrame({'name': names, 'value': prng.randint(0, 100, 10**7)})
%timeit dfB.set_index('name').loc[dfA.name].reset_index()
1 loop, best of 3: 2.27 s per loop
%timeit dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
1 loop, best of 3: 1.65 s per loop
%timeit dfB.set_index('name').ix[dfA.name].reset_index()
1 loop, best of 3: 1.66 s per loop
Is it possible to put percentile cuts on all columns of a dataframe with using a loop? This is how I am doing it now:
df = pd.DataFrame(np.random.randn(10,5))
df_q = pd.DataFrame()
for i in list(range(len(df.columns))):
df_q[i] = pd.qcut(df[i], 5, labels=list(range(5)))
I am hoping there is a slick pandas solution for this to avoid the use of a loop.
Thanks!
pd.qcut accepts an 1D array or Series as its argument. To apply pd.qcut to every column requires multiple calls to pd.qcut. So no matter how you dress it up, there will be a loop -- either explicit or implicit.
You could for example, use apply to call pd.qcut for each column:
In [46]: df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
Out[46]:
0 1 2 3 4
0 4 0 3 0 3
1 0 0 2 3 0
2 3 4 1 2 3
3 4 1 1 1 4
4 3 2 2 4 1
5 2 4 3 0 1
6 2 3 0 4 4
7 1 3 4 2 2
8 0 1 4 3 0
9 1 2 0 1 2
but under the hood, df.apply is using a for-loop, so it really isn't very different than your for-loop:
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
In [47]: %timeit df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
100 loops, best of 3: 2.9 ms per loop
In [48]: %%timeit
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
100 loops, best of 3: 2.95 ms per loop
Note that
for i in list(range(len(df.columns))):
will only work if the columns of df happen to be sequential integers starting at 0.
It is more robust to use
for col in df:
to iterate over the columns of the DataFrame.