I need to create a new column, which will contain letters {a,b,c,d} based on rules:
{'a' if (df['q1']==0 & df['q2']==0),
'b' if (df['q1']==0 & df['q2']==1),
'c' if (df['q1']==1 & df['q2']==0),
'd' if (df['q1']==1 & df['q2']==1)}
so, the new third column should contain a letter which corresponds to a particular combination of {0,1} in two columns.
q1 q2
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
5 0 1
6 0 1
7 0 1
8 0 1
9 0 1
10 1 1
11 1 1
12 0 1
13 0 1
14 1 0
15 0 0
16 0 0
17 0 0
18 0 0
19 0 0
20 0 0
21 0 0
I thought about converting numbers in each row from binary to decimal format and then apply dictionary rules.
You can use join by Series with MultiIndex:
idx = pd.MultiIndex.from_product([[0,1],[0,1]], names=('q1','q2'))
s = pd.Series(['a','b','c','d'], index=idx, name='val')
print (s)
q1 q2
0 0 a
1 b
1 0 c
1 d
Name: val, dtype: object
df = df.join(s, on=['q1','q2'])
print (df)
q1 q2 val
0 0 1 b
1 0 1 b
2 0 1 b
3 0 1 b
4 0 1 b
5 0 1 b
6 0 1 b
7 0 1 b
8 0 1 b
9 0 1 b
10 1 1 d
11 1 1 d
12 0 1 b
13 0 1 b
14 1 0 c
15 0 0 a
16 0 0 a
17 0 0 a
18 0 0 a
19 0 0 a
20 0 0 a
21 0 0 a
Another method with df.map and df.transform:
In [90]: mapping = {(0, 0) :'a', (0, 1) : 'b', (1, 0): 'c', (1, 1): 'd'}
In [91]: df['val'] = df.transform(lambda x: (x['q1'], x['q2']), axis=1).map(mapping); df
Out[91]:
q1 q2 val
0 0 1 b
1 0 1 b
2 0 1 b
3 0 1 b
4 0 1 b
5 0 1 b
6 0 1 b
7 0 1 b
8 0 1 b
9 0 1 b
10 1 1 d
11 1 1 d
12 0 1 b
13 0 1 b
14 1 0 c
15 0 0 a
16 0 0 a
17 0 0 a
18 0 0 a
19 0 0 a
20 0 0 a
21 0 0 a
You can also use zip to generate columns, apply pd.Series and then do the mapping:
In [119]: df['val'] = pd.Series(list(zip(df.q1, df.q2))).map(mapping); df
Out[119]:
q1 q2 val
0 0 1 b
1 0 1 b
2 0 1 b
3 0 1 b
4 0 1 b
5 0 1 b
6 0 1 b
7 0 1 b
8 0 1 b
9 0 1 b
10 1 1 d
11 1 1 d
12 0 1 b
13 0 1 b
14 1 0 c
15 0 0 a
16 0 0 a
17 0 0 a
18 0 0 a
19 0 0 a
20 0 0 a
21 0 0 a
Performance
jezrael's solution:
In [552]: %%timeit
...: idx = pd.MultiIndex.from_product([[0,1],[0,1]], names=('q1','q2'))
...: s = pd.Series(['a','b','c','d'], index=idx, name='val')
...: df.join(s, on=['q1','q2'])
...:
100 loops, best of 3: 2.84 ms per loop
Proposed in this post:
In [553]: %%timeit
...: mapping = {(0, 0) :'a', (0, 1) : 'b', (1, 0): 'c', (1, 1): 'd'}
...: df.transform(lambda x: (x['q1'], x['q2']), axis=1).map(mapping)
...:
1000 loops, best of 3: 1.7 ms per loop
Related
I have a Dataframe with a column called No.. I need to count the number of consecutive 0s in column No.. For example, the first 0 is recorded as 1, and the second 0 is recorded as 2. If it encounters 1, the counter is cleared. And save the result in the column count.
what should I do?
An example of my Dataframe is as follows:
import numpy as np
import pandas as pd
np.random.seed(2021)
a = np.random.randint(0, 2, 20)
df = pd.DataFrame(a, columns=['No.'])
print(df)
No.
0 0
1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 1
11 1
12 1
13 1
14 0
15 0
16 0
17 0
18 0
19 0
The result I need:
No. count
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 1
6 0 2
7 0 3
8 1 0
9 0 1
10 1 0
11 1 0
12 1 0
13 1 0
14 0 1
15 0 2
16 0 3
17 0 4
18 0 5
19 0 6
Generate pseudo-groups with cumsum and then generate within-group counters with groupby.cumsum:
groups = df['No.'].ne(0).cumsum()
df['count'] = df['No.'].eq(0).groupby(groups).cumsum()
Output:
No. count
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 1
6 0 2
7 0 3
8 1 0
9 0 1
10 1 0
11 1 0
12 1 0
13 1 0
14 0 1
15 0 2
16 0 3
17 0 4
18 0 5
19 0 6
>>> df = pd.DataFrame({'a': [1,1,1,1,2,2,2,2,3,3,3,3],
'b': [0,0,1,1,0,0,1,1,0,0,1,1,],
'c': [5,5,5,8,9,9,6,6,7,8,9,9]})
>>> df
a b c
0 1 0 5
1 1 0 5
2 1 1 5
3 1 1 8
4 2 0 9
5 2 0 9
6 2 1 6
7 2 1 6
8 3 0 7
9 3 0 8
10 3 1 9
11 3 1 9
Is there an alternative way to get this output?
>>> pd.pivot_table(df, index=['a','b'], columns='c', aggfunc=len, fill_value=0).reset_index()
c a b 5 6 7 8 9
0 1 0 2 0 0 0 0
1 1 1 1 0 0 1 0
2 2 0 0 0 0 0 2
3 2 1 0 2 0 0 0
4 3 0 0 0 1 1 0
5 3 1 0 0 0 0 2
I have a large df (>~1m lines) with len(df.c.unique()) being 134 so pivot is taking forever.
I was thinking that, given that this result is returned within a second in my actual df:
>>> df.groupby(by = ['a', 'b', 'c']).size().reset_index()
a b c 0
0 1 0 5 2
1 1 1 5 1
2 1 1 8 1
3 2 0 9 2
4 2 1 6 2
5 3 0 7 1
6 3 0 8 1
7 3 1 9 2
whether I could manually construct the desired outcome from this output above
1. Here's one:
df.groupby(by = ['a', 'b', 'c']).size().unstack(fill_value=0).reset_index()
Output:
c a b 5 6 7 8 9
0 1 0 2 0 0 0 0
1 1 1 1 0 0 1 0
2 2 0 0 0 0 0 2
3 2 1 0 2 0 0 0
4 3 0 0 0 1 1 0
5 3 1 0 0 0 0 2
2. Here's another way:
pd.crosstab([df.a,df.b], df.c).reset_index()
Output:
c a b 5 6 7 8 9
0 1 0 2 0 0 0 0
1 1 1 1 0 0 1 0
2 2 0 0 0 0 0 2
3 2 1 0 2 0 0 0
4 3 0 0 0 1 1 0
5 3 1 0 0 0 0 2
Is there a more efficient way to create multiple new columns in a pandas dataframe df initialized to zero than:
for col in add_cols:
df.loc[:, col] = 0
UPDATE: using #jeff's method, but doing it dynamically:
In [208]: add_cols = list('xyz')
In [209]: df.assign(**{i:0 for i in add_cols})
Out[209]:
a b c x y z
0 4 8 6 0 0 0
1 3 7 0 0 0 0
2 4 0 1 0 0 0
3 5 4 5 0 0 0
4 1 3 0 0 0 0
OLD answer:
Another method:
df[add_cols] = pd.DataFrame(0, index=df.index, columns=add_cols)
Demo:
In [343]: df = pd.DataFrame(np.random.randint(0, 10, (5,3)), columns=list('abc'))
In [344]: add_cols = list('xyz')
In [345]: add_cols
Out[345]: ['x', 'y', 'z']
In [346]: df
Out[346]:
a b c
0 4 9 0
1 1 1 1
2 8 8 1
3 0 1 4
4 8 5 6
In [347]: df[add_cols] = pd.DataFrame(0, index=df.index, columns=add_cols)
In [348]: df
Out[348]:
a b c x y z
0 4 9 0 0 0 0
1 1 1 1 0 0 0
2 8 8 1 0 0 0
3 0 1 4 0 0 0
4 8 5 6 0 0 0
In [13]: df = pd.DataFrame(np.random.randint(0, 10, (5,3)), columns=list('abc'))
In [14]: df
Out[14]:
a b c
0 7 2 3
1 7 0 7
2 5 1 5
3 9 1 4
4 2 1 4
In [15]: df.assign(x=0, y=0, z=0)
Out[15]:
a b c x y z
0 7 2 3 0 0 0
1 7 0 7 0 0 0
2 5 1 5 0 0 0
3 9 1 4 0 0 0
4 2 1 4 0 0 0
Here is a hack:
[df.insert(0, col, 0) for col in add_cols]
You can treat a DataFrame with a dict-like syntax:
for col in add_cols:
df[col] = 0
I would like to slice a dataframe to return rows where element x=0 appears consecutively at least n=3 times, and then dropping the first i=2 instances in each mini-sequence
is there an efficient way of achieving in pandas, and if not, using numpy or scipy?
import pandas as pd
import numpy as np
Example 1
df=pd.DataFrame({'A':[0,1,0,0,1,1,0,0,0,0,1,1,0,0,0,1,1],'B':np.random.randn(17)})
A B
0 0 0.748958
1 1 0.254730
2 0 0.629609
3 0 0.272738
4 1 -1.885906
5 1 1.206371
6 0 -0.332471
7 0 0.217553
8 0 0.768986
9 0 -1.607236
10 1 1.613650
11 1 -1.096892
12 0 -0.435762
13 0 0.131284
14 0 -0.177188
15 1 1.393890
16 1 0.174803
Desired output:
A B
8 0 0.768986
9 0 -1.607236
14 0 -0.177188
Example 2
x=0 (element of interest)
n=5 (min length of sequence)
i=2 (drop first two in each sequence)
df2=pd.DataFrame({'A':[0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0],'B':np.random.randn(20)})
A B
0 0 0.703803
1 0 -0.144088
2 0 0.635577
3 0 -0.834611
4 0 1.472271
5 0 -0.554860
6 0 -0.167016
7 1 0.578847
8 1 -1.873663
9 0 0.197062
10 0 1.458845
11 0 -1.921660
12 0 -1.301481
13 0 0.240197
14 0 -1.425058
15 1 -2.801151
16 0 0.766757
17 0 1.249806
18 0 0.595366
19 0 -1.447632
Desired output:
A B
2 0 0.635577
3 0 -0.834611
4 0 1.472271
5 0 -0.554860
6 0 -0.167016
11 0 -1.921660
12 0 -1.301481
13 0 0.240197
14 0 -1.425058
Here's an approach using some NumPy manipulations -
def slice_consc(df,n):
Acol = np.array(df['A'])
Acol_ext = np.concatenate(([0],(Acol==0)+0,[0]))
starts = np.where(np.diff(Acol_ext)==1)[0]
stops = np.where(np.diff(Acol_ext)==-1)[0]
id = np.zeros(Acol.size+2,dtype=int)
valid_mask = stops - starts >= n
id[stops[valid_mask]] = -1
id[starts[valid_mask]+2] = 1
return df[(id.cumsum()==1)[:-2]]
Sample runs -
Case #1:
>>> df
A B
0 0 0.977325
1 1 -0.408457
2 0 -0.377973
3 0 0.567537
4 1 -0.222019
5 1 -1.167422
6 0 -0.142546
7 0 0.675458
8 0 -0.184456
9 0 -0.826050
10 1 -0.772413
11 1 -1.556440
12 0 -0.687249
13 0 -0.481676
14 0 0.420400
15 1 0.031999
16 1 -1.092540
>>> slice_consc(df,3)
A B
8 0 -0.184456
9 0 -0.826050
14 0 0.420400
Case #2:
>>> df2
A B
0 0 0.757102
1 0 2.114935
2 0 -0.352309
3 0 -0.214931
4 0 -1.626064
5 0 -0.989776
6 0 0.639635
7 1 0.049358
8 1 -2.600326
9 0 0.057792
10 0 1.263418
11 0 0.618495
12 0 -1.637054
13 0 1.220862
14 0 1.245484
15 1 1.388218
16 0 -0.499900
17 0 0.761310
18 0 -1.308176
19 0 -2.005983
>>> slice_consc(df2,5)
A B
2 0 -0.352309
3 0 -0.214931
4 0 -1.626064
5 0 -0.989776
6 0 0.639635
11 0 0.618495
12 0 -1.637054
13 0 1.220862
14 0 1.245484
What is the idiomatic way to store this kind of data structure in a pandas :
### Option 1
df = pd.DataFrame(data = [
{'kws' : np.array([0,0,0]), 'x' : i, 'y', i} for i in range(10)
])
# df.x and df.y works as expected
# the list and array casting is required because df.kws is
# an array of arrays
np.array(list(df.kws))
# this causes problems when trying to assign as well though:
# for any other data type, this would set all kws in df to the rhs [1,2,3]
# but since the rhs is a list, it tried to do an element-wise assignment and
# errors saying that the length of df and the length of the rhs do not match
df.kws = [1,2,3]
### Option 2
df = pd.DataFrame(data = [
{'kw_0' : 0, 'kw_1' : 0, 'kw_2' : 0, 'x' : i, 'y', i} for i in range(10)
])
# retrieving 2d array:
df[sorted([c for c in df if c.startswith('kw_')])].values
# batch set :
kws = [1,2,3]
for i, kw in enumerate(kws) :
df['kw_'+i] = kw
Neither of these solutions feel right to me. For one, neither of them allow retrieving a 2d matrix out without copying all of the data. Is there a better way to handle this kind of mixed dimension data, or is this just a task that pandas isn't up to right now?
Just use a column multi-index, the docs
In [31]: df = pd.DataFrame([ {'kw_0' : 0, 'kw_1' : 0, 'kw_2' : 0, 'x' : i, 'y': i} for i in range(10) ])
In [32]: df
Out[32]:
kw_0 kw_1 kw_2 x y
0 0 0 0 0 0
1 0 0 0 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
In [33]: df.columns = MultiIndex.from_tuples([('kw',0),('kw',1),('kw',2),('value','x'),('value','y')])
In [34]: df
Out[34]:
kw value
0 1 2 x y
0 0 0 0 0 0
1 0 0 0 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
Selection is easy
In [35]: df['kw']
Out[35]:
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Setting too
In [36]: df.loc[1,'kw'] = [4,5,6]
In [37]: df
Out[37]:
kw value
0 1 2 x y
0 0 0 0 0 0
1 4 5 6 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
Alternatively you can use 2 dataframes, indexed the same, and combine/merge when needed.