Say I have a data-frame, filled as below, with the column 'Key' having one of five possible values A, B, C, D, X. I would like to add a new column 'Res' that counts the number of these letters cumulatively and resets each time it hits and X.
For example:
Key Res
0 D 1
1 X 0
2 B 1
3 C 2
4 D 3
5 X 0
6 A 1
7 C 2
8 X 0
9 X 0
May anyone assist in how I can achieve this?
A possible solution:
a = df.Key.ne('X')
df['new'] = ((a.cumsum()-a.cumsum().where(~a).ffill().fillna(0)).astype(int))
Another possible solution, which is more basic than the previous one, but much faster (several orders of magnitude):
s = np.zeros(len(df), dtype=int)
for i in range(len(df)):
if df.Key[i] != 'X':
s[i] = s[i-1] + 1
df['new'] = s
Output:
Key Res new
0 D 1 1
1 X 0 0
2 B 1 1
3 C 2 2
4 D 3 3
5 X 0 0
6 A 1 1
7 C 2 2
8 X 0 0
9 X 0 0
Example
df = pd.DataFrame(list('DXBCDXACXX'), columns=['Key'])
df
Key
0 D
1 X
2 B
3 C
4 D
5 X
6 A
7 C
8 X
9 X
Code
df1 = pd.concat([df.iloc[[0]], df])
grouper = df1['Key'].eq('X').cumsum()
df1.assign(Res=df1.groupby(grouper).cumcount()).iloc[1:]
result:
Key Res
0 D 1
1 X 0
2 B 1
3 C 2
4 D 3
5 X 0
6 A 1
7 C 2
8 X 0
9 X 0
Related
I have a dataframe
x y
a 1
b 1
c 1
d 0
e 0
f 0
g 1
h 1
i 0
j 0
I want to remove the rows with 0 except every first new occurence of 0 after 1, so the resultant dataframe should be
x y
a 1
b 1
c 1
d 0
g 1
h 1
i 0
Is it possible to do it without creating groups or row by row iteration to make it faster since I have a big dataframe.
Let us try diff with cumsum create the continue value group , then try duplicated
out = df[~df.y.diff().ne(0).cumsum().duplicated() | df.y].copy()
Out[352]:
x y
0 a 1
1 b 1
2 c 1
3 d 0
6 g 1
7 h 1
8 i 0
Check consecutive similarity using shift()
df[df.y.ne(0)|(df.y.eq(0)&df.y.shift(1).ne(0))]
x y
0 a 1
1 b 1
2 c 1
3 d 0
6 g 1
7 h 1
8 i 0
I am looking for a solution to pick values (row wise) from a Dataframe.
Here is what I already have:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,10, (10, 10)))
df.columns = list('ABCDEFGHIJ')
N = 2
idx = np.argsort(df.values, 1)[:, 0:N]
df= pd.concat([pd.DataFrame(df.values.take(idx), index=df.index), pd.DataFrame(df.columns[idx], index=df.index)],keys=['Value', 'Columns']).sort_index(level=1)
Now I have the index/position for every value but if I try to get the values from the Dataframe it only takes the values from the first row.
What do I have to change in the code?
df looks like:
A B C D E F G H I J
0 6 9 6 1 1 2 8 7 3 5
1 6 3 5 3 5 8 8 2 8 1
2 7 8 7 2 1 2 9 9 4 9
....
My output should look like:
0 D E
0 1 1
1 J H
1 1 2
You can use np.take_along_axis to take values from dataframe. Use np.insert to sieve both values taken and corresponding column names.
# idx is the same as the one used in the question.
vals = np.take_along_axis(df.values, idx, axis=1)
cols = df.columns.values[idx]
indices = np.r_[: len(vals)] # same as np.arange(len(vals))
out = np.insert(vals.astype(str), indices , cols, axis=0)
index = np.repeat(indices, 2)
df = pd.DataFrame(out, index=index)
0 1
0 D E
0 1 1
1 J H
1 1 2
2 E D
2 1 2
3 E I
3 2 2
4 A D
4 1 1
5 I J
5 1 3
6 E I
6 1 2
7 B H
7 1 3
8 G I
8 1 1
9 E A
9 1 2
df have:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
df want:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
I am able to get df want by using:
df.loc['d']=df.loc['b']-df.loc['a']
However, my actual df has 'a','b','c' rows for multiple IDs 'X', 'Y' etc.
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
Y a 1 2 3
b 2 1 4
c 1 1 1
How can I create the same output with multiple IDs?
My original method:
df.loc['d']=df.loc['b']-df.loc['a']
fails KeyError:'b'
Desired output:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
IIUC,
for i, sub in df.groupby(df.index.get_level_values(0)):
df.loc[(i, 'd'), :] = sub.loc[(i,'b')] - sub.loc[(i, 'a')]
print(df.sort_index())
Or maybe
k = df.groupby(df.index.get_level_values(0), as_index=False).apply(lambda s: pd.DataFrame([s.loc[(s.name,'b')].values - s.loc[(s.name, 'a')].values],
columns=s.columns,
index=pd.MultiIndex(levels=[[s.name], ['d']], codes=[[0],[0]])
)).reset_index(drop=True, level=0)
pd.concat([k, df]).sort_index()
Data reshaping is a useful trick if you want to do manipulation on a particular level of a multiindex. See code below,
result = (df.unstack(0).T
.assign(d=lambda x:x.b-x.a)
.stack()
.unstack(0))
Use pd.IndexSlice to slice a and b. Call diff and slice on b and rename it to d. Finally, append it to original df
idx = pd.IndexSlice
df1 = df.loc[idx[:,['a','b']],:].diff().loc[idx[:,'b'],:].rename({'b': 'd'})
df2 = df.append(df1).sort_index().astype(int)
Out[106]:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
I have a pandas dataframe with roughly 150,000,000 rows in the following format:
df.head()
Out[1]:
ID TERM X
0 1 A 0
1 1 A 4
2 1 A 6
3 1 B 0
4 1 B 10
5 2 A 1
6 2 B 1
7 2 F 1
I want to aggregate it by ID & TERM, and count the number of rows. Currently I do the following:
df.groupby(['ID','TERM']).count()
Out[2]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
But this takes roughly two minutes. The same operation using R data.tables takes less than 22 seconds. Is there a more efficient way to do this in python?
For comparison, R data.table:
system.time({ df[,.(.N), .(ID, TERM)] })
#user: 30.32 system: 2.45 elapsed: 22.88
A NumPy solution would be like so -
def groupby_count(df):
unq, t = np.unique(df.TERM, return_inverse=1)
ids = df.ID.values
sidx = np.lexsort([t,ids])
ts = t[sidx]
idss = ids[sidx]
m0 = (idss[1:] != idss[:-1]) | (ts[1:] != ts[:-1])
m = np.concatenate(([True], m0, [True]))
ids_out = idss[m[:-1]]
t_out = unq[ts[m[:-1]]]
x_out = np.diff(np.flatnonzero(m)+1)
out_ar = np.column_stack((ids_out, t_out, x_out))
return pd.DataFrame(out_ar, columns = [['ID','TERM','X']])
A bit simpler version -
def groupby_count_v2(df):
a = df.values
sidx = np.lexsort(a[:,:2].T)
b = a[sidx,:2]
m = np.concatenate(([True],(b[1:] != b[:-1]).any(1),[True]))
out_ar = np.column_stack((b[m[:-1],:2], np.diff(np.flatnonzero(m)+1)))
return pd.DataFrame(out_ar, columns = [['ID','TERM','X']])
Sample run -
In [332]: df
Out[332]:
ID TERM X
0 1 A 0
1 1 A 4
2 1 A 6
3 1 B 0
4 1 B 10
5 2 A 1
6 2 B 1
7 2 F 1
In [333]: groupby_count(df)
Out[333]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
Let's randomly shuffle the rows and verify that it works with our solution -
In [339]: df1 = df.iloc[np.random.permutation(len(df))]
In [340]: df1
Out[340]:
ID TERM X
7 2 F 1
6 2 B 1
0 1 A 0
3 1 B 0
5 2 A 1
2 1 A 6
1 1 A 4
4 1 B 10
In [341]: groupby_count(df1)
Out[341]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
I have a df that looks something like:
a b c d e 0 1 2 3 5 1 4 0 5 2 5 8 9 6 0 4 5 0 0 0
I would like to output the number of numbers in column c that are not zero.
Use double sum:
print df
a b c d e
0 0 1 2 3 5
1 1 4 0 5 2
2 5 8 9 6 0
3 4 5 0 0 0
print (df != 0).sum(1)
0 4
1 4
2 4
3 2
dtype: int64
print (df != 0).sum(1).sum()
14
If you need count only column c or d:
print (df['c'] != 0).sum()
2
print (df['d'] != 0).sum()
3
EDIT: Solution with numpy.sum:
print ((df != 0).values.sum())
14
Numpy's count_nonzero function is efficient for this.
np.count_nonzero(df["c"])