I want to produce a column B in a dataframe that tracks the maximum value reached in column A since row Index 0.
A B
Index
0 1 1
1 2 2
2 3 3
3 2 3
4 1 3
5 3 3
6 4 4
7 2 4
I want to avoid iterating, so is there a vectorized solution and if so how could it look like ?
You're looking for cummax:
In [257]:
df['B'] = df['A'].cummax()
df
Out[257]:
A B
Index
0 1 1
1 2 2
2 3 3
3 2 3
4 1 3
5 3 3
6 4 4
7 2 4
Related
I have a dataframe that looks like
ID feature
1 2
1 3
1 4
2 3
2 2
3 5
3 8
3 4
3 2
4 4
4 6
and I want to add a new column n_ID that counts the number of times that an element occur in the column ID, so the desire output should look like
ID feature n_ID
1 2 3
1 3 3
1 4 3
2 3 2
2 2 2
3 5 4
3 8 4
3 4 4
3 2 4
4 4 2
4 6 2
I know the .value_counts() function but I don't know how to make use of this method to make the new column. Thanks in advance
Using value counts... I was thinking of this... #sophocles Thanks for transform... :)
df = pd.DataFrame({"ID":[1,1,1,2,2,3,3,3,3,4,4],
"feature":[1,2,3,4,5,6,7,8,9,10,11]})
df1 = pd.DataFrame(df["ID"].value_counts().reset_index())
df1.columns = ["ID","n_ID"]
df = df.merge(df1,how = "left",on="ID")
Just create new column and count the occurance using lambda func:
Code:
df['n_id'] = df.apply(lambda x: df['ID'].tolist().count(x.ID), axis=1)
Output:
ID feature n_id
0 1 1 3
1 1 2 3
2 1 3 3
3 2 4 2
4 2 5 2
5 3 6 4
6 3 7 4
7 3 8 4
8 3 9 4
9 4 10 2
10 4 11 2
I have a dataframe that looks like follow:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[[1,2,3],[1,2,3],[1,2,3]], 'c': [[4,5,6],[4,5,6],[4,5,6]]})
I want to explode the dataframe with column b and c. I know that if we only use one column then we can do
df.explode('column_name')
However, I can't find an way to use with two columns. So here is the desired output.
output = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c': [4,5,6,4,5,6,4,5,6]})
I have tried
df.explode(['a','b'])
but it does not work and gives me a
ValueError: column must be a scalar.
Thanks.
Let us try
df=pd.concat([df[x].explode() for x in ['b','c']],axis=1).join(df[['a']]).reindex(columns=df.columns)
Out[179]:
a b c
0 1 1 4
0 1 2 5
0 1 3 6
1 2 1 4
1 2 2 5
1 2 3 6
2 3 1 4
2 3 2 5
2 3 3 6
You can use itertools chain, along with zip to get your result :
pd.DataFrame(chain.from_iterable(zip([a] * df.shape[-1], b, c)
for a, b, c in df.to_numpy()))
0 1 2
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6
List comprehension from #Ben is the fastest. However, if you don't concern too much about speed, you may use apply with pd.Series.explode
df.set_index('a').apply(pd.Series.explode).reset_index()
Or simply apply. On non-list columns, it will return the original values
df.apply(pd.Series.explode).reset_index(drop=True)
Out[42]:
a b c
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6
I have the following Dataframe:
a b c d
0 1 4 9 2
1 2 5 8 7
2 4 6 2 3
3 3 2 7 5
I want to assign a number to each element in a row according to it's order. The result should look like this:
a b c d
0 1 3 4 2
1 1 2 4 3
2 3 4 1 2
3 2 1 4 3
I tried to use the np.argsort function which doesn't work. Does someone know an easy way to to this? Thanks.
Use DataFrame.rank:
df = df.rank(axis=1).astype(int)
print (df)
a b c d
0 1 3 4 2
1 1 2 4 3
2 3 4 1 2
3 2 1 4 3
I have a DataFrame with two columns and I would like to stack the second column below the first and the first below the second.
pd.DataFrame({'A':[1,2,3], 'B': [4,5,6]})
A B
0 1 4
1 2 5
2 3 6
Desired output:
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3
So far I have tried:
pd.concat([df, df[['B','A']].rename(columns={'A':'B', 'B':'A'})])
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3
Is this the cleanest way?
Concat is better if you ask me. But if you have a 100 columns renaming is a pain. As a generalized approach here's one with numpy flip and vstack i.e
v = df.values
pd.DataFrame(pd.np.vstack((v, pd.np.fliplr(v))), columns=df.columns)
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3
For example, I have an DataFrame A as following
A
0
1
2
Now I want to insert every 2 rows in DataFrame B into A every 1 row and B is as following
B
3
3
4
4
5
5
finally I want
A
0
3
3
1
4
4
2
5
5
How can I achieve this?
One option is to take each dataframe's values, reshape, concatenate with np.hstack and then assign to a new dataframe.
In [533]: pd.DataFrame(np.hstack((df1.A.values.reshape(-1, 1),\
df2.B.values.reshape(-1, 2))).reshape(-1, ),\
columns=['A'])
Out[533]:
A
0 0
1 3
2 3
3 1
4 4
5 4
6 2
7 5
8 5
Another solution with pd.concat and df.stack:
In [622]: pd.DataFrame(pd.concat([df1.A, pd.DataFrame(df2.B.values.reshape(-1, 2))], axis=1)\
.stack().reset_index(drop=True),\
columns=['A'])
Out[622]:
A
0 0
1 3
2 3
3 1
4 4
5 4
6 2
7 5
8 5
Setup
Consider the dataframes a and b
a = pd.DataFrame(dict(A=range(3)))
b = pd.DataFrame(dict(B=np.arange(3).repeat(2) + 3))
Solution
Use interleave from toolz or cytoolz
The trick is to split b into two arguments of interleave
from cytoolz import interleave
pd.Series(list(interleave([a.A, b.B[::2], b.B[1::2]])))
0 0
1 3
2 3
3 1
4 4
5 4
6 2
7 5
8 5
dtype: int64
This is a modification of #root's answer to my question
Maybe this one ?
A=len(df1)+len(df2)
df1.index=(list(range(0, A,3)))
df2.index=list(set(range(0, A))-set(range(0, A,3)))
df2.columns=['A']
df=pd.concat([df1,df2],axis=0).sort_index()
df
Out[188]:
A
0 0
1 3
2 3
3 1
4 4
5 4
6 2
7 5
8 5
If we first split a to len(a) arrays and b to len(b) two arrays we can zip them together, stack and concatenate.
a = np.split(dfa.A.values,len(dfa.A))
b = np.split(dfb.B.values,len(dfb.B)/2)
c = np.concatenate(np.hstack(list(zip(a,b))))
pd.Series(c)
Returns:
0 0
1 3
2 3
3 1
4 4
5 4
6 2
7 5
8 5
dtype: int64