Pandas DataFrame assign hirachic number to element - python

I have the following Dataframe:
a b c d
0 1 4 9 2
1 2 5 8 7
2 4 6 2 3
3 3 2 7 5
I want to assign a number to each element in a row according to it's order. The result should look like this:
a b c d
0 1 3 4 2
1 1 2 4 3
2 3 4 1 2
3 2 1 4 3
I tried to use the np.argsort function which doesn't work. Does someone know an easy way to to this? Thanks.

Use DataFrame.rank:
df = df.rank(axis=1).astype(int)
print (df)
a b c d
0 1 3 4 2
1 1 2 4 3
2 3 4 1 2
3 2 1 4 3

Related

Pandas Add an incremental number based on another column

Consider a dataframe with a column like this:
sequence
1
2
3
4
5
1
2
3
1
2
3
4
5
6
7
I wish to create a column when the sequence resets. The sequence is of variable length.
Such that I'd get something like:
sequence run
1 1
2 1
3 1
4 1
5 1
1 2
2 2
3 2
1 3
2 3
3 3
4 3
5 3
6 3
7 3
Try with diff then cumsum
df['run'] = df['sequence'].diff().ne(1).cumsum()
Out[349]:
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
11 3
12 3
13 3
14 3
Name: sequence, dtype: int32
Use:
dataset['run'] = dataset.groupby('sequence ').cumcount().add(1)
output example:
sequence run
y 1
a 1
g 1
a 2
b 1
a 3
b 2

How to use two columns to distinguish data points in a pandas dataframe

I have a dataframe that looks like follow:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[[1,2,3],[1,2,3],[1,2,3]], 'c': [[4,5,6],[4,5,6],[4,5,6]]})
I want to explode the dataframe with column b and c. I know that if we only use one column then we can do
df.explode('column_name')
However, I can't find an way to use with two columns. So here is the desired output.
output = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c': [4,5,6,4,5,6,4,5,6]})
I have tried
df.explode(['a','b'])
but it does not work and gives me a
ValueError: column must be a scalar.
Thanks.
Let us try
df=pd.concat([df[x].explode() for x in ['b','c']],axis=1).join(df[['a']]).reindex(columns=df.columns)
Out[179]:
a b c
0 1 1 4
0 1 2 5
0 1 3 6
1 2 1 4
1 2 2 5
1 2 3 6
2 3 1 4
2 3 2 5
2 3 3 6
You can use itertools chain, along with zip to get your result :
pd.DataFrame(chain.from_iterable(zip([a] * df.shape[-1], b, c)
for a, b, c in df.to_numpy()))
0 1 2
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6
List comprehension from #Ben is the fastest. However, if you don't concern too much about speed, you may use apply with pd.Series.explode
df.set_index('a').apply(pd.Series.explode).reset_index()
Or simply apply. On non-list columns, it will return the original values
df.apply(pd.Series.explode).reset_index(drop=True)
Out[42]:
a b c
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6

Pandas Dataframe: Update values in a certain columns for last n rows

In the example below, I want to update column C for the last 3 rows to the value 0.
Source Dataframe
A B C D E
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
Target Dataframe
A B C D E
1 1 1 1 1
2 2 2 2 2
3 3 0 3 3
4 4 0 4 4
5 5 0 5 5
I tried something like
df.tail(3)['C']=0
but it does not work. Any idea?
You can settle for
df.loc[df.tail(3).index, 'C'] = 0
You can use:
df.iloc[-3:]['C'] = 0
Output:
A B C D E
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 0 3 3
3 4 4 0 4 4
4 5 5 0 5 5
Other way:
df[-3:]['C'] = 0

Adding columns to DataFrame from other DataFrame without intersection

I have on Dataframe with diff size and columns, I require to add the columns from one DataFrame to another, and fulfill with same data all rows.
for instance:
one of them:
Out[48]:
A B
0 1 2
1 1 2
2 1 2
3 1 2
and the other
Out[49]:
C D
0 3 4
I want to have a new one as:
A B C D
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
Is it possible?
You can assign with pd.Series
df.assign(**df1.loc[0])
Out[11]:
A B C D
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
Using join with ffill:
df1.join(df2).ffill().astype(int)
A B C D
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4

tracking maximum value in dataframe column

I want to produce a column B in a dataframe that tracks the maximum value reached in column A since row Index 0.
A B
Index
0 1 1
1 2 2
2 3 3
3 2 3
4 1 3
5 3 3
6 4 4
7 2 4
I want to avoid iterating, so is there a vectorized solution and if so how could it look like ?
You're looking for cummax:
In [257]:
df['B'] = df['A'].cummax()
df
Out[257]:
A B
Index
0 1 1
1 2 2
2 3 3
3 2 3
4 1 3
5 3 3
6 4 4
7 2 4

Categories