pandas-how can I replace rows in a dataframe - python

I am new in Python and try to replace rows.
I have a dataframe such as:
X
Y
1
a
2
d
3
c
4
a
5
b
6
e
7
a
8
b
I have two question:
1- How can I replace 2nd row with 5th, such as:
X
Y
1
a
5
b
3
c
4
a
2
d
6
e
7
a
8
b
2- How can I put 6th row above 3rd row, such as:
X
Y
1
a
2
d
6
e
3
c
4
a
5
b
7
a
8
b

First use DataFrame.iloc, python counts from 0, so for select second row use 1 and for fifth use 4:
df.iloc[[1, 4]] = df.iloc[[4, 1]]
print (df)
X Y
0 1 a
1 5 b
2 3 c
3 4 a
4 2 d
5 6 e
6 7 a
7 8 b
And then rename indices for above value, here 1 and sorting with only stable sorting mergesort:
df = df.rename({5:1}).sort_index(kind='mergesort', ignore_index=True)
print (df)
X Y
0 1 a
1 2 d
2 6 e
3 3 c
4 4 a
5 5 b
6 7 a
7 8 b

Related

Pandas - combine two Series by all unique combinations

Let's say I have the following series:
0 A
1 B
2 C
dtype: object
0 1
1 2
2 3
3 4
dtype: int64
How can I merge them to create an empty dataframe with every possible combination of values, like this:
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
Assuming the 2 series are s and s1, use itertools.product() which gives a cartesian product of input iterables :
import itertools
df = pd.DataFrame(list(itertools.product(s,s1)),columns=['letter','number'])
print(df)
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
As of Pandas 1.2.0, there is a how='cross' option in pandas.merge() that produces the Cartesian product of the columns.
import pandas as pd
letters = pd.DataFrame({'letter': ['A','B','C']})
numbers = pd.DataFrame({'number': [1,2,3,4]})
together = pd.merge(letters, numbers, how = 'cross')
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
As an additional bonus, this function makes it easy to do so with more than one column.
letters = pd.DataFrame({'letterA': ['A','B','C'],
'letterB': ['D','D','E']})
numbers = pd.DataFrame({'number': [1,2,3,4]})
together = pd.merge(letters, numbers, how = 'cross')
letterA letterB number
0 A D 1
1 A D 2
2 A D 3
3 A D 4
4 B D 1
5 B D 2
6 B D 3
7 B D 4
8 C E 1
9 C E 2
10 C E 3
11 C E 4
If you have 2 Series s1 and s2.
you can do this:
pd.DataFrame(index=s1,columns=s2).unstack().reset_index()[["s1","s2"]]
It will give you the follow
s1 s2
0 A 1
1 B 1
2 C 1
3 A 2
4 B 2
5 C 2
6 A 3
7 B 3
8 C 3
9 A 4
10 B 4
11 C 4
You can use pandas.MultiIndex.from_product():
import pandas as pd
pd.DataFrame(
index = pd.MultiIndex
.from_product(
[
['A', 'B', 'C'],
[1, 2, 3, 4]
],
names = ['letters', 'numbers']
)
)
which results in a hierarchical structure:
letters numbers
A 1
2
3
4
B 1
2
3
4
C 1
2
3
4
and you can further call .reset_index() to get ungrouped results:
letters numbers
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
(However I find #NickCHK's answer to be the best)

Repeating rows of a dataframe based on a column value

I have a data frame like this:
df1 = pd.DataFrame({'a': [1,2],
'b': [3,4],
'c': [6,5]})
df1
Out[150]:
a b c
0 1 3 6
1 2 4 5
Now I want to create a df that repeats each row based on difference between col b and c plus 1. So diff between b and c for first row is 6-3 = 3. I want to repeat that row 3+1=4 times. Similarly for second row the difference is 5-4 = 1, so I want to repeat it 1+1=2 times. The column d is added to have value from min(b) to diff between b and c (i.e.6-3 = 3. So it goes from 3->6). So I want to get this df:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5
Do it with reindex + repeat, then using groupby cumcount assign the new value d
df1.reindex(df1.index.repeat(df1.eval('c-b').add(1))).\
assign(d=lambda x : x.c-x.groupby('a').cumcount(ascending=False))
Out[572]:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5

pandas delete a cell and shift up the column

I am using pandas with python.
I have a column in which the first value is zero.
There are other zeros as well in the column but i don't want to delete them as well.
I want to delete this cell and move the column up by 1 position.
If it is easy i can make the first Zero as an empty cell and then delete but i cant find anything just to delete a specific cell and move the rest of the column up.
SO far i have tried help from existing stack overflow and quora plus github etc but i cant see anything i am looking for.
I believe you need shift first and then replace last NaN value:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
If no NaNs only use fillna for replace NaN:
df['A'] = df['A'].shift(-1).fillna('AAA')
print (df)
A B C D E F
0 b 4 7 1 5 a
1 c 5 8 3 3 a
2 d 4 9 5 6 a
3 e 5 4 7 9 b
4 f 5 2 1 2 b
5 AAA 4 3 0 4 b
If possible some NaNs in column then set last value by iloc, get_loc function return position of column A:
df['A'] = df['A'].shift(-1)
df.iloc[-1, df.columns.get_loc('A')] = 'AAA'
print (df)
A B C D E F
0 b 4 7 1 5 a
1 c 5 8 3 3 a
2 d 4 9 5 6 a
3 e 5 4 7 9 b
4 f 5 2 1 2 b
5 AAA 4 3 0 4 b

What is the equivalent of a SQL count in Pandas

In sql, select a.*,count(a.id) as N from table a group by a.name would give me a new column 'N'containing the count as per my group by specification.
However in pandas, if I try df['name'].value_counts(), I get the count but not as a column in the original dataframe.
Is there a way to get the count as a column in the original dataframe in a single step/statement?
It seems you need groupby + transform function size:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'name':list('aaabcc')})
print (df)
A B C D E name
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 c
5 f 4 3 0 4 c
df['new'] = df.groupby('name')['name'].transform('size')
print (df)
A B C D E name new
0 a 4 7 1 5 a 3
1 b 5 8 3 3 a 3
2 c 4 9 5 6 a 3
3 d 5 4 7 9 b 1
4 e 5 2 1 2 c 2
5 f 4 3 0 4 c 2
What is the difference between size and count in pandas?

top k columns with values in pandas dataframe for every row

I have a pandas dataframe like the following:
A B C D
0 7 2 5 2
1 3 3 1 1
2 0 2 6 1
3 3 6 2 9
There can be 100s of columns, in the above example I have only shown 4.
I would like to extract top-k columns for each row and their values.
I can get the top-k columns using:
pd.DataFrame({n: df.T[column].nlargest(k).index.tolist() for n, column in enumerate(df.T)}).T
which, for k=3 gives:
0 1 2
0 A C B
1 A B C
2 C B D
3 D B A
But what I would like to have is:
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
Is there a pand(a)oic way to achieve this?
You can use numpy solution:
numpy.argsort for columns names
array already sort (thanks Jeff), need values by indices
interweave for new array
DataFrame constructor
k = 3
vals = df.values
arr1 = np.argsort(-vals, axis=1)
a = df.columns[arr1[:,:k]]
b = vals[np.arange(len(df.index))[:,None], arr1][:,:k]
c = np.empty((vals.shape[0], 2 * k), dtype=a.dtype)
c[:,0::2] = a
c[:,1::2] = b
print (c)
[['A' 7 'C' 5 'B' 2]
['A' 3 'B' 3 'C' 1]
['C' 6 'B' 2 'D' 1]
['D' 9 'B' 6 'A' 3]]
df = pd.DataFrame(c)
print (df)
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
>>> def foo(x):
... r = []
... for p in zip(list(x.index), list(x)):
... r.extend(p)
... return r
...
>>> pd.DataFrame({n: foo(df.T[row].nlargest(k)) for n, row in enumerate(df.T)}).T
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
Or, using list comprehension:
>>> def foo(x):
... return [j for i in zip(list(x.index), list(x)) for j in i]
...
>>> pd.DataFrame({n: foo(df.T[row].nlargest(k)) for n, row in enumerate(df.T)}).T
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
This does the job efficiently : It uses argpartition that found the n biggest in O(n), then sort only them.
values=df.values
n,m=df.shape
k=4
I,J=mgrid[:n,:m]
I=I[:,:1]
if k<m: J=(-values).argpartition(k)[:,:k]
values=values[I,J]
names=np.take(df.columns,J)
J2=(-values).argsort()
names=names[I,J2]
values=values[I,J2]
names_and_values=np.empty((n,2*k),object)
names_and_values[:,0::2]=names
names_and_values[:,1::2]=values
result=pd.DataFrame(names_and_values)
For
0 1 2 3 4 5
0 A 7 C 5 B 2
1 B 3 A 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3

Categories