Repeating rows of a dataframe based on a column value - python

I have a data frame like this:
df1 = pd.DataFrame({'a': [1,2],
'b': [3,4],
'c': [6,5]})
df1
Out[150]:
a b c
0 1 3 6
1 2 4 5
Now I want to create a df that repeats each row based on difference between col b and c plus 1. So diff between b and c for first row is 6-3 = 3. I want to repeat that row 3+1=4 times. Similarly for second row the difference is 5-4 = 1, so I want to repeat it 1+1=2 times. The column d is added to have value from min(b) to diff between b and c (i.e.6-3 = 3. So it goes from 3->6). So I want to get this df:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5

Do it with reindex + repeat, then using groupby cumcount assign the new value d
df1.reindex(df1.index.repeat(df1.eval('c-b').add(1))).\
assign(d=lambda x : x.c-x.groupby('a').cumcount(ascending=False))
Out[572]:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5

Related

How to pivot one column into multiple columns in a dataframe?

I have a dataframe of type:
a = ['a','b','c','a','b','c','a','b','c']
b = [0,1,2,3,4,5,6,7,8]
df = pd.DataFrame({'key':a,'values':b})
key values
0 a 0
1 b 1
2 c 2
3 a 3
4 b 4
5 c 5
6 a 6
7 b 7
8 c 8
I want to move the values in the "values" column to new columns where they have the same "key".
So result:
key values0 values1 values2
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
From this question How can I pivot a dataframe?
I've tried:
a=d1.pivot_table(index='key',values='values',aggfunc=list).squeeze()
pd.DataFrame(a.tolist(),index=a.index)
Which gives
0 1 2
key
a 0 3 6
b 1 4 7
c 2 5 8
But I don't want the index to be 'key', I want the index to stay the same.
You can use reset_index.
a = df.pivot_table(index='key',values='values',aggfunc=list).squeeze()
out = pd.DataFrame(a.tolist(),index=a.index).add_prefix('values').reset_index()
print(out)
# Output
key values0 values1 values2
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
Another way to do it:
out = (df.pivot_table('values', 'key', df.index // 3)
.add_prefix('values').reset_index())
print(out)
# Output
key values0 values1 values2
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
df["id"] = df.groupby("key").cumcount()
df.pivot(columns="id", index="key").reset_index()
# key values
# id 0 1 2
# 0 a 0 3 6
# 1 b 1 4 7
# 2 c 2 5 8

Autoincrement indexing after groupby with pandas on the original table

I cannot solve a very easy/simple problem in pandas. :(
I have the following table:
df = pd.DataFrame(data=dict(a=[1, 1, 1,2, 2, 3,1], b=["A", "A","B","A", "B", "A","A"]))
df
Out[96]:
a b
0 1 A
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 1 A
I would like to make an incrementing ID of each grouped (grouped by columns a and b) unique item. So the result would like like this (column c):
Out[98]:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I tried with:
df.groupby(["a", "b"]).nunique().cumsum().reset_index()
Result:
Out[105]:
a b c
0 1 A 1
1 1 B 2
2 2 A 3
3 2 B 4
4 3 A 5
Unfortunatelly this works only for the grouped by dataset and not on the original dataset. As you can see in the original table I have 7 rows and the grouped by returns only 5.
So could someone please help me on how to get the desired table:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Thank you in advance!
groupby + ngroup
df['c'] = df.groupby(['a', 'b']).ngroup() + 1
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Use pd.factorize after create a tuple from (a, b) columns:
df['c'] = pd.factorize(df[['a', 'b']].apply(tuple, axis=1))[0] + 1
print(df)
# Output
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1

Group by id and get all rows corresponds to min value in a column

I have a dataset with three columns A,B and C.
*
A B C
1 2 3
1 3 4
1 4 5
1 2 6
2 1 9
2 9 8
2 8 2
2 1 2
I need to get value of A,B,C columns corresponds to min B value grouped by A column.
As you can see I have duplicated values for A{1,2}B and A{2,1}B. If I do this command:
dataset['A,'B','C'].loc[dataset.groupby('A').B.idxmin()]
I get only first value of A,B,C for min B. But how can I get all rows? \
Output:
A B C
1 2 3
2 1 9
Output expected:
A B C
1 2 3
1 2 6
2 1 9
2 1 2
Use GroupBy.transform and compare by column B in boolean indexing:
df = dataset[dataset.groupby('A').B.transform('min').eq(dataset['B'])]
print (df)
A B C
0 1 2 3
3 1 2 6
4 2 1 9
7 2 1 2

Merge and split columns in pandas dataframe

I want to know how to merge multiple columns, and split them again.
Input data
A B C
1 3 5
2 4 6
Merge A, B, C to one column X
X
1
2
3
4
5
6
Process something with X, then split X into A, B, C again. The number of rows for A, B, C is same(2).
A B C
1 3 5
2 4 6
Is there any simple way for this work?
Start with df:
A B C
0 1 3 5
1 2 4 6
Next, get all values in one column:
df2 = df.unstack().reset_index(drop=True).rename('X').to_frame()
print(df2)
X
0 1
1 2
2 3
3 4
4 5
5 6
And, convert back to original shape:
df3 = pd.DataFrame(df2.values.reshape(2,-1, order='F'), columns=list('ABC'))
print(df3)
A B C
0 1 3 5
1 2 4 6
Setup
df=pd.DataFrame({'A': {0: 1, 1: 2}, 'B': {0: 3, 1: 4}, 'C': {0: 5, 1: 6}})
df
Out[684]:
A B C
0 1 3 5
1 2 4 6
Solution
Merge df to 1 column:
df2 = pd.DataFrame(df.values.flatten('F'),columns=['X'])
Out[686]:
X
0 1
1 2
2 3
3 4
4 5
5 6
Split it back to 3 columns:
pd.DataFrame(df2.values.reshape(-1,3,order='F'),columns=['A','B','C'])
Out[701]:
A B C
0 1 3 5
1 2 4 6
un unwind in the way you'd like, you need to either unstack or ravel with order='F'
Option 1
def proc1(df):
v = df.values
s = v.ravel('F')
s = s * 2
return pd.DataFrame(s.reshape(v.shape, order='F'), df.index, df.columns)
proc1(df)
A B C
0 2 6 10
1 4 8 12
Option 2
def proc2(df):
return df.unstack().mul(2).unstack(0)
proc2(df)
A B C
0 2 6 10
1 4 8 12

Finding the maximum entry based on another column in a data frame

Suppose I have a data frame with 3 columns: A, B, C. I want to group by column A, and find the row (for each unique A) with the maximum entry in C, so that I can store that row.A, row.B, row.C into a dictionary elsewhere.
What's the best way to do this without using iterrows?
# generate sample data
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,(10,3)))
df.columns = ['A','B','C']
# sort by C, group by A, take last row of each group
df.sort('C').groupby('A').nth(-1)
Here's another method. If df is the DataFrame, you can write df.groupby('A').apply(lambda d: d.ix[d['C'].argmax()]).
For example,
In [96]: df
Out[96]:
A B C
0 1 0 3
1 3 0 4
2 0 4 5
3 2 4 0
4 3 1 1
5 1 6 2
6 3 6 0
7 4 0 1
8 2 3 4
9 0 5 0
10 7 6 5
11 3 1 2
In [97]: g = df.groupby('A').apply(lambda d: d['C'].argmax())
In [98]: g
Out[98]:
A
0 2
1 0
2 8
3 1
4 7
7 10
dtype: int64
In [99]: df.ix[g.values]
Out[99]:
A B C
2 0 4 5
0 1 0 3
8 2 3 4
1 3 0 4
7 4 0 1
10 7 6 5

Categories