Duplicating each row in a dataframe with counts

Duplicating each row in a dataframe with counts - python

For each row in a dataframe, I wish to create duplicates of it with an additional column to identify each duplicate.
E.g Original dataframe is
A | A
B | B
I wish to make make duplicate of each row with an additional column to identify it. Resulting in:
A | A | 1
A | A | 2
B | B | 1
B | B | 2

You can use df.reindex followed by a groupby on df.index.
df = df.reindex(df.index.repeat(2))
df['count'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Similarly, using reindex and assign with np.tile:
df = df.reindex(df.index.repeat(2))\
.assign(count=np.tile(df.index, 2) + 1)\
.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2

Use Index.repeat with loc, for count groupby with cumcount:
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
print (df)
a b
0 A A
1 B B
df = df.loc[df.index.repeat(2)]
df['new'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Or:
df = df.loc[df.index.repeat(2)]
df['new'] = np.tile(range(int(len(df.index)/2)), 2) + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2

Setup
Borrowed from #jezrael
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
a b
0 A A
1 B B
Solution 1
Create a pd.MultiIndex with pd.MultiIndex.from_product
Then use pd.DataFrame.reindex
idx = pd.MultiIndex.from_product(
[df.index, [1, 2]],
names=[df.index.name, 'New']
)
df.reindex(idx, level=0).reset_index('New')
New a b
0 1 A A
0 2 A A
1 1 B B
1 2 B B
Solution 2
This uses the same loc and reindex concept used by #cᴏʟᴅsᴘᴇᴇᴅ and #jezrael, but simplifies the final answer by using list and int multiplication rather than np.tile.
df.loc[df.index.repeat(2)].assign(New=[1, 2] * len(df))
a b New
0 A A 1
0 A A 2
1 B B 1
1 B B 2

Use pd.concat() to repeat, and then groupby with cumcount() to count:
In [24]: df = pd.DataFrame({'col1': ['A', 'B'], 'col2': ['A', 'B']})
In [25]: df
Out[25]:
col1 col2
0 A A
1 B B
In [26]: df_repeat = pd.concat([df]*3).sort_index()
In [27]: df_repeat
Out[27]:
col1 col2
0 A A
0 A A
0 A A
1 B B
1 B B
1 B B
In [28]: df_repeat["count"] = df_repeat.groupby(level=0).cumcount() + 1
In [29]: df_repeat # df_repeat.reset_index(drop=True); if index reset required.
Out[29]:
col1 col2 count
0 A A 1
0 A A 2
0 A A 3
1 B B 1
1 B B 2
1 B B 3

Related

copy value from one column to another if condition is met in pandas

I have a dataframe:
df =
col1
Aciton1
1
A
2
B
3
C
1
C
I want to edit df, such that when the value of col1 is bigger than 1 , take the value from Action1.
So I will get:
col1
Aciton1
Previous_Aciton
1
A
Initial Action
2
B
A
3
C
B
1
C
Initial Action

If there are no negative and 0 values and each group starting by 1 is possible use DataFrameGroupBy.shift:
df['Previsou_Aciton'] = df.groupby(df['col1'].eq(1).cumsum())['Aciton1'].shift(fill_value='Initail Action')
print (df)
col1 Aciton1 Previsou_Aciton
0 1 A Initail Action
1 2 B A
2 3 C B
3 1 C Initail Action

Use:
df = pd.DataFrame({'col1': [1,2,3], 'act1': ['A', 'B', 'C'], 'Previsou_Aciton':['G', 'H', 'I']})
temp = df[df['col1']>1].index
df['Previsou_Aciton'].loc[temp]=df.loc[[x-1 for x in temp]]['act1'].values
input df:
the result:

Copy data from a row to another row in Pandas dataframe based on condition

Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?

You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method

You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0

import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0

map DataFrame index and forward fill nan values

I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C

Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C

Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C

You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C

How to reshape dataframe if they have same index?

If I have a dataframe like
df= pd.DataFrame(['a','b','c','d'],index=[0,0,1,1])
0
0 a
0 b
1 c
1 d
How can I reshape the dataframe based on index like below i.e
df= pd.DataFrame([['a','b'],['c','d']],index=[0,1])
0 1
0 a b
1 c d

Let's use set_index, groupby, cumcount, and unstack:
df.set_index(df.groupby(level=0).cumcount(), append=True)[0].unstack()
Output:
0 1
0 a b
1 c d

You can use pivot with cumcount :
a = df.groupby(level=0).cumcount()
df = pd.pivot(index=df.index, columns=a, values=df[0])

Couple of ways
1.
In [490]: df.groupby(df.index)[0].agg(lambda x: list(x)).apply(pd.Series)
Out[490]:
0 1
0 a b
1 c d
2.
In [447]: df.groupby(df.index).apply(lambda x: pd.Series(x.values.tolist()).str[0])
Out[447]:
0 1
0 a b
1 c d
3.
In [455]: df.assign(i=df.index, c=df.groupby(level=0).cumcount()).pivot('i', 'c', 0)
Out[455]:
c 0 1
i
0 a b
1 c d
to remove names
In [457]: (df.assign(i=df.index, c=df.groupby(level=0).cumcount()).pivot('i', 'c', 0)
.rename_axis(None).rename_axis(None, 1))
Out[457]:
0 1
0 a b
1 c d

How to make columns in dataframe to be unique?

I wanted to apply one-hot encoding (it isn't important to understand the question) to my dataframe this way:
train = pd.concat([train, pd.get_dummies(train['Canal_ID'])], axis=1, join_axes=[train.index])
train.drop([11,'Canal_ID'],axis=1, inplace = True)
train = pd.concat([train, pd.get_dummies(train['Agencia_ID'])], axis=1, join_axes=[train.index])
train.drop([1382,'Agencia_ID'],axis=1, inplace = True)
Unfortunately, original dataframe had number as values, this is why after getting dummies variables, there are a lot of columns with the same name. How can I make them unique?

Try this: get_dummies has a "prefix" method
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
'C': [1, 2, 3]})
pd.get_dummies(df, prefix=['col1', 'col2'])
C col1_a col1_b col2_a col2_b col2_c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1

You can set new column names by range with shape:
df.columns = range(df.shape[1])
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
print (df.shape)
(3, 6)
df.columns = range(df.shape[1])
print (df)
0 1 2 3 4 5
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3

I would append a random number to the original id of the columns.
new_cols = train.columns
new_cols = new_cols.map(lambda x: "{}-{}".format(x, randint(0,100))
train.columns = new_cols

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Duplicating each row in a dataframe with counts - python

For each row in a dataframe, I wish to create duplicates of it with an additional column to identify each duplicate. E.g Original dataframe is A | A B | B I wish to make make duplicate of each row with an additional column to identify it. Resulting in: A | A | 1 A | A | 2 B | B | 1 B | B | 2

Related

copy value from one column to another if condition is met in pandas

Copy data from a row to another row in Pandas dataframe based on condition

map DataFrame index and forward fill nan values

How to reshape dataframe if they have same index?

How to make columns in dataframe to be unique?

Categories

Resources