Duplicating each row in a dataframe with counts - python

For each row in a dataframe, I wish to create duplicates of it with an additional column to identify each duplicate.
E.g Original dataframe is
A | A
B | B
I wish to make make duplicate of each row with an additional column to identify it. Resulting in:
A | A | 1
A | A | 2
B | B | 1
B | B | 2

You can use df.reindex followed by a groupby on df.index.
df = df.reindex(df.index.repeat(2))
df['count'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Similarly, using reindex and assign with np.tile:
df = df.reindex(df.index.repeat(2))\
.assign(count=np.tile(df.index, 2) + 1)\
.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2

Use Index.repeat with loc, for count groupby with cumcount:
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
print (df)
a b
0 A A
1 B B
df = df.loc[df.index.repeat(2)]
df['new'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Or:
df = df.loc[df.index.repeat(2)]
df['new'] = np.tile(range(int(len(df.index)/2)), 2) + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2

Setup
Borrowed from #jezrael
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
a b
0 A A
1 B B
Solution 1
Create a pd.MultiIndex with pd.MultiIndex.from_product
Then use pd.DataFrame.reindex
idx = pd.MultiIndex.from_product(
[df.index, [1, 2]],
names=[df.index.name, 'New']
)
df.reindex(idx, level=0).reset_index('New')
New a b
0 1 A A
0 2 A A
1 1 B B
1 2 B B
Solution 2
This uses the same loc and reindex concept used by #cᴏʟᴅsᴘᴇᴇᴅ and #jezrael, but simplifies the final answer by using list and int multiplication rather than np.tile.
df.loc[df.index.repeat(2)].assign(New=[1, 2] * len(df))
a b New
0 A A 1
0 A A 2
1 B B 1
1 B B 2

Use pd.concat() to repeat, and then groupby with cumcount() to count:
In [24]: df = pd.DataFrame({'col1': ['A', 'B'], 'col2': ['A', 'B']})
In [25]: df
Out[25]:
col1 col2
0 A A
1 B B
In [26]: df_repeat = pd.concat([df]*3).sort_index()
In [27]: df_repeat
Out[27]:
col1 col2
0 A A
0 A A
0 A A
1 B B
1 B B
1 B B
In [28]: df_repeat["count"] = df_repeat.groupby(level=0).cumcount() + 1
In [29]: df_repeat # df_repeat.reset_index(drop=True); if index reset required.
Out[29]:
col1 col2 count
0 A A 1
0 A A 2
0 A A 3
1 B B 1
1 B B 2
1 B B 3

Related

copy value from one column to another if condition is met in pandas

I have a dataframe:
df =
col1
Aciton1
1
A
2
B
3
C
1
C
I want to edit df, such that when the value of col1 is bigger than 1 , take the value from Action1.
So I will get:
col1
Aciton1
Previous_Aciton
1
A
Initial Action
2
B
A
3
C
B
1
C
Initial Action
If there are no negative and 0 values and each group starting by 1 is possible use DataFrameGroupBy.shift:
df['Previsou_Aciton'] = df.groupby(df['col1'].eq(1).cumsum())['Aciton1'].shift(fill_value='Initail Action')
print (df)
col1 Aciton1 Previsou_Aciton
0 1 A Initail Action
1 2 B A
2 3 C B
3 1 C Initail Action
Use:
df = pd.DataFrame({'col1': [1,2,3], 'act1': ['A', 'B', 'C'], 'Previsou_Aciton':['G', 'H', 'I']})
temp = df[df['col1']>1].index
df['Previsou_Aciton'].loc[temp]=df.loc[[x-1 for x in temp]]['act1'].values
input df:
the result:

Copy data from a row to another row in Pandas dataframe based on condition

Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?
You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method
You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0
import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0

map DataFrame index and forward fill nan values

I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C
Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C
Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C
You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C

How to reshape dataframe if they have same index?

If I have a dataframe like
df= pd.DataFrame(['a','b','c','d'],index=[0,0,1,1])
0
0 a
0 b
1 c
1 d
How can I reshape the dataframe based on index like below i.e
df= pd.DataFrame([['a','b'],['c','d']],index=[0,1])
0 1
0 a b
1 c d
Let's use set_index, groupby, cumcount, and unstack:
df.set_index(df.groupby(level=0).cumcount(), append=True)[0].unstack()
Output:
0 1
0 a b
1 c d
You can use pivot with cumcount :
a = df.groupby(level=0).cumcount()
df = pd.pivot(index=df.index, columns=a, values=df[0])
Couple of ways
1.
In [490]: df.groupby(df.index)[0].agg(lambda x: list(x)).apply(pd.Series)
Out[490]:
0 1
0 a b
1 c d
2.
In [447]: df.groupby(df.index).apply(lambda x: pd.Series(x.values.tolist()).str[0])
Out[447]:
0 1
0 a b
1 c d
3.
In [455]: df.assign(i=df.index, c=df.groupby(level=0).cumcount()).pivot('i', 'c', 0)
Out[455]:
c 0 1
i
0 a b
1 c d
to remove names
In [457]: (df.assign(i=df.index, c=df.groupby(level=0).cumcount()).pivot('i', 'c', 0)
.rename_axis(None).rename_axis(None, 1))
Out[457]:
0 1
0 a b
1 c d

How to make columns in dataframe to be unique?

I wanted to apply one-hot encoding (it isn't important to understand the question) to my dataframe this way:
train = pd.concat([train, pd.get_dummies(train['Canal_ID'])], axis=1, join_axes=[train.index])
train.drop([11,'Canal_ID'],axis=1, inplace = True)
train = pd.concat([train, pd.get_dummies(train['Agencia_ID'])], axis=1, join_axes=[train.index])
train.drop([1382,'Agencia_ID'],axis=1, inplace = True)
Unfortunately, original dataframe had number as values, this is why after getting dummies variables, there are a lot of columns with the same name. How can I make them unique?
Try this: get_dummies has a "prefix" method
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
'C': [1, 2, 3]})
pd.get_dummies(df, prefix=['col1', 'col2'])
C col1_a col1_b col2_a col2_b col2_c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1
You can set new column names by range with shape:
df.columns = range(df.shape[1])
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
print (df.shape)
(3, 6)
df.columns = range(df.shape[1])
print (df)
0 1 2 3 4 5
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
I would append a random number to the original id of the columns.
new_cols = train.columns
new_cols = new_cols.map(lambda x: "{}-{}".format(x, randint(0,100))
train.columns = new_cols

Categories