Repeating rows in a DataFrame based on a column - python

I have a dataframe now:
class1 class2 value value2
0 1 0 1 4
1 2 1 2 3
2 2 0 3 5
3 3 1 4 6
I want to repeat rows and insert an increment column in the same amount according to the difference between value and value2. I want to get the dataframe should like this:
class1 class2 value value2 value3
0 1 0 1 4 1
1 1 0 1 4 2
2 1 0 1 4 3
3 1 0 1 4 4
4 2 1 2 3 2
5 2 1 2 3 3
6 2 0 3 5 3
7 2 0 3 5 4
8 2 0 3 5 5
9 3 1 4 6 4
10 3 1 4 6 5
11 3 1 4 6 6
I tried it like:
def func(x):
copy = x.copy()
num = x.value2+1-x.value
return pd.concat([copy]*num.values[0])
df= df.groupby(['class1','class2']).apply(lambda x:func(x))
But there will be a oredr problem that leads me to not know how to add column value3. And I'd like to have an elegant way of doing it.
Can anyone help me? Thanks in advance

Compute the difference and call Index.repeat:
idx = df.index.repeat(df.value2 - df.value + 1)
Now, either use reindex:
df = df.reindex(idx).reset_index(drop=True)
Or loc:
df = df.loc[idx].reset_index(drop=True)
And you get
df
class1 class2 value value2
0 1 0 1 4
1 1 0 1 4
2 1 0 1 4
3 1 0 1 4
4 2 1 2 3
5 2 1 2 3
6 2 0 3 5
7 2 0 3 5
8 2 0 3 5
9 3 1 4 6
10 3 1 4 6
11 3 1 4 6
For the second part of your question, you'll need groupby.cumcount:
s = idx.to_series()
df['value3'] = df['value'] + s.groupby(idx).cumcount().values
df
class1 class2 value value2 value3
0 1 0 1 4 1
1 1 0 1 4 2
2 1 0 1 4 3
3 1 0 1 4 4
4 2 1 2 3 2
5 2 1 2 3 3
6 2 0 3 5 3
7 2 0 3 5 4
8 2 0 3 5 5
9 3 1 4 6 4
10 3 1 4 6 5
11 3 1 4 6 6

Here's a sequence of things that would get you the desired output:
df.join(df
.apply(lambda x: pd.Series(range(x.value, x.value2+1)), axis=1)
.stack().astype(int)
.reset_index(level=1, drop=1)
.to_frame('value3')).reset_index(drop=1)
Out[]:
class1 class2 value value2 value3
0 1 0 1 4 1
1 1 0 1 4 2
2 1 0 1 4 3
3 1 0 1 4 4
4 2 1 2 3 2
5 2 1 2 3 3
6 2 0 3 5 3
7 2 0 3 5 4
8 2 0 3 5 5
9 3 1 4 6 4
10 3 1 4 6 5
11 3 1 4 6 6

Related

Better way other than for loops,

I want to create a DataFrame that has the columns feature1, month and feature_segment. I have over 3,000 unique values in feature1 and 3 feature_segments, I now have to map each feature to each month and feature_segment,
for example:
feature1 = 1 so the mapping should create a data frame as such:
feature1 month feature_Segment
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 2 3
1 3 1
1 3 2
1 3 3
1 4 1
1 4 2
1 4 3
1 5 1
1 5 2
1 5 3
1 6 1
1 6 2
1 6 3
1 7 1
1 7 2
1 7 3
1 8 1
1 8 2
1 8 3
1 9 1
1 9 2
1 9 3
1 10 1
1 10 2
1 10 3
1 11 1
1 11 2
1 11 3
1 12 1
1 12 2
1 12 3
Now is there any way to create this data frame without using a for loop?
All the df columns are in lists.
Use itertools.product:
from itertools import product
feature = [1]
feature_Segment = [1,2,3]
month = range(1, 13)
df = pd.DataFrame(product(feature, month, feature_Segment),
columns=['feature1','month','feature_Segment'])
print (df.head(10))
feature1 month feature_Segment
0 1 1 1
1 1 1 2
2 1 1 3
3 1 2 1
4 1 2 2
5 1 2 3
6 1 3 1
7 1 3 2
8 1 3 3
9 1 4 1

add a new counter based on an existing counter in python pandas

I have a Series that look like this
col
0 1
1 2
2 3
3 4
4 1
5 2
6 3
7 1
8 2
9 3
10 1
11 2
and I would like to generate a second counter that looks like this
col col2
0 1 1
1 2 1
2 3 1
3 4 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
How can I do that in python?
If 1 is always start of groups then create mask by compare by Series.eq and then add Series.cumsum for cumulative sum:
df['col2'] = df['col'].eq(1).cumsum()
print (df)
col col2
0 1 1
1 2 1
2 3 1
3 4 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4

Stacking Pandas Dataframe without dropping row

Currently, I have a dataframe like this:
0 0 0 3 0 0
0 7 8 9 1 0
0 4 5 2 4 0
My code to stack it:
dt = dataset.iloc[:,0:7].stack().sort_index(level=1).reset_index(level=0, drop=True).to_frame()
dt['variable'] = pandas.Categorical(dt.index).codes+1
dt.rename(columns={0:index_column_name}, inplace=True)
dt.set_index(index_column_name, inplace=True)
dt['variable'] = numpy.sort(dt['variable'])
However, it drops the first row when I'm stacking it, and I want to keep the headers / first row, how would I achieve this?
In essence, I'm losing the data from the first row (a.k.a column headers) and I want to keep it.
Desired Output:
value,variable
0 1
0 1
0 1
0 2
7 2
4 2
0 3
8 3
5 3
3 4
9 4
2 4
0 5
1 5
4 5
0 6
0 6
0 6
Current output:
value,variable
0 1
0 1
7 2
4 2
8 3
5 3
9 4
2 4
1 5
4 5
0 6
0 6
Why not use df.melt as #WeNYoBen mentioned?
print(df)
1 2 3 4 5 6
0 0 0 0 3 0 0
1 0 7 8 9 1 0
2 0 4 5 2 4 0
print(df.melt())
variable value
0 1 0
1 1 0
2 1 0
3 2 0
4 2 7
5 2 4
6 3 0
7 3 8
8 3 5
9 4 3
10 4 9
11 4 2
12 5 0
13 5 1
14 5 4
15 6 0
16 6 0
17 6 0

With python dataframes add column of counts of rows that meet condition to each row that meets it

Say I have a python DataFrame with the following structure:
pd.DataFrame([[1,2,3,4],[1,2,3,4],[1,3,5,6],[1,4,6,7],[1,4,6,7],[1,4,6,7]])
Out[262]:
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 1 3 5 6
3 1 4 6 7
4 1 4 6 7
5 1 4 6 7
How can I add a column called 'ct' that counts the instances of the DataFrame where column 1-3 match to each row that matches... so the DataFrame would look like this when all is completed.
0 1 2 3 ct
0 1 2 3 4 2
1 1 2 3 4 2
2 1 3 5 6 1
3 1 4 6 7 3
4 1 4 6 7 3
5 1 4 6 7 3
You can use groupby + transform + size:
df['ct'] = df.groupby([1,2,3])[1].transform('size')
#alternatively
#df['ct'] = df.groupby([1,2,3])[1].transform(len)
print (df)
0 1 2 3 ct
0 1 2 3 4 2
1 1 2 3 4 2
2 1 3 5 6 1
3 1 4 6 7 3
4 1 4 6 7 3
5 1 4 6 7 3

Python: Replace a cell value in Dataframe with if statement

I have a matrix with that looks like this:
com 0 1 2 3 4 5
AAA 0 5 0 4 2 1 4
ABC 0 9 8 9 1 0 3
ADE 1 4 3 5 1 0 1
BCD 1 6 7 8 3 4 1
BCF 2 3 4 2 1 3 0 ...
Where AAA, ABC ... is the dataframe index. The dataframe columns are com 0 1 3 4 5 6
I want to set the cell values in my dataframe equal to 0 when the row values of com is equal the column "number". So for instance, the above matrix will look like:
com 0 1 2 3 4 5
AAA 0 0 0 4 2 1 4
ABC 0 0 8 9 1 0 3
ADE 1 4 0 5 1 0 1
BCD 1 6 0 8 3 4 1
BCF 2 3 4 0 1 3 0 ...
I tried to iterate over rows and use both .loc and .ix but no success.
Just require some numpy trick
In [22]:
print df
0 1 2 3 4 5
0 5 0 4 2 1 4
0 9 8 9 1 0 3
1 4 3 5 1 0 1
1 6 7 8 3 4 1
2 3 4 2 1 3 0
[5 rows x 6 columns]
In [23]:
#making a masking matrix, 0 where column and index values equal, 1 elsewhere, kind of the vectorized way of doing if TURE 0, else 1
print df*np.where(df.columns.values==df.index.values[..., np.newaxis], 0,1)
0 1 2 3 4 5
0 0 0 4 2 1 4
0 0 8 9 1 0 3
1 4 0 5 1 0 1
1 6 0 8 3 4 1
2 3 4 0 1 3 0
[5 rows x 6 columns]
I think this should work.
for line in range(len(matrix)):
matrix[matrix[line][0]+1]=0
NOTE
Depending on your matrix setup you may not need the +1
Basically it takes the first digit of each line in the matrix and uses that as the index of the value to change to 0
i.e. if the row was
c 0 1 2 3 4 5
AAA 4 3 2 3 9 5 9,
it would change the 5 below the number 4 to 0
c 0 1 2 3 4 5
AAA 4 3 2 3 9 0 9

Categories