Duplicate row and add a column pandas - python

I need to duplicate each row 3 times and add two new columns. The new column values are different for each row.
import pandas as pd
df = {'A': [ 8,9,12],
'B': [ 1,11,3],
'C': [ 7,9,13],
'D': [81,92,121]}
df = pd.DataFrame(df)
#####################################################
#input
A B C D
8 1 7 81
9 11 9 92
12 3 13 121
####################################################
#expected output
A B C D E F
8 1 7 81 9 8 E=A+1, F= C+1
8 1 7 81 8 7 E=A, F= C
8 1 7 81 7 6 E=A-1, F= C-1
9 11 9 92 10 10
9 11 9 92 9 9
9 11 9 92 8 8
12 3 13 121 13 14
12 3 13 121 12 13
12 3 13 121 11 12

To repeat the DataFrame you can use np.repeat().
Afterwards you can create a list to add to "A" and "C".
df = pd.DataFrame(np.repeat(df.to_numpy(), 3, axis=0), columns=df.columns)
extra = [1,0, -1]*3
df['E'] = df['A']+extra
df['F'] = df['C']+extra
This gives:
A B C D E F
0 8 1 7 81 9 8
1 8 1 7 81 8 7
2 8 1 7 81 7 6
3 9 11 9 92 10 10
4 9 11 9 92 9 9
5 9 11 9 92 8 8
6 12 3 13 121 13 14
7 12 3 13 121 12 13
8 12 3 13 121 11 12

Use Index.repeat with DataFrame.loc for repeat rows, then repeat integers [1,0,-1] by numpy.tile and create new columns E, F:
df1 = df.loc[df.index.repeat(3)]
g = np.tile([1,0,-1], len(df))
df1[['E','F']] = df1[['A','C']].add(g, axis=0).to_numpy()
df1 = df1.reset_index(drop=True)
print (df1)
A B C D E F
0 8 1 7 81 9 8
1 8 1 7 81 8 7
2 8 1 7 81 7 6
3 9 11 9 92 10 10
4 9 11 9 92 9 9
5 9 11 9 92 8 8
6 12 3 13 121 13 14
7 12 3 13 121 12 13
8 12 3 13 121 11 12

Related

Repeat and concatenate a DataFrame with constant step value increase

I have a dataframe like the following example:
A B C D E F
0 1 4 7 10 13 16
1 2 5 8 11 14 17
2 3 6 9 12 15 18
I want to repeat the all dataframe like it was one block,
like I want to repeat the above dataframe 3 times and every element increases by 3 than the original one.
The desired dataframe:
A B C D E F
0 1 4 7 10 13 16
1 2 5 8 11 14 17
2 3 6 9 12 15 18
3 4 7 10 13 16 19
4 5 8 11 14 17 20
5 6 9 12 15 18 21
6 7 10 14 16 19 22
7 8 11 15 17 20 23
8 9 12 16 18 21 24
My real df is like:
0 1 2 3 4 5 6 7 8 9 10 11 12
11 CONECT 12 9 13
12 CONECT 13 12 14 15 16
13 CONECT 14 13
14 CONECT 15 13
15 CONECT 16 13 17 18 19
16 CONECT 17 16
code:
import pandas as pd
df = pd.read_csv('connect_part.txt', 'sample_file.csv', names =['A'])
df = df.A.str.split(expand=True)
df.fillna('', inplace=True)
repeats = 3
step = 3
df1 = df.set_index([0]) # add all non-numeric columns here
df2 = pd.concat([df1+i for i in range(0, len(df1)*repeats, step)]).reset_index()
print(df2)
error:
TypeError: can only concatenate str (not "int") to str
res = pd.concat([df + 3*i for i in range(3)], ignore_index=True)
Output:
>>> res
A B C D E F
0 1 4 7 10 13 16
1 2 5 8 11 14 17
2 3 6 9 12 15 18
3 4 7 10 13 16 19
4 5 8 11 14 17 20
5 6 9 12 15 18 21
6 7 10 13 16 19 22
7 8 11 14 17 20 23
8 9 12 15 18 21 24
Setup:
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9],
'D': [10, 11, 12],
'E': [13, 14, 15],
'F': [16, 17, 18]
})
Assuming df as input, use pandas.concat:
repeats = 3
step = 3
df2 = pd.concat([df+i for i in range(0, len(df)*repeats, step)],
ignore_index=True)
output:
A B C D E F
0 1 4 7 10 13 16
1 2 5 8 11 14 17
2 3 6 9 12 15 18
3 4 7 10 13 16 19
4 5 8 11 14 17 20
5 6 9 12 15 18 21
6 7 10 13 16 19 22
7 8 11 14 17 20 23
8 9 12 15 18 21 24
update: non-numeric columns:
repeats = 3
step = 3
df1 = df.set_index([0]) # add all non-numeric columns here
df2 = pd.concat([df1+i for i in range(0, len(df1)*repeats, step)]).reset_index()

How to change the default number of top and bottom row

By default, pandas shows you top and bottom 5 rows of a dataframe in jupyter, given that there are too many rows to display:
>>> df.shape
(100, 4)
col0
col1
col2
col3
0
7
17
15
2
1
6
5
5
12
2
10
15
5
15
3
6
19
19
14
4
12
7
4
12
...
...
...
...
...
95
2
14
8
16
96
8
8
5
16
97
6
8
9
1
98
1
5
10
15
99
15
9
1
18
I know that this setting exists:
pd.set_option("display.max_rows", 20)
however, that yields the same result. Using df.head(10) and df.tail(10) in to consecutive cells is an option, but less clean. Same goes for concatenation. Is there another pandas setting like display.max_row for this default view? How can I expand this to let's say the top and bottom 10?
IIUC, use display.min_rows:
pd.set_option("display.min_rows", 20)
print(df)
# Output:
0 1 2 3
0 18 8 12 2
1 2 13 13 14
2 8 7 9 2
3 17 19 9 3
4 14 18 12 3
5 11 5 9 18
6 4 5 12 3
7 12 8 2 7
8 11 2 14 13
9 6 6 3 6
.. .. .. .. ..
90 8 2 1 9
91 7 19 4 6
92 4 3 17 12
93 19 6 5 18
94 3 5 15 5
95 16 3 13 13
96 11 3 18 8
97 1 9 18 4
98 13 10 18 15
99 16 3 5 9
[100 rows x 4 columns]

populate matrix with value_counts

I need to run a statistic on some data. See how many times a values "j" is next to a value "i". The code that I put hereafter is a gross simplification of what I need to to, but it contains the problem I have.
Let's say that you have this data frame.
import numpy as np
import pandas as pd
a_df=pd.DataFrame({"a_col":np.random.randint(10, size=1000), "b_col":np.random.randint(10, size=1000)})
I generate a matrix that will contain our statistics:
res_matrix=np.zeros((10, 10))
by looking at res_matrix[i][j] we will know how many times the number "j" was next to the number "i" in our data frame.
I know that "for loops" are bad in pandas, but again, this is a simplification.
I generate a sub-table for the value "i" and on this table I ran "value_counts()"
on the column "b_col".
for i in a_df["a_col"].unique():
temp_df=a_df[a_df["a_col"]==i]
table_count=temp_df["b_col"].value_counts()
for val,cnt in table_count.iteritems():
res_matrix[i][val]+=int(cnt)
is there an efficient way to populate res_matrix without changing the topmost for loop?
I am thinking something like list comprehension, but I cannot wrap my mind around it.
Please, focus ONLY on these two lines:
for val,cnt in table_count.iteritems():
res_matrix[i][val]+=int(cnt)
I can't use groupby because my project requires many more operations on the dataframe.
There's a function crosstab in pandas that does just that:
pd.crosstab(a_df['a_col'], a_df['b_col'])
Output:
b_col 0 1 2 3 4 5 6 7 8 9
a_col
0 10 10 10 12 14 9 10 5 13 16
1 16 9 13 14 14 8 4 11 9 12
2 10 8 12 13 9 12 13 7 10 5
3 11 7 10 17 6 9 6 8 7 14
4 9 8 4 5 7 13 12 8 11 6
5 14 9 8 15 6 10 12 9 7 9
6 11 13 10 9 7 5 8 11 13 21
7 8 9 11 8 8 10 11 15 10 12
8 6 17 11 4 12 9 6 10 10 13
9 12 6 14 3 11 11 7 5 14 14
Update: if the outer loop must remain for other reasons, you can set values in res_matrix inside the loop:
res_matrix = np.zeros((10, 10))
for i in a_df["a_col"].unique():
temp_df=a_df[a_df["a_col"]==i]
table_count=temp_df["b_col"].value_counts()
# set values in res_matrix
res_matrix[i, table_count.index] = table_count
Don't loop, this is slow. If you think there is a good reason to loop, please explain it and provide an appropriate example.
Here is another method.
You can groupby both columns and get the group size, then unstack to get a 2D shape:
a_df.groupby(['a_col', 'b_col']).size().unstack()
output:
b_col 0 1 2 3 4 5 6 7 8 9
a_col
0 16 2 4 11 9 13 11 11 8 6
1 10 12 7 6 6 11 10 8 2 12
2 9 12 10 22 12 13 8 11 9 8
3 13 11 11 14 7 11 9 7 8 14
4 14 7 17 5 8 6 15 8 11 8
5 10 12 7 14 6 16 11 12 6 8
6 13 10 9 12 11 14 8 10 6 8
7 9 12 12 9 11 9 8 14 5 12
8 7 8 9 8 10 14 9 8 8 18
9 13 6 13 11 13 11 8 7 11 11

Problem while merging a specific multi level pivot table back to the orignal (single level) dataframe

My question is related to pivot table and merging.
I have a main dataframe that I use to create a pivot table. Later, I perform some calculations to that pivot and add a new column. Finally I want to merge this new column back to the main dataframe but not getting result as desired.
I try to explain the steps that i performed as follows:
Step 1.
df:
items cat section weight factor1
0 1 7 abc 3 80
1 1 7 abc 3 80
2 2 7 xyz 5 60
3 2 7 xyz 5 60
4 2 7 xyz 5 60
5 2 7 xyz 5 60
6 3 7 abc 3 80
7 3 7 abc 3 80
8 3 7 abc 3 80
9 1 8 abc 2 80
10 1 8 abc 2 60
11 2 8 xyz 6 60
12 2 8 xyz 6 60
12 2 8 xyz 6 60
13 2 8 xyz 6 60
14 3 8 abc 2 80
15 1 9 abc 4 80
16 2 9 xyz 9 60
17 2 9 xyz 9 60
18 3 9 abc 4 80
Main dataframe (df) having number of items. Each item has given a number.
whereas each item belongs to a dedicated section. Each item has given a weight that varies based on a category (cat) and section. In addition, there is another column named 'factor' whose value is constant for a given section.
Step 2.
I need to create a pivot as follows from the above df.
pivot = df.pivot_table(db, index=['section'],values=['weight','factor', 'items'],columns=['cat'],aggfunc={'weight':np.max,'factor':np.max, 'items':np.sum})
pivot:
weight factor items
cat 7 8 9 7 8 9 7 8 9
section
abc 3 2 4 80 80 80 5 3 2
xyz 5 6 9 60 60 60 4 4 2
Step 3:
Now I want to perform some calculations on that pivot then add the
result in a new column as follows:
pivot['w_n',7] = pivot['factor', 7]/pivot['items', 7]
pivot['w_n',8] = pivot['factor', 8]/pivot['items', 8]
pivot['w_n',9] = pivot['factor', 9]/pivot['items', 9]
pivot:
weight factor items w_n
cat 7 8 9 7 8 9 7 8 9 7 8 9
section
abc 3 2 4 80 80 80 5 3 2 16 27 40
xyz 5 6 9 60 60 60 4 4 2 15 15 30
Step 4:
Finally I want to merge that new column back to the main df .
with a desired result of single column 'w_n' but instead I am getting 3 columns one for each cat.
Current result:
df:
items cat section weight factor1 w_n_7 w_n,8 w_n,9
0 1 7 abc 3 80 16 27 40
1 1 7 abc 3 80 16 27 40
2 2 7 xyz 5 60 15 15 30
3 2 7 xyz 5 60 15 15 30
4 2 7 xyz 5 60 15 15 30
5 2 7 xyz 5 60 15 15 30
6 3 7 abc 3 80 16 27 40
7 3 7 abc 3 80 16 27 40
8 3 7 abc 3 80 16 27 40
9 1 8 abc 2 80 16 27 40
10 1 8 abc 2 60 16 27 40
11 2 8 xyz 6 60 15 15 30
12 2 8 xyz 6 60 15 15 30
12 2 8 xyz 6 60 15 15 30
13 2 8 xyz 6 60 15 15 30
14 3 8 abc 2 80 16 27 40
15 1 9 abc 4 80 16 27 40
16 2 9 xyz 9 60 15 15 30
17 2 9 xyz 9 60 15 15 30
18 3 9 abc 4 80 16 27 40
Desired result:
------------------
df:
items cat section weight factor1 w_n
0 1 7 abc 3 80 16
1 1 7 abc 3 80 16
2 2 7 xyz 5 60 15
3 2 7 xyz 5 60 15
4 2 7 xyz 5 60 15
5 2 7 xyz 5 60 15
6 3 7 abc 3 80 16
7 3 7 abc 3 80 16
8 3 7 abc 3 80 16
9 1 8 abc 2 80 27
10 1 8 abc 2 60 27
11 2 8 xyz 6 60 15
12 2 8 xyz 6 60 15
12 2 8 xyz 6 60 15
13 2 8 xyz 6 60 15
14 3 8 abc 2 80 27
15 1 9 abc 4 80 40
16 2 9 xyz 9 60 30
17 2 9 xyz 9 60 30
18 3 9 abc 4 80 40
Use DataFrame.join with MultiIndex Series with Series.unstack:
df = df.join(pivot['w_n'].unstack().rename('W_n'), on=['cat','section'])
print (df)
items cat section weight factor W_n
0 1 7 abc 3 80 7.272727
1 1 7 abc 3 80 7.272727
2 2 7 xyz 5 60 7.500000
3 2 7 xyz 5 60 7.500000
4 2 7 xyz 5 60 7.500000
5 2 7 xyz 5 60 7.500000
6 3 7 abc 3 80 7.272727
7 3 7 abc 3 80 7.272727
8 3 7 abc 3 80 7.272727
9 1 8 abc 2 80 16.000000
10 1 8 abc 2 60 16.000000
11 2 8 xyz 6 60 7.500000
12 2 8 xyz 6 60 7.500000
12 2 8 xyz 6 60 7.500000
13 2 8 xyz 6 60 7.500000
14 3 8 abc 2 80 16.000000
15 1 9 abc 4 80 20.000000
16 2 9 xyz 9 60 15.000000
17 2 9 xyz 9 60 15.000000
18 3 9 abc 4 80 20.000000

Summing values across given range of days difference backwards - Pandas

I have created a days difference column in a pandas dataframe, and I'm looking to add a column that has the sum of a specific value over a given days window backwards
Notice that I can supply a date column for each row if it is needed, but the diff was created as days difference from the first day of the data.
Example
df = pd.DataFrame.from_dict({'diff': [0,0,1,2,2,2,2,10,11,15,18],
'value': [10,11,15,2,5,7,8,9,23,14,15]})
df
Out[12]:
diff value
0 0 10
1 0 11
2 1 15
3 2 2
4 2 5
5 2 7
6 2 8
7 10 9
8 11 23
9 15 14
10 18 15
I want to add 5_days_back_sum column that will sum the past 5 days, including same day so the result would be like this
Out[15]:
5_days_back_sum diff value
0 21 0 10
1 21 0 11
2 36 1 15
3 58 2 2
4 58 2 5
5 58 2 7
6 58 2 8
7 9 10 9
8 32 11 23
9 46 15 14
10 29 18 15
How can I achieve that? Originally I have a date column to create the diff column, if that helps its available
Use custom function with boolean indexing for filtering range with sum:
def f(x):
return df.loc[(df['diff'] >= x - 5) & (df['diff'] <= x), 'value'].sum()
df['5_days_back_sum'] = df['diff'].apply(f)
print (df)
diff value 5_days_back_sum
0 0 10 21
1 0 11 21
2 1 15 36
3 2 2 58
4 2 5 58
5 2 7 58
6 2 8 58
7 10 9 9
8 11 23 32
9 15 14 46
10 18 15 29
Similar solution with between:
def f(x):
return df.loc[df['diff'].between(x - 5, x), 'value'].sum()
df['5_days_back_sum'] = df['diff'].apply(f)
print (df)
diff value 5_days_back_sum
0 0 10 21
1 0 11 21
2 1 15 36
3 2 2 58
4 2 5 58
5 2 7 58
6 2 8 58
7 10 9 9
8 11 23 32
9 15 14 46
10 18 15 29

Categories