reshape Pandas dataframe by appending column to column - python

i do have a Pandas df like (df1):
0 1 2 3 4 5
0 a b c d e f
1 1 4 7 10 13 16
2 2 5 8 11 14 17
3 3 6 9 12 15 18
and i want to generate an Dataframe like (df2):
0 1 2
0 a b c
1 1 4 7
2 2 5 7
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
additional information about the given df:
shape of given df ist unknown. b = df1.shape() -> b = [n,m]
it is a given fact the width of df1 is divisble by 3
i did try stack, melt and wide_to_long. By using stack the order of the rows is lost, the rows should behave as shown in exmeplary df2 . I would really appreciate any help.
Kind regards Hans

Use np.vstack and np.hsplit:
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
Another example:
>>> df
0 1 2 3 4 5 6 7 8
0 a b c d e f g h i
1 1 4 7 10 13 16 19 22 25
2 2 5 8 11 14 17 20 23 26
3 3 6 9 12 15 18 21 24 27
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
8 g h i
9 19 22 25
10 20 23 26
11 21 24 27

You can use DataFrame.append:
a = df[df.columns[: len(df.columns) // 3 + 1]]
b = df[df.columns[len(df.columns) // 3 + 1 :]]
b.columns = a.columns
df_out = a.append(b).reset_index(drop=True)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
EDIT: To handle unknown widths:
dfs = []
for i in range(0, len(df.columns), 3):
dfs.append(df[df.columns[i : i + 3]])
dfs[-1].columns = df.columns[:3]
df_out = pd.concat(dfs)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
0 d e f
1 10 13 16
2 11 14 17
3 12 15 18
0 g h i
1 19 22 25
2 20 23 26
3 21 24 27

Related

Duplicate the last row with a value, index, and merge

I have this data frame:
ID X1 X2 X3 Y
A 9 2 3 10
A 3 5 5 0
A 3 3 4 0
A 6 3 6 20
A 6 7 4 0
A 2 4 9 0
A 1 7 8 0
B 6 9 5 0
B 3 3 7 10
B 6 7 8 20
B 5 2 7 0
B 2 9 3 10
B 7 2 7 0
B 9 2 6 0
C 2 9 4 30
C 5 1 5 40
C 8 3 2 0
C 2 1 9 40
C 3 9 8 30
C 7 9 6 20
D 8 2 6 10
D 3 1 8 0
D 4 8 9 20
D 7 3 2 10
F 2 8 8 0
F 7 4 8 0
Step #1:
First, duplicate the last row, for each ID (groupby('ID')), skipping 0 values in Y column.
Then, index the rows (by df['index'] = df.index).
After step #1, df should look like this:
index ID X1 X2 X3 Y
1 A 9 2 3 10
2 A 3 5 5 0
3 A 3 3 4 0
4 A 6 3 6 20
5 A 6 7 4 0
6 A 2 4 9 0
7 A 1 7 8 0
8 A 6 3 6 20
9 B 6 9 5 0
10 B 3 3 7 10
11 B 6 7 8 20
12 B 5 2 7 0
13 B 2 9 3 10
14 B 7 2 7 0
15 B 9 2 6 0
16 B 2 9 3 10
17 C 2 9 4 30
18 C 5 1 5 40
19 C 8 3 2 0
20 C 2 1 9 40
21 C 3 9 8 30
22 C 7 9 6 20
23 C 7 9 6 20
24 D 8 2 6 10
25 D 3 1 8 0
26 D 4 8 9 20
27 D 7 3 2 10
28 D 7 3 2 10
29 F 2 8 8 0
30 F 7 4 8 0
Step #2:
First, exclude those with 0 in Y column (by df2 = df[df['Y'] != 0]).
Then create X4 column (no matter which values to assign. In this example, I assign 1 for all rows in that X4 by df2['X4'] = np.ones(len(df2))).
Note: In fact, in this step I create some other variable with calculations, but it doesn't matter for the example.
After step #2, df2 should look like this:
index ID X1 X2 X3 X4 Y
1 A 9 2 3 1 10
4 A 6 3 6 1 20
8 A 6 3 6 1 20
10 B 3 3 7 1 10
11 B 6 7 8 1 20
13 B 2 9 3 1 10
16 B 2 9 3 1 10
17 C 2 9 4 1 30
18 C 5 1 5 1 40
20 C 2 1 9 1 40
21 C 3 9 8 1 30
22 C 7 9 6 1 20
23 C 7 9 6 1 20
24 D 8 2 6 1 10
26 D 4 8 9 1 20
27 D 7 3 2 1 10
28 D 7 3 2 1 10
Step #3:
First, merge the df with df2, on index (by df = df.merge(df2, on = 'index', how = 'left')).
Then, exclude the last row we added before, for each ID.
Final result (df):
index ID X1 X2 X3 X4 Y
1 A 9 2 3 1.0 10
2 A 3 5 5 NaN 0
3 A 3 3 4 NaN 0
4 A 6 3 6 1.0 20
5 A 6 7 4 NaN 0
6 A 2 4 9 NaN 0
7 A 1 7 8 NaN 0
9 B 6 9 5 NaN 0
10 B 3 3 7 1.0 10
11 B 6 7 8 1.0 20
12 B 5 2 7 NaN 0
13 B 2 9 3 1.0 10
14 B 7 2 7 NaN 0
15 B 9 2 6 NaN 0
17 C 2 9 4 1.0 30
18 C 5 1 5 1.0 40
19 C 8 3 2 NaN 0
20 C 2 1 9 1.0 40
21 C 3 9 8 1.0 30
22 C 7 9 6 1.0 20
24 D 8 2 6 1.0 10
25 D 3 1 8 NaN 0
26 D 4 8 9 1.0 20
27 D 7 3 2 1.0 10
28 D 7 3 2 1.0 10
29 F 2 8 8 NaN 0
30 F 7 4 8 NaN 0
Please follow the steps, it's important, because I do additional manipulations that I don't show here.
Idea is remove 0 values by Y and then duplicates by ID to df1, but for correct ordering is used map by index values + 0.5, so after concat and DataFrame.sort_index for correct order:
#df['index'] = df.index + 1
idx = df.drop_duplicates('ID', keep='last')['ID']
idx = pd.Series(idx.index + .5, idx)
#print (idx)
df1 = df[df['Y'].ne(0)].drop_duplicates('ID', keep='last')
df1.index = df1['ID'].map(idx)
df = pd.concat([df, df1]).sort_index().reset_index(drop=True)
df['index'] = np.arange(1, len(df)+1)
df2 = df[df['Y'] != 0].copy()
df2['X4'] = 1
#print (df2)
df = df.merge(df2[['index','X4']], on = 'index', how = 'left')
Last filter out last duplicated values by ID if X2 is not NaN:
df = df[df['ID'].duplicated(keep='last') | df['X4'].isna()]
print (df)
ID X1 X2 X3 Y index X4
0 A 9 2 3 10 1 1.0
1 A 3 5 5 0 2 NaN
2 A 3 3 4 0 3 NaN
3 A 6 3 6 20 4 1.0
4 A 6 7 4 0 5 NaN
5 A 2 4 9 0 6 NaN
6 A 1 7 8 0 7 NaN
8 B 6 9 5 0 9 NaN
9 B 3 3 7 10 10 1.0
10 B 6 7 8 20 11 1.0
11 B 5 2 7 0 12 NaN
12 B 2 9 3 10 13 1.0
13 B 7 2 7 0 14 NaN
14 B 9 2 6 0 15 NaN
16 C 2 9 4 30 17 1.0
17 C 5 1 5 40 18 1.0
18 C 8 3 2 0 19 NaN
19 C 2 1 9 40 20 1.0
20 C 3 9 8 30 21 1.0
21 C 7 9 6 20 22 1.0
23 D 8 2 6 10 24 1.0
24 D 3 1 8 0 25 NaN
25 D 4 8 9 20 26 1.0
26 D 7 3 2 10 27 1.0
28 F 2 8 8 0 29 NaN
29 F 7 4 8 0 30 NaN
EDIT:
Another idea for appenf last row:
def f(x):
y = x[x['Y'].ne(0)].tail(1)
return x.append(y)
df = df.groupby('ID').apply(f)
print (df)
ID X1 X2 X3 Y
ID
A 0 A 9 2 3 10
1 A 3 5 5 0
2 A 3 3 4 0
3 A 6 3 6 20
4 A 6 7 4 0
5 A 2 4 9 0
6 A 1 7 8 0
3 A 6 3 6 20
B 7 B 6 9 5 0
8 B 3 3 7 10
9 B 6 7 8 20
10 B 5 2 7 0
11 B 2 9 3 10
12 B 7 2 7 0
13 B 9 2 6 0
11 B 2 9 3 10
C 14 C 2 9 4 30
15 C 5 1 5 40
16 C 8 3 2 0
17 C 2 1 9 40
18 C 3 9 8 30
19 C 7 9 6 20
19 C 7 9 6 20
D 20 D 8 2 6 10
21 D 3 1 8 0
22 D 4 8 9 20
23 D 7 3 2 10
23 D 7 3 2 10
F 24 F 2 8 8 0
25 F 7 4 8 0

Insert new dataframe to existing dataframe into specific row position in Pandas

I have a df1 and df2 as follows:
df1:
a b c
0 1 2 4
1 6 12 24
2 7 14 28
3 4 8 16
4 3 6 12
df2:
a b c
0 7 8 9
1 10 11 12
How can I insert df2 to df1 but after the second row? My desired output will like this.
a b c
0 1 2 4
1 6 12 24
2 7 8 9
3 10 11 12
4 7 14 28
5 4 8 16
6 3 6 12
Thank you.
Use concat with splitted first DataFrame by DataFrame.iloc:
df = pd.concat([df1.iloc[:2], df2, df1.iloc[2:]], ignore_index=False)
print (df)
a b c
0 1 2 4
1 6 12 24
0 7 8 9
1 10 11 12
2 7 14 28
3 4 8 16
4 3 6 12
Here is another way using np.r_:
df2.index=range(len(df1),len(df1)+len(df2)) #change index where df1 ends
final=pd.concat((df1,df2)) #concat
final.iloc[np.r_[0,1,df2.index,2:len(df1)]] #select ordering with iloc
#final.iloc[np.r_[0:2,df2.index,2:len(df1)]]
a b c
0 1 2 4
1 6 12 24
5 7 8 9
6 10 11 12
2 7 14 28
3 4 8 16
4 3 6 12

combining vectors in pandas

considering the following dataset:
df = pd.DataFrame(data=np.array([['a',1, 2, 3,'T'], ['b',4, 5, 6,'T'],
['b',9, 9, 39,'T'],
['c',16, 17 , 18,'N']])
, columns=['id','A', 'B', 'C','Active'])
id A B C Active
a 1 2 3 T
b 4 5 6 T
b 9 9 39 T
c 16 17 18 N
I need to augment each rows of each groups(id) by rows that the active = T , which means that
a 1 2 3 a 1 2 3
b 4 5 6 a 1 2 3
b 9 9 39 a 1 2 3
a 1 2 3 b 4 5 6
b 4 5 6 b 4 5 6
b 9 9 39 b 4 5 6
a 1 2 3 b 9 9 39
b 4 5 6 b 9 9 39
b 9 9 39 b 9 9 39
a 1 2 3 c 16 17 18
b 9 9 39 c 16 17 18
b 4 5 6 c 16 17 18
I have an idea which I could not implement it.
first, make a new dataset by filtering data.
take all rows that active column is equal to T and save it in a new df.
df_t = df [df['Active']=='T']
then for each rows of df add a new vector form df_t dataset.
which means that :
for sample in df:
for t in df_t:
df_new = sample + t ( vectors of df and df_t join together)
Df_new = concat(df_new,Df_new)
I really appreciate your comments and suggestion to implement my own idea!
You want the catersian cross product of df and df_t. You can do it with a bit of a hack like this:
df['cross'] = 1
df_t['cross'] = 1
df_new = pd.merge(df,df_t.drop('Active',axis=1),on='cross').drop('cross',axis=1)
Putting it all together:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.array([['a',1, 2, 3,'T'], ['b',4, 5, 6,'T'],
['b',9, 9, 39,'T'],
['c',16, 17 , 18,'N']])
, columns=['id','A', 'B', 'C','Active'])
df_t = df [df['Active']=='T']
df['cross'] = 1
df_t['cross'] = 1
df_new = pd.merge(df,df_t.drop('Active',axis=1),on='cross').drop('cross',axis=1)
results in:
>>> df_new
id_x A_x B_x C_x Active id_y A_y B_y C_y
0 a 1 2 3 T a 1 2 3
1 a 1 2 3 T b 4 5 6
2 a 1 2 3 T b 9 9 39
3 b 4 5 6 T a 1 2 3
4 b 4 5 6 T b 4 5 6
5 b 4 5 6 T b 9 9 39
6 b 9 9 39 T a 1 2 3
7 b 9 9 39 T b 4 5 6
8 b 9 9 39 T b 9 9 39
9 c 16 17 18 N a 1 2 3
10 c 16 17 18 N b 4 5 6
11 c 16 17 18 N b 9 9 39

How can I create a new column in a DataFrame that shows patterns in a different column?

My original CSV file looks like this
1, 9
2, 8
3, 9
14, 7
15, 6
19, 8
20, 9
21, 3
I grouped the table for continuous integers in column A with
for grp, val in df.groupby((df.diff()-1).fillna(0).cumsum().index1):
print(val)
Resulting table:
A B
1 1 9
2 2 8
3 3 9
A B
14 14 7
15 15 6
A B
19 19 8
20 20 9
21 21 3
In practice the B values are very long ID numbers, but insignificant as numbers. How can I create a new column C that will show patterns in each of the three groups by assigning a simple value to each ID, and the same simple value for each duplicate in a group?
Desired output:
A B C
1 1 9 1
2 2 8 2
3 3 9 1
A B C
14 14 7 1
15 15 6 2
A B C
19 19 8 1
20 20 9 2
21 21 3 3
Thanks
You are close
df['C']=df.groupby((df.A.diff()-1).fillna(0).cumsum()).B.apply(lambda x : pd.Series(pd.factorize(x)[0]+1)).values
df
Out[105]:
A B C
0 1 9 1
1 2 8 2
2 3 9 1
3 14 7 1
4 15 6 2
5 19 8 1
6 20 9 2
7 21 3 3
Or using category
df['C']=df.groupby((df.A.diff()-1).fillna(0).cumsum()).B.apply(lambda x : x.astype('category').cat.codes+1).values
df
Out[110]:
A B C
0 1 9 2
1 2 8 1
2 3 9 2
3 14 7 2
4 15 6 1
5 19 8 2
6 20 9 3
7 21 3 1
if you need for loop
for x,df1 in df.groupby((df.A.diff()-1).fillna(0).cumsum()):
print(df1.assign(C=pd.factorize(df1.B)[0]+1))
A B C
0 1 9 1
1 2 8 2
2 3 9 1
A B C
3 14 7 1
4 15 6 2
A B C
5 19 8 1
6 20 9 2
7 21 3 3
Let's try:
df.columns = ['A','B']
g = df.groupby(df.A.diff().ne(1).cumsum())
df['C'] = g['B'].transform(lambda x: pd.factorize(x)[0] + 1)
for n,g in g:
print(g)
Output:
A B C
0 1 9 1
1 2 8 2
2 3 9 1
A B C
3 14 7 1
4 15 6 2
A B C
5 19 8 1
6 20 9 2
7 21 3 3
Try withColumn function that will add a new column to the dataframe and you may assign an index value.

Remove duplicates but keep some

Is it possible to remove duplicates but keep last 3-4 ? Something like:
df = df.drop_duplicates(['ID'], keep='last_four')
Thank you
You can use groupby and tail and pass the num of rows you wish to keep to achieve the same result:
In [5]:
# data setup
df = pd.DataFrame({'ID':[0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,3,3,3,3,3,3,3,3,3,4], 'val':np.arange(25)})
df
Out[5]:
ID val
0 0 0
1 0 1
2 0 2
3 0 3
4 0 4
5 0 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 2 13
14 2 14
15 3 15
16 3 16
17 3 17
18 3 18
19 3 19
20 3 20
21 3 21
22 3 22
23 3 23
24 4 24
Now groupby and call tail:
In [11]:
df.groupby('ID',as_index=False).tail(4)
Out[11]:
ID val
2 0 2
3 0 3
4 0 4
5 0 5
9 1 9
10 1 10
11 1 11
12 1 12
13 2 13
14 2 14
20 3 20
21 3 21
22 3 22
23 3 23
24 4 24

Categories