Duplicate the last row with a value, index, and merge - python

I have this data frame:
ID X1 X2 X3 Y
A 9 2 3 10
A 3 5 5 0
A 3 3 4 0
A 6 3 6 20
A 6 7 4 0
A 2 4 9 0
A 1 7 8 0
B 6 9 5 0
B 3 3 7 10
B 6 7 8 20
B 5 2 7 0
B 2 9 3 10
B 7 2 7 0
B 9 2 6 0
C 2 9 4 30
C 5 1 5 40
C 8 3 2 0
C 2 1 9 40
C 3 9 8 30
C 7 9 6 20
D 8 2 6 10
D 3 1 8 0
D 4 8 9 20
D 7 3 2 10
F 2 8 8 0
F 7 4 8 0
Step #1:
First, duplicate the last row, for each ID (groupby('ID')), skipping 0 values in Y column.
Then, index the rows (by df['index'] = df.index).
After step #1, df should look like this:
index ID X1 X2 X3 Y
1 A 9 2 3 10
2 A 3 5 5 0
3 A 3 3 4 0
4 A 6 3 6 20
5 A 6 7 4 0
6 A 2 4 9 0
7 A 1 7 8 0
8 A 6 3 6 20
9 B 6 9 5 0
10 B 3 3 7 10
11 B 6 7 8 20
12 B 5 2 7 0
13 B 2 9 3 10
14 B 7 2 7 0
15 B 9 2 6 0
16 B 2 9 3 10
17 C 2 9 4 30
18 C 5 1 5 40
19 C 8 3 2 0
20 C 2 1 9 40
21 C 3 9 8 30
22 C 7 9 6 20
23 C 7 9 6 20
24 D 8 2 6 10
25 D 3 1 8 0
26 D 4 8 9 20
27 D 7 3 2 10
28 D 7 3 2 10
29 F 2 8 8 0
30 F 7 4 8 0
Step #2:
First, exclude those with 0 in Y column (by df2 = df[df['Y'] != 0]).
Then create X4 column (no matter which values to assign. In this example, I assign 1 for all rows in that X4 by df2['X4'] = np.ones(len(df2))).
Note: In fact, in this step I create some other variable with calculations, but it doesn't matter for the example.
After step #2, df2 should look like this:
index ID X1 X2 X3 X4 Y
1 A 9 2 3 1 10
4 A 6 3 6 1 20
8 A 6 3 6 1 20
10 B 3 3 7 1 10
11 B 6 7 8 1 20
13 B 2 9 3 1 10
16 B 2 9 3 1 10
17 C 2 9 4 1 30
18 C 5 1 5 1 40
20 C 2 1 9 1 40
21 C 3 9 8 1 30
22 C 7 9 6 1 20
23 C 7 9 6 1 20
24 D 8 2 6 1 10
26 D 4 8 9 1 20
27 D 7 3 2 1 10
28 D 7 3 2 1 10
Step #3:
First, merge the df with df2, on index (by df = df.merge(df2, on = 'index', how = 'left')).
Then, exclude the last row we added before, for each ID.
Final result (df):
index ID X1 X2 X3 X4 Y
1 A 9 2 3 1.0 10
2 A 3 5 5 NaN 0
3 A 3 3 4 NaN 0
4 A 6 3 6 1.0 20
5 A 6 7 4 NaN 0
6 A 2 4 9 NaN 0
7 A 1 7 8 NaN 0
9 B 6 9 5 NaN 0
10 B 3 3 7 1.0 10
11 B 6 7 8 1.0 20
12 B 5 2 7 NaN 0
13 B 2 9 3 1.0 10
14 B 7 2 7 NaN 0
15 B 9 2 6 NaN 0
17 C 2 9 4 1.0 30
18 C 5 1 5 1.0 40
19 C 8 3 2 NaN 0
20 C 2 1 9 1.0 40
21 C 3 9 8 1.0 30
22 C 7 9 6 1.0 20
24 D 8 2 6 1.0 10
25 D 3 1 8 NaN 0
26 D 4 8 9 1.0 20
27 D 7 3 2 1.0 10
28 D 7 3 2 1.0 10
29 F 2 8 8 NaN 0
30 F 7 4 8 NaN 0
Please follow the steps, it's important, because I do additional manipulations that I don't show here.

Idea is remove 0 values by Y and then duplicates by ID to df1, but for correct ordering is used map by index values + 0.5, so after concat and DataFrame.sort_index for correct order:
#df['index'] = df.index + 1
idx = df.drop_duplicates('ID', keep='last')['ID']
idx = pd.Series(idx.index + .5, idx)
#print (idx)
df1 = df[df['Y'].ne(0)].drop_duplicates('ID', keep='last')
df1.index = df1['ID'].map(idx)
df = pd.concat([df, df1]).sort_index().reset_index(drop=True)
df['index'] = np.arange(1, len(df)+1)
df2 = df[df['Y'] != 0].copy()
df2['X4'] = 1
#print (df2)
df = df.merge(df2[['index','X4']], on = 'index', how = 'left')
Last filter out last duplicated values by ID if X2 is not NaN:
df = df[df['ID'].duplicated(keep='last') | df['X4'].isna()]
print (df)
ID X1 X2 X3 Y index X4
0 A 9 2 3 10 1 1.0
1 A 3 5 5 0 2 NaN
2 A 3 3 4 0 3 NaN
3 A 6 3 6 20 4 1.0
4 A 6 7 4 0 5 NaN
5 A 2 4 9 0 6 NaN
6 A 1 7 8 0 7 NaN
8 B 6 9 5 0 9 NaN
9 B 3 3 7 10 10 1.0
10 B 6 7 8 20 11 1.0
11 B 5 2 7 0 12 NaN
12 B 2 9 3 10 13 1.0
13 B 7 2 7 0 14 NaN
14 B 9 2 6 0 15 NaN
16 C 2 9 4 30 17 1.0
17 C 5 1 5 40 18 1.0
18 C 8 3 2 0 19 NaN
19 C 2 1 9 40 20 1.0
20 C 3 9 8 30 21 1.0
21 C 7 9 6 20 22 1.0
23 D 8 2 6 10 24 1.0
24 D 3 1 8 0 25 NaN
25 D 4 8 9 20 26 1.0
26 D 7 3 2 10 27 1.0
28 F 2 8 8 0 29 NaN
29 F 7 4 8 0 30 NaN
EDIT:
Another idea for appenf last row:
def f(x):
y = x[x['Y'].ne(0)].tail(1)
return x.append(y)
df = df.groupby('ID').apply(f)
print (df)
ID X1 X2 X3 Y
ID
A 0 A 9 2 3 10
1 A 3 5 5 0
2 A 3 3 4 0
3 A 6 3 6 20
4 A 6 7 4 0
5 A 2 4 9 0
6 A 1 7 8 0
3 A 6 3 6 20
B 7 B 6 9 5 0
8 B 3 3 7 10
9 B 6 7 8 20
10 B 5 2 7 0
11 B 2 9 3 10
12 B 7 2 7 0
13 B 9 2 6 0
11 B 2 9 3 10
C 14 C 2 9 4 30
15 C 5 1 5 40
16 C 8 3 2 0
17 C 2 1 9 40
18 C 3 9 8 30
19 C 7 9 6 20
19 C 7 9 6 20
D 20 D 8 2 6 10
21 D 3 1 8 0
22 D 4 8 9 20
23 D 7 3 2 10
23 D 7 3 2 10
F 24 F 2 8 8 0
25 F 7 4 8 0

Related

pandas dataframe groupby rank generates unexpected order of ranking

I am using the following code to generate the rank column,
df["rank"] = df.groupby(['group1','userId'])[['rank_level1','rank_level2']].rank(method='first', ascending=True).astype(int)
but as you can see in the following example data it is generating the wrong order of ranking considering rank_level2 column
expected_Rank is the ranking order I am expecting
group1
userId
geoId
rank_level1
rank_level2
rank
expected_Rank
a
1
q
3
3.506102795
1
8
a
1
w
3
-9.359613563
2
2
a
1
e
3
-2.368458072
3
3
a
1
r
3
13.75731938
4
9
a
1
t
3
0.229777761
5
5
a
1
y
3
-10.25124866
6
1
a
1
u
3
2.82822285
7
7
a
1
i
3
0
8
4
a
1
o
3
1.120593402
9
6
a
1
p
4
1.98
10
10
a
1
z
4
5.110299374
11
11
b
1
p
2
-9.552317622
1
1
b
1
r
3
1.175083121
2
6
b
1
t
3
0
3
5
b
1
o
3
9.383253146
4
8
b
1
w
3
5.782528196
5
7
b
1
i
3
-0.680999413
6
4
b
1
y
3
-0.990387248
7
3
b
1
e
3
-11.18793533
8
2
b
1
z
3
12.33791512
9
9
b
1
u
4
-4.799979138
10
11
b
1
q
4
-25.92
11
10
Create tuples by both columns and then use GroupBy.transform with Series.rank and method='dense':
df["rank"] = (df.assign(new=df[['rank_level1','rank_level2']].agg(tuple, 1))
.groupby(['group1','userId'])['new']
.transform(lambda x: x.rank(method='dense', ascending=True))
.astype(int))
print (df)
group1 userId geoId rank_level1 rank_level2 rank expected_Rank
0 a 1 q 3 3.506103 8 8
1 a 1 w 3 -9.359614 2 2
2 a 1 e 3 -2.368458 3 3
3 a 1 r 3 13.757319 9 9
4 a 1 t 3 0.229778 5 5
5 a 1 y 3 -10.251249 1 1
6 a 1 u 3 2.828223 7 7
7 a 1 i 3 0.000000 4 4
8 a 1 o 3 1.120593 6 6
9 a 1 p 4 1.980000 10 10
10 a 1 z 4 5.110299 11 11
11 b 1 p 2 -9.552318 1 1
12 b 1 r 3 1.175083 6 6
13 b 1 t 3 0.000000 5 5
14 b 1 o 3 9.383253 8 8
15 b 1 w 3 5.782528 7 7
16 b 1 i 3 -0.680999 4 4
17 b 1 y 3 -0.990387 3 3
18 b 1 e 3 -11.187935 2 2
19 b 1 z 3 12.337915 9 9
20 b 1 u 4 -4.799979 11 11
21 b 1 q 4 -25.920000 10 10
because:
df["rank"] = df.assign(new=df[['rank_level1','rank_level2']].agg(tuple, 1)).groupby(['group1','userId'])['new'].rank(method='first', ascending=True).astype(int)
DataError: No numeric types to aggregate

reshape Pandas dataframe by appending column to column

i do have a Pandas df like (df1):
0 1 2 3 4 5
0 a b c d e f
1 1 4 7 10 13 16
2 2 5 8 11 14 17
3 3 6 9 12 15 18
and i want to generate an Dataframe like (df2):
0 1 2
0 a b c
1 1 4 7
2 2 5 7
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
additional information about the given df:
shape of given df ist unknown. b = df1.shape() -> b = [n,m]
it is a given fact the width of df1 is divisble by 3
i did try stack, melt and wide_to_long. By using stack the order of the rows is lost, the rows should behave as shown in exmeplary df2 . I would really appreciate any help.
Kind regards Hans
Use np.vstack and np.hsplit:
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
Another example:
>>> df
0 1 2 3 4 5 6 7 8
0 a b c d e f g h i
1 1 4 7 10 13 16 19 22 25
2 2 5 8 11 14 17 20 23 26
3 3 6 9 12 15 18 21 24 27
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
8 g h i
9 19 22 25
10 20 23 26
11 21 24 27
You can use DataFrame.append:
a = df[df.columns[: len(df.columns) // 3 + 1]]
b = df[df.columns[len(df.columns) // 3 + 1 :]]
b.columns = a.columns
df_out = a.append(b).reset_index(drop=True)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
EDIT: To handle unknown widths:
dfs = []
for i in range(0, len(df.columns), 3):
dfs.append(df[df.columns[i : i + 3]])
dfs[-1].columns = df.columns[:3]
df_out = pd.concat(dfs)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
0 d e f
1 10 13 16
2 11 14 17
3 12 15 18
0 g h i
1 19 22 25
2 20 23 26
3 21 24 27

Can You Preserve Column Order When Pandas Dataframe.Combine Or DataFrame.Combine_First?

If you have 2 dataframes, represented as:
A F Y
0 1 2 3
1 4 5 6
And
B C T
0 7 8 9
1 10 11 12
When combining it becomes:
A B C F T Y
0 1 7 8 2 9 3
1 4 10 11 5 12 6
I would like it to become:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
How do I combine 1 data frame with another but keep the original column order?
In [1294]: new_df = df.join(df1)
In [1295]: new_df
Out[1295]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
OR you can also use pd.merge(not a very clean solution though)
In [1297]: df['tmp' ] =1
In [1298]: df1['tmp'] = 1
In [1309]: pd.merge(df, df1, on=['tmp'], left_index=True, right_index=True).drop('tmp', 1)
Out[1309]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12

Insert new dataframe to existing dataframe into specific row position in Pandas

I have a df1 and df2 as follows:
df1:
a b c
0 1 2 4
1 6 12 24
2 7 14 28
3 4 8 16
4 3 6 12
df2:
a b c
0 7 8 9
1 10 11 12
How can I insert df2 to df1 but after the second row? My desired output will like this.
a b c
0 1 2 4
1 6 12 24
2 7 8 9
3 10 11 12
4 7 14 28
5 4 8 16
6 3 6 12
Thank you.
Use concat with splitted first DataFrame by DataFrame.iloc:
df = pd.concat([df1.iloc[:2], df2, df1.iloc[2:]], ignore_index=False)
print (df)
a b c
0 1 2 4
1 6 12 24
0 7 8 9
1 10 11 12
2 7 14 28
3 4 8 16
4 3 6 12
Here is another way using np.r_:
df2.index=range(len(df1),len(df1)+len(df2)) #change index where df1 ends
final=pd.concat((df1,df2)) #concat
final.iloc[np.r_[0,1,df2.index,2:len(df1)]] #select ordering with iloc
#final.iloc[np.r_[0:2,df2.index,2:len(df1)]]
a b c
0 1 2 4
1 6 12 24
5 7 8 9
6 10 11 12
2 7 14 28
3 4 8 16
4 3 6 12

How can I create a new column in a DataFrame that shows patterns in a different column?

My original CSV file looks like this
1, 9
2, 8
3, 9
14, 7
15, 6
19, 8
20, 9
21, 3
I grouped the table for continuous integers in column A with
for grp, val in df.groupby((df.diff()-1).fillna(0).cumsum().index1):
print(val)
Resulting table:
A B
1 1 9
2 2 8
3 3 9
A B
14 14 7
15 15 6
A B
19 19 8
20 20 9
21 21 3
In practice the B values are very long ID numbers, but insignificant as numbers. How can I create a new column C that will show patterns in each of the three groups by assigning a simple value to each ID, and the same simple value for each duplicate in a group?
Desired output:
A B C
1 1 9 1
2 2 8 2
3 3 9 1
A B C
14 14 7 1
15 15 6 2
A B C
19 19 8 1
20 20 9 2
21 21 3 3
Thanks
You are close
df['C']=df.groupby((df.A.diff()-1).fillna(0).cumsum()).B.apply(lambda x : pd.Series(pd.factorize(x)[0]+1)).values
df
Out[105]:
A B C
0 1 9 1
1 2 8 2
2 3 9 1
3 14 7 1
4 15 6 2
5 19 8 1
6 20 9 2
7 21 3 3
Or using category
df['C']=df.groupby((df.A.diff()-1).fillna(0).cumsum()).B.apply(lambda x : x.astype('category').cat.codes+1).values
df
Out[110]:
A B C
0 1 9 2
1 2 8 1
2 3 9 2
3 14 7 2
4 15 6 1
5 19 8 2
6 20 9 3
7 21 3 1
if you need for loop
for x,df1 in df.groupby((df.A.diff()-1).fillna(0).cumsum()):
print(df1.assign(C=pd.factorize(df1.B)[0]+1))
A B C
0 1 9 1
1 2 8 2
2 3 9 1
A B C
3 14 7 1
4 15 6 2
A B C
5 19 8 1
6 20 9 2
7 21 3 3
Let's try:
df.columns = ['A','B']
g = df.groupby(df.A.diff().ne(1).cumsum())
df['C'] = g['B'].transform(lambda x: pd.factorize(x)[0] + 1)
for n,g in g:
print(g)
Output:
A B C
0 1 9 1
1 2 8 2
2 3 9 1
A B C
3 14 7 1
4 15 6 2
A B C
5 19 8 1
6 20 9 2
7 21 3 3
Try withColumn function that will add a new column to the dataframe and you may assign an index value.

Categories