How keep column names when merge/concat in Pandas Dataframe - python

i have a dataframe created in pandas which is looks like this:
A B C
X Y Z X Y Z X Y Z
Y K 2 5 12 11 9 8 4 5 12
K 4 4 13 15 5 4 6 7 2
K 6 7 14 0 2 3 0 6 8
C M 4 5 12 5 2 2 1 14 0
M 6 7 2 3 1 6 7 12 5
M 0 6 8 7 3 9 6 8 4
D N 7 1 13 15 9 8 1 13 5
N 9 0 14 0 5 4 0 14 6
N 3 2 12 5 2 3 1 2 2
I want to make it looks like this:
A B C
X Y Z X Y Z X Y Z
Y K 2 5 12 11 9 8 4 5 12
K 4 4 13 15 5 4 6 7 2
K 6 7 14 0 2 3 0 6 8
A B C
X Y Z X Y Z X Y Z
C M 4 5 12 5 2 2 1 14 0
M 6 7 2 3 1 6 7 12 5
M 0 6 8 7 3 9 6 8 4
A B C
X Y Z X Y Z X Y Z
D N 7 1 13 15 9 8 1 13 5
N 9 0 14 0 5 4 0 14 6
N 3 2 12 5 2 3 1 2 2
Is there is any way i can do that? I have tried several ways with concat/merge/join, but i didn't find a way how i can keep column names for "Y,C,D"

No it is not possible using the standard DataFrame string output/display functions. If there were a way to do that it might be a "display option" but those are all listed here and I don't see a relevant one for you: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html#available-options

Related

pandas dataframe groupby rank generates unexpected order of ranking

I am using the following code to generate the rank column,
df["rank"] = df.groupby(['group1','userId'])[['rank_level1','rank_level2']].rank(method='first', ascending=True).astype(int)
but as you can see in the following example data it is generating the wrong order of ranking considering rank_level2 column
expected_Rank is the ranking order I am expecting
group1
userId
geoId
rank_level1
rank_level2
rank
expected_Rank
a
1
q
3
3.506102795
1
8
a
1
w
3
-9.359613563
2
2
a
1
e
3
-2.368458072
3
3
a
1
r
3
13.75731938
4
9
a
1
t
3
0.229777761
5
5
a
1
y
3
-10.25124866
6
1
a
1
u
3
2.82822285
7
7
a
1
i
3
0
8
4
a
1
o
3
1.120593402
9
6
a
1
p
4
1.98
10
10
a
1
z
4
5.110299374
11
11
b
1
p
2
-9.552317622
1
1
b
1
r
3
1.175083121
2
6
b
1
t
3
0
3
5
b
1
o
3
9.383253146
4
8
b
1
w
3
5.782528196
5
7
b
1
i
3
-0.680999413
6
4
b
1
y
3
-0.990387248
7
3
b
1
e
3
-11.18793533
8
2
b
1
z
3
12.33791512
9
9
b
1
u
4
-4.799979138
10
11
b
1
q
4
-25.92
11
10
Create tuples by both columns and then use GroupBy.transform with Series.rank and method='dense':
df["rank"] = (df.assign(new=df[['rank_level1','rank_level2']].agg(tuple, 1))
.groupby(['group1','userId'])['new']
.transform(lambda x: x.rank(method='dense', ascending=True))
.astype(int))
print (df)
group1 userId geoId rank_level1 rank_level2 rank expected_Rank
0 a 1 q 3 3.506103 8 8
1 a 1 w 3 -9.359614 2 2
2 a 1 e 3 -2.368458 3 3
3 a 1 r 3 13.757319 9 9
4 a 1 t 3 0.229778 5 5
5 a 1 y 3 -10.251249 1 1
6 a 1 u 3 2.828223 7 7
7 a 1 i 3 0.000000 4 4
8 a 1 o 3 1.120593 6 6
9 a 1 p 4 1.980000 10 10
10 a 1 z 4 5.110299 11 11
11 b 1 p 2 -9.552318 1 1
12 b 1 r 3 1.175083 6 6
13 b 1 t 3 0.000000 5 5
14 b 1 o 3 9.383253 8 8
15 b 1 w 3 5.782528 7 7
16 b 1 i 3 -0.680999 4 4
17 b 1 y 3 -0.990387 3 3
18 b 1 e 3 -11.187935 2 2
19 b 1 z 3 12.337915 9 9
20 b 1 u 4 -4.799979 11 11
21 b 1 q 4 -25.920000 10 10
because:
df["rank"] = df.assign(new=df[['rank_level1','rank_level2']].agg(tuple, 1)).groupby(['group1','userId'])['new'].rank(method='first', ascending=True).astype(int)
DataError: No numeric types to aggregate

Find index of first row whose value matches a condition set by another row

I have a dataframe which consists of two columns:
x y
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
I would like to add a column whose value is the index of the first value to meet the following condition: y >= x. For example, for row 2 (x = 3), the first y value greater or equal to 3 is 4 so the output of z for row 2 is (index) 1. I expect the final table to look like:
x y z
0 1 2 0
1 2 4 0
2 3 6 1
3 4 8 1
4 5 10 2
5 6 12 2
6 7 14 3
7 8 16 3
8 9 18 4
9 10 20 4
It should be noted that both x and y are sorted if that should make the solution easier.
I have seen a similar answer but I could not translate it to my situation.
You want np.searchsorted, which assumes df['y'] is sorted:
df['z'] = np.searchsorted(df['y'], df['x'])
Output:
x y z
0 1 2 0
1 2 4 0
2 3 6 1
3 4 8 1
4 5 10 2
5 6 12 2
6 7 14 3
7 8 16 3
8 9 18 4
9 10 20 4

Can You Preserve Column Order When Pandas Dataframe.Combine Or DataFrame.Combine_First?

If you have 2 dataframes, represented as:
A F Y
0 1 2 3
1 4 5 6
And
B C T
0 7 8 9
1 10 11 12
When combining it becomes:
A B C F T Y
0 1 7 8 2 9 3
1 4 10 11 5 12 6
I would like it to become:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
How do I combine 1 data frame with another but keep the original column order?
In [1294]: new_df = df.join(df1)
In [1295]: new_df
Out[1295]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
OR you can also use pd.merge(not a very clean solution though)
In [1297]: df['tmp' ] =1
In [1298]: df1['tmp'] = 1
In [1309]: pd.merge(df, df1, on=['tmp'], left_index=True, right_index=True).drop('tmp', 1)
Out[1309]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12

Duplicate the last row with a value, index, and merge

I have this data frame:
ID X1 X2 X3 Y
A 9 2 3 10
A 3 5 5 0
A 3 3 4 0
A 6 3 6 20
A 6 7 4 0
A 2 4 9 0
A 1 7 8 0
B 6 9 5 0
B 3 3 7 10
B 6 7 8 20
B 5 2 7 0
B 2 9 3 10
B 7 2 7 0
B 9 2 6 0
C 2 9 4 30
C 5 1 5 40
C 8 3 2 0
C 2 1 9 40
C 3 9 8 30
C 7 9 6 20
D 8 2 6 10
D 3 1 8 0
D 4 8 9 20
D 7 3 2 10
F 2 8 8 0
F 7 4 8 0
Step #1:
First, duplicate the last row, for each ID (groupby('ID')), skipping 0 values in Y column.
Then, index the rows (by df['index'] = df.index).
After step #1, df should look like this:
index ID X1 X2 X3 Y
1 A 9 2 3 10
2 A 3 5 5 0
3 A 3 3 4 0
4 A 6 3 6 20
5 A 6 7 4 0
6 A 2 4 9 0
7 A 1 7 8 0
8 A 6 3 6 20
9 B 6 9 5 0
10 B 3 3 7 10
11 B 6 7 8 20
12 B 5 2 7 0
13 B 2 9 3 10
14 B 7 2 7 0
15 B 9 2 6 0
16 B 2 9 3 10
17 C 2 9 4 30
18 C 5 1 5 40
19 C 8 3 2 0
20 C 2 1 9 40
21 C 3 9 8 30
22 C 7 9 6 20
23 C 7 9 6 20
24 D 8 2 6 10
25 D 3 1 8 0
26 D 4 8 9 20
27 D 7 3 2 10
28 D 7 3 2 10
29 F 2 8 8 0
30 F 7 4 8 0
Step #2:
First, exclude those with 0 in Y column (by df2 = df[df['Y'] != 0]).
Then create X4 column (no matter which values to assign. In this example, I assign 1 for all rows in that X4 by df2['X4'] = np.ones(len(df2))).
Note: In fact, in this step I create some other variable with calculations, but it doesn't matter for the example.
After step #2, df2 should look like this:
index ID X1 X2 X3 X4 Y
1 A 9 2 3 1 10
4 A 6 3 6 1 20
8 A 6 3 6 1 20
10 B 3 3 7 1 10
11 B 6 7 8 1 20
13 B 2 9 3 1 10
16 B 2 9 3 1 10
17 C 2 9 4 1 30
18 C 5 1 5 1 40
20 C 2 1 9 1 40
21 C 3 9 8 1 30
22 C 7 9 6 1 20
23 C 7 9 6 1 20
24 D 8 2 6 1 10
26 D 4 8 9 1 20
27 D 7 3 2 1 10
28 D 7 3 2 1 10
Step #3:
First, merge the df with df2, on index (by df = df.merge(df2, on = 'index', how = 'left')).
Then, exclude the last row we added before, for each ID.
Final result (df):
index ID X1 X2 X3 X4 Y
1 A 9 2 3 1.0 10
2 A 3 5 5 NaN 0
3 A 3 3 4 NaN 0
4 A 6 3 6 1.0 20
5 A 6 7 4 NaN 0
6 A 2 4 9 NaN 0
7 A 1 7 8 NaN 0
9 B 6 9 5 NaN 0
10 B 3 3 7 1.0 10
11 B 6 7 8 1.0 20
12 B 5 2 7 NaN 0
13 B 2 9 3 1.0 10
14 B 7 2 7 NaN 0
15 B 9 2 6 NaN 0
17 C 2 9 4 1.0 30
18 C 5 1 5 1.0 40
19 C 8 3 2 NaN 0
20 C 2 1 9 1.0 40
21 C 3 9 8 1.0 30
22 C 7 9 6 1.0 20
24 D 8 2 6 1.0 10
25 D 3 1 8 NaN 0
26 D 4 8 9 1.0 20
27 D 7 3 2 1.0 10
28 D 7 3 2 1.0 10
29 F 2 8 8 NaN 0
30 F 7 4 8 NaN 0
Please follow the steps, it's important, because I do additional manipulations that I don't show here.
Idea is remove 0 values by Y and then duplicates by ID to df1, but for correct ordering is used map by index values + 0.5, so after concat and DataFrame.sort_index for correct order:
#df['index'] = df.index + 1
idx = df.drop_duplicates('ID', keep='last')['ID']
idx = pd.Series(idx.index + .5, idx)
#print (idx)
df1 = df[df['Y'].ne(0)].drop_duplicates('ID', keep='last')
df1.index = df1['ID'].map(idx)
df = pd.concat([df, df1]).sort_index().reset_index(drop=True)
df['index'] = np.arange(1, len(df)+1)
df2 = df[df['Y'] != 0].copy()
df2['X4'] = 1
#print (df2)
df = df.merge(df2[['index','X4']], on = 'index', how = 'left')
Last filter out last duplicated values by ID if X2 is not NaN:
df = df[df['ID'].duplicated(keep='last') | df['X4'].isna()]
print (df)
ID X1 X2 X3 Y index X4
0 A 9 2 3 10 1 1.0
1 A 3 5 5 0 2 NaN
2 A 3 3 4 0 3 NaN
3 A 6 3 6 20 4 1.0
4 A 6 7 4 0 5 NaN
5 A 2 4 9 0 6 NaN
6 A 1 7 8 0 7 NaN
8 B 6 9 5 0 9 NaN
9 B 3 3 7 10 10 1.0
10 B 6 7 8 20 11 1.0
11 B 5 2 7 0 12 NaN
12 B 2 9 3 10 13 1.0
13 B 7 2 7 0 14 NaN
14 B 9 2 6 0 15 NaN
16 C 2 9 4 30 17 1.0
17 C 5 1 5 40 18 1.0
18 C 8 3 2 0 19 NaN
19 C 2 1 9 40 20 1.0
20 C 3 9 8 30 21 1.0
21 C 7 9 6 20 22 1.0
23 D 8 2 6 10 24 1.0
24 D 3 1 8 0 25 NaN
25 D 4 8 9 20 26 1.0
26 D 7 3 2 10 27 1.0
28 F 2 8 8 0 29 NaN
29 F 7 4 8 0 30 NaN
EDIT:
Another idea for appenf last row:
def f(x):
y = x[x['Y'].ne(0)].tail(1)
return x.append(y)
df = df.groupby('ID').apply(f)
print (df)
ID X1 X2 X3 Y
ID
A 0 A 9 2 3 10
1 A 3 5 5 0
2 A 3 3 4 0
3 A 6 3 6 20
4 A 6 7 4 0
5 A 2 4 9 0
6 A 1 7 8 0
3 A 6 3 6 20
B 7 B 6 9 5 0
8 B 3 3 7 10
9 B 6 7 8 20
10 B 5 2 7 0
11 B 2 9 3 10
12 B 7 2 7 0
13 B 9 2 6 0
11 B 2 9 3 10
C 14 C 2 9 4 30
15 C 5 1 5 40
16 C 8 3 2 0
17 C 2 1 9 40
18 C 3 9 8 30
19 C 7 9 6 20
19 C 7 9 6 20
D 20 D 8 2 6 10
21 D 3 1 8 0
22 D 4 8 9 20
23 D 7 3 2 10
23 D 7 3 2 10
F 24 F 2 8 8 0
25 F 7 4 8 0

How can I create a new column in a DataFrame that shows patterns in a different column?

My original CSV file looks like this
1, 9
2, 8
3, 9
14, 7
15, 6
19, 8
20, 9
21, 3
I grouped the table for continuous integers in column A with
for grp, val in df.groupby((df.diff()-1).fillna(0).cumsum().index1):
print(val)
Resulting table:
A B
1 1 9
2 2 8
3 3 9
A B
14 14 7
15 15 6
A B
19 19 8
20 20 9
21 21 3
In practice the B values are very long ID numbers, but insignificant as numbers. How can I create a new column C that will show patterns in each of the three groups by assigning a simple value to each ID, and the same simple value for each duplicate in a group?
Desired output:
A B C
1 1 9 1
2 2 8 2
3 3 9 1
A B C
14 14 7 1
15 15 6 2
A B C
19 19 8 1
20 20 9 2
21 21 3 3
Thanks
You are close
df['C']=df.groupby((df.A.diff()-1).fillna(0).cumsum()).B.apply(lambda x : pd.Series(pd.factorize(x)[0]+1)).values
df
Out[105]:
A B C
0 1 9 1
1 2 8 2
2 3 9 1
3 14 7 1
4 15 6 2
5 19 8 1
6 20 9 2
7 21 3 3
Or using category
df['C']=df.groupby((df.A.diff()-1).fillna(0).cumsum()).B.apply(lambda x : x.astype('category').cat.codes+1).values
df
Out[110]:
A B C
0 1 9 2
1 2 8 1
2 3 9 2
3 14 7 2
4 15 6 1
5 19 8 2
6 20 9 3
7 21 3 1
if you need for loop
for x,df1 in df.groupby((df.A.diff()-1).fillna(0).cumsum()):
print(df1.assign(C=pd.factorize(df1.B)[0]+1))
A B C
0 1 9 1
1 2 8 2
2 3 9 1
A B C
3 14 7 1
4 15 6 2
A B C
5 19 8 1
6 20 9 2
7 21 3 3
Let's try:
df.columns = ['A','B']
g = df.groupby(df.A.diff().ne(1).cumsum())
df['C'] = g['B'].transform(lambda x: pd.factorize(x)[0] + 1)
for n,g in g:
print(g)
Output:
A B C
0 1 9 1
1 2 8 2
2 3 9 1
A B C
3 14 7 1
4 15 6 2
A B C
5 19 8 1
6 20 9 2
7 21 3 3
Try withColumn function that will add a new column to the dataframe and you may assign an index value.

Categories