print(df)
Names Maths Physics Chemistry
0 Khaja 75 91 84
1 Srihari 81 89 71
2 Krishna 69 77 76
3 jain 87 69 68
4 shakir 79 70 74
df.drop(['Chemistry'],axis=1,inplace=True)
df
Names Maths Physics
0 Khaja 75 91
1 Srihari 81 89
2 Krishna 69 77
3 jain 87 69
4 shakir 79 70
How to get back the dropped column from the table. I tried to get back
the column with reset_drop() but it doesn't work.
The final outcome should look like this:
print(df)
Names Maths Physics Chemistry
0 Khaja 75 91 84
1 Srihari 81 89 71
2 Krishna 69 77 76
3 jain 87 69 68
4 shakir 79 70 74
Use pop for extract column to Series and join to add to end of DataFrame:
a = df.pop('Chemistry')
print (a)
0 84
1 71
2 76
3 68
4 74
Name: Chemistry, dtype: int64
print (df)
Names Maths Physics
0 Khaja 75 91
1 Srihari 81 89
2 Krishna 69 77
3 jain 87 69
4 shakir 79 70
df = df.join(a)
print (df)
Names Maths Physics Chemistry
0 Khaja 75 91 84
1 Srihari 81 89 71
2 Krishna 69 77 76
3 jain 87 69 68
4 shakir 79 70 74
If column is not last add reindex by original columns:
cols = df.columns
a = df.pop('Maths')
print (a)
0 75
1 81
2 69
3 87
4 79
Name: Maths, dtype: int64
print (df)
Names Physics Chemistry
0 Khaja 91 84
1 Srihari 89 71
2 Krishna 77 76
3 jain 69 68
4 shakir 70 74
df = df.join(a).reindex(columns=cols)
print (df)
Names Maths Physics Chemistry
0 Khaja 75 91 84
1 Srihari 81 89 71
2 Krishna 69 77 76
3 jain 87 69 68
4 shakir 79 70 74
Its always a good practice to have a master Dataframe and then do operations in them
I would suggest keep best naming practice and give subset dataframe meaningful names.
print (Master)
Names Maths Physics Chemistry
0 Khaja 75 91 84
1 Srihari 81 89 71
2 Krishna 69 77 76
3 jain 87 69 68
4 shakir 79 70 74
Chemistry= df.pop('Chemistry')
0 84
1 71
2 76
3 68
4 74
Name: Chemistry, dtype: int64
df_withoutChemistry
Names Maths Physics
0 Khaja 75 91
1 Srihari 81 89
2 Krishna 69 77
3 jain 87 69
4 shakir 79 70
Related
I have a DataFrame and I need to create a new column which contains the second largest value of each row in the original Dataframe.
Sample:
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
Desired output:
0 1 2 3 4 5 6 7 8 9 penultimate
0 52 69 62 7 20 69 38 10 57 17 62
1 52 94 49 63 1 90 14 76 20 84 90
2 78 37 58 7 27 41 27 26 48 51 58
3 6 39 99 36 62 90 47 25 60 84 90
4 37 36 91 93 76 69 86 95 69 6 93
5 5 54 73 61 22 29 99 27 46 24 73
6 71 65 45 9 63 46 4 93 36 18 71
7 85 7 76 46 65 97 64 52 28 80 85
How can this be done in as little code as possible?
You could use NumPy for this:
import numpy as np
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df['penultimate'] = np.sort(df.values, 1)[:, -2]
print(df)
Using NumPy is faster.
Here is a simple lambda function!
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
# Output
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)
Cheers!
I want to parse data from a PDF and then find sum and percentage of data:
My code looks like this:
import camelot
import re
import pandas as pd
tables = camelot.read_pdf('result.pdf', pages="17")
marks = tables[0].df.iloc[[3,6,9,12,15,18,21,24,27,30],3:12]
print(marks)
marks.to_csv('sample.csv')
And I want to remove repeated content in btw ( ) so that only numbers remain:
3 4 5 6 7 8 9 10 11
3 52(B) 78(A+) 76(A+) 56(B+) 73(A) 74(A) 83(A+) 78(A+) 90(O)
6 40(P)* 58(B+) 52(B) 45(C) 57(B+) 55(B+) 83(A+) 82(A+) 90(O)
9 59(B+) 40(P)* 63(B+) 59(B+) 64(B+) 65(A) 91(O) 82(A+) 85(A+)
12 64(B+) 54(B) 78(A+) 42(P) 72(A) 73(A) 83(A+) 85(A+) 75(A+)
15 67(A) 44(P) A 53(B) 65(A) 64(B+) 83(A+) 84(A+) 75(A+)
18 61(B+) 53(B) 64(B+) 42(P) 65(A) 49(C) 81(A+) 82(A+) 90(O)
21 44(P) 46(C) 68(A) 40(P)* 49(C) 51(B) 83(A+) 82(A+) 75(A+)
24 69(A) 77(A+) 76(A+) 62(B+) 71(A) 64(B+) 85(A+) 84(A+) 100(O)
27 78(A+) 78(A+) 83(A+) 76(A+) 79(A+) 69(A) 85(A+) 82(A+) 90(O)
30 87(A+) 84(A+) 90(O) 71(A) 82(A+) 81(A+) 87(A+) 84(A+) 95(O)
Then what should I do next to sum and hence find percentage?
Use DataFrame.replace with to_numeric per each column by DataFrame.apply:
df = df.replace('\D+', '', regex=True).apply(pd.to_numeric, errors='coerce')
print (df)
3 4 5 6 7 8 9 10 11
3 52 78 76.0 56 73 74 83 78 90
6 40 58 52.0 45 57 55 83 82 90
9 59 40 63.0 59 64 65 91 82 85
12 64 54 78.0 42 72 73 83 85 75
15 67 44 NaN 53 65 64 83 84 75
18 61 53 64.0 42 65 49 81 82 90
21 44 46 68.0 40 49 51 83 82 75
24 69 77 76.0 62 71 64 85 84 100
27 78 78 83.0 76 79 69 85 82 90
30 87 84 90.0 71 82 81 87 84 95
If remove only content of ():
df = df.replace('(\(.*\))', '', regex=True)
print (df)
3 4 5 6 7 8 9 10 11
3 52 78 76 56 73 74 83 78 90
6 40* 58 52 45 57 55 83 82 90
9 59 40* 63 59 64 65 91 82 85
12 64 54 78 42 72 73 83 85 75
15 67 44 A 53 65 64 83 84 75
18 61 53 64 42 65 49 81 82 90
21 44 46 68 40* 49 51 83 82 75
24 69 77 76 62 71 64 85 84 100
27 78 78 83 76 79 69 85 82 90
30 87 84 90 71 82 81 87 84 95
I have a pandas data frame, df:
import pandas as pd
import numpy as np
np.random.seed(123)
s = np.arange(5)
df = pd.DataFrame()
for i in s:
s_df = pd.DataFrame({'time':np.arange(100),
'x':np.arange(100),
'y':np.arange(100),
'r':np.random.randint(60,100)})
s_df['unit'] = str(i)
df = df.append(s_df)
I want select the 'x' and 'y' data for each 'unit', from 'time' 0 up until its value of 'r', and then warp the selected data to fit a new normalized timescale of 0-100. The new DataFrame should look the same, but x and y will have been stretched to fit the new timescale.
I think you can start with this and modify:
df.groupby('unit', as_index=False, group_keys=False)\
.apply(lambda g: g[g.time <= g.r.max()].pipe(lambda x: x.assign(x = np.interp(x.time * 100/x.r.max(), g.time, g.x),
y = np.interp(x.time * 100/x.r.max(), g.time, g.y))))
Output:
r time x y unit
0 91 0 0.369445 0.802790 0
1 91 1 0.802881 0.411523 0
2 91 2 0.080290 0.228482 0
3 91 3 0.248865 0.624470 0
4 91 4 0.350376 0.207805 0
5 91 5 0.604447 0.495269 0
6 91 6 0.402430 0.317250 0
7 91 7 0.205757 0.296371 0
8 91 8 0.426954 0.793716 0
9 91 9 0.728095 0.486691 0
10 91 10 0.087941 0.701258 0
11 91 11 0.653719 0.937834 0
12 91 12 0.702571 0.381267 0
13 91 13 0.113419 0.492686 0
14 91 14 0.381172 0.539422 0
15 91 15 0.490320 0.166290 0
16 91 16 0.440490 0.029675 0
17 91 17 0.663973 0.245057 0
18 91 18 0.273116 0.280711 0
19 91 19 0.807658 0.869288 0
20 91 20 0.227972 0.987803 0
21 91 21 0.747295 0.526613 0
22 91 22 0.491929 0.118479 0
23 91 23 0.403465 0.564284 0
24 91 24 0.618359 0.648467 0
25 91 25 0.867436 0.447866 0
26 91 26 0.487128 0.526473 0
27 91 27 0.135412 0.855466 0
28 91 28 0.469281 0.753690 0
29 91 29 0.397495 0.786670 0
.. .. ... ... ... ...
53 82 53 0.985053 0.534743 4
54 82 54 0.255997 0.789710 4
55 82 55 0.629316 0.889916 4
56 82 56 0.730672 0.539548 4
57 82 57 0.484289 0.278669 4
58 82 58 0.120573 0.754350 4
59 82 59 0.071606 0.912240 4
60 82 60 0.126613 0.775831 4
61 82 61 0.392633 0.706384 4
62 82 62 0.312653 0.698514 4
63 82 63 0.164337 0.420798 4
64 82 64 0.655284 0.317136 4
65 82 65 0.526961 0.484673 4
66 82 66 0.205197 0.516752 4
67 82 67 0.405965 0.314419 4
68 82 68 0.892710 0.620090 4
69 82 69 0.351876 0.422846 4
70 82 70 0.601449 0.152340 4
71 82 71 0.187239 0.486854 4
72 82 72 0.757108 0.727058 4
73 82 73 0.728311 0.623236 4
74 82 74 0.725225 0.279149 4
75 82 75 0.536730 0.746806 4
76 82 76 0.584319 0.543595 4
77 82 77 0.591636 0.451003 4
78 82 78 0.042806 0.766688 4
79 82 79 0.326183 0.832956 4
80 82 80 0.558992 0.507238 4
81 82 81 0.303649 0.143872 4
82 82 82 0.303214 0.113151 4
[428 rows x 5 columns]
Below I am using pandas to read my csv file in the following format:
dataframe = pandas.read_csv("test.csv", header=None, usecols=range(2,62), skiprows=1)
dataset = dataframe.values
How can I delete the first value in the very last column in the dataframe and then delete the last row in the dataframe?
Any ideas?
You can shift the last column up to get rid of the first value, then drop the last line.
df.assign(E=df.E.shift(-1)).drop(df.index[-1])
MVCE:
pd.np.random.seed = 123
df = pd.DataFrame(pd.np.random.randint(0,100,(10,5)),columns=list('ABCDE'))
Output:
A B C D E
0 91 83 40 17 94
1 61 5 43 87 48
2 3 69 73 15 85
3 99 53 18 95 45
4 67 30 69 91 28
5 25 89 14 39 64
6 54 99 49 44 73
7 70 41 96 51 68
8 36 3 15 94 61
9 51 4 31 39 0
df.assign(E=df.E.shift(-1)).drop(df.index[-1]).astype(int)
Output:
A B C D E
0 91 83 40 17 48
1 61 5 43 87 85
2 3 69 73 15 45
3 99 53 18 95 28
4 67 30 69 91 64
5 25 89 14 39 73
6 54 99 49 44 68
7 70 41 96 51 61
8 36 3 15 94 0
or in two steps:
df[df.columns[-1]] = df[df.columns[-1]].shift(-1)
df = df[:-1]
Is there a way to find to find and rank rows in a Pandas Dataframe by their similarity to a row from another Dataframe?
My understanding of your question: you have two data frames, hopfully of the same column count. You want to rate first data frame's, the subject data frame, members by how close, i.e. similar, they are to any of the members of the target data frame.
I am not aware of a built in method.
It is probably not the most efficient way but here is how I'd approach:
#! /usr/bin/python3
import pandas as pd
import numpy as np
import pprint
pp = pprint.PrettyPrinter(indent=4)
# Simulate data
df_subject = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) # This is the one we're iterating to check similarity to target.
df_target = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) # This is the one we're checking distance to
# This will hold the min dstances.
distances=[]
# Loop to iterate over subject DF
for ix1,subject in df_subject.iterrows():
distances_cur=[]
# Loop to iterate over target DF
for ix2,target in df_target.iterrows():
distances_cur.append(np.linalg.norm(target-subject))
# Get the minimum distance for the subject set member.
distances.append(min(distances_cur))
# Distances to df
distances=pd.DataFrame(distances)
# Normalize.
distances=0.5-(distances-distances.mean(axis=0))/distances.max(axis=0)
# Column index joining, ordering and beautification.
Proximity_Ratings_name='Proximity Ratings'
distances=distances.rename(columns={0: Proximity_Ratings_name})
df_subject=df_subject.join(distances)
pp.pprint(df_subject.sort_values(Proximity_Ratings_name,ascending=False))
It should yeild something like the table below. Higher rating means there's a similar member in the target data frame:
A B C D Proximity Ratings
55 86 21 91 78 0.941537
38 91 31 35 95 0.901638
43 49 89 49 6 0.878030
98 28 98 98 36 0.813685
77 67 23 78 84 0.809324
35 52 16 36 58 0.802223
54 2 25 61 44 0.788591
95 76 3 60 46 0.766896
5 55 39 88 37 0.756049
52 79 71 90 70 0.752520
66 52 27 82 82 0.751353
41 45 67 55 33 0.739919
76 12 93 50 62 0.720323
94 99 84 39 63 0.716123
26 62 6 97 60 0.715081
40 64 50 37 27 0.714042
68 70 21 8 82 0.698824
47 90 54 60 65 0.676680
7 85 95 45 71 0.672036
2 14 68 50 6 0.661113
34 62 63 83 29 0.659322
8 87 90 28 74 0.647873
75 14 61 27 68 0.633370
60 9 91 42 40 0.630030
4 46 46 52 35 0.621792
81 94 19 82 44 0.614510
73 67 27 34 92 0.608137
30 92 64 93 51 0.608137
11 52 25 93 50 0.605770
51 17 48 57 52 0.604984
.. .. .. .. .. ...
64 28 56 0 9 0.397054
18 52 84 36 79 0.396518
99 41 5 32 34 0.388519
27 19 54 43 94 0.382714
92 69 56 73 93 0.382714
59 1 29 46 16 0.374878
58 2 36 8 96 0.362525
69 58 92 16 48 0.361505
31 27 57 80 35 0.349887
10 59 23 47 24 0.345891
96 41 77 76 33 0.345891
78 42 71 87 65 0.344398
93 12 31 6 27 0.329152
23 6 5 10 42 0.320445
14 44 6 43 29 0.319964
6 81 51 44 15 0.311840
3 17 60 13 22 0.293066
70 28 40 22 82 0.251549
36 95 72 35 5 0.249354
49 78 10 30 18 0.242370
17 79 69 57 96 0.225168
46 42 95 86 81 0.224742
84 58 81 59 86 0.221346
9 9 62 8 30 0.211659
72 11 51 74 8 0.159265
90 74 26 80 1 0.138993
20 90 4 6 5 0.117652
50 3 12 5 53 0.077088
42 90 76 42 1 0.075284
45 94 46 88 14 0.054244
Hope I understand correctly. Don't use if performance matters, I'm sure there's an algebraic way to approach this (Multiply matrices) that would run way faster.