Number of levels (depth) of index and columns in a Pandas DataFrame - python

Python Pandas DataFrame can have hierarchical index (MultiIndex) or hierarchical columns.
I'm looking for a way to know number of levels (depth) of index and columns.
len(df.index.levels)
seems to only work with MultiIndex but it doesn't work with normal index.
Is there an attribute for this (which will works for MultiIndex but also simple Index) ?
df.index.depth
or
df.columns.depth
will be great.
One example of MultiIndex columns and index:
import pandas as pd
import numpy as np
def mklbl(prefix,n):
return ["%s%s" % (prefix,i) for i in range(n)]
def mi_sample():
miindex = pd.MultiIndex.from_product([mklbl('A',4),
mklbl('B',2),
mklbl('C',4),
mklbl('D',2)])
micolumns = pd.MultiIndex.from_tuples([('a','foo'),('a','bar'),
('b','foo'),('b','bah')],
names=['lvl0', 'lvl1'])
dfmi = pd.DataFrame(np.arange(len(miindex)*len(micolumns)).reshape((len(miindex),len(micolumns))),
index=miindex,
columns=micolumns).sortlevel().sortlevel(axis=1)
return(dfmi)
df = mi_sample()
So df looks like:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9 8 11 10
D1 13 12 15 14
C2 D0 17 16 19 18
D1 21 20 23 22
C3 D0 25 24 27 26
D1 29 28 31 30
B1 C0 D0 33 32 35 34
D1 37 36 39 38
C1 D0 41 40 43 42
D1 45 44 47 46
C2 D0 49 48 51 50
D1 53 52 55 54
C3 D0 57 56 59 58
D1 61 60 63 62
A1 B0 C0 D0 65 64 67 66
D1 69 68 71 70
C1 D0 73 72 75 74
D1 77 76 79 78
C2 D0 81 80 83 82
D1 85 84 87 86
C3 D0 89 88 91 90
D1 93 92 95 94
B1 C0 D0 97 96 99 98
D1 101 100 103 102
C1 D0 105 104 107 106
D1 109 108 111 110
C2 D0 113 112 115 114
D1 117 116 119 118
... ... ... ... ...
A2 B0 C1 D0 137 136 139 138
D1 141 140 143 142
C2 D0 145 144 147 146
D1 149 148 151 150
C3 D0 153 152 155 154
D1 157 156 159 158
B1 C0 D0 161 160 163 162
D1 165 164 167 166
C1 D0 169 168 171 170
D1 173 172 175 174
C2 D0 177 176 179 178
D1 181 180 183 182
C3 D0 185 184 187 186
D1 189 188 191 190
A3 B0 C0 D0 193 192 195 194
D1 197 196 199 198
C1 D0 201 200 203 202
D1 205 204 207 206
C2 D0 209 208 211 210
D1 213 212 215 214
C3 D0 217 216 219 218
D1 221 220 223 222
B1 C0 D0 225 224 227 226
D1 229 228 231 230
C1 D0 233 232 235 234
D1 237 236 239 238
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249 248 251 250
D1 253 252 255 254
[64 rows x 4 columns]

To make summarized version of comments above:
You can use .nlevels attribute which gives the number of levels for an index and columns:
df = pd.DataFrame(np.random.rand(2,2), index=[['A','A'],['B','C']], columns=['a','b'])
df
a b
A B 0.558 0.336
C 0.148 0.436
df.index.nlevels
2
df.columns.nlevels
1
As #joris mentioned above len(df.columns.levels) will not work in the example above as columns is not MultiIndex, giving:
AttributeError: 'Index' object has no attribute 'levels'
But it will work fine for index in the example above:
len(df.index.levels)
2

You probably need something like len(set(df.a)) which works on both index or normal column.

Related

Sum specific columns in dataframe with multi index

I have a dataframe (df) with a multi index
Class A B
Sex M F M F
Group
B1 81 34 55 92
B2 38 3 25 11
B3 73 71 69 79
B4 69 23 27 96
B5 5 1 33 28
B6 40 81 91 87
I am trying to create 2 sum columns (one for M and one for F) so my output would look like:
Class A B Total
Sex M F M F Total M Total F
Group
B1 81 34 55 92 136 126
B2 38 3 25 11 63 14
B3 73 71 69 79 142 150
B4 69 23 27 96 96 119
B5 5 1 33 28 38 29
B6 40 81 91 87 131 168
I have tried :
df['Total M'] = df['M'].sum(axis=1)
df['Total F'] = df['F'].sum(axis=1)
without success
You can try
df[('Total', 'Total M')] = df.xs('M', level=1, axis=1).sum(axis=1)
df[('Total', 'Total F')] = df.xs('F', level=1, axis=1).sum(axis=1)
# or in a for loop
for col in ['M', 'F']:
df[('Total', f'Total {col}')] = df.xs(col, level=1, axis=1).sum(axis=1)
print(df)
A B Total
M F M F Total M Total F
B1 81 34 55 92 136 126
B2 38 3 25 11 63 14
B3 73 71 69 79 142 150
B4 69 23 27 96 96 119
B5 5 1 33 28 38 29
B6 40 81 91 87 131 168
use the loc accessor. Code below
df[('Total', 'Total M')] = df.loc(axis=1)[:, ['M']].sum(axis=1)
df[('Total', 'Total F')]= df.loc(axis=1)[:, ['F']].sum(axis=1)
Here's an alternative approach using groupby. Probably overkill for only 2 level groups, but should scale well in scenarios where there are more.
totals = df.groupby(axis=1, level=1).sum()
totals.columns = pd.MultiIndex.from_product([['Total'], totals.columns])
df = df.join(totals)
[out]
Class A B Total
Sex M F M F F M
Group
B1 81 34 55 92 126 136
B2 38 3 25 11 14 63
B3 73 71 69 79 150 142
B4 69 23 27 96 119 96
B5 5 1 33 28 29 38
B6 40 81 91 87 168 131

More efficient way to compute angles?

I have a dataframe with three points given as columns (so, a total of six):
x_measured y_measured x_calculated y_calculated x_fixedpoint \
0 142 37 143 37.5 138
1 142 37 143 37.6 138
2 142 37 143 37.7 138
3 142 37 143 37.8 138
4 142 37 143 37.9 138
5 73 55 71 55.6 72
6 73 55 71 55.7 72
7 73 55 71 55.8 72
8 73 55 71 55.9 72
9 73 55 71 55.1 72
y_fixedpoint
0 38
1 38
2 38
3 38
4 38
5 55
6 55
7 55
8 55
9 55
Now, I need to calculate the angle between (x_measured, y_measured) and (x_calculated y_calculated) relative to (x_fixedpoint, y_fixedpoint). To do so, I created this function:
def angle_calculator(x1,x2,x3,x4,x5,x6):
All_points = np.array([[x1,x2],[x3,x4],[x5,x6]])
A = All_points[2] - All_points[0]
B = All_points[1] - All_points[0]
C = All_points[2] - All_points[1]
for e1, e2 in ((A, B), (A, C), (B, -C)):
dotproduct = np.dot(e1, e2)
norm = np.linalg.norm(e1) * np.linalg.norm(e2)
if dotproduct !=0:
angle = round(np.arccos(dotproduct/norm) * 180 / np.pi, 2)
else:
angle = 0
return angle
taking the different x,y coordinates as arguments and returning an angle. It works and gives:
df['angles'] = df.apply(lambda x: angle_calculator(x.x_measured, x.y_measured, x.x_calculated, x.y_calculated, x.x_fixedpoint, x.y_fixedpoint), axis=1)
x_measured y_measured x_calculated y_calculated x_fixedpoint \
0 142 37 143 37.5 138
1 142 37 143 37.6 138
2 142 37 143 37.7 138
3 142 37 143 37.8 138
4 142 37 143 37.9 138
5 73 55 71 55.6 72
6 73 55 71 55.7 72
7 73 55 71 55.8 72
8 73 55 71 55.9 72
9 73 55 71 55.1 72
y_fixedpoint angles
0 38 32.275644
1 38 35.537678
2 38 38.425651
3 38 40.950418
4 38 43.132975
5 55 14.264512
6 55 15.701974
7 55 16.858399
8 55 17.759467
9 55 2.848188
Usually, I would be realatively pleased with this BUT.....it is rather slow for dataframes with over 200 000 rows. Slow (Iknow!) is a realtive term, but in this cas it takes around 10 seconds for 200 000 rows.
So, my questions are:
Am I overcomplicating things?
Is there a more efficient way to do this?
As always, thankful for knowledge.
Considering the values in the numpy domain, we can do:
# extract the pairs (and go to numpy)
meas = df.filter(like="measured").to_numpy()
calc = df.filter(like="calculated").to_numpy()
fix = df.filter(like="fixed").to_numpy()
# calculate the differences of `meas` and `calc` from `fix`
meas_dist = fix - meas
calc_dist = fix - calc
# get the inner products
inners = (meas_dist * calc_dist).sum(axis=1)
# or with: inners = np.einsum("ij,ij->i", meas_dist, calc_dist); might be faster
# norm function for brevity
norm = lambda mat: np.linalg.norm(mat, axis=1)
# get the angles (in radians)
angles_in_rad = np.arccos(inners / (norm(meas_dist) * norm(calc_dist)))
# handling possible NaNs (by #Serge de Gosson de Varennes, thanks!)
where_nans = isnan(angles_in_rad)
angles_in_rad[where_nans ] = 0
# go to degrees
angles_in_deg = np.rad2deg(angles_in_rad)
# put back to df
df["angles"] = angles_in_deg
I get:
>>> df
x_measured y_measured x_calculated y_calculated x_fixedpoint y_fixed_point angles
0 142 37 143 37.5 138 38 8.325650
1 142 37 143 37.6 138 38 9.462322
2 142 37 143 37.7 138 38 10.602613
3 142 37 143 37.8 138 38 11.745633
4 142 37 143 37.9 138 38 12.890481
5 73 55 71 55.6 72 55 149.036243
6 73 55 71 55.7 72 55 145.007980
7 73 55 71 55.8 72 55 141.340192
8 73 55 71 55.9 72 55 138.012788
9 73 55 71 55.1 72 55 174.289407

How to turn a dictionary into a dataframe with all the keys in a column

def weights():
saved = {}
for i in range(len(bread_pairs["key_id"])):
drawing = np.array(bread_pairs['bitmap'][i], dtype=np.uint8)
new_test_cnn = drawing.reshape(1, 28, 28, 1).astype('float32')
new_cnn_predict = model.predict(new_test_cnn, batch_size=32, verbose=0)
w = model.layers[8].get_weights()
w = list(w[0].flatten())
saved[bread_pairs["key_id"][i]] = w
return saved
I have this function that is creating a dictionary of key_ids and mapping them to an associated list of values of length 200. So for example my dictionary looks something like saved = {key_id_1: [1,2,3...200], key_id_2: [1,2,...,200], ....}
I would like to turn this dictionary into a dataframe with a column of key_ids and each element in the associated list of 200 becomes its own column. So there is a total of 201 columns where the first column is the first key_id and then the second column is the first element of the list, the third column is the second element of the list etc. And then the second row first column is the second key_id and then the second row second column is the first element of the key_id's second list and so on. Is there a way to convert this dictionary to a df? I have 10000 key_ids do the dimensions would be 10000x201. Thanks!
Load the dict into a DataFrame using pandas.DataFrame.from_dict with the orient parameter, and reset the index with .reset_index()
This will create the DataFrame as requested, however, I recommend leaving the keys as the index, which should make it easier to perform calculations and address specific rows.
If the columns should be named 0...201, then use df.columns = list(range(202)), or use pandas.DataFrame.rename to rename specific columns.
import pandas as pd
# test data
saved = {'key_id_1': list(range(201)), 'key_id_2': list(range(201))}
# create the DataFrame
df = pd.DataFrame.from_dict(saved, orient='index')
# reset the index
df = df.reset_index()
# display(df)
index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
0 key_id_1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
1 key_id_2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
Alternative Implementation
Create the DataFrame with pandas.DataFrame, transpose the DataFrame with pandas.DataFrame.T, and then reset with .reset_index().
df = pd.DataFrame(saved)
df = df.T.reset_index()

Pandas Dataframe

I have a dataframe containing a number of columns and rows, in all of the columns except for the leftmost two, there is data of the form "integer-integer". I would like to split all of these columns into two columns, with each integer in its own cell, and remove the dash.
I have tried to follow the answers in Pandas Dataframe: Split multiple columns each into two columns, but it seems that they are splitting after one element, while I would like to split on the "-".
By way of example, suppose I have a dataframe of the form:
I would like to split the columns labelled 2 through to 22, to have them called 2F, 2A, 3F, 3A, ..., 6A with the data in the first row being R1, Hawthorn, 229, 225, 91, 81, ..., 12.
Thank you for any help.
You can use DataFrame.set_index with DataFrame.stack for Series, then split to new 2 columns by Series.str.split, convert to integers, create new columns names by DataFrame.set_axis, reshape by DataFrame.unstack, sorting columns by DataFrame.sort_index and last flatten MultiIndexwith convert index to columns by DataFrame.reset_index:
#first replace columns names to default values
df.columns = range(len(df.columns))
df = (df.set_index([0,1])
.stack()
.str.split('-', expand=True)
.astype(int)
.set_axis(['F','A'], axis=1, inplace=False)
.unstack()
.sort_index(axis=1, level=[1,0], ascending=[True, False]))
df.columns = df.columns.map(lambda x: f'{x[1]}{x[0]}')
df = df.reset_index()
print (df)
0 1 2F 2A 3F 3A 4F 4A 5F 5A 6F 6A
0 R1 Hawthorn 229 225 91 81 216 142 439 367 7 12
1 R2 Sydney 226 214 93 92 151 167 377 381 12 8
2 R3 Geelong 216 228 91 166 159 121 369 349 16 14
3 R4 North Melbourne 213 239 169 126 142 155 355 394 8 9
4 R5 Gold Coast 248 226 166 94 267 169 455 389 18 6
5 R6 St Kilda 242 197 118 161 158 156 466 353 15 16
6 R7 Fremantle 225 219 72 84 224 185 449 464 7 5
For Input:
df = pd.DataFrame({0: ['R1'], 1: ['Hawthorn'], 2: ['229-225'], 3: ['91-81'], 4:['210-142'], 5:['439-367'], 6:['7-12']})
0 1 2 3 4 5 6
0 R1 Hawthorn 229-225 91-81 210-142 439-367 7-12
Trying the code:
for i in df.columns[2::]:
df[[str(i)+'F', str(i)+'A']] =pd.DataFrame(df[i].str.split('-').tolist(), index= df.index)
del df[i]
Prints (1st row):
0 1 2F 2A 3F 3A 4F 4A 5F 5A 6F 6A
0 R1 Hawthorn 229 225 91 81 210 142 439 367 7 12
you can use lambda function for split a series
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
>>> data
0 12-24
1 13-26
2 14-28
3 15-30
df["d1"] = df["data"].apply(lambda x: x.split("-")[0])
df["d2"] = df["data"].apply(lambda x: x.split("-")[1])
df.head()
>>>
data d1 d2
0 12-24 12 24
1 13-26 13 26
2 14-28 14 28
3 15-30 15 30

Efficient finite field multiplication with log-antilog-table lookup in numpy

I am trying to implement efficient multiplication in GF(2^8), which elements are most naturally represented as uint8-numpy-values, in a numpy-thonic way. Therefore, I implemented GF-Arithmetics (in pure Python, not numpy) in order to build log-antilog-tables (I took a ranom generator, 9); in particular, I implemented a (non-numpy) Python-Function multGF which implements GF-Multiplication, which works great but is slow (since it uses polynomial modulo calcs). A common trick to speed up multiplication is to use the following equation:
Building the log-antilog-uint8-ndarrays is easily performed like this:
gen = 9 ; K = [1] ; g = gen
for i in range(1,255):
K.append(g)
g = multGF(g,gen)
antilog = np.array(K, dtype='uint8')
log = np.full(256,0, dtype='uint8')
for i in range(255): log[antilog[i]] = i
But, and that is my question, how to implement the multiplication in a numpy-thonic way? Both, the log table and the antilog table are of size 255 (not 256; no log for 0) and the exponents have to be added modulo 255 - and not mod 256. I came up with the following IMHO non numpy-thonic solution:
def multGF2(a,b):
return antilog[(int(log[a]) + log[b]) % 255]
I had to convert the uint8-addition (which works mod-256 naturally) into an int-addtion in order to perform mod-255-addition. This is neither elegant nor efficient and I am quite sure, that any has a better solution?
For testing: here are both logtables as arrays:
log = [ nan 0 250 214 245 173 209 42 240 1 168 71 204 187 37 132 235 91
251 191 163 84 66 146 199 212 182 215 32 30 127 247 230 206 86 229
246 65 186 244 158 87 79 171 61 174 141 180 194 113 207 50 177 150
210 54 27 105 25 231 122 93 242 43 225 2 201 156 81 142 224 52
241 53 60 64 181 190 239 254 153 119 82 72 74 9 166 62 56 13
169 143 136 34 175 109 189 80 108 165 202 188 45 99 172 203 145 126
205 157 49 24 22 139 100 159 20 111 226 133 117 233 88 46 237 130
38 3 220 217 252 35 196 96 151 89 76 6 137 192 219 5 47 178
236 110 48 98 55 118 59 155 176 92 185 179 234 211 249 70 148 18
114 39 77 124 67 14 69 58 4 195 161 7 57 147 51 238 8 135
164 144 138 116 131 208 29 162 170 85 104 193 184 97 75 216 103 115
160 123 197 11 183 10 40 222 94 101 167 213 198 90 140 243 121 149
200 63 152 12 44 23 19 129 17 68 134 28 95 218 154 248 15 16
106 227 221 102 128 120 112 26 228 78 83 31 41 36 232 21 125 107
33 73 253 223]
antilog = [ 1 9 65 127 170 141 137 173 178 85 203 201 219 89 167 232 233 224
161 222 116 249 112 221 111 58 241 56 227 186 29 245 28 252 93 131
247 14 126 163 204 246 7 63 220 102 123 142 146 110 51 176 71 73
55 148 88 174 169 150 74 44 87 217 75 37 22 166 225 168 159 11
83 253 84 194 136 164 243 42 97 68 82 244 21 189 34 41 122 135
211 17 153 61 206 228 133 193 147 103 114 207 237 196 190 57 234 251
98 95 145 117 240 49 162 197 183 120 149 81 239 214 60 199 165 250
107 30 238 223 125 184 15 119 226 179 92 138 182 113 212 46 69 91
181 106 23 175 160 215 53 134 218 80 230 151 67 109 40 115 198 172
187 20 180 99 86 208 10 90 188 43 104 5 45 94 152 52 143 155
47 76 26 202 192 154 38 13 101 96 77 19 139 191 48 171 132 200
210 24 216 66 100 105 12 108 33 50 185 6 54 157 25 209 3 27
195 129 229 140 128 236 205 255 70 64 118 235 242 35 32 59 248 121
156 16 144 124 177 78 8 72 62 213 39 4 36 31 231 158 2 18
130 254 79 ]

Categories