Find the total % of each value in its respective index level [duplicate] - python

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 10 months ago.
I'm trying to find the % total of the value within its respective index level, however, the current result is producing Nan values.
pd.DataFrame({"one": np.arange(0, 20), "two": np.arange(20, 40)}, index=[np.array([np.zeros(10), np.ones(10).flatten()], np.arange(80, 100)])
DataFrame:
one two
0.0 80 0 20
81 1 21
82 2 22
83 3 23
84 4 24
85 5 25
86 6 26
87 7 27
88 8 28
89 9 29
1.0 90 10 30
91 11 31
92 12 32
93 13 33
94 14 34
95 15 35
96 16 36
97 17 37
98 18 38
99 19 39
Aim:
To see the % total of a column 'one' within its respective level.
Excel example:
Current attempted code:
for loc in df.index.get_level_values(0):
df.loc[loc, 'total'] = df.loc[loc, :] / df.loc[loc, :].sum()

IIUC, use:
df['total'] = df['one'].div(df.groupby(level=0)['one'].transform('sum'))
output:
one two total
0 80 0 20 0.000000
81 1 21 0.022222
82 2 22 0.044444
83 3 23 0.066667
84 4 24 0.088889
85 5 25 0.111111
86 6 26 0.133333
87 7 27 0.155556
88 8 28 0.177778
89 9 29 0.200000
1 90 10 30 0.068966
91 11 31 0.075862
92 12 32 0.082759
93 13 33 0.089655
94 14 34 0.096552
95 15 35 0.103448
96 16 36 0.110345
97 17 37 0.117241
98 18 38 0.124138
99 19 39 0.131034

Related

Grid of integers

I need to make a grid with the numbers generated by the code, but I'm not understanding how to align them in columns.
Is there a parameter of print or something else that could help me out?
#main()
a=0
b=0
for i in range(1, 13):
a=a+1
print(" ")
b=b+1
for f in range(1,13):
print(f*b, end=" ")
My output at the moment:
I would recommend using python's f-strings:
for i in range(1, 13):
print(''.join(f"{i*j: 4}" for j in range(1,13)))
Here's the output:
1 2 3 4 5 6 7 8 9 10 11 12
2 4 6 8 10 12 14 16 18 20 22 24
3 6 9 12 15 18 21 24 27 30 33 36
4 8 12 16 20 24 28 32 36 40 44 48
5 10 15 20 25 30 35 40 45 50 55 60
6 12 18 24 30 36 42 48 54 60 66 72
7 14 21 28 35 42 49 56 63 70 77 84
8 16 24 32 40 48 56 64 72 80 88 96
9 18 27 36 45 54 63 72 81 90 99 108
10 20 30 40 50 60 70 80 90 100 110 120
11 22 33 44 55 66 77 88 99 110 121 132
12 24 36 48 60 72 84 96 108 120 132 144
The most common form is to use almost any arbitrary expression within the curly braces. This can include dictionary values, function calls and so on. The above usage specifies formatting after the colon. The space before the 4 indicates that the fill character should be a space, and the 4 indicates that the whole expression should take up 4 characters total. For more info, check out the documentation.
Considering the width of each grid cell is stored as w, which for above snippet suffices as 4, a regularly spaced grid can be printed using
w = 4
a, b = 0, 0
for i in range(1, 13):
a, b = a+1, b+1
for f in range(1, 13):
print(('{:'+str(w)+'}').format(f*b), end='')
print('')
Its output is
1 2 3 4 5 6 7 8 9 10 11 12
2 4 6 8 10 12 14 16 18 20 22 24
3 6 9 12 15 18 21 24 27 30 33 36
4 8 12 16 20 24 28 32 36 40 44 48
5 10 15 20 25 30 35 40 45 50 55 60
6 12 18 24 30 36 42 48 54 60 66 72
7 14 21 28 35 42 49 56 63 70 77 84
8 16 24 32 40 48 56 64 72 80 88 96
9 18 27 36 45 54 63 72 81 90 99 108
10 20 30 40 50 60 70 80 90 100 110 120
11 22 33 44 55 66 77 88 99 110 121 132
12 24 36 48 60 72 84 96 108 120 132 144
You can reference keyword argument values passed to the str.format() method in the format string by name via {name}. Here's an example of doing that where the value referenced is computed (as opposed to being a constant):
mx = 12
w = len(str(mx*mx)) + 1
for b in range(1, mx+1):
for f in range(1, mx+1):
print(('{:{w}}').format(f*b, w=w), end='')
print('')
Output:
1 2 3 4 5 6 7 8 9 10 11 12
2 4 6 8 10 12 14 16 18 20 22 24
3 6 9 12 15 18 21 24 27 30 33 36
4 8 12 16 20 24 28 32 36 40 44 48
5 10 15 20 25 30 35 40 45 50 55 60
6 12 18 24 30 36 42 48 54 60 66 72
7 14 21 28 35 42 49 56 63 70 77 84
8 16 24 32 40 48 56 64 72 80 88 96
9 18 27 36 45 54 63 72 81 90 99 108
10 20 30 40 50 60 70 80 90 100 110 120
11 22 33 44 55 66 77 88 99 110 121 132
12 24 36 48 60 72 84 96 108 120 132 144

How to get randomly 20 elements from np.array and save it to DataFrame?

I have DataFrame from 1 to 80 numbers how can i get randomly 20 elements and save result to another DataFrame? I cant save every list like a row. Its saving elements like a columns. In the future i want to try predict every radom elements with sklearn
a = np.arange(1,81).reshape(8,10)
pd.DataFrame(a)
I must to get 20 unique numbers and write it one row. For example in python:
from random import sample
for x in range(1,20):
i=sample(range(1,81), k=20)
i.sort()
print(x,'-',i)`
It return as list [1,3,5,8,34,45,12,76,45...] 20 elements and i want its look like :
0 1 2 3 4 5 6 7 8 9 10 11 12 ... 20
0 1 5 10 14 20 55 67 34 ...... 20 elements
1
.
.
Use df.sample() to get samples of data frm a dataframe:
a = np.arange(1,81).reshape(8,10)
df = pd.DataFrame(a)
df1= df.sample(frac=.25)
>>df1
0 1 2 3 4 5 6 7 8 9
5 51 52 53 54 55 56 57 58 59 60
3 31 32 33 34 35 36 37 38 39 40
For a random permutation np.random.permutation():
df.iloc[np.random.permutation(len(df))].head(2)
0 1 2 3 4 5 6 7 8 9
6 61 62 63 64 65 66 67 68 69 70
1 11 12 13 14 15 16 17 18 19 20
EDIT : To get 20 elements in a list use:
import itertools
list(itertools.chain.from_iterable(df.sample(frac=.25).values))
#[71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
frac=.25 means 25% of the data, since you have used 80 elements 25% gives you 20 elements, you can adjust the fraction depending on you many elements you have and how many you want.
EDIT1: Further to your edit in the question: print(df.values) gives you an array:
[[ 1 2 3 4 5 6 7 8 9 10]
[11 12 13 14 15 16 17 18 19 20]
[21 22 23 24 25 26 27 28 29 30]
[31 32 33 34 35 36 37 38 39 40]
[41 42 43 44 45 46 47 48 49 50]
[51 52 53 54 55 56 57 58 59 60]
[61 62 63 64 65 66 67 68 69 70]
[71 72 73 74 75 76 77 78 79 80]]
You would require to shuffle this array using np.random.shuffle , in this case , do it on df.T.values since you also want to shuffle columns:
np.random.shuffle(df.T.values)
Then do a reshape:
df1 = pd.DataFrame(np.reshape(df.values,(4,20)))
>>df1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 4 3 10 2 8 7 1 5 6 9 14 13 20 12 18 17 11 15 16 19
1 24 23 30 22 28 27 21 25 26 29 34 33 40 32 38 37 31 35 36 39
2 44 43 50 42 48 47 41 45 46 49 54 53 60 52 58 57 51 55 56 59
3 64 63 70 62 68 67 61 65 66 69 74 73 80 72 78 77 71 75 76 79
This is a simple way using existing stackoverflow answers:
1- flatten the array so it looks more like a list, will allow you to deal with only one index instead of dealing with two array indexes
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.flatten.html
aflat = a.flatten()
2- Choose random items from the flattened array any of the answers here
How to randomly select an item from a list?
3- With the selected data, build your dataframe
You can also use numpy.random.choice and you can specify exact rows you want from the sample:
In [263]: a = np.arange(1,81).reshape(8,10)
In [265]: b = pd.DataFrame(a)
In [268]: b.iloc[np.random.choice(np.arange(len(b)), 5, False)]
Out[268]:
0 1 2 3 4 5 6 7 8 9
5 51 52 53 54 55 56 57 58 59 60
7 71 72 73 74 75 76 77 78 79 80
3 31 32 33 34 35 36 37 38 39 40
1 11 12 13 14 15 16 17 18 19 20
4 41 42 43 44 45 46 47 48 49 50
You can change 5 to 20 for your purpose. You need not worry about the percentile.

compare two data frames and delete columns based on lookup table

I have two data frames:
df1:
A B C D E F
0 63 9 56 23 41 0
1 40 35 69 98 47 45
2 51 95 55 36 10 34
3 25 11 67 83 49 89
4 91 10 43 73 96 95
5 2 47 8 30 46 9
6 37 10 33 8 45 20
7 40 88 6 29 46 79
8 75 87 49 76 0 69
9 92 21 86 91 46 41
df2:
A B C D E F
0 0 0 0 1 1 0
I want to delete Columns in df1 based on values in df2(lookup table). wherever df2 has 1 I have to delete that column in df1.
so my final output should be like.
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
Assuming len(df1.columns) == len(df2.columns):
df1.loc[:, ~df2.loc[0].astype(bool).values]
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
If the columns aren't the same, but df2 has a subset of columns in df1, then
df1.reindex(df2.columns[~df2.loc[0].astype(bool)], axis=1)
Or with drop, similar to #student's method:
df1.drop(df2.columns[df2.loc[0].astype(bool)], axis=1)
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
columns can do intersection
df1[df1.columns.intersection(df2.columns[~df2.iloc[0].astype(bool)])]
Out[354]:
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
You can try with drop to drop the columns:
remove_col = df2.columns[(df2 == 1).any()] # get columns with any value 1
df1.drop(remove_col, axis=1, inplace=True) # drop the columns in original dataframe
Or, in one line as:
df1.drop(df2.columns[(df2 == 1).any()], axis=1, inplace=True)
Following can be an easily understandable solution:
df1.loc[:,df2.loc[0]!=1]
Output:
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
loc can be used for selecting rows or columns with a boolean or conditional lookup : https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

how to multiply multiple columns by another column pandas

I have a Dataframe of 100 Columns and I want to multiply one column ('Count') value with the columns position ranging from 6 to 74. Please tell me how to do that. I have been trying
df = df.ix[0, 6:74].multiply(df["Count"], axis="index")
df = df[df.columns[6:74]]*df["Count"]
None of them is working
The result Dataframe should be of 100 columns with all original columns where columns number 6 to 74 have the multiplied values in all the rows.
Assuming the same dataframe provided by #MaxU
Not easier, but a perspective on how to use other api elements.
pd.DataFrame.update and pd.DataFrame.mul
df.update(df.iloc[:, 3:7].mul(df.Count, 0))
df
0 1 2 3 4 5 6 7 8 9 Count
0 89 38 89 15.366436 1.355862 7.231264 4.971494 12 70 69 0.225977
1 49 1 38 1.004190 1.095480 2.829990 0.273870 57 93 64 0.030430
2 2 53 49 49.749460 50.379200 54.157640 16.373240 22 31 41 0.629740
3 38 44 23 28.437516 73.545300 41.185368 73.545300 19 99 57 0.980604
4 45 2 60 10.093230 4.773825 10.502415 6.274170 43 63 55 0.136395
5 65 97 15 10.375760 57.066680 38.260615 14.915155 68 5 21 0.648485
6 95 90 45 52.776000 16.888320 22.517760 50.664960 76 32 75 0.703680
7 60 31 65 63.242210 2.976104 26.784936 38.689352 72 73 94 0.744026
8 64 96 96 7.505370 37.526850 11.007876 10.007160 68 56 39 0.500358
9 78 54 74 8.409275 25.227825 16.528575 9.569175 97 63 37 0.289975
Demo:
Sample DF:
In [6]: df = pd.DataFrame(np.random.randint(100,size=(10,10))) \
.assign(Count=np.random.rand(10))
In [7]: df
Out[7]:
0 1 2 3 4 5 6 7 8 9 Count
0 89 38 89 68 6 32 22 12 70 69 0.225977
1 49 1 38 33 36 93 9 57 93 64 0.030430
2 2 53 49 79 80 86 26 22 31 41 0.629740
3 38 44 23 29 75 42 75 19 99 57 0.980604
4 45 2 60 74 35 77 46 43 63 55 0.136395
5 65 97 15 16 88 59 23 68 5 21 0.648485
6 95 90 45 75 24 32 72 76 32 75 0.703680
7 60 31 65 85 4 36 52 72 73 94 0.744026
8 64 96 96 15 75 22 20 68 56 39 0.500358
9 78 54 74 29 87 57 33 97 63 37 0.289975
Let's multiply columns 3-6 by df['Count']:
In [8]: df.iloc[:, 3:6+1]
Out[8]:
3 4 5 6
0 68 6 32 22
1 33 36 93 9
2 79 80 86 26
3 29 75 42 75
4 74 35 77 46
5 16 88 59 23
6 75 24 32 72
7 85 4 36 52
8 15 75 22 20
9 29 87 57 33
In [9]: df.iloc[:, 3:6+1] *= df['Count']
In [10]: df
Out[10]:
0 1 2 3 4 5 6 7 8 9 Count
0 89 38 89 66.681065 0.818372 20.751519 15.480964 12 70 69 0.225977
1 49 1 38 32.359929 4.910233 60.309102 6.333122 57 93 64 0.030430
2 2 53 49 77.467708 10.911630 55.769707 18.295685 22 31 41 0.629740
3 38 44 23 28.437513 10.229653 27.236368 52.776014 19 99 57 0.980604
4 45 2 60 72.564688 4.773838 49.933342 32.369289 43 63 55 0.136395
5 65 97 15 15.689662 12.002793 38.260613 16.184644 68 5 21 0.648485
6 95 90 45 73.545292 3.273489 20.751519 50.664974 76 32 75 0.703680
7 60 31 65 83.351331 0.545581 23.345459 36.591370 72 73 94 0.744026
8 64 96 96 14.709058 10.229653 14.266669 14.073604 68 56 39 0.500358
9 78 54 74 28.437513 11.866397 36.963643 23.221446 97 63 37 0.289975
The easiest thing to do here would be to extract the values, multiply, and then assign.
u = df.iloc[0, 6:74].values
v = df[['count']]
df = pd.DataFrame(u * v)
By using combine_first
df.iloc[:, 3:6+1].mul(df['Count'],axis=0).combine_first(df)
You need to concatenate the data frame resulting from multiplication with the remaining columns:
df=pd.concat( [df.iloc[0:6],df.iloc[75:],df.iloc[:,6:74+1].multiply(df['Count'],axis=0)] , axis=1)

Axis argument to .loc() to interpret the passed slicers on a axis=1

The documentation suggests:
You can also specify the axis argument to .loc to interpret the passed
slicers on a single axis.
However I get an error trying to slice along the column index.
import pandas as pd
import numpy as np
cols= [(yr,m) for yr in [2014,2015] for m in [7,8,9,10]]
df = pd.DataFrame(np.random.randint(1,100,(10,8)),index=tuple('ABCDEFGHIJ'))
df.columns =pd.MultiIndex.from_tuples(cols)
print df.head()
2014 2015
7 8 9 10 7 8 9 10
A 68 51 6 48 24 3 4 85
B 79 75 68 62 19 40 63 45
C 60 15 32 32 37 95 56 38
D 4 54 81 50 13 64 65 13
E 78 21 84 1 83 18 39 57
#This does not work as expected
print df.loc(axis=1)[(2014,9):(2015,8)]
AssertionError: Start slice bound is non-scalar
#but an arbitrary transpose and changing axis works!
df = df.T
print df.loc(axis=0)[(2014,9):(2015,8)]
A B C D E F G H I J
2014 9 6 68 32 81 84 60 83 39 94 93
10 48 62 32 50 1 84 18 14 92 33
2015 7 24 19 37 13 83 69 31 91 69 90
8 3 40 95 64 18 8 32 93 16 25
So I could always assign the slice and re-transpose.
That though feels like a hack and the axis=1 setting should have worked.
df = df.loc(axis=0)[(2014,9):(2015,8)]
df = df.T
print df
2014 2015
9 10 7 8
A 64 98 99 87
B 43 36 22 84
C 32 78 86 66
D 67 8 34 73
E 83 54 96 33
F 18 83 36 71
G 13 25 76 8
H 69 4 99 84
I 3 52 50 62
J 67 60 9 49
That might be a bug. Pls post an issue on github. The canoncial way to select things is to fully specify all the axes.
In [6]: df.loc[:,(2014,9):(2015,8)]
Out[6]:
2014 2015
9 10 7 8
A 26 2 44 69
B 41 7 5 1
C 8 27 23 22
D 54 72 81 93
E 18 23 54 7
F 11 81 37 83
G 60 38 59 29
H 3 95 89 96
I 6 9 77 9
J 90 92 10 32

Categories