Simple python question about data visualization

Simple python question about data visualization - python

I want a bar plot that shows the number of all diseases in 2000 for Albania.
I tried this, but I could not get what I want.
fig, ax = plt.subplots()
ax.bar(df[country['Albania']], df['2000'])
plt.xlabel('Fruit', fontsize=17, fontname='Times New Roman')
plt.ylabel('Spent', fontsize=17, fontname='Times New Roman')
plt.title('Share of diseases in Albania in 2000 ', fontsize=17, fontname="Times New Roman")
plt.show()

Let's first set up a dummy example:
import numpy as np
import pandas as pd
import itertools
np.random.seed(0)
df = pd.DataFrame({('Country_%s' % c, y): {'disease_%d' % (i+1): np.random.randint(100)
for i in range(4)}
for c,y in itertools.product(list('ABCD'), range(1998,2002))
}).T
df.index.names = ('country', 'year')
disease_1 disease_2 disease_3 disease_4
country year
Country_A 1998 44 47 64 67
1999 67 9 83 21
2000 36 87 70 88
2001 88 12 58 65
Country_B 1998 39 87 46 88
1999 81 37 25 77
2000 72 9 20 80
2001 69 79 47 64
Country_C 1998 82 99 88 49
1999 29 19 19 14
2000 39 32 65 9
2001 57 32 31 74
Country_D 1998 23 35 75 55
1999 28 34 0 0
2000 36 53 5 38
2001 17 79 4 42
You can then subset one multi-indexed row per country and year
df.loc[('Country_B', 2000)]
output:
disease_1 72
disease_2 9
disease_3 20
disease_4 80
Name: (Country_B, 2000), dtype: int64
and plot (here using pandas+matplotlib):
ax = df.loc[('Country_B', 2000)].plot.bar()
ax.set_ylabel('number of cases')

Related

Using Python Update the maximum value in each row dataframe with the sum of [column with maximum value] and [column name threshold]

Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 100 110 5
21 60 70 80 55 57 8
32 12 43 57 87 98 9
41 99 23 45 65 78 12
This is the demo data frame,
Here i wanted to choose maximum for each row from 3 countries(INDIA,GERMANY,US) and then add the threshold value to that maximum record and then add that into the max value and update it in the dataframe.
lets take an example :
max[US,INDIA,GERMANY] = max[US,INDIA,GERMANY] + threshold
After performing this dataframe will get updated as below :
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 105 110 5
21 60 78 80 55 57 8
32 12 43 57 96 98 9
41 111 23 45 65 78 12
I tried to achieve this using for loop but it is taking too long to execute :
df_max = df_final[['US','INDIA','GERMANY']].idxmax(axis=1)
for ind in df_final.index:
column = df_max[ind]
df_final[column][ind] = df_final[column][ind] + df_final['Threshold'][ind]
Please help me with this. Looking forward for a good solution,Thanks in advance...!!!

First solution compare maximal value per row with all values of filtered columns, then multiple mask by Threshold and add to original column:
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
Or use numpy - get columns names by idxmax, compare by array from list cols, multiple and add to original columns:
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
There is difference of solutions if multiple maximum values per rows.
First solution add threshold to all maximum, second solution to first maximum.
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 100 20 100 110 5 <-changed data double 100
1 21 60 70 80 55 57 8
2 32 12 43 57 87 98 9
3 41 99 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 100 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12

Find the total % of each value in its respective index level [duplicate]

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 10 months ago.
I'm trying to find the % total of the value within its respective index level, however, the current result is producing Nan values.
pd.DataFrame({"one": np.arange(0, 20), "two": np.arange(20, 40)}, index=[np.array([np.zeros(10), np.ones(10).flatten()], np.arange(80, 100)])
DataFrame:
one two
0.0 80 0 20
81 1 21
82 2 22
83 3 23
84 4 24
85 5 25
86 6 26
87 7 27
88 8 28
89 9 29
1.0 90 10 30
91 11 31
92 12 32
93 13 33
94 14 34
95 15 35
96 16 36
97 17 37
98 18 38
99 19 39
Aim:
To see the % total of a column 'one' within its respective level.
Excel example:
Current attempted code:
for loc in df.index.get_level_values(0):
df.loc[loc, 'total'] = df.loc[loc, :] / df.loc[loc, :].sum()

IIUC, use:
df['total'] = df['one'].div(df.groupby(level=0)['one'].transform('sum'))
output:
one two total
0 80 0 20 0.000000
81 1 21 0.022222
82 2 22 0.044444
83 3 23 0.066667
84 4 24 0.088889
85 5 25 0.111111
86 6 26 0.133333
87 7 27 0.155556
88 8 28 0.177778
89 9 29 0.200000
1 90 10 30 0.068966
91 11 31 0.075862
92 12 32 0.082759
93 13 33 0.089655
94 14 34 0.096552
95 15 35 0.103448
96 16 36 0.110345
97 17 37 0.117241
98 18 38 0.124138
99 19 39 0.131034

Venn Diagram for each row in DataFrame

I have a set of data that looks like this:
Exp # ID Q1 Q2 All IDs Q1 unique Q2 unique Overlap Unnamed: 8
0 1 58 32 58 58 14 40 18 18
1 2 55 38 44 55 28 34 10 10
2 4 95 69 83 95 37 51 32 32
3 5 92 68 84 92 31 47 37 37
4 6 0 0 0 0 0 0 0 0
5 7 71 52 65 71 27 40 25 25
6 8 84 69 69 84 39 39 30 30
7 10 65 35 63 65 17 45 18 18
8 11 90 72 72 90 39 39 33 33
9 14 88 84 80 88 52 48 32 32
10 17 89 56 75 89 30 49 26 26
11 19 83 56 70 83 32 46 24 24
12 20 94 72 83 93 35 46 37 37
13 21 73 57 56 73 38 37 19 19
For each exp #, I want to make a Venn diagram with the values Q1 Unique, Q2 Unique, and Overlap.
I have tried a couple of things, the below code has gotten me the closest:
from matplotlib import pyplot as plt
import numpy as np
from matplotlib_venn import venn2, venn2_circles
import csv
import pandas as pd
import numpy as np
val_path = r"C:\Users\lawashburn\Documents\DIA\DSD First Pass\20220202_Acquisition\Overlap_Values.csv"
val_tab = pd.read_csv(val_path)
exp_num = val_tab['Exp #']
cols = ['Q1 unique','Q2 unique', 'Overlap']
df = pd.DataFrame()
df ['Exp #'] = exp_num
df['combined'] = val_tab[cols].apply(lambda row: ','.join(row.values.astype(str)), axis=1)
print(df)
exp_no = df['Exp #'].tolist()
combined = df['combined'].tolist()
#combined = [int(i) for i in combined]
print(combined)
for a in exp_no:
plt.figure(figsize=(4,4))
plt.title(a)
for b in combined:
v = venn2(subsets=(b), set_labels = ('Q1', 'Q2'), set_colors=('purple','skyblue'), alpha=0.7)
v.get_label_by_id('A').set_text('Q1')
c = venn2_circles(subsets=(b))
plt.show()
plt.savefig(a + 'output.png')
This generates a DataFrame:
Exp # combined
0 1 14,40,18
1 2 28,34,10
2 4 37,51,32
3 5 31,47,37
4 6 0,0,0
5 7 27,40,25
6 8 39,39,30
7 10 17,45,18
8 11 39,39,33
9 14 52,48,32
10 17 30,49,26
11 19 32,46,24
12 20 35,46,37
13 21 38,37,19
However, I think I run into the issue when I export the combined column into a list:
['14,40,18', '28,34,10', '37,51,32', '31,47,37', '0,0,0', '27,40,25', '39,39,30', '17,45,18', '39,39,33', '52,48,32', '30,49,26', '32,46,24', '35,46,37', '38,37,19']
As after this I get the error:
numpy.core._exceptions.UFuncTypeError: ufunc 'absolute' did not contain a loop with signature matching types dtype('<U8') -> dtype('<U8')
How should I proceed from here? I would like 13 separate Venn Diagrams, and to export each of them into a separate .png file.

How to create a column that contains the penultimate value of each row?

I have a DataFrame and I need to create a new column which contains the second largest value of each row in the original Dataframe.
Sample:
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
Desired output:
0 1 2 3 4 5 6 7 8 9 penultimate
0 52 69 62 7 20 69 38 10 57 17 62
1 52 94 49 63 1 90 14 76 20 84 90
2 78 37 58 7 27 41 27 26 48 51 58
3 6 39 99 36 62 90 47 25 60 84 90
4 37 36 91 93 76 69 86 95 69 6 93
5 5 54 73 61 22 29 99 27 46 24 73
6 71 65 45 9 63 46 4 93 36 18 71
7 85 7 76 46 65 97 64 52 28 80 85
How can this be done in as little code as possible?

You could use NumPy for this:
import numpy as np
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df['penultimate'] = np.sort(df.values, 1)[:, -2]
print(df)
Using NumPy is faster.

Here is a simple lambda function!
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
# Output
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)
Cheers!

Axis argument to .loc() to interpret the passed slicers on a axis=1

The documentation suggests:
You can also specify the axis argument to .loc to interpret the passed
slicers on a single axis.
However I get an error trying to slice along the column index.
import pandas as pd
import numpy as np
cols= [(yr,m) for yr in [2014,2015] for m in [7,8,9,10]]
df = pd.DataFrame(np.random.randint(1,100,(10,8)),index=tuple('ABCDEFGHIJ'))
df.columns =pd.MultiIndex.from_tuples(cols)
print df.head()
2014 2015
7 8 9 10 7 8 9 10
A 68 51 6 48 24 3 4 85
B 79 75 68 62 19 40 63 45
C 60 15 32 32 37 95 56 38
D 4 54 81 50 13 64 65 13
E 78 21 84 1 83 18 39 57
#This does not work as expected
print df.loc(axis=1)[(2014,9):(2015,8)]
AssertionError: Start slice bound is non-scalar
#but an arbitrary transpose and changing axis works!
df = df.T
print df.loc(axis=0)[(2014,9):(2015,8)]
A B C D E F G H I J
2014 9 6 68 32 81 84 60 83 39 94 93
10 48 62 32 50 1 84 18 14 92 33
2015 7 24 19 37 13 83 69 31 91 69 90
8 3 40 95 64 18 8 32 93 16 25
So I could always assign the slice and re-transpose.
That though feels like a hack and the axis=1 setting should have worked.
df = df.loc(axis=0)[(2014,9):(2015,8)]
df = df.T
print df
2014 2015
9 10 7 8
A 64 98 99 87
B 43 36 22 84
C 32 78 86 66
D 67 8 34 73
E 83 54 96 33
F 18 83 36 71
G 13 25 76 8
H 69 4 99 84
I 3 52 50 62
J 67 60 9 49

That might be a bug. Pls post an issue on github. The canoncial way to select things is to fully specify all the axes.
In [6]: df.loc[:,(2014,9):(2015,8)]
Out[6]:
2014 2015
9 10 7 8
A 26 2 44 69
B 41 7 5 1
C 8 27 23 22
D 54 72 81 93
E 18 23 54 7
F 11 81 37 83
G 60 38 59 29
H 3 95 89 96
I 6 9 77 9
J 90 92 10 32

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simple python question about data visualization - python

Related

Using Python Update the maximum value in each row dataframe with the sum of [column with maximum value] and [column name threshold]

Find the total % of each value in its respective index level [duplicate]

Venn Diagram for each row in DataFrame

How to create a column that contains the penultimate value of each row?

Axis argument to .loc() to interpret the passed slicers on a axis=1

Categories

Resources