Venn Diagram for each row in DataFrame - python

I have a set of data that looks like this:
Exp # ID Q1 Q2 All IDs Q1 unique Q2 unique Overlap Unnamed: 8
0 1 58 32 58 58 14 40 18 18
1 2 55 38 44 55 28 34 10 10
2 4 95 69 83 95 37 51 32 32
3 5 92 68 84 92 31 47 37 37
4 6 0 0 0 0 0 0 0 0
5 7 71 52 65 71 27 40 25 25
6 8 84 69 69 84 39 39 30 30
7 10 65 35 63 65 17 45 18 18
8 11 90 72 72 90 39 39 33 33
9 14 88 84 80 88 52 48 32 32
10 17 89 56 75 89 30 49 26 26
11 19 83 56 70 83 32 46 24 24
12 20 94 72 83 93 35 46 37 37
13 21 73 57 56 73 38 37 19 19
For each exp #, I want to make a Venn diagram with the values Q1 Unique, Q2 Unique, and Overlap.
I have tried a couple of things, the below code has gotten me the closest:
from matplotlib import pyplot as plt
import numpy as np
from matplotlib_venn import venn2, venn2_circles
import csv
import pandas as pd
import numpy as np
val_path = r"C:\Users\lawashburn\Documents\DIA\DSD First Pass\20220202_Acquisition\Overlap_Values.csv"
val_tab = pd.read_csv(val_path)
exp_num = val_tab['Exp #']
cols = ['Q1 unique','Q2 unique', 'Overlap']
df = pd.DataFrame()
df ['Exp #'] = exp_num
df['combined'] = val_tab[cols].apply(lambda row: ','.join(row.values.astype(str)), axis=1)
print(df)
exp_no = df['Exp #'].tolist()
combined = df['combined'].tolist()
#combined = [int(i) for i in combined]
print(combined)
for a in exp_no:
plt.figure(figsize=(4,4))
plt.title(a)
for b in combined:
v = venn2(subsets=(b), set_labels = ('Q1', 'Q2'), set_colors=('purple','skyblue'), alpha=0.7)
v.get_label_by_id('A').set_text('Q1')
c = venn2_circles(subsets=(b))
plt.show()
plt.savefig(a + 'output.png')
This generates a DataFrame:
Exp # combined
0 1 14,40,18
1 2 28,34,10
2 4 37,51,32
3 5 31,47,37
4 6 0,0,0
5 7 27,40,25
6 8 39,39,30
7 10 17,45,18
8 11 39,39,33
9 14 52,48,32
10 17 30,49,26
11 19 32,46,24
12 20 35,46,37
13 21 38,37,19
However, I think I run into the issue when I export the combined column into a list:
['14,40,18', '28,34,10', '37,51,32', '31,47,37', '0,0,0', '27,40,25', '39,39,30', '17,45,18', '39,39,33', '52,48,32', '30,49,26', '32,46,24', '35,46,37', '38,37,19']
As after this I get the error:
numpy.core._exceptions.UFuncTypeError: ufunc 'absolute' did not contain a loop with signature matching types dtype('<U8') -> dtype('<U8')
How should I proceed from here? I would like 13 separate Venn Diagrams, and to export each of them into a separate .png file.

Related

Find the total % of each value in its respective index level [duplicate]

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 10 months ago.
I'm trying to find the % total of the value within its respective index level, however, the current result is producing Nan values.
pd.DataFrame({"one": np.arange(0, 20), "two": np.arange(20, 40)}, index=[np.array([np.zeros(10), np.ones(10).flatten()], np.arange(80, 100)])
DataFrame:
one two
0.0 80 0 20
81 1 21
82 2 22
83 3 23
84 4 24
85 5 25
86 6 26
87 7 27
88 8 28
89 9 29
1.0 90 10 30
91 11 31
92 12 32
93 13 33
94 14 34
95 15 35
96 16 36
97 17 37
98 18 38
99 19 39
Aim:
To see the % total of a column 'one' within its respective level.
Excel example:
Current attempted code:
for loc in df.index.get_level_values(0):
df.loc[loc, 'total'] = df.loc[loc, :] / df.loc[loc, :].sum()
IIUC, use:
df['total'] = df['one'].div(df.groupby(level=0)['one'].transform('sum'))
output:
one two total
0 80 0 20 0.000000
81 1 21 0.022222
82 2 22 0.044444
83 3 23 0.066667
84 4 24 0.088889
85 5 25 0.111111
86 6 26 0.133333
87 7 27 0.155556
88 8 28 0.177778
89 9 29 0.200000
1 90 10 30 0.068966
91 11 31 0.075862
92 12 32 0.082759
93 13 33 0.089655
94 14 34 0.096552
95 15 35 0.103448
96 16 36 0.110345
97 17 37 0.117241
98 18 38 0.124138
99 19 39 0.131034

Pandas - merging start/end time ranges with short gaps

Say I have a series of start and end times for a given event:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,5,30).cumsum().reshape(-1, 2), columns = ["start", "end"])
start end
0 2 6
1 7 8
2 12 14
3 18 20
4 24 25
5 26 28
6 29 33
7 35 36
8 39 41
9 44 45
10 48 50
11 53 54
12 58 59
13 62 63
14 65 68
I'd like to merge time ranges with a gap less than or equal to n, so for n = 1 the result would be:
fn(df, n = 1)
start end
0 2 8
2 12 14
3 18 20
4 24 33
7 35 36
8 39 41
9 44 45
10 48 50
11 53 54
12 58 59
13 62 63
14 65 68
I can't seem to find a way to do this with pandas without iterating and building up the result line-by-line. Is there some simpler way to do this?
You can subtract shifted values, compare by N for mask, create groups by cumulative sum and pass to groupby for aggregate max and min:
N = 1
g = df['start'].sub(df['end'].shift())
df = df.groupby(g.gt(N).cumsum()).agg({'start':'min', 'end':'max'})
print (df)
start end
1 2 8
2 12 14
3 18 20
4 24 33
5 35 36
6 39 41
7 44 45
8 48 50
9 53 54
10 58 59
11 62 63
12 65 68

How to create a column that contains the penultimate value of each row?

I have a DataFrame and I need to create a new column which contains the second largest value of each row in the original Dataframe.
Sample:
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
Desired output:
0 1 2 3 4 5 6 7 8 9 penultimate
0 52 69 62 7 20 69 38 10 57 17 62
1 52 94 49 63 1 90 14 76 20 84 90
2 78 37 58 7 27 41 27 26 48 51 58
3 6 39 99 36 62 90 47 25 60 84 90
4 37 36 91 93 76 69 86 95 69 6 93
5 5 54 73 61 22 29 99 27 46 24 73
6 71 65 45 9 63 46 4 93 36 18 71
7 85 7 76 46 65 97 64 52 28 80 85
How can this be done in as little code as possible?
You could use NumPy for this:
import numpy as np
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df['penultimate'] = np.sort(df.values, 1)[:, -2]
print(df)
Using NumPy is faster.
Here is a simple lambda function!
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
# Output
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)
Cheers!

How to get randomly 20 elements from np.array and save it to DataFrame?

I have DataFrame from 1 to 80 numbers how can i get randomly 20 elements and save result to another DataFrame? I cant save every list like a row. Its saving elements like a columns. In the future i want to try predict every radom elements with sklearn
a = np.arange(1,81).reshape(8,10)
pd.DataFrame(a)
I must to get 20 unique numbers and write it one row. For example in python:
from random import sample
for x in range(1,20):
i=sample(range(1,81), k=20)
i.sort()
print(x,'-',i)`
It return as list [1,3,5,8,34,45,12,76,45...] 20 elements and i want its look like :
0 1 2 3 4 5 6 7 8 9 10 11 12 ... 20
0 1 5 10 14 20 55 67 34 ...... 20 elements
1
.
.
Use df.sample() to get samples of data frm a dataframe:
a = np.arange(1,81).reshape(8,10)
df = pd.DataFrame(a)
df1= df.sample(frac=.25)
>>df1
0 1 2 3 4 5 6 7 8 9
5 51 52 53 54 55 56 57 58 59 60
3 31 32 33 34 35 36 37 38 39 40
For a random permutation np.random.permutation():
df.iloc[np.random.permutation(len(df))].head(2)
0 1 2 3 4 5 6 7 8 9
6 61 62 63 64 65 66 67 68 69 70
1 11 12 13 14 15 16 17 18 19 20
EDIT : To get 20 elements in a list use:
import itertools
list(itertools.chain.from_iterable(df.sample(frac=.25).values))
#[71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
frac=.25 means 25% of the data, since you have used 80 elements 25% gives you 20 elements, you can adjust the fraction depending on you many elements you have and how many you want.
EDIT1: Further to your edit in the question: print(df.values) gives you an array:
[[ 1 2 3 4 5 6 7 8 9 10]
[11 12 13 14 15 16 17 18 19 20]
[21 22 23 24 25 26 27 28 29 30]
[31 32 33 34 35 36 37 38 39 40]
[41 42 43 44 45 46 47 48 49 50]
[51 52 53 54 55 56 57 58 59 60]
[61 62 63 64 65 66 67 68 69 70]
[71 72 73 74 75 76 77 78 79 80]]
You would require to shuffle this array using np.random.shuffle , in this case , do it on df.T.values since you also want to shuffle columns:
np.random.shuffle(df.T.values)
Then do a reshape:
df1 = pd.DataFrame(np.reshape(df.values,(4,20)))
>>df1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 4 3 10 2 8 7 1 5 6 9 14 13 20 12 18 17 11 15 16 19
1 24 23 30 22 28 27 21 25 26 29 34 33 40 32 38 37 31 35 36 39
2 44 43 50 42 48 47 41 45 46 49 54 53 60 52 58 57 51 55 56 59
3 64 63 70 62 68 67 61 65 66 69 74 73 80 72 78 77 71 75 76 79
This is a simple way using existing stackoverflow answers:
1- flatten the array so it looks more like a list, will allow you to deal with only one index instead of dealing with two array indexes
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.flatten.html
aflat = a.flatten()
2- Choose random items from the flattened array any of the answers here
How to randomly select an item from a list?
3- With the selected data, build your dataframe
You can also use numpy.random.choice and you can specify exact rows you want from the sample:
In [263]: a = np.arange(1,81).reshape(8,10)
In [265]: b = pd.DataFrame(a)
In [268]: b.iloc[np.random.choice(np.arange(len(b)), 5, False)]
Out[268]:
0 1 2 3 4 5 6 7 8 9
5 51 52 53 54 55 56 57 58 59 60
7 71 72 73 74 75 76 77 78 79 80
3 31 32 33 34 35 36 37 38 39 40
1 11 12 13 14 15 16 17 18 19 20
4 41 42 43 44 45 46 47 48 49 50
You can change 5 to 20 for your purpose. You need not worry about the percentile.

Axis argument to .loc() to interpret the passed slicers on a axis=1

The documentation suggests:
You can also specify the axis argument to .loc to interpret the passed
slicers on a single axis.
However I get an error trying to slice along the column index.
import pandas as pd
import numpy as np
cols= [(yr,m) for yr in [2014,2015] for m in [7,8,9,10]]
df = pd.DataFrame(np.random.randint(1,100,(10,8)),index=tuple('ABCDEFGHIJ'))
df.columns =pd.MultiIndex.from_tuples(cols)
print df.head()
2014 2015
7 8 9 10 7 8 9 10
A 68 51 6 48 24 3 4 85
B 79 75 68 62 19 40 63 45
C 60 15 32 32 37 95 56 38
D 4 54 81 50 13 64 65 13
E 78 21 84 1 83 18 39 57
#This does not work as expected
print df.loc(axis=1)[(2014,9):(2015,8)]
AssertionError: Start slice bound is non-scalar
#but an arbitrary transpose and changing axis works!
df = df.T
print df.loc(axis=0)[(2014,9):(2015,8)]
A B C D E F G H I J
2014 9 6 68 32 81 84 60 83 39 94 93
10 48 62 32 50 1 84 18 14 92 33
2015 7 24 19 37 13 83 69 31 91 69 90
8 3 40 95 64 18 8 32 93 16 25
So I could always assign the slice and re-transpose.
That though feels like a hack and the axis=1 setting should have worked.
df = df.loc(axis=0)[(2014,9):(2015,8)]
df = df.T
print df
2014 2015
9 10 7 8
A 64 98 99 87
B 43 36 22 84
C 32 78 86 66
D 67 8 34 73
E 83 54 96 33
F 18 83 36 71
G 13 25 76 8
H 69 4 99 84
I 3 52 50 62
J 67 60 9 49
That might be a bug. Pls post an issue on github. The canoncial way to select things is to fully specify all the axes.
In [6]: df.loc[:,(2014,9):(2015,8)]
Out[6]:
2014 2015
9 10 7 8
A 26 2 44 69
B 41 7 5 1
C 8 27 23 22
D 54 72 81 93
E 18 23 54 7
F 11 81 37 83
G 60 38 59 29
H 3 95 89 96
I 6 9 77 9
J 90 92 10 32

Categories