Pandas - merging start/end time ranges with short gaps - python

Say I have a series of start and end times for a given event:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,5,30).cumsum().reshape(-1, 2), columns = ["start", "end"])
start end
0 2 6
1 7 8
2 12 14
3 18 20
4 24 25
5 26 28
6 29 33
7 35 36
8 39 41
9 44 45
10 48 50
11 53 54
12 58 59
13 62 63
14 65 68
I'd like to merge time ranges with a gap less than or equal to n, so for n = 1 the result would be:
fn(df, n = 1)
start end
0 2 8
2 12 14
3 18 20
4 24 33
7 35 36
8 39 41
9 44 45
10 48 50
11 53 54
12 58 59
13 62 63
14 65 68
I can't seem to find a way to do this with pandas without iterating and building up the result line-by-line. Is there some simpler way to do this?

You can subtract shifted values, compare by N for mask, create groups by cumulative sum and pass to groupby for aggregate max and min:
N = 1
g = df['start'].sub(df['end'].shift())
df = df.groupby(g.gt(N).cumsum()).agg({'start':'min', 'end':'max'})
print (df)
start end
1 2 8
2 12 14
3 18 20
4 24 33
5 35 36
6 39 41
7 44 45
8 48 50
9 53 54
10 58 59
11 62 63
12 65 68

Related

Find the total % of each value in its respective index level [duplicate]

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 10 months ago.
I'm trying to find the % total of the value within its respective index level, however, the current result is producing Nan values.
pd.DataFrame({"one": np.arange(0, 20), "two": np.arange(20, 40)}, index=[np.array([np.zeros(10), np.ones(10).flatten()], np.arange(80, 100)])
DataFrame:
one two
0.0 80 0 20
81 1 21
82 2 22
83 3 23
84 4 24
85 5 25
86 6 26
87 7 27
88 8 28
89 9 29
1.0 90 10 30
91 11 31
92 12 32
93 13 33
94 14 34
95 15 35
96 16 36
97 17 37
98 18 38
99 19 39
Aim:
To see the % total of a column 'one' within its respective level.
Excel example:
Current attempted code:
for loc in df.index.get_level_values(0):
df.loc[loc, 'total'] = df.loc[loc, :] / df.loc[loc, :].sum()
IIUC, use:
df['total'] = df['one'].div(df.groupby(level=0)['one'].transform('sum'))
output:
one two total
0 80 0 20 0.000000
81 1 21 0.022222
82 2 22 0.044444
83 3 23 0.066667
84 4 24 0.088889
85 5 25 0.111111
86 6 26 0.133333
87 7 27 0.155556
88 8 28 0.177778
89 9 29 0.200000
1 90 10 30 0.068966
91 11 31 0.075862
92 12 32 0.082759
93 13 33 0.089655
94 14 34 0.096552
95 15 35 0.103448
96 16 36 0.110345
97 17 37 0.117241
98 18 38 0.124138
99 19 39 0.131034

Venn Diagram for each row in DataFrame

I have a set of data that looks like this:
Exp # ID Q1 Q2 All IDs Q1 unique Q2 unique Overlap Unnamed: 8
0 1 58 32 58 58 14 40 18 18
1 2 55 38 44 55 28 34 10 10
2 4 95 69 83 95 37 51 32 32
3 5 92 68 84 92 31 47 37 37
4 6 0 0 0 0 0 0 0 0
5 7 71 52 65 71 27 40 25 25
6 8 84 69 69 84 39 39 30 30
7 10 65 35 63 65 17 45 18 18
8 11 90 72 72 90 39 39 33 33
9 14 88 84 80 88 52 48 32 32
10 17 89 56 75 89 30 49 26 26
11 19 83 56 70 83 32 46 24 24
12 20 94 72 83 93 35 46 37 37
13 21 73 57 56 73 38 37 19 19
For each exp #, I want to make a Venn diagram with the values Q1 Unique, Q2 Unique, and Overlap.
I have tried a couple of things, the below code has gotten me the closest:
from matplotlib import pyplot as plt
import numpy as np
from matplotlib_venn import venn2, venn2_circles
import csv
import pandas as pd
import numpy as np
val_path = r"C:\Users\lawashburn\Documents\DIA\DSD First Pass\20220202_Acquisition\Overlap_Values.csv"
val_tab = pd.read_csv(val_path)
exp_num = val_tab['Exp #']
cols = ['Q1 unique','Q2 unique', 'Overlap']
df = pd.DataFrame()
df ['Exp #'] = exp_num
df['combined'] = val_tab[cols].apply(lambda row: ','.join(row.values.astype(str)), axis=1)
print(df)
exp_no = df['Exp #'].tolist()
combined = df['combined'].tolist()
#combined = [int(i) for i in combined]
print(combined)
for a in exp_no:
plt.figure(figsize=(4,4))
plt.title(a)
for b in combined:
v = venn2(subsets=(b), set_labels = ('Q1', 'Q2'), set_colors=('purple','skyblue'), alpha=0.7)
v.get_label_by_id('A').set_text('Q1')
c = venn2_circles(subsets=(b))
plt.show()
plt.savefig(a + 'output.png')
This generates a DataFrame:
Exp # combined
0 1 14,40,18
1 2 28,34,10
2 4 37,51,32
3 5 31,47,37
4 6 0,0,0
5 7 27,40,25
6 8 39,39,30
7 10 17,45,18
8 11 39,39,33
9 14 52,48,32
10 17 30,49,26
11 19 32,46,24
12 20 35,46,37
13 21 38,37,19
However, I think I run into the issue when I export the combined column into a list:
['14,40,18', '28,34,10', '37,51,32', '31,47,37', '0,0,0', '27,40,25', '39,39,30', '17,45,18', '39,39,33', '52,48,32', '30,49,26', '32,46,24', '35,46,37', '38,37,19']
As after this I get the error:
numpy.core._exceptions.UFuncTypeError: ufunc 'absolute' did not contain a loop with signature matching types dtype('<U8') -> dtype('<U8')
How should I proceed from here? I would like 13 separate Venn Diagrams, and to export each of them into a separate .png file.

Grid of integers

I need to make a grid with the numbers generated by the code, but I'm not understanding how to align them in columns.
Is there a parameter of print or something else that could help me out?
#main()
a=0
b=0
for i in range(1, 13):
a=a+1
print(" ")
b=b+1
for f in range(1,13):
print(f*b, end=" ")
My output at the moment:
I would recommend using python's f-strings:
for i in range(1, 13):
print(''.join(f"{i*j: 4}" for j in range(1,13)))
Here's the output:
1 2 3 4 5 6 7 8 9 10 11 12
2 4 6 8 10 12 14 16 18 20 22 24
3 6 9 12 15 18 21 24 27 30 33 36
4 8 12 16 20 24 28 32 36 40 44 48
5 10 15 20 25 30 35 40 45 50 55 60
6 12 18 24 30 36 42 48 54 60 66 72
7 14 21 28 35 42 49 56 63 70 77 84
8 16 24 32 40 48 56 64 72 80 88 96
9 18 27 36 45 54 63 72 81 90 99 108
10 20 30 40 50 60 70 80 90 100 110 120
11 22 33 44 55 66 77 88 99 110 121 132
12 24 36 48 60 72 84 96 108 120 132 144
The most common form is to use almost any arbitrary expression within the curly braces. This can include dictionary values, function calls and so on. The above usage specifies formatting after the colon. The space before the 4 indicates that the fill character should be a space, and the 4 indicates that the whole expression should take up 4 characters total. For more info, check out the documentation.
Considering the width of each grid cell is stored as w, which for above snippet suffices as 4, a regularly spaced grid can be printed using
w = 4
a, b = 0, 0
for i in range(1, 13):
a, b = a+1, b+1
for f in range(1, 13):
print(('{:'+str(w)+'}').format(f*b), end='')
print('')
Its output is
1 2 3 4 5 6 7 8 9 10 11 12
2 4 6 8 10 12 14 16 18 20 22 24
3 6 9 12 15 18 21 24 27 30 33 36
4 8 12 16 20 24 28 32 36 40 44 48
5 10 15 20 25 30 35 40 45 50 55 60
6 12 18 24 30 36 42 48 54 60 66 72
7 14 21 28 35 42 49 56 63 70 77 84
8 16 24 32 40 48 56 64 72 80 88 96
9 18 27 36 45 54 63 72 81 90 99 108
10 20 30 40 50 60 70 80 90 100 110 120
11 22 33 44 55 66 77 88 99 110 121 132
12 24 36 48 60 72 84 96 108 120 132 144
You can reference keyword argument values passed to the str.format() method in the format string by name via {name}. Here's an example of doing that where the value referenced is computed (as opposed to being a constant):
mx = 12
w = len(str(mx*mx)) + 1
for b in range(1, mx+1):
for f in range(1, mx+1):
print(('{:{w}}').format(f*b, w=w), end='')
print('')
Output:
1 2 3 4 5 6 7 8 9 10 11 12
2 4 6 8 10 12 14 16 18 20 22 24
3 6 9 12 15 18 21 24 27 30 33 36
4 8 12 16 20 24 28 32 36 40 44 48
5 10 15 20 25 30 35 40 45 50 55 60
6 12 18 24 30 36 42 48 54 60 66 72
7 14 21 28 35 42 49 56 63 70 77 84
8 16 24 32 40 48 56 64 72 80 88 96
9 18 27 36 45 54 63 72 81 90 99 108
10 20 30 40 50 60 70 80 90 100 110 120
11 22 33 44 55 66 77 88 99 110 121 132
12 24 36 48 60 72 84 96 108 120 132 144

How to get randomly 20 elements from np.array and save it to DataFrame?

I have DataFrame from 1 to 80 numbers how can i get randomly 20 elements and save result to another DataFrame? I cant save every list like a row. Its saving elements like a columns. In the future i want to try predict every radom elements with sklearn
a = np.arange(1,81).reshape(8,10)
pd.DataFrame(a)
I must to get 20 unique numbers and write it one row. For example in python:
from random import sample
for x in range(1,20):
i=sample(range(1,81), k=20)
i.sort()
print(x,'-',i)`
It return as list [1,3,5,8,34,45,12,76,45...] 20 elements and i want its look like :
0 1 2 3 4 5 6 7 8 9 10 11 12 ... 20
0 1 5 10 14 20 55 67 34 ...... 20 elements
1
.
.
Use df.sample() to get samples of data frm a dataframe:
a = np.arange(1,81).reshape(8,10)
df = pd.DataFrame(a)
df1= df.sample(frac=.25)
>>df1
0 1 2 3 4 5 6 7 8 9
5 51 52 53 54 55 56 57 58 59 60
3 31 32 33 34 35 36 37 38 39 40
For a random permutation np.random.permutation():
df.iloc[np.random.permutation(len(df))].head(2)
0 1 2 3 4 5 6 7 8 9
6 61 62 63 64 65 66 67 68 69 70
1 11 12 13 14 15 16 17 18 19 20
EDIT : To get 20 elements in a list use:
import itertools
list(itertools.chain.from_iterable(df.sample(frac=.25).values))
#[71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
frac=.25 means 25% of the data, since you have used 80 elements 25% gives you 20 elements, you can adjust the fraction depending on you many elements you have and how many you want.
EDIT1: Further to your edit in the question: print(df.values) gives you an array:
[[ 1 2 3 4 5 6 7 8 9 10]
[11 12 13 14 15 16 17 18 19 20]
[21 22 23 24 25 26 27 28 29 30]
[31 32 33 34 35 36 37 38 39 40]
[41 42 43 44 45 46 47 48 49 50]
[51 52 53 54 55 56 57 58 59 60]
[61 62 63 64 65 66 67 68 69 70]
[71 72 73 74 75 76 77 78 79 80]]
You would require to shuffle this array using np.random.shuffle , in this case , do it on df.T.values since you also want to shuffle columns:
np.random.shuffle(df.T.values)
Then do a reshape:
df1 = pd.DataFrame(np.reshape(df.values,(4,20)))
>>df1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 4 3 10 2 8 7 1 5 6 9 14 13 20 12 18 17 11 15 16 19
1 24 23 30 22 28 27 21 25 26 29 34 33 40 32 38 37 31 35 36 39
2 44 43 50 42 48 47 41 45 46 49 54 53 60 52 58 57 51 55 56 59
3 64 63 70 62 68 67 61 65 66 69 74 73 80 72 78 77 71 75 76 79
This is a simple way using existing stackoverflow answers:
1- flatten the array so it looks more like a list, will allow you to deal with only one index instead of dealing with two array indexes
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.flatten.html
aflat = a.flatten()
2- Choose random items from the flattened array any of the answers here
How to randomly select an item from a list?
3- With the selected data, build your dataframe
You can also use numpy.random.choice and you can specify exact rows you want from the sample:
In [263]: a = np.arange(1,81).reshape(8,10)
In [265]: b = pd.DataFrame(a)
In [268]: b.iloc[np.random.choice(np.arange(len(b)), 5, False)]
Out[268]:
0 1 2 3 4 5 6 7 8 9
5 51 52 53 54 55 56 57 58 59 60
7 71 72 73 74 75 76 77 78 79 80
3 31 32 33 34 35 36 37 38 39 40
1 11 12 13 14 15 16 17 18 19 20
4 41 42 43 44 45 46 47 48 49 50
You can change 5 to 20 for your purpose. You need not worry about the percentile.

How can you use nested for loops to print out the following pattern in python?

How do you use nested for loops to print out the following pattern? So you don't have to write 10 for loops for it.
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
6 12 18 24 30 36 42 48 54 60
7 14 21 28 35 42 49 54 63 70
8 16 24 32 40 48 56 64 72 80
9 18 27 36 45 54 63 72 81 90
10 20 30 40 50 60 70 80 90 100
Simply increase your step size!
for stepSize in range(10):
for count in range(10):
print((count + 1) * (stepSize + 1), end=" ")
# count loop has ended, back into the scope of stepSize loop
# We are also printing(" ") to end the line
print(" ")
# stepSize loop has finished, code is done
Explanation:
The first, outer loop is increasing our step size, then for each step size we count up 10 steps and finish the line when we print(" ") in the outer for loop.
This is how I would do it:
for x in range (1,11):
product = []
for y in range (1, 11):
current_product = x * y
product.append(current_product)
print(*product, sep=' ')
This is going to be one of the least interpretable answers, but I had fun trying to write a one-liner:
num_rows = 10
print '\n\n'.join(' '.join(str(i) for i in range(j,(num_rows+1)*j)[::j]) for j in range(1,num_rows+1))
Output:
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
6 12 18 24 30 36 42 48 54 60
7 14 21 28 35 42 49 56 63 70
8 16 24 32 40 48 56 64 72 80
9 18 27 36 45 54 63 72 81 90
10 20 30 40 50 60 70 80 90 100
Disecting it, range(j,(num_rows+1)*j)[::j] generates the integers for each line where j is ranging over the row number (starting with an index of 1 as you ask for). The [::j] part gives you every j-th element of the list. Then the inner join statement is constructing the line string from the list of integers, each integer separated by a space ' '. The outer join constructs the final output by combining the lines of integers with \n\n, which is a double new line to put a blank line in between each line of integers.
I think the other solutions are more readable, but this one is kind of fun.

Categories