Calculate duration between events with pandas

Calculate duration between events with pandas - python

I have a dataframe
df = pd.DataFrame([['2018-07-02', 'B'],
['2018-07-03', 'A'],
['2018-07-06', 'B'],
['2018-07-08', 'B'],
['2018-07-09', 'A'],
['2018-07-09', 'A'],
['2018-07-10', 'A'],
['2018-07-12', 'B'],
['2018-07-15', 'A'],
['2018-07-16', 'A'],
['2018-07-18', 'B'],
['2018-07-22', 'A'],
['2018-07-25', 'B'],
['2018-07-25', 'B'],
['2018-07-27', 'A'],
['2018-07-28', 'A']], columns = ['DateEvent','Event'])
where counting starts with event A and ends with event B. Some events could start on more than one day and end on more than one day.
I already calculated the difference:
df = df.set_index('DateEvent')
begin = df.loc[df['Event'] == 'A'].index
cutoffs = df.loc[df['Event'] == 'B'].index
idx = cutoffs.searchsorted(begin)
mask = idx < len(cutoffs)
idx = idx[mask]
begin = begin[mask]
end = cutoffs[idx]
pd.DataFrame({'begin':begin, 'end':end})
but I get the difference for multiple starts and ends also:
begin end
0 2018-07-03 2018-07-06
1 2018-07-09 2018-07-12
2 2018-07-09 2018-07-12
3 2018-07-10 2018-07-12
4 2018-07-15 2018-07-18
5 2018-07-16 2018-07-18
6 2018-07-22 2018-07-25
The desired output includes the first occurrence of event A and the last occurrence of event B... looking for maximum duration, just to be sure.
I could loop before or after to delete the unnecessary events, but is there a nicer, more pythonic way?
Thank you,
Aleš
EDIT:
I've been using the code sucessfully as a function in a groupby. But it's not clean and it does take some time. How can I rewrite the code to include the group in the df?
df = pd.DataFrame([['2.07.2018', 1, 'B'],
['3.07.2018', 1, 'A'],
['3.07.2018', 2, 'A'],
['6.07.2018', 2, 'B'],
['8.07.2018', 2, 'B'],
['9.07.2018', 2, 'A'],
['9.07.2018', 2, 'A'],
['9.07.2018', 2, 'B'],
['9.07.2018', 3, 'A'],
['10.07.2018', 3, 'A'],
['10.07.2018', 3, 'B'],
['12.07.2018', 3, 'B'],
['15.07.2018', 3, 'A'],
['16.07.2018', 4, 'A'],
['16.07.2018', 4, 'B'],
['18.07.2018', 4, 'B'],
['18.07.2018', 4, 'A'],
['22.07.2018', 5, 'A'],
['25.07.2018', 5, 'B'],
['25.07.2018', 7, 'B'],
['25.07.2018', 7, 'A'],
['25.07.2018', 7, 'B'],
['27.07.2018', 9, 'A'],
['28.07.2018', 9, 'A'],
['28.07.2018', 9, 'B']], columns = ['DateEvent','Group','Event'])
I'm trying to somehow do a combination of cumsum on a group, but cannot get the desired results.
Thank you!

Let's try:
df = pd.DataFrame([['2018-07-02', 'B'],
['2018-07-03', 'A'],
['2018-07-06', 'B'],
['2018-07-08', 'B'],
['2018-07-09', 'A'],
['2018-07-09', 'A'],
['2018-07-10', 'A'],
['2018-07-12', 'B'],
['2018-07-15', 'A'],
['2018-07-16', 'A'],
['2018-07-18', 'B'],
['2018-07-22', 'A'],
['2018-07-25', 'B'],
['2018-07-25', 'B'],
['2018-07-27', 'A'],
['2018-07-28', 'A']], columns = ['DateEvent','Event'])
a = (df['Event'] != 'A').cumsum()
a = a.groupby(a).cumcount()
df['Event Group'] = (a == 1).cumsum()
df_out = df.groupby('Event Group').filter(lambda x: set(x['Event']) == set(['A','B']))\
.groupby('Event Group')['DateEvent'].agg(['first','last'])\
.rename(columns={'first':'start','last':'end'})\
.reset_index()
print(df_out)
Output:
Event Group start end
0 1 2018-07-03 2018-07-08
1 2 2018-07-09 2018-07-12
2 3 2018-07-15 2018-07-18
3 4 2018-07-22 2018-07-25
Edit
a = (df['Event'] != 'A').cumsum().mask(df['Event'] != 'A')
df['Event Group'] = a.ffill()
df_out = df.groupby('Event Group').filter(lambda x: set(x['Event']) == set(['A','B']))\
.groupby('Event Group')['DateEvent'].agg(['first','last'])\
.rename(columns={'first':'start','last':'end'})\
.reset_index()

Related

Changing 2-dimensional list to standard matrix form

org = [['A', 'a', 1],
['A', 'b', 2],
['A', 'c', 3],
['B', 'a', 4],
['B', 'b', 5],
['B', 'c', 6],
['C', 'a', 7],
['C', 'b', 8],
['C', 'c', 9]]
I want to change the 'org' to the standard matrix form like below.
transform = [['\t','A', 'B', 'C'],
['a', 1, 4, 7],
['b', 2, 5, 8],
['c', 3, 6, 9]]
I made a small function that converts this.
The code I wrote is below:
import numpy as np
def matrix(li):
column = ['\t']
row = []
result = []
rest = []
for i in li:
if i[0] not in column:
column.append(i[0])
if i[1] not in row:
row.append(i[1])
result.append(column)
for i in li:
for r in row:
if r == i[1]:
rest.append([i[2]])
rest = np.array(rest).reshape((len(row),len(column)-1)).tolist()
for i in range(len(rest)):
rest[i] = [row[i]]+rest[i]
result += rest
for i in result:
print(i)
matrix(org)
The result was this:
>>>['\t', 'school', 'kids', 'really']
[72, 0.008962252017017516, 0.04770759762717251, 0.08993156334317577]
[224, 0.004180594204995023, 0.04450803342634945, 0.04195010047081213]
[385, 0.0021807662921382335, 0.023217182598008267, 0.06564858527712682]
I don't think this is efficient since I use so many for loops.
Is there any efficient way to do this?

Since you are using 3rd party libraries, this is a task well suited for pandas.
There is some messy, but not inefficient, work to incorporate index and columns as per your requirement.
org = [['A', 'a', 1],
['A', 'b', 2],
['A', 'c', 3],
['B', 'a', 4],
['B', 'b', 5],
['B', 'c', 6],
['C', 'a', 7],
['C', 'b', 8],
['C', 'c', 9]]
df = pd.DataFrame(org)
pvt = df.pivot_table(index=0, columns=1, values=2)
cols = ['\t'] + pvt.columns.tolist()
res = pvt.values.T.tolist()
res.insert(0, pvt.index.tolist())
res = [[i]+j for i, j in zip(cols, res)]
print(res)
[['\t', 'A', 'B', 'C'],
['a', 1, 4, 7],
['b', 2, 5, 8],
['c', 3, 6, 9]]

Here's another "manual" way using only numpy:
org_arr = np.array(org)
key1 = np.unique(org_arr[:,0])
key2 = np.unique(org_arr[:,1])
values = org_arr[:,2].reshape((len(key1),len(key2))).transpose()
np.block([
["\t", key1 ],
[key2[:,None], values]
])
""" # alternatively, for numpy < 1.13.0
np.vstack((
np.hstack(("\t", key1)),
np.hstack((key2[:, None], values))
))
"""
For simplicity, it requires the input matrix to be strictly ordered (first col is major and ascending ...).
Output:
Out[58]:
array([['\t', 'A', 'B', 'C'],
['a', '1', '4', '7'],
['b', '2', '5', '8'],
['c', '3', '6', '9']],
dtype='<U1')

Sequence numbers for lists with lists

I am preparing a list of lists. the initial list can contain any number of entries but the sub-lists each contain 3 entries, eg:
colony = [['A', 'B', 'C'], [1, 'b', 'c'], [2, 'b', 'c'], [3, 'b', 'c'], [4, 'b', 'c'], [5, 'b', 'c']]
The first entry in each of the sub-lists is the sequence number of the entry and needs to be sequential,
ie. 1, 2, 3, 4, 5, 6,...
A, B and C are the column headings, the data is in the subsequent lists. My difficulty is that if the number of sub-lists that I need to add is say 5, then the sequence number of every entry is 5.How can I change my code to insert the correct sequence number in each sub-list:
colony = [['A', 'B', 'C']]
add_colony = [0, 0, 0]
R = 5
i = 1
for i in range(R):
add_colony[0] = i + 1
add_colony[1] = 'b'
add_colony[2] = 'c'
colony.append(add_colony)
i = i + 1
print()
print('colony = ', colony)
produces:
colony = [['A', 'B', 'C'], [5, 'b', 'c'], [5, 'b', 'c'], [5, 'b', 'c'], [5, 'b', 'c'], [5, 'b', 'c']]
not:
colony = [['A', 'B', 'C'], [1, 'b', 'c'], [2, 'b', 'c'], [3, 'b', 'c'], [4, 'b', 'c'], [5, 'b', 'c']]
I have tried all sorts of variations but end up with the incorrect output.
Thanks in advance
Bob

You are permanently mutating and appending the same list object add_colony. All the lists in colony are references to this same object. You have to create a new list for each loop iteration:
for i in range(R):
add_colony = [0, 0, 0] # without this line ...
add_colony[0] = i + 1 # ... this mutation will affect all the references in colony
add_colony[1] = 'b'
add_colony[2] = 'c'
colony.append(add_colony)
Or shorter:
for i in range(R):
colony.append([i + 1, 'b', 'c'])

Hello there fellow University of Melbourne Student!
As #schwobaseggl mentioned, you need to create a new list object on each iteration of your loop, or you just keep appending the same object over and over again. You could also just make add_colony have the default values ['b', 'c'], and insert() the new i value at the beginning each of the list.
Here is an example:
colony = [['A', 'B', 'C']]
R = 5
for i in range(R):
add_colony = ['b', 'c']
add_colony.insert(0, i+1)
colony.append(add_colony)
print('colony = ', colony)
Which Outputs:
colony = [['A', 'B', 'C'], [1, 'b', 'c'], [2, 'b', 'c'], [3, 'b', 'c'], [4, 'b', 'c'], [5, 'b', 'c']]
You could also use a list comprehension:
colony = [['A', 'B', 'C']] + [[i + 1] + ['b', 'c'] for i in range(R)]
Good Luck!

1 liner list comp that mutates the original list without saving it to a variable (kinda weird):
[colony.append([i, 'b', 'c']) for i in range(1, R + 1)]
print(colony) outputs
[['A', 'B', 'C'], [1, 'b', 'c'], [2, 'b', 'c'], [3, 'b', 'c'], [4, 'b', 'c'], [5, 'b', 'c']]

Conditional random allocation in Python

I need to conditionally randomly allocate users to groups. The table governing the process is as follows:
A B C
0 9 1 1
1 1 7 8
2 0 2 1
According to the above matrix, there's a total of 11 users from area 0, 16 from area 1, and 3 from area 2.
Furthermore, of the 11 users from area 0, 9 should be allocated to group A, 1 each should be allocated to B and C. The process is analogous for the rest of the groups.
I have some code in Python:
import random
import pandas as pd
df = pd.DataFrame({"A": [9,1,0], "B": [1,7,2], "C": [1,8,1]})
random.sample(range(1,df.sum(axis=1)[0] + 1),df.sum(axis=1)[0])
The last line creates a random vector of integers e.g. [1, 4, 10, 2, 5, 11, 9, 3, 8, 7, 6]. I can allocate indices from 1 to 9 to group A,
the index with 10 to group B, the index with 11 to group C. In other words, user 3 goes to group B, user 6 goes to group C, and all the rest go to group A.
The desired output would be [A,A,B,A,A,C,A,A,A,A,A], or even better, a pandas dataframe like:
1 A
2 A
3 B
4 A
5 A
6 C
...
How can I automate the process I described in words above? (the actual allocation matrix is 10 x 10)

You could use np.repeat to get an array with the right number of users:
In [38]: [np.repeat(df.columns, row) for row in df.values]
Out[38]:
[Index(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'C'], dtype='object'),
Index(['A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C',
'C', 'C'],
dtype='object'),
Index(['B', 'B', 'C'], dtype='object')]
And then permute them:
In [39]: [np.random.permutation(np.repeat(df.columns, row)) for row in df.values]
Out[39]:
[array(['C', 'A', 'A', 'A', 'A', 'A', 'B', 'A', 'A', 'A', 'A'], dtype=object),
array(['A', 'B', 'C', 'C', 'B', 'C', 'B', 'C', 'C', 'B', 'B', 'C', 'C',
'C', 'B', 'B'], dtype=object),
array(['B', 'C', 'B'], dtype=object)]
and then you could call pd.Series on each array if you wanted.

Pandas way of getting intersection between two rows in a python Pandas dataframe

Say I have some data that looks like below. I want to get the count of ids that have two tags at the same time.
tag id
a A
b B
a B
b A
c A
What I desire the result:
tag1 tag2 count
a b 2
a c 1
b c 1
In plain python I could write pseudocode:
d = defaultdict(set)
d[tag].add(id)
for tag1, tag2 in itertools.combinations(d.keys(), 2):
print tag1, tag2, len(d[tag1] & d[tag2])
Not the most efficient way but it should work. Now I already have the data stored in Pandas dataframe. Is there a more pandas-way to achieve the same result?

Here is my attempt:
from itertools import combinations
import pandas as pd
import numpy as np
In [123]: df
Out[123]:
tag id
0 a A
1 b B
2 a B
3 b A
4 c A
In [124]: a = np.asarray(list(combinations(df.tag, 2)))
In [125]: a
Out[125]:
array([['a', 'b'],
['a', 'a'],
['a', 'b'],
['a', 'c'],
['b', 'a'],
['b', 'b'],
['b', 'c'],
['a', 'b'],
['a', 'c'],
['b', 'c']],
dtype='<U1')
In [126]: a = a[a[:,0] != a[:,1]]
In [127]: a
Out[127]:
array([['a', 'b'],
['a', 'b'],
['a', 'c'],
['b', 'a'],
['b', 'c'],
['a', 'b'],
['a', 'c'],
['b', 'c']],
dtype='<U1')
In [129]: np.ndarray.sort(a)
In [130]: pd.DataFrame(a).groupby([0,1]).size()
Out[130]:
0 1
a b 4
c 2
b c 2
dtype: int64

Unpack a nested list

My question is simple.
There are two lists.
The first is a list of integers:
a = [1, 2, 3]
The other is a list of lists:
b = [['a', 'b'], ['c', 'd'], ['e', 'f']]
How could I get the result below:
result = [[1, 'a', 'b'], [2, 'c', 'd'], [3, 'e', 'f']]
Thanks.

>>> a = [1, 2, 3]
>>> b = [['a', 'b'], ['c', 'd'], ['e', 'f']]
>>> [[aa] + bb for aa, bb in zip(a, b)]
[[1, 'a', 'b'], [2, 'c', 'd'], [3, 'e', 'f']]

In Python3
>>> a = [1, 2, 3]
>>> b = [['a', 'b'], ['c', 'd'], ['e', 'f']]
>>> [aa+bb for *aa, bb in zip(a,b)]
[[1, 'a', 'b'], [2, 'c', 'd'], [3, 'e', 'f']]

Another way to do this would be:
index = 0
l = b
for i in a:
l[index].append(i)
index += 1

The following Python code will unpack each list and assemble it in the form you indicated.
[[a[i]] + b[i] for i in range(min(len(a),len(b)))]

Using Python's enumerate function you can loop over a list with an index. Using x.extend(y) will prepend the values in list x to list y.
a = [1, 2, 3]
b = [['a', 'b'], ['c', 'd'], ['e', 'f']]
result = []
for index, value in enumerate(a):
aa = [value]
aa.extend(b[index])
result.append(aa)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate duration between events with pandas - python

Related

Changing 2-dimensional list to standard matrix form

Sequence numbers for lists with lists

Conditional random allocation in Python

Pandas way of getting intersection between two rows in a python Pandas dataframe

Unpack a nested list

Categories

Resources